Loading...
Loading...
Scrape web pages and save as HTML or Markdown (with text and images). Minimal dependencies - only requests and beautifulsoup4. Use when the user provides a URL and wants to download/archive the content locally.
npx skill4agent add agentbay-ai/agentbay-skills web-scraperrequestsbeautifulsoup4images/{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archivecd {baseDir}
pip install -r requirements.txtpip install requests beautifulsoup4htmlmdhtml--no-download-imageshtmlmdimages//tmp/~/Downloads/{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.mdweb-scraping.mdimages/{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images{baseDir}/scripts/scrape.py --url "https://example.com" --format html
# Saves to: example-com-{timestamp}.html{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archivedocs-archive/
├── index.md
├── getting-started.md
├── api/
│ ├── authentication.md
│ └── endpoints.md
└── images/ # Shared images from all pages
├── logo.png
└── diagram.svg{baseDir}/scripts/scrape.py \
--url "https://blog.example.com" \
--format html \
--recursive \
--max-depth 3 \
--max-pages 100 \
--output ~/Archives/blog-backup{baseDir}/scripts/scrape.py \
--url "https://example.com" \
--format md \
--recursive \
--no-respect-robots \
--rate-limit 1.0{baseDir}/scripts/scrape.py \
--url "https://yoursite.com" \
--format md \
--recursive \
--rate-limit 0.2images/images/--no-download-images--recursive--max-depth--max-pages--same-domain--rate-limit--max-depth 1 --max-pages 10--no-respect-robots--same-domain--timeout