jb-docs-scraper
Original:🇺🇸 English
Translated
1 scriptsChecked / no sensitive code detected
Scrape documentation websites into local markdown files for AI context. Takes a base URL and crawls the documentation, storing results in ./docs (or custom path). Uses crawl4ai with BFS deep crawling.
12installs
Sourcebjesuiter/skills
Added on
NPX Install
npx skill4agent add bjesuiter/skills jb-docs-scraperTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Documentation Scraper
Scrape any documentation website into local markdown files. Uses for async web crawling.
crawl4aiQuick Start
bash
# Scrape any documentation URL
uv run --with crawl4ai python ./references/scrape_docs.py <URL>
# Examples
uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/
uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwindOutput goes to by default.
./docs/<auto-detected-name>/Prerequisites (First Time Only)
bash
uv run --with crawl4ai playwright installUsage
bash
uv run --with crawl4ai python ./references/scrape_docs.py <URL> [OPTIONS]Options
| Option | Description | Default |
|---|---|---|
| Output directory | |
| Maximum link depth | |
| Maximum pages to scrape | |
| URL filter (glob) | Auto-detected |
| Suppress verbose output | |
Examples
bash
# Basic - scrape to ./docs/documentation_v3/
uv run --with crawl4ai python ./references/scrape_docs.py \
https://mediasoup.org/documentation/v3/
# Custom output directory
uv run --with crawl4ai python ./references/scrape_docs.py \
https://docs.rombo.co/tailwind \
--output ./my-tailwind-docs
# Limit crawl scope
uv run --with crawl4ai python ./references/scrape_docs.py \
https://tanstack.com/start/latest/docs/framework/react/overview \
--max-pages 50 \
--max-depth 3
# Custom URL pattern filter
uv run --with crawl4ai python ./references/scrape_docs.py \
https://example.com/docs/api/v2/ \
--url-pattern "*api/v2/*"How It Works
- Auto-detects domain and URL pattern from the input URL
- Crawls using BFS (breadth-first search) strategy
- Filters to stay within the documentation section
- Converts pages to clean markdown
- Saves with directory structure mirroring the URL paths
Output Structure
docs/<name>/
index.md # Root page
getting-started.md
api/
overview.md
client.md
guides/
installation.mdTroubleshooting
| Issue | Solution |
|---|---|
| Run |
| Empty output | Check if URL pattern matches actual doc URLs. Try |
| Missing pages | Increase |
| Wrong pages scraped | Use stricter |
Tips
- Test first - Use to verify config before full crawl
--max-pages 10 - Check output name - Script auto-detects from URL path segments
- Rerun safe - Files are overwritten, duplicates skipped