jb-docs-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDocumentation Scraper
文档抓取工具
Scrape any documentation website into local markdown files. Uses for async web crawling.
crawl4ai可将任意文档网站抓取为本地Markdown文件。使用进行异步网页爬取。
crawl4aiQuick Start
快速开始
bash
undefinedbash
undefinedScrape any documentation URL
抓取任意文档URL
uv run --with crawl4ai python ./references/scrape_docs.py <URL>
uv run --with crawl4ai python ./references/scrape_docs.py <URL>
Examples
示例
uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/
uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwind
Output goes to `./docs/<auto-detected-name>/` by default.uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/
uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwind
默认情况下,输出会保存到`./docs/<自动检测的名称>/`目录。Prerequisites (First Time Only)
前置要求(首次使用)
bash
uv run --with crawl4ai playwright installbash
uv run --with crawl4ai playwright installUsage
使用方法
bash
uv run --with crawl4ai python ./references/scrape_docs.py <URL> [OPTIONS]bash
uv run --with crawl4ai python ./references/scrape_docs.py <URL> [可选参数]Options
可选参数
| Option | Description | Default |
|---|---|---|
| Output directory | |
| Maximum link depth | |
| Maximum pages to scrape | |
| URL filter (glob) | Auto-detected |
| Suppress verbose output | |
| 参数 | 说明 | 默认值 |
|---|---|---|
| 输出目录 | |
| 最大链接深度 | |
| 最大抓取页面数 | |
| URL过滤规则(通配符) | 自动检测 |
| 关闭详细输出 | |
Examples
示例
bash
undefinedbash
undefinedBasic - scrape to ./docs/documentation_v3/
基础用法 - 抓取到./docs/documentation_v3/
uv run --with crawl4ai python ./references/scrape_docs.py
https://mediasoup.org/documentation/v3/
https://mediasoup.org/documentation/v3/
uv run --with crawl4ai python ./references/scrape_docs.py
https://mediasoup.org/documentation/v3/
https://mediasoup.org/documentation/v3/
Custom output directory
自定义输出目录
uv run --with crawl4ai python ./references/scrape_docs.py
https://docs.rombo.co/tailwind
--output ./my-tailwind-docs
https://docs.rombo.co/tailwind
--output ./my-tailwind-docs
uv run --with crawl4ai python ./references/scrape_docs.py
https://docs.rombo.co/tailwind
--output ./my-tailwind-docs
https://docs.rombo.co/tailwind
--output ./my-tailwind-docs
Limit crawl scope
限制爬取范围
uv run --with crawl4ai python ./references/scrape_docs.py
https://tanstack.com/start/latest/docs/framework/react/overview
--max-pages 50
--max-depth 3
https://tanstack.com/start/latest/docs/framework/react/overview
--max-pages 50
--max-depth 3
uv run --with crawl4ai python ./references/scrape_docs.py
https://tanstack.com/start/latest/docs/framework/react/overview
--max-pages 50
--max-depth 3
https://tanstack.com/start/latest/docs/framework/react/overview
--max-pages 50
--max-depth 3
Custom URL pattern filter
自定义URL过滤规则
uv run --with crawl4ai python ./references/scrape_docs.py
https://example.com/docs/api/v2/
--url-pattern "api/v2/"
https://example.com/docs/api/v2/
--url-pattern "api/v2/"
undefineduv run --with crawl4ai python ./references/scrape_docs.py
https://example.com/docs/api/v2/
--url-pattern "api/v2/"
https://example.com/docs/api/v2/
--url-pattern "api/v2/"
undefinedHow It Works
工作原理
- Auto-detects domain and URL pattern from the input URL
- Crawls using BFS (breadth-first search) strategy
- Filters to stay within the documentation section
- Converts pages to clean markdown
- Saves with directory structure mirroring the URL paths
- 自动检测:从输入URL中识别域名和URL规则
- 爬取:采用BFS(广度优先搜索)策略进行爬取
- 过滤:仅保留文档相关内容
- 转换:将页面转换为简洁的Markdown格式
- 保存:按照URL路径结构生成对应的目录和文件
Output Structure
输出结构
docs/<name>/
index.md # Root page
getting-started.md
api/
overview.md
client.md
guides/
installation.mddocs/<名称>/
index.md # 首页
getting-started.md
api/
overview.md
client.md
guides/
installation.mdTroubleshooting
问题排查
| Issue | Solution |
|---|---|
| Run |
| Empty output | Check if URL pattern matches actual doc URLs. Try |
| Missing pages | Increase |
| Wrong pages scraped | Use stricter |
| 问题 | 解决方案 |
|---|---|
| 运行 |
| 输出为空 | 检查URL规则是否匹配实际文档URL,尝试使用 |
| 页面缺失 | 增大 |
| 抓取到无关页面 | 使用更严格的 |
Tips
使用技巧
- Test first - Use to verify config before full crawl
--max-pages 10 - Check output name - Script auto-detects from URL path segments
- Rerun safe - Files are overwritten, duplicates skipped
- 先测试:使用参数在全量爬取前验证配置是否正确
--max-pages 10 - 检查输出名称:脚本会从URL路径段自动检测输出目录名称
- 可安全重跑:文件会被覆盖,重复内容会被跳过