jb-docs-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Documentation Scraper

文档抓取工具

Scrape any documentation website into local markdown files. Uses
crawl4ai
for async web crawling.
可将任意文档网站抓取为本地Markdown文件。使用
crawl4ai
进行异步网页爬取。

Quick Start

快速开始

bash
undefined
bash
undefined

Scrape any documentation URL

抓取任意文档URL

uv run --with crawl4ai python ./references/scrape_docs.py <URL>
uv run --with crawl4ai python ./references/scrape_docs.py <URL>

Examples

示例

uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/ uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwind

Output goes to `./docs/<auto-detected-name>/` by default.
uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/ uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwind

默认情况下,输出会保存到`./docs/<自动检测的名称>/`目录。

Prerequisites (First Time Only)

前置要求(首次使用)

bash
uv run --with crawl4ai playwright install
bash
uv run --with crawl4ai playwright install

Usage

使用方法

bash
uv run --with crawl4ai python ./references/scrape_docs.py <URL> [OPTIONS]
bash
uv run --with crawl4ai python ./references/scrape_docs.py <URL> [可选参数]

Options

可选参数

OptionDescriptionDefault
-o, --output PATH
Output directory
./docs/<auto-detected-name>
--max-depth N
Maximum link depth
6
--max-pages N
Maximum pages to scrape
500
--url-pattern PATTERN
URL filter (glob)Auto-detected
-q, --quiet
Suppress verbose output
False
参数说明默认值
-o, --output PATH
输出目录
./docs/<自动检测的名称>
--max-depth N
最大链接深度
6
--max-pages N
最大抓取页面数
500
--url-pattern PATTERN
URL过滤规则(通配符)自动检测
-q, --quiet
关闭详细输出
False

Examples

示例

bash
undefined
bash
undefined

Basic - scrape to ./docs/documentation_v3/

基础用法 - 抓取到./docs/documentation_v3/

uv run --with crawl4ai python ./references/scrape_docs.py
https://mediasoup.org/documentation/v3/
uv run --with crawl4ai python ./references/scrape_docs.py
https://mediasoup.org/documentation/v3/

Custom output directory

自定义输出目录

uv run --with crawl4ai python ./references/scrape_docs.py
https://docs.rombo.co/tailwind
--output ./my-tailwind-docs
uv run --with crawl4ai python ./references/scrape_docs.py
https://docs.rombo.co/tailwind
--output ./my-tailwind-docs

Limit crawl scope

限制爬取范围

uv run --with crawl4ai python ./references/scrape_docs.py
https://tanstack.com/start/latest/docs/framework/react/overview
--max-pages 50
--max-depth 3
uv run --with crawl4ai python ./references/scrape_docs.py
https://tanstack.com/start/latest/docs/framework/react/overview
--max-pages 50
--max-depth 3

Custom URL pattern filter

自定义URL过滤规则

uv run --with crawl4ai python ./references/scrape_docs.py
https://example.com/docs/api/v2/
--url-pattern "api/v2/"
undefined
uv run --with crawl4ai python ./references/scrape_docs.py
https://example.com/docs/api/v2/
--url-pattern "api/v2/"
undefined

How It Works

工作原理

  1. Auto-detects domain and URL pattern from the input URL
  2. Crawls using BFS (breadth-first search) strategy
  3. Filters to stay within the documentation section
  4. Converts pages to clean markdown
  5. Saves with directory structure mirroring the URL paths
  1. 自动检测:从输入URL中识别域名和URL规则
  2. 爬取:采用BFS(广度优先搜索)策略进行爬取
  3. 过滤:仅保留文档相关内容
  4. 转换:将页面转换为简洁的Markdown格式
  5. 保存:按照URL路径结构生成对应的目录和文件

Output Structure

输出结构

docs/<name>/
  index.md           # Root page
  getting-started.md
  api/
    overview.md
    client.md
  guides/
    installation.md
docs/<名称>/
  index.md           # 首页
  getting-started.md
  api/
    overview.md
    client.md
  guides/
    installation.md

Troubleshooting

问题排查

IssueSolution
Playwright browser binaries are missing
Run
uv run --with crawl4ai playwright install
Empty outputCheck if URL pattern matches actual doc URLs. Try
--url-pattern
Missing pagesIncrease
--max-depth
or
--max-pages
Wrong pages scrapedUse stricter
--url-pattern
问题解决方案
Playwright browser binaries are missing
运行
uv run --with crawl4ai playwright install
输出为空检查URL规则是否匹配实际文档URL,尝试使用
--url-pattern
参数指定规则
页面缺失增大
--max-depth
--max-pages
的值
抓取到无关页面使用更严格的
--url-pattern
规则

Tips

使用技巧

  1. Test first - Use
    --max-pages 10
    to verify config before full crawl
  2. Check output name - Script auto-detects from URL path segments
  3. Rerun safe - Files are overwritten, duplicates skipped
  1. 先测试:使用
    --max-pages 10
    参数在全量爬取前验证配置是否正确
  2. 检查输出名称:脚本会从URL路径段自动检测输出目录名称
  3. 可安全重跑:文件会被覆盖,重复内容会被跳过