jb-docs-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Documentation Scraper

文档抓取工具

Scrape any documentation website into local markdown files. Uses

crawl4ai

for async web crawling.

可将任意文档网站抓取为本地Markdown文件。使用

crawl4ai

进行异步网页爬取。

Quick Start

快速开始

bash

undefined

bash

undefined

Scrape any documentation URL

抓取任意文档URL

uv run --with crawl4ai python ./references/scrape_docs.py <URL>

Examples

示例

uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/ uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwind


Output goes to `./docs/<auto-detected-name>/` by default.

uv run --with crawl4ai python ./references/scrape_docs.py https://mediasoup.org/documentation/v3/ uv run --with crawl4ai python ./references/scrape_docs.py https://docs.rombo.co/tailwind


默认情况下，输出会保存到`./docs/<自动检测的名称>/`目录。

Prerequisites (First Time Only)

前置要求（首次使用）

bash

uv run --with crawl4ai playwright install

bash

uv run --with crawl4ai playwright install

Usage

使用方法

bash

uv run --with crawl4ai python ./references/scrape_docs.py <URL> [OPTIONS]

bash

uv run --with crawl4ai python ./references/scrape_docs.py <URL> [可选参数]

Options

可选参数

Option	Description	Default
`-o, --output PATH`	Output directory	`./docs/<auto-detected-name>`
`--max-depth N`	Maximum link depth	`6`
`--max-pages N`	Maximum pages to scrape	`500`
`--url-pattern PATTERN`	URL filter (glob)	Auto-detected
`-q, --quiet`	Suppress verbose output	`False`

参数	说明	默认值
`-o, --output PATH`	输出目录	`./docs/<自动检测的名称>`
`--max-depth N`	最大链接深度	`6`
`--max-pages N`	最大抓取页面数	`500`
`--url-pattern PATTERN`	URL过滤规则（通配符）	自动检测
`-q, --quiet`	关闭详细输出	`False`

Examples

示例

bash

undefined

bash

undefined

Basic - scrape to ./docs/documentation_v3/

基础用法 - 抓取到./docs/documentation_v3/

uv run --with crawl4ai python ./references/scrape_docs.py
https://mediasoup.org/documentation/v3/

Custom output directory

自定义输出目录

uv run --with crawl4ai python ./references/scrape_docs.py
https://docs.rombo.co/tailwind
--output ./my-tailwind-docs

Limit crawl scope

限制爬取范围

uv run --with crawl4ai python ./references/scrape_docs.py
https://tanstack.com/start/latest/docs/framework/react/overview
--max-pages 50
--max-depth 3

Custom URL pattern filter

自定义URL过滤规则

uv run --with crawl4ai python ./references/scrape_docs.py
https://example.com/docs/api/v2/
--url-pattern "api/v2/"

undefined

uv run --with crawl4ai python ./references/scrape_docs.py
https://example.com/docs/api/v2/
--url-pattern "api/v2/"

undefined

How It Works

工作原理

Auto-detects domain and URL pattern from the input URL
Crawls using BFS (breadth-first search) strategy
Filters to stay within the documentation section
Converts pages to clean markdown
Saves with directory structure mirroring the URL paths

自动检测：从输入URL中识别域名和URL规则
爬取：采用BFS（广度优先搜索）策略进行爬取
过滤：仅保留文档相关内容
转换：将页面转换为简洁的Markdown格式
保存：按照URL路径结构生成对应的目录和文件

Output Structure

输出结构

docs/<name>/
  index.md           # Root page
  getting-started.md
  api/
    overview.md
    client.md
  guides/
    installation.md

docs/<名称>/
  index.md           # 首页
  getting-started.md
  api/
    overview.md
    client.md
  guides/
    installation.md

Troubleshooting

问题排查

Issue	Solution
`Playwright browser binaries are missing`	Run `uv run --with crawl4ai playwright install`
Empty output	Check if URL pattern matches actual doc URLs. Try `--url-pattern`
Missing pages	Increase `--max-depth` or `--max-pages`
Wrong pages scraped	Use stricter `--url-pattern`

问题	解决方案
`Playwright browser binaries are missing`	运行 `uv run --with crawl4ai playwright install`
输出为空	检查URL规则是否匹配实际文档URL，尝试使用 `--url-pattern` 参数指定规则
页面缺失	增大 `--max-depth` 或 `--max-pages` 的值
抓取到无关页面	使用更严格的 `--url-pattern` 规则

Tips

使用技巧

Test first - Use
```
--max-pages 10
```
to verify config before full crawl
Check output name - Script auto-detects from URL path segments
Rerun safe - Files are overwritten, duplicates skipped

先测试：使用
```
--max-pages 10
```
参数在全量爬取前验证配置是否正确
检查输出名称：脚本会从URL路径段自动检测输出目录名称
可安全重跑：文件会被覆盖，重复内容会被跳过