crawler
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCrawler Skill
爬虫技能
Converts any URL into clean markdown using a robust 3-tier fallback chain.
将任意URL转换为干净的Markdown格式,采用强大的三层回退链式架构。
Quick start
快速开始
bash
uv run scripts/crawl.py --url https://example.com --output reports/example.mdMarkdown is saved to the file specified by . Progress/errors go to stderr. Exit code on
success, if all scrapers fail.
--output01bash
uv run scripts/crawl.py --url https://example.com --output reports/example.mdMarkdown内容将保存到指定的文件中。进度/错误信息输出到stderr。成功时退出码为,若所有爬虫都失败则退出码为。
--output01How it works
工作原理
The script tries each tier in order and returns the first success:
| Tier | Module | Requires |
|---|---|---|
| 1 | Firecrawl ( | |
| 2 | Jina Reader ( | Nothing — free, no key needed |
| 3 | Scrapling ( | Local headless browser (auto-installs via pip) |
脚本按顺序尝试每个层级,返回第一个成功的结果:
| 层级 | 模块 | 依赖条件 |
|---|---|---|
| 1 | Firecrawl ( | |
| 2 | Jina Reader ( | 无 — 免费使用,无需密钥 |
| 3 | Scrapling ( | 本地无头浏览器(通过pip自动安装) |
File layout
文件结构
crawler-skill/
├── SKILL.md ← this file
├── scripts/
│ ├── crawl.py ← main CLI entry point (PEP 723 inline deps)
│ └── src/
│ ├── domain_router.py ← URL-to-tier routing rules
│ ├── firecrawl_scraper.py ← Tier 1: Firecrawl API
│ ├── jina_reader.py ← Tier 2: Jina r.jina.ai proxy
│ └── scrapling_scraper.py ← Tier 3: local headless scraper
└── tests/
└── test_crawl.py ← 70 pytest tests (all passing)crawler-skill/
├── SKILL.md ← 本文档
├── scripts/
│ ├── crawl.py ← 主CLI入口点(PEP 723 内联依赖)
│ └── src/
│ ├── domain_router.py ← URL到层级的路由规则
│ ├── firecrawl_scraper.py ← 层级1:Firecrawl API
│ ├── jina_reader.py ← 层级2:Jina r.jina.ai 代理
│ └── scrapling_scraper.py ← 层级3:本地无头爬虫
└── tests/
└── test_crawl.py ← 70个pytest测试用例(全部通过)Usage examples
使用示例
bash
undefinedbash
undefinedBasic fetch — tries Firecrawl, falls back to Jina, then Scrapling
基础抓取 — 先尝试Firecrawl,失败则回退到Jina,再失败则用Scrapling
Always prefer using --output to avoid terminal encoding issues
建议始终使用--output参数以避免终端编码问题
uv run scripts/crawl.py --url https://docs.python.org/3/ --output reports/python_docs.md
uv run scripts/crawl.py --url https://docs.python.org/3/ --output reports/python_docs.md
If no --output is provided, markdown goes to stdout (not recommended on Windows)
若未提供--output参数,Markdown内容将输出到stdout(Windows系统不推荐)
uv run scripts/crawl.py --url https://example.com
uv run scripts/crawl.py --url https://example.com
With a Firecrawl API key for best results
使用Firecrawl API密钥以获得最佳效果
FIRECRAWL_API_KEY=fc-... uv run scripts/crawl.py --url https://example.com --output reports/example.md
undefinedFIRECRAWL_API_KEY=fc-... uv run scripts/crawl.py --url https://example.com --output reports/example.md
undefinedURL requirements
URL要求
Only and URLs are accepted. Passing any other scheme
(, , , a bare path, etc.) exits with code
and prints a clear error — no scraping is attempted.
http://https://ftp://file://javascript:1仅接受和协议的URL。若传入其他协议(、、、裸路径等),脚本将以退出码终止并打印清晰的错误信息 — 不会尝试抓取。
http://https://ftp://file://javascript:1Saving Reports
保存报告
When the user asks to save the crawled content or a summary to a file, ALWAYS use the argument and save the file into the directory at the project root (for example, ). If the directory does not exist, the script will create it.
--outputreports/{project_root}/reportsExample:
If asked to "save to result.md", you should run:
uv run scripts/crawl.py --url <URL> --output reports/result.md当用户要求将抓取内容或摘要保存到文件时,务必使用参数并将文件保存到项目根目录的文件夹中(例如)。若目录不存在,脚本会自动创建它。
--outputreports/{project_root}/reports示例:
若用户要求“保存到result.md”,你应运行:
uv run scripts/crawl.py --url <URL> --output reports/result.mdPoint at a self-hosted Firecrawl instance
指向自托管的Firecrawl实例
bash
FIRECRAWL_API_URL=http://localhost:3002 uv run scripts/crawl.py --url https://example.combash
FIRECRAWL_API_URL=http://localhost:3002 uv run scripts/crawl.py --url https://example.comContent validation
内容验证
Each scraper validates its output before returning success:
- Minimum 100 characters of content (rejects empty/error pages)
- Detection of CAPTCHA / bot-verification pages (Firecrawl)
- Detection of Cloudflare interstitial pages (Scrapling — escalates to StealthyFetcher)
- Detection of Jina error page indicators (,
Error:, etc.)Access Denied
每个爬虫在返回成功结果前会验证输出:
- 内容至少100字符(拒绝空白/错误页面)
- 检测验证码/机器人验证页面(Firecrawl)
- 检测Cloudflare 中间页面(Scrapling — 升级为StealthyFetcher)
- 检测Jina错误页面标识(、
Error:等)Access Denied
Domain routing
域名路由
Certain hostnames bypass one or more scraper tiers to avoid known compatibility
issues. The logic lives in .
scripts/src/domain_router.py| Domain | Skipped tiers | Active chain |
|---|---|---|
| firecrawl | jina → scrapling |
| firecrawl + jina | scrapling only |
| everything else | — | firecrawl → jina → scrapling |
Sub-domain matching follows a suffix rule: matches the
rule because its hostname ends with . An exact
sub-domain like does not match .
blog.medium.commedium.com.medium.comother.weixin.qq.commp.weixin.qq.com某些主机名会绕过一个或多个爬虫层级以避免已知兼容性问题。相关逻辑位于中。
scripts/src/domain_router.py| 域名 | 跳过的层级 | 启用的链式流程 |
|---|---|---|
| firecrawl | jina → scrapling |
| firecrawl + jina | 仅scrapling |
| 其他所有域名 | — | firecrawl → jina → scrapling |
子域名匹配遵循后缀规则:会匹配规则,因为其主机名以结尾。而精确子域名如不会匹配规则。
blog.medium.commedium.com.medium.comother.weixin.qq.commp.weixin.qq.comRunning tests
运行测试
bash
uv run pytest tests/ -vAll 70 tests use mocking — no network calls, no API keys required.
bash
uv run pytest tests/ -v所有70个测试用例均使用模拟数据 — 无需网络调用,无需API密钥。
Dependencies (auto-installed by uv run
)
uv run依赖项(由uv run
自动安装)
uv run- — Firecrawl Python SDK
firecrawl-py>=2.0 - — HTTP client for Jina Reader
httpx>=0.27 - — Headless scraping with stealth support
scrapling>=0.2 - — HTML-to-markdown conversion
html2text>=2024.2.26
- — Firecrawl Python SDK
firecrawl-py>=2.0 - — Jina Reader的HTTP客户端
httpx>=0.27 - — 支持隐身模式的无头爬虫
scrapling>=0.2 - — HTML到Markdown的转换工具
html2text>=2024.2.26
When to invoke this skill
何时调用此技能
Invoke whenever you need the text content of a web page:
crawl.pypython
result = subprocess.run(
["uv", "run", "scripts/crawl.py", "--url", url],
capture_output=True, text=True
)
if result.returncode == 0:
markdown = result.stdoutOr simply run it directly from the terminal as shown in Quick start above.
当你需要获取网页文本内容时,调用:
crawl.pypython
result = subprocess.run(
["uv", "run", "scripts/crawl.py", "--url", url],
capture_output=True, text=True
)
if result.returncode == 0:
markdown = result.stdout或者直接按照上述快速开始中的示例从终端运行它。