cli-web-scrape
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseScrapling CLI
Scrapling CLI
Web scraping CLI with browser impersonation, anti-bot bypass, and CSS extraction.
具备浏览器模拟、反机器人绕过和CSS内容提取功能的网页抓取CLI工具。
Prerequisites
前置要求
bash
undefinedbash
undefinedInstall with all extras (CLI needs click, fetchers need playwright/camoufox)
安装所有附加组件(CLI依赖click,抓取器依赖playwright/camoufox)
uv tool install 'scrapling[all]'
uv tool install 'scrapling[all]'
Install fetcher browser engines (one-time)
安装抓取器对应的浏览器引擎(仅需执行一次)
scrapling install
Verify: `scrapling --help`scrapling install
验证安装:`scrapling --help`Fetcher Selection
抓取器选择
| Tier | Command | Engine | Speed | Stealth | JS | Use When |
|---|---|---|---|---|---|---|
| HTTP | | httpx + TLS impersonation | Fast | Medium | No | Static pages, APIs, most sites |
| Dynamic | | Playwright (headless browser) | Medium | Low | Yes | JS-rendered SPAs, wait-for-element |
| Stealthy | | Camoufox (patched Firefox) | Slow | High | Yes | Cloudflare, aggressive anti-bot |
Default to HTTP tier — only escalate when the page requires JS rendering or blocks HTTP requests.
| 层级 | 命令 | 引擎 | 速度 | 隐身度 | JS支持 | 适用场景 |
|---|---|---|---|---|---|---|
| HTTP | | httpx + TLS 模拟 | 快 | 中等 | 不支持 | 静态页面、API、大多数站点 |
| 动态 | | Playwright(无头浏览器) | 中等 | 低 | 支持 | JS渲染的单页应用、需要等待元素加载的场景 |
| 隐身 | | Camoufox(定制版Firefox) | 慢 | 高 | 支持 | Cloudflare保护的站点、有严格反机器人机制的站点 |
默认使用HTTP层级 —— 仅当页面需要JS渲染或拦截HTTP请求时再升级层级。
Output Format
输出格式
Determined by output file extension:
| Extension | Output | Best For |
|---|---|---|
| Raw HTML | Parsing, further processing |
| HTML converted to Markdown | Reading, LLM context |
| Text content only | Clean text extraction |
Always use for output files. Read the file after extraction.
/tmp/scrapling-*.{md,txt,html}由输出文件的扩展名决定:
| 扩展名 | 输出内容 | 最佳适用场景 |
|---|---|---|
| 原始HTML | 解析、后续处理 |
| HTML转换后的Markdown | 阅读、LLM上下文输入 |
| 仅文本内容 | 纯净文本提取 |
请始终使用作为输出文件路径,提取完成后读取对应文件即可。
/tmp/scrapling-*.{md,txt,html}Core Commands
核心命令
HTTP Tier: GET
HTTP层级:GET
bash
scrapling extract get URL OUTPUT_FILE [OPTIONS]| Flag | Purpose | Example |
|---|---|---|
| Extract matching elements only | |
| Force specific browser | |
| Custom headers (repeatable) | |
| Cookie string | |
| Proxy URL | |
| Query params (repeatable) | |
| Seconds (default: 30) | |
| Skip SSL verification | For self-signed certs |
| Don't follow redirects | For redirect inspection |
| Disable stealth headers | For debugging |
Examples:
bash
undefinedbash
scrapling extract get URL OUTPUT_FILE [OPTIONS]| 标识 | 用途 | 示例 |
|---|---|---|
| 仅提取匹配的元素 | |
| 强制模拟指定浏览器 | |
| 自定义请求头(可重复添加) | |
| Cookie字符串 | |
| 代理URL | |
| 查询参数(可重复添加) | |
| 超时时间(单位:秒,默认30) | |
| 跳过SSL验证 | 用于自签名证书场景 |
| 不跟随重定向 | 用于重定向排查 |
| 关闭隐身请求头 | 用于调试 |
示例:
bash
undefinedBasic page fetch as markdown
基础页面抓取,输出为markdown
scrapling extract get "https://example.com" /tmp/scrapling-out.md
scrapling extract get "https://example.com" /tmp/scrapling-out.md
Extract only article content
仅提取文章内容
scrapling extract get "https://news.site.com/article" /tmp/scrapling-out.txt -s "article"
scrapling extract get "https://news.site.com/article" /tmp/scrapling-out.txt -s "article"
Multiple CSS selectors
多个CSS选择器
scrapling extract get "https://hn.com" /tmp/scrapling-out.txt -s ".titleline > a"
scrapling extract get "https://hn.com" /tmp/scrapling-out.txt -s ".titleline > a"
With auth header
携带认证请求头
scrapling extract get "https://api.example.com/data" /tmp/scrapling-out.txt -H "Authorization: Bearer TOKEN"
scrapling extract get "https://api.example.com/data" /tmp/scrapling-out.txt -H "Authorization: Bearer TOKEN"
Impersonate Firefox
模拟Firefox浏览器
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate firefox
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate firefox
Random browser impersonation from list
从列表中随机选择浏览器模拟
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate "chrome,firefox,safari"
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate "chrome,firefox,safari"
With proxy
使用代理
scrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080"
undefinedscrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080"
undefinedHTTP Tier: POST
HTTP层级:POST
bash
scrapling extract post URL OUTPUT_FILE [OPTIONS]Additional options over GET:
| Flag | Purpose | Example |
|---|---|---|
| Form data | |
| JSON body | |
bash
undefinedbash
scrapling extract post URL OUTPUT_FILE [OPTIONS]比GET额外支持的选项:
| 标识 | 用途 | 示例 |
|---|---|---|
| 表单数据 | |
| JSON请求体 | |
bash
undefinedPOST with form data
携带表单数据的POST请求
scrapling extract post "https://api.example.com/search" /tmp/scrapling-out.txt -d "q=test&page=1"
scrapling extract post "https://api.example.com/search" /tmp/scrapling-out.txt -d "q=test&page=1"
POST with JSON
携带JSON的POST请求
scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}'
PUT and DELETE share the same interface as POST and GET respectively.scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}'
PUT和DELETE的使用接口分别和POST、GET一致。Dynamic Tier: fetch
动态层级:fetch
For JS-rendered pages. Launches headless Playwright browser.
bash
scrapling extract fetch URL OUTPUT_FILE [OPTIONS]| Flag | Purpose | Default |
|---|---|---|
| Headless mode | True |
| Drop images/CSS/fonts for speed | False |
| Wait for network idle | False |
| Milliseconds | 30000 |
| Extra wait after load (ms) | 0 |
| CSS selector extraction | — |
| Wait for element before proceeding | — |
| Use installed Chrome instead of bundled | False |
| Proxy URL | — |
| Extra headers (repeatable) | — |
bash
undefined用于JS渲染的页面,会启动无头Playwright浏览器。
bash
scrapling extract fetch URL OUTPUT_FILE [OPTIONS]| 标识 | 用途 | 默认值 |
|---|---|---|
| 无头模式 | 开启 |
| 禁用图片/CSS/字体加载提升速度 | 关闭 |
| 等待网络空闲 | 关闭 |
| 超时时间(单位:毫秒) | 30000 |
| 页面加载后额外等待时间(单位:毫秒) | 0 |
| CSS选择器提取内容 | — |
| 等待指定元素加载完成后再执行 | — |
| 使用本地安装的Chrome而非内置版本 | 关闭 |
| 代理URL | — |
| 额外请求头(可重复添加) | — |
bash
undefinedFetch JS-rendered SPA
抓取JS渲染的单页应用
scrapling extract fetch "https://spa-app.com" /tmp/scrapling-out.md
scrapling extract fetch "https://spa-app.com" /tmp/scrapling-out.md
Wait for specific element to load
等待指定元素加载完成
scrapling extract fetch "https://dashboard.com" /tmp/scrapling-out.md --wait-selector ".data-table"
scrapling extract fetch "https://dashboard.com" /tmp/scrapling-out.md --wait-selector ".data-table"
Fast mode: skip images/CSS, wait for network idle
高速模式:跳过图片/CSS加载,等待网络空闲
scrapling extract fetch "https://app.com" /tmp/scrapling-out.md --disable-resources --network-idle
scrapling extract fetch "https://app.com" /tmp/scrapling-out.md --disable-resources --network-idle
Extra wait for slow-loading content
额外等待慢加载内容
scrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000
undefinedscrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000
undefinedStealthy Tier: stealthy-fetch
隐身层级:stealthy-fetch
Maximum anti-detection. Uses Camoufox (patched Firefox).
bash
scrapling extract stealthy-fetch URL OUTPUT_FILE [OPTIONS]Additional options over :
fetch| Flag | Purpose | Default |
|---|---|---|
| Solve Cloudflare challenges | False |
| Block WebRTC (prevents IP leak) | False |
| Add noise to canvas fingerprinting | False |
| Block WebGL fingerprinting | False (allowed) |
bash
undefined最高级反检测能力,使用Camoufox(定制版Firefox)。
bash
scrapling extract stealthy-fetch URL OUTPUT_FILE [OPTIONS]比额外支持的选项:
fetch| 标识 | 用途 | 默认值 |
|---|---|---|
| 自动解决Cloudflare验证 | 关闭 |
| 屏蔽WebRTC(防止IP泄露) | 关闭 |
| 为canvas指纹添加干扰噪声 | 关闭 |
| 屏蔽WebGL指纹 | 关闭(默认允许) |
bash
undefinedBypass Cloudflare
绕过Cloudflare
scrapling extract stealthy-fetch "https://cf-protected.com" /tmp/scrapling-out.md --solve-cloudflare
scrapling extract stealthy-fetch "https://cf-protected.com" /tmp/scrapling-out.md --solve-cloudflare
Maximum stealth
最高隐身模式
scrapling extract stealthy-fetch "https://aggressive-antibot.com" /tmp/scrapling-out.md
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl
scrapling extract stealthy-fetch "https://aggressive-antibot.com" /tmp/scrapling-out.md
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl
Stealthy with CSS selector
结合CSS选择器的隐身抓取
scrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt
--solve-cloudflare -s ".content"
--solve-cloudflare -s ".content"
undefinedscrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt
--solve-cloudflare -s ".content"
--solve-cloudflare -s ".content"
undefinedAuto-Escalation Protocol
自动升级协议
ALL scrapling usage must follow this protocol. Never use alone — always validate content and escalate if needed. Consumer skills (res-deep, res-price-compare, doc-daily-digest) MUST use this pattern, not a bare .
extract getextract get所有Scrapling使用必须遵循本协议。 不要单独使用——请始终验证内容,必要时升级层级。消费类技能(res-deep、res-price-compare、doc-daily-digest)必须使用该模式,不能直接调用裸的。
extract getextract getStep 1: HTTP Tier
步骤1:HTTP层级
bash
scrapling extract get "URL" /tmp/scrapling-out.mdRead and validate content before proceeding.
/tmp/scrapling-out.mdbash
scrapling extract get "URL" /tmp/scrapling-out.md读取,验证内容有效性后再继续。
/tmp/scrapling-out.mdStep 2: Validate Content
步骤2:内容验证
Check the scraped output for thin content indicators — signs that the site requires JS rendering:
| Indicator | Pattern | Example |
|---|---|---|
| JS disabled warning | "JavaScript", "enable JavaScript", "JS wyłączony" | iSpot.pl, many SPAs |
| No product/price data | Output has navigation and footer but no prices, specs, or product names | E-commerce SPAs |
| Mostly nav links | 80%+ of content is menu items, category links, cookie banners | React/Angular/Vue apps |
| Very short content | Less than ~20 meaningful lines after stripping nav/footer | Hydration-dependent pages |
| Login/loading wall | "Loading...", "Please wait", skeleton UI text | Dashboard apps |
If ANY indicator is present → escalate to Dynamic tier. Do NOT treat HTTP 200 with thin content as success.
检查抓取输出是否存在内容稀薄标识——即站点需要JS渲染的信号:
| 标识 | 匹配模式 | 示例 |
|---|---|---|
| JS禁用警告 | "JavaScript"、"enable JavaScript"、"JS wyłączony" | iSpot.pl、大多数单页应用 |
| 无产品/价格数据 | 输出包含导航和页脚,但没有价格、规格或产品名称 | 电商单页应用 |
| 大部分是导航链接 | 80%以上内容是菜单项、分类链接、Cookie提示 | React/Angular/Vue应用 |
| 内容极短 | 剔除导航/页脚后有效内容不足20行 | 依赖客户端注水的页面 |
| 登录/加载屏障 | "Loading..."、"Please wait"、骨架屏文本 | 后台应用 |
如果存在任意一个标识 → 升级到动态层级。 不要将返回HTTP 200但内容稀薄的情况视为抓取成功。
Step 3: Dynamic Tier (if content validation fails)
步骤3:动态层级(如果内容验证失败)
bash
scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resourcesRead and validate again. If content is now rich → done. If still blocked (403, Cloudflare challenge, empty) → escalate.
bash
scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resources读取内容再次验证,如果内容已经正常 → 完成。如果仍然被拦截(403、Cloudflare验证、空内容)→ 继续升级。
Step 4: Stealthy Tier (if Dynamic tier fails)
步骤4:隐身层级(如果动态层级失败)
bash
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflareIf still blocked, add maximum stealth flags:
bash
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md \
--solve-cloudflare --block-webrtc --hide-canvas --block-webglbash
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflare如果仍然被拦截,添加最高隐身标识:
bash
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md \
--solve-cloudflare --block-webrtc --hide-canvas --block-webglConsumer Skill Integration
消费类技能集成
When a consumer skill says "retry with scrapling" or "scrapling fallback", it means: follow the full auto-escalation protocol above, not just the HTTP tier. The pattern:
- → Read → Validate content
extract get - Content thin? → → Read → Validate
extract fetch --network-idle --disable-resources - Still blocked? → → Read
extract stealthy-fetch --solve-cloudflare - All tiers fail? → Skip and label "scrapling blocked"
Known JS-rendered sites (always start at Dynamic tier):
- iSpot.pl — React SPA, HTTP tier returns only nav shell
- Single-page apps with client-side routing (hash or history API URLs)
当消费类技能提示「使用scrapling重试」或「scrapling降级」时,意味着:遵循上述完整的自动升级协议,而不是仅使用HTTP层级。模式如下:
- → 读取 → 验证内容
extract get - 内容稀薄? → → 读取 → 验证
extract fetch --network-idle --disable-resources - 仍然被拦截? → → 读取
extract stealthy-fetch --solve-cloudflare - 所有层级都失败? → 跳过并标记为「scrapling blocked」
已知的JS渲染站点(始终从动态层级开始):
- iSpot.pl — React单页应用,HTTP层级仅返回导航框架
- 带客户端路由的单页应用(哈希或历史API URL)
Interactive Shell
交互式Shell
bash
undefinedbash
undefinedLaunch REPL
启动REPL
scrapling shell
scrapling shell
One-liner evaluation
单行代码执行
scrapling shell -c 'Fetcher().get("https://example.com").css("title::text")'
undefinedscrapling shell -c 'Fetcher().get("https://example.com").css("title::text")'
undefinedTroubleshooting
故障排查
| Issue | Fix |
|---|---|
| Reinstall: |
| fetch/stealthy-fetch fails | Run |
| Cloudflare still blocks | Add |
| Timeout | Increase |
| SSL error | Add |
| Empty output with selector | Try without |
| 问题 | 解决方法 |
|---|---|
| 重新安装: |
| fetch/stealthy-fetch执行失败 | 运行 |
| 仍然被Cloudflare拦截 | 给stealthy-fetch添加 |
| 超时 | 增大 |
| SSL错误 | 添加 |
| 选择器返回空输出 | 先尝试不带 |
Constraints
使用限制
- Output file path is required — scrapling writes to file, not stdout
- CSS selectors return ALL matches concatenated
- HTTP tier timeout is in seconds, fetch/stealthy-fetch timeout is in milliseconds
- only available on HTTP tier (fetch/stealthy handle it internally)
--impersonate - only on stealthy-fetch tier
--solve-cloudflare - Stealth headers enabled by default on HTTP tier — disable with for debugging
--no-stealthy-headers
- 输出文件路径是必填项 —— Scrapling会写入文件,不会输出到stdout
- CSS选择器会返回所有匹配结果并拼接
- HTTP层级超时单位为秒,fetch/stealthy-fetch层级超时单位为毫秒
- 仅在HTTP层级可用(fetch/stealthy层级内部处理浏览器模拟)
--impersonate - 仅在stealthy-fetch层级可用
--solve-cloudflare - HTTP层级默认开启隐身请求头 —— 调试时可使用关闭
--no-stealthy-headers