cli-web-scrape

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Scrapling CLI

Scrapling CLI

Web scraping CLI with browser impersonation, anti-bot bypass, and CSS extraction.
具备浏览器模拟、反机器人绕过和CSS内容提取功能的网页抓取CLI工具。

Prerequisites

前置要求

bash
undefined
bash
undefined

Install with all extras (CLI needs click, fetchers need playwright/camoufox)

安装所有附加组件(CLI依赖click,抓取器依赖playwright/camoufox)

uv tool install 'scrapling[all]'
uv tool install 'scrapling[all]'

Install fetcher browser engines (one-time)

安装抓取器对应的浏览器引擎(仅需执行一次)

scrapling install

Verify: `scrapling --help`
scrapling install

验证安装:`scrapling --help`

Fetcher Selection

抓取器选择

TierCommandEngineSpeedStealthJSUse When
HTTP
extract get/post/put/delete
httpx + TLS impersonationFastMediumNoStatic pages, APIs, most sites
Dynamic
extract fetch
Playwright (headless browser)MediumLowYesJS-rendered SPAs, wait-for-element
Stealthy
extract stealthy-fetch
Camoufox (patched Firefox)SlowHighYesCloudflare, aggressive anti-bot
Default to HTTP tier — only escalate when the page requires JS rendering or blocks HTTP requests.
层级命令引擎速度隐身度JS支持适用场景
HTTP
extract get/post/put/delete
httpx + TLS 模拟中等不支持静态页面、API、大多数站点
动态
extract fetch
Playwright(无头浏览器)中等支持JS渲染的单页应用、需要等待元素加载的场景
隐身
extract stealthy-fetch
Camoufox(定制版Firefox)支持Cloudflare保护的站点、有严格反机器人机制的站点
默认使用HTTP层级 —— 仅当页面需要JS渲染或拦截HTTP请求时再升级层级。

Output Format

输出格式

Determined by output file extension:
ExtensionOutputBest For
.html
Raw HTMLParsing, further processing
.md
HTML converted to MarkdownReading, LLM context
.txt
Text content onlyClean text extraction
Always use
/tmp/scrapling-*.{md,txt,html}
for output files. Read the file after extraction.
由输出文件的扩展名决定:
扩展名输出内容最佳适用场景
.html
原始HTML解析、后续处理
.md
HTML转换后的Markdown阅读、LLM上下文输入
.txt
仅文本内容纯净文本提取
请始终使用
/tmp/scrapling-*.{md,txt,html}
作为输出文件路径,提取完成后读取对应文件即可。

Core Commands

核心命令

HTTP Tier: GET

HTTP层级:GET

bash
scrapling extract get URL OUTPUT_FILE [OPTIONS]
FlagPurposeExample
-s, --css-selector
Extract matching elements only
-s ".article-body"
--impersonate
Force specific browser
--impersonate firefox
-H, --headers
Custom headers (repeatable)
-H "Authorization: Bearer tok"
--cookies
Cookie string
--cookies "session=abc123"
--proxy
Proxy URL
--proxy "http://user:pass@host:port"
-p, --params
Query params (repeatable)
-p "page=2" -p "limit=50"
--timeout
Seconds (default: 30)
--timeout 60
--no-verify
Skip SSL verificationFor self-signed certs
--no-follow-redirects
Don't follow redirectsFor redirect inspection
--no-stealthy-headers
Disable stealth headersFor debugging
Examples:
bash
undefined
bash
scrapling extract get URL OUTPUT_FILE [OPTIONS]
标识用途示例
-s, --css-selector
仅提取匹配的元素
-s ".article-body"
--impersonate
强制模拟指定浏览器
--impersonate firefox
-H, --headers
自定义请求头(可重复添加)
-H "Authorization: Bearer tok"
--cookies
Cookie字符串
--cookies "session=abc123"
--proxy
代理URL
--proxy "http://user:pass@host:port"
-p, --params
查询参数(可重复添加)
-p "page=2" -p "limit=50"
--timeout
超时时间(单位:秒,默认30)
--timeout 60
--no-verify
跳过SSL验证用于自签名证书场景
--no-follow-redirects
不跟随重定向用于重定向排查
--no-stealthy-headers
关闭隐身请求头用于调试
示例:
bash
undefined

Basic page fetch as markdown

基础页面抓取,输出为markdown

scrapling extract get "https://example.com" /tmp/scrapling-out.md
scrapling extract get "https://example.com" /tmp/scrapling-out.md

Extract only article content

仅提取文章内容

scrapling extract get "https://news.site.com/article" /tmp/scrapling-out.txt -s "article"
scrapling extract get "https://news.site.com/article" /tmp/scrapling-out.txt -s "article"

Multiple CSS selectors

多个CSS选择器

scrapling extract get "https://hn.com" /tmp/scrapling-out.txt -s ".titleline > a"
scrapling extract get "https://hn.com" /tmp/scrapling-out.txt -s ".titleline > a"

With auth header

携带认证请求头

scrapling extract get "https://api.example.com/data" /tmp/scrapling-out.txt -H "Authorization: Bearer TOKEN"
scrapling extract get "https://api.example.com/data" /tmp/scrapling-out.txt -H "Authorization: Bearer TOKEN"

Impersonate Firefox

模拟Firefox浏览器

scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate firefox
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate firefox

Random browser impersonation from list

从列表中随机选择浏览器模拟

scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate "chrome,firefox,safari"
scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate "chrome,firefox,safari"

With proxy

使用代理

scrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080"
undefined
scrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080"
undefined

HTTP Tier: POST

HTTP层级:POST

bash
scrapling extract post URL OUTPUT_FILE [OPTIONS]
Additional options over GET:
FlagPurposeExample
-d, --data
Form data
-d "param1=value1&param2=value2"
-j, --json
JSON body
-j '{"key": "value"}'
bash
undefined
bash
scrapling extract post URL OUTPUT_FILE [OPTIONS]
比GET额外支持的选项:
标识用途示例
-d, --data
表单数据
-d "param1=value1&param2=value2"
-j, --json
JSON请求体
-j '{"key": "value"}'
bash
undefined

POST with form data

携带表单数据的POST请求

scrapling extract post "https://api.example.com/search" /tmp/scrapling-out.txt -d "q=test&page=1"
scrapling extract post "https://api.example.com/search" /tmp/scrapling-out.txt -d "q=test&page=1"

POST with JSON

携带JSON的POST请求

scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}'

PUT and DELETE share the same interface as POST and GET respectively.
scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}'

PUT和DELETE的使用接口分别和POST、GET一致。

Dynamic Tier: fetch

动态层级:fetch

For JS-rendered pages. Launches headless Playwright browser.
bash
scrapling extract fetch URL OUTPUT_FILE [OPTIONS]
FlagPurposeDefault
--headless/--no-headless
Headless modeTrue
--disable-resources
Drop images/CSS/fonts for speedFalse
--network-idle
Wait for network idleFalse
--timeout
Milliseconds30000
--wait
Extra wait after load (ms)0
-s, --css-selector
CSS selector extraction
--wait-selector
Wait for element before proceeding
--real-chrome
Use installed Chrome instead of bundledFalse
--proxy
Proxy URL
-H, --extra-headers
Extra headers (repeatable)
bash
undefined
用于JS渲染的页面,会启动无头Playwright浏览器。
bash
scrapling extract fetch URL OUTPUT_FILE [OPTIONS]
标识用途默认值
--headless/--no-headless
无头模式开启
--disable-resources
禁用图片/CSS/字体加载提升速度关闭
--network-idle
等待网络空闲关闭
--timeout
超时时间(单位:毫秒)30000
--wait
页面加载后额外等待时间(单位:毫秒)0
-s, --css-selector
CSS选择器提取内容
--wait-selector
等待指定元素加载完成后再执行
--real-chrome
使用本地安装的Chrome而非内置版本关闭
--proxy
代理URL
-H, --extra-headers
额外请求头(可重复添加)
bash
undefined

Fetch JS-rendered SPA

抓取JS渲染的单页应用

scrapling extract fetch "https://spa-app.com" /tmp/scrapling-out.md
scrapling extract fetch "https://spa-app.com" /tmp/scrapling-out.md

Wait for specific element to load

等待指定元素加载完成

scrapling extract fetch "https://dashboard.com" /tmp/scrapling-out.md --wait-selector ".data-table"
scrapling extract fetch "https://dashboard.com" /tmp/scrapling-out.md --wait-selector ".data-table"

Fast mode: skip images/CSS, wait for network idle

高速模式:跳过图片/CSS加载,等待网络空闲

scrapling extract fetch "https://app.com" /tmp/scrapling-out.md --disable-resources --network-idle
scrapling extract fetch "https://app.com" /tmp/scrapling-out.md --disable-resources --network-idle

Extra wait for slow-loading content

额外等待慢加载内容

scrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000
undefined
scrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000
undefined

Stealthy Tier: stealthy-fetch

隐身层级:stealthy-fetch

Maximum anti-detection. Uses Camoufox (patched Firefox).
bash
scrapling extract stealthy-fetch URL OUTPUT_FILE [OPTIONS]
Additional options over
fetch
:
FlagPurposeDefault
--solve-cloudflare
Solve Cloudflare challengesFalse
--block-webrtc
Block WebRTC (prevents IP leak)False
--hide-canvas
Add noise to canvas fingerprintingFalse
--block-webgl
Block WebGL fingerprintingFalse (allowed)
bash
undefined
最高级反检测能力,使用Camoufox(定制版Firefox)。
bash
scrapling extract stealthy-fetch URL OUTPUT_FILE [OPTIONS]
fetch
额外支持的选项:
标识用途默认值
--solve-cloudflare
自动解决Cloudflare验证关闭
--block-webrtc
屏蔽WebRTC(防止IP泄露)关闭
--hide-canvas
为canvas指纹添加干扰噪声关闭
--block-webgl
屏蔽WebGL指纹关闭(默认允许)
bash
undefined

Bypass Cloudflare

绕过Cloudflare

scrapling extract stealthy-fetch "https://cf-protected.com" /tmp/scrapling-out.md --solve-cloudflare
scrapling extract stealthy-fetch "https://cf-protected.com" /tmp/scrapling-out.md --solve-cloudflare

Maximum stealth

最高隐身模式

scrapling extract stealthy-fetch "https://aggressive-antibot.com" /tmp/scrapling-out.md
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl
scrapling extract stealthy-fetch "https://aggressive-antibot.com" /tmp/scrapling-out.md
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl

Stealthy with CSS selector

结合CSS选择器的隐身抓取

scrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt
--solve-cloudflare -s ".content"
undefined
scrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt
--solve-cloudflare -s ".content"
undefined

Auto-Escalation Protocol

自动升级协议

ALL scrapling usage must follow this protocol. Never use
extract get
alone — always validate content and escalate if needed. Consumer skills (res-deep, res-price-compare, doc-daily-digest) MUST use this pattern, not a bare
extract get
.
所有Scrapling使用必须遵循本协议。 不要单独使用
extract get
——请始终验证内容,必要时升级层级。消费类技能(res-deep、res-price-compare、doc-daily-digest)必须使用该模式,不能直接调用裸的
extract get

Step 1: HTTP Tier

步骤1:HTTP层级

bash
scrapling extract get "URL" /tmp/scrapling-out.md
Read
/tmp/scrapling-out.md
and validate content before proceeding.
bash
scrapling extract get "URL" /tmp/scrapling-out.md
读取
/tmp/scrapling-out.md
验证内容有效性后再继续。

Step 2: Validate Content

步骤2:内容验证

Check the scraped output for thin content indicators — signs that the site requires JS rendering:
IndicatorPatternExample
JS disabled warning"JavaScript", "enable JavaScript", "JS wyłączony"iSpot.pl, many SPAs
No product/price dataOutput has navigation and footer but no prices, specs, or product namesE-commerce SPAs
Mostly nav links80%+ of content is menu items, category links, cookie bannersReact/Angular/Vue apps
Very short contentLess than ~20 meaningful lines after stripping nav/footerHydration-dependent pages
Login/loading wall"Loading...", "Please wait", skeleton UI textDashboard apps
If ANY indicator is present → escalate to Dynamic tier. Do NOT treat HTTP 200 with thin content as success.
检查抓取输出是否存在内容稀薄标识——即站点需要JS渲染的信号:
标识匹配模式示例
JS禁用警告"JavaScript"、"enable JavaScript"、"JS wyłączony"iSpot.pl、大多数单页应用
无产品/价格数据输出包含导航和页脚,但没有价格、规格或产品名称电商单页应用
大部分是导航链接80%以上内容是菜单项、分类链接、Cookie提示React/Angular/Vue应用
内容极短剔除导航/页脚后有效内容不足20行依赖客户端注水的页面
登录/加载屏障"Loading..."、"Please wait"、骨架屏文本后台应用
如果存在任意一个标识 → 升级到动态层级。 不要将返回HTTP 200但内容稀薄的情况视为抓取成功。

Step 3: Dynamic Tier (if content validation fails)

步骤3:动态层级(如果内容验证失败)

bash
scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resources
Read and validate again. If content is now rich → done. If still blocked (403, Cloudflare challenge, empty) → escalate.
bash
scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resources
读取内容再次验证,如果内容已经正常 → 完成。如果仍然被拦截(403、Cloudflare验证、空内容)→ 继续升级。

Step 4: Stealthy Tier (if Dynamic tier fails)

步骤4:隐身层级(如果动态层级失败)

bash
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflare
If still blocked, add maximum stealth flags:
bash
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md \
  --solve-cloudflare --block-webrtc --hide-canvas --block-webgl
bash
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflare
如果仍然被拦截,添加最高隐身标识:
bash
scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md \
  --solve-cloudflare --block-webrtc --hide-canvas --block-webgl

Consumer Skill Integration

消费类技能集成

When a consumer skill says "retry with scrapling" or "scrapling fallback", it means: follow the full auto-escalation protocol above, not just the HTTP tier. The pattern:
  1. extract get
    → Read → Validate content
  2. Content thin? →
    extract fetch --network-idle --disable-resources
    → Read → Validate
  3. Still blocked? →
    extract stealthy-fetch --solve-cloudflare
    → Read
  4. All tiers fail? → Skip and label "scrapling blocked"
Known JS-rendered sites (always start at Dynamic tier):
  • iSpot.pl — React SPA, HTTP tier returns only nav shell
  • Single-page apps with client-side routing (hash or history API URLs)
当消费类技能提示「使用scrapling重试」或「scrapling降级」时,意味着:遵循上述完整的自动升级协议,而不是仅使用HTTP层级。模式如下:
  1. extract get
    → 读取 → 验证内容
  2. 内容稀薄? →
    extract fetch --network-idle --disable-resources
    → 读取 → 验证
  3. 仍然被拦截? →
    extract stealthy-fetch --solve-cloudflare
    → 读取
  4. 所有层级都失败? → 跳过并标记为「scrapling blocked」
已知的JS渲染站点(始终从动态层级开始):
  • iSpot.pl — React单页应用,HTTP层级仅返回导航框架
  • 带客户端路由的单页应用(哈希或历史API URL)

Interactive Shell

交互式Shell

bash
undefined
bash
undefined

Launch REPL

启动REPL

scrapling shell
scrapling shell

One-liner evaluation

单行代码执行

scrapling shell -c 'Fetcher().get("https://example.com").css("title::text")'
undefined
scrapling shell -c 'Fetcher().get("https://example.com").css("title::text")'
undefined

Troubleshooting

故障排查

IssueFix
ModuleNotFoundError: click
Reinstall:
uv tool install --force 'scrapling[all]'
fetch/stealthy-fetch failsRun
scrapling install
to install browser engines
Cloudflare still blocksAdd
--block-webrtc --hide-canvas
to stealthy-fetch
TimeoutIncrease
--timeout
(seconds for HTTP, milliseconds for fetch/stealthy)
SSL errorAdd
--no-verify
(HTTP tier only)
Empty output with selectorTry without
-s
first to verify page loads, then refine selector
问题解决方法
ModuleNotFoundError: click
重新安装:
uv tool install --force 'scrapling[all]'
fetch/stealthy-fetch执行失败运行
scrapling install
安装浏览器引擎
仍然被Cloudflare拦截给stealthy-fetch添加
--block-webrtc --hide-canvas
参数
超时增大
--timeout
参数(HTTP层级单位为秒,fetch/stealthy层级单位为毫秒)
SSL错误添加
--no-verify
参数(仅HTTP层级支持)
选择器返回空输出先尝试不带
-s
参数确认页面加载正常,再优化选择器

Constraints

使用限制

  • Output file path is required — scrapling writes to file, not stdout
  • CSS selectors return ALL matches concatenated
  • HTTP tier timeout is in seconds, fetch/stealthy-fetch timeout is in milliseconds
  • --impersonate
    only available on HTTP tier (fetch/stealthy handle it internally)
  • --solve-cloudflare
    only on stealthy-fetch tier
  • Stealth headers enabled by default on HTTP tier — disable with
    --no-stealthy-headers
    for debugging
  • 输出文件路径是必填项 —— Scrapling会写入文件,不会输出到stdout
  • CSS选择器会返回所有匹配结果并拼接
  • HTTP层级超时单位为,fetch/stealthy-fetch层级超时单位为毫秒
  • --impersonate
    仅在HTTP层级可用(fetch/stealthy层级内部处理浏览器模拟)
  • --solve-cloudflare
    仅在stealthy-fetch层级可用
  • HTTP层级默认开启隐身请求头 —— 调试时可使用
    --no-stealthy-headers
    关闭