cli-web-scrape

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Scrapling CLI

Web scraping CLI with browser impersonation, anti-bot bypass, and CSS extraction.

具备浏览器模拟、反机器人绕过和CSS内容提取功能的网页抓取CLI工具。

Prerequisites

前置要求

bash

undefined

bash

undefined

Install with all extras (CLI needs click, fetchers need playwright/camoufox)

安装所有附加组件（CLI依赖click，抓取器依赖playwright/camoufox）

uv tool install 'scrapling[all]'

Install fetcher browser engines (one-time)

安装抓取器对应的浏览器引擎（仅需执行一次）

scrapling install


Verify: `scrapling --help`

scrapling install


验证安装：`scrapling --help`

Fetcher Selection

抓取器选择

Tier	Command	Engine	Speed	Stealth	JS	Use When
HTTP	`extract get/post/put/delete`	httpx + TLS impersonation	Fast	Medium	No	Static pages, APIs, most sites
Dynamic	`extract fetch`	Playwright (headless browser)	Medium	Low	Yes	JS-rendered SPAs, wait-for-element
Stealthy	`extract stealthy-fetch`	Camoufox (patched Firefox)	Slow	High	Yes	Cloudflare, aggressive anti-bot

Default to HTTP tier — only escalate when the page requires JS rendering or blocks HTTP requests.

层级	命令	引擎	速度	隐身度	JS支持	适用场景
HTTP	`extract get/post/put/delete`	httpx + TLS 模拟	快	中等	不支持	静态页面、API、大多数站点
动态	`extract fetch`	Playwright（无头浏览器）	中等	低	支持	JS渲染的单页应用、需要等待元素加载的场景
隐身	`extract stealthy-fetch`	Camoufox（定制版Firefox）	慢	高	支持	Cloudflare保护的站点、有严格反机器人机制的站点

默认使用HTTP层级 —— 仅当页面需要JS渲染或拦截HTTP请求时再升级层级。

Output Format

输出格式

Determined by output file extension:

Extension	Output	Best For
`.html`	Raw HTML	Parsing, further processing
`.md`	HTML converted to Markdown	Reading, LLM context
`.txt`	Text content only	Clean text extraction

Always use

/tmp/scrapling-*.{md,txt,html}

for output files. Read the file after extraction.

由输出文件的扩展名决定：

扩展名	输出内容	最佳适用场景
`.html`	原始HTML	解析、后续处理
`.md`	HTML转换后的Markdown	阅读、LLM上下文输入
`.txt`	仅文本内容	纯净文本提取

请始终使用

/tmp/scrapling-*.{md,txt,html}

作为输出文件路径，提取完成后读取对应文件即可。

Core Commands

核心命令

HTTP Tier: GET

HTTP层级：GET

bash

scrapling extract get URL OUTPUT_FILE [OPTIONS]

Flag	Purpose	Example
`-s, --css-selector`	Extract matching elements only	`-s ".article-body"`
`--impersonate`	Force specific browser	`--impersonate firefox`
`-H, --headers`	Custom headers (repeatable)	`-H "Authorization: Bearer tok"`
`--cookies`	Cookie string	`--cookies "session=abc123"`
`--proxy`	Proxy URL	`--proxy "http://user:pass@host:port"`
`-p, --params`	Query params (repeatable)	`-p "page=2" -p "limit=50"`
`--timeout`	Seconds (default: 30)	`--timeout 60`
`--no-verify`	Skip SSL verification	For self-signed certs
`--no-follow-redirects`	Don't follow redirects	For redirect inspection
`--no-stealthy-headers`	Disable stealth headers	For debugging

Examples:

bash

undefined

bash

scrapling extract get URL OUTPUT_FILE [OPTIONS]

标识	用途	示例
`-s, --css-selector`	仅提取匹配的元素	`-s ".article-body"`
`--impersonate`	强制模拟指定浏览器	`--impersonate firefox`
`-H, --headers`	自定义请求头（可重复添加）	`-H "Authorization: Bearer tok"`
`--cookies`	Cookie字符串	`--cookies "session=abc123"`
`--proxy`	代理URL	`--proxy "http://user:pass@host:port"`
`-p, --params`	查询参数（可重复添加）	`-p "page=2" -p "limit=50"`
`--timeout`	超时时间（单位：秒，默认30）	`--timeout 60`
`--no-verify`	跳过SSL验证	用于自签名证书场景
`--no-follow-redirects`	不跟随重定向	用于重定向排查
`--no-stealthy-headers`	关闭隐身请求头	用于调试

示例：

bash

undefined

Basic page fetch as markdown

基础页面抓取，输出为markdown

scrapling extract get "https://example.com" /tmp/scrapling-out.md

Extract only article content

仅提取文章内容

scrapling extract get "https://news.site.com/article" /tmp/scrapling-out.txt -s "article"

Multiple CSS selectors

多个CSS选择器

scrapling extract get "https://hn.com" /tmp/scrapling-out.txt -s ".titleline > a"

With auth header

携带认证请求头

scrapling extract get "https://api.example.com/data" /tmp/scrapling-out.txt -H "Authorization: Bearer TOKEN"

Impersonate Firefox

模拟Firefox浏览器

scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate firefox

Random browser impersonation from list

从列表中随机选择浏览器模拟

scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate "chrome,firefox,safari"

With proxy

使用代理

scrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080"

undefined

scrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080"

undefined

HTTP Tier: POST

HTTP层级：POST

bash

scrapling extract post URL OUTPUT_FILE [OPTIONS]

Additional options over GET:

Flag	Purpose	Example
`-d, --data`	Form data	`-d "param1=value1&param2=value2"`
`-j, --json`	JSON body	`-j '{"key": "value"}'`

bash

undefined

bash

scrapling extract post URL OUTPUT_FILE [OPTIONS]

比GET额外支持的选项：

标识	用途	示例
`-d, --data`	表单数据	`-d "param1=value1&param2=value2"`
`-j, --json`	JSON请求体	`-j '{"key": "value"}'`

bash

undefined

POST with form data

携带表单数据的POST请求

scrapling extract post "https://api.example.com/search" /tmp/scrapling-out.txt -d "q=test&page=1"

POST with JSON

携带JSON的POST请求

scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}'


PUT and DELETE share the same interface as POST and GET respectively.

scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}'


PUT和DELETE的使用接口分别和POST、GET一致。

Dynamic Tier: fetch

动态层级：fetch

For JS-rendered pages. Launches headless Playwright browser.

bash

scrapling extract fetch URL OUTPUT_FILE [OPTIONS]

Flag	Purpose	Default
`--headless/--no-headless`	Headless mode	True
`--disable-resources`	Drop images/CSS/fonts for speed	False
`--network-idle`	Wait for network idle	False
`--timeout`	Milliseconds	30000
`--wait`	Extra wait after load (ms)	0
`-s, --css-selector`	CSS selector extraction	—
`--wait-selector`	Wait for element before proceeding	—
`--real-chrome`	Use installed Chrome instead of bundled	False
`--proxy`	Proxy URL	—
`-H, --extra-headers`	Extra headers (repeatable)	—

bash

undefined

用于JS渲染的页面，会启动无头Playwright浏览器。

bash

scrapling extract fetch URL OUTPUT_FILE [OPTIONS]

标识	用途	默认值
`--headless/--no-headless`	无头模式	开启
`--disable-resources`	禁用图片/CSS/字体加载提升速度	关闭
`--network-idle`	等待网络空闲	关闭
`--timeout`	超时时间（单位：毫秒）	30000
`--wait`	页面加载后额外等待时间（单位：毫秒）	0
`-s, --css-selector`	CSS选择器提取内容	—
`--wait-selector`	等待指定元素加载完成后再执行	—
`--real-chrome`	使用本地安装的Chrome而非内置版本	关闭
`--proxy`	代理URL	—
`-H, --extra-headers`	额外请求头（可重复添加）	—

bash

undefined

Fetch JS-rendered SPA

抓取JS渲染的单页应用

scrapling extract fetch "https://spa-app.com" /tmp/scrapling-out.md

Wait for specific element to load

等待指定元素加载完成

scrapling extract fetch "https://dashboard.com" /tmp/scrapling-out.md --wait-selector ".data-table"

Fast mode: skip images/CSS, wait for network idle

高速模式：跳过图片/CSS加载，等待网络空闲

scrapling extract fetch "https://app.com" /tmp/scrapling-out.md --disable-resources --network-idle

Extra wait for slow-loading content

额外等待慢加载内容

scrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000

undefined

scrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000

undefined

Stealthy Tier: stealthy-fetch

隐身层级：stealthy-fetch

Maximum anti-detection. Uses Camoufox (patched Firefox).

bash

scrapling extract stealthy-fetch URL OUTPUT_FILE [OPTIONS]

Additional options over

fetch

Flag	Purpose	Default
`--solve-cloudflare`	Solve Cloudflare challenges	False
`--block-webrtc`	Block WebRTC (prevents IP leak)	False
`--hide-canvas`	Add noise to canvas fingerprinting	False
`--block-webgl`	Block WebGL fingerprinting	False (allowed)

bash

undefined

最高级反检测能力，使用Camoufox（定制版Firefox）。

bash

scrapling extract stealthy-fetch URL OUTPUT_FILE [OPTIONS]

比

fetch

额外支持的选项：

标识	用途	默认值
`--solve-cloudflare`	自动解决Cloudflare验证	关闭
`--block-webrtc`	屏蔽WebRTC（防止IP泄露）	关闭
`--hide-canvas`	为canvas指纹添加干扰噪声	关闭
`--block-webgl`	屏蔽WebGL指纹	关闭（默认允许）

bash

undefined

Bypass Cloudflare

绕过Cloudflare

scrapling extract stealthy-fetch "https://cf-protected.com" /tmp/scrapling-out.md --solve-cloudflare

Maximum stealth

最高隐身模式

scrapling extract stealthy-fetch "https://aggressive-antibot.com" /tmp/scrapling-out.md
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl

Stealthy with CSS selector

结合CSS选择器的隐身抓取

scrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt
--solve-cloudflare -s ".content"

undefined

scrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt
--solve-cloudflare -s ".content"

undefined

Auto-Escalation Protocol

自动升级协议

ALL scrapling usage must follow this protocol. Never use

extract get

alone — always validate content and escalate if needed. Consumer skills (res-deep, res-price-compare, doc-daily-digest) MUST use this pattern, not a bare

extract get

所有Scrapling使用必须遵循本协议。 不要单独使用

extract get

——请始终验证内容，必要时升级层级。消费类技能（res-deep、res-price-compare、doc-daily-digest）必须使用该模式，不能直接调用裸的

extract get

。

Step 1: HTTP Tier

步骤1：HTTP层级

bash

scrapling extract get "URL" /tmp/scrapling-out.md

Read

/tmp/scrapling-out.md

and validate content before proceeding.

bash

scrapling extract get "URL" /tmp/scrapling-out.md

读取

/tmp/scrapling-out.md

，验证内容有效性后再继续。

Step 2: Validate Content

步骤2：内容验证

Check the scraped output for thin content indicators — signs that the site requires JS rendering:

Indicator	Pattern	Example
JS disabled warning	"JavaScript", "enable JavaScript", "JS wyłączony"	iSpot.pl, many SPAs
No product/price data	Output has navigation and footer but no prices, specs, or product names	E-commerce SPAs
Mostly nav links	80%+ of content is menu items, category links, cookie banners	React/Angular/Vue apps
Very short content	Less than ~20 meaningful lines after stripping nav/footer	Hydration-dependent pages
Login/loading wall	"Loading...", "Please wait", skeleton UI text	Dashboard apps

If ANY indicator is present → escalate to Dynamic tier. Do NOT treat HTTP 200 with thin content as success.

检查抓取输出是否存在内容稀薄标识——即站点需要JS渲染的信号：

标识	匹配模式	示例
JS禁用警告	"JavaScript"、"enable JavaScript"、"JS wyłączony"	iSpot.pl、大多数单页应用
无产品/价格数据	输出包含导航和页脚，但没有价格、规格或产品名称	电商单页应用
大部分是导航链接	80%以上内容是菜单项、分类链接、Cookie提示	React/Angular/Vue应用
内容极短	剔除导航/页脚后有效内容不足20行	依赖客户端注水的页面
登录/加载屏障	"Loading..."、"Please wait"、骨架屏文本	后台应用

如果存在任意一个标识 → 升级到动态层级。 不要将返回HTTP 200但内容稀薄的情况视为抓取成功。

Step 3: Dynamic Tier (if content validation fails)

步骤3：动态层级（如果内容验证失败）

bash

scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resources

Read and validate again. If content is now rich → done. If still blocked (403, Cloudflare challenge, empty) → escalate.

bash

scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resources

读取内容再次验证，如果内容已经正常 → 完成。如果仍然被拦截（403、Cloudflare验证、空内容）→ 继续升级。

Step 4: Stealthy Tier (if Dynamic tier fails)

步骤4：隐身层级（如果动态层级失败）

bash

scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflare

If still blocked, add maximum stealth flags:

bash

scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md \
  --solve-cloudflare --block-webrtc --hide-canvas --block-webgl

bash

scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflare

如果仍然被拦截，添加最高隐身标识：

bash

scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md \
  --solve-cloudflare --block-webrtc --hide-canvas --block-webgl

Consumer Skill Integration

消费类技能集成

When a consumer skill says "retry with scrapling" or "scrapling fallback", it means: follow the full auto-escalation protocol above, not just the HTTP tier. The pattern:

```
extract get
```
→ Read → Validate content

Content thin? →

extract fetch --network-idle --disable-resources

→ Read → Validate

Still blocked? →

extract stealthy-fetch --solve-cloudflare

→ Read

All tiers fail? → Skip and label "scrapling blocked"

Known JS-rendered sites (always start at Dynamic tier):

iSpot.pl — React SPA, HTTP tier returns only nav shell
Single-page apps with client-side routing (hash or history API URLs)

当消费类技能提示「使用scrapling重试」或「scrapling降级」时，意味着：遵循上述完整的自动升级协议，而不是仅使用HTTP层级。模式如下：

```
extract get
```
→ 读取 → 验证内容

内容稀薄？ →

extract fetch --network-idle --disable-resources

→ 读取 → 验证

仍然被拦截？ →

extract stealthy-fetch --solve-cloudflare

→ 读取

所有层级都失败？ → 跳过并标记为「scrapling blocked」

已知的JS渲染站点（始终从动态层级开始）：

iSpot.pl — React单页应用，HTTP层级仅返回导航框架
带客户端路由的单页应用（哈希或历史API URL）

Interactive Shell

交互式Shell

bash

undefined

bash

undefined

Launch REPL

启动REPL

scrapling shell

One-liner evaluation

单行代码执行

scrapling shell -c 'Fetcher().get("https://example.com").css("title::text")'

undefined

scrapling shell -c 'Fetcher().get("https://example.com").css("title::text")'

undefined

Troubleshooting

故障排查

Issue	Fix
`ModuleNotFoundError: click`	Reinstall: `uv tool install --force 'scrapling[all]'`
fetch/stealthy-fetch fails	Run `scrapling install` to install browser engines
Cloudflare still blocks	Add `--block-webrtc --hide-canvas` to stealthy-fetch
Timeout	Increase `--timeout` (seconds for HTTP, milliseconds for fetch/stealthy)
SSL error	Add `--no-verify` (HTTP tier only)
Empty output with selector	Try without `-s` first to verify page loads, then refine selector

问题	解决方法
`ModuleNotFoundError: click`	重新安装： `uv tool install --force 'scrapling[all]'`
fetch/stealthy-fetch执行失败	运行 `scrapling install` 安装浏览器引擎
仍然被Cloudflare拦截	给stealthy-fetch添加 `--block-webrtc --hide-canvas` 参数
超时	增大 `--timeout` 参数（HTTP层级单位为秒，fetch/stealthy层级单位为毫秒）
SSL错误	添加 `--no-verify` 参数（仅HTTP层级支持）
选择器返回空输出	先尝试不带 `-s` 参数确认页面加载正常，再优化选择器

Constraints

使用限制

Output file path is required — scrapling writes to file, not stdout
CSS selectors return ALL matches concatenated
HTTP tier timeout is in seconds, fetch/stealthy-fetch timeout is in milliseconds
```
--impersonate
```
only available on HTTP tier (fetch/stealthy handle it internally)
```
--solve-cloudflare
```
only on stealthy-fetch tier
Stealth headers enabled by default on HTTP tier — disable with
```
--no-stealthy-headers
```
for debugging

输出文件路径是必填项 —— Scrapling会写入文件，不会输出到stdout
CSS选择器会返回所有匹配结果并拼接
HTTP层级超时单位为秒，fetch/stealthy-fetch层级超时单位为毫秒
```
--impersonate
```
仅在HTTP层级可用（fetch/stealthy层级内部处理浏览器模拟）
```
--solve-cloudflare
```
仅在stealthy-fetch层级可用
HTTP层级默认开启隐身请求头 —— 调试时可使用
```
--no-stealthy-headers
```
关闭