scrapling

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

scrapling

scrapling

Scrapling is a powerful Python web scraping library with a comprehensive CLI for extracting data from websites directly from the terminal without writing code. The primary use case is the
extract
command group for quick data extraction.
Scrapling是一款功能强大的Python网页抓取库,附带全面的CLI,无需编写代码即可直接从终端提取网站数据。其核心使用场景是
extract
命令组,用于快速提取数据。

Installation

安装

Install with the shell extras using uv:
bash
uv tool install "scrapling[shell]"
Then install fetcher dependencies (browsers, system dependencies, fingerprint manipulation):
bash
scrapling install
使用uv安装包含shell扩展的版本:
bash
uv tool install "scrapling[shell]"
然后安装抓取器依赖(浏览器、系统依赖、指纹处理):
bash
scrapling install

Extract Commands (Primary Usage)

提取命令(核心用法)

The
scrapling extract
command group allows you to download and extract content from websites without writing any code. Output format is determined by file extension:
  • .md
    - Convert HTML to Markdown
  • .html
    - Save raw HTML
  • .txt
    - Extract clean text content
scrapling extract
命令组允许你无需编写代码即可下载并提取网站内容。输出格式由文件扩展名决定:
  • .md
    - 将HTML转换为Markdown
  • .html
    - 保存原始HTML
  • .txt
    - 提取纯文本内容

Quick Start

快速开始

bash
undefined
bash
undefined

Basic website download as text

基础网站下载为文本格式

scrapling extract get "https://example.com" page_content.txt
scrapling extract get "https://example.com" page_content.txt

Download as markdown

以Markdown格式下载

scrapling extract get "https://blog.example.com" article.md
scrapling extract get "https://blog.example.com" article.md

Save raw HTML

保存原始HTML

scrapling extract get "https://example.com" page.html
undefined
scrapling extract get "https://example.com" page.html
undefined

Decision Guide: Which Command to Use?

命令选择指南:该用哪个命令?

Use CaseCommand
Simple websites, blogs, news articles
get
Modern web apps, dynamic content (JavaScript)
fetch
Protected sites, Cloudflare, anti-bot
stealthy-fetch
Form submissions, APIs
post
,
put
,
delete
使用场景命令
简单网站、博客、新闻文章
get
现代Web应用、动态内容(JavaScript)
fetch
受保护站点、Cloudflare、反机器人机制
stealthy-fetch
表单提交、API
post
,
put
,
delete

HTTP Request Commands

HTTP请求命令

GET Request

GET请求

Most common command for downloading website content:
bash
undefined
最常用的网站内容下载命令:
bash
undefined

Basic download

基础下载

scrapling extract get "https://news.site.com" news.md
scrapling extract get "https://news.site.com" news.md

Download with custom timeout

自定义超时时间下载

scrapling extract get "https://example.com" content.txt --timeout 60
scrapling extract get "https://example.com" content.txt --timeout 60

Extract specific content using CSS selectors

使用CSS选择器提取特定内容

scrapling extract get "https://blog.example.com" articles.md --css-selector "article"
scrapling extract get "https://blog.example.com" articles.md --css-selector "article"

Send request with cookies

携带Cookie发送请求

scrapling extract get "https://scrapling.requestcatcher.com" content.md
--cookies "session=abc123; user=john"
scrapling extract get "https://scrapling.requestcatcher.com" content.md
--cookies "session=abc123; user=john"

Add user agent

添加用户代理

scrapling extract get "https://api.site.com" data.json
-H "User-Agent: MyBot 1.0"
scrapling extract get "https://api.site.com" data.json
-H "User-Agent: MyBot 1.0"

Add multiple headers

添加多个请求头

scrapling extract get "https://site.com" page.html
-H "Accept: text/html"
-H "Accept-Language: en-US"
scrapling extract get "https://site.com" page.html
-H "Accept: text/html"
-H "Accept-Language: en-US"

With query parameters

携带查询参数

scrapling extract get "https://api.example.com" data.json
-p "page=1" -p "limit=10"

**GET options:**
-H, --headers TEXT HTTP headers "Key: Value" (multiple allowed) --cookies TEXT Cookies "name1=value1;name2=value2" --timeout INTEGER Request timeout in seconds (default: 30) --proxy TEXT Proxy URL from $PROXY_URL env variable -s, --css-selector TEXT Extract specific content with CSS selector -p, --params TEXT Query parameters "key=value" (multiple) --follow-redirects / --no-follow-redirects (default: True) --verify / --no-verify SSL verification (default: True) --impersonate TEXT Browser to impersonate (chrome, firefox) --stealthy-headers / --no-stealthy-headers (default: True)
undefined
scrapling extract get "https://api.example.com" data.json
-p "page=1" -p "limit=10"

**GET选项:**
-H, --headers TEXT HTTP请求头 "Key: Value"(可多次使用) --cookies TEXT Cookie "name1=value1;name2=value2" --timeout INTEGER 请求超时时间(秒,默认:30) --proxy TEXT 代理URL,可从$PROXY_URL环境变量获取 -s, --css-selector TEXT 使用CSS选择器提取特定内容 -p, --params TEXT 查询参数 "key=value"(可多次使用) --follow-redirects / --no-follow-redirects (默认:开启) --verify / --no-verify SSL验证(默认:开启) --impersonate TEXT 模拟的浏览器类型(chrome、firefox) --stealthy-headers / --no-stealthy-headers (默认:开启)
undefined

POST Request

POST请求

bash
undefined
bash
undefined

Submit form data

提交表单数据

scrapling extract post "https://api.site.com/search" results.html
--data "query=python&type=tutorial"
scrapling extract post "https://api.site.com/search" results.html
--data "query=python&type=tutorial"

Send JSON data

发送JSON数据

scrapling extract post "https://api.site.com" response.json
--json '{"username": "test", "action": "search"}'

**POST options:** (same as GET plus)
-d, --data TEXT Form data "param1=value1&param2=value2" -j, --json TEXT JSON data as string
undefined
scrapling extract post "https://api.site.com" response.json
--json '{"username": "test", "action": "search"}'

**POST选项:**(包含所有GET选项,新增以下选项)
-d, --data TEXT 表单数据 "param1=value1&param2=value2" -j, --json TEXT JSON格式的字符串数据
undefined

PUT Request

PUT请求

bash
undefined
bash
undefined

Send data

发送数据

scrapling extract put "https://api.example.com" results.html
--data "update=info"
--impersonate "firefox"
scrapling extract put "https://api.example.com" results.html
--data "update=info"
--impersonate "firefox"

Send JSON data

发送JSON数据

scrapling extract put "https://api.example.com" response.json
--json '{"username": "test", "action": "search"}'
undefined
scrapling extract put "https://api.example.com" response.json
--json '{"username": "test", "action": "search"}'
undefined

DELETE Request

DELETE请求

bash
scrapling extract delete "https://api.example.com/resource" response.txt
bash
scrapling extract delete "https://api.example.com/resource" response.txt

With impersonation

模拟浏览器发送请求

scrapling extract delete "https://api.example.com/" response.txt
--impersonate "chrome"
undefined
scrapling extract delete "https://api.example.com/" response.txt
--impersonate "chrome"
undefined

Browser Fetching Commands

浏览器抓取命令

Use browser-based fetching for JavaScript-heavy sites or when HTTP requests fail.
当HTTP请求无法获取内容时,可使用基于浏览器的抓取方式处理JavaScript渲染的站点。

fetch - Handle Dynamic Content

fetch - 处理动态内容

For websites that load content dynamically or have slight protection:
bash
undefined
适用于动态加载内容或具备基础防护的网站:
bash
undefined

Wait for JavaScript to load and network activity to finish

等待JavaScript加载完成并停止网络活动

scrapling extract fetch "https://example.com" content.md --network-idle
scrapling extract fetch "https://example.com" content.md --network-idle

Wait for specific element to appear

等待特定元素出现

scrapling extract fetch "https://example.com" data.txt
--wait-selector ".content-loaded"
scrapling extract fetch "https://example.com" data.txt
--wait-selector ".content-loaded"

Visible browser mode for debugging

可见浏览器模式用于调试

scrapling extract fetch "https://example.com" page.html
--no-headless --disable-resources
scrapling extract fetch "https://example.com" page.html
--no-headless --disable-resources

Use installed Chrome browser

使用本地已安装的Chrome浏览器

scrapling extract fetch "https://example.com" content.md --real-chrome
scrapling extract fetch "https://example.com" content.md --real-chrome

With CSS selector extraction

结合CSS选择器提取内容

scrapling extract fetch "https://example.com" articles.md
--css-selector "article"
--network-idle

**fetch options:**
--headless / --no-headless Run browser headless (default: True) --disable-resources Drop unnecessary resources for speed boost --network-idle Wait for network idle --timeout INTEGER Timeout in milliseconds (default: 30000) --wait INTEGER Additional wait time in ms (default: 0) -s, --css-selector TEXT Extract specific content --wait-selector TEXT Wait for selector before proceeding --locale TEXT User locale (default: system) --real-chrome Use installed Chrome browser --proxy TEXT Proxy URL -H, --extra-headers TEXT Extra headers (multiple)
undefined
scrapling extract fetch "https://example.com" articles.md
--css-selector "article"
--network-idle

**fetch选项:**
--headless / --no-headless 以无头模式运行浏览器(默认:开启) --disable-resources 禁用不必要的资源以提升速度 --network-idle 等待网络活动停止 --timeout INTEGER 超时时间(毫秒,默认:30000) --wait INTEGER 额外等待时间(毫秒,默认:0) -s, --css-selector TEXT 提取特定内容的CSS选择器 --wait-selector TEXT 等待指定选择器对应的元素出现后再继续 --locale TEXT 用户区域设置(默认:系统设置) --real-chrome 使用本地已安装的Chrome浏览器 --proxy TEXT 代理URL -H, --extra-headers TEXT 额外请求头(可多次使用)
undefined

stealthy-fetch - Bypass Protection

stealthy-fetch - 绕过防护机制

For websites with anti-bot protection or Cloudflare:
bash
undefined
适用于具备反机器人防护或Cloudflare验证的网站:
bash
undefined

Bypass basic protection

绕过基础防护

scrapling extract stealthy-fetch "https://example.com" content.md
scrapling extract stealthy-fetch "https://example.com" content.md

Solve Cloudflare challenges

解决Cloudflare验证挑战

scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt
--solve-cloudflare
--css-selector "#padded_content a"
scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt
--solve-cloudflare
--css-selector "#padded_content a"

Use proxy for anonymity (set PROXY_URL environment variable)

使用代理提升匿名性(需设置PROXY_URL环境变量)

scrapling extract stealthy-fetch "https://site.com" content.md
--proxy "$PROXY_URL"
scrapling extract stealthy-fetch "https://site.com" content.md
--proxy "$PROXY_URL"

Hide canvas fingerprint

隐藏Canvas指纹

scrapling extract stealthy-fetch "https://example.com" content.md
--hide-canvas
--block-webrtc

**stealthy-fetch options:** (same as fetch plus)
--block-webrtc Block WebRTC entirely --solve-cloudflare Solve Cloudflare challenges --allow-webgl / --block-webgl Allow WebGL (default: True) --hide-canvas Add noise to canvas operations
undefined
scrapling extract stealthy-fetch "https://example.com" content.md
--hide-canvas
--block-webrtc

**stealthy-fetch选项:**(包含所有fetch选项,新增以下选项)
--block-webrtc 完全阻止WebRTC --solve-cloudflare 自动解决Cloudflare验证挑战 --allow-webgl / --block-webgl 允许WebGL(默认:开启) --hide-canvas 为Canvas操作添加干扰信息以隐藏指纹
undefined

CSS Selector Examples

CSS选择器示例

Extract specific content with the
-s
or
--css-selector
flag:
bash
undefined
使用
-s
--css-selector
参数提取特定内容:
bash
undefined

Extract all articles

提取所有文章内容

scrapling extract get "https://blog.example.com" articles.md -s "article"
scrapling extract get "https://blog.example.com" articles.md -s "article"

Extract specific class

提取特定类名的内容

scrapling extract get "https://example.com" titles.txt -s ".title"
scrapling extract get "https://example.com" titles.txt -s ".title"

Extract by ID

提取指定ID的内容

scrapling extract get "https://example.com" content.md -s "#main-content"
scrapling extract get "https://example.com" content.md -s "#main-content"

Extract links (href attributes)

提取链接(href属性)

scrapling extract get "https://example.com" links.txt -s "a::attr(href)"
scrapling extract get "https://example.com" links.txt -s "a::attr(href)"

Extract text only

仅提取文本内容

scrapling extract get "https://example.com" titles.txt -s "h1::text"
scrapling extract get "https://example.com" titles.txt -s "h1::text"

Extract multiple elements with fetch

结合fetch命令提取多个元素

scrapling extract fetch "https://example.com" products.md
-s ".product-card"
--network-idle
undefined
scrapling extract fetch "https://example.com" products.md
-s ".product-card"
--network-idle
undefined

Help Commands

帮助命令

bash
scrapling --help
scrapling extract --help
scrapling extract get --help
scrapling extract post --help
scrapling extract fetch --help
scrapling extract stealthy-fetch --help
bash
scrapling --help
scrapling extract --help
scrapling extract get --help
scrapling extract post --help
scrapling extract fetch --help
scrapling extract stealthy-fetch --help

Resources

相关资源