scrapling
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesescrapling
scrapling
Scrapling is a powerful Python web scraping library with a comprehensive CLI for extracting data from websites directly from the terminal without writing code. The primary use case is the command group for quick data extraction.
extractScrapling是一款功能强大的Python网页抓取库,附带全面的CLI,无需编写代码即可直接从终端提取网站数据。其核心使用场景是命令组,用于快速提取数据。
extractInstallation
安装
Install with the shell extras using uv:
bash
uv tool install "scrapling[shell]"Then install fetcher dependencies (browsers, system dependencies, fingerprint manipulation):
bash
scrapling install使用uv安装包含shell扩展的版本:
bash
uv tool install "scrapling[shell]"然后安装抓取器依赖(浏览器、系统依赖、指纹处理):
bash
scrapling installExtract Commands (Primary Usage)
提取命令(核心用法)
The command group allows you to download and extract content from websites without writing any code. Output format is determined by file extension:
scrapling extract- - Convert HTML to Markdown
.md - - Save raw HTML
.html - - Extract clean text content
.txt
scrapling extract- - 将HTML转换为Markdown
.md - - 保存原始HTML
.html - - 提取纯文本内容
.txt
Quick Start
快速开始
bash
undefinedbash
undefinedBasic website download as text
基础网站下载为文本格式
scrapling extract get "https://example.com" page_content.txt
scrapling extract get "https://example.com" page_content.txt
Download as markdown
以Markdown格式下载
scrapling extract get "https://blog.example.com" article.md
scrapling extract get "https://blog.example.com" article.md
Save raw HTML
保存原始HTML
scrapling extract get "https://example.com" page.html
undefinedscrapling extract get "https://example.com" page.html
undefinedDecision Guide: Which Command to Use?
命令选择指南:该用哪个命令?
| Use Case | Command |
|---|---|
| Simple websites, blogs, news articles | |
| Modern web apps, dynamic content (JavaScript) | |
| Protected sites, Cloudflare, anti-bot | |
| Form submissions, APIs | |
| 使用场景 | 命令 |
|---|---|
| 简单网站、博客、新闻文章 | |
| 现代Web应用、动态内容(JavaScript) | |
| 受保护站点、Cloudflare、反机器人机制 | |
| 表单提交、API | |
HTTP Request Commands
HTTP请求命令
GET Request
GET请求
Most common command for downloading website content:
bash
undefined最常用的网站内容下载命令:
bash
undefinedBasic download
基础下载
scrapling extract get "https://news.site.com" news.md
scrapling extract get "https://news.site.com" news.md
Download with custom timeout
自定义超时时间下载
scrapling extract get "https://example.com" content.txt --timeout 60
scrapling extract get "https://example.com" content.txt --timeout 60
Extract specific content using CSS selectors
使用CSS选择器提取特定内容
scrapling extract get "https://blog.example.com" articles.md --css-selector "article"
scrapling extract get "https://blog.example.com" articles.md --css-selector "article"
Send request with cookies
携带Cookie发送请求
scrapling extract get "https://scrapling.requestcatcher.com" content.md
--cookies "session=abc123; user=john"
--cookies "session=abc123; user=john"
scrapling extract get "https://scrapling.requestcatcher.com" content.md
--cookies "session=abc123; user=john"
--cookies "session=abc123; user=john"
Add user agent
添加用户代理
scrapling extract get "https://api.site.com" data.json
-H "User-Agent: MyBot 1.0"
-H "User-Agent: MyBot 1.0"
scrapling extract get "https://api.site.com" data.json
-H "User-Agent: MyBot 1.0"
-H "User-Agent: MyBot 1.0"
Add multiple headers
添加多个请求头
scrapling extract get "https://site.com" page.html
-H "Accept: text/html"
-H "Accept-Language: en-US"
-H "Accept: text/html"
-H "Accept-Language: en-US"
scrapling extract get "https://site.com" page.html
-H "Accept: text/html"
-H "Accept-Language: en-US"
-H "Accept: text/html"
-H "Accept-Language: en-US"
With query parameters
携带查询参数
scrapling extract get "https://api.example.com" data.json
-p "page=1" -p "limit=10"
-p "page=1" -p "limit=10"
**GET options:**
-H, --headers TEXT HTTP headers "Key: Value" (multiple allowed)
--cookies TEXT Cookies "name1=value1;name2=value2"
--timeout INTEGER Request timeout in seconds (default: 30)
--proxy TEXT Proxy URL from $PROXY_URL env variable
-s, --css-selector TEXT Extract specific content with CSS selector
-p, --params TEXT Query parameters "key=value" (multiple)
--follow-redirects / --no-follow-redirects (default: True)
--verify / --no-verify SSL verification (default: True)
--impersonate TEXT Browser to impersonate (chrome, firefox)
--stealthy-headers / --no-stealthy-headers (default: True)
undefinedscrapling extract get "https://api.example.com" data.json
-p "page=1" -p "limit=10"
-p "page=1" -p "limit=10"
**GET选项:**
-H, --headers TEXT HTTP请求头 "Key: Value"(可多次使用)
--cookies TEXT Cookie "name1=value1;name2=value2"
--timeout INTEGER 请求超时时间(秒,默认:30)
--proxy TEXT 代理URL,可从$PROXY_URL环境变量获取
-s, --css-selector TEXT 使用CSS选择器提取特定内容
-p, --params TEXT 查询参数 "key=value"(可多次使用)
--follow-redirects / --no-follow-redirects (默认:开启)
--verify / --no-verify SSL验证(默认:开启)
--impersonate TEXT 模拟的浏览器类型(chrome、firefox)
--stealthy-headers / --no-stealthy-headers (默认:开启)
undefinedPOST Request
POST请求
bash
undefinedbash
undefinedSubmit form data
提交表单数据
scrapling extract post "https://api.site.com/search" results.html
--data "query=python&type=tutorial"
--data "query=python&type=tutorial"
scrapling extract post "https://api.site.com/search" results.html
--data "query=python&type=tutorial"
--data "query=python&type=tutorial"
Send JSON data
发送JSON数据
scrapling extract post "https://api.site.com" response.json
--json '{"username": "test", "action": "search"}'
--json '{"username": "test", "action": "search"}'
**POST options:** (same as GET plus)
-d, --data TEXT Form data "param1=value1¶m2=value2"
-j, --json TEXT JSON data as string
undefinedscrapling extract post "https://api.site.com" response.json
--json '{"username": "test", "action": "search"}'
--json '{"username": "test", "action": "search"}'
**POST选项:**(包含所有GET选项,新增以下选项)
-d, --data TEXT 表单数据 "param1=value1¶m2=value2"
-j, --json TEXT JSON格式的字符串数据
undefinedPUT Request
PUT请求
bash
undefinedbash
undefinedSend data
发送数据
scrapling extract put "https://api.example.com" results.html
--data "update=info"
--impersonate "firefox"
--data "update=info"
--impersonate "firefox"
scrapling extract put "https://api.example.com" results.html
--data "update=info"
--impersonate "firefox"
--data "update=info"
--impersonate "firefox"
Send JSON data
发送JSON数据
scrapling extract put "https://api.example.com" response.json
--json '{"username": "test", "action": "search"}'
--json '{"username": "test", "action": "search"}'
undefinedscrapling extract put "https://api.example.com" response.json
--json '{"username": "test", "action": "search"}'
--json '{"username": "test", "action": "search"}'
undefinedDELETE Request
DELETE请求
bash
scrapling extract delete "https://api.example.com/resource" response.txtbash
scrapling extract delete "https://api.example.com/resource" response.txtWith impersonation
模拟浏览器发送请求
scrapling extract delete "https://api.example.com/" response.txt
--impersonate "chrome"
--impersonate "chrome"
undefinedscrapling extract delete "https://api.example.com/" response.txt
--impersonate "chrome"
--impersonate "chrome"
undefinedBrowser Fetching Commands
浏览器抓取命令
Use browser-based fetching for JavaScript-heavy sites or when HTTP requests fail.
当HTTP请求无法获取内容时,可使用基于浏览器的抓取方式处理JavaScript渲染的站点。
fetch - Handle Dynamic Content
fetch - 处理动态内容
For websites that load content dynamically or have slight protection:
bash
undefined适用于动态加载内容或具备基础防护的网站:
bash
undefinedWait for JavaScript to load and network activity to finish
等待JavaScript加载完成并停止网络活动
scrapling extract fetch "https://example.com" content.md --network-idle
scrapling extract fetch "https://example.com" content.md --network-idle
Wait for specific element to appear
等待特定元素出现
scrapling extract fetch "https://example.com" data.txt
--wait-selector ".content-loaded"
--wait-selector ".content-loaded"
scrapling extract fetch "https://example.com" data.txt
--wait-selector ".content-loaded"
--wait-selector ".content-loaded"
Visible browser mode for debugging
可见浏览器模式用于调试
scrapling extract fetch "https://example.com" page.html
--no-headless --disable-resources
--no-headless --disable-resources
scrapling extract fetch "https://example.com" page.html
--no-headless --disable-resources
--no-headless --disable-resources
Use installed Chrome browser
使用本地已安装的Chrome浏览器
scrapling extract fetch "https://example.com" content.md --real-chrome
scrapling extract fetch "https://example.com" content.md --real-chrome
With CSS selector extraction
结合CSS选择器提取内容
**fetch options:**
--headless / --no-headless Run browser headless (default: True)
--disable-resources Drop unnecessary resources for speed boost
--network-idle Wait for network idle
--timeout INTEGER Timeout in milliseconds (default: 30000)
--wait INTEGER Additional wait time in ms (default: 0)
-s, --css-selector TEXT Extract specific content
--wait-selector TEXT Wait for selector before proceeding
--locale TEXT User locale (default: system)
--real-chrome Use installed Chrome browser
--proxy TEXT Proxy URL
-H, --extra-headers TEXT Extra headers (multiple)
undefined
**fetch选项:**
--headless / --no-headless 以无头模式运行浏览器(默认:开启)
--disable-resources 禁用不必要的资源以提升速度
--network-idle 等待网络活动停止
--timeout INTEGER 超时时间(毫秒,默认:30000)
--wait INTEGER 额外等待时间(毫秒,默认:0)
-s, --css-selector TEXT 提取特定内容的CSS选择器
--wait-selector TEXT 等待指定选择器对应的元素出现后再继续
--locale TEXT 用户区域设置(默认:系统设置)
--real-chrome 使用本地已安装的Chrome浏览器
--proxy TEXT 代理URL
-H, --extra-headers TEXT 额外请求头(可多次使用)
undefinedstealthy-fetch - Bypass Protection
stealthy-fetch - 绕过防护机制
For websites with anti-bot protection or Cloudflare:
bash
undefined适用于具备反机器人防护或Cloudflare验证的网站:
bash
undefinedBypass basic protection
绕过基础防护
scrapling extract stealthy-fetch "https://example.com" content.md
scrapling extract stealthy-fetch "https://example.com" content.md
Solve Cloudflare challenges
解决Cloudflare验证挑战
scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt
--solve-cloudflare
--css-selector "#padded_content a"
--solve-cloudflare
--css-selector "#padded_content a"
scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt
--solve-cloudflare
--css-selector "#padded_content a"
--solve-cloudflare
--css-selector "#padded_content a"
Use proxy for anonymity (set PROXY_URL environment variable)
使用代理提升匿名性(需设置PROXY_URL环境变量)
scrapling extract stealthy-fetch "https://site.com" content.md
--proxy "$PROXY_URL"
--proxy "$PROXY_URL"
scrapling extract stealthy-fetch "https://site.com" content.md
--proxy "$PROXY_URL"
--proxy "$PROXY_URL"
Hide canvas fingerprint
隐藏Canvas指纹
**stealthy-fetch options:** (same as fetch plus)
--block-webrtc Block WebRTC entirely
--solve-cloudflare Solve Cloudflare challenges
--allow-webgl / --block-webgl Allow WebGL (default: True)
--hide-canvas Add noise to canvas operations
undefined
**stealthy-fetch选项:**(包含所有fetch选项,新增以下选项)
--block-webrtc 完全阻止WebRTC
--solve-cloudflare 自动解决Cloudflare验证挑战
--allow-webgl / --block-webgl 允许WebGL(默认:开启)
--hide-canvas 为Canvas操作添加干扰信息以隐藏指纹
undefinedCSS Selector Examples
CSS选择器示例
Extract specific content with the or flag:
-s--css-selectorbash
undefined使用或参数提取特定内容:
-s--css-selectorbash
undefinedExtract all articles
提取所有文章内容
scrapling extract get "https://blog.example.com" articles.md -s "article"
scrapling extract get "https://blog.example.com" articles.md -s "article"
Extract specific class
提取特定类名的内容
scrapling extract get "https://example.com" titles.txt -s ".title"
scrapling extract get "https://example.com" titles.txt -s ".title"
Extract by ID
提取指定ID的内容
scrapling extract get "https://example.com" content.md -s "#main-content"
scrapling extract get "https://example.com" content.md -s "#main-content"
Extract links (href attributes)
提取链接(href属性)
scrapling extract get "https://example.com" links.txt -s "a::attr(href)"
scrapling extract get "https://example.com" links.txt -s "a::attr(href)"
Extract text only
仅提取文本内容
scrapling extract get "https://example.com" titles.txt -s "h1::text"
scrapling extract get "https://example.com" titles.txt -s "h1::text"
Extract multiple elements with fetch
结合fetch命令提取多个元素
undefinedundefinedHelp Commands
帮助命令
bash
scrapling --help
scrapling extract --help
scrapling extract get --help
scrapling extract post --help
scrapling extract fetch --help
scrapling extract stealthy-fetch --helpbash
scrapling --help
scrapling extract --help
scrapling extract get --help
scrapling extract post --help
scrapling extract fetch --help
scrapling extract stealthy-fetch --helpResources
相关资源
- Documentation: https://scrapling.readthedocs.io/
- GitHub: https://github.com/D4Vinci/Scrapling