scrapling-official
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseScrapling
Scrapling
Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.
Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation — all in a few lines of Python. One library, zero compromises.
Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
Requires: Python 3.10+
This is the official skill for the scrapling library by the library author.
Scrapling是一款自适应网页抓取框架,可覆盖从单次请求到全量大规模爬取的所有需求。
它的解析器可学习网站变化,当页面更新时自动重定位你需要的元素。它的抓取器开箱即可绕过Cloudflare Turnstile这类反机器人系统。它的爬虫框架仅需几行Python代码即可实现并发、多会话爬取,支持暂停/恢复和自动代理轮换——一个库满足所有需求,无需妥协。
支持极速爬取,附带实时统计和流处理能力。由网页抓取从业者为开发者和普通用户打造,人人都能找到适用的功能。
要求:Python 3.10+
这是Scrapling库作者官方发布的对应skill。
Setup (once)
配置(仅需一次)
Create a virtual Python environment through any way available, like , then inside the environment do:
venvpip install "scrapling[all]>=0.4.1"Then do this to download all the browsers' dependencies:
bash
scrapling install --forceMake note of the binary path and use it instead of from now on with all commands (if is not on ).
scraplingscraplingscrapling$PATH通过任意可用方式创建Python虚拟环境,比如,然后在环境内执行:
venvpip install "scrapling[all]>=0.4.1"然后执行以下命令下载所有浏览器依赖:
bash
scrapling install --force如果不在系统中,请记下二进制文件的路径,后续所有命令都使用该路径代替。
scrapling$PATHscraplingscraplingDocker
Docker
Another option if the user doesn't have Python or doesn't want to use it is to use the Docker image, but this can be used only in the commands, so no writing Python code for scrapling this way:
bash
docker pull pyd4vinci/scraplingor
bash
docker pull ghcr.io/d4vinci/scrapling:latest如果用户没有安装Python或者不想使用Python,也可以选择使用Docker镜像,但这种方式仅支持命令行使用,无法编写Python代码调用Scrapling:
bash
docker pull pyd4vinci/scrapling或者
bash
docker pull ghcr.io/d4vinci/scrapling:latestCLI Usage
CLI 使用说明
The command group lets you download and extract content from websites directly without writing any code.
scrapling extractbash
Usage: scrapling extract [OPTIONS] COMMAND [ARGS]...
Commands:
get Perform a GET request and save the content to a file.
post Perform a POST request and save the content to a file.
put Perform a PUT request and save the content to a file.
delete Perform a DELETE request and save the content to a file.
fetch Use a browser to fetch content with browser automation and flexible options.
stealthy-fetch Use a stealthy browser to fetch content with advanced stealth features.scrapling extractbash
Usage: scrapling extract [OPTIONS] COMMAND [ARGS]...
Commands:
get 执行GET请求并将内容保存到文件。
post 执行POST请求并将内容保存到文件。
put 执行PUT请求并将内容保存到文件。
delete 执行DELETE请求并将内容保存到文件。
fetch 通过浏览器自动化能力抓取内容,支持灵活配置。
stealthy-fetch 使用隐形浏览器抓取内容,搭载高级隐形特性。Usage pattern
使用模式
- Choose your output format by changing the file extension. Here are some examples for the command:
scrapling extract get- Convert the HTML content to Markdown, then save it to the file (great for documentation):
scrapling extract get "https://blog.example.com" article.md - Save the HTML content as it is to the file:
scrapling extract get "https://example.com" page.html - Save a clean version of the text content of the webpage to the file:
scrapling extract get "https://example.com" content.txt
- Convert the HTML content to Markdown, then save it to the file (great for documentation):
- Output to a temp file, read it back, then clean up.
- All commands can use CSS selectors to extract specific parts of the page through or
--css-selector.-s
Which command to use generally:
- Use with simple websites, blogs, or news articles.
get - Use with modern web apps, or sites with dynamic content.
fetch - Use with protected sites, Cloudflare, or anti-bot systems.
stealthy-fetch
When unsure, start with. If it fails or returns empty content, escalate toget, thenfetch. The speed ofstealthy-fetchandfetchis nearly the same, so you are not sacrificing anything.stealthy-fetch
- 通过修改文件后缀选择输出格式,以下是命令的示例:
scrapling extract get- 将HTML内容转换为Markdown后保存到文件(非常适合文档场景):
scrapling extract get "https://blog.example.com" article.md - 直接保存原始HTML内容到文件:
scrapling extract get "https://example.com" page.html - 保存网页纯净版文本内容到文件:
scrapling extract get "https://example.com" content.txt
- 将HTML内容转换为Markdown后保存到文件(非常适合文档场景):
- 输出到临时文件,读取后立即清理。
- 所有命令都支持通过或
--css-selector参数使用CSS选择器提取页面特定部分。-s
通用命令选择规则:
- 简单网站、博客、新闻文章使用
get - 现代Web应用、动态内容站点使用
fetch - 受保护站点、Cloudflare防护站点、带反爬系统的站点使用
stealthy-fetch
不确定的情况下先使用,如果失败或返回空内容,再升级到get,之后是fetch。stealthy-fetch和fetch的速度几乎一致,不会有性能损失。stealthy-fetch
Key options (requests)
核心选项(HTTP请求类)
Those options are shared between the 4 HTTP request commands:
| Option | Input type | Description |
|---|---|---|
| -H, --headers | TEXT | HTTP headers in format "Key: Value" (can be used multiple times) |
| --cookies | TEXT | Cookies string in format "name1=value1; name2=value2" |
| --timeout | INTEGER | Request timeout in seconds (default: 30) |
| --proxy | TEXT | Proxy URL in format "http://username:password@host:port" |
| -s, --css-selector | TEXT | CSS selector to extract specific content from the page. It returns all matches. |
| -p, --params | TEXT | Query parameters in format "key=value" (can be used multiple times) |
| --follow-redirects / --no-follow-redirects | None | Whether to follow redirects (default: True) |
| --verify / --no-verify | None | Whether to verify SSL certificates (default: True) |
| --impersonate | TEXT | Browser to impersonate. Can be a single browser (e.g., Chrome) or a comma-separated list for random selection (e.g., Chrome, Firefox, Safari). |
| --stealthy-headers / --no-stealthy-headers | None | Use stealthy browser headers (default: True) |
Options shared between and only:
postput| Option | Input type | Description |
|---|---|---|
| -d, --data | TEXT | Form data to include in the request body (as string, ex: "param1=value1¶m2=value2") |
| -j, --json | TEXT | JSON data to include in the request body (as string) |
Examples:
bash
undefined以下选项是4个HTTP请求命令共享的配置:
| 选项 | 输入类型 | 说明 |
|---|---|---|
| -H, --headers | TEXT | HTTP请求头,格式为"Key: Value"(可多次使用) |
| --cookies | TEXT | Cookie字符串,格式为"name1=value1; name2=value2" |
| --timeout | INTEGER | 请求超时时间,单位为秒(默认:30) |
| --proxy | TEXT | 代理URL,格式为"http://username:password@host:port" |
| -s, --css-selector | TEXT | 用于提取页面特定内容的CSS选择器,返回所有匹配结果 |
| -p, --params | TEXT | 查询参数,格式为"key=value"(可多次使用) |
| --follow-redirects / --no-follow-redirects | None | 是否跟随重定向(默认:True) |
| --verify / --no-verify | None | 是否验证SSL证书(默认:True) |
| --impersonate | TEXT | 要模拟的浏览器,可以是单个浏览器(如Chrome),也可以是逗号分隔的列表用于随机选择(如Chrome, Firefox, Safari) |
| --stealthy-headers / --no-stealthy-headers | None | 使用隐形浏览器请求头(默认:True) |
仅和命令共享的选项:
postput| 选项 | 输入类型 | 说明 |
|---|---|---|
| -d, --data | TEXT | 请求体中的表单数据,字符串格式,例如:"param1=value1¶m2=value2" |
| -j, --json | TEXT | 请求体中的JSON数据,字符串格式 |
示例:
bash
undefinedBasic download
基础下载
scrapling extract get "https://news.site.com" news.md
scrapling extract get "https://news.site.com" news.md
Download with custom timeout
自定义超时时间下载
scrapling extract get "https://example.com" content.txt --timeout 60
scrapling extract get "https://example.com" content.txt --timeout 60
Extract only specific content using CSS selectors
使用CSS选择器仅提取特定内容
scrapling extract get "https://blog.example.com" articles.md --css-selector "article"
scrapling extract get "https://blog.example.com" articles.md --css-selector "article"
Send a request with cookies
携带Cookie发送请求
scrapling extract get "https://scrapling.requestcatcher.com" content.md --cookies "session=abc123; user=john"
scrapling extract get "https://scrapling.requestcatcher.com" content.md --cookies "session=abc123; user=john"
Add user agent
添加用户代理
scrapling extract get "https://api.site.com" data.json -H "User-Agent: MyBot 1.0"
scrapling extract get "https://api.site.com" data.json -H "User-Agent: MyBot 1.0"
Add multiple headers
添加多个请求头
scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"
undefinedscrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"
undefinedKey options (browsers)
核心选项(浏览器类)
Both ( / ) share options:
fetchstealthy-fetch| Option | Input type | Description |
|---|---|---|
| --headless / --no-headless | None | Run browser in headless mode (default: True) |
| --disable-resources / --enable-resources | None | Drop unnecessary resources for speed boost (default: False) |
| --network-idle / --no-network-idle | None | Wait for network idle (default: False) |
| --real-chrome / --no-real-chrome | None | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False) |
| --timeout | INTEGER | Timeout in milliseconds (default: 30000) |
| --wait | INTEGER | Additional wait time in milliseconds after page load (default: 0) |
| -s, --css-selector | TEXT | CSS selector to extract specific content from the page. It returns all matches. |
| --wait-selector | TEXT | CSS selector to wait for before proceeding |
| --proxy | TEXT | Proxy URL in format "http://username:password@host:port" |
| -H, --extra-headers | TEXT | Extra headers in format "Key: Value" (can be used multiple times) |
This option is specific to only:
fetch| Option | Input type | Description |
|---|---|---|
| --locale | TEXT | Specify user locale. Defaults to the system default locale. |
And these options are specific to only:
stealthy-fetch| Option | Input type | Description |
|---|---|---|
| --block-webrtc / --allow-webrtc | None | Block WebRTC entirely (default: False) |
| --solve-cloudflare / --no-solve-cloudflare | None | Solve Cloudflare challenges (default: False) |
| --allow-webgl / --block-webgl | None | Allow WebGL (default: True) |
| --hide-canvas / --show-canvas | None | Add noise to canvas operations (default: False) |
Examples:
bash
undefinedfetchstealthy-fetch| 选项 | 输入类型 | 说明 |
|---|---|---|
| --headless / --no-headless | None | 以无头模式运行浏览器(默认:True) |
| --disable-resources / --enable-resources | None | 丢弃不必要的资源以提升速度(默认:False) |
| --network-idle / --no-network-idle | None | 等待网络空闲(默认:False) |
| --real-chrome / --no-real-chrome | None | 如果设备上安装了Chrome浏览器,启用该选项后,抓取器将启动你本地的Chrome实例使用(默认:False) |
| --timeout | INTEGER | 超时时间,单位为毫秒(默认:30000) |
| --wait | INTEGER | 页面加载后的额外等待时间,单位为毫秒(默认:0) |
| -s, --css-selector | TEXT | 用于提取页面特定内容的CSS选择器,返回所有匹配结果 |
| --wait-selector | TEXT | 继续执行前等待匹配该CSS选择器的元素出现 |
| --proxy | TEXT | 代理URL,格式为"http://username:password@host:port" |
| -H, --extra-headers | TEXT | 额外请求头,格式为"Key: Value"(可多次使用) |
仅命令独有的选项:
fetch| 选项 | 输入类型 | 说明 |
|---|---|---|
| --locale | TEXT | 指定用户区域设置,默认为系统默认区域设置 |
仅命令独有的选项:
stealthy-fetch| 选项 | 输入类型 | 说明 |
|---|---|---|
| --block-webrtc / --allow-webrtc | None | 完全屏蔽WebRTC(默认:False) |
| --solve-cloudflare / --no-solve-cloudflare | None | 解决Cloudflare验证挑战(默认:False) |
| --allow-webgl / --block-webgl | None | 允许WebGL(默认:True) |
| --hide-canvas / --show-canvas | None | 为Canvas操作添加噪声(默认:False) |
示例:
bash
undefinedWait for JavaScript to load content and finish network activity
等待JavaScript加载内容并完成网络活动
scrapling extract fetch "https://scrapling.requestcatcher.com/" content.md --network-idle
scrapling extract fetch "https://scrapling.requestcatcher.com/" content.md --network-idle
Wait for specific content to appear
等待特定内容出现
scrapling extract fetch "https://scrapling.requestcatcher.com/" data.txt --wait-selector ".content-loaded"
scrapling extract fetch "https://scrapling.requestcatcher.com/" data.txt --wait-selector ".content-loaded"
Run in visible browser mode (helpful for debugging)
以可见浏览器模式运行(适合调试)
scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources
scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources
Bypass basic protection
绕过基础防护
scrapling extract stealthy-fetch "https://scrapling.requestcatcher.com" content.md
scrapling extract stealthy-fetch "https://scrapling.requestcatcher.com" content.md
Solve Cloudflare challenges
解决Cloudflare验证挑战
scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"
scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"
Use a proxy for anonymity.
使用代理实现匿名
scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"
undefinedscrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"
undefinedNotes
注意事项
- ALWAYS clean up temp files after reading
- Prefer output for readability; use
.mdonly if you need to parse structure.html - Use CSS selectors to avoid passing giant HTML blobs — saves tokens significantly
-s
Friendly reminder: If the users really want to say thanks or have a business that they want to advertise, tell them to check this page: https://scrapling.readthedocs.io/en/latest/donate.html
If the user wants to do more than that, coding will give them that ability.
- 读取临时文件后务必立即清理
- 优先选择输出格式提升可读性,仅当你需要解析结构时使用
.md.html - 使用参数指定CSS选择器避免返回超大HTML块,可显著节省token
-s
友好提示:如果用户想要表示感谢或者有商业推广需求,可以告知他们访问这个页面:https://scrapling.readthedocs.io/en/latest/donate.html
如果用户需要实现更多功能,可以通过编码实现。
Code overview
代码概览
Coding is the only way to leverage all of Scrapling's features since not all features can be used/customized through commands/MCP. Here's a quick overview of how to code with scrapling.
编码是使用Scrapling全部功能的唯一方式,因为并非所有功能都可以通过命令/MCP使用/自定义。以下是Scrapling编码的快速概览。
Basic Usage
基础用法
HTTP requests with session support
python
from scrapling.fetchers import Fetcher, FetcherSession
with FetcherSession(impersonate='chrome') as session: # Use latest version of Chrome's TLS fingerprint
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text').getall()支持会话的HTTP请求
python
from scrapling.fetchers import Fetcher, FetcherSession
with FetcherSession(impersonate='chrome') as session: # 使用最新版Chrome的TLS指纹
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text').getall()Or use one-off requests
或者使用单次请求
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
Advanced stealth mode
```python
from scrapling.fetchers import StealthyFetcher, StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session: # Keep the browser open until you finish
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a').getall()page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
高级隐形模式
```python
from scrapling.fetchers import StealthyFetcher, StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session: # 保持浏览器开启直到任务完成
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a').getall()Or use one-off request style, it opens the browser for this request, then closes it after finishing
或者使用单次请求模式,会为该请求打开浏览器,完成后自动关闭
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()
Full browser automation
```python
from scrapling.fetchers import DynamicFetcher, DynamicSession
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Keep the browser open until you finish
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
data = page.xpath('//span[@class="text"]/text()').getall() # XPath selector if you prefer itpage = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()
全量浏览器自动化
```python
from scrapling.fetchers import DynamicFetcher, DynamicSession
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # 保持浏览器开启直到任务完成
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
data = page.xpath('//span[@class="text"]/text()').getall() # 可选择使用XPath选择器Or use one-off request style, it opens the browser for this request, then closes it after finishing
或者使用单次请求模式,会为该请求打开浏览器,完成后自动关闭
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()
undefinedpage = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()
undefinedSpiders
爬虫
Build full crawlers with concurrent requests, multiple session types, and pause/resume:
python
from scrapling.spiders import Spider, Request, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
}
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")Use multiple session types in a single spider:
python
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
# Route protected pages through the stealth session
if "protected" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast", callback=self.parse) # explicit callbackPause and resume long crawls with checkpoints by running the spider like this:
python
QuotesSpider(crawldir="./crawl_data").start()Press Ctrl+C to pause gracefully — progress is saved automatically. Later, when you start the spider again, pass the same , and it will resume from where it stopped.
crawldir构建支持并发请求、多会话类型、暂停/恢复的全功能爬虫:
python
from scrapling.spiders import Spider, Request, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
}
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")在单个爬虫中使用多会话类型:
python
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
# 受保护的页面通过隐形会话路由
if "protected" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast", callback=self.parse) # 显式指定回调通过以下方式运行爬虫即可实现断点续爬:
python
QuotesSpider(crawldir="./crawl_data").start()按下Ctrl+C即可优雅暂停,进度会自动保存。后续重新启动爬虫时传入相同的,即可从停止位置恢复爬取。
crawldirAdvanced Parsing & Navigation
高级解析与导航
python
from scrapling.fetchers import Fetcherpython
from scrapling.fetchers import FetcherRich element selection and navigation
丰富的元素选择与导航能力
page = Fetcher.get('https://quotes.toscrape.com/')
page = Fetcher.get('https://quotes.toscrape.com/')
Get quotes with multiple selection methods
多种选择器获取名言
quotes = page.css('.quote') # CSS selector
quotes = page.xpath('//div[@class="quote"]') # XPath
quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup-style
quotes = page.css('.quote') # CSS选择器
quotes = page.xpath('//div[@class="quote"]') # XPath
quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup风格
Same as
等价于
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # and so on...
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # 还有更多用法...
Find element by text content
通过文本内容查找元素
quotes = page.find_by_text('quote', tag='div')
quotes = page.find_by_text('quote', tag='div')
Advanced navigation
高级导航
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall() # Chained selectors
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall() # 链式选择器
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent
Element relationships and similarity
元素关系与相似度
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()
You can use the parser right away if you don't want to fetch websites like below:
```python
from scrapling.parser import Selector
page = Selector("<html>...</html>")And it works precisely the same way!
imilar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()
如果你不需要抓取网站,也可以直接使用解析器:
```python
from scrapling.parser import Selector
page = Selector("<html>...</html>")使用方式完全一致!
Async Session Management Examples
异步会话管理示例
python
import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
async with FetcherSession(http3=True) as session: # `FetcherSession` is context-aware and can work in both sync/async patterns
page1 = session.get('https://quotes.toscrape.com/')
page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')python
import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
async with FetcherSession(http3=True) as session: # `FetcherSession`支持上下文感知,可同时在同步/异步模式下工作
page1 = session.get('https://quotes.toscrape.com/')
page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')Async session usage
异步会话使用
async with AsyncStealthySession(max_pages=2) as session:
tasks = []
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
task = session.fetch(url)
tasks.append(task)
print(session.get_pool_stats()) # Optional - The status of the browser tabs pool (busy/free/error)
results = await asyncio.gather(*tasks)
print(session.get_pool_stats())undefinedasync with AsyncStealthySession(max_pages=2) as session:
tasks = []
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
task = session.fetch(url)
tasks.append(task)
print(session.get_pool_stats()) # 可选 - 浏览器标签池状态(繁忙/空闲/错误)
results = await asyncio.gather(*tasks)
print(session.get_pool_stats())undefinedReferences
参考文档
You already had a good glimpse of what the library can do. Use the references below to dig deeper when needed
- — MCP server tools and capabilities
references/mcp-server.md - — Everything you need for parsing HTML
references/parsing - — Everything you need to fetch websites and session persistence
references/fetching - — Everything you need to write spiders, proxy rotation, and advanced features. It follows a Scrapy-like format
references/spiders - — A quick API comparison between scrapling and Beautifulsoup
references/migrating_from_beautifulsoup.md - — Full official docs in Markdown for quick access (use only if current references do not look up-to-date).
https://github.com/D4Vinci/Scrapling/tree/main/docs
This skill encapsulates almost all the published documentation in Markdown, so don't check external sources or search online without the user's permission.
你已经大致了解了这个库的能力,需要深入了解时可以使用以下参考文档:
- — MCP服务器工具与能力
references/mcp-server.md - — HTML解析相关的所有内容
references/parsing - — 网站抓取与会话持久化相关的所有内容
references/fetching - — 编写爬虫、代理轮换、高级功能相关的所有内容,遵循类Scrapy的格式
references/spiders - — Scrapling与Beautifulsoup的快速API对比
references/migrating_from_beautifulsoup.md - — 完整的官方Markdown文档,仅当当前参考文档不是最新版本时使用
https://github.com/D4Vinci/Scrapling/tree/main/docs
本skill封装了几乎所有已发布的Markdown格式文档,未经用户许可请勿查阅外部资源或在线搜索。
Guardrails (Always)
安全规则(始终遵守)
- Only scrape content you're authorized to access.
- Respect robots.txt and ToS.
- Add delays (download_delay) for large crawls.
- Don't bypass paywalls or authentication without permission.
- Never scrape personal/sensitive data.
- 仅抓取你有权访问的内容
- 遵守robots.txt和服务条款
- 大规模爬取时添加延迟(download_delay)
- 未经许可不要绕过付费墙或身份验证
- 绝不抓取个人/敏感数据