crawl4ai
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesecrawl4ai
crawl4ai
High-performance web crawler with intelligent chunking. Crawls web pages and extracts content as markdown using LLM-based skeleton planning.
一款具备智能分块功能的高性能网页爬虫。基于LLM的骨架规划技术,可爬取网页并将内容提取为Markdown格式。
Commands
命令
crawl_url
(alias: webCrawl
)
crawl_urlwebCrawlcrawl_url
(别名:webCrawl
)
crawl_urlwebCrawlCrawl a web page with LangGraph workflow and LLM-based intelligent chunking.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
| str | - | Target URL to crawl (required) |
| str | "smart" | Action mode: "smart", "skeleton", "crawl" |
| bool | true | Clean and simplify markdown output |
| int | 0 | Maximum crawling depth (0=single page) |
| bool | false | Also return document skeleton (TOC) |
| list[int] | - | List of section indices to extract |
Action Modes:
| Mode | Description | Use Case |
|---|---|---|
| LLM generates chunk plan, then extracts relevant sections | Large docs where you need specific info |
| Extract lightweight TOC without full content | Quick overview, decide what to read |
| Return full markdown content | Small pages, complete content needed |
Examples:
python
undefined通过LangGraph工作流和基于LLM的智能分块功能爬取网页。
参数:
| 参数 | 类型 | 默认值 | 描述 |
|---|---|---|---|
| str | - | 目标URL(必填) |
| str | "smart" | 操作模式:"smart"、"skeleton"、"crawl" |
| bool | true | 清理并简化Markdown输出内容 |
| int | 0 | 最大爬取深度(0=仅单页) |
| bool | false | 同时返回文档骨架(目录) |
| list[int] | - | 需要提取的章节索引列表 |
操作模式说明:
| 模式 | 描述 | 使用场景 |
|---|---|---|
| LLM生成分块规划,随后提取相关章节 | 需要从大型文档中获取特定信息的场景 |
| 仅提取轻量级目录,不包含完整内容 | 快速概览文档结构,决定阅读重点 |
| 返回完整的Markdown内容 | 小型页面,需要获取完整内容的场景 |
示例:
python
undefinedSmart crawl with LLM chunking (default)
Smart crawl with LLM chunking (default)
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com"})
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com"})
Skeleton only - get TOC quickly
Skeleton only - get TOC quickly
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "action": "skeleton"})
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "action": "skeleton"})
Full content crawl
Full content crawl
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "action": "crawl"})
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "action": "crawl"})
Extract specific sections
Extract specific sections
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "chunk_indices": [0, 1, 2]})
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "chunk_indices": [0, 1, 2]})
Deep crawl (follow links up to depth N)
Deep crawl (follow links up to depth N)
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "max_depth": 2})
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "max_depth": 2})
Get skeleton with full content
Get skeleton with full content
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "return_skeleton": true})
undefined@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "return_skeleton": true})
undefinedCore Concepts
核心概念
| Topic | Description | Reference |
|---|---|---|
| Skeleton Planning | LLM sees TOC (~500 tokens) not full content (~10k+) | smart-chunking.md |
| Chunk Extraction | Token-aware section extraction | chunking.md |
| Deep Crawling | Multi-page crawling with BFS strategy | deep-crawl.md |
| 主题 | 描述 | 参考文档 |
|---|---|---|
| 骨架规划(Skeleton Planning) | LLM仅读取目录(约500 tokens)而非完整内容(约10k+ tokens) | smart-chunking.md |
| 分块提取(Chunk Extraction) | 基于Token感知的章节提取 | chunking.md |
| 深度爬取(Deep Crawling) | 采用BFS策略的多页面爬取 | deep-crawl.md |
Best Practices
最佳实践
- Use mode first for large documents to understand structure
skeleton - Use to extract specific sections instead of full content
chunk_indices - Set > 0 carefully - limits pages crawled to prevent runaway crawling
max_depth - Keep for cleaner output, false for raw content
fit_markdown=true
- 对于大型文档,先使用模式了解其结构
skeleton - 使用提取特定章节,而非获取完整内容
chunk_indices - 谨慎设置> 0 - 限制爬取页面数量,避免无限制爬取
max_depth - 保持以获得更整洁的输出,设置为false可获取原始内容
fit_markdown=true
Advanced
进阶用法
- Batch multiple URLs with separate calls
- Combine with knowledge tools for RAG pipelines
- Use skeleton + LLM to auto-generate chunk plans for custom extraction
- 通过单独调用批量处理多个URL
- 与知识工具结合,构建RAG流水线
- 结合骨架规划与LLM自动生成自定义提取的分块方案