crawl4ai

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

crawl4ai

crawl4ai

High-performance web crawler with intelligent chunking. Crawls web pages and extracts content as markdown using LLM-based skeleton planning.
一款具备智能分块功能的高性能网页爬虫。基于LLM的骨架规划技术,可爬取网页并将内容提取为Markdown格式。

Commands

命令

crawl_url
(alias:
webCrawl
)

crawl_url
(别名:
webCrawl

Crawl a web page with LangGraph workflow and LLM-based intelligent chunking.
Parameters:
ParameterTypeDefaultDescription
url
str-Target URL to crawl (required)
action
str"smart"Action mode: "smart", "skeleton", "crawl"
fit_markdown
booltrueClean and simplify markdown output
max_depth
int0Maximum crawling depth (0=single page)
return_skeleton
boolfalseAlso return document skeleton (TOC)
chunk_indices
list[int]-List of section indices to extract
Action Modes:
ModeDescriptionUse Case
smart
(default)
LLM generates chunk plan, then extracts relevant sectionsLarge docs where you need specific info
skeleton
Extract lightweight TOC without full contentQuick overview, decide what to read
crawl
Return full markdown contentSmall pages, complete content needed
Examples:
python
undefined
通过LangGraph工作流和基于LLM的智能分块功能爬取网页。
参数:
参数类型默认值描述
url
str-目标URL(必填)
action
str"smart"操作模式:"smart"、"skeleton"、"crawl"
fit_markdown
booltrue清理并简化Markdown输出内容
max_depth
int0最大爬取深度(0=仅单页)
return_skeleton
boolfalse同时返回文档骨架(目录)
chunk_indices
list[int]-需要提取的章节索引列表
操作模式说明:
模式描述使用场景
smart
(默认)
LLM生成分块规划,随后提取相关章节需要从大型文档中获取特定信息的场景
skeleton
仅提取轻量级目录,不包含完整内容快速概览文档结构,决定阅读重点
crawl
返回完整的Markdown内容小型页面,需要获取完整内容的场景
示例:
python
undefined

Smart crawl with LLM chunking (default)

Smart crawl with LLM chunking (default)

@omni("crawl4ai.CrawlUrl", {"url": "https://example.com"})
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com"})

Skeleton only - get TOC quickly

Skeleton only - get TOC quickly

@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "action": "skeleton"})
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "action": "skeleton"})

Full content crawl

Full content crawl

@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "action": "crawl"})
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "action": "crawl"})

Extract specific sections

Extract specific sections

@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "chunk_indices": [0, 1, 2]})
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "chunk_indices": [0, 1, 2]})

Deep crawl (follow links up to depth N)

Deep crawl (follow links up to depth N)

@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "max_depth": 2})
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "max_depth": 2})

Get skeleton with full content

Get skeleton with full content

@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "return_skeleton": true})
undefined
@omni("crawl4ai.CrawlUrl", {"url": "https://example.com", "return_skeleton": true})
undefined

Core Concepts

核心概念

TopicDescriptionReference
Skeleton PlanningLLM sees TOC (~500 tokens) not full content (~10k+)smart-chunking.md
Chunk ExtractionToken-aware section extractionchunking.md
Deep CrawlingMulti-page crawling with BFS strategydeep-crawl.md
主题描述参考文档
骨架规划(Skeleton Planning)LLM仅读取目录(约500 tokens)而非完整内容(约10k+ tokens)smart-chunking.md
分块提取(Chunk Extraction)基于Token感知的章节提取chunking.md
深度爬取(Deep Crawling)采用BFS策略的多页面爬取deep-crawl.md

Best Practices

最佳实践

  • Use
    skeleton
    mode first for large documents to understand structure
  • Use
    chunk_indices
    to extract specific sections instead of full content
  • Set
    max_depth
    > 0 carefully - limits pages crawled to prevent runaway crawling
  • Keep
    fit_markdown=true
    for cleaner output, false for raw content
  • 对于大型文档,先使用
    skeleton
    模式了解其结构
  • 使用
    chunk_indices
    提取特定章节,而非获取完整内容
  • 谨慎设置
    max_depth
    > 0 - 限制爬取页面数量,避免无限制爬取
  • 保持
    fit_markdown=true
    以获得更整洁的输出,设置为false可获取原始内容

Advanced

进阶用法

  • Batch multiple URLs with separate calls
  • Combine with knowledge tools for RAG pipelines
  • Use skeleton + LLM to auto-generate chunk plans for custom extraction
  • 通过单独调用批量处理多个URL
  • 与知识工具结合,构建RAG流水线
  • 结合骨架规划与LLM自动生成自定义提取的分块方案