Loading...
Loading...
This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
npx skill4agent add aiskillstore/marketplace crawl4aicrwlpip install crawl4ai
crawl4ai-setup
# Verify installation
crawl4ai-doctor# Basic crawling - returns markdown
crwl https://example.com
# Get markdown output
crwl https://example.com -o markdown
# JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache
# See more examples
crwl --exampleimport asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500])
asyncio.run(main())| Concept | CLI | SDK |
|---|---|---|
| Browser settings | | |
| Crawl settings | | |
| Extraction | | |
| Content filter | | |
headlessviewport_width/heightuser_agentproxy_configpage_timeoutwait_forcache_modejs_codecss_selector# Basic markdown
crwl https://docs.example.com -o markdown
# Filtered markdown (removes noise)
crwl https://docs.example.com -o markdown-fit
# With content filter
crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit# filter_bm25.yml (relevance-based)
type: "bm25"
query: "machine learning tutorials"
threshold: 1.0from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
print(result.markdown.fit_markdown) # Filtered
print(result.markdown.raw_markdown) # Original# Generate schema once (uses LLM)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# Use schema for extraction (no LLM)
crwl https://shop.com -e extract_css.yml -s product_schema.json -o json{
"name": "products",
"baseSelector": ".product-card",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}# extract_llm.yml
type: "llm"
provider: "openai/gpt-4o-mini"
instruction: "Extract product names and prices"
api_token: "your-token"crwl https://shop.com -e extract_llm.yml -o jsoncrwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"# crawler.yml
wait_for: "css:.ajax-content"
scan_full_page: true
page_timeout: 60000
delay_before_return_html: 2.0for url in url1 url2 url3; do crwl "$url" -o markdown; doneurls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)# login_crawler.yml
session_id: "user_session"
js_code: |
document.querySelector('#username').value = 'user';
document.querySelector('#password').value = 'pass';
document.querySelector('#submit').click();
wait_for: "css:.dashboard"# Login
crwl https://site.com/login -C login_crawler.yml
# Access protected content (session reused)
crwl https://site.com/protected -c "session_id=user_session"# browser.yml
headless: true
proxy_config:
server: "http://proxy:8080"
username: "user"
password: "pass"
user_agent_mode: "random"crwl https://example.com -B browser.yml# Search Google and get results as JSON
python scripts/google_search.py "your search query" 20
# Example
python scripts/google_search.py "2026年Go语言展望" 20google_search_results.jsoncrwl https://docs.example.com -o markdown > docs.md# Generate schema once
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# Monitor (no LLM costs)
crwl https://shop.com -e extract_css.yml -s schema.json -o json# Multiple sources with filtering
for url in news1.com news2.com news3.com; do
crwl "https://$url" -f filter_bm25.yml -o markdown-fit
done# First view content
crwl https://example.com -o markdown
# Then ask questions
crwl https://example.com -q "What are the main conclusions?"
crwl https://example.com -q "Summarize the key points"| Document | Purpose |
|---|---|
| CLI Guide | Command-line interface reference |
| SDK Guide | Python SDK quick reference |
| Complete SDK Reference | Full API documentation (5900+ lines) |
--bypass-cachecrwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"crwl https://example.com -B browser.yml# browser.yml
headless: false
viewport_width: 1920
viewport_height: 1080
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"# Debug: see full output
crwl https://example.com -o all -v
# Try different wait strategy
crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"# Verify session
crwl https://site.com -c "session_id=test" -o all | grep -i session