Documentation Scraper with slurp-ai
Overview
slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.
CRITICAL: Run Outside Sandbox
All commands in this skill MUST be run outside the sandbox. Use
dangerouslyDisableSandbox: true
for all Bash commands including:
- (installation check)
- (sitemap analysis)
- (scraping)
- File inspection commands (, , , etc.)
The sandbox blocks network access and file operations required for web scraping.
Pre-Flight: Check Installation
Before scraping, verify slurp-ai is installed:
bash
which slurp || echo "NOT INSTALLED"
If not installed, ask the user to run:
Requires: Node.js v20+
Do NOT proceed with scraping until slurp-ai is confirmed installed.
Commands
| Command | Purpose |
|---|
| Fetch and compile in one step |
slurp fetch <url> [version]
| Download docs to partials only |
| Compile partials into single file |
slurp read <package> [version]
| Read local documentation |
Output: Creates
slurp_compiled/compiled_docs.md
from partials in
.
CRITICAL: Analyze Sitemap First
Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your
and
decisions.
Step 1: Run Sitemap Analysis
bash
node analyze-sitemap.js https://docs.example.com
This outputs:
- Total page count (informs )
- URLs grouped by section (informs )
- Suggested slurp commands with appropriate flags
- Sample URLs to understand naming patterns
Step 2: Interpret the Output
Example output:
📊 Total URLs in sitemap: 247
📁 URLs by top-level section:
/docs 182 pages
/api 45 pages
/blog 20 pages
🎯 Suggested --base-path options:
https://docs.example.com/docs/guides/ (67 pages)
https://docs.example.com/docs/reference/ (52 pages)
https://docs.example.com/api/ (45 pages)
💡 Recommended slurp commands:
# Just "/docs/guides" section (67 pages)
slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80
Step 3: Choose Scope Based on Analysis
| Sitemap Shows | Action |
|---|
| < 50 pages total | Scrape entire site: |
| 50-200 pages | Scope to relevant section with |
| 200+ pages | Must scope down - pick specific subsection |
| No sitemap found | Start with , inspect partials, adjust |
Step 4: Frame the Slurp Command
With sitemap data, you can now set accurate parameters:
bash
# From sitemap: /docs/api has 45 pages
slurp https://docs.example.com/docs/api/intro \
--base-path https://docs.example.com/docs/api/ \
--max 55
Key insight: Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).
Common Scraping Patterns
Library Documentation (versioned)
bash
# Express.js 4.x docs
slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/
# React docs (latest)
slurp https://react.dev/learn --base-path https://react.dev/learn
API Reference Only
bash
slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/
Full Documentation Site
bash
slurp https://docs.example.com/
CLI Options
| Flag | Default | Purpose |
|---|
| 20 | Maximum pages to scrape |
| 5 | Parallel page requests |
| true | Use headless browser |
| start URL | Filter links to this prefix |
| | Output directory for partials |
| 3 | Retries for failed requests |
| 1000 | Delay between retries |
| - | Skip confirmation prompts |
Compile Options
| Flag | Default | Purpose |
|---|
| | Input directory |
| ./slurp_compiled/compiled_docs.md
| Output file |
| true | Keep metadata blocks |
| true | Strip nav elements |
| true | Eliminate duplicates |
| - | JSON array of regex patterns to exclude |
When to Disable Headless Mode
- Static HTML documentation sites
- Faster scraping when JS rendering not needed
Default is headless (true) - works for most modern doc sites including SPAs.
Output Structure
slurp_partials/ # Intermediate files
└── page1.md
└── page2.md
slurp_compiled/ # Final output
└── compiled_docs.md # Compiled result
Quick Reference
bash
# 1. ALWAYS analyze sitemap first
node analyze-sitemap.js https://docs.example.com
# 2. Scrape with informed parameters (from sitemap analysis)
slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80
# 3. Skip prompts for automation
slurp https://docs.example.com/ --yes
# 4. Check output
cat slurp_compiled/compiled_docs.md | head -100
Common Issues
| Problem | Cause | Solution |
|---|
| Wrong value | Guessing page count | Run first |
| Too few pages scraped | limit (default 20) | Set based on sitemap analysis |
| Missing content | JS not rendering | Ensure (default) |
| Crawl stuck/slow | Rate limiting | Reduce |
| Duplicate sections | Similar content | Use (default) |
| Wrong pages included | Base path too broad | Use sitemap to find correct |
| Prompts blocking automation | Interactive mode | Add flag |
Post-Scrape Usage
The output markdown is designed for AI context injection:
bash
# Check file size (context budget)
wc -c slurp_compiled/compiled_docs.md
# Preview structure
grep "^#" slurp_compiled/compiled_docs.md | head -30
# Use with Claude Code - reference in prompt or via @file
When NOT to Use
- API specs in OpenAPI/Swagger: Use dedicated parsers instead
- GitHub READMEs: Fetch directly via raw.githubusercontent.com
- npm package docs: Often better to read source + README
- Frequently updated docs: Consider caching strategy