crawler

Original：🇺🇸 English

Translated

6 scriptsChecked / no sensitive code detected

Fetches web pages and converts them to clean markdown using a robust 3-tier chain (Firecrawl → Jina Reader → Scrapling stealth browser). Use this skill instead of WebFetch whenever the user provides a URL and needs the page's text content — especially for sites that block direct access: medium.com articles (paywalled/metered), WeChat public accounts (mp.weixin.qq.com, geo-restricted), documentation sites with bot protection, or any page where simple HTTP fetching might return a CAPTCHA or empty page. Triggers for: "read this URL", "summarize this article/page", "grab the content from", "extract text from", "what does this page say", "fetch this link", or any request to access and process a specific web page. Do NOT trigger for: building scrapers, checking HTTP status codes, parsing already-downloaded HTML files, answering conceptual questions about scraping tools, or monitoring page changes.

11installs

Sourcegn00678465/crawler-skill

Added on2026-03-21

NPX Install

npx skill4agent add gn00678465/crawler-skill crawler

SKILL.md Content

View Translation Comparison →

Crawler Skill

Converts any URL into clean markdown using a robust 3-tier fallback chain.

Quick start

bash

uv run scripts/crawl.py --url https://example.com --output reports/example.md

Markdown is saved to the file specified by

--output

. Progress/errors go to stderr. Exit code

on success,

if all scrapers fail.

How it works

The script tries each tier in order and returns the first success:

Tier	Module	Requires
1	Firecrawl ( `firecrawl_scraper.py` )	`FIRECRAWL_API_KEY` env var (optional; falls back if missing)
2	Jina Reader ( `jina_reader.py` )	Nothing — free, no key needed
3	Scrapling ( `scrapling_scraper.py` )	Local headless browser (auto-installs via pip)

File layout

crawler-skill/
├── SKILL.md            ← this file
├── scripts/
│   ├── crawl.py               ← main CLI entry point (PEP 723 inline deps)
│   └── src/
│       ├── domain_router.py       ← URL-to-tier routing rules
│       ├── firecrawl_scraper.py   ← Tier 1: Firecrawl API
│       ├── jina_reader.py         ← Tier 2: Jina r.jina.ai proxy
│       └── scrapling_scraper.py   ← Tier 3: local headless scraper
└── tests/
    └── test_crawl.py          ← 70 pytest tests (all passing)

Usage examples

bash

# Basic fetch — tries Firecrawl, falls back to Jina, then Scrapling
# Always prefer using --output to avoid terminal encoding issues
uv run scripts/crawl.py --url https://docs.python.org/3/ --output reports/python_docs.md

# If no --output is provided, markdown goes to stdout (not recommended on Windows)
uv run scripts/crawl.py --url https://example.com

# With a Firecrawl API key for best results
FIRECRAWL_API_KEY=fc-... uv run scripts/crawl.py --url https://example.com --output reports/example.md

URL requirements

Only

http://

and

https://

URLs are accepted. Passing any other scheme (

ftp://

,

file://

,

javascript:

, a bare path, etc.) exits with code

and prints a clear error — no scraping is attempted.

Saving Reports

When the user asks to save the crawled content or a summary to a file, ALWAYS use the

--output

argument and save the file into the

reports/

directory at the project root (for example,

{project_root}/reports

). If the directory does not exist, the script will create it.

Example: If asked to "save to result.md", you should run:

uv run scripts/crawl.py --url <URL> --output reports/result.md

Point at a self-hosted Firecrawl instance

bash

FIRECRAWL_API_URL=http://localhost:3002 uv run scripts/crawl.py --url https://example.com

Content validation

Each scraper validates its output before returning success:

Minimum 100 characters of content (rejects empty/error pages)
Detection of CAPTCHA / bot-verification pages (Firecrawl)
Detection of Cloudflare interstitial pages (Scrapling — escalates to StealthyFetcher)
Detection of Jina error page indicators (
```
Error:
```
,
```
Access Denied
```
, etc.)

Domain routing

Certain hostnames bypass one or more scraper tiers to avoid known compatibility issues. The logic lives in

scripts/src/domain_router.py

.

Domain	Skipped tiers	Active chain
`medium.com` (and subdomains)	firecrawl	jina → scrapling
`mp.weixin.qq.com`	firecrawl + jina	scrapling only
everything else	—	firecrawl → jina → scrapling

Sub-domain matching follows a suffix rule:

blog.medium.com

matches the

medium.com

rule because its hostname ends with

.medium.com

. An exact sub-domain like

other.weixin.qq.com

does not match

mp.weixin.qq.com

.

Running tests

bash

uv run pytest tests/ -v

All 70 tests use mocking — no network calls, no API keys required.

Dependencies (auto-installed by

uv run

)

```
firecrawl-py>=2.0
```
— Firecrawl Python SDK
```
httpx>=0.27
```
— HTTP client for Jina Reader
```
scrapling>=0.2
```
— Headless scraping with stealth support
```
html2text>=2024.2.26
```
— HTML-to-markdown conversion

When to invoke this skill

Invoke

crawl.py

whenever you need the text content of a web page:

python

result = subprocess.run(
    ["uv", "run", "scripts/crawl.py", "--url", url],
    capture_output=True, text=True
)
if result.returncode == 0:
    markdown = result.stdout

Or simply run it directly from the terminal as shown in Quick start above.

crawler

NPX Install

Tags

SKILL.md Content

Crawler Skill

Quick start

How it works

File layout

Usage examples

URL requirements

Saving Reports

Point at a self-hosted Firecrawl instance

Content validation

Domain routing

Running tests

Dependencies (auto-installed by
`uv run`
)

When to invoke this skill

crawler

NPX Install

Tags

SKILL.md Content

Crawler Skill

Quick start

How it works

File layout

Usage examples

URL requirements

Saving Reports

Point at a self-hosted Firecrawl instance

Content validation

Domain routing

Running tests

Dependencies (auto-installed by uv run)

When to invoke this skill

Dependencies (auto-installed by
`uv run`
)