katana-web-crawling

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Katana web crawling

Katana 网页爬取

Katana is a fast crawler/spider from ProjectDiscovery, aimed at automation pipelines (URLs in → discovered endpoints out). Official docs and flags: repository README and
katana -h
.
Katana 是来自ProjectDiscovery的一款快速爬虫/蜘蛛工具,专为自动化流水线设计(URL输入 → 发现端点输出)。官方文档及参数可查看仓库README
katana -h
命令。

Scope and ethics

范围与伦理

Use only on systems you own or are explicitly authorized to test (contract, bug bounty program rules, internal env). Crawl gently: set concurrency, rate limits, and depth to reduce load. Misuse can violate law and terms of service—you are responsible for your actions (tool ships with that warning).
仅可在你拥有明确获权测试的系统上使用(如合同授权、漏洞赏金计划规则允许、内部环境)。温和爬取:设置并发数速率限制爬取深度以降低服务器负载。滥用工具可能违反法律和服务条款——你需对自身行为负责(工具自带该警告)。

Installation

安装

Go (requires Go 1.25+ per upstream; verify current README if install fails):
bash
CGO_ENABLED=1 go install github.com/projectdiscovery/katana/cmd/katana@latest
Docker:
bash
docker pull projectdiscovery/katana:latest
docker run projectdiscovery/katana:latest -u https://example.com
Headless in Docker often needs
-system-chrome
and Chrome/Chromium available—see upstream Docker section.
Go语言安装(根据上游要求需Go 1.25+;若安装失败请查看最新README):
bash
CGO_ENABLED=1 go install github.com/projectdiscovery/katana/cmd/katana@latest
Docker安装:
bash
docker pull projectdiscovery/katana:latest
docker run projectdiscovery/katana:latest -u https://example.com
Docker环境下的无头模式通常需要
-system-chrome
参数,且需确保Chrome/Chromium可用——详情请查看上游Docker章节。

Input

输入方式

  • Single/multiple URLs:
    -u https://a.com
    or comma-separated URLs
  • File:
    -list urls.txt
  • STDIN:
    echo https://example.com | katana
    or
    cat domains | httpx | katana
  • 单个/多个URL:
    -u https://a.com
    或逗号分隔的URL
  • 文件输入:
    -list urls.txt
  • 标准输入(STDIN)
    echo https://example.com | katana
    cat domains | httpx | katana

Modes

模式说明

ModeWhen
Standard (default)Fast; uses Go HTTP client; no full JS/DOM render—may miss post-render routes
Headless (
-headless
)
Browser context; better for JS-heavy apps; optional
-system-chrome
Enable JS file parsing for more endpoints:
-js-crawl
(
-jc
).
-jsluice
is heavier.
模式使用场景
标准模式(默认)速度快;使用Go HTTP客户端;不支持完整JS/DOM渲染——可能会错过渲染后生成的路由
无头模式 (
-headless
)
浏览器上下文环境;更适合重度JS依赖的应用;可选
-system-chrome
参数
启用JS文件解析可发现更多端点:
-js-crawl
(简写
-jc
)。
-jsluice
解析强度更高。

Flags to know first

必知参数

FlagPurpose
-d
,
-depth
Max crawl depth (default 3)
-c
,
-concurrency
Parallel fetchers
-rl
,
-rate-limit
Max requests per second
-ct
,
-crawl-duration
Cap total crawl time (e.g.
5m
)
-cs
/
-cos
In-scope / out-of-scope URL regex
-ns
Disable default host scope if you need cross-host (use carefully)
-iqp
Ignore same path with different query strings
-fs
,
-filter-similar
Reduce near-duplicate paths
-kf
,
-known-files
robots.txt
/
sitemap.xml
etc. (min depth 3 for full coverage per docs)
-j
,
-jsonl
JSONL output for scripting
-o
,
-output
Write to file
-sr
,
-store-response
Store HTTP for review (disk use)
-proxy
HTTP/SOCKS5 proxy
-H
Extra headers (auth, cookies) via
header:value
Run
katana -h
for the full list (filters, form fill, tech detect, TLS options, etc.).
参数用途
-d
,
-depth
最大爬取深度(默认3
-c
,
-concurrency
并行请求数
-rl
,
-rate-limit
每秒最大请求数
-ct
,
-crawl-duration
爬取总时长上限(例如
5m
-cs
/
-cos
在范围内 / 范围外的URL正则表达式
-ns
若需跨主机爬取,禁用默认的主机范围限制(谨慎使用)
-iqp
忽略同一路径下不同查询字符串的请求
-fs
,
-filter-similar
减少近似重复的路径
-kf
,
-known-files
爬取
robots.txt
/
sitemap.xml
等已知文件(根据文档,需设置最小深度3以覆盖完整内容)
-j
,
-jsonl
输出JSONL格式以便脚本处理
-o
,
-output
将结果写入文件
-sr
,
-store-response
存储HTTP响应以供后续查看(会占用磁盘空间)
-proxy
HTTP/SOCKS5代理
-H
通过
header:value
格式添加额外请求头(如授权、Cookie)
运行
katana -h
查看完整参数列表(包括过滤器、表单填充、技术检测、TLS选项等)。

Minimal examples

极简示例

bash
katana -u https://example.com -d 2 -silent
bash
katana -u https://example.com -jsonl -o endpoints.jsonl
bash
katana -list seeds.txt -d 3 -cs '.*\.example\.com.*' -rl 30 -jsonl
Headless (JS-heavy target):
bash
katana -u https://example.com -headless -d 2
bash
katana -u https://example.com -d 2 -silent
bash
katana -u https://example.com -jsonl -o endpoints.jsonl
bash
katana -list seeds.txt -d 3 -cs '.*\.example\.com.*' -rl 30 -jsonl
无头模式(针对重度JS依赖目标):
bash
katana -u https://example.com -headless -d 2

Pipelines

流水线组合

Common pattern: resolve live HTTP first, then crawl:
bash
cat domains.txt | httpx -silent | katana -jsonl -o crawl.jsonl
Combine with other PD tools (naabu, nuclei, etc.) only in authorized assessments.
常见流程:先解析存活HTTP服务,再进行爬取:
bash
cat domains.txt | httpx -silent | katana -jsonl -o crawl.jsonl
仅在获权的评估中,才可与其他PD工具(如naabu、nuclei等)组合使用。

Troubleshooting

故障排查

  • CGO_ENABLED=1
    required for go install per README.
  • Headless failures: try
    -system-chrome
    , ensure Chrome/Chromium installed, or use Docker image with documented Chrome setup.
  • Health check:
    -health-check
    /
    -hc
    .
  • Go语言安装时必须设置
    CGO_ENABLED=1
    ,详情见README。
  • 无头模式失败:尝试添加
    -system-chrome
    参数,确保Chrome/Chromium已安装,或使用带有Chrome配置的Docker镜像。
  • 健康检查:使用
    -health-check
    /
    -hc
    参数。

References

参考链接