katana-web-crawling

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Katana web crawling

Katana 网页爬取

Katana is a fast crawler/spider from ProjectDiscovery, aimed at automation pipelines (URLs in → discovered endpoints out). Official docs and flags: repository README and

katana -h

Katana 是来自ProjectDiscovery的一款快速爬虫/蜘蛛工具，专为自动化流水线设计（URL输入 → 发现端点输出）。官方文档及参数可查看仓库README和

katana -h

命令。

Scope and ethics

范围与伦理

Use only on systems you own or are explicitly authorized to test (contract, bug bounty program rules, internal env). Crawl gently: set concurrency, rate limits, and depth to reduce load. Misuse can violate law and terms of service—you are responsible for your actions (tool ships with that warning).

仅可在你拥有或明确获权测试的系统上使用（如合同授权、漏洞赏金计划规则允许、内部环境）。温和爬取：设置并发数、速率限制和爬取深度以降低服务器负载。滥用工具可能违反法律和服务条款——你需对自身行为负责（工具自带该警告）。

Installation

安装

Go (requires Go 1.25+ per upstream; verify current README if install fails):

bash

CGO_ENABLED=1 go install github.com/projectdiscovery/katana/cmd/katana@latest

Docker:

bash

docker pull projectdiscovery/katana:latest
docker run projectdiscovery/katana:latest -u https://example.com

Headless in Docker often needs

-system-chrome

and Chrome/Chromium available—see upstream Docker section.

Go语言安装（根据上游要求需Go 1.25+；若安装失败请查看最新README）：

bash

CGO_ENABLED=1 go install github.com/projectdiscovery/katana/cmd/katana@latest

Docker安装：

bash

docker pull projectdiscovery/katana:latest
docker run projectdiscovery/katana:latest -u https://example.com

Docker环境下的无头模式通常需要

-system-chrome

参数，且需确保Chrome/Chromium可用——详情请查看上游Docker章节。

Input

输入方式

Single/multiple URLs:
```
-u https://a.com
```
or comma-separated URLs
File:
```
-list urls.txt
```

STDIN:

echo https://example.com | katana

cat domains | httpx | katana

单个/多个URL：
```
-u https://a.com
```
或逗号分隔的URL
文件输入：
```
-list urls.txt
```

标准输入（STDIN）：

echo https://example.com | katana

或

cat domains | httpx | katana

Modes

模式说明

Mode	When
Standard (default)	Fast; uses Go HTTP client; no full JS/DOM render—may miss post-render routes
Headless ( `-headless` )	Browser context; better for JS-heavy apps; optional `-system-chrome`

Enable JS file parsing for more endpoints:

-js-crawl

(

-jc

-jsluice

is heavier.

模式	使用场景
标准模式（默认）	速度快；使用Go HTTP客户端；不支持完整JS/DOM渲染——可能会错过渲染后生成的路由
无头模式 ( `-headless` )	浏览器上下文环境；更适合重度JS依赖的应用；可选 `-system-chrome` 参数

启用JS文件解析可发现更多端点：

-js-crawl

（简写

-jc

）。

-jsluice

解析强度更高。

Flags to know first

必知参数

Flag	Purpose
`-d` , `-depth`	Max crawl depth (default 3)
`-c` , `-concurrency`	Parallel fetchers
`-rl` , `-rate-limit`	Max requests per second
`-ct` , `-crawl-duration`	Cap total crawl time (e.g. `5m` )
`-cs` / `-cos`	In-scope / out-of-scope URL regex
`-ns`	Disable default host scope if you need cross-host (use carefully)
`-iqp`	Ignore same path with different query strings
`-fs` , `-filter-similar`	Reduce near-duplicate paths
`-kf` , `-known-files`	`robots.txt` / `sitemap.xml` etc. (min depth 3 for full coverage per docs)
`-j` , `-jsonl`	JSONL output for scripting
`-o` , `-output`	Write to file
`-sr` , `-store-response`	Store HTTP for review (disk use)
`-proxy`	HTTP/SOCKS5 proxy
`-H`	Extra headers (auth, cookies) via `header:value`

Run

katana -h

for the full list (filters, form fill, tech detect, TLS options, etc.).

参数	用途
`-d` , `-depth`	最大爬取深度（默认3）
`-c` , `-concurrency`	并行请求数
`-rl` , `-rate-limit`	每秒最大请求数
`-ct` , `-crawl-duration`	爬取总时长上限（例如 `5m` ）
`-cs` / `-cos`	在范围内 / 范围外的URL正则表达式
`-ns`	若需跨主机爬取，禁用默认的主机范围限制（谨慎使用）
`-iqp`	忽略同一路径下不同查询字符串的请求
`-fs` , `-filter-similar`	减少近似重复的路径
`-kf` , `-known-files`	爬取 `robots.txt` / `sitemap.xml` 等已知文件（根据文档，需设置最小深度3以覆盖完整内容）
`-j` , `-jsonl`	输出JSONL格式以便脚本处理
`-o` , `-output`	将结果写入文件
`-sr` , `-store-response`	存储HTTP响应以供后续查看（会占用磁盘空间）
`-proxy`	HTTP/SOCKS5代理
`-H`	通过 `header:value` 格式添加额外请求头（如授权、Cookie）

运行

katana -h

查看完整参数列表（包括过滤器、表单填充、技术检测、TLS选项等）。

Minimal examples

极简示例

bash

katana -u https://example.com -d 2 -silent

bash

katana -u https://example.com -jsonl -o endpoints.jsonl

bash

katana -list seeds.txt -d 3 -cs '.*\.example\.com.*' -rl 30 -jsonl

Headless (JS-heavy target):

bash

katana -u https://example.com -headless -d 2

bash

katana -u https://example.com -d 2 -silent

bash

katana -u https://example.com -jsonl -o endpoints.jsonl

bash

katana -list seeds.txt -d 3 -cs '.*\.example\.com.*' -rl 30 -jsonl

无头模式（针对重度JS依赖目标）：

bash

katana -u https://example.com -headless -d 2

Pipelines

流水线组合

Common pattern: resolve live HTTP first, then crawl:

bash

cat domains.txt | httpx -silent | katana -jsonl -o crawl.jsonl

Combine with other PD tools (naabu, nuclei, etc.) only in authorized assessments.

常见流程：先解析存活HTTP服务，再进行爬取：

bash

cat domains.txt | httpx -silent | katana -jsonl -o crawl.jsonl

仅在获权的评估中，才可与其他PD工具（如naabu、nuclei等）组合使用。

Troubleshooting

故障排查

```
CGO_ENABLED=1
```
required for go install per README.
Headless failures: try
```
-system-chrome
```
, ensure Chrome/Chromium installed, or use Docker image with documented Chrome setup.
Health check:
```
-health-check
```
/
```
-hc
```
.

Go语言安装时必须设置
```
CGO_ENABLED=1
```
，详情见README。
无头模式失败：尝试添加
```
-system-chrome
```
参数，确保Chrome/Chromium已安装，或使用带有Chrome配置的Docker镜像。
健康检查：使用
```
-health-check
```
/
```
-hc
```
参数。

References

参考链接

Source and releases: github.com/projectdiscovery/katana

源码及发布版本：github.com/projectdiscovery/katana