paper-fetch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

paper-fetch

Fetch the PDF for a paper given a DOI (or title). Tries multiple sources in priority order and stops at the first hit.

通过DOI（或标题）获取论文PDF。按优先级依次尝试多个来源，找到第一个可用资源后即停止。

Resolution order

获取优先级顺序

Unpaywall —

https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL

, read

best_oa_location.url_for_pdf

(skipped if

UNPAYWALL_EMAIL

not set)

Semantic Scholar —

https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds

arXiv — if

externalIds.ArXiv

present,

https://arxiv.org/pdf/{arxiv_id}.pdf

PubMed Central OA — if PMCID present,

https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/

bioRxiv / medRxiv — if DOI prefix is
```
10.1101
```
, query
```
https://api.biorxiv.org/details/{server}/{doi}
```
for the latest version PDF URL
Publisher direct (institutional mode only —
PAPER_FETCH_INSTITUTIONAL=1
) — DOI-prefix → publisher PDF template (Nature / Science / Wiley / Springer / ACS / PNAS / NEJM / Sage / T&F / Elsevier). The caller's own subscription IP / cookies / EZproxy are what authorize the fetch; unauthorized responses fail the
```
%PDF
```
check and fall through to step 7.
Sci-Hub mirrors (on by default; disable with
PAPER_FETCH_NO_SCIHUB=1
) — last-resort fallback. Tries the mirror list in
```
PAPER_FETCH_SCIHUB_MIRRORS
```
(or built-in defaults
```
sci-hub.ru
```
,
```
sci-hub.st
```
,
```
sci-hub.su
```
,
```
sci-hub.box
```
,
```
sci-hub.red
```
,
```
sci-hub.al
```
,
```
sci-hub.mk
```
,
```
sci-hub.ee
```
) in order; on full miss, scrapes
```
https://www.sci-hub.pub/
```
once per process for fresh mirrors. CAPTCHA / missing-paper pages have no PDF iframe and fall through silently.
Otherwise → report failure with title/authors so the user can request via ILL

If only a title is given, pass it directly via

--title "<title>"

. Resolution chain:

Crossref
```
query.title
```
— primary; covers all major journal/conference DOIs
Semantic Scholar
/paper/search/match
— fallback when Crossref's top match is low-confidence (
```
match_score < 40
```
) or the gap to the runner-up is
```
< 3
```
. Critically, S2 covers arXiv-only preprints (no Crossref DOI). When S2 surfaces a paper that has only an arXiv id, the canonical
```
10.48550/arXiv.<id>
```
is synthesized so the download chain stays uniform.
Crossref's best guess (low-confidence) — used only when both resolvers struggled. The result envelope sets
```
meta.title_resolution.low_confidence: true
```
plus a
```
low_confidence_reason
```
(
```
score_below_threshold
```
/
```
ambiguous_runner_up
```
) so an agent can either bail or confirm via
```
--dry-run
```
.

Either way the resolved DOI, the winning resolver, the full

resolvers_tried

list, and the top candidate matches are all surfaced under

meta.title_resolution

If
asta-skill
is registered, the agent can alternatively resolve title → DOI through the Asta MCP first, then pass the DOI directly here. This skips paper-fetch's two-stage Crossref/S2 chain in favor of Asta's richer search surface (relevance ranking, snippet search, citation graph). Workflow: call

asta__search_paper_by_title("<title>", fields="title,year,authors,externalIds")

, read

externalIds.DOI

(or

10.48550/arXiv.<ArXiv>

when only

ArXiv

is present), then

paper-fetch <doi>

. Use

--title

when Asta isn't available or when a single command is preferred.

Unpaywall — 调用接口

https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL

，读取

best_oa_location.url_for_pdf

字段（若未设置

UNPAYWALL_EMAIL

则跳过此步骤）

Semantic Scholar — 调用接口

https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds

arXiv — 若存在

externalIds.ArXiv

标识，访问

https://arxiv.org/pdf/{arxiv_id}.pdf

PubMed Central OA — 若存在PMCID标识，访问

https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/

bioRxiv / medRxiv — 若DOI前缀为
```
10.1101
```
，调用
```
https://api.biorxiv.org/details/{server}/{doi}
```
查询最新版本PDF的URL
出版商直接获取（仅机构模式可用，需设置
```
PAPER_FETCH_INSTITUTIONAL=1
```
）——通过DOI前缀匹配出版商PDF模板（涵盖Nature / Science / Wiley / Springer / ACS / PNAS / NEJM / Sage / T&F / Elsevier）。调用方的订阅IP/ cookies / EZproxy将用于授权获取；未授权的响应会通过
```
%PDF
```
校验失败，进而进入第7步
Sci-Hub镜像（默认启用；可通过
```
PAPER_FETCH_NO_SCIHUB=1
```
禁用）——最后备选方案。按
```
PAPER_FETCH_SCIHUB_MIRRORS
```
配置的镜像列表（或内置默认值
```
sci-hub.ru
```
,
```
sci-hub.st
```
,
```
sci-hub.su
```
,
```
sci-hub.box
```
,
```
sci-hub.red
```
,
```
sci-hub.al
```
,
```
sci-hub.mk
```
,
```
sci-hub.ee
```
）依次尝试；若所有镜像均失败，会在每个进程中抓取一次
```
https://www.sci-hub.pub/
```
获取最新镜像。若遇到验证码或论文缺失页面（无PDF iframe），会直接跳过
若以上均失败→返回包含标题/作者的失败报告，用户可通过馆际互借（ILL）请求获取

若仅提供标题，直接通过

--title "<title>"

参数传入。解析流程如下：

Crossref
```
query.title
```
— 首选方式；覆盖所有主流期刊/会议DOI
Semantic Scholar
/paper/search/match
— 当Crossref的顶级匹配置信度低（
```
match_score < 40
```
）或与次优匹配的差距
```
< 3
```
时作为备选。关键是S2支持仅在arXiv发布的预印本（无Crossref DOI）。当S2返回仅含arXiv id的论文时，会自动生成标准DOI
```
10.48550/arXiv.<id>
```
，确保下载流程统一
Crossref低置信度最佳猜测 — 仅当两种解析方式均失败时使用。结果包会设置
```
meta.title_resolution.low_confidence: true
```
及
```
low_confidence_reason
```
（
```
score_below_threshold
```
/
```
ambiguous_runner_up
```
），供Agent决定是否终止或通过
```
--dry-run
```
确认

无论哪种方式，解析得到的DOI、成功的解析器、完整的

resolvers_tried

列表及顶级候选匹配结果都会在

meta.title_resolution

中返回。

若已注册
asta-skill
，Agent可先通过Asta MCP将标题解析为DOI，再直接传入此处。这会跳过paper-fetch的Crossref/S2两阶段解析链，转而使用Asta更丰富的搜索界面（相关性排序、片段搜索、引文图谱）。流程：调用

asta__search_paper_by_title("<title>", fields="title,year,authors,externalIds")

，读取

externalIds.DOI

（若仅存在

ArXiv

标识则使用

10.48550/arXiv.<ArXiv>

），然后执行

paper-fetch <doi>

。当Asta不可用或偏好单命令操作时，使用

--title

参数。

Usage

使用方法

bash

python scripts/fetch.py <DOI> [options]
python scripts/fetch.py --title "<paper title>" [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema           # machine-readable self-description

bash

python scripts/fetch.py <DOI> [options]
python scripts/fetch.py --title "<paper title>" [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema           # 机器可读的自描述信息

Flags

参数标识

The flags below are the ones an agent composes in normal use. For the complete contract — including

--dry-run

--pretty

--stream

--overwrite

--timeout

--version

, plus parameter types and exit-code mappings — run

python scripts/fetch.py schema

(machine-readable, drift-checked via

schema_version

Flag	Default	Description
`doi`	—	DOI to fetch (positional). Use `-` to read a single DOI from stdin
`--title TITLE`	—	Paper title; resolved to a DOI via Crossref before download. Mutually exclusive with positional DOI / `--batch`
`--batch FILE`	—	File with one DOI per line for bulk download. Use `-` to read from stdin
`--out DIR`	`pdfs`	Output directory
`--format`	auto	`json` for agents, `text` for humans. Auto-detects: `json` when stdout is not a TTY, `text` when it is
`--idempotency-key KEY`	—	Safe-retry key. Re-running with the same key replays the original envelope from `<out>/.paper-fetch-idem/` without network I/O

以下是Agent日常使用中会用到的标识。如需完整约定——包括

--dry-run

--pretty

--stream

--overwrite

--timeout

--version

，以及参数类型和退出码映射——请运行

python scripts/fetch.py schema

（机器可读，通过

schema_version

检查版本差异）。

参数标识	默认值	描述
`doi`	—	要获取的DOI（位置参数）。使用 `-` 从标准输入读取单个DOI
`--title TITLE`	—	论文标题；先通过Crossref解析为DOI再进行下载。与位置参数DOI / `--batch` 互斥
`--batch FILE`	—	批量下载的文件，每行一个DOI。使用 `-` 从标准输入读取
`--out DIR`	`pdfs`	输出目录
`--format`	auto	`json` 供Agent使用， `text` 供人类阅读。自动检测：标准输出非TTY时为 `json` ，否则为 `text`
`--idempotency-key KEY`	—	安全重试密钥。使用相同密钥重新运行时，会从 `<out>/.paper-fetch-idem/` 中重放原始结果包，无需网络请求

Agent discovery:

schema

subcommand

Agent发现：

schema

子命令

bash

python scripts/fetch.py schema

Emits a complete machine-readable description of the CLI on stdout (no network). Includes

cli_version

schema_version

, parameter types, exit codes, error codes, envelope shapes, and environment variables. Agents should read this once, cache it against

schema_version

, and re-read when the cached version drifts.

bash

python scripts/fetch.py schema

在标准输出中输出完整的CLI机器可读描述（无需网络）。包含

cli_version

schema_version

, 参数类型、退出码、错误码、结果包结构及环境变量。Agent应读取一次，根据

schema_version

缓存，当缓存版本不一致时重新读取。

Output contract

输出约定

stdout emits a single JSON envelope. Every envelope carries a

meta

slot.

Success (all DOIs resolved):

json

{
  "ok": true,
  "data": {
    "results": [
      {
        "doi": "10.1038/s41586-021-03819-2",
        "success": true,
        "source": "unpaywall",
        "pdf_url": "https://www.nature.com/articles/s41586-021-03819-2.pdf",
        "file": "pdfs/Jumper_2021_Highly_accurate_protein_structure_predic.pdf",
        "meta": {"title": "Highly accurate protein structure prediction with AlphaFold", "year": 2021, "author": "Jumper"},
        "sources_tried": ["unpaywall"]
      }
    ],
    "summary": {"total": 1, "succeeded": 1, "failed": 0},
    "next": []
  },
  "meta": {
    "request_id": "req_a908f5156fc1",
    "latency_ms": 2036,
    "schema_version": "1.9.0",
    "cli_version": "0.13.1",
    "sources_tried": ["unpaywall"]
  }
}

Partial (batch mode — some DOIs failed, exit code reflects the failure class):

json

{
  "ok": "partial",
  "data": {
    "results": [
      { "doi": "10.1038/s41586-021-03819-2", "success": true, "source": "unpaywall", ... },
      {
        "doi": "10.1234/nonexistent",
        "success": false,
        "source": null,
        "pdf_url": null,
        "file": null,
        "meta": {},
        "sources_tried": ["unpaywall", "semantic_scholar"],
        "error": {
          "code": "not_found",
          "message": "No open-access PDF found",
          "retryable": true,
          "retry_after_hours": 168,
          "reason": "OA availability changes over time; retry after embargo lifts or preprint appears"
        }
      }
    ],
    "summary": {"total": 2, "succeeded": 1, "failed": 1},
    "next": ["paper-fetch 10.1234/nonexistent --out pdfs"]
  },
  "meta": { ... }
}

The

next

slot is an array of suggested follow-up commands: re-invoking them retries only the failed subset. Combine with

--idempotency-key

to make the whole batch safely retriable without re-downloading the already-succeeded items.

Failure (bad arguments, exit code 3):

json

{
  "ok": false,
  "error": {
    "code": "validation_error",
    "message": "Provide a DOI or --batch file",
    "retryable": false
  },
  "meta": { ... }
}

Per-item skipped (destination already exists, no

--overwrite

json

{
  "doi": "10.1038/s41586-021-03819-2",
  "success": true,
  "source": "unpaywall",
  "pdf_url": "https://...",
  "file": "pdfs/Jumper_2021_...pdf",
  "skipped": true,
  "skip_reason": "file_exists",
  "sources_tried": ["unpaywall"]
}

Idempotency replay (re-run with the same

--idempotency-key

The cached envelope is returned verbatim, but

meta.request_id

and

meta.latency_ms

are re-stamped for the current call, and

meta.replayed_from_idempotency_key

is set. No network I/O occurs.

标准输出会输出单个JSON结果包。每个结果包都包含

meta

字段。

成功（所有DOI均解析完成）：

json

{
  "ok": true,
  "data": {
    "results": [
      {
        "doi": "10.1038/s41586-021-03819-2",
        "success": true,
        "source": "unpaywall",
        "pdf_url": "https://www.nature.com/articles/s41586-021-03819-2.pdf",
        "file": "pdfs/Jumper_2021_Highly_accurate_protein_structure_predic.pdf",
        "meta": {"title": "Highly accurate protein structure prediction with AlphaFold", "year": 2021, "author": "Jumper"},
        "sources_tried": ["unpaywall"]
      }
    ],
    "summary": {"total": 1, "succeeded": 1, "failed": 0},
    "next": []
  },
  "meta": {
    "request_id": "req_a908f5156fc1",
    "latency_ms": 2036,
    "schema_version": "1.9.0",
    "cli_version": "0.13.1",
    "sources_tried": ["unpaywall"]
  }
}

部分成功（批量模式——部分DOI失败，退出码反映失败类型）：

json

{
  "ok": "partial",
  "data": {
    "results": [
      { "doi": "10.1038/s41586-021-03819-2", "success": true, "source": "unpaywall", ... },
      {
        "doi": "10.1234/nonexistent",
        "success": false,
        "source": null,
        "pdf_url": null,
        "file": null,
        "meta": {},
        "sources_tried": ["unpaywall", "semantic_scholar"],
        "error": {
          "code": "not_found",
          "message": "未找到开放获取PDF",
          "retryable": true,
          "retry_after_hours": 168,
          "reason": "开放获取状态会随时间变化；可在 embargo 解除或预印本发布后重试"
        }
      }
    ],
    "summary": {"total": 2, "succeeded": 1, "failed": 1},
    "next": ["paper-fetch 10.1234/nonexistent --out pdfs"]
  },
  "meta": { ... }
}

next

字段是建议的后续命令数组：重新调用这些命令只会重试失败的部分。结合

--idempotency-key

可让整个批量任务安全重试，无需重新下载已成功获取的文件。

失败（参数错误，退出码3）：

json

{
  "ok": false,
  "error": {
    "code": "validation_error",
    "message": "请提供DOI或--batch文件",
    "retryable": false
  },
  "meta": { ... }
}

单条目跳过（目标文件已存在，未设置

--overwrite

）：

json

{
  "doi": "10.1038/s41586-021-03819-2",
  "success": true,
  "source": "unpaywall",
  "pdf_url": "https://...",
  "file": "pdfs/Jumper_2021_...pdf",
  "skipped": true,
  "skip_reason": "file_exists",
  "sources_tried": ["unpaywall"]
}

幂等重放（使用相同

--idempotency-key

重新运行）：

会原封不动返回缓存的结果包，但会重新标记当前调用的

meta.request_id

和

meta.latency_ms

，并设置

meta.replayed_from_idempotency_key

。不会产生网络请求。

Stderr progress (NDJSON)

标准错误输出进度（NDJSON格式）

When

--format json

, stderr emits one JSON object per line for liveness:

{"event": "session",     "request_id": "req_...", "elapsed_ms": 0,    "cli_version": "0.13.1", "schema_version": "1.9.0"}
{"event": "start",       "request_id": "req_...", "elapsed_ms": 2,    "doi": "10.1038/..."}
{"event": "source_try",  "request_id": "req_...", "elapsed_ms": 2,    "doi": "...", "source": "unpaywall"}
{"event": "source_hit",  "request_id": "req_...", "elapsed_ms": 2036, "doi": "...", "source": "unpaywall", "pdf_url": "..."}
{"event": "download_ok", "request_id": "req_...", "elapsed_ms": 4120, "doi": "...", "file": "..."}

Event types:

session

start

source_try

source_hit

source_miss

source_skip

source_enrich

source_enrich_failed

download_ok

download_error

download_skip

dry_run

not_found

. All events share

request_id

and

elapsed_ms

, letting an orchestrator correlate progress across stderr and the final stdout envelope. The

session

event fires once per invocation, before any DOI work or network I/O, and carries

cli_version

schema_version

so agents can detect schema drift against a cached copy without waiting for the final envelope.

source_enrich

fires when Semantic Scholar is called purely to backfill missing

author

title

after another source already provided the PDF URL; its

fields

array lists exactly which fields were filled in.

source_enrich_failed

fires when that enrichment call fails — the Unpaywall PDF URL is still used and the filename falls back to

unknown_<year>_…

When

--format text

, stderr emits human-readable prose.

当设置

--format json

时，标准错误输出会每行输出一个JSON对象用于显示进度：

{"event": "session",     "request_id": "req_...", "elapsed_ms": 0,    "cli_version": "0.13.1", "schema_version": "1.9.0"}
{"event": "start",       "request_id": "req_...", "elapsed_ms": 2,    "doi": "10.1038/..."}
{"event": "source_try",  "request_id": "req_...", "elapsed_ms": 2,    "doi": "...", "source": "unpaywall"}
{"event": "source_hit",  "request_id": "req_...", "elapsed_ms": 2036, "doi": "...", "source": "unpaywall", "pdf_url": "..."}
{"event": "download_ok", "request_id": "req_...", "elapsed_ms": 4120, "doi": "...", "file": "..."}

事件类型：

session

start

source_try

source_hit

source_miss

source_skip

source_enrich

source_enrich_failed

download_ok

download_error

download_skip

dry_run

not_found

。所有事件均包含

request_id

和

elapsed_ms

，便于编排器关联标准错误输出和最终标准输出结果包的进度。

session

事件在每次调用时触发一次，在DOI处理或网络请求之前，包含

cli_version

schema_version

，Agent可据此检测与缓存版本的差异，无需等待最终结果包。

当Semantic Scholar仅用于在其他来源已提供PDF URL后补充缺失的

author

title

信息时，会触发

source_enrich

事件；其

fields

数组会列出具体补充的字段。当补充信息调用失败时，会触发

source_enrich_failed

事件——此时仍会使用Unpaywall的PDF URL，文件名会回退为

unknown_<year>_…

。

当设置

--format text

时，标准错误输出会输出人类可读的文本。

Exit codes

退出码


0
retry_after_hours
2
3
4

Code	Meaning	Retryable class
`0`	All DOIs resolved / previewed	—
`1`	Unresolved — one or more DOIs had no OA copy; no transport failure	Not now (retry after `retry_after_hours` )
`2`	Reserved for auth errors (currently unused)	—
`3`	Validation error (bad arguments, missing input)	No
`4`	Transport error (network / download / IO failure)	Yes

The taxonomy lets an orchestrator route failures deterministically: exit 4 is worth retrying immediately, exit 1 is not, exit 3 is a bug in the caller.


0
retry_after_hours
2
3
4

代码	含义	重试类别
`0`	所有DOI均解析/预览完成	—
`1`	未解析——一个或多个DOI无开放获取副本；无传输失败	暂不重试（等待 `retry_after_hours` 后重试）
`2`	预留用于授权错误（当前未使用）	—
`3`	校验错误（参数错误，输入缺失）	否
`4`	传输错误（网络/下载/IO失败）	是

此分类可让编排器确定性地处理失败：退出码4值得立即重试，退出码1暂不重试，退出码3表示调用方存在错误。

Error codes in JSON

JSON中的错误码

Every retryable error carries a

retry_after_hours

hint in the error object, so an orchestrator can schedule retries without guessing.

Code	Meaning	Retryable	`retry_after_hours`
`validation_error`	Bad arguments or empty input	No	—
`title_resolve_failed`	Crossref returned no items for the given `--title` query (try a longer / cleaner title, or pass the DOI directly)	No	—
`not_found`	No open-access PDF found	Yes	`168` (one week — OA lands on embargo / preprint timescale)
`download_network_error`	Network failure during download	Yes	`1`
`download_not_a_pdf`	Response was not a PDF (HTML landing page)	No	—
`download_host_not_allowed`	PDF URL failed SSRF safety check (private IP / non-http(s) / non-80,443 / blocked metadata host)	No	—
`download_size_exceeded`	Response exceeded 50 MB limit	Yes	`24`
`download_io_error`	Local filesystem write failed	Yes	`1`
`internal_error`	Unexpected error	No	—

The canonical mapping lives in

RETRY_AFTER_HOURS

scripts/fetch.py

and is surfaced in

schema.error_codes

每个可重试错误的错误对象中都包含

retry_after_hours

提示，便于编排器安排重试时间，无需猜测。

代码	含义	是否可重试	`retry_after_hours`
`validation_error`	参数错误或输入为空	否	—
`title_resolve_failed`	Crossref未返回与给定 `--title` 查询匹配的结果（尝试更长/更清晰的标题，或直接传入DOI）	否	—
`not_found`	未找到开放获取PDF	是	`168` （一周——开放获取状态会随embargo解除或预印本发布变化）
`download_network_error`	下载过程中网络失败	是	`1`
`download_not_a_pdf`	响应内容不是PDF（HTML着陆页）	否	—
`download_host_not_allowed`	PDF URL未通过SSRF安全检查（私有IP/非http(s)/非80,443端口/被阻止的元数据主机）	否	—
`download_size_exceeded`	响应内容超过50MB限制	是	`24`
`download_io_error`	本地文件系统写入失败	是	`1`
`internal_error`	意外错误	否	—

标准映射关系位于

scripts/fetch.py

中的

RETRY_AFTER_HOURS

，并会在

schema.error_codes

中返回。

Examples

示例

bash

undefined

bash

undefined

Single DOI (JSON output when piped; text when in a terminal)

单个DOI（管道输出时为JSON格式；终端中为文本格式）

python scripts/fetch.py 10.1038/s41586-020-2649-2

Single title (resolved to DOI via Crossref, then downloaded)

单个标题（先通过Crossref解析为DOI，再下载）

python scripts/fetch.py --title "Highly accurate protein structure prediction with AlphaFold"

Dry-run preview (resolve without downloading)

干运行预览（仅解析不下载）

python scripts/fetch.py 10.1038/s41586-020-2649-2 --dry-run

Title + dry-run — preview the resolved DOI and candidate matches

标题+干运行——预览解析得到的DOI和候选匹配结果

python scripts/fetch.py --title "Attention Is All You Need" --dry-run

Force JSON (for agents even inside a terminal)

强制JSON格式（即使在终端中也供Agent使用）

python scripts/fetch.py 10.1038/s41586-020-2649-2 --format json

Human-readable with pretty colors in a pipeline

管道输出时使用人类可读的彩色文本

python scripts/fetch.py 10.1038/s41586-020-2649-2 --format text

Batch download, safely retriable

批量下载，支持安全重试

python scripts/fetch.py --batch dois.txt --out ./papers
--idempotency-key monday-review-batch

Pipe DOIs from another tool

从其他工具管道传入DOI

zot -F ids.json query ... | jq -r '.[].doi' | python scripts/fetch.py --batch -

Agent discovery

Agent发现

python scripts/fetch.py schema --pretty

Streaming mode — one result per line as each DOI resolves

流模式——每个DOI解析完成后立即输出一条结果

python scripts/fetch.py --batch dois.txt --stream

Works without UNPAYWALL_EMAIL (skips Unpaywall, uses remaining 4 sources)

无需设置UNPAYWALL_EMAIL即可使用（跳过Unpaywall，使用剩余4个来源）

python scripts/fetch.py 10.1038/s41586-020-2649-2

undefined

python scripts/fetch.py 10.1038/s41586-020-2649-2

undefined

Environment

环境变量

Variable	Default	Purpose
`UNPAYWALL_EMAIL`	unset	Contact email for Unpaywall API. Optional but recommended. Without it, Unpaywall is skipped (remaining sources still work).
`PAPER_FETCH_INSTITUTIONAL`	unset	Set to any value (e.g. `1` ) to opt into institutional mode — activates a 1 req/s rate limiter and the publisher-direct fallback. See below.
`PAPER_FETCH_NO_SCIHUB`	unset	Set to any value to disable the Sci-Hub fallback (step 7).
`PAPER_FETCH_SCIHUB_MIRRORS`	unset	Comma-separated mirror hostnames to try in priority order (e.g. `sci-hub.ru,sci-hub.st,sci-hub.su` ). Overrides built-in defaults.

变量	默认值	用途
`UNPAYWALL_EMAIL`	未设置	Unpaywall API的联系邮箱。可选但建议设置。未设置时会跳过Unpaywall（剩余来源仍可正常使用）。
`PAPER_FETCH_INSTITUTIONAL`	未设置	设置为任意值（如 `1` ）即可启用机构模式——激活每秒1次请求的速率限制器和出版商直接获取备选方案。详见下文。
`PAPER_FETCH_NO_SCIHUB`	未设置	设置为任意值可禁用Sci-Hub备选方案（第7步）。
`PAPER_FETCH_SCIHUB_MIRRORS`	未设置	逗号分隔的镜像主机名列表，按优先级顺序尝试（如 `sci-hub.ru,sci-hub.st,sci-hub.su` ）。会覆盖内置默认值。

Institutional access (opt-in)

机构访问（可选启用）

Many researchers have legitimate subscription access through their institution's IP range (on-campus or VPN). Paper-fetch can use that access by letting the publisher's own auth (your IP, your session cookies) decide whether to serve the PDF.

Host reachability does not differ between modes — public mode already trusts URLs returned by the OA APIs (Unpaywall, Semantic Scholar, bioRxiv, PMC) and fetches any HTTPS host that passes SSRF defense. Institutional mode adds two things: (1) a publisher-direct fallback (step 6 above) that constructs a publisher-side PDF URL by DOI prefix when every OA source missed, so your institutional IP/cookies can authorize the fetch, and (2) a 1 req/s rate limiter to keep batch jobs from getting your IP throttled or banned for "systematic downloading."

Opt in:

export PAPER_FETCH_INSTITUTIONAL=1

What changes in institutional mode:

Aspect	Public (default)	Institutional
Host reachability	Any public HTTPS host passing SSRF defense	Same
SSRF defense	Enforced (private IP / non-http(s) / non-80,443 / cloud metadata all blocked)	Enforced — same rules
Publisher-direct fallback	Off	On — DOI-prefix → publisher PDF URL, last resort after all OA sources miss
Rate limit	None	1 req/s token bucket (all outbound)
`meta.auth_mode`	`"public"`	`"institutional"`

What stays the same:

```
%PDF
```
magic-byte check and 50 MB size cap (prevents HTML landing pages and oversized responses slipping through)
No CAPTCHA solving, ever. If a publisher shows a challenge, the response won't start with
```
%PDF
```
and paper-fetch falls through to the next source.
No browser automation, no Playwright, no stealth.
Agent cannot opt in on its own —
```
PAPER_FETCH_INSTITUTIONAL
```
must be set by the human operator in the shell environment. This is the trust boundary.

When paper-fetch can't find an OA copy and you're in public mode, the error envelope includes

suggest_institutional: true

and a hint telling the user to set the env var. Agents can surface this verbatim rather than failing silently.

ToS notice: almost every publisher subscription prohibits "systematic downloading." The 1 req/s rate limit plus the existing per-file idempotency are designed to keep individual research use within acceptable bounds. Running many parallel paper-fetch processes, or lifting the rate limit, can trigger a publisher-wide IP ban affecting your entire institution. Don't.

许多研究人员可通过所在机构的IP范围（校内或VPN）获得合法的订阅访问权限。Paper-fetch可利用该权限，让出版商自身的授权机制（你的IP、会话cookies）决定是否提供PDF。

两种模式下的主机可达性并无差异——公共模式已信任OA API（Unpaywall、Semantic Scholar、bioRxiv、PMC）返回的URL，并会获取所有通过SSRF防御的HTTPS主机。机构模式新增两项功能：(1) 出版商直接获取备选方案（上述第6步），当所有OA来源均失败时，通过DOI前缀构造出版商端的PDF URL，以便你的机构IP/cookies进行授权获取；(2) 每秒1次请求的速率限制器，防止批量任务导致你的IP被出版商限流或封禁，避免“系统性下载”。

启用方式：

export PAPER_FETCH_INSTITUTIONAL=1

机构模式的变化：

方面	公共模式（默认）	机构模式
主机可达性	所有通过SSRF防御的公共HTTPS主机	相同
SSRF防御	强制执行（私有IP/非http(s)/非80,443端口/云元数据均被阻止）	强制执行——规则相同
出版商直接获取备选方案	关闭	开启——DOI前缀匹配出版商PDF URL，所有OA来源失败后的最后备选
速率限制	无	每秒1次请求的令牌桶（所有出站请求）
`meta.auth_mode`	`"public"`	`"institutional"`

保持不变的内容：

```
%PDF
```
魔术字节校验和50MB大小限制（防止HTML着陆页和超大响应内容通过）
绝不处理验证码。若出版商显示验证挑战，响应内容不会以
```
%PDF
```
开头，paper-fetch会自动进入下一个来源
无浏览器自动化、无Playwright、无隐身模式
Agent无法自行启用——
```
PAPER_FETCH_INSTITUTIONAL
```
必须由人类操作者在shell环境中设置。这是信任边界。

当paper-fetch在公共模式下无法找到开放获取副本时，错误结果包会包含

suggest_institutional: true

及提示信息，告知用户设置环境变量。Agent可直接展示该提示，而非静默失败。

服务条款提示： 几乎所有出版商订阅都禁止“系统性下载”。每秒1次请求的速率限制加上已有的单文件幂等性设计，旨在确保个人研究使用符合可接受范围。运行多个并行的paper-fetch进程，或解除速率限制，可能会触发出版商的全机构IP封禁。请勿这样操作。

Notes

注意事项

Auth is delegated. The agent never runs a login subcommand. The human or the orchestrator sets
```
UNPAYWALL_EMAIL
```
in the environment; the agent inherits it. Missing email degrades gracefully to the remaining 4 sources.
Trust is directional. CLI arguments are validated once at the entry point. SSRF defense, the
```
%PDF
```
magic-byte check, and the 50 MB size cap are enforced in the environment layer, not at the agent's request. An agent cannot loosen safety by passing a flag — opting into institutional mode (and its rate-limit risk profile) is an operator action via environment variable.
Downloads are naturally idempotent. Re-running against the same
```
--out
```
skips files that already exist (deterministic filename:
```
{first_author}_{year}_{journal_abbrev}_{short_title}.pdf
```
; the journal segment is omitted if metadata lacks a journal/venue). Pair with
```
--idempotency-key
```
to also replay the exact envelope without any network I/O.
Default output directory:
```
./pdfs/
```
.

授权委托。Agent从不执行登录子命令。人类或编排器在环境中设置
```
UNPAYWALL_EMAIL
```
；Agent继承该设置。未设置邮箱时会优雅降级为使用剩余4个来源。
信任单向。CLI参数在入口处进行一次校验。SSRF防御、
```
%PDF
```
魔术字节校验和50MB大小限制在环境层强制执行，而非通过Agent请求实现。Agent无法通过传入参数放宽安全限制——启用机构模式（及其速率限制风险）是操作者通过环境变量执行的操作。
下载天然幂等。针对相同
```
--out
```
目录重新运行时，会跳过已存在的文件（文件名规则：
```
{first_author}_{year}_{journal_abbrev}_{short_title}.pdf
```
；若元数据中缺少期刊/会议信息，则省略期刊部分）。结合
```
--idempotency-key
```
可完全重放结果包，无需任何网络请求。
默认输出目录：
```
./pdfs/
```
。

paper-fetch

Original

Translation

paper-fetch

paper-fetch

Resolution order

获取优先级顺序

Usage

使用方法

Flags

参数标识

Agent discovery: schema subcommand

Agent发现：schema子命令

Output contract

输出约定

Stderr progress (NDJSON)

标准错误输出进度（NDJSON格式）

Exit codes

退出码

Error codes in JSON

JSON中的错误码

Examples

示例

Single DOI (JSON output when piped; text when in a terminal)

单个DOI（管道输出时为JSON格式；终端中为文本格式）

Single title (resolved to DOI via Crossref, then downloaded)

单个标题（先通过Crossref解析为DOI，再下载）

Dry-run preview (resolve without downloading)

干运行预览（仅解析不下载）

Title + dry-run — preview the resolved DOI and candidate matches

标题+干运行——预览解析得到的DOI和候选匹配结果

Force JSON (for agents even inside a terminal)

强制JSON格式（即使在终端中也供Agent使用）

Human-readable with pretty colors in a pipeline

管道输出时使用人类可读的彩色文本

Batch download, safely retriable

批量下载，支持安全重试

Pipe DOIs from another tool

从其他工具管道传入DOI

Agent discovery

Agent发现

Streaming mode — one result per line as each DOI resolves

流模式——每个DOI解析完成后立即输出一条结果

Works without UNPAYWALL_EMAIL (skips Unpaywall, uses remaining 4 sources)

无需设置UNPAYWALL_EMAIL即可使用（跳过Unpaywall，使用剩余4个来源）

Environment

环境变量

Institutional access (opt-in)

机构访问（可选启用）

Notes

注意事项

Agent discovery:
`schema`
subcommand

Agent发现：
`schema`
子命令