paper-fetch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

paper-fetch

paper-fetch

Fetch the PDF for a paper given a DOI (or title). Tries multiple sources in priority order and stops at the first hit.
通过DOI(或标题)获取论文PDF。按优先级依次尝试多个来源,找到第一个可用资源后即停止。

Resolution order

获取优先级顺序

  1. Unpaywall
    https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL
    , read
    best_oa_location.url_for_pdf
    (skipped if
    UNPAYWALL_EMAIL
    not set)
  2. Semantic Scholar
    https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds
  3. arXiv — if
    externalIds.ArXiv
    present,
    https://arxiv.org/pdf/{arxiv_id}.pdf
  4. PubMed Central OA — if PMCID present,
    https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/
  5. bioRxiv / medRxiv — if DOI prefix is
    10.1101
    , query
    https://api.biorxiv.org/details/{server}/{doi}
    for the latest version PDF URL
  6. Publisher direct (institutional mode only —
    PAPER_FETCH_INSTITUTIONAL=1
    )
    — DOI-prefix → publisher PDF template (Nature / Science / Wiley / Springer / ACS / PNAS / NEJM / Sage / T&F / Elsevier). The caller's own subscription IP / cookies / EZproxy are what authorize the fetch; unauthorized responses fail the
    %PDF
    check and fall through to step 7.
  7. Sci-Hub mirrors (on by default; disable with
    PAPER_FETCH_NO_SCIHUB=1
    )
    — last-resort fallback. Tries the mirror list in
    PAPER_FETCH_SCIHUB_MIRRORS
    (or built-in defaults
    sci-hub.ru
    ,
    sci-hub.st
    ,
    sci-hub.su
    ,
    sci-hub.box
    ,
    sci-hub.red
    ,
    sci-hub.al
    ,
    sci-hub.mk
    ,
    sci-hub.ee
    ) in order; on full miss, scrapes
    https://www.sci-hub.pub/
    once per process for fresh mirrors. CAPTCHA / missing-paper pages have no PDF iframe and fall through silently.
  8. Otherwise → report failure with title/authors so the user can request via ILL
If only a title is given, pass it directly via
--title "<title>"
. Resolution chain:
  1. Crossref
    query.title
    — primary; covers all major journal/conference DOIs
  2. Semantic Scholar
    /paper/search/match
    — fallback when Crossref's top match is low-confidence (
    match_score < 40
    ) or the gap to the runner-up is
    < 3
    . Critically, S2 covers arXiv-only preprints (no Crossref DOI). When S2 surfaces a paper that has only an arXiv id, the canonical
    10.48550/arXiv.<id>
    is synthesized so the download chain stays uniform.
  3. Crossref's best guess (low-confidence) — used only when both resolvers struggled. The result envelope sets
    meta.title_resolution.low_confidence: true
    plus a
    low_confidence_reason
    (
    score_below_threshold
    /
    ambiguous_runner_up
    ) so an agent can either bail or confirm via
    --dry-run
    .
Either way the resolved DOI, the winning resolver, the full
resolvers_tried
list, and the top candidate matches are all surfaced under
meta.title_resolution
.
If
asta-skill
is registered
, the agent can alternatively resolve title → DOI through the Asta MCP first, then pass the DOI directly here. This skips paper-fetch's two-stage Crossref/S2 chain in favor of Asta's richer search surface (relevance ranking, snippet search, citation graph). Workflow: call
asta__search_paper_by_title("<title>", fields="title,year,authors,externalIds")
, read
externalIds.DOI
(or
10.48550/arXiv.<ArXiv>
when only
ArXiv
is present), then
paper-fetch <doi>
. Use
--title
when Asta isn't available or when a single command is preferred.
  1. Unpaywall — 调用接口
    https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL
    ,读取
    best_oa_location.url_for_pdf
    字段(若未设置
    UNPAYWALL_EMAIL
    则跳过此步骤)
  2. Semantic Scholar — 调用接口
    https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds
  3. arXiv — 若存在
    externalIds.ArXiv
    标识,访问
    https://arxiv.org/pdf/{arxiv_id}.pdf
  4. PubMed Central OA — 若存在PMCID标识,访问
    https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/
  5. bioRxiv / medRxiv — 若DOI前缀为
    10.1101
    ,调用
    https://api.biorxiv.org/details/{server}/{doi}
    查询最新版本PDF的URL
  6. 出版商直接获取(仅机构模式可用,需设置
    PAPER_FETCH_INSTITUTIONAL=1
    )——通过DOI前缀匹配出版商PDF模板(涵盖Nature / Science / Wiley / Springer / ACS / PNAS / NEJM / Sage / T&F / Elsevier)。调用方的订阅IP/ cookies / EZproxy将用于授权获取;未授权的响应会通过
    %PDF
    校验失败,进而进入第7步
  7. Sci-Hub镜像(默认启用;可通过
    PAPER_FETCH_NO_SCIHUB=1
    禁用)——最后备选方案。按
    PAPER_FETCH_SCIHUB_MIRRORS
    配置的镜像列表(或内置默认值
    sci-hub.ru
    ,
    sci-hub.st
    ,
    sci-hub.su
    ,
    sci-hub.box
    ,
    sci-hub.red
    ,
    sci-hub.al
    ,
    sci-hub.mk
    ,
    sci-hub.ee
    )依次尝试;若所有镜像均失败,会在每个进程中抓取一次
    https://www.sci-hub.pub/
    获取最新镜像。若遇到验证码或论文缺失页面(无PDF iframe),会直接跳过
  8. 若以上均失败→返回包含标题/作者的失败报告,用户可通过馆际互借(ILL)请求获取
若仅提供标题,直接通过
--title "<title>"
参数传入。解析流程如下:
  1. Crossref
    query.title
    — 首选方式;覆盖所有主流期刊/会议DOI
  2. Semantic Scholar
    /paper/search/match
    — 当Crossref的顶级匹配置信度低(
    match_score < 40
    )或与次优匹配的差距
    < 3
    时作为备选。关键是S2支持仅在arXiv发布的预印本(无Crossref DOI)。当S2返回仅含arXiv id的论文时,会自动生成标准DOI
    10.48550/arXiv.<id>
    ,确保下载流程统一
  3. Crossref低置信度最佳猜测 — 仅当两种解析方式均失败时使用。结果包会设置
    meta.title_resolution.low_confidence: true
    low_confidence_reason
    score_below_threshold
    /
    ambiguous_runner_up
    ),供Agent决定是否终止或通过
    --dry-run
    确认
无论哪种方式,解析得到的DOI、成功的解析器、完整的
resolvers_tried
列表及顶级候选匹配结果都会在
meta.title_resolution
中返回。
若已注册
asta-skill
,Agent可先通过Asta MCP将标题解析为DOI,再直接传入此处。这会跳过paper-fetch的Crossref/S2两阶段解析链,转而使用Asta更丰富的搜索界面(相关性排序、片段搜索、引文图谱)。流程:调用
asta__search_paper_by_title("<title>", fields="title,year,authors,externalIds")
,读取
externalIds.DOI
(若仅存在
ArXiv
标识则使用
10.48550/arXiv.<ArXiv>
),然后执行
paper-fetch <doi>
。当Asta不可用或偏好单命令操作时,使用
--title
参数。

Usage

使用方法

bash
python scripts/fetch.py <DOI> [options]
python scripts/fetch.py --title "<paper title>" [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema           # machine-readable self-description
bash
python scripts/fetch.py <DOI> [options]
python scripts/fetch.py --title "<paper title>" [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema           # 机器可读的自描述信息

Flags

参数标识

The flags below are the ones an agent composes in normal use. For the complete contract — including
--dry-run
,
--pretty
,
--stream
,
--overwrite
,
--timeout
,
--version
, plus parameter types and exit-code mappings — run
python scripts/fetch.py schema
(machine-readable, drift-checked via
schema_version
).
FlagDefaultDescription
doi
DOI to fetch (positional). Use
-
to read a single DOI from stdin
--title TITLE
Paper title; resolved to a DOI via Crossref before download. Mutually exclusive with positional DOI /
--batch
--batch FILE
File with one DOI per line for bulk download. Use
-
to read from stdin
--out DIR
pdfs
Output directory
--format
auto
json
for agents,
text
for humans. Auto-detects:
json
when stdout is not a TTY,
text
when it is
--idempotency-key KEY
Safe-retry key. Re-running with the same key replays the original envelope from
<out>/.paper-fetch-idem/
without network I/O
以下是Agent日常使用中会用到的标识。如需完整约定——包括
--dry-run
,
--pretty
,
--stream
,
--overwrite
,
--timeout
,
--version
,以及参数类型和退出码映射——请运行
python scripts/fetch.py schema
(机器可读,通过
schema_version
检查版本差异)。
参数标识默认值描述
doi
要获取的DOI(位置参数)。使用
-
从标准输入读取单个DOI
--title TITLE
论文标题;先通过Crossref解析为DOI再进行下载。与位置参数DOI /
--batch
互斥
--batch FILE
批量下载的文件,每行一个DOI。使用
-
从标准输入读取
--out DIR
pdfs
输出目录
--format
auto
json
供Agent使用,
text
供人类阅读。自动检测:标准输出非TTY时为
json
,否则为
text
--idempotency-key KEY
安全重试密钥。使用相同密钥重新运行时,会从
<out>/.paper-fetch-idem/
中重放原始结果包,无需网络请求

Agent discovery:
schema
subcommand

Agent发现:
schema
子命令

bash
python scripts/fetch.py schema
Emits a complete machine-readable description of the CLI on stdout (no network). Includes
cli_version
,
schema_version
, parameter types, exit codes, error codes, envelope shapes, and environment variables. Agents should read this once, cache it against
schema_version
, and re-read when the cached version drifts.
bash
python scripts/fetch.py schema
在标准输出中输出完整的CLI机器可读描述(无需网络)。包含
cli_version
,
schema_version
, 参数类型、退出码、错误码、结果包结构及环境变量。Agent应读取一次,根据
schema_version
缓存,当缓存版本不一致时重新读取。

Output contract

输出约定

stdout emits a single JSON envelope. Every envelope carries a
meta
slot.
Success (all DOIs resolved):
json
{
  "ok": true,
  "data": {
    "results": [
      {
        "doi": "10.1038/s41586-021-03819-2",
        "success": true,
        "source": "unpaywall",
        "pdf_url": "https://www.nature.com/articles/s41586-021-03819-2.pdf",
        "file": "pdfs/Jumper_2021_Highly_accurate_protein_structure_predic.pdf",
        "meta": {"title": "Highly accurate protein structure prediction with AlphaFold", "year": 2021, "author": "Jumper"},
        "sources_tried": ["unpaywall"]
      }
    ],
    "summary": {"total": 1, "succeeded": 1, "failed": 0},
    "next": []
  },
  "meta": {
    "request_id": "req_a908f5156fc1",
    "latency_ms": 2036,
    "schema_version": "1.9.0",
    "cli_version": "0.13.1",
    "sources_tried": ["unpaywall"]
  }
}
Partial (batch mode — some DOIs failed, exit code reflects the failure class):
json
{
  "ok": "partial",
  "data": {
    "results": [
      { "doi": "10.1038/s41586-021-03819-2", "success": true, "source": "unpaywall", ... },
      {
        "doi": "10.1234/nonexistent",
        "success": false,
        "source": null,
        "pdf_url": null,
        "file": null,
        "meta": {},
        "sources_tried": ["unpaywall", "semantic_scholar"],
        "error": {
          "code": "not_found",
          "message": "No open-access PDF found",
          "retryable": true,
          "retry_after_hours": 168,
          "reason": "OA availability changes over time; retry after embargo lifts or preprint appears"
        }
      }
    ],
    "summary": {"total": 2, "succeeded": 1, "failed": 1},
    "next": ["paper-fetch 10.1234/nonexistent --out pdfs"]
  },
  "meta": { ... }
}
The
next
slot is an array of suggested follow-up commands: re-invoking them retries only the failed subset. Combine with
--idempotency-key
to make the whole batch safely retriable without re-downloading the already-succeeded items.
Failure (bad arguments, exit code 3):
json
{
  "ok": false,
  "error": {
    "code": "validation_error",
    "message": "Provide a DOI or --batch file",
    "retryable": false
  },
  "meta": { ... }
}
Per-item skipped (destination already exists, no
--overwrite
):
json
{
  "doi": "10.1038/s41586-021-03819-2",
  "success": true,
  "source": "unpaywall",
  "pdf_url": "https://...",
  "file": "pdfs/Jumper_2021_...pdf",
  "skipped": true,
  "skip_reason": "file_exists",
  "sources_tried": ["unpaywall"]
}
Idempotency replay (re-run with the same
--idempotency-key
):
The cached envelope is returned verbatim, but
meta.request_id
and
meta.latency_ms
are re-stamped for the current call, and
meta.replayed_from_idempotency_key
is set. No network I/O occurs.
标准输出会输出单个JSON结果包。每个结果包都包含
meta
字段。
成功(所有DOI均解析完成):
json
{
  "ok": true,
  "data": {
    "results": [
      {
        "doi": "10.1038/s41586-021-03819-2",
        "success": true,
        "source": "unpaywall",
        "pdf_url": "https://www.nature.com/articles/s41586-021-03819-2.pdf",
        "file": "pdfs/Jumper_2021_Highly_accurate_protein_structure_predic.pdf",
        "meta": {"title": "Highly accurate protein structure prediction with AlphaFold", "year": 2021, "author": "Jumper"},
        "sources_tried": ["unpaywall"]
      }
    ],
    "summary": {"total": 1, "succeeded": 1, "failed": 0},
    "next": []
  },
  "meta": {
    "request_id": "req_a908f5156fc1",
    "latency_ms": 2036,
    "schema_version": "1.9.0",
    "cli_version": "0.13.1",
    "sources_tried": ["unpaywall"]
  }
}
部分成功(批量模式——部分DOI失败,退出码反映失败类型):
json
{
  "ok": "partial",
  "data": {
    "results": [
      { "doi": "10.1038/s41586-021-03819-2", "success": true, "source": "unpaywall", ... },
      {
        "doi": "10.1234/nonexistent",
        "success": false,
        "source": null,
        "pdf_url": null,
        "file": null,
        "meta": {},
        "sources_tried": ["unpaywall", "semantic_scholar"],
        "error": {
          "code": "not_found",
          "message": "未找到开放获取PDF",
          "retryable": true,
          "retry_after_hours": 168,
          "reason": "开放获取状态会随时间变化;可在 embargo 解除或预印本发布后重试"
        }
      }
    ],
    "summary": {"total": 2, "succeeded": 1, "failed": 1},
    "next": ["paper-fetch 10.1234/nonexistent --out pdfs"]
  },
  "meta": { ... }
}
next
字段是建议的后续命令数组:重新调用这些命令只会重试失败的部分。结合
--idempotency-key
可让整个批量任务安全重试,无需重新下载已成功获取的文件。
失败(参数错误,退出码3):
json
{
  "ok": false,
  "error": {
    "code": "validation_error",
    "message": "请提供DOI或--batch文件",
    "retryable": false
  },
  "meta": { ... }
}
单条目跳过(目标文件已存在,未设置
--overwrite
):
json
{
  "doi": "10.1038/s41586-021-03819-2",
  "success": true,
  "source": "unpaywall",
  "pdf_url": "https://...",
  "file": "pdfs/Jumper_2021_...pdf",
  "skipped": true,
  "skip_reason": "file_exists",
  "sources_tried": ["unpaywall"]
}
幂等重放(使用相同
--idempotency-key
重新运行):
会原封不动返回缓存的结果包,但会重新标记当前调用的
meta.request_id
meta.latency_ms
,并设置
meta.replayed_from_idempotency_key
。不会产生网络请求。

Stderr progress (NDJSON)

标准错误输出进度(NDJSON格式)

When
--format json
, stderr emits one JSON object per line for liveness:
{"event": "session",     "request_id": "req_...", "elapsed_ms": 0,    "cli_version": "0.13.1", "schema_version": "1.9.0"}
{"event": "start",       "request_id": "req_...", "elapsed_ms": 2,    "doi": "10.1038/..."}
{"event": "source_try",  "request_id": "req_...", "elapsed_ms": 2,    "doi": "...", "source": "unpaywall"}
{"event": "source_hit",  "request_id": "req_...", "elapsed_ms": 2036, "doi": "...", "source": "unpaywall", "pdf_url": "..."}
{"event": "download_ok", "request_id": "req_...", "elapsed_ms": 4120, "doi": "...", "file": "..."}
Event types:
session
,
start
,
source_try
,
source_hit
,
source_miss
,
source_skip
,
source_enrich
,
source_enrich_failed
,
download_ok
,
download_error
,
download_skip
,
dry_run
,
not_found
. All events share
request_id
and
elapsed_ms
, letting an orchestrator correlate progress across stderr and the final stdout envelope. The
session
event fires once per invocation, before any DOI work or network I/O, and carries
cli_version
/
schema_version
so agents can detect schema drift against a cached copy without waiting for the final envelope.
source_enrich
fires when Semantic Scholar is called purely to backfill missing
author
/
title
after another source already provided the PDF URL; its
fields
array lists exactly which fields were filled in.
source_enrich_failed
fires when that enrichment call fails — the Unpaywall PDF URL is still used and the filename falls back to
unknown_<year>_…
.
When
--format text
, stderr emits human-readable prose.
当设置
--format json
时,标准错误输出会每行输出一个JSON对象用于显示进度:
{"event": "session",     "request_id": "req_...", "elapsed_ms": 0,    "cli_version": "0.13.1", "schema_version": "1.9.0"}
{"event": "start",       "request_id": "req_...", "elapsed_ms": 2,    "doi": "10.1038/..."}
{"event": "source_try",  "request_id": "req_...", "elapsed_ms": 2,    "doi": "...", "source": "unpaywall"}
{"event": "source_hit",  "request_id": "req_...", "elapsed_ms": 2036, "doi": "...", "source": "unpaywall", "pdf_url": "..."}
{"event": "download_ok", "request_id": "req_...", "elapsed_ms": 4120, "doi": "...", "file": "..."}
事件类型:
session
,
start
,
source_try
,
source_hit
,
source_miss
,
source_skip
,
source_enrich
,
source_enrich_failed
,
download_ok
,
download_error
,
download_skip
,
dry_run
,
not_found
。所有事件均包含
request_id
elapsed_ms
,便于编排器关联标准错误输出和最终标准输出结果包的进度。
session
事件在每次调用时触发一次,在DOI处理或网络请求之前,包含
cli_version
/
schema_version
,Agent可据此检测与缓存版本的差异,无需等待最终结果包。
当Semantic Scholar仅用于在其他来源已提供PDF URL后补充缺失的
author
/
title
信息时,会触发
source_enrich
事件;其
fields
数组会列出具体补充的字段。当补充信息调用失败时,会触发
source_enrich_failed
事件——此时仍会使用Unpaywall的PDF URL,文件名会回退为
unknown_<year>_…
当设置
--format text
时,标准错误输出会输出人类可读的文本。

Exit codes

退出码

CodeMeaningRetryable class
0
All DOIs resolved / previewed
1
Unresolved — one or more DOIs had no OA copy; no transport failureNot now (retry after
retry_after_hours
)
2
Reserved for auth errors (currently unused)
3
Validation error (bad arguments, missing input)No
4
Transport error (network / download / IO failure)Yes
The taxonomy lets an orchestrator route failures deterministically: exit 4 is worth retrying immediately, exit 1 is not, exit 3 is a bug in the caller.
代码含义重试类别
0
所有DOI均解析/预览完成
1
未解析——一个或多个DOI无开放获取副本;无传输失败暂不重试(等待
retry_after_hours
后重试)
2
预留用于授权错误(当前未使用)
3
校验错误(参数错误,输入缺失)
4
传输错误(网络/下载/IO失败)
此分类可让编排器确定性地处理失败:退出码4值得立即重试,退出码1暂不重试,退出码3表示调用方存在错误。

Error codes in JSON

JSON中的错误码

Every retryable error carries a
retry_after_hours
hint in the error object, so an orchestrator can schedule retries without guessing.
CodeMeaningRetryable
retry_after_hours
validation_error
Bad arguments or empty inputNo
title_resolve_failed
Crossref returned no items for the given
--title
query (try a longer / cleaner title, or pass the DOI directly)
No
not_found
No open-access PDF foundYes
168
(one week — OA lands on embargo / preprint timescale)
download_network_error
Network failure during downloadYes
1
download_not_a_pdf
Response was not a PDF (HTML landing page)No
download_host_not_allowed
PDF URL failed SSRF safety check (private IP / non-http(s) / non-80,443 / blocked metadata host)No
download_size_exceeded
Response exceeded 50 MB limitYes
24
download_io_error
Local filesystem write failedYes
1
internal_error
Unexpected errorNo
The canonical mapping lives in
RETRY_AFTER_HOURS
in
scripts/fetch.py
and is surfaced in
schema.error_codes
.
每个可重试错误的错误对象中都包含
retry_after_hours
提示,便于编排器安排重试时间,无需猜测。
代码含义是否可重试
retry_after_hours
validation_error
参数错误或输入为空
title_resolve_failed
Crossref未返回与给定
--title
查询匹配的结果(尝试更长/更清晰的标题,或直接传入DOI)
not_found
未找到开放获取PDF
168
(一周——开放获取状态会随embargo解除或预印本发布变化)
download_network_error
下载过程中网络失败
1
download_not_a_pdf
响应内容不是PDF(HTML着陆页)
download_host_not_allowed
PDF URL未通过SSRF安全检查(私有IP/非http(s)/非80,443端口/被阻止的元数据主机)
download_size_exceeded
响应内容超过50MB限制
24
download_io_error
本地文件系统写入失败
1
internal_error
意外错误
标准映射关系位于
scripts/fetch.py
中的
RETRY_AFTER_HOURS
,并会在
schema.error_codes
中返回。

Examples

示例

bash
undefined
bash
undefined

Single DOI (JSON output when piped; text when in a terminal)

单个DOI(管道输出时为JSON格式;终端中为文本格式)

python scripts/fetch.py 10.1038/s41586-020-2649-2
python scripts/fetch.py 10.1038/s41586-020-2649-2

Single title (resolved to DOI via Crossref, then downloaded)

单个标题(先通过Crossref解析为DOI,再下载)

python scripts/fetch.py --title "Highly accurate protein structure prediction with AlphaFold"
python scripts/fetch.py --title "Highly accurate protein structure prediction with AlphaFold"

Dry-run preview (resolve without downloading)

干运行预览(仅解析不下载)

python scripts/fetch.py 10.1038/s41586-020-2649-2 --dry-run
python scripts/fetch.py 10.1038/s41586-020-2649-2 --dry-run

Title + dry-run — preview the resolved DOI and candidate matches

标题+干运行——预览解析得到的DOI和候选匹配结果

python scripts/fetch.py --title "Attention Is All You Need" --dry-run
python scripts/fetch.py --title "Attention Is All You Need" --dry-run

Force JSON (for agents even inside a terminal)

强制JSON格式(即使在终端中也供Agent使用)

python scripts/fetch.py 10.1038/s41586-020-2649-2 --format json
python scripts/fetch.py 10.1038/s41586-020-2649-2 --format json

Human-readable with pretty colors in a pipeline

管道输出时使用人类可读的彩色文本

python scripts/fetch.py 10.1038/s41586-020-2649-2 --format text
python scripts/fetch.py 10.1038/s41586-020-2649-2 --format text

Batch download, safely retriable

批量下载,支持安全重试

python scripts/fetch.py --batch dois.txt --out ./papers
--idempotency-key monday-review-batch
python scripts/fetch.py --batch dois.txt --out ./papers
--idempotency-key monday-review-batch

Pipe DOIs from another tool

从其他工具管道传入DOI

zot -F ids.json query ... | jq -r '.[].doi' | python scripts/fetch.py --batch -
zot -F ids.json query ... | jq -r '.[].doi' | python scripts/fetch.py --batch -

Agent discovery

Agent发现

python scripts/fetch.py schema --pretty
python scripts/fetch.py schema --pretty

Streaming mode — one result per line as each DOI resolves

流模式——每个DOI解析完成后立即输出一条结果

python scripts/fetch.py --batch dois.txt --stream
python scripts/fetch.py --batch dois.txt --stream

Works without UNPAYWALL_EMAIL (skips Unpaywall, uses remaining 4 sources)

无需设置UNPAYWALL_EMAIL即可使用(跳过Unpaywall,使用剩余4个来源)

python scripts/fetch.py 10.1038/s41586-020-2649-2
undefined
python scripts/fetch.py 10.1038/s41586-020-2649-2
undefined

Environment

环境变量

VariableDefaultPurpose
UNPAYWALL_EMAIL
unsetContact email for Unpaywall API. Optional but recommended. Without it, Unpaywall is skipped (remaining sources still work).
PAPER_FETCH_INSTITUTIONAL
unsetSet to any value (e.g.
1
) to opt into institutional mode — activates a 1 req/s rate limiter and the publisher-direct fallback. See below.
PAPER_FETCH_NO_SCIHUB
unsetSet to any value to disable the Sci-Hub fallback (step 7).
PAPER_FETCH_SCIHUB_MIRRORS
unsetComma-separated mirror hostnames to try in priority order (e.g.
sci-hub.ru,sci-hub.st,sci-hub.su
). Overrides built-in defaults.
变量默认值用途
UNPAYWALL_EMAIL
未设置Unpaywall API的联系邮箱。可选但建议设置。未设置时会跳过Unpaywall(剩余来源仍可正常使用)。
PAPER_FETCH_INSTITUTIONAL
未设置设置为任意值(如
1
)即可启用机构模式——激活每秒1次请求的速率限制器和出版商直接获取备选方案。详见下文。
PAPER_FETCH_NO_SCIHUB
未设置设置为任意值可禁用Sci-Hub备选方案(第7步)。
PAPER_FETCH_SCIHUB_MIRRORS
未设置逗号分隔的镜像主机名列表,按优先级顺序尝试(如
sci-hub.ru,sci-hub.st,sci-hub.su
)。会覆盖内置默认值。

Institutional access (opt-in)

机构访问(可选启用)

Many researchers have legitimate subscription access through their institution's IP range (on-campus or VPN). Paper-fetch can use that access by letting the publisher's own auth (your IP, your session cookies) decide whether to serve the PDF.
Host reachability does not differ between modes — public mode already trusts URLs returned by the OA APIs (Unpaywall, Semantic Scholar, bioRxiv, PMC) and fetches any HTTPS host that passes SSRF defense. Institutional mode adds two things: (1) a publisher-direct fallback (step 6 above) that constructs a publisher-side PDF URL by DOI prefix when every OA source missed, so your institutional IP/cookies can authorize the fetch, and (2) a 1 req/s rate limiter to keep batch jobs from getting your IP throttled or banned for "systematic downloading."
Opt in:
export PAPER_FETCH_INSTITUTIONAL=1
What changes in institutional mode:
AspectPublic (default)Institutional
Host reachabilityAny public HTTPS host passing SSRF defenseSame
SSRF defenseEnforced (private IP / non-http(s) / non-80,443 / cloud metadata all blocked)Enforced — same rules
Publisher-direct fallbackOffOn — DOI-prefix → publisher PDF URL, last resort after all OA sources miss
Rate limitNone1 req/s token bucket (all outbound)
meta.auth_mode
"public"
"institutional"
What stays the same:
  • %PDF
    magic-byte check and 50 MB size cap (prevents HTML landing pages and oversized responses slipping through)
  • No CAPTCHA solving, ever. If a publisher shows a challenge, the response won't start with
    %PDF
    and paper-fetch falls through to the next source.
  • No browser automation, no Playwright, no stealth.
  • Agent cannot opt in on its own —
    PAPER_FETCH_INSTITUTIONAL
    must be set by the human operator in the shell environment. This is the trust boundary.
When paper-fetch can't find an OA copy and you're in public mode, the error envelope includes
suggest_institutional: true
and a hint telling the user to set the env var. Agents can surface this verbatim rather than failing silently.
ToS notice: almost every publisher subscription prohibits "systematic downloading." The 1 req/s rate limit plus the existing per-file idempotency are designed to keep individual research use within acceptable bounds. Running many parallel paper-fetch processes, or lifting the rate limit, can trigger a publisher-wide IP ban affecting your entire institution. Don't.
许多研究人员可通过所在机构的IP范围(校内或VPN)获得合法的订阅访问权限。Paper-fetch可利用该权限,让出版商自身的授权机制(你的IP、会话cookies)决定是否提供PDF。
两种模式下的主机可达性并无差异——公共模式已信任OA API(Unpaywall、Semantic Scholar、bioRxiv、PMC)返回的URL,并会获取所有通过SSRF防御的HTTPS主机。机构模式新增两项功能:(1) 出版商直接获取备选方案(上述第6步),当所有OA来源均失败时,通过DOI前缀构造出版商端的PDF URL,以便你的机构IP/cookies进行授权获取;(2) 每秒1次请求的速率限制器,防止批量任务导致你的IP被出版商限流或封禁,避免“系统性下载”。
启用方式:
export PAPER_FETCH_INSTITUTIONAL=1
机构模式的变化:
方面公共模式(默认)机构模式
主机可达性所有通过SSRF防御的公共HTTPS主机相同
SSRF防御强制执行(私有IP/非http(s)/非80,443端口/云元数据均被阻止)强制执行——规则相同
出版商直接获取备选方案关闭开启——DOI前缀匹配出版商PDF URL,所有OA来源失败后的最后备选
速率限制每秒1次请求的令牌桶(所有出站请求)
meta.auth_mode
"public"
"institutional"
保持不变的内容:
  • %PDF
    魔术字节校验和50MB大小限制(防止HTML着陆页和超大响应内容通过)
  • 绝不处理验证码。若出版商显示验证挑战,响应内容不会以
    %PDF
    开头,paper-fetch会自动进入下一个来源
  • 无浏览器自动化、无Playwright、无隐身模式
  • Agent无法自行启用——
    PAPER_FETCH_INSTITUTIONAL
    必须由人类操作者在shell环境中设置。这是信任边界。
当paper-fetch在公共模式下无法找到开放获取副本时,错误结果包会包含
suggest_institutional: true
及提示信息,告知用户设置环境变量。Agent可直接展示该提示,而非静默失败。
服务条款提示: 几乎所有出版商订阅都禁止“系统性下载”。每秒1次请求的速率限制加上已有的单文件幂等性设计,旨在确保个人研究使用符合可接受范围。运行多个并行的paper-fetch进程,或解除速率限制,可能会触发出版商的全机构IP封禁。请勿这样操作。

Notes

注意事项

  • Auth is delegated. The agent never runs a login subcommand. The human or the orchestrator sets
    UNPAYWALL_EMAIL
    in the environment; the agent inherits it. Missing email degrades gracefully to the remaining 4 sources.
  • Trust is directional. CLI arguments are validated once at the entry point. SSRF defense, the
    %PDF
    magic-byte check, and the 50 MB size cap are enforced in the environment layer, not at the agent's request. An agent cannot loosen safety by passing a flag — opting into institutional mode (and its rate-limit risk profile) is an operator action via environment variable.
  • Downloads are naturally idempotent. Re-running against the same
    --out
    skips files that already exist (deterministic filename:
    {first_author}_{year}_{journal_abbrev}_{short_title}.pdf
    ; the journal segment is omitted if metadata lacks a journal/venue). Pair with
    --idempotency-key
    to also replay the exact envelope without any network I/O.
  • Default output directory:
    ./pdfs/
    .
  • 授权委托。Agent从不执行登录子命令。人类或编排器在环境中设置
    UNPAYWALL_EMAIL
    ;Agent继承该设置。未设置邮箱时会优雅降级为使用剩余4个来源。
  • 信任单向。CLI参数在入口处进行一次校验。SSRF防御、
    %PDF
    魔术字节校验和50MB大小限制在环境层强制执行,而非通过Agent请求实现。Agent无法通过传入参数放宽安全限制——启用机构模式(及其速率限制风险)是操作者通过环境变量执行的操作。
  • 下载天然幂等。针对相同
    --out
    目录重新运行时,会跳过已存在的文件(文件名规则:
    {first_author}_{year}_{journal_abbrev}_{short_title}.pdf
    ;若元数据中缺少期刊/会议信息,则省略期刊部分)。结合
    --idempotency-key
    可完全重放结果包,无需任何网络请求。
  • 默认输出目录:
    ./pdfs/