paper-fetch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesepaper-fetch
paper-fetch
Fetch the PDF for a paper given a DOI (or title). Tries multiple sources in priority order and stops at the first hit.
通过DOI(或标题)获取论文PDF。按优先级依次尝试多个来源,找到第一个可用资源后即停止。
Resolution order
获取优先级顺序
- Unpaywall — , read
https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL(skipped ifbest_oa_location.url_for_pdfnot set)UNPAYWALL_EMAIL - Semantic Scholar —
https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds - arXiv — if present,
externalIds.ArXivhttps://arxiv.org/pdf/{arxiv_id}.pdf - PubMed Central OA — if PMCID present,
https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/ - bioRxiv / medRxiv — if DOI prefix is , query
10.1101for the latest version PDF URLhttps://api.biorxiv.org/details/{server}/{doi} - Publisher direct (institutional mode only — ) — DOI-prefix → publisher PDF template (Nature / Science / Wiley / Springer / ACS / PNAS / NEJM / Sage / T&F / Elsevier). The caller's own subscription IP / cookies / EZproxy are what authorize the fetch; unauthorized responses fail the
PAPER_FETCH_INSTITUTIONAL=1check and fall through to step 7.%PDF - Sci-Hub mirrors (on by default; disable with ) — last-resort fallback. Tries the mirror list in
PAPER_FETCH_NO_SCIHUB=1(or built-in defaultsPAPER_FETCH_SCIHUB_MIRRORS,sci-hub.ru,sci-hub.st,sci-hub.su,sci-hub.box,sci-hub.red,sci-hub.al,sci-hub.mk) in order; on full miss, scrapessci-hub.eeonce per process for fresh mirrors. CAPTCHA / missing-paper pages have no PDF iframe and fall through silently.https://www.sci-hub.pub/ - Otherwise → report failure with title/authors so the user can request via ILL
If only a title is given, pass it directly via . Resolution chain:
--title "<title>"- Crossref — primary; covers all major journal/conference DOIs
query.title - Semantic Scholar — fallback when Crossref's top match is low-confidence (
/paper/search/match) or the gap to the runner-up ismatch_score < 40. Critically, S2 covers arXiv-only preprints (no Crossref DOI). When S2 surfaces a paper that has only an arXiv id, the canonical< 3is synthesized so the download chain stays uniform.10.48550/arXiv.<id> - Crossref's best guess (low-confidence) — used only when both resolvers struggled. The result envelope sets plus a
meta.title_resolution.low_confidence: true(low_confidence_reason/score_below_threshold) so an agent can either bail or confirm viaambiguous_runner_up.--dry-run
Either way the resolved DOI, the winning resolver, the full list, and the top candidate matches are all surfaced under .
resolvers_triedmeta.title_resolutionIf is registered, the agent can alternatively resolve title → DOI through the Asta MCP first, then pass the DOI directly here. This skips paper-fetch's two-stage Crossref/S2 chain in favor of Asta's richer search surface (relevance ranking, snippet search, citation graph). Workflow: call , read (or when only is present), then . Use when Asta isn't available or when a single command is preferred.
asta-skillasta__search_paper_by_title("<title>", fields="title,year,authors,externalIds")externalIds.DOI10.48550/arXiv.<ArXiv>ArXivpaper-fetch <doi>--title- Unpaywall — 调用接口,读取
https://api.unpaywall.org/v2/{doi}?email=$UNPAYWALL_EMAIL字段(若未设置best_oa_location.url_for_pdf则跳过此步骤)UNPAYWALL_EMAIL - Semantic Scholar — 调用接口
https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}?fields=openAccessPdf,externalIds - arXiv — 若存在标识,访问
externalIds.ArXivhttps://arxiv.org/pdf/{arxiv_id}.pdf - PubMed Central OA — 若存在PMCID标识,访问
https://www.ncbi.nlm.nih.gov/pmc/articles/{pmcid}/pdf/ - bioRxiv / medRxiv — 若DOI前缀为,调用
10.1101查询最新版本PDF的URLhttps://api.biorxiv.org/details/{server}/{doi} - 出版商直接获取(仅机构模式可用,需设置)——通过DOI前缀匹配出版商PDF模板(涵盖Nature / Science / Wiley / Springer / ACS / PNAS / NEJM / Sage / T&F / Elsevier)。调用方的订阅IP/ cookies / EZproxy将用于授权获取;未授权的响应会通过
PAPER_FETCH_INSTITUTIONAL=1校验失败,进而进入第7步%PDF - Sci-Hub镜像(默认启用;可通过禁用)——最后备选方案。按
PAPER_FETCH_NO_SCIHUB=1配置的镜像列表(或内置默认值PAPER_FETCH_SCIHUB_MIRRORS,sci-hub.ru,sci-hub.st,sci-hub.su,sci-hub.box,sci-hub.red,sci-hub.al,sci-hub.mk)依次尝试;若所有镜像均失败,会在每个进程中抓取一次sci-hub.ee获取最新镜像。若遇到验证码或论文缺失页面(无PDF iframe),会直接跳过https://www.sci-hub.pub/ - 若以上均失败→返回包含标题/作者的失败报告,用户可通过馆际互借(ILL)请求获取
若仅提供标题,直接通过参数传入。解析流程如下:
--title "<title>"- Crossref — 首选方式;覆盖所有主流期刊/会议DOI
query.title - Semantic Scholar — 当Crossref的顶级匹配置信度低(
/paper/search/match)或与次优匹配的差距match_score < 40时作为备选。关键是S2支持仅在arXiv发布的预印本(无Crossref DOI)。当S2返回仅含arXiv id的论文时,会自动生成标准DOI< 3,确保下载流程统一10.48550/arXiv.<id> - Crossref低置信度最佳猜测 — 仅当两种解析方式均失败时使用。结果包会设置及
meta.title_resolution.low_confidence: true(low_confidence_reason/score_below_threshold),供Agent决定是否终止或通过ambiguous_runner_up确认--dry-run
无论哪种方式,解析得到的DOI、成功的解析器、完整的列表及顶级候选匹配结果都会在中返回。
resolvers_triedmeta.title_resolution若已注册,Agent可先通过Asta MCP将标题解析为DOI,再直接传入此处。这会跳过paper-fetch的Crossref/S2两阶段解析链,转而使用Asta更丰富的搜索界面(相关性排序、片段搜索、引文图谱)。流程:调用,读取(若仅存在标识则使用),然后执行。当Asta不可用或偏好单命令操作时,使用参数。
asta-skillasta__search_paper_by_title("<title>", fields="title,year,authors,externalIds")externalIds.DOIArXiv10.48550/arXiv.<ArXiv>paper-fetch <doi>--titleUsage
使用方法
bash
python scripts/fetch.py <DOI> [options]
python scripts/fetch.py --title "<paper title>" [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema # machine-readable self-descriptionbash
python scripts/fetch.py <DOI> [options]
python scripts/fetch.py --title "<paper title>" [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema # 机器可读的自描述信息Flags
参数标识
The flags below are the ones an agent composes in normal use. For the complete contract — including , , , , , , plus parameter types and exit-code mappings — run (machine-readable, drift-checked via ).
--dry-run--pretty--stream--overwrite--timeout--versionpython scripts/fetch.py schemaschema_version| Flag | Default | Description |
|---|---|---|
| — | DOI to fetch (positional). Use |
| — | Paper title; resolved to a DOI via Crossref before download. Mutually exclusive with positional DOI / |
| — | File with one DOI per line for bulk download. Use |
| | Output directory |
| auto | |
| — | Safe-retry key. Re-running with the same key replays the original envelope from |
以下是Agent日常使用中会用到的标识。如需完整约定——包括, , , , , ,以及参数类型和退出码映射——请运行(机器可读,通过检查版本差异)。
--dry-run--pretty--stream--overwrite--timeout--versionpython scripts/fetch.py schemaschema_version| 参数标识 | 默认值 | 描述 |
|---|---|---|
| — | 要获取的DOI(位置参数)。使用 |
| — | 论文标题;先通过Crossref解析为DOI再进行下载。与位置参数DOI / |
| — | 批量下载的文件,每行一个DOI。使用 |
| | 输出目录 |
| auto | |
| — | 安全重试密钥。使用相同密钥重新运行时,会从 |
Agent discovery: schema
subcommand
schemaAgent发现:schema
子命令
schemabash
python scripts/fetch.py schemaEmits a complete machine-readable description of the CLI on stdout (no network). Includes , , parameter types, exit codes, error codes, envelope shapes, and environment variables. Agents should read this once, cache it against , and re-read when the cached version drifts.
cli_versionschema_versionschema_versionbash
python scripts/fetch.py schema在标准输出中输出完整的CLI机器可读描述(无需网络)。包含, , 参数类型、退出码、错误码、结果包结构及环境变量。Agent应读取一次,根据缓存,当缓存版本不一致时重新读取。
cli_versionschema_versionschema_versionOutput contract
输出约定
stdout emits a single JSON envelope. Every envelope carries a slot.
metaSuccess (all DOIs resolved):
json
{
"ok": true,
"data": {
"results": [
{
"doi": "10.1038/s41586-021-03819-2",
"success": true,
"source": "unpaywall",
"pdf_url": "https://www.nature.com/articles/s41586-021-03819-2.pdf",
"file": "pdfs/Jumper_2021_Highly_accurate_protein_structure_predic.pdf",
"meta": {"title": "Highly accurate protein structure prediction with AlphaFold", "year": 2021, "author": "Jumper"},
"sources_tried": ["unpaywall"]
}
],
"summary": {"total": 1, "succeeded": 1, "failed": 0},
"next": []
},
"meta": {
"request_id": "req_a908f5156fc1",
"latency_ms": 2036,
"schema_version": "1.9.0",
"cli_version": "0.13.1",
"sources_tried": ["unpaywall"]
}
}Partial (batch mode — some DOIs failed, exit code reflects the failure class):
json
{
"ok": "partial",
"data": {
"results": [
{ "doi": "10.1038/s41586-021-03819-2", "success": true, "source": "unpaywall", ... },
{
"doi": "10.1234/nonexistent",
"success": false,
"source": null,
"pdf_url": null,
"file": null,
"meta": {},
"sources_tried": ["unpaywall", "semantic_scholar"],
"error": {
"code": "not_found",
"message": "No open-access PDF found",
"retryable": true,
"retry_after_hours": 168,
"reason": "OA availability changes over time; retry after embargo lifts or preprint appears"
}
}
],
"summary": {"total": 2, "succeeded": 1, "failed": 1},
"next": ["paper-fetch 10.1234/nonexistent --out pdfs"]
},
"meta": { ... }
}The slot is an array of suggested follow-up commands: re-invoking them retries only the failed subset. Combine with to make the whole batch safely retriable without re-downloading the already-succeeded items.
next--idempotency-keyFailure (bad arguments, exit code 3):
json
{
"ok": false,
"error": {
"code": "validation_error",
"message": "Provide a DOI or --batch file",
"retryable": false
},
"meta": { ... }
}Per-item skipped (destination already exists, no ):
--overwritejson
{
"doi": "10.1038/s41586-021-03819-2",
"success": true,
"source": "unpaywall",
"pdf_url": "https://...",
"file": "pdfs/Jumper_2021_...pdf",
"skipped": true,
"skip_reason": "file_exists",
"sources_tried": ["unpaywall"]
}Idempotency replay (re-run with the same ):
--idempotency-keyThe cached envelope is returned verbatim, but and are re-stamped for the current call, and is set. No network I/O occurs.
meta.request_idmeta.latency_msmeta.replayed_from_idempotency_key标准输出会输出单个JSON结果包。每个结果包都包含字段。
meta成功(所有DOI均解析完成):
json
{
"ok": true,
"data": {
"results": [
{
"doi": "10.1038/s41586-021-03819-2",
"success": true,
"source": "unpaywall",
"pdf_url": "https://www.nature.com/articles/s41586-021-03819-2.pdf",
"file": "pdfs/Jumper_2021_Highly_accurate_protein_structure_predic.pdf",
"meta": {"title": "Highly accurate protein structure prediction with AlphaFold", "year": 2021, "author": "Jumper"},
"sources_tried": ["unpaywall"]
}
],
"summary": {"total": 1, "succeeded": 1, "failed": 0},
"next": []
},
"meta": {
"request_id": "req_a908f5156fc1",
"latency_ms": 2036,
"schema_version": "1.9.0",
"cli_version": "0.13.1",
"sources_tried": ["unpaywall"]
}
}部分成功(批量模式——部分DOI失败,退出码反映失败类型):
json
{
"ok": "partial",
"data": {
"results": [
{ "doi": "10.1038/s41586-021-03819-2", "success": true, "source": "unpaywall", ... },
{
"doi": "10.1234/nonexistent",
"success": false,
"source": null,
"pdf_url": null,
"file": null,
"meta": {},
"sources_tried": ["unpaywall", "semantic_scholar"],
"error": {
"code": "not_found",
"message": "未找到开放获取PDF",
"retryable": true,
"retry_after_hours": 168,
"reason": "开放获取状态会随时间变化;可在 embargo 解除或预印本发布后重试"
}
}
],
"summary": {"total": 2, "succeeded": 1, "failed": 1},
"next": ["paper-fetch 10.1234/nonexistent --out pdfs"]
},
"meta": { ... }
}next--idempotency-key失败(参数错误,退出码3):
json
{
"ok": false,
"error": {
"code": "validation_error",
"message": "请提供DOI或--batch文件",
"retryable": false
},
"meta": { ... }
}单条目跳过(目标文件已存在,未设置):
--overwritejson
{
"doi": "10.1038/s41586-021-03819-2",
"success": true,
"source": "unpaywall",
"pdf_url": "https://...",
"file": "pdfs/Jumper_2021_...pdf",
"skipped": true,
"skip_reason": "file_exists",
"sources_tried": ["unpaywall"]
}幂等重放(使用相同重新运行):
--idempotency-key会原封不动返回缓存的结果包,但会重新标记当前调用的和,并设置。不会产生网络请求。
meta.request_idmeta.latency_msmeta.replayed_from_idempotency_keyStderr progress (NDJSON)
标准错误输出进度(NDJSON格式)
When , stderr emits one JSON object per line for liveness:
--format json{"event": "session", "request_id": "req_...", "elapsed_ms": 0, "cli_version": "0.13.1", "schema_version": "1.9.0"}
{"event": "start", "request_id": "req_...", "elapsed_ms": 2, "doi": "10.1038/..."}
{"event": "source_try", "request_id": "req_...", "elapsed_ms": 2, "doi": "...", "source": "unpaywall"}
{"event": "source_hit", "request_id": "req_...", "elapsed_ms": 2036, "doi": "...", "source": "unpaywall", "pdf_url": "..."}
{"event": "download_ok", "request_id": "req_...", "elapsed_ms": 4120, "doi": "...", "file": "..."}Event types: , , , , , , , , , , , , . All events share and , letting an orchestrator correlate progress across stderr and the final stdout envelope. The event fires once per invocation, before any DOI work or network I/O, and carries / so agents can detect schema drift against a cached copy without waiting for the final envelope.
sessionstartsource_trysource_hitsource_misssource_skipsource_enrichsource_enrich_faileddownload_okdownload_errordownload_skipdry_runnot_foundrequest_idelapsed_mssessioncli_versionschema_versionsource_enrichauthortitlefieldssource_enrich_failedunknown_<year>_…When , stderr emits human-readable prose.
--format text当设置时,标准错误输出会每行输出一个JSON对象用于显示进度:
--format json{"event": "session", "request_id": "req_...", "elapsed_ms": 0, "cli_version": "0.13.1", "schema_version": "1.9.0"}
{"event": "start", "request_id": "req_...", "elapsed_ms": 2, "doi": "10.1038/..."}
{"event": "source_try", "request_id": "req_...", "elapsed_ms": 2, "doi": "...", "source": "unpaywall"}
{"event": "source_hit", "request_id": "req_...", "elapsed_ms": 2036, "doi": "...", "source": "unpaywall", "pdf_url": "..."}
{"event": "download_ok", "request_id": "req_...", "elapsed_ms": 4120, "doi": "...", "file": "..."}事件类型:, , , , , , , , , , , , 。所有事件均包含和,便于编排器关联标准错误输出和最终标准输出结果包的进度。事件在每次调用时触发一次,在DOI处理或网络请求之前,包含 / ,Agent可据此检测与缓存版本的差异,无需等待最终结果包。
sessionstartsource_trysource_hitsource_misssource_skipsource_enrichsource_enrich_faileddownload_okdownload_errordownload_skipdry_runnot_foundrequest_idelapsed_mssessioncli_versionschema_version当Semantic Scholar仅用于在其他来源已提供PDF URL后补充缺失的 / 信息时,会触发事件;其数组会列出具体补充的字段。当补充信息调用失败时,会触发事件——此时仍会使用Unpaywall的PDF URL,文件名会回退为。
authortitlesource_enrichfieldssource_enrich_failedunknown_<year>_…当设置时,标准错误输出会输出人类可读的文本。
--format textExit codes
退出码
| Code | Meaning | Retryable class |
|---|---|---|
| All DOIs resolved / previewed | — |
| Unresolved — one or more DOIs had no OA copy; no transport failure | Not now (retry after |
| Reserved for auth errors (currently unused) | — |
| Validation error (bad arguments, missing input) | No |
| Transport error (network / download / IO failure) | Yes |
The taxonomy lets an orchestrator route failures deterministically: exit 4 is worth retrying immediately, exit 1 is not, exit 3 is a bug in the caller.
| 代码 | 含义 | 重试类别 |
|---|---|---|
| 所有DOI均解析/预览完成 | — |
| 未解析——一个或多个DOI无开放获取副本;无传输失败 | 暂不重试(等待 |
| 预留用于授权错误(当前未使用) | — |
| 校验错误(参数错误,输入缺失) | 否 |
| 传输错误(网络/下载/IO失败) | 是 |
此分类可让编排器确定性地处理失败:退出码4值得立即重试,退出码1暂不重试,退出码3表示调用方存在错误。
Error codes in JSON
JSON中的错误码
Every retryable error carries a hint in the error object, so an orchestrator can schedule retries without guessing.
retry_after_hours| Code | Meaning | Retryable | |
|---|---|---|---|
| Bad arguments or empty input | No | — |
| Crossref returned no items for the given | No | — |
| No open-access PDF found | Yes | |
| Network failure during download | Yes | |
| Response was not a PDF (HTML landing page) | No | — |
| PDF URL failed SSRF safety check (private IP / non-http(s) / non-80,443 / blocked metadata host) | No | — |
| Response exceeded 50 MB limit | Yes | |
| Local filesystem write failed | Yes | |
| Unexpected error | No | — |
The canonical mapping lives in in and is surfaced in .
RETRY_AFTER_HOURSscripts/fetch.pyschema.error_codes每个可重试错误的错误对象中都包含提示,便于编排器安排重试时间,无需猜测。
retry_after_hours| 代码 | 含义 | 是否可重试 | |
|---|---|---|---|
| 参数错误或输入为空 | 否 | — |
| Crossref未返回与给定 | 否 | — |
| 未找到开放获取PDF | 是 | |
| 下载过程中网络失败 | 是 | |
| 响应内容不是PDF(HTML着陆页) | 否 | — |
| PDF URL未通过SSRF安全检查(私有IP/非http(s)/非80,443端口/被阻止的元数据主机) | 否 | — |
| 响应内容超过50MB限制 | 是 | |
| 本地文件系统写入失败 | 是 | |
| 意外错误 | 否 | — |
标准映射关系位于中的,并会在中返回。
scripts/fetch.pyRETRY_AFTER_HOURSschema.error_codesExamples
示例
bash
undefinedbash
undefinedSingle DOI (JSON output when piped; text when in a terminal)
单个DOI(管道输出时为JSON格式;终端中为文本格式)
python scripts/fetch.py 10.1038/s41586-020-2649-2
python scripts/fetch.py 10.1038/s41586-020-2649-2
Single title (resolved to DOI via Crossref, then downloaded)
单个标题(先通过Crossref解析为DOI,再下载)
python scripts/fetch.py --title "Highly accurate protein structure prediction with AlphaFold"
python scripts/fetch.py --title "Highly accurate protein structure prediction with AlphaFold"
Dry-run preview (resolve without downloading)
干运行预览(仅解析不下载)
python scripts/fetch.py 10.1038/s41586-020-2649-2 --dry-run
python scripts/fetch.py 10.1038/s41586-020-2649-2 --dry-run
Title + dry-run — preview the resolved DOI and candidate matches
标题+干运行——预览解析得到的DOI和候选匹配结果
python scripts/fetch.py --title "Attention Is All You Need" --dry-run
python scripts/fetch.py --title "Attention Is All You Need" --dry-run
Force JSON (for agents even inside a terminal)
强制JSON格式(即使在终端中也供Agent使用)
python scripts/fetch.py 10.1038/s41586-020-2649-2 --format json
python scripts/fetch.py 10.1038/s41586-020-2649-2 --format json
Human-readable with pretty colors in a pipeline
管道输出时使用人类可读的彩色文本
python scripts/fetch.py 10.1038/s41586-020-2649-2 --format text
python scripts/fetch.py 10.1038/s41586-020-2649-2 --format text
Batch download, safely retriable
批量下载,支持安全重试
python scripts/fetch.py --batch dois.txt --out ./papers
--idempotency-key monday-review-batch
--idempotency-key monday-review-batch
python scripts/fetch.py --batch dois.txt --out ./papers
--idempotency-key monday-review-batch
--idempotency-key monday-review-batch
Pipe DOIs from another tool
从其他工具管道传入DOI
zot -F ids.json query ... | jq -r '.[].doi' | python scripts/fetch.py --batch -
zot -F ids.json query ... | jq -r '.[].doi' | python scripts/fetch.py --batch -
Agent discovery
Agent发现
python scripts/fetch.py schema --pretty
python scripts/fetch.py schema --pretty
Streaming mode — one result per line as each DOI resolves
流模式——每个DOI解析完成后立即输出一条结果
python scripts/fetch.py --batch dois.txt --stream
python scripts/fetch.py --batch dois.txt --stream
Works without UNPAYWALL_EMAIL (skips Unpaywall, uses remaining 4 sources)
无需设置UNPAYWALL_EMAIL即可使用(跳过Unpaywall,使用剩余4个来源)
python scripts/fetch.py 10.1038/s41586-020-2649-2
undefinedpython scripts/fetch.py 10.1038/s41586-020-2649-2
undefinedEnvironment
环境变量
| Variable | Default | Purpose |
|---|---|---|
| unset | Contact email for Unpaywall API. Optional but recommended. Without it, Unpaywall is skipped (remaining sources still work). |
| unset | Set to any value (e.g. |
| unset | Set to any value to disable the Sci-Hub fallback (step 7). |
| unset | Comma-separated mirror hostnames to try in priority order (e.g. |
| 变量 | 默认值 | 用途 |
|---|---|---|
| 未设置 | Unpaywall API的联系邮箱。可选但建议设置。未设置时会跳过Unpaywall(剩余来源仍可正常使用)。 |
| 未设置 | 设置为任意值(如 |
| 未设置 | 设置为任意值可禁用Sci-Hub备选方案(第7步)。 |
| 未设置 | 逗号分隔的镜像主机名列表,按优先级顺序尝试(如 |
Institutional access (opt-in)
机构访问(可选启用)
Many researchers have legitimate subscription access through their institution's IP range (on-campus or VPN). Paper-fetch can use that access by letting the publisher's own auth (your IP, your session cookies) decide whether to serve the PDF.
Host reachability does not differ between modes — public mode already trusts URLs returned by the OA APIs (Unpaywall, Semantic Scholar, bioRxiv, PMC) and fetches any HTTPS host that passes SSRF defense. Institutional mode adds two things: (1) a publisher-direct fallback (step 6 above) that constructs a publisher-side PDF URL by DOI prefix when every OA source missed, so your institutional IP/cookies can authorize the fetch, and (2) a 1 req/s rate limiter to keep batch jobs from getting your IP throttled or banned for "systematic downloading."
Opt in:
export PAPER_FETCH_INSTITUTIONAL=1What changes in institutional mode:
| Aspect | Public (default) | Institutional |
|---|---|---|
| Host reachability | Any public HTTPS host passing SSRF defense | Same |
| SSRF defense | Enforced (private IP / non-http(s) / non-80,443 / cloud metadata all blocked) | Enforced — same rules |
| Publisher-direct fallback | Off | On — DOI-prefix → publisher PDF URL, last resort after all OA sources miss |
| Rate limit | None | 1 req/s token bucket (all outbound) |
| | |
What stays the same:
- magic-byte check and 50 MB size cap (prevents HTML landing pages and oversized responses slipping through)
%PDF - No CAPTCHA solving, ever. If a publisher shows a challenge, the response won't start with and paper-fetch falls through to the next source.
%PDF - No browser automation, no Playwright, no stealth.
- Agent cannot opt in on its own — must be set by the human operator in the shell environment. This is the trust boundary.
PAPER_FETCH_INSTITUTIONAL
When paper-fetch can't find an OA copy and you're in public mode, the error envelope includes and a hint telling the user to set the env var. Agents can surface this verbatim rather than failing silently.
suggest_institutional: trueToS notice: almost every publisher subscription prohibits "systematic downloading." The 1 req/s rate limit plus the existing per-file idempotency are designed to keep individual research use within acceptable bounds. Running many parallel paper-fetch processes, or lifting the rate limit, can trigger a publisher-wide IP ban affecting your entire institution. Don't.
许多研究人员可通过所在机构的IP范围(校内或VPN)获得合法的订阅访问权限。Paper-fetch可利用该权限,让出版商自身的授权机制(你的IP、会话cookies)决定是否提供PDF。
两种模式下的主机可达性并无差异——公共模式已信任OA API(Unpaywall、Semantic Scholar、bioRxiv、PMC)返回的URL,并会获取所有通过SSRF防御的HTTPS主机。机构模式新增两项功能:(1) 出版商直接获取备选方案(上述第6步),当所有OA来源均失败时,通过DOI前缀构造出版商端的PDF URL,以便你的机构IP/cookies进行授权获取;(2) 每秒1次请求的速率限制器,防止批量任务导致你的IP被出版商限流或封禁,避免“系统性下载”。
启用方式:
export PAPER_FETCH_INSTITUTIONAL=1机构模式的变化:
| 方面 | 公共模式(默认) | 机构模式 |
|---|---|---|
| 主机可达性 | 所有通过SSRF防御的公共HTTPS主机 | 相同 |
| SSRF防御 | 强制执行(私有IP/非http(s)/非80,443端口/云元数据均被阻止) | 强制执行——规则相同 |
| 出版商直接获取备选方案 | 关闭 | 开启——DOI前缀匹配出版商PDF URL,所有OA来源失败后的最后备选 |
| 速率限制 | 无 | 每秒1次请求的令牌桶(所有出站请求) |
| | |
保持不变的内容:
- 魔术字节校验和50MB大小限制(防止HTML着陆页和超大响应内容通过)
%PDF - 绝不处理验证码。若出版商显示验证挑战,响应内容不会以开头,paper-fetch会自动进入下一个来源
%PDF - 无浏览器自动化、无Playwright、无隐身模式
- Agent无法自行启用——必须由人类操作者在shell环境中设置。这是信任边界。
PAPER_FETCH_INSTITUTIONAL
当paper-fetch在公共模式下无法找到开放获取副本时,错误结果包会包含及提示信息,告知用户设置环境变量。Agent可直接展示该提示,而非静默失败。
suggest_institutional: true服务条款提示: 几乎所有出版商订阅都禁止“系统性下载”。每秒1次请求的速率限制加上已有的单文件幂等性设计,旨在确保个人研究使用符合可接受范围。运行多个并行的paper-fetch进程,或解除速率限制,可能会触发出版商的全机构IP封禁。请勿这样操作。
Notes
注意事项
- Auth is delegated. The agent never runs a login subcommand. The human or the orchestrator sets in the environment; the agent inherits it. Missing email degrades gracefully to the remaining 4 sources.
UNPAYWALL_EMAIL - Trust is directional. CLI arguments are validated once at the entry point. SSRF defense, the magic-byte check, and the 50 MB size cap are enforced in the environment layer, not at the agent's request. An agent cannot loosen safety by passing a flag — opting into institutional mode (and its rate-limit risk profile) is an operator action via environment variable.
%PDF - Downloads are naturally idempotent. Re-running against the same skips files that already exist (deterministic filename:
--out; the journal segment is omitted if metadata lacks a journal/venue). Pair with{first_author}_{year}_{journal_abbrev}_{short_title}.pdfto also replay the exact envelope without any network I/O.--idempotency-key - Default output directory: .
./pdfs/
- 授权委托。Agent从不执行登录子命令。人类或编排器在环境中设置;Agent继承该设置。未设置邮箱时会优雅降级为使用剩余4个来源。
UNPAYWALL_EMAIL - 信任单向。CLI参数在入口处进行一次校验。SSRF防御、魔术字节校验和50MB大小限制在环境层强制执行,而非通过Agent请求实现。Agent无法通过传入参数放宽安全限制——启用机构模式(及其速率限制风险)是操作者通过环境变量执行的操作。
%PDF - 下载天然幂等。针对相同目录重新运行时,会跳过已存在的文件(文件名规则:
--out;若元数据中缺少期刊/会议信息,则省略期刊部分)。结合{first_author}_{year}_{journal_abbrev}_{short_title}.pdf可完全重放结果包,无需任何网络请求。--idempotency-key - 默认输出目录: 。
./pdfs/