literature-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLiterature Engineer (evidence collector)
文献工程师(evidence collector)
Goal: build a large, verifiable candidate pool for downstream dedupe/rank, mapping, notes, citations, and drafting.
This skill is intentionally evidence-first: if you can't reach the target size with verifiable IDs/provenance, the correct behavior is to block and ask for more exports / enable network, not to fabricate.
目标:构建一个大规模、可验证的候选文献池,用于后续的去重/排序、映射、笔记、引文生成及草稿撰写。
本技能刻意采用证据优先原则:若无法通过可验证的ID/来源信息达到目标规模,正确的做法是终止流程并请求提供更多导出文件/启用网络,而非编造内容。
Inputs
输入项
queries.md- ,
keywords,exclude,max_resultstime window
- Optional offline sources (any combination; all are merged):
papers/import.(csv|json|jsonl|bib)papers/arxiv_export.(csv|json|jsonl|bib)papers/imports/*.(csv|json|jsonl|bib)
- Optional snowball exports (offline):
papers/snowball/*.(csv|json|jsonl|bib)
queries.md- 包含(关键词)、
keywords(排除词)、exclude(最大结果数)、max_results(时间窗口)time window
- 包含
- 可选离线数据源(可任意组合,系统会自动合并):
papers/import.(csv|json|jsonl|bib)papers/arxiv_export.(csv|json|jsonl|bib)papers/imports/*.(csv|json|jsonl|bib)
- 可选滚雪球式导出文件(离线):
papers/snowball/*.(csv|json|jsonl|bib)
Outputs
输出项
papers/papers_raw.jsonl- 1 record per line; minimum fields:
- (str),
title(list[str]),authors(int|""),year(str)url - stable identifier(s): and/or
arxiv_iddoi - (str; may be empty in offline mode)
abstract - (str) +
source(list[dict])provenance
- 1 record per line; minimum fields:
- (human scan)
papers/papers_raw.csv - (route counts, missing-meta stats, next actions)
papers/retrieval_report.md
papers/papers_raw.jsonl- 每行一条记录;至少包含以下字段:
- (字符串)、
title(字符串列表)、authors(整数或空字符串)、year(字符串)url - 稳定标识:和/或
arxiv_iddoi - (字符串;离线模式下可为空)
abstract - (字符串) +
source(字典列表)provenance
- 每行一条记录;至少包含以下字段:
- (供人工查阅)
papers/papers_raw.csv - (包含路径统计、元数据缺失情况及后续操作建议)
papers/retrieval_report.md
Workflow (multi-route)
工作流(多路径)
- Offline-first merge: ingest all available offline exports (and label provenance per file).
- Online retrieval (optional): if enabled, run arXiv API retrieval for each keyword query.
- Snowballing (optional): expand from seed papers via references/cited-by (online), or merge offline snowball exports.
- Normalize + dedupe: canonicalize IDs/URLs, merge duplicates while unioning .
provenance - Report: write a concise retrieval report with coverage buckets and missing-meta counts.
- 离线优先合并:导入所有可用的离线导出文件,并为每个文件标记来源信息。
- 在线检索(可选):若启用该功能,将针对每个关键词查询调用arXiv API进行检索。
- 滚雪球式扩充(可选):通过参考文献/被引文献(在线模式)从种子文献进行拓展,或合并离线的滚雪球导出文件。
- 规范化+去重:统一ID/URL格式,合并重复记录并整合信息。
provenance - 生成报告:撰写简洁的检索报告,包含覆盖范围分类及元数据缺失统计。
Quality checklist
质量检查清单
- Candidate pool size target met (A150++: ≥1200) without fabrication.
- Each record has a stable identifier (or
arxiv_id, plusdoi).url - Each record has provenance: which route/file/API produced it.
- 候选文献池达到目标规模(A150++:≥1200篇)且无编造内容。
- 每条记录均包含稳定标识(或
arxiv_id,外加doi)。url - 每条记录均包含来源溯源信息:生成该记录的路径/文件/API。
Script
脚本使用说明
Quick Start
快速开始
python .codex/skills/literature-engineer/scripts/run.py --help
python .codex/skills/literature-engineer/scripts/run.py --help
All Options
所有可用选项
- See .
python .codex/skills/literature-engineer/scripts/run.py --help - Reads retrieval config from .
queries.md - Offline inputs (merged if present): ,
papers/import.(csv|json|jsonl|bib),papers/arxiv_export.(csv|json|jsonl|bib).papers/imports/*.(csv|json|jsonl|bib) - Optional offline snowball inputs: .
papers/snowball/*.(csv|json|jsonl|bib) - Online expansion requires network: use and/or
--online.--snowball - Online retrieval is best-effort: arXiv API can be flaky in some environments; the script will also attempt a Semantic Scholar route when needed.
- For LLM-agent topics, the script also performs a best-effort pinned arXiv id_list fetch (canonical classics like ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts + a small prior-survey seed set) so can include must-cite anchors even when keyword search misses them.
ref.bib - If HTTPS/TLS to external domains is unstable, the Semantic Scholar route is fetched via the proxy so the pipeline can still self-boot without manual exports.
r.jina.ai - When an online run returns records due to transient network errors, a simple rerun is often sufficient (the pipeline should not fabricate).
0
- 详见。
python .codex/skills/literature-engineer/scripts/run.py --help - 从读取检索配置。
queries.md - 离线输入文件(若存在则自动合并):、
papers/import.(csv|json|jsonl|bib)、papers/arxiv_export.(csv|json|jsonl|bib)。papers/imports/*.(csv|json|jsonl|bib) - 可选离线滚雪球输入文件:。
papers/snowball/*.(csv|json|jsonl|bib) - 在线拓展需要网络:使用和/或
--online参数。--snowball - 在线检索为尽力而为模式:arXiv API在部分环境下可能不稳定;必要时脚本会尝试调用Semantic Scholar接口。
- 针对LLM-agent相关主题,脚本还会尽力获取固定的arXiv id列表(包括ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts等经典文献+少量前期调研种子文献),确保即使关键词搜索遗漏这些文献,仍能包含必须引用的核心文献。
ref.bib - 若外部域名的HTTPS/TLS连接不稳定,Semantic Scholar接口将通过代理获取,确保流程无需手动导出即可自动启动。
r.jina.ai - 若在线运行因临时网络错误返回0条记录,通常只需重新运行即可(流程不得编造内容)。
Examples
使用示例
-
Offline imports only:
- Put exports under then run:
papers/imports/python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>
- Put exports under
-
Explicit offline inputs (multi-route):
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl
-
Online arXiv retrieval (needs network):
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online
-
Snowballing (needs network unless you provide offline snowball exports):
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball
-
仅使用离线导入文件:
- 将导出文件放入目录,然后运行:
papers/imports/python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>
- 将导出文件放入
-
指定离线输入文件(多路径):
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl
-
在线arXiv检索(需要网络):
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online
-
滚雪球式扩充(若无离线滚雪球导出文件则需要网络):
python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball
Troubleshooting
故障排除
Issue: can't reach ≥1200 papers
问题:无法达到≥1200篇文献的目标
Symptom:
- size is far below target; later stages will fail mapping/bindings and citation density.
papers/papers_raw.jsonl
Causes:
- Only a small offline export was provided.
- Network is blocked so online retrieval/snowballing can't run.
Solutions:
- Provide additional exports under (multiple routes/queries).
papers/imports/ - Provide snowball exports under .
papers/snowball/ - Enable network and rerun with .
--online --snowball
症状:
- 的数量远低于目标;后续阶段的映射/关联及引文密度环节会失败。
papers/papers_raw.jsonl
原因:
- 仅提供了少量离线导出文件。
- 网络被限制,无法进行在线检索/滚雪球式扩充。
解决方案:
- 在目录下添加更多导出文件(多路径/多查询结果)。
papers/imports/ - 在目录下添加滚雪球导出文件。
papers/snowball/ - 启用网络并添加参数重新运行。
--online --snowball
Issue: many records missing stable IDs
问题:大量记录缺失稳定ID
Symptom:
- Report shows many entries with empty and
arxiv_id.doi
Solutions:
- Prefer arXiv/OpenReview/ACL exports that include stable IDs.
- If you have network, rerun with to backfill arXiv IDs.
--online - Filter out ID-less entries before downstream citation generation.
症状:
- 报告显示大量记录的和
arxiv_id字段为空。doi
解决方案:
- 优先选择包含稳定ID的arXiv/OpenReview/ACL导出文件。
- 若有网络,添加参数重新运行以补全arXiv ID。
--online - 在后续引文生成前过滤掉无ID的记录。