literature-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Literature Engineer (evidence collector)

文献工程师(evidence collector)

Goal: build a large, verifiable candidate pool for downstream dedupe/rank, mapping, notes, citations, and drafting.
This skill is intentionally evidence-first: if you can't reach the target size with verifiable IDs/provenance, the correct behavior is to block and ask for more exports / enable network, not to fabricate.
目标:构建一个大规模、可验证的候选文献池,用于后续的去重/排序、映射、笔记、引文生成及草稿撰写。
本技能刻意采用证据优先原则:若无法通过可验证的ID/来源信息达到目标规模,正确的做法是终止流程并请求提供更多导出文件/启用网络,而非编造内容。

Inputs

输入项

  • queries.md
    • keywords
      ,
      exclude
      ,
      max_results
      ,
      time window
  • Optional offline sources (any combination; all are merged):
    • papers/import.(csv|json|jsonl|bib)
    • papers/arxiv_export.(csv|json|jsonl|bib)
    • papers/imports/*.(csv|json|jsonl|bib)
  • Optional snowball exports (offline):
    • papers/snowball/*.(csv|json|jsonl|bib)
  • queries.md
    • 包含
      keywords
      (关键词)、
      exclude
      (排除词)、
      max_results
      (最大结果数)、
      time window
      (时间窗口)
  • 可选离线数据源(可任意组合,系统会自动合并):
    • papers/import.(csv|json|jsonl|bib)
    • papers/arxiv_export.(csv|json|jsonl|bib)
    • papers/imports/*.(csv|json|jsonl|bib)
  • 可选滚雪球式导出文件(离线):
    • papers/snowball/*.(csv|json|jsonl|bib)

Outputs

输出项

  • papers/papers_raw.jsonl
    • 1 record per line; minimum fields:
      • title
        (str),
        authors
        (list[str]),
        year
        (int|""),
        url
        (str)
      • stable identifier(s):
        arxiv_id
        and/or
        doi
      • abstract
        (str; may be empty in offline mode)
      • source
        (str) +
        provenance
        (list[dict])
  • papers/papers_raw.csv
    (human scan)
  • papers/retrieval_report.md
    (route counts, missing-meta stats, next actions)
  • papers/papers_raw.jsonl
    • 每行一条记录;至少包含以下字段:
      • title
        (字符串)、
        authors
        (字符串列表)、
        year
        (整数或空字符串)、
        url
        (字符串)
      • 稳定标识:
        arxiv_id
        和/或
        doi
      • abstract
        (字符串;离线模式下可为空)
      • source
        (字符串) +
        provenance
        (字典列表)
  • papers/papers_raw.csv
    (供人工查阅)
  • papers/retrieval_report.md
    (包含路径统计、元数据缺失情况及后续操作建议)

Workflow (multi-route)

工作流(多路径)

  1. Offline-first merge: ingest all available offline exports (and label provenance per file).
  2. Online retrieval (optional): if enabled, run arXiv API retrieval for each keyword query.
  3. Snowballing (optional): expand from seed papers via references/cited-by (online), or merge offline snowball exports.
  4. Normalize + dedupe: canonicalize IDs/URLs, merge duplicates while unioning
    provenance
    .
  5. Report: write a concise retrieval report with coverage buckets and missing-meta counts.
  1. 离线优先合并:导入所有可用的离线导出文件,并为每个文件标记来源信息。
  2. 在线检索(可选):若启用该功能,将针对每个关键词查询调用arXiv API进行检索。
  3. 滚雪球式扩充(可选):通过参考文献/被引文献(在线模式)从种子文献进行拓展,或合并离线的滚雪球导出文件。
  4. 规范化+去重:统一ID/URL格式,合并重复记录并整合
    provenance
    信息。
  5. 生成报告:撰写简洁的检索报告,包含覆盖范围分类及元数据缺失统计。

Quality checklist

质量检查清单

  • Candidate pool size target met (A150++: ≥1200) without fabrication.
  • Each record has a stable identifier (
    arxiv_id
    or
    doi
    , plus
    url
    ).
  • Each record has provenance: which route/file/API produced it.
  • 候选文献池达到目标规模(A150++:≥1200篇)且无编造内容
  • 每条记录均包含稳定标识(
    arxiv_id
    doi
    ,外加
    url
    )。
  • 每条记录均包含来源溯源信息:生成该记录的路径/文件/API。

Script

脚本使用说明

Quick Start

快速开始

  • python .codex/skills/literature-engineer/scripts/run.py --help
  • python .codex/skills/literature-engineer/scripts/run.py --help

All Options

所有可用选项

  • See
    python .codex/skills/literature-engineer/scripts/run.py --help
    .
  • Reads retrieval config from
    queries.md
    .
  • Offline inputs (merged if present):
    papers/import.(csv|json|jsonl|bib)
    ,
    papers/arxiv_export.(csv|json|jsonl|bib)
    ,
    papers/imports/*.(csv|json|jsonl|bib)
    .
  • Optional offline snowball inputs:
    papers/snowball/*.(csv|json|jsonl|bib)
    .
  • Online expansion requires network: use
    --online
    and/or
    --snowball
    .
  • Online retrieval is best-effort: arXiv API can be flaky in some environments; the script will also attempt a Semantic Scholar route when needed.
  • For LLM-agent topics, the script also performs a best-effort pinned arXiv id_list fetch (canonical classics like ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts + a small prior-survey seed set) so
    ref.bib
    can include must-cite anchors even when keyword search misses them.
  • If HTTPS/TLS to external domains is unstable, the Semantic Scholar route is fetched via the
    r.jina.ai
    proxy so the pipeline can still self-boot without manual exports.
  • When an online run returns
    0
    records due to transient network errors, a simple rerun is often sufficient (the pipeline should not fabricate).
  • 详见
    python .codex/skills/literature-engineer/scripts/run.py --help
  • queries.md
    读取检索配置。
  • 离线输入文件(若存在则自动合并):
    papers/import.(csv|json|jsonl|bib)
    papers/arxiv_export.(csv|json|jsonl|bib)
    papers/imports/*.(csv|json|jsonl|bib)
  • 可选离线滚雪球输入文件:
    papers/snowball/*.(csv|json|jsonl|bib)
  • 在线拓展需要网络:使用
    --online
    和/或
    --snowball
    参数。
  • 在线检索为尽力而为模式:arXiv API在部分环境下可能不稳定;必要时脚本会尝试调用Semantic Scholar接口。
  • 针对LLM-agent相关主题,脚本还会尽力获取固定的arXiv id列表(包括ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts等经典文献+少量前期调研种子文献),确保即使关键词搜索遗漏这些文献,
    ref.bib
    仍能包含必须引用的核心文献。
  • 若外部域名的HTTPS/TLS连接不稳定,Semantic Scholar接口将通过
    r.jina.ai
    代理获取,确保流程无需手动导出即可自动启动。
  • 若在线运行因临时网络错误返回0条记录,通常只需重新运行即可(流程不得编造内容)。

Examples

使用示例

  • Offline imports only:
    • Put exports under
      papers/imports/
      then run:
      • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>
  • Explicit offline inputs (multi-route):
    • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl
  • Online arXiv retrieval (needs network):
    • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online
  • Snowballing (needs network unless you provide offline snowball exports):
    • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball
  • 仅使用离线导入文件:
    • 将导出文件放入
      papers/imports/
      目录,然后运行:
      • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>
  • 指定离线输入文件(多路径):
    • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl
  • 在线arXiv检索(需要网络):
    • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online
  • 滚雪球式扩充(若无离线滚雪球导出文件则需要网络):
    • python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball

Troubleshooting

故障排除

Issue: can't reach ≥1200 papers

问题:无法达到≥1200篇文献的目标

Symptom:
  • papers/papers_raw.jsonl
    size is far below target; later stages will fail mapping/bindings and citation density.
Causes:
  • Only a small offline export was provided.
  • Network is blocked so online retrieval/snowballing can't run.
Solutions:
  • Provide additional exports under
    papers/imports/
    (multiple routes/queries).
  • Provide snowball exports under
    papers/snowball/
    .
  • Enable network and rerun with
    --online --snowball
    .
症状
  • papers/papers_raw.jsonl
    的数量远低于目标;后续阶段的映射/关联及引文密度环节会失败。
原因
  • 仅提供了少量离线导出文件。
  • 网络被限制,无法进行在线检索/滚雪球式扩充。
解决方案
  • papers/imports/
    目录下添加更多导出文件(多路径/多查询结果)。
  • papers/snowball/
    目录下添加滚雪球导出文件。
  • 启用网络并添加
    --online --snowball
    参数重新运行。

Issue: many records missing stable IDs

问题:大量记录缺失稳定ID

Symptom:
  • Report shows many entries with empty
    arxiv_id
    and
    doi
    .
Solutions:
  • Prefer arXiv/OpenReview/ACL exports that include stable IDs.
  • If you have network, rerun with
    --online
    to backfill arXiv IDs.
  • Filter out ID-less entries before downstream citation generation.
症状
  • 报告显示大量记录的
    arxiv_id
    doi
    字段为空。
解决方案
  • 优先选择包含稳定ID的arXiv/OpenReview/ACL导出文件。
  • 若有网络,添加
    --online
    参数重新运行以补全arXiv ID。
  • 在后续引文生成前过滤掉无ID的记录。