literature-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Literature Engineer (evidence collector)

文献工程师（evidence collector）

Goal: build a large, verifiable candidate pool for downstream dedupe/rank, mapping, notes, citations, and drafting.

This skill is intentionally evidence-first: if you can't reach the target size with verifiable IDs/provenance, the correct behavior is to block and ask for more exports / enable network, not to fabricate.

目标：构建一个大规模、可验证的候选文献池，用于后续的去重/排序、映射、笔记、引文生成及草稿撰写。

本技能刻意采用证据优先原则：若无法通过可验证的ID/来源信息达到目标规模，正确的做法是终止流程并请求提供更多导出文件/启用网络，而非编造内容。

Inputs

输入项

queries.md

```
keywords
```
,
```
exclude
```
,
```
max_results
```
,
```
time window
```

Optional offline sources (any combination; all are merged):

```
papers/import.(csv|json|jsonl|bib)
```

papers/arxiv_export.(csv|json|jsonl|bib)

```
papers/imports/*.(csv|json|jsonl|bib)
```

Optional snowball exports (offline):
- ```
papers/snowball/*.(csv|json|jsonl|bib)
```

```
queries.md
```
- 包含
```
keywords
```
  （关键词）、
```
exclude
```
  （排除词）、
```
max_results
```
  （最大结果数）、
```
time window
```
  （时间窗口）

可选离线数据源（可任意组合，系统会自动合并）：

```
papers/import.(csv|json|jsonl|bib)
```

papers/arxiv_export.(csv|json|jsonl|bib)

```
papers/imports/*.(csv|json|jsonl|bib)
```

可选滚雪球式导出文件（离线）：
- ```
papers/snowball/*.(csv|json|jsonl|bib)
```

Outputs

输出项

```
papers/papers_raw.jsonl
```
- 1 record per line; minimum fields:
  - ```
  title
```
  (str),
```
  authors
```
  (list[str]),
```
  year
```
  (int|""),
```
  url
```
  (str)
- stable identifier(s):
```
  arxiv_id
```
  and/or
```
  doi
```
- ```
abstract
```
    (str; may be empty in offline mode)
  - ```
  source
```
  (str) +
```
  provenance
```
  (list[dict])
```
papers/papers_raw.csv
```
(human scan)
```
papers/retrieval_report.md
```
(route counts, missing-meta stats, next actions)

```
papers/papers_raw.jsonl
```
- 每行一条记录；至少包含以下字段：
  - ```
  title
```
  （字符串）、
```
  authors
```
  （字符串列表）、
```
  year
```
  （整数或空字符串）、
```
  url
```
  （字符串）
- 稳定标识：
```
  arxiv_id
```
  和/或
```
  doi
```
- ```
abstract
```
    （字符串；离线模式下可为空）
  - ```
  source
```
  （字符串） +
```
  provenance
```
  （字典列表）
```
papers/papers_raw.csv
```
（供人工查阅）
```
papers/retrieval_report.md
```
（包含路径统计、元数据缺失情况及后续操作建议）

Workflow (multi-route)

工作流（多路径）

Offline-first merge: ingest all available offline exports (and label provenance per file).
Online retrieval (optional): if enabled, run arXiv API retrieval for each keyword query.
Snowballing (optional): expand from seed papers via references/cited-by (online), or merge offline snowball exports.
Normalize + dedupe: canonicalize IDs/URLs, merge duplicates while unioning
```
provenance
```
.
Report: write a concise retrieval report with coverage buckets and missing-meta counts.

离线优先合并：导入所有可用的离线导出文件，并为每个文件标记来源信息。
在线检索（可选）：若启用该功能，将针对每个关键词查询调用arXiv API进行检索。
滚雪球式扩充（可选）：通过参考文献/被引文献（在线模式）从种子文献进行拓展，或合并离线的滚雪球导出文件。
规范化+去重：统一ID/URL格式，合并重复记录并整合
```
provenance
```
信息。
生成报告：撰写简洁的检索报告，包含覆盖范围分类及元数据缺失统计。

Quality checklist

质量检查清单

Candidate pool size target met (A150++: ≥1200) without fabrication.
Each record has a stable identifier (
```
arxiv_id
```
or
```
doi
```
, plus
```
url
```
).
Each record has provenance: which route/file/API produced it.

候选文献池达到目标规模（A150++：≥1200篇）且无编造内容。
每条记录均包含稳定标识（
```
arxiv_id
```
或
```
doi
```
，外加
```
url
```
）。
每条记录均包含来源溯源信息：生成该记录的路径/文件/API。

Script

脚本使用说明

Quick Start

快速开始

python .codex/skills/literature-engineer/scripts/run.py --help

python .codex/skills/literature-engineer/scripts/run.py --help

All Options

所有可用选项

See

python .codex/skills/literature-engineer/scripts/run.py --help

Reads retrieval config from
```
queries.md
```
.

Offline inputs (merged if present):

papers/import.(csv|json|jsonl|bib)

papers/arxiv_export.(csv|json|jsonl|bib)

papers/imports/*.(csv|json|jsonl|bib)

Optional offline snowball inputs:
```
papers/snowball/*.(csv|json|jsonl|bib)
```
.
Online expansion requires network: use
```
--online
```
and/or
```
--snowball
```
.
Online retrieval is best-effort: arXiv API can be flaky in some environments; the script will also attempt a Semantic Scholar route when needed.
For LLM-agent topics, the script also performs a best-effort pinned arXiv id_list fetch (canonical classics like ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts + a small prior-survey seed set) so
```
ref.bib
```
can include must-cite anchors even when keyword search misses them.
If HTTPS/TLS to external domains is unstable, the Semantic Scholar route is fetched via the
```
r.jina.ai
```
proxy so the pipeline can still self-boot without manual exports.
When an online run returns
```
0
```
records due to transient network errors, a simple rerun is often sufficient (the pipeline should not fabricate).

详见

python .codex/skills/literature-engineer/scripts/run.py --help

。

从
```
queries.md
```
读取检索配置。

离线输入文件（若存在则自动合并）：

papers/import.(csv|json|jsonl|bib)

、

papers/arxiv_export.(csv|json|jsonl|bib)

、

papers/imports/*.(csv|json|jsonl|bib)

。

可选离线滚雪球输入文件：
```
papers/snowball/*.(csv|json|jsonl|bib)
```
。
在线拓展需要网络：使用
```
--online
```
和/或
```
--snowball
```
参数。
在线检索为尽力而为模式：arXiv API在部分环境下可能不稳定；必要时脚本会尝试调用Semantic Scholar接口。
针对LLM-agent相关主题，脚本还会尽力获取固定的arXiv id列表（包括ReAct/Toolformer/Reflexion/Voyager/Tree-of-Thoughts等经典文献+少量前期调研种子文献），确保即使关键词搜索遗漏这些文献，
```
ref.bib
```
仍能包含必须引用的核心文献。
若外部域名的HTTPS/TLS连接不稳定，Semantic Scholar接口将通过
```
r.jina.ai
```
代理获取，确保流程无需手动导出即可自动启动。
若在线运行因临时网络错误返回0条记录，通常只需重新运行即可（流程不得编造内容）。

Examples

使用示例

Offline imports only:

Put exports under

papers/imports/

then run:

python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>

Explicit offline inputs (multi-route):

python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl

Online arXiv retrieval (needs network):

python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online

Snowballing (needs network unless you provide offline snowball exports):

python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball

仅使用离线导入文件：

将导出文件放入

papers/imports/

目录，然后运行：

python .codex/skills/literature-engineer/scripts/run.py --workspace <ws>

指定离线输入文件（多路径）：

python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --input path/to/a.bib --input path/to/b.jsonl

在线arXiv检索（需要网络）：

python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --online

滚雪球式扩充（若无离线滚雪球导出文件则需要网络）：

python .codex/skills/literature-engineer/scripts/run.py --workspace <ws> --snowball

Troubleshooting

故障排除

Issue: can't reach ≥1200 papers

问题：无法达到≥1200篇文献的目标

Symptom:

```
papers/papers_raw.jsonl
```
size is far below target; later stages will fail mapping/bindings and citation density.

Causes:

Only a small offline export was provided.
Network is blocked so online retrieval/snowballing can't run.

Solutions:

Provide additional exports under
```
papers/imports/
```
(multiple routes/queries).
Provide snowball exports under
```
papers/snowball/
```
.
Enable network and rerun with
```
--online --snowball
```
.

症状：

```
papers/papers_raw.jsonl
```
的数量远低于目标；后续阶段的映射/关联及引文密度环节会失败。

原因：

仅提供了少量离线导出文件。
网络被限制，无法进行在线检索/滚雪球式扩充。

解决方案：

在
```
papers/imports/
```
目录下添加更多导出文件（多路径/多查询结果）。
在
```
papers/snowball/
```
目录下添加滚雪球导出文件。
启用网络并添加
```
--online --snowball
```
参数重新运行。

Issue: many records missing stable IDs

问题：大量记录缺失稳定ID

Symptom:

Report shows many entries with empty
```
arxiv_id
```
and
```
doi
```
.

Solutions:

Prefer arXiv/OpenReview/ACL exports that include stable IDs.
If you have network, rerun with
```
--online
```
to backfill arXiv IDs.
Filter out ID-less entries before downstream citation generation.

症状：

报告显示大量记录的
```
arxiv_id
```
和
```
doi
```
字段为空。

解决方案：

优先选择包含稳定ID的arXiv/OpenReview/ACL导出文件。
若有网络，添加
```
--online
```
参数重新运行以补全arXiv ID。
在后续引文生成前过滤掉无ID的记录。