arxiv-search

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

arXiv Search (metadata-first)

arXiv搜索（优先处理元数据）

Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.

When online, prefer rich arXiv metadata (categories, arxiv_id, pdf_url, published/updated, etc.). When offline, accept an export and convert it cleanly.

收集包含足够元数据的初始论文集合，以支持后续的排序、分类体系构建和引用生成。

在线模式下，优先获取丰富的arXiv元数据（分类、arxiv_id、pdf_url、发布/更新时间等）。离线模式下，接收用户提供的导出文件并进行标准化转换。

Input

输入

```
queries.md
```
(keywords, excludes, time window)

```
queries.md
```
（包含关键词、排除词、时间窗口）

Outputs

输出

```
papers/papers_raw.jsonl
```
(JSONL; 1 paper per line)
- Each record includes at least:
```
title
```
  ,
```
authors
```
  ,
```
year
```
  ,
```
url
```
  ,
```
abstract
```
- When using the arXiv API online mode, records also include helpful metadata:
```
arxiv_id
```
  ,
```
pdf_url
```
  ,
```
categories
```
  ,
```
primary_category
```
  ,
```
published
```
  ,
```
updated
```
  ,
```
doi
```
  ,
```
journal_ref
```
  ,
```
comment
```
Convenience index (optional but generated by the script):
- ```
papers/papers_raw.csv
```

```
papers/papers_raw.jsonl
```
（JSONL格式；每行对应一篇论文）
- 每条记录至少包含：
```
title
```
  、
```
authors
```
  、
```
year
```
  、
```
url
```
  、
```
abstract
```
- 使用arXiv API在线模式时，记录还会包含实用的元数据：
```
arxiv_id
```
  、
```
pdf_url
```
  、
```
categories
```
  、
```
primary_category
```
  、
```
published
```
  、
```
updated
```
  、
```
doi
```
  、
```
journal_ref
```
  、
```
comment
```
便捷索引（可选，由脚本自动生成）：
- ```
papers/papers_raw.csv
```

Decision: online vs offline

选择：在线模式 vs 离线模式

If you have network access: run arXiv API retrieval.
If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields.
Hybrid: if you import offline but still have network later, you can enrich missing fields (abstract/authors/categories) via arXiv
```
id_list
```
using
```
--enrich-metadata
```
or
```
queries.md
```
```
enrich_metadata: true
```
.

若有网络访问权限：运行arXiv API检索。
若无网络：导入用户提供的导出文件（CSV/JSON/JSONL）并标准化字段。
混合模式：若先离线导入，后续恢复网络，可通过
```
--enrich-metadata
```
参数或在
```
queries.md
```
中设置
```
enrich_metadata: true
```
，利用arXiv的
```
id_list
```
补全缺失字段（摘要/作者/分类等）。

Workflow (heuristic)

工作流（启发式规则）

Read
```
queries.md
```
and expand into concrete query strings.
Retrieve results (online) or import an export (offline).
Normalize every record to include at least:
- ```
title
```
  ,
```
authors
```
  (array),
```
year
```
  ,
```
url
```
  ,
```
abstract
```
Keep the set broad at this stage; dedupe/ranking comes next.
Apply time window and
```
max_results
```
if specified.

读取
```
queries.md
```
并扩展为具体的查询字符串。
执行在线检索结果或导入离线导出文件。
标准化每条记录，确保至少包含：
- ```
title
```
  、
```
authors
```
  （数组格式）、
```
year
```
  、
```
url
```
  、
```
abstract
```
此阶段保持论文集合的广泛性；去重和排序在后续步骤进行。
若指定了时间窗口和
```
max_results
```
，则应用相应限制。

Quality checklist

质量检查清单

```
papers/papers_raw.jsonl
```
exists.
Each line is valid JSON and contains
```
title
```
,
```
authors
```
,
```
year
```
,
```
url
```
.

```
papers/papers_raw.jsonl
```
文件已生成。
每行都是有效的JSON格式，且包含
```
title
```
、
```
authors
```
、
```
year
```
、
```
url
```
字段。

Side effects

副作用

Allowed: create/overwrite
```
papers/papers_raw.jsonl
```
; append notes to
```
STATUS.md
```
.
Not allowed: write prose sections in
```
output/
```
before writing is approved.

允许操作：创建/覆盖
```
papers/papers_raw.jsonl
```
；向
```
STATUS.md
```
添加注释。
禁止操作：在获得批准前，向
```
output/
```
目录写入长篇文本内容。

Script

脚本使用

Quick Start

快速开始

python .codex/skills/arxiv-search/scripts/run.py --help

Online:

python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --query "<query>" --max-results 200

Offline import:

python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --input <export.csv|json|jsonl>

python .codex/skills/arxiv-search/scripts/run.py --help

在线模式：

python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --query "<query>" --max-results 200

离线导入：

python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --input <export.csv|json|jsonl>

All Options

所有可选参数

```
--query <q>
```
: repeatable; multiple queries are unioned
```
--exclude <term>
```
: repeatable; excludes applied after retrieval
```
--max-results <n>
```
: cap total retrieved
```
--input <export.*>
```
: offline mode (CSV/JSON/JSONL)
```
--enrich-metadata
```
: best-effort enrich via arXiv
```
id_list
```
(needs network)

queries.md

also supports:

keywords

exclude

time window

max_results

enrich_metadata

```
--query <q>
```
：可重复使用；多个查询结果取并集
```
--exclude <term>
```
：可重复使用；在检索完成后应用排除规则
```
--max-results <n>
```
：限制检索的总结果数
```
--input <export.*>
```
：启用离线模式（支持CSV/JSON/JSONL格式）
```
--enrich-metadata
```
：通过arXiv的
```
id_list
```
尽力补全缺失元数据（需要网络）

queries.md

还支持配置：

keywords

、

exclude

、

time window

、

max_results

、

enrich_metadata

Examples

示例

Online (multi-query + excludes):

python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300

Fetch a single paper by arXiv ID (direct

id_list

fetch):

python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query 2509.02547 --max-results 1

Offline auto-detect (no flags):

Place

papers/import.csv

(or

.json/.jsonl

) under the workspace, then run:

python .codex/skills/arxiv-search/scripts/run.py --workspace <ws>

Offline import + time window (via
```
queries.md
```
):
- Set
```
- time window: { from: 2022, to: 2025 }
```
  then run offline import normally

在线模式（多查询+排除词）：

python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300

通过arXiv ID检索单篇论文（直接使用

id_list

获取）：

python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query 2509.02547 --max-results 1

离线自动检测（无需指定参数）：

将

papers/import.csv

（或

.json/.jsonl

）放入工作目录，然后运行：

python .codex/skills/arxiv-search/scripts/run.py --workspace <ws>

离线导入+时间窗口（通过
```
queries.md
```
配置）：
- 在
```
queries.md
```
  中设置
```
- time window: { from: 2022, to: 2025 }
```
  ，然后正常运行离线导入

Troubleshooting

故障排除

Common Issues

常见问题

Issue:

papers/papers_raw.jsonl

is empty

问题：

papers/papers_raw.jsonl

文件为空

Symptom:

Script exits with “No results returned …” or output file is empty.

Causes:

Network is blocked (online mode).
Queries are too narrow or
```
queries.md
```
is empty.

Solutions:

Use offline import: place
```
papers/import.csv|json|jsonl
```
in the workspace or pass
```
--input
```
.
Broaden keywords and reduce excludes in
```
queries.md
```
.
Run with explicit
```
--query
```
to sanity-check the parser.

症状：

脚本提示“未返回任何结果…”或输出文件为空。

原因：

在线模式下网络被阻断。
查询条件过于狭窄或
```
queries.md
```
文件为空。

解决方案：

使用离线导入：将
```
papers/import.csv|json|jsonl
```
放入工作目录，或通过
```
--input
```
参数指定文件。
在
```
queries.md
```
中放宽关键词限制，减少排除词。
使用显式的
```
--query
```
参数运行脚本，验证解析器是否正常工作。

Issue: Offline import records miss fields

问题：离线导入的记录缺失字段

Symptom:

Downstream steps fail because records miss
```
authors/year/abstract/url
```
.

Causes:

Export columns don’t match expected fields; upstream export is incomplete.

Solutions:

Ensure the export contains at least
```
title
```
,
```
authors
```
,
```
year
```
,
```
url
```
,
```
abstract
```
.
If you later have network, use
```
--enrich-metadata
```
to backfill missing fields (best effort).

症状：

后续步骤失败，因为记录缺失
```
authors/year/abstract/url
```
等字段。

原因：

导出文件的列与预期字段不匹配；上游导出文件不完整。

解决方案：

确保导出文件至少包含
```
title
```
、
```
authors
```
、
```
year
```
、
```
url
```
、
```
abstract
```
字段。
若后续恢复网络，使用
```
--enrich-metadata
```
参数补全缺失字段（尽力而为）。

Recovery Checklist

恢复检查清单

Confirm
```
queries.md
```
has non-empty
```
keywords
```
(or pass
```
--query
```
).
If offline: confirm workspace has
```
papers/import.*
```
and rerun.
Spot-check 3–5 JSONL lines: valid JSON + required fields.

确认
```
queries.md
```
包含非空的
```
keywords
```
（或已通过
```
--query
```
参数指定查询）。
若使用离线模式：确认工作目录中存在
```
papers/import.*
```
文件，然后重新运行脚本。
抽查3-5行JSONL内容：确保为有效的JSON格式且包含必填字段。

arxiv-search

Original

Translation

arXiv Search (metadata-first)

arXiv搜索（优先处理元数据）

Input

输入

Outputs

输出

Decision: online vs offline

选择：在线模式 vs 离线模式

Workflow (heuristic)

工作流（启发式规则）

Quality checklist

质量检查清单

Side effects

副作用

Script

脚本使用

Quick Start

快速开始

All Options

所有可选参数

Examples

示例

Troubleshooting

故障排除

Common Issues

常见问题

Issue:
`papers/papers_raw.jsonl`
is empty

问题：
`papers/papers_raw.jsonl`
文件为空

Issue: Offline import records miss fields

问题：离线导入的记录缺失字段

Recovery Checklist

恢复检查清单

arxiv-search

Original

Translation

arXiv Search (metadata-first)

arXiv搜索（优先处理元数据）

Input

输入

Outputs

输出

Decision: online vs offline

选择：在线模式 vs 离线模式

Workflow (heuristic)

工作流（启发式规则）

Quality checklist

质量检查清单

Side effects

副作用

Script

脚本使用

Quick Start

快速开始

All Options

所有可选参数

Examples

示例

Troubleshooting

故障排除

Common Issues

常见问题

Issue: papers/papers_raw.jsonl is empty

问题：papers/papers_raw.jsonl文件为空

Issue: Offline import records miss fields

问题：离线导入的记录缺失字段

Recovery Checklist

恢复检查清单

Issue:
`papers/papers_raw.jsonl`
is empty

问题：
`papers/papers_raw.jsonl`
文件为空