arxiv-search
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesearXiv Search (metadata-first)
arXiv搜索(优先处理元数据)
Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.
When online, prefer rich arXiv metadata (categories, arxiv_id, pdf_url, published/updated, etc.). When offline, accept an export and convert it cleanly.
收集包含足够元数据的初始论文集合,以支持后续的排序、分类体系构建和引用生成。
在线模式下,优先获取丰富的arXiv元数据(分类、arxiv_id、pdf_url、发布/更新时间等)。离线模式下,接收用户提供的导出文件并进行标准化转换。
Input
输入
- (keywords, excludes, time window)
queries.md
- (包含关键词、排除词、时间窗口)
queries.md
Outputs
输出
- (JSONL; 1 paper per line)
papers/papers_raw.jsonl- Each record includes at least: ,
title,authors,year,urlabstract - When using the arXiv API online mode, records also include helpful metadata: ,
arxiv_id,pdf_url,categories,primary_category,published,updated,doi,journal_refcomment
- Each record includes at least:
- Convenience index (optional but generated by the script):
papers/papers_raw.csv
- (JSONL格式;每行对应一篇论文)
papers/papers_raw.jsonl- 每条记录至少包含:、
title、authors、year、urlabstract - 使用arXiv API在线模式时,记录还会包含实用的元数据:、
arxiv_id、pdf_url、categories、primary_category、published、updated、doi、journal_refcomment
- 每条记录至少包含:
- 便捷索引(可选,由脚本自动生成):
papers/papers_raw.csv
Decision: online vs offline
选择:在线模式 vs 离线模式
- If you have network access: run arXiv API retrieval.
- If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields.
- Hybrid: if you import offline but still have network later, you can enrich missing fields (abstract/authors/categories) via arXiv using
id_listor--enrich-metadataqueries.md.enrich_metadata: true
- 若有网络访问权限:运行arXiv API检索。
- 若无网络:导入用户提供的导出文件(CSV/JSON/JSONL)并标准化字段。
- 混合模式:若先离线导入,后续恢复网络,可通过参数或在
--enrich-metadata中设置queries.md,利用arXiv的enrich_metadata: true补全缺失字段(摘要/作者/分类等)。id_list
Workflow (heuristic)
工作流(启发式规则)
- Read and expand into concrete query strings.
queries.md - Retrieve results (online) or import an export (offline).
- Normalize every record to include at least:
- ,
title(array),authors,year,urlabstract
- Keep the set broad at this stage; dedupe/ranking comes next.
- Apply time window and if specified.
max_results
- 读取并扩展为具体的查询字符串。
queries.md - 执行在线检索结果或导入离线导出文件。
- 标准化每条记录,确保至少包含:
- 、
title(数组格式)、authors、year、urlabstract
- 此阶段保持论文集合的广泛性;去重和排序在后续步骤进行。
- 若指定了时间窗口和,则应用相应限制。
max_results
Quality checklist
质量检查清单
- exists.
papers/papers_raw.jsonl - Each line is valid JSON and contains ,
title,authors,year.url
- 文件已生成。
papers/papers_raw.jsonl - 每行都是有效的JSON格式,且包含、
title、authors、year字段。url
Side effects
副作用
- Allowed: create/overwrite ; append notes to
papers/papers_raw.jsonl.STATUS.md - Not allowed: write prose sections in before writing is approved.
output/
- 允许操作:创建/覆盖;向
papers/papers_raw.jsonl添加注释。STATUS.md - 禁止操作:在获得批准前,向目录写入长篇文本内容。
output/
Script
脚本使用
Quick Start
快速开始
python .codex/skills/arxiv-search/scripts/run.py --help- Online:
python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --query "<query>" --max-results 200 - Offline import:
python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --input <export.csv|json|jsonl>
python .codex/skills/arxiv-search/scripts/run.py --help- 在线模式:
python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --query "<query>" --max-results 200 - 离线导入:
python .codex/skills/arxiv-search/scripts/run.py --workspace <workspace_dir> --input <export.csv|json|jsonl>
All Options
所有可选参数
- : repeatable; multiple queries are unioned
--query <q> - : repeatable; excludes applied after retrieval
--exclude <term> - : cap total retrieved
--max-results <n> - : offline mode (CSV/JSON/JSONL)
--input <export.*> - : best-effort enrich via arXiv
--enrich-metadata(needs network)id_list - also supports:
queries.md,keywords,exclude,time window,max_resultsenrich_metadata
- :可重复使用;多个查询结果取并集
--query <q> - :可重复使用;在检索完成后应用排除规则
--exclude <term> - :限制检索的总结果数
--max-results <n> - :启用离线模式(支持CSV/JSON/JSONL格式)
--input <export.*> - :通过arXiv的
--enrich-metadata尽力补全缺失元数据(需要网络)id_list - 还支持配置:
queries.md、keywords、exclude、time window、max_resultsenrich_metadata
Examples
示例
- Online (multi-query + excludes):
python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300
- Fetch a single paper by arXiv ID (direct fetch):
id_listpython .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query 2509.02547 --max-results 1
- Offline auto-detect (no flags):
- Place (or
papers/import.csv) under the workspace, then run:.json/.jsonlpython .codex/skills/arxiv-search/scripts/run.py --workspace <ws>
- Place
- Offline import + time window (via ):
queries.md- Set then run offline import normally
- time window: { from: 2022, to: 2025 }
- Set
- 在线模式(多查询+排除词):
python .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300
- 通过arXiv ID检索单篇论文(直接使用获取):
id_listpython .codex/skills/arxiv-search/scripts/run.py --workspace <ws> --query 2509.02547 --max-results 1
- 离线自动检测(无需指定参数):
- 将(或
papers/import.csv)放入工作目录,然后运行:.json/.jsonlpython .codex/skills/arxiv-search/scripts/run.py --workspace <ws>
- 将
- 离线导入+时间窗口(通过配置):
queries.md- 在中设置
queries.md,然后正常运行离线导入- time window: { from: 2022, to: 2025 }
- 在
Troubleshooting
故障排除
Common Issues
常见问题
Issue: papers/papers_raw.jsonl
is empty
papers/papers_raw.jsonl问题:papers/papers_raw.jsonl
文件为空
papers/papers_raw.jsonlSymptom:
- Script exits with “No results returned …” or output file is empty.
Causes:
- Network is blocked (online mode).
- Queries are too narrow or is empty.
queries.md
Solutions:
- Use offline import: place in the workspace or pass
papers/import.csv|json|jsonl.--input - Broaden keywords and reduce excludes in .
queries.md - Run with explicit to sanity-check the parser.
--query
症状:
- 脚本提示“未返回任何结果…”或输出文件为空。
原因:
- 在线模式下网络被阻断。
- 查询条件过于狭窄或文件为空。
queries.md
解决方案:
- 使用离线导入:将放入工作目录,或通过
papers/import.csv|json|jsonl参数指定文件。--input - 在中放宽关键词限制,减少排除词。
queries.md - 使用显式的参数运行脚本,验证解析器是否正常工作。
--query
Issue: Offline import records miss fields
问题:离线导入的记录缺失字段
Symptom:
- Downstream steps fail because records miss .
authors/year/abstract/url
Causes:
- Export columns don’t match expected fields; upstream export is incomplete.
Solutions:
- Ensure the export contains at least ,
title,authors,year,url.abstract - If you later have network, use to backfill missing fields (best effort).
--enrich-metadata
症状:
- 后续步骤失败,因为记录缺失等字段。
authors/year/abstract/url
原因:
- 导出文件的列与预期字段不匹配;上游导出文件不完整。
解决方案:
- 确保导出文件至少包含、
title、authors、year、url字段。abstract - 若后续恢复网络,使用参数补全缺失字段(尽力而为)。
--enrich-metadata
Recovery Checklist
恢复检查清单
- Confirm has non-empty
queries.md(or passkeywords).--query - If offline: confirm workspace has and rerun.
papers/import.* - Spot-check 3–5 JSONL lines: valid JSON + required fields.
- 确认包含非空的
queries.md(或已通过keywords参数指定查询)。--query - 若使用离线模式:确认工作目录中存在文件,然后重新运行脚本。
papers/import.* - 抽查3-5行JSONL内容:确保为有效的JSON格式且包含必填字段。