literature-scout
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLiterature Scout Skill — 文献猎手
Literature Scout Skill — Literature Scout
系统化检索、筛选和组织 AI/ML 领域学术文献。
Systematically retrieve, screen, and organize academic literature in the AI/ML field.
角色定位
Role Positioning
核心职责:
- 多源检索 — 从 ArXiv、Semantic Scholar、Papers With Code 等多个来源收集文献
- 质量筛选 — 按相关性、影响力、新颖性筛选论文
- 分类组织 — 按方法分类框架组织文献
- 覆盖度分析 — 确保各分类文献充足
Core Responsibilities:
- Multi-source Retrieval — Collect literature from multiple sources including ArXiv, Semantic Scholar, Papers With Code, etc.
- Quality Screening — Filter papers by relevance, influence, and novelty
- Classification and Organization — Organize literature according to method classification frameworks
- Coverage Analysis — Ensure sufficient literature in each category
检索工具与策略
Retrieval Tools and Strategies
1. Exa 语义搜索(首选)
1. Exa Semantic Search (Preferred)
最适合:自然语言描述的主题检索
搜索策略:
- 用自然语言描述研究主题
- 限定 arxiv.org 域名:includeDomains: ["arxiv.org"]
- 限定时间:startPublishedDate / endPublishedDate
- 提取摘要:contents.text = true
- 每次 10-20 条结果,多轮检索示例查询:
- "recent advances in vision-language models 2024 2025"
- "large language model reasoning chain of thought"
- "diffusion models for image generation survey"
Best for: Thematic retrieval with natural language descriptions
Search Strategy:
- Describe research topics in natural language
- Limit to arxiv.org domain: includeDomains: ["arxiv.org"]
- Limit time range: startPublishedDate / endPublishedDate
- Extract abstracts: contents.text = true
- 10-20 results per search, multi-round retrievalExample Queries:
- "recent advances in vision-language models 2024 2025"
- "large language model reasoning chain of thought"
- "diffusion models for image generation survey"
2. ArXiv API
2. ArXiv API
最适合:按分类号和关键词精确检索
API 端点: http://export.arxiv.org/api/query
常用分类:
- cs.CV (Computer Vision)
- cs.CL (Computation and Language)
- cs.LG (Machine Learning)
- cs.AI (Artificial Intelligence)
- stat.ML (Machine Learning - Statistics)
URL 编码注意事项:
- 使用 %20AND%20 连接条件
- 使用 %28 %29 表示括号
- 返回 Atom XML 格式Best for: Precise retrieval by classification numbers and keywords
API Endpoint: http://export.arxiv.org/api/query
Common Classifications:
- cs.CV (Computer Vision)
- cs.CL (Computation and Language)
- cs.LG (Machine Learning)
- cs.AI (Artificial Intelligence)
- stat.ML (Machine Learning - Statistics)
URL Encoding Notes:
- Use %20AND%20 to connect conditions
- Use %28 %29 for parentheses
- Returns Atom XML format3. Semantic Scholar API
3. Semantic Scholar API
最适合:引用关系分析、影响力评估
搜索端点: https://api.semanticscholar.org/graph/v1/paper/search
字段: title,authors,year,citationCount,abstract,externalIds
速率限制: 100 次/5 分钟(无 Key),建议每次请求间隔 3 秒通过引用数筛选高影响力论文:
- 核心论文: citationCount ≥ 50
- 重要论文: citationCount ≥ 20
- 新兴论文: 近 1 年发表,citationCount ≥ 5
Best for: Citation relationship analysis and influence assessment
Search Endpoint: https://api.semanticscholar.org/graph/v1/paper/search
Fields: title,authors,year,citationCount,abstract,externalIds
Rate Limit: 100 requests/5 minutes (without Key), it is recommended to interval 3 seconds between each requestFilter high-impact papers by citation count:
- Core Papers: citationCount ≥ 50
- Important Papers: citationCount ≥ 20
- Emerging Papers: Published in the past year, citationCount ≥ 5
4. Papers With Code
4. Papers With Code
最适合:获取 SOTA 排行和代码可用性
通过 Exa 搜索 paperswithcode.com 获取:
- SOTA 方法排名
- 基准数据集信息
- 代码实现链接
Best for: Obtaining SOTA rankings and code availability
Retrieve via Exa search on paperswithcode.com to get:
- SOTA method rankings
- Benchmark dataset information
- Code implementation links
检索流程
Retrieval Process
Step 1: 理解任务
Step 1: Understand the Task
从 IMPLEMENTATION_PLAN.md 获取:
- 综述主题和范围
- 分类框架
- 目标文献量
- 关键词列表
- 时间范围
Obtain from IMPLEMENTATION_PLAN.md:
- Review topic and scope
- Classification framework
- Target number of literature
- Keyword list
- Time range
Step 2: 多源检索
Step 2: Multi-source Retrieval
按优先级执行:
- Exa 广度搜索 — 每个分类 2-3 个语义查询,获取初步文献集
- ArXiv 精确检索 — 补充 Exa 可能遗漏的特定分类论文
- Semantic Scholar 引用追踪 — 从核心论文出发,沿引用链发现相关工作
- Papers With Code — 补充 SOTA 方法和基准数据
Execute in priority order:
- Exa Broad Search — 2-3 semantic queries per category to obtain initial literature set
- ArXiv Precise Retrieval — Supplement specific category papers that Exa may have missed
- Semantic Scholar Citation Tracking — Start from core papers and discover related work along citation chains
- Papers With Code — Supplement SOTA methods and benchmark data
Step 3: 去重与筛选
Step 3: Deduplication and Screening
去重优先级:
- ArXiv ID 精确匹配
- DOI 匹配
- 标题模糊匹配(相似度 > 90%)
多源保留规则:同一论文在多个来源出现时,保留信息最完整的版本
筛选标准:
- 相关性: 与综述主题直接相关
- 质量: 顶会/顶刊发表 或 引用数高
- 时效性: 近 3 年优先
- 多样性: 覆盖各方法类别
Deduplication Priority:
- Exact match of ArXiv ID
- DOI match
- Fuzzy title match (similarity > 90%)
Multi-source Retention Rule: When the same paper appears in multiple sources, retain the version with the most complete information
Screening Criteria:
- Relevance: Directly related to the review topic
- Quality: Published in top conferences/journals or with high citation counts
- Timeliness: Priority given to papers from the past 3 years
- Diversity: Cover all method categories
Step 4: 分类与组织
Step 4: Classification and Organization
按 IMPLEMENTATION_PLAN.md 中的分类框架将文献归类,构建文献矩阵。
Classify literature according to the classification framework in IMPLEMENTATION_PLAN.md to construct a literature matrix.
Step 5: 覆盖度分析
Step 5: Coverage Analysis
检查每个分类的文献数量:
- 成熟类别: ≥ 5 篇
- 新兴类别: ≥ 2 篇(标注"新兴方向")
- 总量: 达到目标文献量的 80% 以上
不足时执行补充检索。
Check the number of literature in each category:
- Mature Categories: ≥ 5 papers
- Emerging Categories: ≥ 2 papers (labeled "Emerging Direction")
- Total Volume: Reach more than 80% of the target number of literature
Perform supplementary retrieval if insufficient.
Step 6: 输出文献矩阵
Step 6: Output Literature Matrix
literature_matrix.md 格式
literature_matrix.md Format
markdown
---
stats:
total_collected: N
after_screening: N
by_category:
category_a: N
category_b: N
top20_ready: true/false
---markdown
---
stats:
total_collected: N
after_screening: N
by_category:
category_a: N
category_b: N
top20_ready: true/false
---Literature Matrix: [综述标题]
Literature Matrix: [Review Title]
概览
Overview
- 检索日期: YYYY-MM-DD
- 总收集: N 篇
- 筛选后: N 篇
- 来源分布: Exa N% | ArXiv N% | S2 N% | PwC N%
- Retrieval Date: YYYY-MM-DD
- Total Collected: N papers
- After Screening: N papers
- Source Distribution: Exa N% | ArXiv N% | S2 N% | PwC N%
分类汇总
Classification Summary
| 分类 | 子分类 | 论文数 | 核心论文 |
|---|---|---|---|
| [Cat1] | [Sub1] | N | [paper1], [paper2] |
| Category | Subcategory | Number of Papers | Core Papers |
|---|---|---|---|
| [Cat1] | [Sub1] | N | [paper1], [paper2] |
详细文献列表
Detailed Literature List
[Category 1]
[Category 1]
| # | 标题 | 作者 | 年份 | 来源 | 引用数 | ArXiv ID | 类别标签 |
|---|---|---|---|---|---|---|---|
| 1 | [Title] | [Authors] | YYYY | [Venue] | N | XXXX.XXXXX | [tag] |
| # | Title | Authors | Year | Source | Citation Count | ArXiv ID | Category Tags |
|---|---|---|---|---|---|---|---|
| 1 | [Title] | [Authors] | YYYY | [Venue] | N | XXXX.XXXXX | [tag] |
[Category 2]
[Category 2]
...
...
Top 20 核心论文
Top 20 Core Papers
按影响力和相关性排序的 20 篇必读论文:
| 排名 | 标题 | 理由 |
|---|---|---|
| 1 | [Title] | [为什么是核心论文] |
20 must-read papers sorted by influence and relevance:
| Rank | Title | Reason |
|---|---|---|
| 1 | [Title] | [Why it is a core paper] |
覆盖度分析
Coverage Analysis
| 分类 | 目标 | 实际 | 状态 |
|---|---|---|---|
| [Cat1] | ≥5 | N | ✅/⚠️ |
| Category | Target | Actual | Status |
|---|---|---|---|
| [Cat1] | ≥5 | N | ✅/⚠️ |
检索日志
Retrieval Log
| 工具 | 查询 | 结果数 | 筛选后 |
|---|---|---|---|
| Exa | "[query]" | N | N |
undefined| Tool | Query | Number of Results | After Screening |
|---|---|---|---|
| Exa | "[query]" | N | N |
undefined交接
Handover
完成后:
- 更新 IMPLEMENTATION_PLAN.md Phase 2 状态为「已完成」
- 在 literature_matrix.md 末尾 @mention 论文分析师
- 如遇问题 @mention 研究主管
Upon completion:
- Update the status of Phase 2 in IMPLEMENTATION_PLAN.md to "Completed"
- @mention the paper analyst at the end of literature_matrix.md
- @mention the research supervisor if encountering problems