arxiv

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

arXiv Paper Search & Download

arXiv论文搜索与下载

Search topic or arXiv paper ID: $ARGUMENTS
搜索主题或arXiv论文ID:$ARGUMENTS

Constants

常量

  • PAPER_DIR - Local directory to save downloaded PDFs. Default:
    papers/
    in the current project directory.
  • MAX_RESULTS = 10 - Default number of search results.
  • FETCH_SCRIPT -
    tools/arxiv_fetch.py
    relative to the ARIS install, or the same path relative to the current project. Fall back to inline Python if not found.
Overrides (append to arguments):
  • /arxiv "attention mechanism" - max: 20
    - return up to 20 results
  • /arxiv "2301.07041" - download
    - download a specific paper by ID
  • /arxiv "query" - dir: literature/
    - save PDFs to a custom directory
  • /arxiv "query" - download: all
    - download all result PDFs
  • PAPER_DIR - 用于保存下载PDF的本地目录。默认值:当前项目目录下的
    papers/
  • MAX_RESULTS = 10 - 默认的搜索结果数量
  • FETCH_SCRIPT - 相对于ARIS安装路径的
    tools/arxiv_fetch.py
    ,或相对于当前项目的相同路径。如果未找到,则回退到内置Python代码。
覆盖配置(追加到参数中):
  • /arxiv "attention mechanism" - max: 20
    - 返回最多20条结果
  • /arxiv "2301.07041" - download
    - 通过ID下载指定论文
  • /arxiv "query" - dir: literature/
    - 将PDF保存到自定义目录
  • /arxiv "query" - download: all
    - 下载所有结果的PDF

Workflow

工作流程

Step 1: Parse Arguments

步骤1:解析参数

Parse
$ARGUMENTS
for directives:
  • Query or ID: main search term or a bare arXiv ID such as
    2301.07041
    or
    cs/0601001
  • - max: N
    : override MAX_RESULTS (e.g.,
    - max: 20
    )
  • - dir: PATH
    : override PAPER_DIR (e.g.,
    - dir: literature/
    )
  • - download
    : download the first result's PDF after listing
  • - download: all
    : download PDFs for all results
If the argument matches an arXiv ID pattern (
YYMM.NNNNN
or
category/NNNNNNN
), skip the search and go directly to Step 3.
解析
$ARGUMENTS
中的指令:
  • 查询词或ID:主要搜索词,或纯arXiv ID,例如
    2301.07041
    cs/0601001
  • - max: N
    :覆盖MAX_RESULTS(例如
    - max: 20
  • - dir: PATH
    :覆盖PAPER_DIR(例如
    - dir: literature/
  • - download
    :列出结果后下载第一条结果的PDF
  • - download: all
    :下载所有结果的PDF
如果参数匹配arXiv ID格式(
YYMM.NNNNN
category/NNNNNNN
),则跳过搜索,直接进入步骤3。

Step 2: Search arXiv

步骤2:搜索arXiv

Locate the fetch script:
bash
SCRIPT=$(python3 -c "
import pathlib
candidates = [
    pathlib.Path('tools/arxiv_fetch.py'),
    pathlib.Path.home() / '.claude' / 'skills' / 'arxiv' / 'arxiv_fetch.py',
]
for p in candidates:
    if p.exists():
        print(p)
        break
" 2>/dev/null)
If SCRIPT is found, run:
bash
python3 "$SCRIPT" search "QUERY" --max MAX_RESULTS
If SCRIPT is not found, fall back to inline Python:
bash
python3 - <<'PYEOF'
import json
import urllib.parse
import urllib.request
import xml.etree.ElementTree as ET

NS = "http://www.w3.org/2005/Atom"
query = urllib.parse.quote("QUERY")
url = (f"http://export.arxiv.org/api/query"
       f"?search_query={query}&start=0&max_results=MAX_RESULTS"
       f"&sortBy=relevance&sortOrder=descending")
with urllib.request.urlopen(url, timeout=30) as r:
    root = ET.fromstring(r.read())
papers = []
for entry in root.findall(f"{{{NS}}}entry"):
    aid = entry.findtext(f"{{{NS}}}id", "").split("/abs/")[-1].split("v")[0]
    title = (entry.findtext(f"{{{NS}}}title", "") or "").strip().replace("\n", " ")
    abstract = (entry.findtext(f"{{{NS}}}summary", "") or "").strip().replace("\n", " ")
    authors = [a.findtext(f"{{{NS}}}name", "") for a in entry.findall(f"{{{NS}}}author")]
    published = entry.findtext(f"{{{NS}}}published", "")[:10]
    cats = [c.get("term", "") for c in entry.findall(f"{{{NS}}}category")]
    papers.append({
        "id": aid,
        "title": title,
        "authors": authors,
        "abstract": abstract,
        "published": published,
        "categories": cats,
        "pdf_url": f"https://arxiv.org/pdf/{aid}.pdf",
        "abs_url": f"https://arxiv.org/abs/{aid}",
    })
print(json.dumps(papers, ensure_ascii=False, indent=2))
PYEOF
Present results as a table:
text
| # | arXiv ID   | Title               | Authors        | Date       | Category |
|---|------------|---------------------|----------------|------------|----------|
| 1 | 2301.07041 | Attention Is All... | Vaswani et al. | 2017-06-12 | cs.LG    |
定位获取脚本:
bash
SCRIPT=$(python3 -c "
import pathlib
candidates = [
    pathlib.Path('tools/arxiv_fetch.py'),
    pathlib.Path.home() / '.claude' / 'skills' / 'arxiv' / 'arxiv_fetch.py',
]
for p in candidates:
    if p.exists():
        print(p)
        break
" 2>/dev/null)
如果找到SCRIPT,运行:
bash
python3 "$SCRIPT" search "QUERY" --max MAX_RESULTS
如果未找到SCRIPT,回退到内置Python代码:
bash
python3 - <<'PYEOF'
import json
import urllib.parse
import urllib.request
import xml.etree.ElementTree as ET

NS = "http://www.w3.org/2005/Atom"
query = urllib.parse.quote("QUERY")
url = (f"http://export.arxiv.org/api/query"
       f"?search_query={query}&start=0&max_results=MAX_RESULTS"
       f"&sortBy=relevance&sortOrder=descending")
with urllib.request.urlopen(url, timeout=30) as r:
    root = ET.fromstring(r.read())
papers = []
for entry in root.findall(f"{{{NS}}}entry"):
    aid = entry.findtext(f"{{{NS}}}id", "").split("/abs/")[-1].split("v")[0]
    title = (entry.findtext(f"{{{NS}}}title", "") or "").strip().replace("\n", " ")
    abstract = (entry.findtext(f"{{{NS}}}summary", "") or "").strip().replace("\n", " ")
    authors = [a.findtext(f"{{{NS}}}name", "") for a in entry.findall(f"{{{NS}}}author")]
    published = entry.findtext(f"{{{NS}}}published", "")[:10]
    cats = [c.get("term", "") for c in entry.findall(f"{{{NS}}}category")]
    papers.append({
        "id": aid,
        "title": title,
        "authors": authors,
        "abstract": abstract,
        "published": published,
        "categories": cats,
        "pdf_url": f"https://arxiv.org/pdf/{aid}.pdf",
        "abs_url": f"https://arxiv.org/abs/{aid}",
    })
print(json.dumps(papers, ensure_ascii=False, indent=2))
PYEOF
以表格形式展示结果:
text
| # | arXiv ID   | Title               | Authors        | Date       | Category |
|---|------------|---------------------|----------------|------------|----------|
| 1 | 2301.07041 | Attention Is All... | Vaswani et al. | 2017-06-12 | cs.LG    |

Step 3: Fetch Details for a Specific ID

步骤3:获取指定ID的详细信息

When a single paper ID is requested (either directly or from Step 2):
bash
python3 "$SCRIPT" search "id:ARXIV_ID" --max 1
当请求单个论文ID时(无论是直接请求还是来自步骤2):
bash
python3 "$SCRIPT" search "id:ARXIV_ID" --max 1

or fallback:

或回退方案:

python3 -c " import urllib.request, xml.etree.ElementTree as ET NS = 'http://www.w3.org/2005/Atom' url = 'http://export.arxiv.org/api/query?id_list=ARXIV_ID' with urllib.request.urlopen(url, timeout=30) as r: root = ET.fromstring(r.read())
python3 -c " import urllib.request, xml.etree.ElementTree as ET NS = 'http://www.w3.org/2005/Atom' url = 'http://export.arxiv.org/api/query?id_list=ARXIV_ID' with urllib.request.urlopen(url, timeout=30) as r: root = ET.fromstring(r.read())

print full details ...

打印完整详情...

"

Display: title, all authors, categories, full abstract, published date, PDF URL, abstract URL.
"

展示内容:标题、所有作者、分类、完整摘要、发布日期、PDF链接、摘要页面链接。

Step 4: Download PDFs

步骤4:下载PDF

When download is requested, for each paper ID to download:
bash
undefined
当请求下载时,对每个要下载的论文ID执行:
bash
undefined

Using fetch script:

使用获取脚本:

python3 "$SCRIPT" download ARXIV_ID --dir PAPER_DIR
python3 "$SCRIPT" download ARXIV_ID --dir PAPER_DIR

Fallback:

回退方案:

mkdir -p PAPER_DIR && python3 -c " import pathlib import sys import urllib.request
out = pathlib.Path('PAPER_DIR/ARXIV_ID.pdf') if out.exists(): print(f'Already exists: {out}') sys.exit(0) req = urllib.request.Request( 'https://arxiv.org/pdf/ARXIV_ID.pdf', headers={'User-Agent': 'arxiv-skill/1.0'}, ) with urllib.request.urlopen(req, timeout=60) as r: out.write_bytes(r.read()) print(f'Downloaded: {out} ({out.stat().st_size // 1024} KB)') "

After each download:

- Confirm file size > 10 KB (reject smaller files - likely an error HTML page)
- Add a 1-second delay between consecutive downloads to avoid rate limiting
- Report: `Downloaded: papers/2301.07041.pdf (842 KB)`
mkdir -p PAPER_DIR && python3 -c " import pathlib import sys import urllib.request
out = pathlib.Path('PAPER_DIR/ARXIV_ID.pdf') if out.exists(): print(f'Already exists: {out}') sys.exit(0) req = urllib.request.Request( 'https://arxiv.org/pdf/ARXIV_ID.pdf', headers={'User-Agent': 'arxiv-skill/1.0'}, ) with urllib.request.urlopen(req, timeout=60) as r: out.write_bytes(r.read()) print(f'Downloaded: {out} ({out.stat().st_size // 1024} KB)') "

每次下载后:

- 确认文件大小>10 KB(如果更小则拒绝,因为可能是错误HTML页面)
- 连续下载之间添加1秒延迟,避免触发速率限制
- 报告:`Downloaded: papers/2301.07041.pdf (842 KB)`

Step 5: Summarize

步骤5:总结

For each paper (downloaded or fetched by API):
markdown
undefined
对每篇论文(已下载或通过API获取):
markdown
undefined

[Title]

[标题]

  • arXiv: [ID] - [abs_url]
  • Authors: [full author list]
  • Date: [published]
  • Categories: [cs.LG, cs.AI, ...]
  • Abstract: [full abstract]
  • Key contributions (extracted from abstract):
    • [contribution 1]
    • [contribution 2]
    • [contribution 3]
  • Local PDF: papers/[ID].pdf (if downloaded)
undefined
  • arXiv:[ID] - [abs_url]
  • 作者:[完整作者列表]
  • 日期:[发布日期]
  • 分类:[cs.LG, cs.AI, ...]
  • 摘要:[完整摘要]
  • 核心贡献(从摘要提取):
    • [贡献1]
    • [贡献2]
    • [贡献3]
  • 本地PDF:papers/[ID].pdf(如果已下载)
undefined

Step 6: Final Output

步骤6:最终输出

Summarize what was done:
  • Found N papers for "query"
  • Downloaded: papers/2301.07041.pdf (842 KB)
    (for each download)
  • Any warnings (rate limit hit, file too small, already exists)
Suggest follow-up skills:
text
/research-lit "topic"     - multi-source review: Zotero + Obsidian + local PDFs + web
/novelty-check "idea"     - verify your idea is novel against these papers
总结已完成的操作:
  • 为“查询词”找到N篇论文
  • 已下载:papers/2301.07041.pdf (842 KB)
    (每个下载对应一条)
  • 任何警告(触发速率限制、文件过小、已存在)
推荐后续可用功能:
text
/research-lit "主题"     - 多来源综述:Zotero + Obsidian + 本地PDF + 网页
/novelty-check "想法"     - 对照这些论文验证你的想法是否具有创新性

Key Rules

核心规则

  • Always show the arXiv ID prominently - users need it for citations and reproducibility
  • Verify downloaded PDFs: file must be > 10 KB; warn and delete if smaller
  • Rate limit: wait 1 second between consecutive PDF downloads; retry once after 5 seconds on HTTP 429
  • Never overwrite an existing PDF at the same path - skip it and report "already exists"
  • Handle both arXiv ID formats: new (
    2301.07041
    ) and old (
    cs/0601001
    )
  • PAPER_DIR is created automatically if it does not exist
  • If the arXiv API is unreachable, report the error clearly and suggest using
    /research-lit
    with
    - sources: web
    as a fallback
  • 始终突出显示arXiv ID - 用户需要它来引用和复现研究
  • 验证下载的PDF:文件必须>10 KB;如果更小则发出警告并删除
  • 速率限制:连续PDF下载之间等待1秒;遇到HTTP 429错误时,等待5秒后重试一次
  • 切勿覆盖同一路径下已存在的PDF - 跳过并报告“已存在”
  • 支持两种arXiv ID格式:新格式(
    2301.07041
    )和旧格式(
    cs/0601001
  • 如果PAPER_DIR不存在,会自动创建
  • 如果arXiv API无法访问,清晰报告错误,并建议使用
    /research-lit
    并添加
    - sources: web
    作为替代方案