arxiv
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesearXiv Paper Search & Download
arXiv论文搜索与下载
Search topic or arXiv paper ID: $ARGUMENTS
搜索主题或arXiv论文ID:$ARGUMENTS
Constants
常量
- PAPER_DIR - Local directory to save downloaded PDFs. Default: in the current project directory.
papers/ - MAX_RESULTS = 10 - Default number of search results.
- FETCH_SCRIPT - relative to the ARIS install, or the same path relative to the current project. Fall back to inline Python if not found.
tools/arxiv_fetch.py
Overrides (append to arguments):
- return up to 20 results/arxiv "attention mechanism" - max: 20 - download a specific paper by ID/arxiv "2301.07041" - download - save PDFs to a custom directory/arxiv "query" - dir: literature/ - download all result PDFs/arxiv "query" - download: all
- PAPER_DIR - 用于保存下载PDF的本地目录。默认值:当前项目目录下的
papers/ - MAX_RESULTS = 10 - 默认的搜索结果数量
- FETCH_SCRIPT - 相对于ARIS安装路径的,或相对于当前项目的相同路径。如果未找到,则回退到内置Python代码。
tools/arxiv_fetch.py
覆盖配置(追加到参数中):
- 返回最多20条结果/arxiv "attention mechanism" - max: 20 - 通过ID下载指定论文/arxiv "2301.07041" - download - 将PDF保存到自定义目录/arxiv "query" - dir: literature/ - 下载所有结果的PDF/arxiv "query" - download: all
Workflow
工作流程
Step 1: Parse Arguments
步骤1:解析参数
Parse for directives:
$ARGUMENTS- Query or ID: main search term or a bare arXiv ID such as or
2301.07041cs/0601001 - : override MAX_RESULTS (e.g.,
- max: N)- max: 20 - : override PAPER_DIR (e.g.,
- dir: PATH)- dir: literature/ - : download the first result's PDF after listing
- download - : download PDFs for all results
- download: all
If the argument matches an arXiv ID pattern ( or ), skip the search and go directly to Step 3.
YYMM.NNNNNcategory/NNNNNNN解析中的指令:
$ARGUMENTS- 查询词或ID:主要搜索词,或纯arXiv ID,例如或
2301.07041cs/0601001 - :覆盖MAX_RESULTS(例如
- max: N)- max: 20 - :覆盖PAPER_DIR(例如
- dir: PATH)- dir: literature/ - :列出结果后下载第一条结果的PDF
- download - :下载所有结果的PDF
- download: all
如果参数匹配arXiv ID格式(或),则跳过搜索,直接进入步骤3。
YYMM.NNNNNcategory/NNNNNNNStep 2: Search arXiv
步骤2:搜索arXiv
Locate the fetch script:
bash
SCRIPT=$(python3 -c "
import pathlib
candidates = [
pathlib.Path('tools/arxiv_fetch.py'),
pathlib.Path.home() / '.claude' / 'skills' / 'arxiv' / 'arxiv_fetch.py',
]
for p in candidates:
if p.exists():
print(p)
break
" 2>/dev/null)If SCRIPT is found, run:
bash
python3 "$SCRIPT" search "QUERY" --max MAX_RESULTSIf SCRIPT is not found, fall back to inline Python:
bash
python3 - <<'PYEOF'
import json
import urllib.parse
import urllib.request
import xml.etree.ElementTree as ET
NS = "http://www.w3.org/2005/Atom"
query = urllib.parse.quote("QUERY")
url = (f"http://export.arxiv.org/api/query"
f"?search_query={query}&start=0&max_results=MAX_RESULTS"
f"&sortBy=relevance&sortOrder=descending")
with urllib.request.urlopen(url, timeout=30) as r:
root = ET.fromstring(r.read())
papers = []
for entry in root.findall(f"{{{NS}}}entry"):
aid = entry.findtext(f"{{{NS}}}id", "").split("/abs/")[-1].split("v")[0]
title = (entry.findtext(f"{{{NS}}}title", "") or "").strip().replace("\n", " ")
abstract = (entry.findtext(f"{{{NS}}}summary", "") or "").strip().replace("\n", " ")
authors = [a.findtext(f"{{{NS}}}name", "") for a in entry.findall(f"{{{NS}}}author")]
published = entry.findtext(f"{{{NS}}}published", "")[:10]
cats = [c.get("term", "") for c in entry.findall(f"{{{NS}}}category")]
papers.append({
"id": aid,
"title": title,
"authors": authors,
"abstract": abstract,
"published": published,
"categories": cats,
"pdf_url": f"https://arxiv.org/pdf/{aid}.pdf",
"abs_url": f"https://arxiv.org/abs/{aid}",
})
print(json.dumps(papers, ensure_ascii=False, indent=2))
PYEOFPresent results as a table:
text
| # | arXiv ID | Title | Authors | Date | Category |
|---|------------|---------------------|----------------|------------|----------|
| 1 | 2301.07041 | Attention Is All... | Vaswani et al. | 2017-06-12 | cs.LG |定位获取脚本:
bash
SCRIPT=$(python3 -c "
import pathlib
candidates = [
pathlib.Path('tools/arxiv_fetch.py'),
pathlib.Path.home() / '.claude' / 'skills' / 'arxiv' / 'arxiv_fetch.py',
]
for p in candidates:
if p.exists():
print(p)
break
" 2>/dev/null)如果找到SCRIPT,运行:
bash
python3 "$SCRIPT" search "QUERY" --max MAX_RESULTS如果未找到SCRIPT,回退到内置Python代码:
bash
python3 - <<'PYEOF'
import json
import urllib.parse
import urllib.request
import xml.etree.ElementTree as ET
NS = "http://www.w3.org/2005/Atom"
query = urllib.parse.quote("QUERY")
url = (f"http://export.arxiv.org/api/query"
f"?search_query={query}&start=0&max_results=MAX_RESULTS"
f"&sortBy=relevance&sortOrder=descending")
with urllib.request.urlopen(url, timeout=30) as r:
root = ET.fromstring(r.read())
papers = []
for entry in root.findall(f"{{{NS}}}entry"):
aid = entry.findtext(f"{{{NS}}}id", "").split("/abs/")[-1].split("v")[0]
title = (entry.findtext(f"{{{NS}}}title", "") or "").strip().replace("\n", " ")
abstract = (entry.findtext(f"{{{NS}}}summary", "") or "").strip().replace("\n", " ")
authors = [a.findtext(f"{{{NS}}}name", "") for a in entry.findall(f"{{{NS}}}author")]
published = entry.findtext(f"{{{NS}}}published", "")[:10]
cats = [c.get("term", "") for c in entry.findall(f"{{{NS}}}category")]
papers.append({
"id": aid,
"title": title,
"authors": authors,
"abstract": abstract,
"published": published,
"categories": cats,
"pdf_url": f"https://arxiv.org/pdf/{aid}.pdf",
"abs_url": f"https://arxiv.org/abs/{aid}",
})
print(json.dumps(papers, ensure_ascii=False, indent=2))
PYEOF以表格形式展示结果:
text
| # | arXiv ID | Title | Authors | Date | Category |
|---|------------|---------------------|----------------|------------|----------|
| 1 | 2301.07041 | Attention Is All... | Vaswani et al. | 2017-06-12 | cs.LG |Step 3: Fetch Details for a Specific ID
步骤3:获取指定ID的详细信息
When a single paper ID is requested (either directly or from Step 2):
bash
python3 "$SCRIPT" search "id:ARXIV_ID" --max 1当请求单个论文ID时(无论是直接请求还是来自步骤2):
bash
python3 "$SCRIPT" search "id:ARXIV_ID" --max 1or fallback:
或回退方案:
python3 -c "
import urllib.request, xml.etree.ElementTree as ET
NS = 'http://www.w3.org/2005/Atom'
url = 'http://export.arxiv.org/api/query?id_list=ARXIV_ID'
with urllib.request.urlopen(url, timeout=30) as r:
root = ET.fromstring(r.read())
python3 -c "
import urllib.request, xml.etree.ElementTree as ET
NS = 'http://www.w3.org/2005/Atom'
url = 'http://export.arxiv.org/api/query?id_list=ARXIV_ID'
with urllib.request.urlopen(url, timeout=30) as r:
root = ET.fromstring(r.read())
print full details ...
打印完整详情...
"
Display: title, all authors, categories, full abstract, published date, PDF URL, abstract URL."
展示内容:标题、所有作者、分类、完整摘要、发布日期、PDF链接、摘要页面链接。Step 4: Download PDFs
步骤4:下载PDF
When download is requested, for each paper ID to download:
bash
undefined当请求下载时,对每个要下载的论文ID执行:
bash
undefinedUsing fetch script:
使用获取脚本:
python3 "$SCRIPT" download ARXIV_ID --dir PAPER_DIR
python3 "$SCRIPT" download ARXIV_ID --dir PAPER_DIR
Fallback:
回退方案:
mkdir -p PAPER_DIR && python3 -c "
import pathlib
import sys
import urllib.request
out = pathlib.Path('PAPER_DIR/ARXIV_ID.pdf')
if out.exists():
print(f'Already exists: {out}')
sys.exit(0)
req = urllib.request.Request(
'https://arxiv.org/pdf/ARXIV_ID.pdf',
headers={'User-Agent': 'arxiv-skill/1.0'},
)
with urllib.request.urlopen(req, timeout=60) as r:
out.write_bytes(r.read())
print(f'Downloaded: {out} ({out.stat().st_size // 1024} KB)')
"
After each download:
- Confirm file size > 10 KB (reject smaller files - likely an error HTML page)
- Add a 1-second delay between consecutive downloads to avoid rate limiting
- Report: `Downloaded: papers/2301.07041.pdf (842 KB)`mkdir -p PAPER_DIR && python3 -c "
import pathlib
import sys
import urllib.request
out = pathlib.Path('PAPER_DIR/ARXIV_ID.pdf')
if out.exists():
print(f'Already exists: {out}')
sys.exit(0)
req = urllib.request.Request(
'https://arxiv.org/pdf/ARXIV_ID.pdf',
headers={'User-Agent': 'arxiv-skill/1.0'},
)
with urllib.request.urlopen(req, timeout=60) as r:
out.write_bytes(r.read())
print(f'Downloaded: {out} ({out.stat().st_size // 1024} KB)')
"
每次下载后:
- 确认文件大小>10 KB(如果更小则拒绝,因为可能是错误HTML页面)
- 连续下载之间添加1秒延迟,避免触发速率限制
- 报告:`Downloaded: papers/2301.07041.pdf (842 KB)`Step 5: Summarize
步骤5:总结
For each paper (downloaded or fetched by API):
markdown
undefined对每篇论文(已下载或通过API获取):
markdown
undefined[Title]
[标题]
- arXiv: [ID] - [abs_url]
- Authors: [full author list]
- Date: [published]
- Categories: [cs.LG, cs.AI, ...]
- Abstract: [full abstract]
- Key contributions (extracted from abstract):
- [contribution 1]
- [contribution 2]
- [contribution 3]
- Local PDF: papers/[ID].pdf (if downloaded)
undefined- arXiv:[ID] - [abs_url]
- 作者:[完整作者列表]
- 日期:[发布日期]
- 分类:[cs.LG, cs.AI, ...]
- 摘要:[完整摘要]
- 核心贡献(从摘要提取):
- [贡献1]
- [贡献2]
- [贡献3]
- 本地PDF:papers/[ID].pdf(如果已下载)
undefinedStep 6: Final Output
步骤6:最终输出
Summarize what was done:
Found N papers for "query"- (for each download)
Downloaded: papers/2301.07041.pdf (842 KB) - Any warnings (rate limit hit, file too small, already exists)
Suggest follow-up skills:
text
/research-lit "topic" - multi-source review: Zotero + Obsidian + local PDFs + web
/novelty-check "idea" - verify your idea is novel against these papers总结已完成的操作:
为“查询词”找到N篇论文- (每个下载对应一条)
已下载:papers/2301.07041.pdf (842 KB) - 任何警告(触发速率限制、文件过小、已存在)
推荐后续可用功能:
text
/research-lit "主题" - 多来源综述:Zotero + Obsidian + 本地PDF + 网页
/novelty-check "想法" - 对照这些论文验证你的想法是否具有创新性Key Rules
核心规则
- Always show the arXiv ID prominently - users need it for citations and reproducibility
- Verify downloaded PDFs: file must be > 10 KB; warn and delete if smaller
- Rate limit: wait 1 second between consecutive PDF downloads; retry once after 5 seconds on HTTP 429
- Never overwrite an existing PDF at the same path - skip it and report "already exists"
- Handle both arXiv ID formats: new () and old (
2301.07041)cs/0601001 - PAPER_DIR is created automatically if it does not exist
- If the arXiv API is unreachable, report the error clearly and suggest using with
/research-litas a fallback- sources: web
- 始终突出显示arXiv ID - 用户需要它来引用和复现研究
- 验证下载的PDF:文件必须>10 KB;如果更小则发出警告并删除
- 速率限制:连续PDF下载之间等待1秒;遇到HTTP 429错误时,等待5秒后重试一次
- 切勿覆盖同一路径下已存在的PDF - 跳过并报告“已存在”
- 支持两种arXiv ID格式:新格式()和旧格式(
2301.07041)cs/0601001 - 如果PAPER_DIR不存在,会自动创建
- 如果arXiv API无法访问,清晰报告错误,并建议使用并添加
/research-lit作为替代方案- sources: web