paper2code-arxiv-implementation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

paper2code — Arxiv Paper to Working Implementation

paper2code — 将Arxiv论文转换为可运行的实现

Skill by ara.so — Daily 2026 Skills collection.
paper2code is a Claude Code agent skill that converts any arxiv paper URL into a citation-anchored Python implementation. Every code decision references the exact paper section and equation it implements, and all gaps/ambiguities are explicitly flagged rather than silently filled in.

ara.so开发的Skill — 2026年度每日Skill合集。
paper2code是一款Claude Code Agent技能,可将任意arxiv论文URL转换为带引用锚点的Python实现。每一处代码决策都对应它所实现的具体论文章节和公式,所有空白/歧义点都会被明确标记,而非悄无声息地被补全。

Install

安装

bash
npx skills add PrathamLearnsToCode/paper2code/skills/paper2code
During install you'll choose:
  • Agents: which coding agents get the skill (e.g., Claude Code)
  • Scope: Global (recommended) or project-level
  • Method: Symlink (recommended) or copy
Then launch your agent:
bash
claude

bash
npx skills add PrathamLearnsToCode/paper2code/skills/paper2code
安装过程中你需要选择:
  • Agents:哪些编码Agent可以使用该Skill(例如Claude Code)
  • Scope:全局(推荐)或项目级别
  • Method:软链接(推荐)或复制
然后启动你的Agent:
bash
claude

Core Commands

核心命令

Basic usage

基础用法

/paper2code https://arxiv.org/abs/1706.03762
/paper2code https://arxiv.org/abs/1706.03762

With framework override

覆盖默认框架

/paper2code https://arxiv.org/abs/2006.11239 --framework jax
/paper2code https://arxiv.org/abs/2006.11239 --framework pytorch   # default
/paper2code https://arxiv.org/abs/2006.11239 --framework tensorflow
/paper2code https://arxiv.org/abs/2006.11239 --framework jax
/paper2code https://arxiv.org/abs/2006.11239 --framework pytorch   # 默认值
/paper2code https://arxiv.org/abs/2006.11239 --framework tensorflow

With mode flag

携带模式参数

/paper2code 1706.03762 --mode minimal       # architecture only (default)
/paper2code 1706.03762 --mode full          # includes training loop + data pipeline
/paper2code 1706.03762 --mode educational   # extra comments + pedagogical notebook
/paper2code 1706.03762 --mode minimal       # 仅包含架构(默认)
/paper2code 1706.03762 --mode full          # 包含训练循环 + 数据管道
/paper2code 1706.03762 --mode educational   # 额外注释 + 教学用notebook

Bare arxiv ID (no URL required)

仅输入arxiv ID(无需URL)

/paper2code 1706.03762
/paper2code 2106.09685

/paper2code 1706.03762
/paper2code 2106.09685

Output Structure

输出结构

Every run produces a directory named after the paper slug:
attention_is_all_you_need/
├── README.md                  # Paper summary + quick-start
├── REPRODUCTION_NOTES.md      # Ambiguity audit, unspecified choices, known deviations
├── requirements.txt           # Pinned dependencies
├── src/
│   ├── model.py               # Architecture — every layer cited to paper section
│   ├── loss.py                # Loss functions with equation references
│   ├── data.py                # Dataset skeleton with preprocessing TODOs
│   ├── train.py               # Training loop (full/educational mode)
│   ├── evaluate.py            # Metric computation
│   └── utils.py               # Shared utilities
├── configs/
│   └── base.yaml              # All hyperparams — each cited or flagged [UNSPECIFIED]
└── notebooks/
    └── walkthrough.ipynb      # Paper section → code → shape checks

每次运行都会生成一个以论文slug命名的目录:
attention_is_all_you_need/
├── README.md                  # 论文摘要 + 快速入门
├── REPRODUCTION_NOTES.md      # 歧义审核、未明确说明的选择、已知偏差
├── requirements.txt           # 锁定版本的依赖
├── src/
│   ├── model.py               # 模型架构 — 每一层都对应论文章节引用
│   ├── loss.py                # 带公式引用的损失函数
│   ├── data.py                # 数据集骨架,包含预处理待办项
│   ├── train.py               # 训练循环(full/educational模式下存在)
│   ├── evaluate.py            # 指标计算
│   └── utils.py               # 共享工具函数
├── configs/
│   └── base.yaml              # 所有超参数 — 每个参数都有引用或标记为[UNSPECIFIED]
└── notebooks/
    └── walkthrough.ipynb      # 论文章节 → 代码 → 维度校验

Citation Anchoring Convention

引用锚定规则

The core value of paper2code is traceability. Every non-trivial decision is tagged:
TagMeaning
§X.Y
Directly specified in section X.Y
§X.Y, Eq. N
Implements equation N from section X.Y
[UNSPECIFIED]
Paper doesn't state this — choice made with alternatives listed
[PARTIALLY_SPECIFIED]
Paper mentions it but is ambiguous — quote included
[ASSUMPTION]
Reasonable inference — reasoning explained
[FROM_OFFICIAL_CODE]
Taken from authors' official implementation
paper2code的核心价值是可追溯性,每个非琐碎决策都会被标记:
标签含义
§X.Y
直接在X.Y节中明确说明
§X.Y, Eq. N
实现了X.Y节中的第N个公式
[UNSPECIFIED]
论文未说明该部分 — 给出选择以及替代方案列表
[PARTIALLY_SPECIFIED]
论文提及但表述模糊 — 包含原文引用
[ASSUMPTION]
合理推断 — 解释推断理由
[FROM_OFFICIAL_CODE]
取自作者的官方实现

Example — model.py with citation anchors

示例 — 带引用锚点的model.py

python
import torch
import torch.nn as nn
import math


class MultiHeadAttention(nn.Module):
    """§3.2 — Multi-Head Attention
    
    Implements Eq. 4: MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
    where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)
    """

    def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
        super().__init__()
        # §3.2 — d_model = 512, h = 8 stated in Table 1
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads  # §3.2 — d_k = d_v = d_model / h = 64
        self.num_heads = num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)  # §3.2, Eq. 4 — W^O projection

        # [UNSPECIFIED] Dropout rate for attention weights not stated in §3.2
        # Using 0.1 matching the model-wide dropout (§5.4, Table 3)
        self.dropout = nn.Dropout(dropout)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        # §3.2, Eq. 4 — project into h heads
        Q = self.W_q(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # §3.2.1, Eq. 1 — Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            # §3.2.3 — decoder masks future positions with -inf before softmax
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = torch.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        out = torch.matmul(attn_weights, V)  # (batch, heads, seq, d_k)
        out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        return self.W_o(out)  # §3.2, Eq. 4 — W^O output projection


class TransformerBlock(nn.Module):
    """§3.1 — Encoder/Decoder layer structure"""

    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads, dropout)

        # [ASSUMPTION] Using pre-norm based on stability; paper Figure 1 shows post-norm
        # Post-norm: x = LayerNorm(x + sublayer(x)) — §3.1
        # [PARTIALLY_SPECIFIED] "We apply layer normalization" — position ambiguous
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # §3.3 — FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),  # §3.3 — "ReLU activation"
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # §3.1 — residual connection around each sub-layer
        attn_out = self.attention(self.norm1(x), self.norm1(x), self.norm1(x), mask)
        x = x + self.dropout(attn_out)
        x = x + self.dropout(self.ff(self.norm2(x)))
        return x
python
import torch
import torch.nn as nn
import math


class MultiHeadAttention(nn.Module):
    """§3.2 — 多头注意力
    
    实现公式4: MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
    其中 head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)
    """

    def __init__(self, d_model: int, num_heads: int, dropout: float = 0.1):
        super().__init__()
        # §3.2 — d_model = 512, h = 8 在表1中说明
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads  # §3.2 — d_k = d_v = d_model / h = 64
        self.num_heads = num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)  # §3.2, 公式4 — W^O 投影层

        # [UNSPECIFIED] §3.2中未说明注意力权重的dropout率
        # 使用0.1匹配模型全局dropout率(§5.4, 表3)
        self.dropout = nn.Dropout(dropout)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        # §3.2, 公式4 — 投影到h个注意力头
        Q = self.W_q(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # §3.2.1, 公式1 — Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            # §3.2.3 — 解码器在softmax前用-inf掩盖未来位置
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = torch.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        out = torch.matmul(attn_weights, V)  # (batch, heads, seq, d_k)
        out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        return self.W_o(out)  # §3.2, 公式4 — W^O 输出投影


class TransformerBlock(nn.Module):
    """§3.1 — 编码器/解码器层结构"""

    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads, dropout)

        # [ASSUMPTION] 基于稳定性使用预归一化;论文图1显示后归一化
        # 后归一化: x = LayerNorm(x + sublayer(x)) — §3.1
        # [PARTIALLY_SPECIFIED] "我们应用层归一化" — 位置表述模糊
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # §3.3 — FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),  # §3.3 — "ReLU激活函数"
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # §3.1 — 每个子层周围都有残差连接
        attn_out = self.attention(self.norm1(x), self.norm1(x), self.norm1(x), mask)
        x = x + self.dropout(attn_out)
        x = x + self.dropout(self.ff(self.norm2(x)))
        return x

Example — configs/base.yaml with citations

示例 — 带引用的configs/base.yaml

yaml
undefined
yaml
undefined

base.yaml — All hyperparameters for attention_is_all_you_need

base.yaml — attention_is_all_you_need的所有超参数

Each value is either cited from the paper or flagged [UNSPECIFIED]

每个值要么来自论文引用,要么标记为[UNSPECIFIED]

model: d_model: 512 # §3, Table 1 — "d_model = 512" num_heads: 8 # §3.2, Table 1 — "h = 8" d_ff: 2048 # §3.3, Table 1 — "d_ff = 2048" num_encoder_layers: 6 # §3, Table 1 — "N = 6" num_decoder_layers: 6 # §3, Table 1 — "N = 6" dropout: 0.1 # §5.4, Table 3 — "P_drop = 0.1" max_seq_len: 512 # [UNSPECIFIED] not stated; using 512 (common default) # Alternatives: 256, 1024
training: batch_size: 25000 # §5.1 — "each batch ~25,000 source + target tokens" optimizer: adam # §5.3 — "Adam optimizer" beta1: 0.9 # §5.3 — "β1 = 0.9" beta2: 0.98 # §5.3 — "β2 = 0.98" epsilon: 1.0e-9 # §5.3 — "ε = 10^-9" warmup_steps: 4000 # §5.3 — "warmup_steps = 4000" label_smoothing: 0.1 # §5.4 — "ε_ls = 0.1"
undefined
model: d_model: 512 # §3, 表1 — "d_model = 512" num_heads: 8 # §3.2, 表1 — "h = 8" d_ff: 2048 # §3.3, 表1 — "d_ff = 2048" num_encoder_layers: 6 # §3, 表1 — "N = 6" num_decoder_layers: 6 # §3, 表1 — "N = 6" dropout: 0.1 # §5.4, 表3 — "P_drop = 0.1" max_seq_len: 512 # [UNSPECIFIED] 未说明;使用512(通用默认值) # 替代方案: 256, 1024
training: batch_size: 25000 # §5.1 — "每个batch约包含25000个源+目标token" optimizer: adam # §5.3 — "Adam优化器" beta1: 0.9 # §5.3 — "β1 = 0.9" beta2: 0.98 # §5.3 — "β2 = 0.98" epsilon: 1.0e-9 # §5.3 — "ε = 10^-9" warmup_steps: 4000 # §5.3 — "warmup_steps = 4000" label_smoothing: 0.1 # §5.4 — "ε_ls = 0.1"
undefined

Example — REPRODUCTION_NOTES.md structure

示例 — REPRODUCTION_NOTES.md结构

markdown
undefined
markdown
undefined

Reproduction Notes — Attention Is All You Need

复现笔记 — Attention Is All You Need

Ambiguity Audit

歧义审核

SPECIFIED (high confidence)

明确说明(高置信度)

ChoiceValueSource
d_model512§3, Table 1
num_heads8§3.2, Table 1
optimizerAdam β1=0.9, β2=0.98§5.3
选择取值来源
d_model512§3, 表1
num_heads8§3.2, 表1
优化器Adam β1=0.9, β2=0.98§5.3

PARTIALLY_SPECIFIED (judgment call made)

部分说明(已做判断)

ChoiceOur DecisionPaper QuoteAlternatives
Norm positionpre-norm"layer norm before each sub-layer" (§3.1) conflicts with Figure 1post-norm
选择我们的决策论文原文替代方案
归一化位置预归一化"每个子层前应用层归一化"(§3.1)与图1冲突后归一化

UNSPECIFIED (our defaults)

未明确说明(我们的默认值)

ChoiceOur DefaultRationaleAlternatives
LayerNorm epsilon1e-6common default1e-5, 1e-8
max_seq_len512common for WMT256, 1024
选择我们的默认值理由替代方案
LayerNorm epsilon1e-6通用默认值1e-5, 1e-8
max_seq_len512WMT数据集通用配置256, 1024

Known Deviations

已知偏差

  • data.py provides skeleton only; WMT14 preprocessing not implemented
  • No beam search decoding (§5 mentions beam size 4, not fully implemented)

---
  • data.py仅提供骨架;未实现WMT14预处理
  • 未实现束搜索解码(§5提及束大小为4,未完全实现)

---

What paper2code Will NOT Do

paper2code不会做的事

Understanding limits prevents wasted debugging time:
  • Won't guarantee correctness — matches what the paper describes; if the paper is wrong, the code is wrong
  • Won't invent details silently — gaps are always
    [UNSPECIFIED]
    , never filled confidently
  • Won't download datasets
    data.py
    gives a
    Dataset
    skeleton with instructions
  • Won't set up training infrastructure — no distributed training, no experiment tracking
  • Won't implement baselines — only the paper's core contribution
  • Won't reimplement standard components — imports them or notes the dependency

了解局限性可以避免浪费调试时间:
  • 不保证正确性 — 仅匹配论文描述的内容;如果论文有误,代码也会有误
  • 不会悄无声息地补充细节 — 空白项始终标记为
    [UNSPECIFIED]
    ,绝不会主观补全
  • 不会下载数据集
    data.py
    仅提供
    Dataset
    骨架和使用说明
  • 不会搭建训练基础设施 — 不包含分布式训练、实验追踪功能
  • 不会实现基线模型 — 仅实现论文的核心贡献
  • 不会重新实现标准组件 — 会导入相关组件或标注依赖

Common Patterns

常见使用场景

Pattern 1 — Implement a new architecture paper

场景1 — 实现新架构论文

/paper2code https://arxiv.org/abs/2010.11929 --mode minimal
Focus:
src/model.py
will contain the full architecture. Review
REPRODUCTION_NOTES.md
to understand every ambiguous choice before running.
/paper2code https://arxiv.org/abs/2010.11929 --mode minimal
重点:
src/model.py
会包含完整架构。运行前请查看
REPRODUCTION_NOTES.md
了解所有歧义决策。

Pattern 2 — Reproduce a training method

场景2 — 复现训练方法

/paper2code https://arxiv.org/abs/2006.11239 --mode full --framework pytorch
Focus:
src/train.py
will contain the full training loop.
configs/base.yaml
will list every hyperparameter with paper citations.
/paper2code https://arxiv.org/abs/2006.11239 --mode full --framework pytorch
重点:
src/train.py
会包含完整训练循环。
configs/base.yaml
会列出所有带论文引用的超参数。

Pattern 3 — Educational deep-dive

场景3 — 教学深度拆解

/paper2code 1706.03762 --mode educational
Focus:
notebooks/walkthrough.ipynb
walks through each paper section, shows corresponding code, and runs CPU-safe shape checks.
/paper2code 1706.03762 --mode educational
重点:
notebooks/walkthrough.ipynb
会逐节拆解论文,展示对应代码,并运行CPU友好的维度校验。

Pattern 4 — Quick architecture prototype

场景4 — 快速架构原型

/paper2code 2106.09685  # ViT
Then inspect and run:
bash
cd vision_transformer/
pip install -r requirements.txt
python -c "
from src.model import VisionTransformer
import torch
model = VisionTransformer()  # toy config
x = torch.randn(2, 3, 224, 224)
print(model(x).shape)
"

/paper2code 2106.09685  # ViT
然后检查并运行:
bash
cd vision_transformer/
pip install -r requirements.txt
python -c "
from src.model import VisionTransformer
import torch
model = VisionTransformer()  # 简易配置
x = torch.randn(2, 3, 224, 224)
print(model(x).shape)
"

Troubleshooting

故障排查

Skill not triggering

Skill未触发

  • Confirm install completed:
    npx skills list
    should show
    paper2code-arxiv-implementation
  • Use the explicit trigger:
    /paper2code <url>
  • Try bare arxiv ID format:
    /paper2code 1706.03762
  • 确认安装完成:运行
    npx skills list
    应显示
    paper2code-arxiv-implementation
  • 使用显式触发命令:
    /paper2code <url>
  • 尝试纯arxiv ID格式:
    /paper2code 1706.03762

Generated code has import errors

生成的代码有导入错误

  • Run
    pip install -r requirements.txt
    first
  • Check
    REPRODUCTION_NOTES.md
    for noted dependencies
  • Standard components (e.g., HuggingFace transformers) are imported, not reimplemented — install them separately
  • 先运行
    pip install -r requirements.txt
  • 查看
    REPRODUCTION_NOTES.md
    中注明的依赖
  • 标准组件(例如HuggingFace transformers)是直接导入,未重新实现 — 请单独安装

"Paper not found" or fetch errors

"论文未找到"或获取错误

  • Confirm the arxiv ID exists:
    https://arxiv.org/abs/<ID>
  • Try the full URL instead of bare ID
  • Some very new papers (hours old) may not be indexed yet
  • 确认arxiv ID存在:
    https://arxiv.org/abs/<ID>
  • 尝试使用完整URL而非纯ID
  • 部分刚发布的论文(发布仅数小时)可能还未被索引

Silent assumptions in generated code

生成的代码存在隐式假设

  • This should not happen by design — if you find one, it's a bug
  • Check
    REPRODUCTION_NOTES.md
    first; the assumption may be documented there
  • Report via the repo issues if a gap was genuinely filled silently
  • 设计上不应出现这种情况 — 如果你发现了,说明是Bug
  • 请先查看
    REPRODUCTION_NOTES.md
    ;相关假设可能已在此处记录
  • 如果确实存在悄无声息补全的空白项,请在仓库Issues中反馈

Framework-specific issues

框架相关问题

  • Default framework is PyTorch — omitting
    --framework
    gives PyTorch output
  • JAX output requires
    jax
    ,
    flax
    ,
    optax
    — listed in
    requirements.txt
  • TensorFlow output requires
    tensorflow>=2.x

  • 默认框架是PyTorch — 省略
    --framework
    参数会输出PyTorch代码
  • JAX输出需要
    jax
    flax
    optax
    — 已在
    requirements.txt
    中列出
  • TensorFlow输出需要
    tensorflow>=2.x

Contributing

贡献指南

Add a worked example

添加已验证的示例

  1. Run:
    /paper2code https://arxiv.org/abs/XXXX.XXXXX
  2. Save output to
    skills/paper2code/worked/{paper_slug}/
  3. Write
    review.md
    evaluating correctness, flagged ambiguities, and any mistakes
  4. Submit PR
  1. 运行:
    /paper2code https://arxiv.org/abs/XXXX.XXXXX
  2. 将输出保存到
    skills/paper2code/worked/{paper_slug}/
  3. 编写
    review.md
    评估正确性、标记的歧义点以及所有错误
  4. 提交PR

Improve guardrails

改进防护规则

Add patterns where the skill makes silent assumptions to
guardrails/
.
将Skill可能出现隐式假设的场景添加到
guardrails/
目录下。

Add domain knowledge

添加领域知识

Papers in your subfield reference common components? Add a knowledge file to
knowledge/
(e.g.,
knowledge/graph_neural_networks.md
).

你所在子领域的论文会引用通用组件?请将知识文件添加到
knowledge/
目录下(例如
knowledge/graph_neural_networks.md
)。

Resources

相关资源