experiment-bridge

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Workflow 1.5: Experiment Bridge

Workflow 1.5：实验桥梁

Implement and deploy experiments from plan: $ARGUMENTS

根据以下计划实现并部署实验：$ARGUMENTS

Overview

概述

This skill bridges Workflow 1 (idea discovery + method refinement) and Workflow 2 (auto review loop). It takes the experiment plan and turns it into running experiments with initial results.

Workflow 1 output:                    This skill:                                    Workflow 2 input:
refine-logs/EXPERIMENT_PLAN.md   →   implement → GPT-5.4 review → deploy → collect → initial results ready
refine-logs/EXPERIMENT_TRACKER.md     code        (cross-model)    /run-experiment     for /auto-review-loop
refine-logs/FINAL_PROPOSAL.md

本技能连接了Workflow 1（创意探索+方法优化）与Workflow 2（自动审核循环）。它接收实验计划并将其转化为可运行的实验，同时生成初步结果。

Workflow 1输出:                    本技能:                                    Workflow 2输入:
refine-logs/EXPERIMENT_PLAN.md   →   实现代码 → GPT-5.4审核 → 部署 → 收集结果 → 准备好初步结果
refine-logs/EXPERIMENT_TRACKER.md     (跨模型)    /run-experiment     用于/auto-review-loop
refine-logs/FINAL_PROPOSAL.md

Constants

常量设置

CODE_REVIEW = true — GPT-5.4 xhigh reviews experiment code before deployment. Catches logic bugs before wasting GPU hours. Set
```
false
```
to skip.
AUTO_DEPLOY = true — Automatically deploy experiments after implementation + review. Set
```
false
```
to manually inspect code before deploying.
SANITY_FIRST = true — Run the sanity-stage experiment first (smallest, fastest) before launching the rest. Catches setup bugs early.
MAX_PARALLEL_RUNS = 4 — Maximum number of experiments to deploy in parallel (limited by available GPUs).

Override:

/experiment-bridge "EXPERIMENT_PLAN.md" — code review: false, auto deploy: false

CODE_REVIEW = true — 部署前使用GPT-5.4 xhigh对实验代码进行审核，避免因逻辑错误浪费GPU时长。设置为
```
false
```
可跳过此步骤。
AUTO_DEPLOY = true — 代码实现+审核完成后自动部署实验。设置为
```
false
```
可在部署前手动检查代码。
SANITY_FIRST = true — 先运行sanity阶段实验（规模最小、速度最快），再启动其余实验，提前发现配置问题。
MAX_PARALLEL_RUNS = 4 — 可并行部署的最大实验数量（受可用GPU限制）。

覆盖配置：

/experiment-bridge "EXPERIMENT_PLAN.md" — code review: false, auto deploy: false

Inputs

输入要求

This skill expects one or more of:

refine-logs/EXPERIMENT_PLAN.md
(best) — claim-driven experiment roadmap from
```
/experiment-plan
```
refine-logs/EXPERIMENT_TRACKER.md
— run-by-run execution table
refine-logs/FINAL_PROPOSAL.md
— method description for implementation context
IDEA_REPORT.md
— fallback if refine-logs don't exist

If none exist, ask the user what experiments to implement.

本技能需要以下至少一种输入：

refine-logs/EXPERIMENT_PLAN.md
（最优选择）—— 来自
```
/experiment-plan
```
的基于假设的实验路线图
refine-logs/EXPERIMENT_TRACKER.md
— 逐次运行的执行记录表
refine-logs/FINAL_PROPOSAL.md
— 用于提供实现上下文的方法描述文档
IDEA_REPORT.md
— 若refine-logs目录下无上述文件，可作为备选

若没有任何输入，需询问用户要实现哪些实验。

Workflow

工作流

Phase 1: Parse the Experiment Plan

阶段1：解析实验计划

Read

EXPERIMENT_PLAN.md

and extract:

Run order and milestones — which experiments run first (sanity → baseline → main → ablation → polish)
For each experiment block:
- Dataset / split / task
- Compared systems and variants
- Metrics to compute
- Setup details (backbone, hyperparameters, seeds)
- Success criterion
- Priority (MUST-RUN vs NICE-TO-HAVE)
Compute budget — total estimated GPU-hours
Method details from
```
FINAL_PROPOSAL.md
```
— what exactly to implement

Present a brief summary:

📋 Experiment plan loaded:
- Milestones: [N] (sanity → baseline → main → ablation)
- Must-run experiments: [N]
- Nice-to-have: [N]
- Estimated GPU-hours: [X]

Proceeding to implementation.

读取

EXPERIMENT_PLAN.md

并提取以下信息：

运行顺序与里程碑 — 哪些实验先运行（sanity → baseline → main → ablation → polish）
每个实验模块的信息：
- 数据集/数据划分/任务
- 对比系统与变体
- 需计算的指标
- 配置细节（骨干网络、超参数、随机种子）
- 成功判定标准
- 优先级（必须运行 vs 可选运行）
计算预算 — 预估的总GPU时长
来自
FINAL_PROPOSAL.md
的方法细节 — 需要具体实现的内容

展示简要总结：

📋 实验计划已加载：
- 里程碑数量：[N]（sanity → baseline → main → ablation）
- 必须运行的实验：[N]个
- 可选运行的实验：[N]个
- 预估GPU时长：[X]

即将开始代码实现。

Phase 2: Implement Experiment Code

阶段2：实现实验代码

For each milestone (in order), write the experiment scripts:

Check existing code — scan the project for existing experiment scripts, model code, data loaders. Reuse as much as possible.
Implement missing pieces:
- Training scripts with proper argparse (all hyperparameters configurable)
- Evaluation scripts computing the specified metrics
- Data loading / preprocessing if needed
- Baseline implementations if not already present
- Fixed random seeds for reproducibility
- Results saved to JSON/CSV for later analysis
- Proper logging (wandb if configured in CLAUDE.md)
Follow the plan's run order — implement sanity-stage experiments first, then baselines, then main method, then ablations.
Self-review before deploying:
- Are all hyperparameters from EXPERIMENT_PLAN.md reflected in argparse?
- Is the random seed fixed and controllable?
- Are results saved in a parseable format (JSON/CSV)?
- Does the code match FINAL_PROPOSAL.md's method description?

按照里程碑顺序编写实验脚本：

检查现有代码 — 扫描项目中的现有实验脚本、模型代码、数据加载器，尽可能复用。
实现缺失部分：
- 带有argparse的训练脚本（所有超参数可配置）
- 计算指定指标的评估脚本
- 必要的数据加载/预处理代码
- 若基线实现不存在则补充
- 固定随机种子以保证可复现性
- 将结果保存为JSON/CSV格式以便后续分析
- 合理的日志记录（若CLAUDE.md中配置了wandb则使用）
遵循计划的运行顺序 — 先实现sanity阶段实验，再是基线、主方法，最后是消融实验。
部署前自我审核：
- 代码中的argparse是否体现了EXPERIMENT_PLAN.md中的所有超参数？
- 随机种子是否固定且可控制？
- 结果是否保存为可解析的格式（JSON/CSV）？
- 代码是否与FINAL_PROPOSAL.md中的方法描述一致？

Phase 2.5: Cross-Model Code Review (when CODE_REVIEW = true)

阶段2.5：跨模型代码审核（当CODE_REVIEW = true时）

Skip this step if
CODE_REVIEW
is
false
.

Before deploying, send the experiment code to GPT-5.4 xhigh for review:

mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    Review the following experiment implementation for correctness.

    ## Experiment Plan:
    [paste key sections from EXPERIMENT_PLAN.md]

    ## Method Description:
    [paste from FINAL_PROPOSAL.md]

    ## Implementation:
    [paste the experiment scripts]

    Check for:
    1. Does the code correctly implement the method described in the proposal?
    2. Are all hyperparameters from the plan reflected in the code?
    3. Are there any logic bugs (wrong loss function, incorrect data split, missing eval)?
    4. Is the evaluation metric computed correctly?
    5. Any potential issues (OOM risk, numerical instability, missing seeds)?

    For each issue found, specify: CRITICAL / MAJOR / MINOR and the exact fix.

On review results:

No CRITICAL issues → proceed to Phase 3
CRITICAL issues found → fix them, then re-submit for review (max 2 rounds)
Codex MCP unavailable → skip silently, proceed to Phase 3 (graceful degradation)

若
CODE_REVIEW
为
false
，跳过此步骤。

部署前，将实验代码发送至GPT-5.4 xhigh进行审核：

mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    审核以下实验实现代码的正确性。

    ## 实验计划：
    [粘贴EXPERIMENT_PLAN.md中的关键部分]

    ## 方法描述：
    [粘贴FINAL_PROPOSAL.md中的内容]

    ## 实现代码：
    [粘贴实验脚本]

    检查要点：
    1. 代码是否正确实现了提案中的方法？
    2. 代码是否体现了计划中的所有超参数？
    3. 是否存在逻辑错误（损失函数错误、数据划分错误、缺少评估步骤）？
    4. 评估指标计算是否正确？
    5. 潜在问题（内存溢出风险、数值不稳定、缺少随机种子）？

    对于发现的每个问题，标注：CRITICAL / MAJOR / MINOR，并给出具体修复方案。

根据审核结果处理：

无CRITICAL问题 → 进入阶段3
发现CRITICAL问题 → 修复后重新提交审核（最多2轮）
Codex MCP不可用 → 静默跳过，直接进入阶段3（优雅降级）

Phase 3: Sanity Check (if SANITY_FIRST = true)

阶段3：Sanity检查（若SANITY_FIRST = true）

Before deploying the full experiment suite, run the sanity-stage experiment:

/run-experiment [sanity experiment command]

Wait for completion. Verify:

Training loop runs without errors
Metrics are computed and saved correctly
GPU memory usage is within bounds
Output format matches expectations

If sanity fails → fix the code, re-run. Do not proceed to full deployment with broken code.

在部署完整实验套件前，先运行sanity阶段实验：

/run-experiment [sanity实验命令]

等待实验完成，验证以下内容：

训练循环无错误运行
指标计算并保存正确
GPU内存使用在合理范围内
输出格式符合预期

若sanity实验失败 → 修复代码后重新运行，代码未修复前不得部署完整实验套件。

Phase 4: Deploy Full Experiments

阶段4：部署完整实验

Deploy experiments following the plan's milestone order:

/run-experiment [experiment commands]

For each milestone:

Deploy experiments in parallel (up to MAX_PARALLEL_RUNS)
Use
```
/monitor-experiment
```
to track progress
Collect results as experiments complete

🚦 Checkpoint (if AUTO_DEPLOY = false):

🔧 Code implementation complete. Ready to deploy:

Milestone 0 (sanity): [status — passed/pending]
Milestone 1 (baseline): [N experiments, ~X GPU-hours]
Milestone 2 (main method): [N experiments, ~X GPU-hours]
Milestone 3 (ablations): [N experiments, ~X GPU-hours]

Total estimated: ~X GPU-hours on [N] GPUs

Deploy now? Or review the code first?

按照计划的里程碑顺序部署实验：

/run-experiment [实验命令]

每个里程碑的处理步骤：

并行部署实验（最多MAX_PARALLEL_RUNS个）
使用
```
/monitor-experiment
```
跟踪进度
实验完成后收集结果

🚦 检查点（若AUTO_DEPLOY = false）：

🔧 代码实现完成，准备部署：

里程碑0（sanity）：[状态 — 已通过/待运行]
里程碑1（基线）：[N个实验，约X GPU时长]
里程碑2（主方法）：[N个实验，约X GPU时长]
里程碑3（消融实验）：[N个实验，约X GPU时长]

总预估时长：在[N]个GPU上运行约X GPU时长

立即部署？还是先审核代码？

Phase 5: Collect Initial Results

阶段5：收集初步结果

As experiments complete:

Parse output files (JSON/CSV/logs) for key metrics
Update
refine-logs/EXPERIMENT_TRACKER.md
— fill in Status and Notes columns
Check success criteria from EXPERIMENT_PLAN.md — did each experiment meet its bar?
Write initial results summary:

markdown

undefined

实验完成后：

解析输出文件（JSON/CSV/日志）提取关键指标
更新
refine-logs/EXPERIMENT_TRACKER.md
— 填写状态和备注列
检查EXPERIMENT_PLAN.md中的成功标准 — 每个实验是否达到要求
编写初步结果总结：

markdown

undefined

Initial Experiment Results

初步实验结果

Date: [today] Plan: refine-logs/EXPERIMENT_PLAN.md

日期：[今日] 对应计划：refine-logs/EXPERIMENT_PLAN.md

Results by Milestone

各里程碑结果

M0: Sanity — PASSED

M0: Sanity — 已通过

[result]

[结果详情]

M1: Baselines

M1: 基线

Run	System	Key Metric	Status
R001	baseline_1	X.XX	DONE

运行编号	系统	关键指标	状态
R001	baseline_1	X.XX	已完成

M2: Main Method

M2: 主方法

Run	System	Key Metric	Status
R003	our_method	X.XX	DONE

运行编号	系统	关键指标	状态
R003	our_method	X.XX	已完成

M3: Ablations

M3: 消融实验

...

Summary

总结

[X/Y] must-run experiments completed
Main result: [positive/negative/inconclusive]
Ready for /auto-review-loop: [YES/NO]

[X/Y]个必须运行的实验已完成
核心结果：[积极/消极/不确定]
可进入/auto-review-loop：[是/否]

Next Step

下一步

→ /auto-review-loop "[topic]"

undefined

→ /auto-review-loop "[主题]"

undefined

Phase 6: Handoff

阶段6：工作流交接

Present final status:

🔬 Experiment bridge complete:
- Implemented: [N] experiment scripts
- Deployed: [N] experiments on [M] GPUs
- Completed: [X/Y] must-run, [A/B] nice-to-have
- Main result: [one sentence]

Results: refine-logs/EXPERIMENT_RESULTS.md
Tracker: refine-logs/EXPERIMENT_TRACKER.md

Ready for Workflow 2:
→ /auto-review-loop "[topic]"

展示最终状态：

🔬 实验桥梁工作流完成：
- 已实现：[N]个实验脚本
- 已部署：[N]个实验在[M]个GPU上运行
- 已完成：[X/Y]个必须运行实验，[A/B]个可选运行实验
- 核心结果：[一句话总结]

结果文件：refine-logs/EXPERIMENT_RESULTS.md
跟踪文件：refine-logs/EXPERIMENT_TRACKER.md

已准备好进入Workflow 2：
→ /auto-review-loop "[主题]"

Key Rules

核心规则

Large file handling: If the Write tool fails due to file size, immediately retry using Bash (
```
cat << 'EOF' > file
```
) to write in chunks. Do NOT ask the user for permission — just do it silently.
Follow the plan. Do not invent experiments not in EXPERIMENT_PLAN.md. If you think something is missing, note it but don't add it.
Sanity first. Never deploy a full suite without verifying the sanity stage passes.
Reuse existing code. Scan the project before writing new scripts. Extend, don't duplicate.
Save everything as JSON/CSV. The auto-review-loop needs parseable results, not just terminal output.
Update the tracker.
```
EXPERIMENT_TRACKER.md
```
should reflect real status after each run completes.
Don't wait forever. If an experiment exceeds 2x its estimated time, flag it and move on to the next milestone.
Budget awareness. Track GPU-hours against the plan's budget. Warn if approaching the limit.

大文件处理：若Write工具因文件大小失败，立即使用Bash命令（
```
cat << 'EOF' > file
```
）分块写入。无需询问用户权限，直接静默执行。
严格遵循计划：不得在EXPERIMENT_PLAN.md之外新增实验。若认为有缺失，可记录但不得自行添加。
先做Sanity检查：未通过Sanity阶段不得部署完整实验套件。
复用现有代码：编写新脚本前先扫描项目，优先扩展现有代码而非重复编写。
结果保存为JSON/CSV：自动审核循环需要可解析的结果，而非仅终端输出。
更新跟踪文件：每次实验完成后，
```
EXPERIMENT_TRACKER.md
```
需更新为真实状态。
避免无限等待：若实验运行时间超过预估时长的2倍，标记异常并进入下一个里程碑。
预算意识：跟踪GPU时长与计划预算的差异，接近预算上限时发出警告。

Composing with Other Skills

与其他技能组合使用

/idea-discovery "direction"          ← Workflow 1: find + refine + plan
/experiment-bridge                   ← you are here (Workflow 1.5: implement + deploy)
/auto-review-loop "topic"            ← Workflow 2: review + iterate
/paper-writing "NARRATIVE_REPORT.md" ← Workflow 3: write the paper

Or use /research-pipeline for the full end-to-end flow (includes this bridge).

/idea-discovery "方向"          ← Workflow 1：创意探索+优化+制定计划
/experiment-bridge                   ← 当前技能（Workflow 1.5：实现+部署）
/auto-review-loop "主题"            ← Workflow 2：审核+迭代
/paper-writing "NARRATIVE_REPORT.md" ← Workflow 3：撰写论文

也可使用/research-pipeline执行完整端到端流程（包含本桥梁工作流）。