experiment-plan

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Experiment Plan: Claim-Driven, Paper-Oriented Validation

实验计划:以论点为核心、面向论文的验证方案

Refine and concretize: $ARGUMENTS
优化并具体化:$ARGUMENTS

Overview

概述

Use this skill after the method is stable enough that the next question becomes: what exact experiments should we run, in what order, to defend the paper? If the user wants the full chain in one request, prefer
/research-refine-pipeline
.
The goal is not to generate a giant benchmark wishlist. The goal is to turn a proposal into a claim -> evidence -> run order roadmap that supports four things:
  1. the method actually solves the anchored problem
  2. the dominant contribution is real and focused
  3. the method is elegant enough that extra complexity is unnecessary
  4. any frontier-model-era component is genuinely useful, not decorative
当方法足够成熟,接下来需要明确「为了支撑论文,我们具体应该开展哪些实验、按什么顺序执行」时,即可使用本技能。如果用户希望一次性完成全流程,建议优先使用
/research-refine-pipeline
本技能的目标不是生成冗长的基准测试清单,而是将提案转化为论点→证据→运行顺序的路线图,以支撑以下四点:
  1. 该方法确实能解决既定的核心问题
  2. 主要贡献真实且聚焦
  3. 方法足够简洁,无需额外复杂设计
  4. 任何前沿模型相关组件均具备实际效用,而非装饰性存在

Constants

常量定义

  • OUTPUT_DIR =
    refine-logs/
    — Default destination for experiment planning artifacts.
  • MAX_PRIMARY_CLAIMS = 2 — Prefer one dominant claim plus one supporting claim.
  • MAX_CORE_BLOCKS = 5 — Keep the must-run experimental story compact.
  • MAX_BASELINE_FAMILIES = 3 — Prefer a few strong baselines over many weak ones.
  • DEFAULT_SEEDS = 3 — Use 3 seeds when stochastic variance matters and budget allows.
  • OUTPUT_DIR =
    refine-logs/
    — 实验规划产物的默认存储路径。
  • MAX_PRIMARY_CLAIMS = 2 — 优先设置1个核心论点加1个辅助论点。
  • MAX_CORE_BLOCKS = 5 — 核心实验模块需保持精简。
  • MAX_BASELINE_FAMILIES = 3 — 优先选择少量强基准模型,而非大量弱基准模型。
  • DEFAULT_SEEDS = 3 — 当随机方差影响较大且预算允许时,使用3个随机种子。

Workflow

工作流程

Phase 0: Load the Proposal Context

阶段0:加载提案上下文

Read the most relevant existing files first if they exist:
  • refine-logs/FINAL_PROPOSAL.md
  • refine-logs/REVIEW_SUMMARY.md
  • refine-logs/REFINEMENT_REPORT.md
Extract:
  • Problem Anchor
  • Dominant contribution
  • Optional supporting contribution
  • Critical reviewer concerns
  • Data / compute / timeline constraints
  • Which frontier primitive is central, if any
If these files do not exist, derive the same information from the user's prompt.
若存在以下相关文件,需先读取:
  • refine-logs/FINAL_PROPOSAL.md
  • refine-logs/REVIEW_SUMMARY.md
  • refine-logs/REFINEMENT_REPORT.md
从中提取:
  • 问题锚点
  • 主要贡献
  • 可选辅助贡献
  • 审稿人核心关注点
  • 数据/计算/时间限制
  • 核心前沿技术组件(若有)
若上述文件不存在,则从用户的提示中推导上述信息。

Phase 1: Freeze the Paper Claims

阶段1:确定论文论点

Before proposing experiments, write down the claims that must be defended.
Use this structure:
  • Primary claim: the main mechanism-level contribution
  • Supporting claim: optional, only if it directly strengthens the main paper story
  • Anti-claim to rule out: e.g. "the gain only comes from more parameters," "the gain only comes from a larger search space," or "the modern component is just decoration"
  • Minimum convincing evidence: what would make each claim believable to a strong reviewer?
Do not exceed
MAX_PRIMARY_CLAIMS
unless the paper truly has multiple inseparable claims.
在规划实验前,需明确需要支撑的论点,结构如下:
  • 核心论点:方法层面的主要贡献
  • 辅助论点:可选,仅用于直接强化论文核心叙事
  • 需排除的反向论点:例如「性能提升仅来自参数增加」「性能提升仅来自更大的搜索空间」「现代组件只是装饰」等
  • 最低说服力证据:什么样的结果能让资深审稿人信服?
除非论文确实存在多个不可分割的核心论点,否则不要超过
MAX_PRIMARY_CLAIMS
的限制。

Phase 2: Build the Experimental Storyline

阶段2:构建实验叙事线

Design the paper around a compact set of experiment blocks. Default to the following blocks and delete any that are not needed:
  1. Main anchor result — does the method solve the actual bottleneck?
  2. Novelty isolation — does the dominant contribution itself matter?
  3. Simplicity / elegance check — can a bigger or more fragmented version be avoided?
  4. Frontier necessity check — if an LLM / VLM / Diffusion / RL-era component is central, is it actually the right tool?
  5. Failure analysis or qualitative diagnosis — what does the method still miss?
For each block, decide whether it belongs in:
  • Main paper — essential to defend the core claims
  • Appendix — useful but non-blocking
  • Cut — interesting, but not worth the paper budget
Prefer one strong baseline family over many weak baselines. If a stronger modern baseline exists, use it instead of padding the list.
围绕精简的实验模块设计论文内容,默认采用以下模块,删除不必要的模块:
  1. 核心锚点结果:方法是否解决了实际瓶颈?
  2. 创新性验证:主要贡献是否真的有价值?
  3. 简洁性验证:是否可以避免更复杂或碎片化的设计?
  4. 前沿组件必要性验证:若核心组件属于LLM/VLM/Diffusion/RL范畴,它是否是合适的工具?
  5. 失败分析或定性诊断:方法仍存在哪些不足?
为每个模块确定其归属:
  • 正文:支撑核心论点的必要内容
  • 附录:有用但非核心的内容
  • 舍弃:有趣但不值得消耗论文预算的内容
优先选择单个强基准模型家族,而非多个弱基准模型。若存在更强的现代基准模型,应使用它而非凑数。

Phase 3: Specify Each Experiment Block

阶段3:详细定义每个实验模块

For every kept block, fully specify:
  • Claim tested
  • Why this block exists
  • Dataset / split / task
  • Compared systems: strongest baselines, ablations, and variants only
  • Metrics: decisive metrics first, secondary metrics second
  • Setup details: backbone, frozen vs trainable parts, key hyperparameters, training budget, seeds
  • Success criterion: what outcome would count as convincing evidence?
  • Failure interpretation: if the result is negative, what does it mean?
  • Table / figure target: where this result should appear in the paper
Special rules:
  • A simplicity check should usually compare the final method against either an overbuilt variant or a tempting extra component that the paper intentionally rejects.
  • A frontier necessity check should usually compare the chosen modern primitive against the strongest plausible simpler or older alternative.
  • If the proposal is intentionally non-frontier, say so explicitly and skip the frontier block instead of forcing one.
对于保留的每个模块,需完整定义:
  • 验证的论点
  • 模块存在的意义
  • 数据集/划分/任务
  • 对比系统:仅保留最强基准模型、消融实验和变体
  • 指标:优先使用决定性指标,其次是次要指标
  • 设置细节:骨干网络、冻结与可训练部分、关键超参数、训练预算、随机种子
  • 成功标准:什么样的结果算有说服力的证据?
  • 失败解读:若结果为负,意味着什么?
  • 论文图表位置:结果应出现在论文的哪个图表中
特殊规则:
  • 简洁性验证通常需将最终方法与过度设计的变体或论文刻意舍弃的额外组件进行对比。
  • 前沿组件必要性验证通常需将所选现代组件与最合理的简化或传统替代方案进行对比。
  • 若提案明确不使用前沿技术,需明确说明并跳过该模块。

Phase 4: Turn the Plan Into an Execution Order

阶段4:转化为执行顺序

Build a realistic run order so the user knows what to do first.
Use this milestone structure:
  1. Sanity stage — data pipeline, metric correctness, one quick overfit or toy split
  2. Baseline stage — reproduce the strongest baseline(s)
  3. Main method stage — run the final method on the primary setting
  4. Decision stage — run the decisive ablations for novelty, simplicity, and frontier necessity
  5. Polish stage — robustness, qualitative figures, appendix extras
For each milestone, estimate:
  • compute cost
  • expected turnaround time
  • stop / go decision gate
  • risk and mitigation
Separate must-run from nice-to-have experiments.
制定合理的运行顺序,让用户明确优先级,采用以下里程碑结构:
  1. ** sanity阶段**:数据管道、指标正确性、快速过拟合或小样本测试
  2. 基准模型阶段:复现最强基准模型
  3. 主方法阶段:在核心设置下运行最终方法
  4. 决策阶段:运行用于验证创新性、简洁性和前沿组件必要性的关键消融实验
  5. 优化阶段:鲁棒性测试、定性图表、附录补充内容
为每个里程碑估算:
  • 计算成本
  • 预计完成时间
  • 继续/终止决策节点
  • 风险与缓解方案
区分必须执行可选执行的实验。

Phase 5: Write the Outputs

阶段5:生成输出文件

Step 5.1: Write
refine-logs/EXPERIMENT_PLAN.md

步骤5.1:生成
refine-logs/EXPERIMENT_PLAN.md

Use this structure:
markdown
undefined
使用以下结构:
markdown
undefined

Experiment Plan

实验计划

Problem: [problem] Method Thesis: [one-sentence thesis] Date: [today]
问题:[问题描述] 方法核心:[一句话核心论点] 日期:[今日日期]

Claim Map

论点映射表

ClaimWhy It MattersMinimum Convincing EvidenceLinked Blocks
C1......B1, B2
论点重要性最低说服力证据关联模块
C1......B1, B2

Paper Storyline

论文叙事线

  • Main paper must prove:
  • Appendix can support:
  • Experiments intentionally cut:
  • 正文必须包含的内容:
  • 附录可补充的内容:
  • 有意舍弃的实验:

Experiment Blocks

实验模块

Block 1: [Name]

模块1:[名称]

  • Claim tested:
  • Why this block exists:
  • Dataset / split / task:
  • Compared systems:
  • Metrics:
  • Setup details:
  • Success criterion:
  • Failure interpretation:
  • Table / figure target:
  • Priority: MUST-RUN / NICE-TO-HAVE
  • 验证的论点:
  • 模块意义:
  • 数据集/划分/任务:
  • 对比系统:
  • 指标:
  • 设置细节:
  • 成功标准:
  • 失败解读:
  • 论文图表位置:
  • 优先级:必须执行/可选执行

Block 2: [Name]

模块2:[名称]

...
...

Run Order and Milestones

运行顺序与里程碑

MilestoneGoalRunsDecision GateCostRisk
M0...............
里程碑目标运行内容决策节点成本风险
M0...............

Compute and Data Budget

计算与数据预算

  • Total estimated GPU-hours:
  • Data preparation needs:
  • Human evaluation needs:
  • Biggest bottleneck:
  • 预计总GPU小时数:
  • 数据准备需求:
  • 人工评估需求:
  • 最大瓶颈:

Risks and Mitigations

风险与缓解方案

  • [Risk]:
  • [Mitigation]:
  • [风险]:
  • [缓解方案]:

Final Checklist

最终检查清单

  • Main paper tables are covered
  • Novelty is isolated
  • Simplicity is defended
  • Frontier contribution is justified or explicitly not claimed
  • Nice-to-have runs are separated from must-run runs
undefined
  • 覆盖正文所需图表
  • 验证了创新性
  • 支撑了简洁性
  • 证明了前沿组件的必要性,或明确未宣称使用前沿组件
  • 区分了必须执行和可选执行的实验
undefined

Step 5.2: Write
refine-logs/EXPERIMENT_TRACKER.md

步骤5.2:生成
refine-logs/EXPERIMENT_TRACKER.md

Use this structure:
markdown
undefined
使用以下结构:
markdown
undefined

Experiment Tracker

实验追踪表

Run IDMilestonePurposeSystem / VariantSplitMetricsPriorityStatusNotes
R001M0sanity.........MUSTTODO...

Keep the tracker compact and execution-oriented.
运行ID里程碑目的系统/变体数据划分指标优先级状态备注
R001M0sanity.........必须执行待执行...

保持追踪表精简,以执行为导向。

Step 5.3: Present a Brief Summary to the User

步骤5.3:向用户展示简要总结

Experiment plan ready.

Must-run blocks:
- [Block 1]
- [Block 2]

Highest-risk assumption:
- [risk]

First three runs to launch:
1. [run]
2. [run]
3. [run]

Plan file: refine-logs/EXPERIMENT_PLAN.md
Tracker file: refine-logs/EXPERIMENT_TRACKER.md
实验计划已生成。

必须执行的模块:
- [模块1]
- [模块2]

最高风险假设:
- [风险内容]

首批需启动的三个实验:
1. [实验1]
2. [实验2]
3. [实验3]

计划文件路径:refine-logs/EXPERIMENT_PLAN.md
追踪文件路径:refine-logs/EXPERIMENT_TRACKER.md

Key Rules

核心规则

  • Large file handling: If the Write tool fails due to file size, immediately retry using Bash (
    cat << 'EOF' > file
    ) to write in chunks. Do NOT ask the user for permission — just do it silently.
  • Every experiment must defend a claim. If it does not change a reviewer belief, cut it.
  • Prefer a compact paper story. Design the main table first, then add only the ablations that defend it.
  • Defend simplicity explicitly. If complexity is a concern, include a deletion study or a stronger-but-bloated variant comparison.
  • Defend frontier choices explicitly. If a modern primitive is central, prove why it is better than the strongest simpler alternative.
  • Prefer strong baselines over long baseline lists. A short, credible comparison set is better than a padded one.
  • Separate must-run from nice-to-have. Do not let appendix ideas delay the core paper evidence.
  • Reuse proposal constraints. Do not invent unrealistic budgets or data assumptions.
  • Do not fabricate results. Plan evidence; do not claim evidence.
  • 大文件处理:若Write工具因文件大小失败,立即使用Bash(
    cat << 'EOF' > file
    )分块重试,无需询问用户,直接静默执行。
  • 每个实验都必须支撑一个论点:若无法改变审稿人的看法,就舍弃该实验。
  • 优先精简的论文叙事:先设计核心图表,再添加仅用于支撑该图表的消融实验。
  • 明确支撑简洁性:若简洁性是关注点,需包含删除研究或与更复杂但性能略优的变体对比。
  • 明确支撑前沿技术选择:若核心使用现代组件,需证明它比最强的简化替代方案更优。
  • 优先强基准模型而非长列表:简短、可信的对比集比凑数的长列表更好。
  • 区分必须执行和可选执行:不要让附录内容延误核心论文证据的生成。
  • 复用提案限制:不要假设不切实际的预算或数据条件。
  • 不要编造结果:规划证据,而非宣称已有证据。

Composing with Other Skills

与其他技能的组合使用

/research-refine-pipeline -> one-shot method + experiment planning
/research-refine   -> method and claim refinement
/experiment-plan   -> detailed experiment roadmap
/run-experiment    -> execute the runs
/auto-review-loop  -> react to results and iterate on the paper
/research-refine-pipeline → 一站式方法优化+实验规划
/research-refine → 方法与论点优化
/experiment-plan → 详细实验路线图
/run-experiment → 执行实验
/auto-review-loop → 根据结果调整并迭代论文