experiment-design-planner

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Experiment Design Planner

实验设计规划器

Turn a research claim into an experiment plan that can actually answer it. This skill is for planning before running, not for reporting completed results.
Use this skill when:
  • a user is about to run a new experiment or ablation
  • a paper claim needs evidence
  • baselines, metrics, controls, or datasets are unclear
  • the user is changing too many variables at once
  • cluster/compute time should not be wasted on ambiguous runs
  • reviewer-proof evidence is needed before submission
Pair this skill with:
  • research-project-memory
    when the experiment plan should become project-level evidence, risk, and action memory
  • run-experiment
    after the design is ready to execute
  • experiment-report-writer
    after results exist
  • paper-reviewer-simulator
    to stress-test whether the evidence will satisfy reviewers
  • baseline-selection-audit
    before finalizing the experiment matrix when baseline choice, fairness, or reviewer-proof comparisons need deeper review
  • figure-results-review
    after plotted or tabulated results exist and need claim-support review
将研究论点转化为能够切实验证它的实验方案。此技能用于实验运行前的规划,而非已完成结果的汇报。
在以下场景使用此技能:
  • 用户即将开展新实验或消融实验
  • 论文论点需要证据支撑
  • 基准、指标、控制变量或数据集不明确
  • 用户同时更改过多变量
  • 不应在模糊的实验上浪费集群/计算资源
  • 投稿前需要经得起评审的证据
搭配以下技能使用:
  • 当实验方案需要成为项目级别的证据、风险和行动记录时,搭配
    research-project-memory
  • 设计完成准备执行时,搭配
    run-experiment
  • 获得实验结果后,搭配
    experiment-report-writer
  • 要测试证据是否能满足评审要求时,搭配
    paper-reviewer-simulator
  • 当基准选择、公平性或经得起评审的对比需要更深入审查,在确定实验矩阵前搭配
    baseline-selection-audit
  • 绘制或整理结果后,需要验证结果是否支撑论点时,搭配
    figure-results-review

Skill Directory Layout

技能目录结构

text
<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── ablation-matrix.md
    ├── evidence-standards.md
    ├── metrics-and-controls.md
    └── report-template.md
text
<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── ablation-matrix.md
    ├── evidence-standards.md
    ├── metrics-and-controls.md
    └── report-template.md

Progressive Loading

渐进式加载

  • Always read
    references/evidence-standards.md
    and
    references/metrics-and-controls.md
    .
  • Read
    references/ablation-matrix.md
    when the plan compares variants, components, baselines, hyperparameters, datasets, or model sizes.
  • Use
    references/report-template.md
    when saving or returning a substantial experiment plan.
  • If the target repo has
    memory/
    , update planned evidence, experiment families, risks, and actions using
    research-project-memory
    conventions.
  • If the experiment depends on current baselines, benchmarks, or leaderboard conventions, verify current sources with web search.
  • 务必阅读
    references/evidence-standards.md
    references/metrics-and-controls.md
  • 当方案需要对比变体、组件、基准、超参数、数据集或模型规模时,阅读
    references/ablation-matrix.md
  • 当需要保存或返回完整实验方案时,使用
    references/report-template.md
  • 如果目标仓库包含
    memory/
    目录,按照
    research-project-memory
    的规范更新规划的证据、实验系列、风险和行动项。
  • 如果实验依赖当前的基准、测试基准或排行榜规则,通过网络搜索验证最新来源。

Core Principles

核心原则

  • Start from the claim, not the command line.
  • State the hypothesis before running experiments.
  • Use a baseline before introducing a new method.
  • Change one variable at a time unless the experiment is explicitly factorial.
  • Define controls and nuisance variables before interpreting results.
  • Make negative results useful by defining falsification and fallback decisions.
  • Design the table or figure before running the experiment.
  • Stop conditions matter: decide what result is enough to move on.
  • 从论点出发,而非命令行。
  • 运行实验前明确假设。
  • 在引入新方法前先使用基准。
  • 除非是明确的析因实验,否则每次只更改一个变量。
  • 在解读结果前定义控制变量和干扰变量。
  • 通过定义证伪条件和备选决策,让负面结果也具备价值。
  • 在运行实验前设计好表格或图表。
  • 停止条件至关重要:确定什么样的结果足以推进后续工作。

Step 1 - Define the Claim and Question

步骤1 - 定义论点与问题

Extract:
  • paper or project claim
  • research question
  • target audience: internal debugging, advisor update, paper evidence, rebuttal, benchmark claim
  • expected output: Markdown plan, LaTeX experiment section outline, run matrix, or saved file
  • experiment mode:
    • single
      : one controlled comparison
    • ablation
      : component or variable isolation
    • benchmark
      : compare methods across datasets/tasks
    • theory
      : empirical support for a theoretical prediction
    • diagnostic
      : understand a failure mode or surprising result
Rewrite vague goals into testable questions:
text
Vague: Does our method work?
Testable: Does component X improve metric M over baseline B on datasets D1/D2 under the same training budget?
提取以下信息:
  • 论文或项目论点
  • 研究问题
  • 目标受众:内部调试、导师汇报、论文证据、反驳意见、基准论点
  • 预期输出:Markdown方案、LaTeX实验部分大纲、运行矩阵或保存的文件
  • 实验模式:
    • single
      :单一受控对比
    • ablation
      :组件或变量隔离
    • benchmark
      :跨数据集/任务对比方法
    • theory
      :为理论预测提供实证支持
    • diagnostic
      :分析失败模式或意外结果
将模糊目标改写为可测试的问题:
text
模糊表述:我们的方法有效吗?
可测试表述:在相同训练预算下,组件X是否能在数据集D1/D2上比基准B提升指标M?

Step 2 - State Hypotheses

步骤2 - 明确假设

Write:
  • primary hypothesis
  • alternative explanations
  • expected metric direction and rough effect size
  • falsification condition
  • decision rule
If the user cannot state a falsification condition, the experiment is not ready.
撰写:
  • 主要假设
  • 替代解释
  • 预期指标方向和大致效应量
  • 证伪条件
  • 决策规则
如果用户无法明确证伪条件,则实验尚未准备就绪。

Step 3 - Define Evidence Standard

步骤3 - 定义证据标准

Read
references/evidence-standards.md
.
Decide what evidence is needed:
  • one table, one curve, one ablation, one qualitative example, one theorem-aligned diagnostic, or a benchmark suite
  • number of datasets/tasks
  • number of seeds or repeats
  • required baselines
  • acceptable variance
  • whether statistical testing or confidence intervals are needed
  • whether results must support a paper claim or only guide next steps
阅读
references/evidence-standards.md
确定所需证据:
  • 一张表格、一条曲线、一组消融实验、一个定性示例、一个与定理匹配的诊断实验,或是一套基准测试
  • 数据集/任务的数量
  • 随机种子或重复实验的次数
  • 所需的基准
  • 可接受的方差
  • 是否需要统计检验或置信区间
  • 结果是需要支撑论文论点,还是仅用于指导后续步骤

Step 4 - Choose Baselines and Controls

步骤4 - 选择基准与控制变量

Identify:
  • primary baseline
  • strongest prior method or current SOTA, if relevant
  • simple baseline
  • ablation baseline
  • oracle or upper bound, if useful
  • controlled variables
  • nuisance variables
If no baseline exists, make the first experiment a baseline-establishment experiment.
确定:
  • 主要基准
  • 最先进的现有方法(SOTA)(如相关)
  • 简单基准
  • 消融基准
  • 理想上限(oracle)或性能上限(如有用)
  • 控制变量
  • 干扰变量
如果没有基准存在,则将第一个实验设为基准建立实验。

Step 5 - Choose Metrics and Logging

步骤5 - 选择指标与日志记录

Read
references/metrics-and-controls.md
.
For each metric, specify:
  • definition
  • direction
  • aggregation
  • split
  • variance reporting
  • failure interpretation
  • why it answers the question
Define required logging:
  • command
  • config path
  • git commit
  • dataset version
  • seed
  • hyperparameters
  • hardware/runtime
  • metrics
  • artifacts: tables, figures, checkpoints, logs
阅读
references/metrics-and-controls.md
针对每个指标,明确:
  • 定义
  • 优化方向
  • 聚合方式
  • 数据划分
  • 方差报告方式
  • 失败情况解读
  • 为何该指标能回答研究问题
定义所需的日志记录内容:
  • 命令
  • 配置文件路径
  • Git提交记录
  • 数据集版本
  • 随机种子
  • 超参数
  • 硬件/运行时间
  • 指标
  • 产物:表格、图表、检查点、日志

Step 6 - Build Run Matrix

步骤6 - 构建运行矩阵

Read
references/ablation-matrix.md
when there is more than one run.
Create a run table with:
  • run ID
  • changed variable
  • fixed controls
  • dataset/split
  • metric
  • seed/repeats
  • expected result
  • status
  • output path
Split experiments if a run changes more than one conceptual variable.
当实验次数多于一次时,阅读
references/ablation-matrix.md
创建包含以下内容的运行表格:
  • 运行ID
  • 更改的变量
  • 固定的控制变量
  • 数据集/划分
  • 指标
  • 随机种子/重复次数
  • 预期结果
  • 状态
  • 输出路径
如果一次运行更改多个概念性变量,则拆分实验。

Step 7 - Define Stop Conditions and Next Decisions

步骤7 - 定义停止条件与后续决策

Write:
  • what result is sufficient to support the claim
  • what result falsifies or weakens the claim
  • what result triggers another ablation
  • what result means stop and write/report
  • compute budget ceiling
  • deadline constraints
撰写:
  • 什么样的结果足以支撑论点
  • 什么样的结果会证伪或削弱论点
  • 什么样的结果会触发另一组消融实验
  • 什么样的结果意味着停止实验并撰写/汇报
  • 计算预算上限
  • 截止时间限制

Step 8 - Reviewer Risk Check

步骤8 - 评审风险检查

Before finalizing, ask:
  • Would a reviewer complain that the baseline is weak?
  • Is the comparison fair?
  • Are seeds/repeats enough?
  • Does the experiment isolate the claimed mechanism?
  • Are metrics aligned with the claim?
  • Is there a confounder that could explain the result?
  • Would a negative result still teach something?
If the answer exposes a major weakness, update the design before execution.
在最终确定前,询问以下问题:
  • 评审人员是否会抱怨基准过于薄弱?
  • 对比是否公平?
  • 随机种子/重复实验次数是否足够?
  • 实验是否能隔离所声称的机制?
  • 指标是否与论点匹配?
  • 是否存在可能解释结果的混杂因素?
  • 负面结果是否仍能带来启发?
如果答案暴露了重大缺陷,在执行前更新设计方案。

Step 9 - Write the Experiment Plan

步骤9 - 撰写实验方案

Use
references/report-template.md
.
If saving to a project and no path is given, use:
text
docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md
If working inside a code repo or code worktree created by
init-python-project
/
new-workspace
, prefer:
text
docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.md
The final plan should be runnable by
run-experiment
and later reportable by
experiment-report-writer
.
使用
references/report-template.md
如果要保存到项目且未指定路径,使用:
text
docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md
如果在
init-python-project
/
new-workspace
创建的代码仓库或工作树中工作,优先使用:
text
docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.md
最终方案应可通过
run-experiment
执行,并可后续通过
experiment-report-writer
生成报告。

Step 10 - Write Back to Project Memory

步骤10 - 写入项目记忆

If the project uses
research-project-memory
, update:
  • memory/evidence-board.md
    : planned
    EVD-###
    items and
    EXP-###
    experiment families
  • memory/claim-board.md
    : linked claims, marking unsupported or planned claims honestly
  • memory/risk-board.md
    : baseline, mechanism, metric, seed, compute, and reviewer risks exposed by the design
  • memory/action-board.md
    : runnable next actions, including which experiment to launch first
  • relevant worktree
    .agent/worktree-status.md
    : experiment purpose and exit condition if a branch/worktree is involved
Use
planned
status for experiments that have not run. Do not record expected outcomes as observed evidence.
如果项目使用
research-project-memory
,更新以下内容:
  • memory/evidence-board.md
    :规划中的
    EVD-###
    条目和
    EXP-###
    实验系列
  • memory/claim-board.md
    :关联论点,如实标记未支撑或规划中的论点
  • memory/risk-board.md
    :设计方案中暴露的基准、机制、指标、随机种子、计算和评审风险
  • memory/action-board.md
    :可执行的后续行动,包括优先启动的实验
  • 相关工作树的
    .agent/worktree-status.md
    :实验目的和退出条件(如果涉及分支/工作树)
对于尚未运行的实验,使用
planned
状态。不要将预期结果记录为已观测证据。

Final Sanity Check

最终合理性检查

Before finalizing:
  • claim and hypothesis are explicit
  • baseline is defined
  • independent variable is isolated
  • controls and nuisance variables are listed
  • metrics are tied to the question
  • run matrix is concrete
  • logging requirements are sufficient for reproduction
  • stop condition and decision rule are explicit
  • reviewer risks are stated
  • project memory is updated when the repo has
    memory/
在最终确定前,确认:
  • 论点和假设明确
  • 基准已定义
  • 自变量已隔离
  • 控制变量和干扰变量已列出
  • 指标与研究问题相关
  • 运行矩阵具体明确
  • 日志记录要求足以支持复现
  • 停止条件和决策规则明确
  • 已说明评审风险
  • 当仓库包含
    memory/
    时,已更新项目记忆