experiment-design-planner

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Experiment Design Planner

实验设计规划器

Turn a research claim into an experiment plan that can actually answer it. This skill is for planning before running, not for reporting completed results.

Use this skill when:

a user is about to run a new experiment or ablation
a paper claim needs evidence
baselines, metrics, controls, or datasets are unclear
the user is changing too many variables at once
cluster/compute time should not be wasted on ambiguous runs
reviewer-proof evidence is needed before submission

Pair this skill with:

```
research-project-memory
```
when the experiment plan should become project-level evidence, risk, and action memory
```
run-experiment
```
after the design is ready to execute
```
experiment-report-writer
```
after results exist
```
paper-reviewer-simulator
```
to stress-test whether the evidence will satisfy reviewers
```
baseline-selection-audit
```
before finalizing the experiment matrix when baseline choice, fairness, or reviewer-proof comparisons need deeper review
```
figure-results-review
```
after plotted or tabulated results exist and need claim-support review

将研究论点转化为能够切实验证它的实验方案。此技能用于实验运行前的规划，而非已完成结果的汇报。

在以下场景使用此技能：

用户即将开展新实验或消融实验
论文论点需要证据支撑
基准、指标、控制变量或数据集不明确
用户同时更改过多变量
不应在模糊的实验上浪费集群/计算资源
投稿前需要经得起评审的证据

搭配以下技能使用：

当实验方案需要成为项目级别的证据、风险和行动记录时，搭配
```
research-project-memory
```
设计完成准备执行时，搭配
```
run-experiment
```
获得实验结果后，搭配
```
experiment-report-writer
```
要测试证据是否能满足评审要求时，搭配
```
paper-reviewer-simulator
```
当基准选择、公平性或经得起评审的对比需要更深入审查，在确定实验矩阵前搭配
```
baseline-selection-audit
```
绘制或整理结果后，需要验证结果是否支撑论点时，搭配
```
figure-results-review
```

Skill Directory Layout

技能目录结构

text

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── ablation-matrix.md
    ├── evidence-standards.md
    ├── metrics-and-controls.md
    └── report-template.md

text

<installed-skill-dir>/
├── SKILL.md
└── references/
    ├── ablation-matrix.md
    ├── evidence-standards.md
    ├── metrics-and-controls.md
    └── report-template.md

Progressive Loading

渐进式加载

Always read

references/evidence-standards.md

and

references/metrics-and-controls.md

Read
```
references/ablation-matrix.md
```
when the plan compares variants, components, baselines, hyperparameters, datasets, or model sizes.
Use
```
references/report-template.md
```
when saving or returning a substantial experiment plan.
If the target repo has
```
memory/
```
, update planned evidence, experiment families, risks, and actions using
```
research-project-memory
```
conventions.
If the experiment depends on current baselines, benchmarks, or leaderboard conventions, verify current sources with web search.

务必阅读

references/evidence-standards.md

和

references/metrics-and-controls.md

。

当方案需要对比变体、组件、基准、超参数、数据集或模型规模时，阅读
```
references/ablation-matrix.md
```
。
当需要保存或返回完整实验方案时，使用
```
references/report-template.md
```
。
如果目标仓库包含
```
memory/
```
目录，按照
```
research-project-memory
```
的规范更新规划的证据、实验系列、风险和行动项。
如果实验依赖当前的基准、测试基准或排行榜规则，通过网络搜索验证最新来源。

Core Principles

核心原则

Start from the claim, not the command line.
State the hypothesis before running experiments.
Use a baseline before introducing a new method.
Change one variable at a time unless the experiment is explicitly factorial.
Define controls and nuisance variables before interpreting results.
Make negative results useful by defining falsification and fallback decisions.
Design the table or figure before running the experiment.
Stop conditions matter: decide what result is enough to move on.

从论点出发，而非命令行。
运行实验前明确假设。
在引入新方法前先使用基准。
除非是明确的析因实验，否则每次只更改一个变量。
在解读结果前定义控制变量和干扰变量。
通过定义证伪条件和备选决策，让负面结果也具备价值。
在运行实验前设计好表格或图表。
停止条件至关重要：确定什么样的结果足以推进后续工作。

Step 1 - Define the Claim and Question

步骤1 - 定义论点与问题

Extract:

paper or project claim
research question
target audience: internal debugging, advisor update, paper evidence, rebuttal, benchmark claim
expected output: Markdown plan, LaTeX experiment section outline, run matrix, or saved file
experiment mode:
- ```
single
```
  : one controlled comparison
- ```
ablation
```
  : component or variable isolation
- ```
benchmark
```
  : compare methods across datasets/tasks
- ```
theory
```
  : empirical support for a theoretical prediction
- ```
diagnostic
```
  : understand a failure mode or surprising result

Rewrite vague goals into testable questions:

text

Vague: Does our method work?
Testable: Does component X improve metric M over baseline B on datasets D1/D2 under the same training budget?

提取以下信息：

论文或项目论点
研究问题
目标受众：内部调试、导师汇报、论文证据、反驳意见、基准论点
预期输出：Markdown方案、LaTeX实验部分大纲、运行矩阵或保存的文件
实验模式：
- ```
single
```
  ：单一受控对比
- ```
ablation
```
  ：组件或变量隔离
- ```
benchmark
```
  ：跨数据集/任务对比方法
- ```
theory
```
  ：为理论预测提供实证支持
- ```
diagnostic
```
  ：分析失败模式或意外结果

将模糊目标改写为可测试的问题：

text

模糊表述：我们的方法有效吗？
可测试表述：在相同训练预算下，组件X是否能在数据集D1/D2上比基准B提升指标M？

Step 2 - State Hypotheses

步骤2 - 明确假设

Write:

primary hypothesis
alternative explanations
expected metric direction and rough effect size
falsification condition
decision rule

If the user cannot state a falsification condition, the experiment is not ready.

撰写：

主要假设
替代解释
预期指标方向和大致效应量
证伪条件
决策规则

如果用户无法明确证伪条件，则实验尚未准备就绪。

Step 3 - Define Evidence Standard

步骤3 - 定义证据标准

Read

references/evidence-standards.md

Decide what evidence is needed:

one table, one curve, one ablation, one qualitative example, one theorem-aligned diagnostic, or a benchmark suite
number of datasets/tasks
number of seeds or repeats
required baselines
acceptable variance
whether statistical testing or confidence intervals are needed
whether results must support a paper claim or only guide next steps

阅读

references/evidence-standards.md

。

确定所需证据：

一张表格、一条曲线、一组消融实验、一个定性示例、一个与定理匹配的诊断实验，或是一套基准测试
数据集/任务的数量
随机种子或重复实验的次数
所需的基准
可接受的方差
是否需要统计检验或置信区间
结果是需要支撑论文论点，还是仅用于指导后续步骤

Step 4 - Choose Baselines and Controls

步骤4 - 选择基准与控制变量

Identify:

primary baseline
strongest prior method or current SOTA, if relevant
simple baseline
ablation baseline
oracle or upper bound, if useful
controlled variables
nuisance variables

If no baseline exists, make the first experiment a baseline-establishment experiment.

确定：

主要基准
最先进的现有方法（SOTA）（如相关）
简单基准
消融基准
理想上限（oracle）或性能上限（如有用）
控制变量
干扰变量

如果没有基准存在，则将第一个实验设为基准建立实验。

Step 5 - Choose Metrics and Logging

步骤5 - 选择指标与日志记录

Read

references/metrics-and-controls.md

For each metric, specify:

definition
direction
aggregation
split
variance reporting
failure interpretation
why it answers the question

Define required logging:

command
config path
git commit
dataset version
seed
hyperparameters
hardware/runtime
metrics
artifacts: tables, figures, checkpoints, logs

阅读

references/metrics-and-controls.md

。

针对每个指标，明确：

定义
优化方向
聚合方式
数据划分
方差报告方式
失败情况解读
为何该指标能回答研究问题

定义所需的日志记录内容：

命令
配置文件路径
Git提交记录
数据集版本
随机种子
超参数
硬件/运行时间
指标
产物：表格、图表、检查点、日志

Step 6 - Build Run Matrix

步骤6 - 构建运行矩阵

Read

references/ablation-matrix.md

when there is more than one run.

Create a run table with:

run ID
changed variable
fixed controls
dataset/split
metric
seed/repeats
expected result
status
output path

Split experiments if a run changes more than one conceptual variable.

当实验次数多于一次时，阅读

references/ablation-matrix.md

。

创建包含以下内容的运行表格：

运行ID
更改的变量
固定的控制变量
数据集/划分
指标
随机种子/重复次数
预期结果
状态
输出路径

如果一次运行更改多个概念性变量，则拆分实验。

Step 7 - Define Stop Conditions and Next Decisions

步骤7 - 定义停止条件与后续决策

Write:

what result is sufficient to support the claim
what result falsifies or weakens the claim
what result triggers another ablation
what result means stop and write/report
compute budget ceiling
deadline constraints

撰写：

什么样的结果足以支撑论点
什么样的结果会证伪或削弱论点
什么样的结果会触发另一组消融实验
什么样的结果意味着停止实验并撰写/汇报
计算预算上限
截止时间限制

Step 8 - Reviewer Risk Check

步骤8 - 评审风险检查

Before finalizing, ask:

Would a reviewer complain that the baseline is weak?
Is the comparison fair?
Are seeds/repeats enough?
Does the experiment isolate the claimed mechanism?
Are metrics aligned with the claim?
Is there a confounder that could explain the result?
Would a negative result still teach something?

If the answer exposes a major weakness, update the design before execution.

在最终确定前，询问以下问题：

评审人员是否会抱怨基准过于薄弱？
对比是否公平？
随机种子/重复实验次数是否足够？
实验是否能隔离所声称的机制？
指标是否与论点匹配？
是否存在可能解释结果的混杂因素？
负面结果是否仍能带来启发？

如果答案暴露了重大缺陷，在执行前更新设计方案。

Step 9 - Write the Experiment Plan

步骤9 - 撰写实验方案

Use

references/report-template.md

If saving to a project and no path is given, use:

text

docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md

If working inside a code repo or code worktree created by

init-python-project

new-workspace

, prefer:

text

docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.md

The final plan should be runnable by

run-experiment

and later reportable by

experiment-report-writer

使用

references/report-template.md

。

如果要保存到项目且未指定路径，使用：

text

docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md

如果在

init-python-project

new-workspace

创建的代码仓库或工作树中工作，优先使用：

text

docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.md

最终方案应可通过

run-experiment

执行，并可后续通过

experiment-report-writer

生成报告。

Step 10 - Write Back to Project Memory

步骤10 - 写入项目记忆

If the project uses

research-project-memory

, update:

```
memory/evidence-board.md
```
: planned
```
EVD-###
```
items and
```
EXP-###
```
experiment families
```
memory/claim-board.md
```
: linked claims, marking unsupported or planned claims honestly
```
memory/risk-board.md
```
: baseline, mechanism, metric, seed, compute, and reviewer risks exposed by the design
```
memory/action-board.md
```
: runnable next actions, including which experiment to launch first
relevant worktree
```
.agent/worktree-status.md
```
: experiment purpose and exit condition if a branch/worktree is involved

Use

planned

status for experiments that have not run. Do not record expected outcomes as observed evidence.

如果项目使用

research-project-memory

，更新以下内容：

```
memory/evidence-board.md
```
：规划中的
```
EVD-###
```
条目和
```
EXP-###
```
实验系列
```
memory/claim-board.md
```
：关联论点，如实标记未支撑或规划中的论点
```
memory/risk-board.md
```
：设计方案中暴露的基准、机制、指标、随机种子、计算和评审风险
```
memory/action-board.md
```
：可执行的后续行动，包括优先启动的实验
相关工作树的
```
.agent/worktree-status.md
```
：实验目的和退出条件（如果涉及分支/工作树）

对于尚未运行的实验，使用

planned

状态。不要将预期结果记录为已观测证据。

Final Sanity Check

最终合理性检查

Before finalizing:

claim and hypothesis are explicit
baseline is defined
independent variable is isolated
controls and nuisance variables are listed
metrics are tied to the question
run matrix is concrete
logging requirements are sufficient for reproduction
stop condition and decision rule are explicit
reviewer risks are stated
project memory is updated when the repo has
```
memory/
```

在最终确定前，确认：

论点和假设明确
基准已定义
自变量已隔离
控制变量和干扰变量已列出
指标与研究问题相关
运行矩阵具体明确
日志记录要求足以支持复现
停止条件和决策规则明确
已说明评审风险
当仓库包含
```
memory/
```
时，已更新项目记忆