experiment-design-planner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExperiment Design Planner
实验设计规划器
Turn a research claim into an experiment plan that can actually answer it. This skill is for planning before running, not for reporting completed results.
Use this skill when:
- a user is about to run a new experiment or ablation
- a paper claim needs evidence
- baselines, metrics, controls, or datasets are unclear
- the user is changing too many variables at once
- cluster/compute time should not be wasted on ambiguous runs
- reviewer-proof evidence is needed before submission
Pair this skill with:
- when the experiment plan should become project-level evidence, risk, and action memory
research-project-memory - after the design is ready to execute
run-experiment - after results exist
experiment-report-writer - to stress-test whether the evidence will satisfy reviewers
paper-reviewer-simulator - before finalizing the experiment matrix when baseline choice, fairness, or reviewer-proof comparisons need deeper review
baseline-selection-audit - after plotted or tabulated results exist and need claim-support review
figure-results-review
将研究论点转化为能够切实验证它的实验方案。此技能用于实验运行前的规划,而非已完成结果的汇报。
在以下场景使用此技能:
- 用户即将开展新实验或消融实验
- 论文论点需要证据支撑
- 基准、指标、控制变量或数据集不明确
- 用户同时更改过多变量
- 不应在模糊的实验上浪费集群/计算资源
- 投稿前需要经得起评审的证据
搭配以下技能使用:
- 当实验方案需要成为项目级别的证据、风险和行动记录时,搭配
research-project-memory - 设计完成准备执行时,搭配
run-experiment - 获得实验结果后,搭配
experiment-report-writer - 要测试证据是否能满足评审要求时,搭配
paper-reviewer-simulator - 当基准选择、公平性或经得起评审的对比需要更深入审查,在确定实验矩阵前搭配
baseline-selection-audit - 绘制或整理结果后,需要验证结果是否支撑论点时,搭配
figure-results-review
Skill Directory Layout
技能目录结构
text
<installed-skill-dir>/
├── SKILL.md
└── references/
├── ablation-matrix.md
├── evidence-standards.md
├── metrics-and-controls.md
└── report-template.mdtext
<installed-skill-dir>/
├── SKILL.md
└── references/
├── ablation-matrix.md
├── evidence-standards.md
├── metrics-and-controls.md
└── report-template.mdProgressive Loading
渐进式加载
- Always read and
references/evidence-standards.md.references/metrics-and-controls.md - Read when the plan compares variants, components, baselines, hyperparameters, datasets, or model sizes.
references/ablation-matrix.md - Use when saving or returning a substantial experiment plan.
references/report-template.md - If the target repo has , update planned evidence, experiment families, risks, and actions using
memory/conventions.research-project-memory - If the experiment depends on current baselines, benchmarks, or leaderboard conventions, verify current sources with web search.
- 务必阅读和
references/evidence-standards.md。references/metrics-and-controls.md - 当方案需要对比变体、组件、基准、超参数、数据集或模型规模时,阅读。
references/ablation-matrix.md - 当需要保存或返回完整实验方案时,使用。
references/report-template.md - 如果目标仓库包含目录,按照
memory/的规范更新规划的证据、实验系列、风险和行动项。research-project-memory - 如果实验依赖当前的基准、测试基准或排行榜规则,通过网络搜索验证最新来源。
Core Principles
核心原则
- Start from the claim, not the command line.
- State the hypothesis before running experiments.
- Use a baseline before introducing a new method.
- Change one variable at a time unless the experiment is explicitly factorial.
- Define controls and nuisance variables before interpreting results.
- Make negative results useful by defining falsification and fallback decisions.
- Design the table or figure before running the experiment.
- Stop conditions matter: decide what result is enough to move on.
- 从论点出发,而非命令行。
- 运行实验前明确假设。
- 在引入新方法前先使用基准。
- 除非是明确的析因实验,否则每次只更改一个变量。
- 在解读结果前定义控制变量和干扰变量。
- 通过定义证伪条件和备选决策,让负面结果也具备价值。
- 在运行实验前设计好表格或图表。
- 停止条件至关重要:确定什么样的结果足以推进后续工作。
Step 1 - Define the Claim and Question
步骤1 - 定义论点与问题
Extract:
- paper or project claim
- research question
- target audience: internal debugging, advisor update, paper evidence, rebuttal, benchmark claim
- expected output: Markdown plan, LaTeX experiment section outline, run matrix, or saved file
- experiment mode:
- : one controlled comparison
single - : component or variable isolation
ablation - : compare methods across datasets/tasks
benchmark - : empirical support for a theoretical prediction
theory - : understand a failure mode or surprising result
diagnostic
Rewrite vague goals into testable questions:
text
Vague: Does our method work?
Testable: Does component X improve metric M over baseline B on datasets D1/D2 under the same training budget?提取以下信息:
- 论文或项目论点
- 研究问题
- 目标受众:内部调试、导师汇报、论文证据、反驳意见、基准论点
- 预期输出:Markdown方案、LaTeX实验部分大纲、运行矩阵或保存的文件
- 实验模式:
- :单一受控对比
single - :组件或变量隔离
ablation - :跨数据集/任务对比方法
benchmark - :为理论预测提供实证支持
theory - :分析失败模式或意外结果
diagnostic
将模糊目标改写为可测试的问题:
text
模糊表述:我们的方法有效吗?
可测试表述:在相同训练预算下,组件X是否能在数据集D1/D2上比基准B提升指标M?Step 2 - State Hypotheses
步骤2 - 明确假设
Write:
- primary hypothesis
- alternative explanations
- expected metric direction and rough effect size
- falsification condition
- decision rule
If the user cannot state a falsification condition, the experiment is not ready.
撰写:
- 主要假设
- 替代解释
- 预期指标方向和大致效应量
- 证伪条件
- 决策规则
如果用户无法明确证伪条件,则实验尚未准备就绪。
Step 3 - Define Evidence Standard
步骤3 - 定义证据标准
Read .
references/evidence-standards.mdDecide what evidence is needed:
- one table, one curve, one ablation, one qualitative example, one theorem-aligned diagnostic, or a benchmark suite
- number of datasets/tasks
- number of seeds or repeats
- required baselines
- acceptable variance
- whether statistical testing or confidence intervals are needed
- whether results must support a paper claim or only guide next steps
阅读。
references/evidence-standards.md确定所需证据:
- 一张表格、一条曲线、一组消融实验、一个定性示例、一个与定理匹配的诊断实验,或是一套基准测试
- 数据集/任务的数量
- 随机种子或重复实验的次数
- 所需的基准
- 可接受的方差
- 是否需要统计检验或置信区间
- 结果是需要支撑论文论点,还是仅用于指导后续步骤
Step 4 - Choose Baselines and Controls
步骤4 - 选择基准与控制变量
Identify:
- primary baseline
- strongest prior method or current SOTA, if relevant
- simple baseline
- ablation baseline
- oracle or upper bound, if useful
- controlled variables
- nuisance variables
If no baseline exists, make the first experiment a baseline-establishment experiment.
确定:
- 主要基准
- 最先进的现有方法(SOTA)(如相关)
- 简单基准
- 消融基准
- 理想上限(oracle)或性能上限(如有用)
- 控制变量
- 干扰变量
如果没有基准存在,则将第一个实验设为基准建立实验。
Step 5 - Choose Metrics and Logging
步骤5 - 选择指标与日志记录
Read .
references/metrics-and-controls.mdFor each metric, specify:
- definition
- direction
- aggregation
- split
- variance reporting
- failure interpretation
- why it answers the question
Define required logging:
- command
- config path
- git commit
- dataset version
- seed
- hyperparameters
- hardware/runtime
- metrics
- artifacts: tables, figures, checkpoints, logs
阅读。
references/metrics-and-controls.md针对每个指标,明确:
- 定义
- 优化方向
- 聚合方式
- 数据划分
- 方差报告方式
- 失败情况解读
- 为何该指标能回答研究问题
定义所需的日志记录内容:
- 命令
- 配置文件路径
- Git提交记录
- 数据集版本
- 随机种子
- 超参数
- 硬件/运行时间
- 指标
- 产物:表格、图表、检查点、日志
Step 6 - Build Run Matrix
步骤6 - 构建运行矩阵
Read when there is more than one run.
references/ablation-matrix.mdCreate a run table with:
- run ID
- changed variable
- fixed controls
- dataset/split
- metric
- seed/repeats
- expected result
- status
- output path
Split experiments if a run changes more than one conceptual variable.
当实验次数多于一次时,阅读。
references/ablation-matrix.md创建包含以下内容的运行表格:
- 运行ID
- 更改的变量
- 固定的控制变量
- 数据集/划分
- 指标
- 随机种子/重复次数
- 预期结果
- 状态
- 输出路径
如果一次运行更改多个概念性变量,则拆分实验。
Step 7 - Define Stop Conditions and Next Decisions
步骤7 - 定义停止条件与后续决策
Write:
- what result is sufficient to support the claim
- what result falsifies or weakens the claim
- what result triggers another ablation
- what result means stop and write/report
- compute budget ceiling
- deadline constraints
撰写:
- 什么样的结果足以支撑论点
- 什么样的结果会证伪或削弱论点
- 什么样的结果会触发另一组消融实验
- 什么样的结果意味着停止实验并撰写/汇报
- 计算预算上限
- 截止时间限制
Step 8 - Reviewer Risk Check
步骤8 - 评审风险检查
Before finalizing, ask:
- Would a reviewer complain that the baseline is weak?
- Is the comparison fair?
- Are seeds/repeats enough?
- Does the experiment isolate the claimed mechanism?
- Are metrics aligned with the claim?
- Is there a confounder that could explain the result?
- Would a negative result still teach something?
If the answer exposes a major weakness, update the design before execution.
在最终确定前,询问以下问题:
- 评审人员是否会抱怨基准过于薄弱?
- 对比是否公平?
- 随机种子/重复实验次数是否足够?
- 实验是否能隔离所声称的机制?
- 指标是否与论点匹配?
- 是否存在可能解释结果的混杂因素?
- 负面结果是否仍能带来启发?
如果答案暴露了重大缺陷,在执行前更新设计方案。
Step 9 - Write the Experiment Plan
步骤9 - 撰写实验方案
Use .
references/report-template.mdIf saving to a project and no path is given, use:
text
docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.mdIf working inside a code repo or code worktree created by / , prefer:
init-python-projectnew-workspacetext
docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.mdThe final plan should be runnable by and later reportable by .
run-experimentexperiment-report-writer使用。
references/report-template.md如果要保存到项目且未指定路径,使用:
text
docs/experiments/experiment_plan_YYYY-MM-DD_<short-name>.md如果在 / 创建的代码仓库或工作树中工作,优先使用:
init-python-projectnew-workspacetext
docs/reports/experiment_plan_YYYY-MM-DD_<short-name>.md最终方案应可通过执行,并可后续通过生成报告。
run-experimentexperiment-report-writerStep 10 - Write Back to Project Memory
步骤10 - 写入项目记忆
If the project uses , update:
research-project-memory- : planned
memory/evidence-board.mditems andEVD-###experiment familiesEXP-### - : linked claims, marking unsupported or planned claims honestly
memory/claim-board.md - : baseline, mechanism, metric, seed, compute, and reviewer risks exposed by the design
memory/risk-board.md - : runnable next actions, including which experiment to launch first
memory/action-board.md - relevant worktree : experiment purpose and exit condition if a branch/worktree is involved
.agent/worktree-status.md
Use status for experiments that have not run. Do not record expected outcomes as observed evidence.
planned如果项目使用,更新以下内容:
research-project-memory- :规划中的
memory/evidence-board.md条目和EVD-###实验系列EXP-### - :关联论点,如实标记未支撑或规划中的论点
memory/claim-board.md - :设计方案中暴露的基准、机制、指标、随机种子、计算和评审风险
memory/risk-board.md - :可执行的后续行动,包括优先启动的实验
memory/action-board.md - 相关工作树的:实验目的和退出条件(如果涉及分支/工作树)
.agent/worktree-status.md
对于尚未运行的实验,使用状态。不要将预期结果记录为已观测证据。
plannedFinal Sanity Check
最终合理性检查
Before finalizing:
- claim and hypothesis are explicit
- baseline is defined
- independent variable is isolated
- controls and nuisance variables are listed
- metrics are tied to the question
- run matrix is concrete
- logging requirements are sufficient for reproduction
- stop condition and decision rule are explicit
- reviewer risks are stated
- project memory is updated when the repo has
memory/
在最终确定前,确认:
- 论点和假设明确
- 基准已定义
- 自变量已隔离
- 控制变量和干扰变量已列出
- 指标与研究问题相关
- 运行矩阵具体明确
- 日志记录要求足以支持复现
- 停止条件和决策规则明确
- 已说明评审风险
- 当仓库包含时,已更新项目记忆
memory/