create-experiment-design

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Create Experiment Design

实验方案设计

Overview

概述

Design A/B tests and experiments with scientific rigor. Includes a falsifiable hypothesis, pre-registered analysis plan, sample size calculation, guardrail metrics, and clear decision criteria to prevent p-hacking and HARKing.

设计具备科学严谨性的A/B测试与实验方案，包括可证伪的假设、预注册分析计划、样本量计算、护栏指标以及明确的决策标准，以避免p-hacking和HARKing问题。

Workflow

工作流程

Read product context — Scan
```
.chalk/docs/product/
```
for the product profile, relevant PRDs, and any existing experiment docs. Check for a metrics framework that defines standard metrics and their baseline values.
Define the hypothesis — Parse
```
$ARGUMENTS
```
and work with the user to formulate a hypothesis in the format: "If we [change], then [primary metric] will [direction] by [minimum detectable effect], because [rationale]." The hypothesis must be falsifiable.
Select metrics — Define:
- Primary metric: The single metric that determines success or failure. Must be measurable within the experiment duration.
- Secondary metrics: Additional metrics to monitor for deeper understanding. These do not determine the outcome.
- Guardrail metrics: Metrics that must NOT degrade (e.g., error rate, page load time, support ticket volume). If a guardrail is breached, the experiment is stopped regardless of the primary metric.
Calculate sample size — Based on: baseline conversion rate, minimum detectable effect (MDE), statistical significance level (default: 95%), statistical power (default: 80%). State the required sample size per variant.
Estimate duration — Based on current traffic or user volume, estimate how many days or weeks the experiment needs to run to reach the required sample size. Flag if the duration is impractically long and suggest adjusting the MDE.
Design the variants — Describe the control and treatment(s). Each variant must differ in exactly one variable to isolate causation. If multiple changes are bundled, note the confounding risk.
Pre-register the analysis plan — Document before the experiment starts: statistical test to use (e.g., chi-squared, t-test, Mann-Whitney), one-tailed vs. two-tailed, how to handle multiple comparisons (Bonferroni correction), and when to check results (no peeking before reaching sample size).
Define decision criteria — State explicitly: what outcome leads to ship, iterate, or kill. Include the scenario where the result is inconclusive (effect size smaller than MDE).
Determine the next file number — Read filenames in
```
.chalk/docs/product/
```
to find the highest numbered file. Use
```
highest + 1
```
.

Write the experiment doc — Save to

.chalk/docs/product/<n>_experiment_<name>.md

读取产品上下文 — 扫描
```
.chalk/docs/product/
```
目录，获取产品概况、相关PRD文档以及所有现有实验文档。检查是否存在定义了标准指标及其基准值的指标框架。
定义假设 — 解析
```
$ARGUMENTS
```
，与用户协作制定符合以下格式的假设："如果我们[做出某项变更]，那么[核心指标]将[朝某个方向变化]至少[最小可检测效应值]，原因是[理论依据]。"假设必须具备可证伪性。
选择指标 — 定义：
- 核心指标：唯一决定实验成败的指标，必须能在实验周期内完成测量。
- 次要指标：用于深入分析的额外指标，不影响实验最终结论。
- 护栏指标：绝对不能出现恶化的指标（如错误率、页面加载时间、支持工单量）。若护栏指标不达标，无论核心指标表现如何，都需终止实验。
计算样本量 — 基于基准转化率、最小可检测效应（MDE）、统计显著性水平（默认：95%）、统计功效（默认：80%）进行计算，明确每个变体所需的样本量。
预估实验时长 — 根据当前流量或用户量，预估达到所需样本量需要的天数或周数。若时长过长不切实际，建议调整最小可检测效应值。
设计实验变体 — 描述对照组与处理组。每个变体只能在一个变量上存在差异，以明确因果关系。若同时测试多项变更，需注明存在混淆变量的风险。
预注册分析计划 — 在实验开始前记录：将使用的统计检验方法（如卡方检验、t检验、Mann-Whitney检验）、单尾检验还是双尾检验、多重比较的处理方式（Bonferroni校正）以及查看结果的时间节点（达到样本量前不得中途查看结果）。
定义决策标准 — 明确说明：何种实验结果对应上线、迭代或终止方案。同时包含结果不显著（效应值小于最小可检测效应）的场景处理方式。
确定下一个文件编号 — 读取
```
.chalk/docs/product/
```
目录下的文件名，找到编号最大的文件，使用“最大编号+1”作为新文件编号。

撰写实验文档 — 保存至

.chalk/docs/product/<n>_experiment_<name>.md

路径。

Output

输出结果

File:

.chalk/docs/product/<n>_experiment_<name>.md

Format: Markdown with clearly labeled sections
Key sections: Hypothesis, Primary Metric, Secondary Metrics, Guardrail Metrics, Sample Size Calculation, Duration Estimate, Variant Descriptions, Analysis Plan (pre-registered), Decision Criteria, Risks and Mitigations

文件路径：

.chalk/docs/product/<n>_experiment_<name>.md

格式：带有清晰标注章节的Markdown格式
核心章节：假设、核心指标、次要指标、护栏指标、样本量计算、时长预估、变体描述、预注册分析计划、决策标准、风险与缓解措施

Anti-patterns

反模式

No pre-registered analysis plan — Choosing the statistical test after seeing the data is p-hacking. The analysis plan must be written before the experiment starts.
Peeking at results — Checking results daily and stopping when p < 0.05 inflates false positive rates dramatically. Commit to the sample size and do not check early without a sequential testing correction.
Too many metrics — Testing 20 metrics guarantees at least one will be "significant" by chance. One primary metric determines the outcome. Everything else is exploratory.
Bundling changes — Testing a new UI, new copy, and new pricing simultaneously makes it impossible to know which change drove the result. Isolate variables.
Missing guardrails — An experiment that increases conversion by 5% but doubles error rates is not a success. Always define guardrail metrics.
Ignoring practical significance — A statistically significant 0.1% improvement may not be worth the engineering cost. Define the minimum effect size that justifies shipping.

未预注册分析计划 — 在看到数据后再选择统计检验方法属于p-hacking行为，分析计划必须在实验开始前完成撰写。
中途查看结果 — 每日查看结果并在p < 0.05时终止实验会大幅提升假阳性率。需严格遵循预定的样本量，若无序列检验校正不得提前查看结果。
指标过多 — 同时测试20个指标必然会有至少一个因随机性呈现“显著”结果。仅由一个核心指标决定实验结论，其余指标均为探索性分析。
捆绑多项变更 — 同时测试新UI、新文案和新定价会导致无法确定是哪项变更驱动了结果，需将变量独立开来。
缺失护栏指标 — 某项实验使转化率提升5%但错误率翻倍，这并非成功的实验，必须始终定义护栏指标。
忽略实际显著性 — 统计上显著的0.1%提升可能不值得投入工程成本，需明确界定值得上线的最小效应值。