Design A/B tests and experiments with scientific rigor. Includes a falsifiable hypothesis, pre-registered analysis plan, sample size calculation, guardrail metrics, and clear decision criteria to prevent p-hacking and HARKing.
-
Read product context — Scan
for the product profile, relevant PRDs, and any existing experiment docs. Check for a metrics framework that defines standard metrics and their baseline values.
-
Define the hypothesis — Parse
and work with the user to formulate a hypothesis in the format: "If we [change], then [primary metric] will [direction] by [minimum detectable effect], because [rationale]." The hypothesis must be falsifiable.
-
Select metrics — Define:
- Primary metric: The single metric that determines success or failure. Must be measurable within the experiment duration.
- Secondary metrics: Additional metrics to monitor for deeper understanding. These do not determine the outcome.
- Guardrail metrics: Metrics that must NOT degrade (e.g., error rate, page load time, support ticket volume). If a guardrail is breached, the experiment is stopped regardless of the primary metric.
-
Calculate sample size — Based on: baseline conversion rate, minimum detectable effect (MDE), statistical significance level (default: 95%), statistical power (default: 80%). State the required sample size per variant.
-
Estimate duration — Based on current traffic or user volume, estimate how many days or weeks the experiment needs to run to reach the required sample size. Flag if the duration is impractically long and suggest adjusting the MDE.
-
Design the variants — Describe the control and treatment(s). Each variant must differ in exactly one variable to isolate causation. If multiple changes are bundled, note the confounding risk.
-
Pre-register the analysis plan — Document before the experiment starts: statistical test to use (e.g., chi-squared, t-test, Mann-Whitney), one-tailed vs. two-tailed, how to handle multiple comparisons (Bonferroni correction), and when to check results (no peeking before reaching sample size).
-
Define decision criteria — State explicitly: what outcome leads to ship, iterate, or kill. Include the scenario where the result is inconclusive (effect size smaller than MDE).
-
Determine the next file number — Read filenames in
to find the highest numbered file. Use
.
-
Write the experiment doc — Save to
.chalk/docs/product/<n>_experiment_<name>.md
.