ab-test-setup

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

A/B Test Setup

A/B测试设置

1️⃣ Purpose & Scope

1️⃣ 目的与范围

Ensure every A/B test is valid, rigorous, and safe before a single line of code is written.
  • Prevents "peeking"
  • Enforces statistical power
  • Blocks invalid hypotheses

确保在编写任何代码之前,每个A/B测试都是有效、严谨且安全的。
  • 防止“提前查看结果”
  • 确保统计效力(statistical power)
  • 阻止无效假设

2️⃣ Pre-Requisites

2️⃣ 前置条件

You must have:
  • A clear user problem
  • Access to an analytics source
  • Roughly estimated traffic volume
你必须具备:
  • 明确的用户问题
  • 访问分析数据源的权限
  • 大致估算的流量规模

Hypothesis Quality Checklist

假设质量检查表

A valid hypothesis includes:
  • Observation or evidence
  • Single, specific change
  • Directional expectation
  • Defined audience
  • Measurable success criteria

有效的假设需包含:
  • 观察依据或实证
  • 单一、明确的变更点
  • 方向性预期
  • 定义清晰的受众
  • 可衡量的成功标准

3️⃣ Hypothesis Lock (Hard Gate)

3️⃣ 假设锁定(强制检查环节)

Before designing variants or metrics, you MUST:
  • Present the final hypothesis
  • Specify:
    • Target audience
    • Primary metric
    • Expected direction of effect
    • Minimum Detectable Effect (MDE)
Ask explicitly:
“Is this the final hypothesis we are committing to for this test?”
Do NOT proceed until confirmed.

在设计变体或指标之前,你必须:
  • 提交最终假设
  • 明确说明:
    • 目标受众
    • 核心指标
    • 预期效果方向
    • 最小可检测效应(Minimum Detectable Effect, MDE)
需明确询问:
“这是我们将为本次测试敲定的最终假设吗?”
在得到确认前不得推进。

4️⃣ Assumptions & Validity Check (Mandatory)

4️⃣ 假设与有效性检查(强制要求)

Explicitly list assumptions about:
  • Traffic stability
  • User independence
  • Metric reliability
  • Randomization quality
  • External factors (seasonality, campaigns, releases)
If assumptions are weak or violated:
  • Warn the user
  • Recommend delaying or redesigning the test

明确列出关于以下内容的假设:
  • 流量稳定性
  • 用户独立性
  • 指标可靠性
  • 随机化质量
  • 外部因素(季节性、营销活动、版本发布)
如果假设薄弱或不成立:
  • 向用户发出警告
  • 建议推迟或重新设计测试

5️⃣ Test Type Selection

5️⃣ 测试类型选择

Choose the simplest valid test:
  • A/B Test – single change, two variants
  • A/B/n Test – multiple variants, higher traffic required
  • Multivariate Test (MVT) – interaction effects, very high traffic
  • Split URL Test – major structural changes
Default to A/B unless there is a clear reason otherwise.

选择最简单的有效测试类型:
  • A/B Test – 单一变更,两个变体
  • A/B/n Test – 多个变体,需要更多流量
  • Multivariate Test (MVT) – 交互效应,需要大量流量
  • Split URL Test – 重大结构变更
除非有明确理由,否则默认选择A/B测试

6️⃣ Metrics Definition

6️⃣ 指标定义

Primary Metric (Mandatory)

核心指标(强制要求)

  • Single metric used to evaluate success
  • Directly tied to the hypothesis
  • Pre-defined and frozen before launch
  • 用于评估测试成功的单一指标
  • 与假设直接关联
  • 在启动前预先定义并固定

Secondary Metrics

次要指标

  • Provide context
  • Explain why results occurred
  • Must not override the primary metric
  • 提供上下文信息
  • 解释结果产生的原因
  • 不得凌驾于核心指标之上

Guardrail Metrics

防护指标

  • Metrics that must not degrade
  • Used to prevent harmful wins
  • Trigger test stop if significantly negative

  • 不得出现恶化的指标
  • 用于防止有害的“成功”结果
  • 若出现显著负面结果,触发测试终止

7️⃣ Sample Size & Duration

7️⃣ 样本量与测试时长

Define upfront:
  • Baseline rate
  • MDE
  • Significance level (typically 95%)
  • Statistical power (typically 80%)
Estimate:
  • Required sample size per variant
  • Expected test duration
Do NOT proceed without a realistic sample size estimate.

提前定义:
  • 基准转化率
  • MDE
  • 显著性水平(通常为95%)
  • 统计效力(通常为80%)
估算:
  • 每个变体所需的样本量
  • 预期测试时长
未完成合理的样本量估算不得推进。

8️⃣ Execution Readiness Gate (Hard Stop)

8️⃣ 执行就绪性检查(强制终止环节)

You may proceed to implementation only if all are true:
  • Hypothesis is locked
  • Primary metric is frozen
  • Sample size is calculated
  • Test duration is defined
  • Guardrails are set
  • Tracking is verified
If any item is missing, stop and resolve it.

仅当以下所有条件满足时,方可进入实施阶段:
  • 假设已锁定
  • 核心指标已固定
  • 样本量已计算完成
  • 测试时长已定义
  • 防护指标已设置
  • 数据跟踪已验证
若有任何一项缺失,需停止并解决问题。

Running the Test

测试运行阶段

During the Test

测试进行中

DO:
  • Monitor technical health
  • Document external factors
DO NOT:
  • Stop early due to “good-looking” results
  • Change variants mid-test
  • Add new traffic sources
  • Redefine success criteria

需要做:
  • 监控技术健康状况
  • 记录外部因素
禁止做:
  • 因“看起来不错”的结果提前终止测试
  • 在测试中途变更变体
  • 添加新的流量来源
  • 重新定义成功标准

Analyzing Results

结果分析

Analysis Discipline

分析原则

When interpreting results:
  • Do NOT generalize beyond the tested population
  • Do NOT claim causality beyond the tested change
  • Do NOT override guardrail failures
  • Separate statistical significance from business judgment
解读结果时:
  • 不得将结果推广到测试人群之外
  • 不得声称测试变更之外的因果关系
  • 不得无视防护指标的失败
  • 将统计显著性与业务判断区分开

Interpretation Outcomes

解读结果与对应行动

ResultAction
Significant positiveConsider rollout
Significant negativeReject variant, document learning
InconclusiveConsider more traffic or bolder change
Guardrail failureDo not ship, even if primary wins

结果行动
显著正向结果考虑全量推出
显著负向结果拒绝变体,记录经验教训
无明确结论考虑增加流量或采用更激进的变更
防护指标失败即使核心指标达标,也不得上线

Documentation & Learning

文档记录与经验沉淀

Test Record (Mandatory)

测试记录(强制要求)

Document:
  • Hypothesis
  • Variants
  • Metrics
  • Sample size vs achieved
  • Results
  • Decision
  • Learnings
  • Follow-up ideas
Store records in a shared, searchable location to avoid repeated failures.

需记录:
  • 假设内容
  • 变体信息
  • 指标定义
  • 样本量目标与实际达成情况
  • 测试结果
  • 决策内容
  • 经验教训
  • 后续改进思路
将记录存储在共享、可搜索的位置,避免重复犯错。

Refusal Conditions (Safety)

拒绝推进的条件(安全规则)

Refuse to proceed if:
  • Baseline rate is unknown and cannot be estimated
  • Traffic is insufficient to detect the MDE
  • Primary metric is undefined
  • Multiple variables are changed without proper design
  • Hypothesis cannot be clearly stated
Explain why and recommend next steps.

若出现以下情况,需拒绝推进测试:
  • 基准转化率未知且无法估算
  • 流量不足以检测MDE
  • 核心指标未定义
  • 在未进行合理设计的情况下变更多个变量
  • 假设无法清晰表述
解释原因并建议下一步行动。

Key Principles (Non-Negotiable)

核心原则(不可妥协)

  • One hypothesis per test
  • One primary metric
  • Commit before launch
  • No peeking
  • Learning over winning
  • Statistical rigor first

  • 一次测试对应一个假设
  • 一个核心指标
  • 启动前敲定所有内容
  • 不得提前查看结果
  • 经验优先于“获胜”
  • 统计严谨性第一

Final Reminder

最终提醒

A/B testing is not about proving ideas right. It is about learning the truth with confidence.
If you feel tempted to rush, simplify, or “just try it” — that is the signal to slow down and re-check the design.
A/B测试的目的不是证明想法正确。 而是带着信心探索真相
如果你急于推进、简化或“只是试试”—— 这正是你需要放慢脚步,重新检查设计的信号。