ab-test-setup

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

A/B Test Setup

A/B测试设置

1️⃣ Purpose & Scope

1️⃣ 目的与范围

Ensure every A/B test is valid, rigorous, and safe before a single line of code is written.

Prevents "peeking"
Enforces statistical power
Blocks invalid hypotheses

确保在编写任何代码之前，每个A/B测试都是有效、严谨且安全的。

防止“提前查看结果”
确保统计效力（statistical power）
阻止无效假设

2️⃣ Pre-Requisites

2️⃣ 前置条件

You must have:

A clear user problem
Access to an analytics source
Roughly estimated traffic volume

你必须具备：

明确的用户问题
访问分析数据源的权限
大致估算的流量规模

Hypothesis Quality Checklist

假设质量检查表

A valid hypothesis includes:

Observation or evidence
Single, specific change
Directional expectation
Defined audience
Measurable success criteria

有效的假设需包含：

观察依据或实证
单一、明确的变更点
方向性预期
定义清晰的受众
可衡量的成功标准

3️⃣ Hypothesis Lock (Hard Gate)

3️⃣ 假设锁定（强制检查环节）

Before designing variants or metrics, you MUST:

Present the final hypothesis
Specify:
- Target audience
- Primary metric
- Expected direction of effect
- Minimum Detectable Effect (MDE)

Ask explicitly:

“Is this the final hypothesis we are committing to for this test?”

Do NOT proceed until confirmed.

在设计变体或指标之前，你必须：

提交最终假设
明确说明：
- 目标受众
- 核心指标
- 预期效果方向
- 最小可检测效应（Minimum Detectable Effect, MDE）

需明确询问：

“这是我们将为本次测试敲定的最终假设吗？”

在得到确认前不得推进。

4️⃣ Assumptions & Validity Check (Mandatory)

4️⃣ 假设与有效性检查（强制要求）

Explicitly list assumptions about:

Traffic stability
User independence
Metric reliability
Randomization quality
External factors (seasonality, campaigns, releases)

If assumptions are weak or violated:

Warn the user
Recommend delaying or redesigning the test

明确列出关于以下内容的假设：

流量稳定性
用户独立性
指标可靠性
随机化质量
外部因素（季节性、营销活动、版本发布）

如果假设薄弱或不成立：

向用户发出警告
建议推迟或重新设计测试

5️⃣ Test Type Selection

5️⃣ 测试类型选择

Choose the simplest valid test:

A/B Test – single change, two variants
A/B/n Test – multiple variants, higher traffic required
Multivariate Test (MVT) – interaction effects, very high traffic
Split URL Test – major structural changes

Default to A/B unless there is a clear reason otherwise.

选择最简单的有效测试类型：

A/B Test – 单一变更，两个变体
A/B/n Test – 多个变体，需要更多流量
Multivariate Test (MVT) – 交互效应，需要大量流量
Split URL Test – 重大结构变更

除非有明确理由，否则默认选择A/B测试。

6️⃣ Metrics Definition

6️⃣ 指标定义

Primary Metric (Mandatory)

核心指标（强制要求）

Single metric used to evaluate success
Directly tied to the hypothesis
Pre-defined and frozen before launch

用于评估测试成功的单一指标
与假设直接关联
在启动前预先定义并固定

Secondary Metrics

次要指标

Provide context
Explain why results occurred
Must not override the primary metric

提供上下文信息
解释结果产生的原因
不得凌驾于核心指标之上

Guardrail Metrics

防护指标

Metrics that must not degrade
Used to prevent harmful wins
Trigger test stop if significantly negative

不得出现恶化的指标
用于防止有害的“成功”结果
若出现显著负面结果，触发测试终止

7️⃣ Sample Size & Duration

7️⃣ 样本量与测试时长

Define upfront:

Baseline rate
MDE
Significance level (typically 95%)
Statistical power (typically 80%)

Estimate:

Required sample size per variant
Expected test duration

Do NOT proceed without a realistic sample size estimate.

提前定义：

基准转化率
MDE
显著性水平（通常为95%）
统计效力（通常为80%）

估算：

每个变体所需的样本量
预期测试时长

未完成合理的样本量估算不得推进。

8️⃣ Execution Readiness Gate (Hard Stop)

8️⃣ 执行就绪性检查（强制终止环节）

You may proceed to implementation only if all are true:

Hypothesis is locked
Primary metric is frozen
Sample size is calculated
Test duration is defined
Guardrails are set
Tracking is verified

If any item is missing, stop and resolve it.

仅当以下所有条件满足时，方可进入实施阶段：

假设已锁定
核心指标已固定
样本量已计算完成
测试时长已定义
防护指标已设置
数据跟踪已验证

若有任何一项缺失，需停止并解决问题。

Running the Test

测试运行阶段

During the Test

测试进行中

DO:

Monitor technical health
Document external factors

DO NOT:

Stop early due to “good-looking” results
Change variants mid-test
Add new traffic sources
Redefine success criteria

需要做：

监控技术健康状况
记录外部因素

禁止做：

因“看起来不错”的结果提前终止测试
在测试中途变更变体
添加新的流量来源
重新定义成功标准

Analyzing Results

结果分析

Analysis Discipline

分析原则

When interpreting results:

Do NOT generalize beyond the tested population
Do NOT claim causality beyond the tested change
Do NOT override guardrail failures
Separate statistical significance from business judgment

解读结果时：

不得将结果推广到测试人群之外
不得声称测试变更之外的因果关系
不得无视防护指标的失败
将统计显著性与业务判断区分开

Interpretation Outcomes

解读结果与对应行动

Result	Action
Significant positive	Consider rollout
Significant negative	Reject variant, document learning
Inconclusive	Consider more traffic or bolder change
Guardrail failure	Do not ship, even if primary wins

结果	行动
显著正向结果	考虑全量推出
显著负向结果	拒绝变体，记录经验教训
无明确结论	考虑增加流量或采用更激进的变更
防护指标失败	即使核心指标达标，也不得上线

Documentation & Learning

文档记录与经验沉淀

Test Record (Mandatory)

测试记录（强制要求）

Document:

Hypothesis
Variants
Metrics
Sample size vs achieved
Results
Decision
Learnings
Follow-up ideas

Store records in a shared, searchable location to avoid repeated failures.

需记录：

假设内容
变体信息
指标定义
样本量目标与实际达成情况
测试结果
决策内容
经验教训
后续改进思路

将记录存储在共享、可搜索的位置，避免重复犯错。

Refusal Conditions (Safety)

拒绝推进的条件（安全规则）

Refuse to proceed if:

Baseline rate is unknown and cannot be estimated
Traffic is insufficient to detect the MDE
Primary metric is undefined
Multiple variables are changed without proper design
Hypothesis cannot be clearly stated

Explain why and recommend next steps.

若出现以下情况，需拒绝推进测试：

基准转化率未知且无法估算
流量不足以检测MDE
核心指标未定义
在未进行合理设计的情况下变更多个变量
假设无法清晰表述

解释原因并建议下一步行动。

Key Principles (Non-Negotiable)

核心原则（不可妥协）

One hypothesis per test
One primary metric
Commit before launch
No peeking
Learning over winning
Statistical rigor first

一次测试对应一个假设
一个核心指标
启动前敲定所有内容
不得提前查看结果
经验优先于“获胜”
统计严谨性第一

Final Reminder

最终提醒

A/B testing is not about proving ideas right. It is about learning the truth with confidence.

If you feel tempted to rush, simplify, or “just try it” — that is the signal to slow down and re-check the design.

A/B测试的目的不是证明想法正确。而是带着信心探索真相。

如果你急于推进、简化或“只是试试”—— 这正是你需要放慢脚步，重新检查设计的信号。