stat-ab-testing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

A/B Testing Statistics

A/B测试统计分析

Framework

框架

IRON LAW: Calculate Sample Size BEFORE Running the Test

Running a test without knowing the required sample size leads to two
failures: stopping too early (false positives) or running too long (waste).

Required inputs: baseline conversion rate, minimum detectable effect (MDE),
significance level (α), power (1-β). Calculate BEFORE starting.

IRON LAW: 开展测试前先计算样本量

在不知道所需样本量的情况下开展测试会导致两种问题：过早停止测试（假阳性）或测试时长过长（浪费资源）。

所需输入参数：基准转化率、最小可检测效应（MDE）、显著性水平（α）、统计功效（1-β）。务必在开始测试前完成计算。

Sample Size Formula (Proportions)

样本量计算公式（比例类指标）

n per group ≈ (Z_α/2 + Z_β)² × [p₁(1-p₁) + p₂(1-p₂)] / (p₁ - p₂)²

Quick reference (α=0.05, power=0.8):

Baseline Rate	MDE (relative)	N per Group
5%	10% (→5.5%)	~58,000
5%	20% (→6.0%)	~15,000
10%	10% (→11%)	~15,000
10%	20% (→12%)	~4,000

每组样本量 ≈ (Z_α/2 + Z_β)² × [p₁(1-p₁) + p₂(1-p₂)] / (p₁ - p₂)²

快速参考（α=0.05，功效=0.8）：

基准转化率	相对MDE	每组样本量
5%	10%（→5.5%）	~58,000
5%	20%（→6.0%）	~15,000
10%	10%（→11%）	~15,000
10%	20%（→12%）	~4,000

Testing Approaches

测试方法

Approach	How It Works	Best When
Frequentist (fixed-horizon)	Set sample size, run to completion, then analyze	Standard practice, well-understood
Bayesian	Update beliefs with data, compute probability of improvement	Want probability statements ("90% chance B is better")
Sequential testing	Check results at intervals with adjusted thresholds	Need to stop early if clear winner, or limit downside risk

方法	工作原理	适用场景
Frequentist（固定周期）	先确定样本量，完成测试后再进行分析	标准实践，易于理解
Bayesian	用数据更新认知，计算效果提升的概率	需要概率化结论（如“B方案更优的概率为90%”）
Sequential testing	定期检查结果并使用调整后的阈值	需要在出现明显胜者时提前停止测试，或限制潜在风险

Experiment Design Checklist

实验设计检查清单

Hypothesis: What do you expect to happen and why?
Primary metric: ONE key metric (conversion, revenue, retention)
Guardrail metrics: Metrics that must NOT degrade (page load time, error rate)
Randomization unit: User, session, or device?
Sample size: Calculated from baseline, MDE, α, power
Duration: Account for weekly cycles (minimum 1-2 full weeks)
Stopping rules: Pre-defined — do NOT peek and stop early without correction

假设: 你预期会发生什么，原因是什么？
核心指标: 唯一的关键指标（转化率、收入、留存率）
防护指标: 绝对不能出现下滑的指标（页面加载时间、错误率）
随机化单元: 用户、会话还是设备？
样本量: 根据基准值、MDE、α、功效计算得出
测试时长: 考虑周度周期（至少1-2个完整周）
停止规则: 预先定义——未经校正的情况下，切勿中途查看结果并提前停止测试

Analysis Steps

分析步骤

Check randomization balance (are groups comparable on pre-treatment metrics?)
Calculate observed difference and confidence interval
Run significance test (z-test for proportions, t-test for continuous)
Check guardrail metrics
Interpret with practical significance in mind

检查随机化平衡性（两组在测试前的指标是否具有可比性？）
计算观测差异和置信区间
进行显著性检验（比例类指标用z检验，连续类指标用t检验）
检查防护指标
结合实际业务意义解读结果

Output Format

输出格式

markdown

undefined

markdown

undefined

A/B Test Design: {Experiment Name}

A/B测试设计：{实验名称}

Hypothesis

假设

H₀: {no difference}
H₁: {expected improvement}
Primary metric: {metric}
MDE: {X% relative}

H₀: {无差异}
H₁: {预期提升效果}
核心指标: {指标名称}
MDE: {X% 相对值}

Sample Size

样本量

Baseline rate: {X%}
Required N per group: {N}
Estimated duration: {days/weeks}

基准转化率: {X%}
每组所需样本量: {N}
预计测试时长: {天/周}

Results (post-test)

测试结果（测试后）

Metric	Control	Treatment	Diff	CI (95%)	p-value
{primary}	X%	X%	+X%	[X, X]	{value}

指标	对照组	实验组	差异	95%置信区间	p值
{核心指标}	X%	X%	+X%	[X, X]	{数值}

Decision

决策

{Ship / Don't ship / Extend test} — {rationale}

undefined

{上线/不上线/延长测试} —— {理由}

undefined

Gotchas

注意事项

Peeking inflates false positives: Checking results daily and stopping when p < 0.05 can produce a 30%+ false positive rate. Use sequential testing methods if you need to peek.
Novelty effect: New features may show a lift that fades as users get used to them. Run tests long enough (2+ weeks) to stabilize.
Simpson's paradox: An overall positive result can be negative in every subgroup (or vice versa). Segment by key dimensions.
Network effects / interference: If treatment users interact with control users (social features, marketplace), independence is violated. Use cluster randomization.
Statistical significance threshold is arbitrary: α=0.05 is convention, not truth. For high-stakes decisions (pricing, major UX changes), consider α=0.01.

中途查看结果会增加假阳性概率：每天查看结果并在p<0.05时停止测试，会导致假阳性率高达30%以上。如果需要中途查看，请使用Sequential testing方法。
新奇效应：新功能初期可能表现出效果提升，但随着用户适应会逐渐消退。测试时长应足够长（2周以上）以稳定结果。
辛普森悖论：整体结果为正，但在每个细分群体中结果可能为负（反之亦然）。需按关键维度进行细分分析。
网络效应/干扰：如果实验组用户与对照组用户存在交互（社交功能、市场平台），则独立性假设不成立。应使用集群随机化。
统计显著性阈值具有主观性：α=0.05是惯例而非真理。对于高风险决策（定价、重大UX变更），可考虑使用α=0.01。

References

参考资料

For Bayesian A/B testing methodology, see
```
references/bayesian-ab.md
```
For multi-armed bandit approach, see
```
references/bandits.md
```

关于Bayesian A/B测试方法，请查看
```
references/bayesian-ab.md
```
关于多臂老虎机方法，请查看
```
references/bandits.md
```