stat-ab-testing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

A/B Testing Statistics

A/B测试统计分析

Framework

框架

IRON LAW: Calculate Sample Size BEFORE Running the Test

Running a test without knowing the required sample size leads to two
failures: stopping too early (false positives) or running too long (waste).

Required inputs: baseline conversion rate, minimum detectable effect (MDE),
significance level (α), power (1-β). Calculate BEFORE starting.
IRON LAW: 开展测试前先计算样本量

在不知道所需样本量的情况下开展测试会导致两种问题:过早停止测试(假阳性)或测试时长过长(浪费资源)。

所需输入参数:基准转化率、最小可检测效应(MDE)、显著性水平(α)、统计功效(1-β)。务必在开始测试前完成计算。

Sample Size Formula (Proportions)

样本量计算公式(比例类指标)

n per group ≈ (Z_α/2 + Z_β)² × [p₁(1-p₁) + p₂(1-p₂)] / (p₁ - p₂)²
Quick reference (α=0.05, power=0.8):
Baseline RateMDE (relative)N per Group
5%10% (→5.5%)~58,000
5%20% (→6.0%)~15,000
10%10% (→11%)~15,000
10%20% (→12%)~4,000
每组样本量 ≈ (Z_α/2 + Z_β)² × [p₁(1-p₁) + p₂(1-p₂)] / (p₁ - p₂)²
快速参考(α=0.05,功效=0.8):
基准转化率相对MDE每组样本量
5%10%(→5.5%)~58,000
5%20%(→6.0%)~15,000
10%10%(→11%)~15,000
10%20%(→12%)~4,000

Testing Approaches

测试方法

ApproachHow It WorksBest When
Frequentist (fixed-horizon)Set sample size, run to completion, then analyzeStandard practice, well-understood
BayesianUpdate beliefs with data, compute probability of improvementWant probability statements ("90% chance B is better")
Sequential testingCheck results at intervals with adjusted thresholdsNeed to stop early if clear winner, or limit downside risk
方法工作原理适用场景
Frequentist(固定周期)先确定样本量,完成测试后再进行分析标准实践,易于理解
Bayesian用数据更新认知,计算效果提升的概率需要概率化结论(如“B方案更优的概率为90%”)
Sequential testing定期检查结果并使用调整后的阈值需要在出现明显胜者时提前停止测试,或限制潜在风险

Experiment Design Checklist

实验设计检查清单

  1. Hypothesis: What do you expect to happen and why?
  2. Primary metric: ONE key metric (conversion, revenue, retention)
  3. Guardrail metrics: Metrics that must NOT degrade (page load time, error rate)
  4. Randomization unit: User, session, or device?
  5. Sample size: Calculated from baseline, MDE, α, power
  6. Duration: Account for weekly cycles (minimum 1-2 full weeks)
  7. Stopping rules: Pre-defined — do NOT peek and stop early without correction
  1. 假设: 你预期会发生什么,原因是什么?
  2. 核心指标: 唯一的关键指标(转化率、收入、留存率)
  3. 防护指标: 绝对不能出现下滑的指标(页面加载时间、错误率)
  4. 随机化单元: 用户、会话还是设备?
  5. 样本量: 根据基准值、MDE、α、功效计算得出
  6. 测试时长: 考虑周度周期(至少1-2个完整周)
  7. 停止规则: 预先定义——未经校正的情况下,切勿中途查看结果并提前停止测试

Analysis Steps

分析步骤

  1. Check randomization balance (are groups comparable on pre-treatment metrics?)
  2. Calculate observed difference and confidence interval
  3. Run significance test (z-test for proportions, t-test for continuous)
  4. Check guardrail metrics
  5. Interpret with practical significance in mind
  1. 检查随机化平衡性(两组在测试前的指标是否具有可比性?)
  2. 计算观测差异和置信区间
  3. 进行显著性检验(比例类指标用z检验,连续类指标用t检验)
  4. 检查防护指标
  5. 结合实际业务意义解读结果

Output Format

输出格式

markdown
undefined
markdown
undefined

A/B Test Design: {Experiment Name}

A/B测试设计:{实验名称}

Hypothesis

假设

  • H₀: {no difference}
  • H₁: {expected improvement}
  • Primary metric: {metric}
  • MDE: {X% relative}
  • H₀: {无差异}
  • H₁: {预期提升效果}
  • 核心指标: {指标名称}
  • MDE: {X% 相对值}

Sample Size

样本量

  • Baseline rate: {X%}
  • Required N per group: {N}
  • Estimated duration: {days/weeks}
  • 基准转化率: {X%}
  • 每组所需样本量: {N}
  • 预计测试时长: {天/周}

Results (post-test)

测试结果(测试后)

MetricControlTreatmentDiffCI (95%)p-value
{primary}X%X%+X%[X, X]{value}
指标对照组实验组差异95%置信区间p值
{核心指标}X%X%+X%[X, X]{数值}

Decision

决策

{Ship / Don't ship / Extend test} — {rationale}
undefined
{上线/不上线/延长测试} —— {理由}
undefined

Gotchas

注意事项

  • Peeking inflates false positives: Checking results daily and stopping when p < 0.05 can produce a 30%+ false positive rate. Use sequential testing methods if you need to peek.
  • Novelty effect: New features may show a lift that fades as users get used to them. Run tests long enough (2+ weeks) to stabilize.
  • Simpson's paradox: An overall positive result can be negative in every subgroup (or vice versa). Segment by key dimensions.
  • Network effects / interference: If treatment users interact with control users (social features, marketplace), independence is violated. Use cluster randomization.
  • Statistical significance threshold is arbitrary: α=0.05 is convention, not truth. For high-stakes decisions (pricing, major UX changes), consider α=0.01.
  • 中途查看结果会增加假阳性概率:每天查看结果并在p<0.05时停止测试,会导致假阳性率高达30%以上。如果需要中途查看,请使用Sequential testing方法。
  • 新奇效应:新功能初期可能表现出效果提升,但随着用户适应会逐渐消退。测试时长应足够长(2周以上)以稳定结果。
  • 辛普森悖论:整体结果为正,但在每个细分群体中结果可能为负(反之亦然)。需按关键维度进行细分分析。
  • 网络效应/干扰:如果实验组用户与对照组用户存在交互(社交功能、市场平台),则独立性假设不成立。应使用集群随机化。
  • 统计显著性阈值具有主观性:α=0.05是惯例而非真理。对于高风险决策(定价、重大UX变更),可考虑使用α=0.01。

References

参考资料

  • For Bayesian A/B testing methodology, see
    references/bayesian-ab.md
  • For multi-armed bandit approach, see
    references/bandits.md
  • 关于Bayesian A/B测试方法,请查看
    references/bayesian-ab.md
  • 关于多臂老虎机方法,请查看
    references/bandits.md