stat-ab-testing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseA/B Testing Statistics
A/B测试统计分析
Framework
框架
IRON LAW: Calculate Sample Size BEFORE Running the Test
Running a test without knowing the required sample size leads to two
failures: stopping too early (false positives) or running too long (waste).
Required inputs: baseline conversion rate, minimum detectable effect (MDE),
significance level (α), power (1-β). Calculate BEFORE starting.IRON LAW: 开展测试前先计算样本量
在不知道所需样本量的情况下开展测试会导致两种问题:过早停止测试(假阳性)或测试时长过长(浪费资源)。
所需输入参数:基准转化率、最小可检测效应(MDE)、显著性水平(α)、统计功效(1-β)。务必在开始测试前完成计算。Sample Size Formula (Proportions)
样本量计算公式(比例类指标)
n per group ≈ (Z_α/2 + Z_β)² × [p₁(1-p₁) + p₂(1-p₂)] / (p₁ - p₂)²Quick reference (α=0.05, power=0.8):
| Baseline Rate | MDE (relative) | N per Group |
|---|---|---|
| 5% | 10% (→5.5%) | ~58,000 |
| 5% | 20% (→6.0%) | ~15,000 |
| 10% | 10% (→11%) | ~15,000 |
| 10% | 20% (→12%) | ~4,000 |
每组样本量 ≈ (Z_α/2 + Z_β)² × [p₁(1-p₁) + p₂(1-p₂)] / (p₁ - p₂)²快速参考(α=0.05,功效=0.8):
| 基准转化率 | 相对MDE | 每组样本量 |
|---|---|---|
| 5% | 10%(→5.5%) | ~58,000 |
| 5% | 20%(→6.0%) | ~15,000 |
| 10% | 10%(→11%) | ~15,000 |
| 10% | 20%(→12%) | ~4,000 |
Testing Approaches
测试方法
| Approach | How It Works | Best When |
|---|---|---|
| Frequentist (fixed-horizon) | Set sample size, run to completion, then analyze | Standard practice, well-understood |
| Bayesian | Update beliefs with data, compute probability of improvement | Want probability statements ("90% chance B is better") |
| Sequential testing | Check results at intervals with adjusted thresholds | Need to stop early if clear winner, or limit downside risk |
| 方法 | 工作原理 | 适用场景 |
|---|---|---|
| Frequentist(固定周期) | 先确定样本量,完成测试后再进行分析 | 标准实践,易于理解 |
| Bayesian | 用数据更新认知,计算效果提升的概率 | 需要概率化结论(如“B方案更优的概率为90%”) |
| Sequential testing | 定期检查结果并使用调整后的阈值 | 需要在出现明显胜者时提前停止测试,或限制潜在风险 |
Experiment Design Checklist
实验设计检查清单
- Hypothesis: What do you expect to happen and why?
- Primary metric: ONE key metric (conversion, revenue, retention)
- Guardrail metrics: Metrics that must NOT degrade (page load time, error rate)
- Randomization unit: User, session, or device?
- Sample size: Calculated from baseline, MDE, α, power
- Duration: Account for weekly cycles (minimum 1-2 full weeks)
- Stopping rules: Pre-defined — do NOT peek and stop early without correction
- 假设: 你预期会发生什么,原因是什么?
- 核心指标: 唯一的关键指标(转化率、收入、留存率)
- 防护指标: 绝对不能出现下滑的指标(页面加载时间、错误率)
- 随机化单元: 用户、会话还是设备?
- 样本量: 根据基准值、MDE、α、功效计算得出
- 测试时长: 考虑周度周期(至少1-2个完整周)
- 停止规则: 预先定义——未经校正的情况下,切勿中途查看结果并提前停止测试
Analysis Steps
分析步骤
- Check randomization balance (are groups comparable on pre-treatment metrics?)
- Calculate observed difference and confidence interval
- Run significance test (z-test for proportions, t-test for continuous)
- Check guardrail metrics
- Interpret with practical significance in mind
- 检查随机化平衡性(两组在测试前的指标是否具有可比性?)
- 计算观测差异和置信区间
- 进行显著性检验(比例类指标用z检验,连续类指标用t检验)
- 检查防护指标
- 结合实际业务意义解读结果
Output Format
输出格式
markdown
undefinedmarkdown
undefinedA/B Test Design: {Experiment Name}
A/B测试设计:{实验名称}
Hypothesis
假设
- H₀: {no difference}
- H₁: {expected improvement}
- Primary metric: {metric}
- MDE: {X% relative}
- H₀: {无差异}
- H₁: {预期提升效果}
- 核心指标: {指标名称}
- MDE: {X% 相对值}
Sample Size
样本量
- Baseline rate: {X%}
- Required N per group: {N}
- Estimated duration: {days/weeks}
- 基准转化率: {X%}
- 每组所需样本量: {N}
- 预计测试时长: {天/周}
Results (post-test)
测试结果(测试后)
| Metric | Control | Treatment | Diff | CI (95%) | p-value |
|---|---|---|---|---|---|
| {primary} | X% | X% | +X% | [X, X] | {value} |
| 指标 | 对照组 | 实验组 | 差异 | 95%置信区间 | p值 |
|---|---|---|---|---|---|
| {核心指标} | X% | X% | +X% | [X, X] | {数值} |
Decision
决策
{Ship / Don't ship / Extend test} — {rationale}
undefined{上线/不上线/延长测试} —— {理由}
undefinedGotchas
注意事项
- Peeking inflates false positives: Checking results daily and stopping when p < 0.05 can produce a 30%+ false positive rate. Use sequential testing methods if you need to peek.
- Novelty effect: New features may show a lift that fades as users get used to them. Run tests long enough (2+ weeks) to stabilize.
- Simpson's paradox: An overall positive result can be negative in every subgroup (or vice versa). Segment by key dimensions.
- Network effects / interference: If treatment users interact with control users (social features, marketplace), independence is violated. Use cluster randomization.
- Statistical significance threshold is arbitrary: α=0.05 is convention, not truth. For high-stakes decisions (pricing, major UX changes), consider α=0.01.
- 中途查看结果会增加假阳性概率:每天查看结果并在p<0.05时停止测试,会导致假阳性率高达30%以上。如果需要中途查看,请使用Sequential testing方法。
- 新奇效应:新功能初期可能表现出效果提升,但随着用户适应会逐渐消退。测试时长应足够长(2周以上)以稳定结果。
- 辛普森悖论:整体结果为正,但在每个细分群体中结果可能为负(反之亦然)。需按关键维度进行细分分析。
- 网络效应/干扰:如果实验组用户与对照组用户存在交互(社交功能、市场平台),则独立性假设不成立。应使用集群随机化。
- 统计显著性阈值具有主观性:α=0.05是惯例而非真理。对于高风险决策(定价、重大UX变更),可考虑使用α=0.01。
References
参考资料
- For Bayesian A/B testing methodology, see
references/bayesian-ab.md - For multi-armed bandit approach, see
references/bandits.md
- 关于Bayesian A/B测试方法,请查看
references/bayesian-ab.md - 关于多臂老虎机方法,请查看
references/bandits.md