trustworthy-experiments
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTrustworthy Experiments
可信实验
What It Is
什么是可信实验
Trustworthy Experiments is a framework for running controlled experiments (A/B tests) that produce reliable, actionable results. The core insight: most experiments fail, and many "successful" results are actually false positives.
The key shift: Move from "Did the experiment show a positive result?" to "Can I trust this result enough to act on it?"
Ronny Kohavi, who built experimentation platforms at Microsoft, Amazon, and Airbnb, found that:
- 66-92% of experiments fail to improve the target metric
- 8% of experiments have invalid results due to sample ratio mismatch alone
- When the base success rate is 8%, a P-value of 0.05 still means 26% false positive risk
可信实验是一套用于开展对照实验(A/B测试)的框架,可产生可靠、可落地的结果。核心观点:大多数实验都会失败,许多「成功」的结果实际上是假阳性。
核心转变:从「实验是否呈现阳性结果?」转向「我是否足够信任该结果并据此采取行动?」
曾在微软、亚马逊和Airbnb搭建实验平台的Ronny Kohavi发现:
- 66%-92%的实验无法提升目标指标
- 仅因样本比例不匹配就导致8%的实验结果无效
- 当基础成功率为8%时,P值为0.05仍意味着26%的假阳性风险
When to Use It
适用场景
Use Trustworthy Experiments when you need to:
- Design an A/B test that will produce valid, actionable results
- Determine sample size and runtime for statistical power
- Validate experiment results before making ship/no-ship decisions
- Build an experimentation culture at your company
- Choose metrics (OEC) that balance short-term gains with long-term value
- Diagnose why results look suspicious (Twyman's Law)
- Speed up experimentation without sacrificing validity
在以下场景中可使用可信实验框架:
- 设计A/B测试,以产生有效、可落地的结果
- 确定样本量和实验时长,保障统计效力
- 验证实验结果,再做出发布/不发布的决策
- 在公司内构建实验文化
- 选择指标(OEC),平衡短期收益与长期价值
- 诊断结果异常的原因(Twyman法则)
- 在不牺牲有效性的前提下加快实验进程
When Not to Use It
不适用场景
Don't use controlled experiments when:
- You don't have enough users — Need tens of thousands minimum
- The decision is one-time — Can't A/B test mergers or acquisitions
- There's no real user choice — Employer-mandated software
- You need immediate decisions — Experiments need time
- The metric can't be measured — No experiment without observable outcomes
请勿在以下场景中使用对照实验:
- 用户数量不足——至少需要数万名用户
- 决策为一次性事件——无法对并购等事项开展A/B测试
- 无真实用户选择空间——如雇主强制要求使用的软件
- 需要立即做出决策——实验需要时间积累数据
- 指标无法被测量——无可观测结果则无法开展实验
Resources
参考资源
Book:
- Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu
书籍:
- 《可信在线对照实验》(Trustworthy Online Controlled Experiments),作者:Ronny Kohavi、Diane Tang、Ya Xu