trustworthy-experiments
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTrustworthy Experiments
可信实验
What It Is
什么是可信实验
Trustworthy Experiments is a framework for running controlled experiments (A/B tests) that produce reliable, actionable results. The core insight: most experiments fail, and many "successful" results are actually false positives.
The key shift: Move from "Did the experiment show a positive result?" to "Can I trust this result enough to act on it?"
Ronny Kohavi, who built experimentation platforms at Microsoft, Amazon, and Airbnb, found that:
- 66-92% of experiments fail to improve the target metric
- 8% of experiments have invalid results due to sample ratio mismatch alone
- When the base success rate is 8%, a P-value of 0.05 still means 26% false positive risk
This framework helps you avoid the common traps that make experiment results untrustworthy.
可信实验是一套用于开展对照实验(A/B测试)的框架,能产出可靠、可落地的结果。核心洞见:大多数实验都会失败,许多‘成功’的结果实际上是假阳性。
核心转变:从‘实验是否显示阳性结果?’转向‘我是否足够信任这个结果并据此行动?’
曾在微软、亚马逊和Airbnb搭建实验平台的Ronny Kohavi发现:
- 66-92%的实验无法提升目标指标
- 8%的实验仅因样本比例不匹配(sample ratio mismatch)就产生无效结果
- 当基准成功率为8%时,P-value为0.05仍意味着26%的假阳性风险
本框架能帮助你避开那些导致实验结果不可信的常见陷阱。
Response Posture
响应准则
- Apply the framework directly to the user's experiment.
- Never mention the repository, skills, SKILL.md, patterns, or references.
- Do not run tools or read files; answer from the framework.
- Avoid process/meta commentary; respond as an experimentation lead.
- 直接将本框架应用于用户的实验场景
- 绝不要提及仓库、技能、SKILL.md、模式或参考资料
- 不要运行工具或读取文件;仅基于本框架作答
- 避免流程/元评论;以实验负责人的身份回应
When to Use It
适用场景
Use Trustworthy Experiments when you need to:
- Design an A/B test that will produce valid, actionable results
- Determine sample size and runtime for statistical power
- Validate experiment results before making ship/no-ship decisions
- Build an experimentation culture at your company
- Choose metrics (OEC) that balance short-term gains with long-term value
- Diagnose why results look suspicious (Twyman's Law)
- Speed up experimentation without sacrificing validity
当你需要以下操作时,可使用可信实验框架:
- 设计A/B测试,产出有效、可落地的结果
- 确定样本量和实验时长,以保证统计效力
- 验证实验结果,再做出是否上线的决策
- 在公司内建立实验文化
- 选择指标(OEC),平衡短期收益与长期价值
- 诊断结果异常的原因(Twyman定律)
- 在不牺牲有效性的前提下加速实验进程
When Not to Use It
不适用场景
Don't use controlled experiments when:
- You don't have enough users — Need tens of thousands minimum; 200,000+ for mature experimentation
- The decision is one-time — Can't A/B test mergers, acquisitions, or one-off events
- There's no real user choice — Employer-mandated software offers no switching insight
- You need immediate decisions — Experiments need time to reach statistical power
- The metric can't be measured — No experiment without observable outcomes
在以下场景中,请勿使用对照实验:
- 用户量不足——至少需要数万名用户;成熟实验体系需要20万+用户
- 一次性决策——无法对并购、收购或一次性事件开展A/B测试
- 无真实用户选择空间——雇主强制使用的软件无法提供用户切换的洞察
- 需要立即决策——实验需要时间达到统计效力
- 无法衡量指标——没有可观测结果就无法开展实验
Patterns
实践模式
Detailed examples showing how to run experiments correctly. Each pattern shows a common mistake and the correct approach.
以下是详细示例,展示如何正确开展实验。每个模式都会指出常见错误及正确做法。
Critical (get these wrong and you've wasted your time)
关键模式(一旦出错,实验全白费)
| Pattern | What It Teaches |
|---|---|
| peeking-at-results | Don't check P-values daily — let experiments run to completion |
| sample-ratio-mismatch | If your 50/50 split is off, your results are invalid |
| underpowered-tests | Too few users = meaningless results, even if "significant" |
| wrong-success-metric | Optimizing the wrong metric can hurt your business |
| twymans-law | If results look too good to be true, they probably are |
| 模式 | 核心要点 |
|---|---|
| 中途查看结果 | 不要每日检查P-value——让实验运行至结束 |
| 样本比例不匹配 | 如果你的50/50分组比例失衡,结果就是无效的 |
| 统计效力不足的测试 | 用户量过少=结果无意义,即便看起来“显著” |
| 错误的成功指标 | 优化错误的指标可能损害业务 |
| Twyman定律 | 如果结果好得离谱,那大概率是假的 |
High Impact
高影响模式
| Pattern | What It Teaches |
|---|---|
| novelty-effects | Initial lifts often fade — run experiments long enough |
| survivorship-bias | Analyzing only users who stayed skews your results |
| multiple-comparisons | Testing many metrics inflates false positive rate |
| guardrail-metrics | Always monitor what you might be hurting |
| big-redesigns-fail | Ship incrementally — 80% of big bets lose |
| flat-is-not-ship | No significant result means don't ship, not "good enough" |
| 模式 | 核心要点 |
|---|---|
| 新奇效应 | 初期提升往往会消退——实验要运行足够久 |
| 幸存者偏差 | 仅分析留存用户会扭曲结果 |
| 多重比较 | 测试多个指标会提升假阳性率 |
| 防护指标 | 始终监控可能被损害的指标 |
| 大型重设计常失败 | 逐步上线——80%的大赌注都会失败 |
| 无显著差异≠可以上线 | 无显著结果意味着不要上线,而非“足够好” |
Medium Impact
中影响模式
| Pattern | What It Teaches |
|---|---|
| institutional-memory | Document learnings or repeat the same mistakes |
| external-validity | Results may not generalize to other contexts |
| variance-reduction | Techniques to get results faster without losing validity |
| 模式 | 核心要点 |
|---|---|
| 机构记忆 | 记录经验教训,避免重蹈覆辙 |
| 外部有效性 | 结果可能无法推广到其他场景 |
| 方差缩减 | 在不损失有效性的前提下加速获取结果的技术 |
Deep Dives
深度拓展
Read only when you need extra detail.
- : Expanded framework detail, checklists, and examples.
references/trustworthy-experiments-playbook.md - : Fill-in-the-blanks plan to design and run an A/B test.
references/experiment-plan-template.md
仅在需要额外细节时阅读。
- :框架的拓展细节、检查清单及示例。
references/trustworthy-experiments-playbook.md - :可填写的A/B测试设计与执行计划模板。
references/experiment-plan-template.md
Scripts
实用脚本
Optional utilities (no external deps):
- : Estimate required sample size for a two-variant conversion test.
scripts/sample_size.py - : Check sample ratio mismatch (SRM) for a 2-bucket split.
scripts/srm_check.py
可选工具(无外部依赖):
- :估算双变体转化测试所需的样本量。
scripts/sample_size.py - :检查两组分组的样本比例不匹配(SRM)情况。
scripts/srm_check.py
Resources
参考资源
Book:
- Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu — The definitive guide. All proceeds go to charity.
Papers (from Kohavi's teams):
- "Rules of Thumb for Online Experiments" — Patterns from thousands of Microsoft experiments
- "Diagnosing Sample Ratio Mismatch" — How to detect and debug SRM
- "CUPED: Variance Reduction" — Get results faster without losing validity
- "Crawl, Walk, Run, Fly" — Six axes for experimentation maturity
Online:
- goodui.org — Database of 140+ experiment patterns with success rates
- Ronny Kohavi's LinkedIn — Regular posts on experimentation insights
- Ronny Kohavi's Maven course — Live cohort-based course on experimentation
Related Books:
- Calling Bullshit by Carl Bergstrom and Jevin West — Critical thinking about data
- Hard Facts, Dangerous Half-Truths and Total Nonsense by Jeffrey Pfeffer and Robert Sutton — Evidence-based management
书籍:
- Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu ——权威指南,所有收益捐赠给慈善机构。
论文(来自Kohavi的团队):
- "Rules of Thumb for Online Experiments" ——基于数千个微软实验总结的模式
- "Diagnosing Sample Ratio Mismatch" ——如何检测与调试SRM
- "CUPED: Variance Reduction" ——在不损失有效性的前提下加速获取结果
- "Crawl, Walk, Run, Fly" ——实验成熟度的六个维度
在线资源:
- goodui.org ——包含140+实验模式的数据库及成功率数据
- Ronny Kohavi的LinkedIn ——定期发布实验相关洞见
- Ronny Kohavi的Maven课程 ——基于 cohort 的实验主题直播课程
相关书籍:
- Calling Bullshit by Carl Bergstrom and Jevin West ——关于数据的批判性思维
- Hard Facts, Dangerous Half-Truths and Total Nonsense by Jeffrey Pfeffer and Robert Sutton ——基于证据的管理