trustworthy-experiments

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Trustworthy Experiments

可信实验

What It Is

什么是可信实验

Trustworthy Experiments is a framework for running controlled experiments (A/B tests) that produce reliable, actionable results. The core insight: most experiments fail, and many "successful" results are actually false positives.
The key shift: Move from "Did the experiment show a positive result?" to "Can I trust this result enough to act on it?"
Ronny Kohavi, who built experimentation platforms at Microsoft, Amazon, and Airbnb, found that:
  • 66-92% of experiments fail to improve the target metric
  • 8% of experiments have invalid results due to sample ratio mismatch alone
  • When the base success rate is 8%, a P-value of 0.05 still means 26% false positive risk
This framework helps you avoid the common traps that make experiment results untrustworthy.
可信实验是一套用于开展对照实验(A/B测试)的框架,能产出可靠、可落地的结果。核心洞见:大多数实验都会失败,许多‘成功’的结果实际上是假阳性。
核心转变:从‘实验是否显示阳性结果?’转向‘我是否足够信任这个结果并据此行动?’
曾在微软、亚马逊和Airbnb搭建实验平台的Ronny Kohavi发现:
  • 66-92%的实验无法提升目标指标
  • 8%的实验仅因样本比例不匹配(sample ratio mismatch)就产生无效结果
  • 当基准成功率为8%时,P-value为0.05仍意味着26%的假阳性风险
本框架能帮助你避开那些导致实验结果不可信的常见陷阱。

Response Posture

响应准则

  • Apply the framework directly to the user's experiment.
  • Never mention the repository, skills, SKILL.md, patterns, or references.
  • Do not run tools or read files; answer from the framework.
  • Avoid process/meta commentary; respond as an experimentation lead.
  • 直接将本框架应用于用户的实验场景
  • 绝不要提及仓库、技能、SKILL.md、模式或参考资料
  • 不要运行工具或读取文件;仅基于本框架作答
  • 避免流程/元评论;以实验负责人的身份回应

When to Use It

适用场景

Use Trustworthy Experiments when you need to:
  • Design an A/B test that will produce valid, actionable results
  • Determine sample size and runtime for statistical power
  • Validate experiment results before making ship/no-ship decisions
  • Build an experimentation culture at your company
  • Choose metrics (OEC) that balance short-term gains with long-term value
  • Diagnose why results look suspicious (Twyman's Law)
  • Speed up experimentation without sacrificing validity
当你需要以下操作时,可使用可信实验框架:
  • 设计A/B测试,产出有效、可落地的结果
  • 确定样本量和实验时长,以保证统计效力
  • 验证实验结果,再做出是否上线的决策
  • 在公司内建立实验文化
  • 选择指标(OEC),平衡短期收益与长期价值
  • 诊断结果异常的原因(Twyman定律)
  • 在不牺牲有效性的前提下加速实验进程

When Not to Use It

不适用场景

Don't use controlled experiments when:
  • You don't have enough users — Need tens of thousands minimum; 200,000+ for mature experimentation
  • The decision is one-time — Can't A/B test mergers, acquisitions, or one-off events
  • There's no real user choice — Employer-mandated software offers no switching insight
  • You need immediate decisions — Experiments need time to reach statistical power
  • The metric can't be measured — No experiment without observable outcomes
在以下场景中,请勿使用对照实验:
  • 用户量不足——至少需要数万名用户;成熟实验体系需要20万+用户
  • 一次性决策——无法对并购、收购或一次性事件开展A/B测试
  • 无真实用户选择空间——雇主强制使用的软件无法提供用户切换的洞察
  • 需要立即决策——实验需要时间达到统计效力
  • 无法衡量指标——没有可观测结果就无法开展实验

Patterns

实践模式

Detailed examples showing how to run experiments correctly. Each pattern shows a common mistake and the correct approach.
以下是详细示例,展示如何正确开展实验。每个模式都会指出常见错误及正确做法。

Critical (get these wrong and you've wasted your time)

关键模式(一旦出错,实验全白费)

PatternWhat It Teaches
peeking-at-resultsDon't check P-values daily — let experiments run to completion
sample-ratio-mismatchIf your 50/50 split is off, your results are invalid
underpowered-testsToo few users = meaningless results, even if "significant"
wrong-success-metricOptimizing the wrong metric can hurt your business
twymans-lawIf results look too good to be true, they probably are
模式核心要点
中途查看结果不要每日检查P-value——让实验运行至结束
样本比例不匹配如果你的50/50分组比例失衡,结果就是无效的
统计效力不足的测试用户量过少=结果无意义,即便看起来“显著”
错误的成功指标优化错误的指标可能损害业务
Twyman定律如果结果好得离谱,那大概率是假的

High Impact

高影响模式

PatternWhat It Teaches
novelty-effectsInitial lifts often fade — run experiments long enough
survivorship-biasAnalyzing only users who stayed skews your results
multiple-comparisonsTesting many metrics inflates false positive rate
guardrail-metricsAlways monitor what you might be hurting
big-redesigns-failShip incrementally — 80% of big bets lose
flat-is-not-shipNo significant result means don't ship, not "good enough"
模式核心要点
新奇效应初期提升往往会消退——实验要运行足够久
幸存者偏差仅分析留存用户会扭曲结果
多重比较测试多个指标会提升假阳性率
防护指标始终监控可能被损害的指标
大型重设计常失败逐步上线——80%的大赌注都会失败
无显著差异≠可以上线无显著结果意味着不要上线,而非“足够好”

Medium Impact

中影响模式

PatternWhat It Teaches
institutional-memoryDocument learnings or repeat the same mistakes
external-validityResults may not generalize to other contexts
variance-reductionTechniques to get results faster without losing validity
模式核心要点
机构记忆记录经验教训,避免重蹈覆辙
外部有效性结果可能无法推广到其他场景
方差缩减在不损失有效性的前提下加速获取结果的技术

Deep Dives

深度拓展

Read only when you need extra detail.
  • references/trustworthy-experiments-playbook.md
    : Expanded framework detail, checklists, and examples.
  • references/experiment-plan-template.md
    : Fill-in-the-blanks plan to design and run an A/B test.
仅在需要额外细节时阅读。
  • references/trustworthy-experiments-playbook.md
    :框架的拓展细节、检查清单及示例。
  • references/experiment-plan-template.md
    :可填写的A/B测试设计与执行计划模板。

Scripts

实用脚本

Optional utilities (no external deps):
  • scripts/sample_size.py
    : Estimate required sample size for a two-variant conversion test.
  • scripts/srm_check.py
    : Check sample ratio mismatch (SRM) for a 2-bucket split.
可选工具(无外部依赖):
  • scripts/sample_size.py
    :估算双变体转化测试所需的样本量。
  • scripts/srm_check.py
    :检查两组分组的样本比例不匹配(SRM)情况。

Resources

参考资源

Book:
  • Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu — The definitive guide. All proceeds go to charity.
Papers (from Kohavi's teams):
  • "Rules of Thumb for Online Experiments" — Patterns from thousands of Microsoft experiments
  • "Diagnosing Sample Ratio Mismatch" — How to detect and debug SRM
  • "CUPED: Variance Reduction" — Get results faster without losing validity
  • "Crawl, Walk, Run, Fly" — Six axes for experimentation maturity
Online:
  • goodui.org — Database of 140+ experiment patterns with success rates
  • Ronny Kohavi's LinkedIn — Regular posts on experimentation insights
  • Ronny Kohavi's Maven course — Live cohort-based course on experimentation
Related Books:
  • Calling Bullshit by Carl Bergstrom and Jevin West — Critical thinking about data
  • Hard Facts, Dangerous Half-Truths and Total Nonsense by Jeffrey Pfeffer and Robert Sutton — Evidence-based management
书籍:
  • Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu ——权威指南,所有收益捐赠给慈善机构。
论文(来自Kohavi的团队):
  • "Rules of Thumb for Online Experiments" ——基于数千个微软实验总结的模式
  • "Diagnosing Sample Ratio Mismatch" ——如何检测与调试SRM
  • "CUPED: Variance Reduction" ——在不损失有效性的前提下加速获取结果
  • "Crawl, Walk, Run, Fly" ——实验成熟度的六个维度
在线资源:
  • goodui.org ——包含140+实验模式的数据库及成功率数据
  • Ronny Kohavi的LinkedIn ——定期发布实验相关洞见
  • Ronny Kohavi的Maven课程 ——基于 cohort 的实验主题直播课程
相关书籍:
  • Calling Bullshit by Carl Bergstrom and Jevin West ——关于数据的批判性思维
  • Hard Facts, Dangerous Half-Truths and Total Nonsense by Jeffrey Pfeffer and Robert Sutton ——基于证据的管理