trustworthy-experiments

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Trustworthy Experiments

可信实验

What It Is

什么是可信实验

Trustworthy Experiments is a framework for running controlled experiments (A/B tests) that produce reliable, actionable results. The core insight: most experiments fail, and many "successful" results are actually false positives.

The key shift: Move from "Did the experiment show a positive result?" to "Can I trust this result enough to act on it?"

Ronny Kohavi, who built experimentation platforms at Microsoft, Amazon, and Airbnb, found that:

66-92% of experiments fail to improve the target metric
8% of experiments have invalid results due to sample ratio mismatch alone
When the base success rate is 8%, a P-value of 0.05 still means 26% false positive risk

This framework helps you avoid the common traps that make experiment results untrustworthy.

可信实验是一套用于开展对照实验（A/B测试）的框架，能产出可靠、可落地的结果。核心洞见：大多数实验都会失败，许多‘成功’的结果实际上是假阳性。

核心转变：从‘实验是否显示阳性结果？’转向‘我是否足够信任这个结果并据此行动？’

曾在微软、亚马逊和Airbnb搭建实验平台的Ronny Kohavi发现：

66-92%的实验无法提升目标指标
8%的实验仅因样本比例不匹配（sample ratio mismatch）就产生无效结果
当基准成功率为8%时，P-value为0.05仍意味着26%的假阳性风险

本框架能帮助你避开那些导致实验结果不可信的常见陷阱。

Response Posture

响应准则

Apply the framework directly to the user's experiment.
Never mention the repository, skills, SKILL.md, patterns, or references.
Do not run tools or read files; answer from the framework.
Avoid process/meta commentary; respond as an experimentation lead.

直接将本框架应用于用户的实验场景
绝不要提及仓库、技能、SKILL.md、模式或参考资料
不要运行工具或读取文件；仅基于本框架作答
避免流程/元评论；以实验负责人的身份回应

When to Use It

适用场景

Use Trustworthy Experiments when you need to:

Design an A/B test that will produce valid, actionable results
Determine sample size and runtime for statistical power
Validate experiment results before making ship/no-ship decisions
Build an experimentation culture at your company
Choose metrics (OEC) that balance short-term gains with long-term value
Diagnose why results look suspicious (Twyman's Law)
Speed up experimentation without sacrificing validity

当你需要以下操作时，可使用可信实验框架：

设计A/B测试，产出有效、可落地的结果
确定样本量和实验时长，以保证统计效力
验证实验结果，再做出是否上线的决策
在公司内建立实验文化
选择指标（OEC），平衡短期收益与长期价值
诊断结果异常的原因（Twyman定律）
在不牺牲有效性的前提下加速实验进程

When Not to Use It

不适用场景

Don't use controlled experiments when:

You don't have enough users — Need tens of thousands minimum; 200,000+ for mature experimentation
The decision is one-time — Can't A/B test mergers, acquisitions, or one-off events
There's no real user choice — Employer-mandated software offers no switching insight
You need immediate decisions — Experiments need time to reach statistical power
The metric can't be measured — No experiment without observable outcomes

在以下场景中，请勿使用对照实验：

用户量不足——至少需要数万名用户；成熟实验体系需要20万+用户
一次性决策——无法对并购、收购或一次性事件开展A/B测试
无真实用户选择空间——雇主强制使用的软件无法提供用户切换的洞察
需要立即决策——实验需要时间达到统计效力
无法衡量指标——没有可观测结果就无法开展实验

Patterns

实践模式

Detailed examples showing how to run experiments correctly. Each pattern shows a common mistake and the correct approach.

以下是详细示例，展示如何正确开展实验。每个模式都会指出常见错误及正确做法。

Critical (get these wrong and you've wasted your time)

关键模式（一旦出错，实验全白费）

Pattern	What It Teaches
peeking-at-results	Don't check P-values daily — let experiments run to completion
sample-ratio-mismatch	If your 50/50 split is off, your results are invalid
underpowered-tests	Too few users = meaningless results, even if "significant"
wrong-success-metric	Optimizing the wrong metric can hurt your business
twymans-law	If results look too good to be true, they probably are

模式	核心要点
中途查看结果	不要每日检查P-value——让实验运行至结束
样本比例不匹配	如果你的50/50分组比例失衡，结果就是无效的
统计效力不足的测试	用户量过少=结果无意义，即便看起来“显著”
错误的成功指标	优化错误的指标可能损害业务
Twyman定律	如果结果好得离谱，那大概率是假的

High Impact

高影响模式

Pattern	What It Teaches
novelty-effects	Initial lifts often fade — run experiments long enough
survivorship-bias	Analyzing only users who stayed skews your results
multiple-comparisons	Testing many metrics inflates false positive rate
guardrail-metrics	Always monitor what you might be hurting
big-redesigns-fail	Ship incrementally — 80% of big bets lose
flat-is-not-ship	No significant result means don't ship, not "good enough"

模式	核心要点
新奇效应	初期提升往往会消退——实验要运行足够久
幸存者偏差	仅分析留存用户会扭曲结果
多重比较	测试多个指标会提升假阳性率
防护指标	始终监控可能被损害的指标
大型重设计常失败	逐步上线——80%的大赌注都会失败
无显著差异≠可以上线	无显著结果意味着不要上线，而非“足够好”

Medium Impact

中影响模式

Pattern	What It Teaches
institutional-memory	Document learnings or repeat the same mistakes
external-validity	Results may not generalize to other contexts
variance-reduction	Techniques to get results faster without losing validity

模式	核心要点
机构记忆	记录经验教训，避免重蹈覆辙
外部有效性	结果可能无法推广到其他场景
方差缩减	在不损失有效性的前提下加速获取结果的技术

Deep Dives

深度拓展

Read only when you need extra detail.

```
references/trustworthy-experiments-playbook.md
```
: Expanded framework detail, checklists, and examples.
```
references/experiment-plan-template.md
```
: Fill-in-the-blanks plan to design and run an A/B test.

仅在需要额外细节时阅读。

```
references/trustworthy-experiments-playbook.md
```
：框架的拓展细节、检查清单及示例。
```
references/experiment-plan-template.md
```
：可填写的A/B测试设计与执行计划模板。

Scripts

实用脚本

Optional utilities (no external deps):

```
scripts/sample_size.py
```
: Estimate required sample size for a two-variant conversion test.
```
scripts/srm_check.py
```
: Check sample ratio mismatch (SRM) for a 2-bucket split.

可选工具（无外部依赖）：

```
scripts/sample_size.py
```
：估算双变体转化测试所需的样本量。
```
scripts/srm_check.py
```
：检查两组分组的样本比例不匹配（SRM）情况。

Resources

参考资源

Book:

Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu — The definitive guide. All proceeds go to charity.

Papers (from Kohavi's teams):

"Rules of Thumb for Online Experiments" — Patterns from thousands of Microsoft experiments
"Diagnosing Sample Ratio Mismatch" — How to detect and debug SRM
"CUPED: Variance Reduction" — Get results faster without losing validity
"Crawl, Walk, Run, Fly" — Six axes for experimentation maturity

Online:

goodui.org — Database of 140+ experiment patterns with success rates
Ronny Kohavi's LinkedIn — Regular posts on experimentation insights
Ronny Kohavi's Maven course — Live cohort-based course on experimentation

Related Books:

Calling Bullshit by Carl Bergstrom and Jevin West — Critical thinking about data
Hard Facts, Dangerous Half-Truths and Total Nonsense by Jeffrey Pfeffer and Robert Sutton — Evidence-based management

书籍：

Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu ——权威指南，所有收益捐赠给慈善机构。

论文（来自Kohavi的团队）：

"Rules of Thumb for Online Experiments" ——基于数千个微软实验总结的模式
"Diagnosing Sample Ratio Mismatch" ——如何检测与调试SRM
"CUPED: Variance Reduction" ——在不损失有效性的前提下加速获取结果
"Crawl, Walk, Run, Fly" ——实验成熟度的六个维度

在线资源：

goodui.org ——包含140+实验模式的数据库及成功率数据
Ronny Kohavi的LinkedIn ——定期发布实验相关洞见
Ronny Kohavi的Maven课程 ——基于 cohort 的实验主题直播课程

相关书籍：

Calling Bullshit by Carl Bergstrom and Jevin West ——关于数据的批判性思维
Hard Facts, Dangerous Half-Truths and Total Nonsense by Jeffrey Pfeffer and Robert Sutton ——基于证据的管理