trustworthy-experiments

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Trustworthy Experiments

可信实验

What It Is

什么是可信实验

Trustworthy Experiments is a framework for running controlled experiments (A/B tests) that produce reliable, actionable results. The core insight: most experiments fail, and many "successful" results are actually false positives.

The key shift: Move from "Did the experiment show a positive result?" to "Can I trust this result enough to act on it?"

Ronny Kohavi, who built experimentation platforms at Microsoft, Amazon, and Airbnb, found that:

66-92% of experiments fail to improve the target metric
8% of experiments have invalid results due to sample ratio mismatch alone
When the base success rate is 8%, a P-value of 0.05 still means 26% false positive risk

可信实验是一套用于开展对照实验（A/B测试）的框架，可产生可靠、可落地的结果。核心观点：大多数实验都会失败，许多「成功」的结果实际上是假阳性。

核心转变：从「实验是否呈现阳性结果？」转向「我是否足够信任该结果并据此采取行动？」

曾在微软、亚马逊和Airbnb搭建实验平台的Ronny Kohavi发现：

66%-92%的实验无法提升目标指标
仅因样本比例不匹配就导致8%的实验结果无效
当基础成功率为8%时，P值为0.05仍意味着26%的假阳性风险

When to Use It

适用场景

Use Trustworthy Experiments when you need to:

Design an A/B test that will produce valid, actionable results
Determine sample size and runtime for statistical power
Validate experiment results before making ship/no-ship decisions
Build an experimentation culture at your company
Choose metrics (OEC) that balance short-term gains with long-term value
Diagnose why results look suspicious (Twyman's Law)
Speed up experimentation without sacrificing validity

在以下场景中可使用可信实验框架：

设计A/B测试，以产生有效、可落地的结果
确定样本量和实验时长，保障统计效力
验证实验结果，再做出发布/不发布的决策
在公司内构建实验文化
选择指标（OEC），平衡短期收益与长期价值
诊断结果异常的原因（Twyman法则）
在不牺牲有效性的前提下加快实验进程

When Not to Use It

不适用场景

Don't use controlled experiments when:

You don't have enough users — Need tens of thousands minimum
The decision is one-time — Can't A/B test mergers or acquisitions
There's no real user choice — Employer-mandated software
You need immediate decisions — Experiments need time
The metric can't be measured — No experiment without observable outcomes

请勿在以下场景中使用对照实验：

用户数量不足——至少需要数万名用户
决策为一次性事件——无法对并购等事项开展A/B测试
无真实用户选择空间——如雇主强制要求使用的软件
需要立即做出决策——实验需要时间积累数据
指标无法被测量——无可观测结果则无法开展实验

Resources

参考资源

Book:

Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu

书籍：

《可信在线对照实验》（Trustworthy Online Controlled Experiments），作者：Ronny Kohavi、Diane Tang、Ya Xu