experimentation-analytics

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Experimentation Analytics

实验分析

A data-team-mentor's playbook for interpreting experiment results without fooling yourself.
The result panel is the moment-of-truth for an experiment. The numbers on it determine whether you ship, kill, or iterate. They also expose every shortcut taken in the design phase: an underpowered test produces wide confidence intervals; a peeked test produces a too-narrow p-value; a ratio metric without delta-method correction produces overconfident lift estimates. Most ship-the-wrong-thing decisions trace back to misreading the result panel.
This skill is the discipline that prevents misreading. It assumes the experiment was designed well (see the
experiment-design
skill). It assumes the platform's results panel is technically correct (most modern platforms are; some older ones are not). It assumes you can read a number off a screen. The hard part is knowing what each number actually means and what it does not, and that is what is here.
When to use this skill: any time you are reading an experiment result panel and about to make a ship, kill, or iterate decision.

这是一份数据团队导师编写的指南,教你如何正确解读实验结果,避免自我误导。
结果面板是实验的关键时刻,上面的数据决定了你是上线、终止还是迭代功能。同时,它也会暴露设计阶段的所有疏漏:功效不足的测试会产生宽泛的置信区间;中途查看数据的测试会得到过于狭窄的p值;未用delta方法修正的比率指标会得出过于乐观的提升值估算。大多数错误的上线决策,根源都在于对结果面板的误读。
本技能就是防止误读的准则。它假设实验设计合理(可参考
experiment-design
技能),平台的结果面板在技术层面准确(大多数现代平台都符合,部分老旧平台则不然),且你能读懂屏幕上的数字。真正的难点在于理解每个数字的实际含义与局限,这正是本指南要讲解的内容。
适用场景:每当你查看实验结果面板,即将做出上线、终止或迭代决策时。

What this skill is for

本技能的用途

This skill covers result interpretation, the statistical concepts that make the numbers trustworthy, and the dashboard reconciliation work that prevents executive-level confusion when the experiment number does not match the BI number. The audience is product managers and data analysts who read experiment results together and need a shared vocabulary that does not paper over the dangerous parts of statistics.
Companion skills cover the adjacent territory. The
experiment-design
skill covers pre-experiment thinking: hypothesis, sample size, MDE, segments, what NOT to test. Read it before designing the test; read this skill when reading the result. The
feature-flagging
skill covers the operational mechanics of flag management, environment promotion, and stale-flag cleanup. Together the three skills span the experimentation lifecycle from intent through interpretation. For platform-specific MCP commands, consult the chosen platform's docs; Statsig, PostHog, Optimizely, GrowthBook, Eppo, Amplitude, and Kameleoon all expose rich analytics surfaces that this skill informs how to read.

本技能涵盖结果解读、确保数据可信的统计概念,以及当实验数据与BI数据不符时,避免管理层产生困惑的仪表盘数据核对工作。受众是共同解读实验结果的产品经理与数据分析师,需要一套不回避统计学风险的通用术语体系。
配套技能覆盖相关领域:
experiment-design
技能涵盖实验前的思考:假设、样本量、最小可检测效应(MDE)、细分群体、不可测试的内容。设计实验前阅读该技能,解读结果时阅读本技能。
feature-flagging
技能涵盖功能标记管理、环境推广、过期标记清理的操作机制。这三项技能共同覆盖从实验意图到结果解读的完整生命周期。针对平台特定的MCP命令,请参考所选平台的文档;Statsig、PostHog、Optimizely、GrowthBook、Eppo、Amplitude和Kameleoon都提供了丰富的分析界面,本技能将指导你如何解读这些界面的数据。

The result panel: what every modern platform should expose

结果面板:现代平台应具备的核心要素

A result panel that omits any of the following is a black box. Treat results from black-box platforms with extra skepticism, and consider exporting raw assignment and event data into a notebook where you can compute the missing pieces yourself.
What a competent platform exposes:
  • Variants and traffic allocation (e.g., 50/50, 33/33/33). Allocation drift across the test window indicates assignment bugs.
  • Per-variant primary metric: point estimate, confidence interval (or credible interval for Bayesian), sample size at the variant level.
  • Lift: variant minus control, expressed as both absolute change and relative percent. Both numbers matter; relative is intuitive, absolute is what shows up in revenue calculations.
  • Statistical significance: p-value (frequentist) or probability of being best (Bayesian). The methodology should be labeled clearly so you know which interpretation rules apply.
  • Variance reduction technique applied: CUPED, post-stratification, regression adjustment. If the platform applies these silently, ask which.
  • Guardrail metric statuses: each guardrail labeled green, amber, or red against its tolerance. The tolerance was set at design time; the panel just enforces it.
  • Per-segment results for pre-registered segments only. Post-hoc segment slicers are tempting and dangerous.
  • Test status: running, ended, decision filed.
  • A time series of the lift across the test window. This is where novelty effects, primacy effects, and assignment bugs become visible.
If you are looking at a result panel that hides any of these, the first move is to surface them, not to ship.

缺少以下任何一项的结果面板都是黑箱。对于黑箱平台的结果要格外谨慎,考虑将原始分配数据和事件数据导出到笔记本中,自行计算缺失的部分。
合格平台应展示的内容:
  • 变体与流量分配(如50/50、33/33/33)。测试期间分配比例的偏差意味着存在分配漏洞。
  • 各变体的核心指标:点估计值、置信区间(贝叶斯方法为可信区间)、变体层面的样本量。
  • 提升值:变体与对照组的差值,以绝对变化和相对百分比两种形式呈现。两个数值都很重要:相对值更直观,绝对值是收入计算的依据。
  • 统计显著性:p值(频率学派)或最优概率(贝叶斯学派)。应明确标注方法,以便你知晓适用的解读规则。
  • 应用的方差缩减技术:CUPED、事后分层、回归调整。如果平台默认应用这些技术,要确认具体是哪一种。
  • 护栏指标状态:每个护栏指标根据预设阈值标记为绿色、黄色或红色。阈值在设计阶段设定,面板仅负责执行判断。
  • 仅展示预先注册的细分群体结果。事后细分群体分析看似诱人但风险极高。
  • 测试状态:进行中、已结束、已提交决策。
  • 测试期间提升值的时间序列。新奇效应、首因效应和分配漏洞会在此处显现。
如果你查看的结果面板缺少以上任何一项,首先要做的是补全这些信息,而非急于上线。

Confidence intervals: the most important number

置信区间:最重要的数值

The single most important number on the result panel is the confidence interval (CI) on the lift. More important than the point estimate. More important than the p-value. The CI tells you what you actually know.
What a 95% CI of [+2%, +6%] means: under repeated sampling, the true effect would fall in this range 95% of the time. The true effect is most likely somewhere near the middle, but the extremes are entirely consistent with the data.
What it does not mean: it does not literally mean "there is a 95% chance the true effect is between +2% and +6%." That is the Bayesian credible interval, which often gives similar numerical answers but is conceptually different. PMs can usually live with the loose intuition; analysts should know the precise version when defending a number to a skeptic.
The width of the CI matters more than the center for most ship decisions. A wide CI means you do not know much yet. A narrow CI means you know with precision. The point estimate is your best guess; the width is your humility.
Practical decision rules, in order of importance:
  1. If the CI includes zero AND a meaningful positive number (say [-1%, +5%]), you do not have enough data to ship. Period. The point estimate may look favorable, but the data is consistent with no effect and consistent with a meaningful win. You cannot tell which.
  2. If the CI is all-positive (lower bound greater than zero, e.g., [+1%, +4%]), there is a real effect. Now evaluate magnitude: is the lower bound large enough to be worth the implementation cost?
  3. If the CI is all-negative (upper bound less than zero, e.g., [-5%, -1%]), there is real harm. Kill the test.
  4. If the CI straddles zero but is narrow (e.g., [-0.5%, +0.5%]), this is a real null result. The effect is small enough to call essentially zero. Useful information; do not ship the change for "lift" reasons (you found none) but do not panic about harm either.
  5. If the CI straddles zero and is wide (e.g., [-5%, +8%]), the test is inconclusive. The data is consistent with a moderate win, no effect, or a moderate loss. Run longer, run bigger, or accept that the question cannot be answered at the available traffic.
For a worked-example cheatsheet, see
references/confidence-interval-cheatsheet.md
.

结果面板上最重要的数值是提升值的置信区间(CI),比点估计值和p值都重要。置信区间能告诉你你真正了解的信息。
95%置信区间[+2%, +6%]的含义:在重复抽样的情况下,真实效应有95%的概率落在该区间内。真实效应最可能接近区间中点,但区间两端的数值也完全符合现有数据。
代表的含义:并非字面意义上的“真实效应有95%的概率在+2%到+6%之间”。这是贝叶斯可信区间的含义,两者数值可能相近,但概念完全不同。产品经理可以采用宽松的直觉理解,但数据分析师在向质疑者解释数据时,应知晓精确的定义。
对于大多数上线决策,置信区间的宽度比中点更重要。宽区间意味着你目前了解的信息有限;窄区间意味着你能精准掌握情况。点估计值是你的最佳猜测,宽度则体现了你的认知局限性。
实用决策规则,按重要性排序:
  1. 如果置信区间包含零和有意义的正数(如[-1%, +5%]),说明你没有足够的数据支持上线。无论点估计值看起来多么有利,数据既符合无效应的情况,也符合显著正向效应的情况,你无法区分。
  2. 如果置信区间全为正数(下限大于零,如[+1%, +4%]),说明存在真实效应。此时评估效应幅度:下限是否大到足以覆盖实现成本?
  3. 如果置信区间全为负数(上限小于零,如[-5%, -1%]),说明存在真实损害,应终止测试。
  4. 如果置信区间包含零但很狭窄(如[-0.5%, +0.5%]),这是真实的零结果。效应小到可以视为基本为零。这是有用的信息:不要因为“提升”而上线该功能(你未发现提升),但也不必担心损害。
  5. 如果置信区间包含零且很宽泛(如[-5%, +8%]),测试结果不确定。数据既符合中等程度的正向效应,也符合无效应或中等程度的负向效应。可以延长测试时间、扩大样本量,或者接受在现有流量下无法回答该问题。
如需实例速查表,请查看
references/confidence-interval-cheatsheet.md

P-values: what they mean and what they do not

p值:含义与误区

The p-value is the probability of observing the lift you saw (or a larger one) IF the true effect were zero. A p-value of 0.04 means: under the null hypothesis of no effect, you would see this much lift purely by chance about 4% of the time.
What the p-value does not mean, despite frequent abuse:
  • It does not mean "there is a 96% chance the treatment works." That sentence has no defensible interpretation; the p-value is computed under the assumption that the treatment does NOT work, so it cannot tell you the probability that it does.
  • It does not mean "the effect is large." A tiny effect tested against a huge sample can produce a vanishingly small p-value. The p-value is about the strength of evidence against the null, not the size of the effect.
  • It does not mean "the result will replicate." A p-value of 0.04 is associated with replication rates well below 50% in most published research; statistical significance is not reproducibility.
The 0.05 cutoff is convention, not law. If you pre-committed to alpha equals 0.05, follow it; the discipline of pre-commitment is more valuable than the specific threshold. If you did not pre-commit, p equals 0.06 is not categorically different from p equals 0.04, and treating it as such is theater.
Always read the CI alongside the p-value. The p-value tells you about the null hypothesis; the CI tells you about the magnitude. Both matter; neither is sufficient alone. A p-value of 0.001 with a CI of [+0.1%, +0.3%] is a real but practically tiny effect; a p-value of 0.08 with a CI of [-1%, +12%] is a noisy estimate that could be huge or zero. The former is technically significant and not worth shipping; the latter is technically not significant and you might still want to dig deeper.
The peeking problem applies to p-values directly. Standard p-values assume one analysis at the end of the test. Multiple analyses inflate the false positive rate. Modern platforms with sequential testing report "always-valid" or "anytime-valid" p-values that survive peeking; older platforms do not. Know which you are looking at. See
references/p-value-interpretation-guide.md
for deeper coverage.

p值是指在真实效应为零的假设下,观察到当前(或更大)提升值的概率。p值为0.04意味着:在无效应的原假设下,纯偶然出现当前提升值的概率约为4%。
尽管p值常被误用,但它代表以下含义:
  • 不代表“该处理方案有效的概率为96%”。这句话没有合理的解读;p值是在假设处理方案无效的前提下计算的,因此无法告诉你处理方案有效的概率。
  • 不代表“效应很大”。针对超大样本测试微小效应,也能得到极小的p值。p值衡量的是反对原假设的证据强度,而非效应大小。
  • 不代表“结果可重复”。在已发表的研究中,p值为0.04的结果重复率远低于50%;统计显著性不等于可重复性。
0.05的阈值是惯例,而非定律。如果你预先设定α=0.05,就应遵循;预先承诺的准则比具体阈值更有价值。如果你未预先设定,p=0.06与p=0.04并无本质区别,将二者区别对待只是形式主义。
务必同时查看置信区间和p值。p值告诉你关于原假设的信息;置信区间告诉你效应的幅度。两者都很重要,单独一项都不足以支撑决策。p值为0.001但置信区间为[+0.1%, +0.3%],是真实但实际意义微小的效应;p值为0.08但置信区间为[-1%, +12%],是噪声很大的估计值,效应可能很大也可能为零。前者在统计上显著但不值得上线;后者在统计上不显著,但你可能仍需深入分析。
中途查看数据的问题直接影响p值。标准p值假设在测试结束时进行一次分析。多次分析会增加假阳性率。支持序贯检验的现代平台会报告“始终有效”或“随时有效”的p值,可避免中途查看数据的问题;老旧平台则不支持。要明确你使用的是哪种类型的p值。如需深入了解,请查看
references/p-value-interpretation-guide.md

Multiple testing corrections

多重检验校正

The problem in one sentence: with twenty independent comparisons at alpha equals 0.05, you expect one false positive purely by chance. With fifty comparisons, two or three. With a hundred, five.
Where multiple testing creeps in unintentionally:
  • Multiple variants: A vs B vs C vs D is three pairwise comparisons against control, not one.
  • Multiple metrics: tracking primary plus six guardrails plus three secondary metrics is ten chances to find significance somewhere.
  • Multiple segments: ten segments times three metrics is thirty chances.
  • Multiple time windows: looking at week 1, week 2, week 3, and the full test is four chances per metric per segment.
Two correction methods you should know:
  • Bonferroni correction. Divide alpha by the number of comparisons. Conservative; controls the family-wise error rate (the probability of even one false positive across the family of tests). Use when false positives are catastrophic.
  • Benjamini-Hochberg (BH) correction. Less conservative; controls the false discovery rate (the expected proportion of false positives among the things called significant). Better for exploratory analysis.
Most modern platforms support these natively as a configuration option. Configure them, or know which the platform applies by default. Some platforms apply corrections silently to the displayed p-values; others report uncorrected numbers and expect you to do the math.
The PM-friendly heuristic: pre-register your primary metric and primary segment. Treat everything else as exploratory. Findings in non-primary metrics or non-primary segments require larger effects, replication in a follow-up test, or both before they justify shipping. The discipline of designating primary up front protects you from the multiple-testing trap better than any correction formula.

一句话概括问题:在α=0.05的情况下进行20次独立比较,预计会出现1次纯偶然的假阳性结果。50次比较会出现2-3次,100次比较会出现5次。
多重检验的常见隐性场景:
  • 多变体测试:A vs B vs C vs D是3次与对照组的两两比较,而非1次。
  • 多指标:跟踪核心指标+6个护栏指标+3个次要指标,意味着有10次机会找到显著性结果。
  • 多细分群体:10个细分群体×3个指标,意味着有30次机会。
  • 多时间窗口:查看第1周、第2周、第3周和整个测试期的数据,意味着每个指标每个细分群体有4次机会。
你应了解的两种校正方法:
  • Bonferroni校正:将α除以比较次数。较为保守,控制家族误差率(即所有测试中出现至少1次假阳性的概率)。适用于假阳性后果严重的场景。
  • Benjamini-Hochberg(BH)校正:保守性较低,控制错误发现率(即被判定为显著的结果中,假阳性结果的预期比例)。更适合探索性分析。
大多数现代平台默认支持这些校正方法,或可通过配置开启。有些平台会自动校正显示的p值;有些平台则报告未校正的数值,需要你自行计算。
面向产品经理的实用准则:预先注册核心指标和核心细分群体,将其他所有内容视为探索性分析。非核心指标或非核心细分群体中的发现,需要更大的效应量、后续测试的重复验证,或两者兼具,才能支撑上线决策。预先指定核心内容的准则,比任何校正公式都更能帮你避免多重检验陷阱。

Sequential testing math

序贯检验原理

The peeking problem revisited from the analytics side. Classical t-tests, z-tests, and proportion tests assume one analysis at the end of a test. If you analyze the data five times during the test and stop the moment you see significance, your false positive rate is much higher than the nominal 5%. With daily peeking on a four-week test, false positive rate can climb above 30%.
Sequential testing methods adjust the math to allow continuous monitoring without inflating false positives. The names you will encounter:
  • mSPRT (mixture sequential probability ratio test). Produces "always-valid" p-values: you can look any time, and the false positive rate stays at the nominal alpha. Common in modern platforms.
  • Group sequential designs. Plan a fixed set of interim analyses (say, three) with adjusted significance thresholds at each. Less flexible than mSPRT but well-understood.
  • Anytime-valid confidence intervals. The CI version of always-valid p-values. Wider than fixed-horizon CIs by design; the cost of peek-safety is some statistical efficiency.
Modern platforms with sequential testing built in: Statsig (always-valid by default), Eppo (sequential CIs as a configuration), parts of PostHog (depending on the experiment type), GrowthBook (mSPRT supported). If your panel says "always-valid p-value" or "anytime-valid CI," that is sequential. If it says "p-value" with no qualifier, it is probably fixed-horizon and peeking inflates false positives.
If your platform does not support sequential testing, the discipline is: pre-commit to a single analysis date at launch, do not peek, and if you must peek, do not make decisions based on the peek. Save the decision-making for the pre-committed date. This is hard. Sequential testing makes it easier.
The trade-off: sequential tests have wider CIs and less aggressive p-values than fixed-horizon tests at the same sample size. The cost is real but usually worth it for PM contexts where the discipline of not peeking is impractical.

从分析角度重新审视中途查看数据的问题。经典t检验、z检验和比例检验假设在测试结束时进行一次分析。如果在测试期间分析5次数据,并在发现显著性结果时立即终止测试,假阳性率会远高于标称的5%。在为期4周的测试中每天查看数据,假阳性率可能超过30%。
序贯检验方法调整了统计原理,允许持续监控数据而不增加假阳性率。你会遇到的术语:
  • mSPRT(混合序贯概率比检验):生成“始终有效”的p值,你可以随时查看数据,假阳性率仍保持在标称α水平。现代平台常用。
  • 组序贯设计:预先设定固定次数的中期分析(如3次),并在每次分析时调整显著性阈值。灵活性不如mSPRT,但原理清晰易懂。
  • 随时有效置信区间:始终有效p值对应的置信区间版本。比固定时间范围的置信区间更宽,这是为避免中途查看数据风险而付出的统计效率代价。
内置序贯检验的现代平台:Statsig(默认始终有效)、Eppo(可配置序贯置信区间)、PostHog的部分功能(取决于实验类型)、GrowthBook(支持mSPRT)。如果你的面板显示“始终有效p值”或“随时有效置信区间”,则采用了序贯检验;如果仅显示“p值”且无说明,则可能是固定时间范围的检验,中途查看数据会增加假阳性率。
如果你的平台不支持序贯检验,准则是:在测试启动时预先承诺一次分析日期,不要中途查看数据;如果必须查看,不要基于中途的数据做出决策,等到预先承诺的日期再做决策。这很难做到,而序贯检验让这件事变得更容易。
权衡:在相同样本量下,序贯检验的置信区间比固定时间范围检验更宽,p值的显著性更弱。代价是真实存在的,但对于产品经理而言,这种代价通常值得,因为不中途查看数据的准则在实际中很难遵守。

CUPED variance reduction

CUPED方差缩减

CUPED (Controlled-experiment Using Pre-Experiment Data) uses pre-experiment behavior of the same users to subtract out their baseline, leaving a cleaner signal of the treatment effect. The result is the same point estimate with a much narrower confidence interval, often 30% to 50% narrower, which is equivalent to roughly doubling your effective sample size for free.
When CUPED helps:
  • Metrics with high pre-experiment baseline variance: revenue per user, sessions per user, engagement minutes, content consumption.
  • Tests where users have meaningful history (logged-in users, users from before a certain date).
  • Long-running products where pre-experiment behavior strongly predicts in-experiment behavior.
When CUPED does not help:
  • Brand-new users with no pre-experiment data.
  • One-time conversion metrics with no pre-experiment baseline (a user either signs up or does not; there is nothing to subtract).
  • Tests where pre-experiment behavior is essentially uncorrelated with the metric being moved.
How to read CUPED-adjusted results: the point estimate is roughly the same as the unadjusted estimate; the CI is narrower. If a result switched from "significant with CUPED" to "not significant without CUPED," that is normal and the CUPED-adjusted version is the more powerful test. If a result went the other way (significant without CUPED, not significant with), the unadjusted result was probably noise that CUPED correctly removed.
A common confusion: "CUPED made our lift smaller, so we should ship the unadjusted version." This is wrong. CUPED reduces variance, not point estimates. If the lift looks smaller after CUPED, the unadjusted lift was probably inflated by chance correlation with pre-experiment behavior; the CUPED estimate is closer to the true effect. Trust the adjusted version.
Platform support: Statsig, Eppo, GrowthBook, parts of PostHog, Amplitude (Experiment product). Optimizely has equivalent variance reduction. If your platform offers CUPED, turn it on for any metric where pre-experiment data exists.
For deeper coverage of CUPED, the delta method, and other statistical methods, see
references/statistical-method-reference.md
.

CUPED(利用实验前数据的受控实验)使用同一用户的实验前行为数据,减去其基线值,从而得到更清晰的处理效应信号。结果是点估计值不变,但置信区间大幅变窄,通常窄30%-50%,相当于免费将有效样本量增加了约一倍。
CUPED适用场景:
  • 实验前基线方差较高的指标:每用户收入、每用户会话数、参与时长、内容消费量。
  • 用户有历史数据的测试:登录用户、特定日期之前的用户。
  • 实验前行为能强烈预测实验中行为的长期产品。
CUPED不适用场景:
  • 无实验前数据的新用户。
  • 无实验前基线的一次性转化指标(用户要么注册要么不注册,没有可减去的基线)。
  • 实验前行为与被测指标基本无关的测试。
如何解读CUPED校正后的结果:点估计值与未校正的估计值大致相同,但置信区间更窄。如果结果在CUPED校正后从“显著”变为“不显著”,这是正常的,因为CUPED校正后的检验功效更强。如果结果反过来(未校正时显著,校正后不显著),则未校正的结果可能是噪声,CUPED正确移除了这些噪声。
常见误区:“CUPED让我们的提升值变小了,所以我们应该采用未校正的版本。”这是错误的。CUPED减少的是方差,而非点估计值。如果CUPED校正后提升值看起来更小,说明未校正的提升值可能是偶然与实验前行为相关而被夸大了;校正后的估计值更接近真实效应。请信任校正后的版本。
平台支持情况:Statsig、Eppo、GrowthBook、PostHog的部分功能、Amplitude(Experiment产品)。Optimizely有等效的方差缩减功能。如果你的平台提供CUPED,对于所有有实验前数据的指标都应开启。
如需深入了解CUPED、delta方法和其他统计方法,请查看
references/statistical-method-reference.md

Heterogeneous treatment effects (HTE) and segments

异质性处理效应(HTE)与细分群体

A heterogeneous treatment effect is when the treatment works differently for different segments. New users see +5%, power users see -2%, average is +1%. This is common, often interesting, and easy to over-read.
Why HTE is tempting: "let us just ship to new users." Why it is dangerous: post-hoc segment discovery is often noise; targeting infrastructure costs are real; UI complexity from per-segment behavior compounds over time.
The right way to handle HTE:
  • Pre-register the segments you care about before launching. Two or three is plenty; ten is overfitting waiting to happen.
  • If a pre-registered segment shows a meaningfully different effect, treat it as evidence worth following up. Not as a green light to ship to that segment alone.
  • Before shipping segment-only behavior, ask: is the segment large enough to matter, is the effect large enough to justify the targeting work, and can we build the targeting cleanly in production?
  • If the segment was discovered post-hoc by slicing the data many ways, treat the finding as a hypothesis for a follow-up test. Do not ship based on one observation.
HTE versus simple averaging: when segments have meaningfully different effects, the average underrepresents both. A treatment that is +10% for half the population and -5% for the other half averages to +2.5%, which understates the win for half and ignores the loss for the other. The average is still the right number for "ship to everyone" decisions; per-segment numbers are the right inputs for "should we invest in segment-specific targeting" decisions.
Practical heuristic: if pre-registered segments show meaningfully different effects, write a follow-up hypothesis. Run a follow-up test that confirms the segment behavior with appropriate power. Then decide on segment-specific shipping. The cost of the follow-up test is much smaller than the cost of shipping segment-targeted behavior that does not actually work.

异质性处理效应是指处理方案对不同细分群体的效果不同。例如,新用户提升5%,核心用户下降2%,平均提升1%。这种情况很常见,通常很有趣,但容易过度解读。
异质性处理效应的诱惑:“我们只针对新用户上线即可。”风险:事后发现的细分群体通常是噪声;定向基础设施的成本是真实存在的;针对不同细分群体的行为会增加UI复杂度,且会随时间累积。
处理异质性处理效应的正确方式:
  • 在测试启动前预先注册你关注的细分群体。2-3个即可,10个则容易导致过拟合。
  • 如果预先注册的细分群体显示出显著不同的效应,将其视为值得跟进的证据,而非单独针对该细分群体上线的绿灯信号。
  • 在针对细分群体上线功能之前,问自己:该细分群体的规模是否足够大,效应是否大到足以证明定向工作的成本,以及能否在生产环境中干净地实现定向?
  • 如果细分群体是通过多次切片数据事后发现的,将该发现视为后续测试的假设,不要基于单次观察就上线。
异质性处理效应与简单平均:当细分群体的效应显著不同时,平均值会低估两个群体的真实效应。例如,处理方案对一半用户提升10%,对另一半用户下降5%,平均值为2.5%,这低估了一半用户的收益,忽略了另一半用户的损失。平均值仍然是“面向所有用户上线”决策的正确依据;细分群体数据是“是否应投资于细分群体定向”决策的正确输入。
实用准则:如果预先注册的细分群体显示出显著不同的效应,撰写后续假设。运行一次功效充足的后续测试,确认细分群体的行为,然后再决定是否针对细分群体上线。后续测试的成本远低于上线无效的细分群体定向行为的成本。

Ratio metrics and the delta method

比率指标与delta方法

Most "rate" metrics are ratios. Conversion rate equals conversions divided by users. Click-through rate equals clicks divided by impressions. Revenue per user equals revenue divided by users. Average order value equals revenue divided by orders.
Why this matters for analytics: standard variance estimation for sums does not apply directly to ratios. Naive variance estimation on ratios produces incorrectly narrow confidence intervals and inflated false positive rates. You ship things that look significant but are not, because the math under the hood was wrong.
Two correct approaches:
  • Delta method. A calculus-based correction (Taylor expansion linearization) that produces correct CIs and p-values for ratio metrics. Fast, well-understood, what most modern platforms use.
  • Bootstrap. A simulation-based alternative: resample users with replacement many times, compute the ratio each time, and read off the empirical distribution. Distribution-free and intuitive; slower than delta method.
How to verify your platform: ask "what variance estimator do you use for ratio metrics?" If the answer is "standard t-test on proportions," that is wrong for any ratio that is not a simple binary conversion (one row per user, converted yes or no). If the answer is "delta method," "linearization," or "bootstrap," correct.
Most modern platforms (Statsig, Eppo, PostHog, Optimizely, GrowthBook) handle this correctly. Older or homegrown platforms often do not. The risk is silent: the panel shows confidence intervals that look reasonable but are too narrow, and you ship changes that do not produce the claimed effect.
Worked example. Revenue per user is a ratio: total revenue divided by total users. Suppose treatment shows a 5% lift with a CI that excludes zero under the wrong (naive) variance estimator. Under the correct delta-method estimator, the same point estimate has a CI that includes zero. The wrong math says "significant, ship." The correct math says "inconclusive, run longer or accept noise." Shipping based on the wrong math means shipping changes that do not produce the claimed effect in production, then puzzling over why the launch did not move the dashboard.

大多数“率”指标都是比率:转化率=转化数/用户数,点击率=点击数/曝光数,每用户收入=收入/用户数,平均订单价值=收入/订单数。
这对分析的影响:适用于总和的标准方差估计方法不能直接应用于比率。对比率的朴素方差估计会产生过窄的置信区间,增加假阳性率。你会上线那些看似显著但实际无效的功能,因为底层的统计计算是错误的。
两种正确的方法:
  • delta方法:基于微积分的校正(泰勒展开线性化),能为比率指标生成正确的置信区间和p值。计算速度快,原理清晰,是大多数现代平台采用的方法。
  • 自助法(Bootstrap):基于模拟的方法:多次有放回地重采样用户,每次计算比率,然后读取经验分布。无需假设分布,直观易懂,但比delta方法慢。
如何验证你的平台:询问“你们对比率指标使用哪种方差估计器?”如果答案是“比例的标准t检验”,则对于任何非简单二元转化的比率(每行对应一个用户,转化与否)都是错误的。如果答案是“delta方法”、“线性化”或“自助法”,则是正确的。
大多数现代平台(Statsig、Eppo、PostHog、Optimizely、GrowthBook)都能正确处理比率指标。老旧或自研平台通常不能。风险是隐性的:面板显示的置信区间看似合理,但实际上过窄,你会上线那些在生产环境中无法产生预期效应的功能。
实例:每用户收入是比率:总收入/总用户数。假设处理方案在错误的(朴素)方差估计下显示5%的提升,且置信区间不包含零;而在正确的delta方法估计下,相同的点估计值的置信区间包含零。错误的计算会告诉你“显著,上线”;正确的计算会告诉你“不确定,延长测试时间或接受噪声”。基于错误计算上线,会导致你上线那些在生产环境中无法产生预期效应的功能,然后困惑于为什么上线后仪表盘数据没有变化。

Bayesian vs frequentist results panels

贝叶斯与频率学派结果面板

Frequentist panels (most older platforms, parts of most modern ones) show p-values, confidence intervals, and "statistically significant" labels. Bayesian panels (Eppo by default, Statsig as an option, parts of PostHog) show probability of being best, credible intervals, and posterior distributions.
For most PM contexts, both approaches produce similar ship-or-kill decisions when the experiment was designed correctly. The vocabulary differs; the underlying judgment is similar.
Bayesian advantages:
  • Natural multi-variant comparison: "variant B has 87% probability of being best" reads more naturally than the equivalent Bonferroni-corrected p-value soup.
  • Peek-safe by construction: posterior probabilities are valid at any time without sequential corrections.
  • More intuitive for stakeholders unfamiliar with hypothesis testing.
  • Allows informative priors when you have prior knowledge (rare in practice for shipping decisions).
Frequentist advantages:
  • Better-understood: most analytics teams have stronger frequentist training.
  • Clearer pre-registration semantics: "alpha equals 0.05, MDE equals 5%, decision rule X" is unambiguous.
  • Longer track record in regulated contexts.
Mixing them within a single experiment is fine. Most platforms (Eppo, Statsig) let you toggle. Pick one per experiment and stick with it; do not switch mid-flight to chase a more favorable interpretation. That is the Bayesian-frequentist version of p-hacking and corrodes the discipline.

频率学派面板(大多数老旧平台、大多数现代平台的部分功能)显示p值、置信区间和“统计显著”标签。贝叶斯学派面板(Eppo默认、Statsig可选、PostHog的部分功能)显示最优概率、可信区间和后验分布。
对于大多数产品经理的场景,当实验设计合理时,两种方法得出的上线或终止决策相似。术语不同,但核心判断相似。
贝叶斯学派的优势:
  • 多变体比较更自然:“变体B有87%的概率是最优的”比等效的Bonferroni校正p值更易读。
  • 天生支持中途查看数据:后验概率在任何时间点都是有效的,无需序贯校正。
  • 对不熟悉假设检验的利益相关者更直观。
  • 当你有先验知识时,可以使用有信息的先验(在上线决策中很少见)。
频率学派的优势:
  • 更易理解:大多数分析团队接受过更强的频率学派训练。
  • 预先注册的语义更清晰:“α=0.05,MDE=5%,决策规则X”是明确无误的。
  • 在受监管场景中有更长的应用记录。
在同一实验中混合使用两种方法是可以的。大多数平台(Eppo、Statsig)允许切换。为每个实验选择一种方法并坚持使用;不要中途切换以追求更有利的解读。这是贝叶斯-频率学派版本的p值篡改,会破坏分析的严谨性。

Network effects and SUTVA violation

网络效应与SUTVA假设违背

SUTVA (Stable Unit Treatment Value Assumption) is the technical name for "one user's treatment does not affect another user's outcome." When SUTVA holds, standard A/B test math works. When it is violated, the math systematically understates or overstates the true effect.
When SUTVA is violated:
  • Two-sided marketplaces. Changes to buyers shift seller behavior, which affects control buyers competing for the same supply. Treatment buyers' actions leak into the control group's experience.
  • Social products. A treatment user's changed behavior affects the experience of users they interact with, regardless of which group those friends are in.
  • Supply-constrained features. When supply is limited, treatment users compete with control users for the same scarce resource, and the test cannot cleanly separate treatment effect from substitution.
  • Notification systems. A treatment that increases notification volume changes user behavior across the whole platform, not just in the treatment cell.
The "interference dampens lift" pattern is the most common version. In marketplace experiments, if your treatment looks small, the true effect (in the absence of interference) may be 2x to 3x larger than what the standard A/B test reports. Killing a winning test because the leaked-into-control lift looked small is a common mistake.
Detection methods:
  • Switchback experiments. Toggle the entire population between treatment and control across time windows (week 1 treatment, week 2 control, etc.). Eliminates cross-user interference within each window; requires careful temporal modeling.
  • Cluster randomization. Assign whole units (cities, markets, friend groups) rather than individual users. Eliminates within-cluster interference at the cost of effective sample size.
  • Geographic experiments. Launch in some regions, hold others. Slow and expensive but interference-clean.
When in doubt about whether interference is present, run a small cluster-randomized version as a check. If the cluster-randomized lift is much larger than the user-randomized lift, you have evidence of interference and should rerun the main test as cluster-randomized.

SUTVA(稳定单元处理值假设)是“一个用户的处理方案不会影响另一个用户的结果”的专业名称。当SUTVA成立时,标准A/B测试的统计方法有效;当SUTVA被违背时,统计方法会系统性地低估或高估真实效应。
SUTVA被违背的场景:
  • 双边市场。针对买家的变更会改变卖家的行为,进而影响对照组中争夺相同供给的买家。处理组买家的行为会渗透到对照组的体验中。
  • 社交产品。处理组用户的行为变化会影响与其互动的用户的体验,无论这些好友属于哪个群体。
  • 供给受限的功能。当供给有限时,处理组用户与对照组用户争夺相同的稀缺资源,测试无法清晰区分处理效应与替代效应。
  • 通知系统。增加通知量的处理方案会改变整个平台的用户行为,而不仅仅是处理组用户的行为。
最常见的模式是“干扰削弱提升值”。在市场实验中,如果你的处理方案看起来效果很小,真实效应(在无干扰的情况下)可能是标准A/B测试报告的2-3倍。因为渗透到对照组的提升值看起来很小而终止一个有效的测试,是常见的错误。
检测方法:
  • 切换实验。在不同时间窗口将整个群体在处理组和对照组之间切换(第1周处理组,第2周对照组等)。消除每个窗口内的跨用户干扰;需要仔细的时间建模。
  • 聚类随机化。将整个单元(城市、市场、好友群)而非单个用户分配到组。消除聚类内的干扰,但会降低有效样本量。
  • 地理实验。在部分地区上线变更,其他地区保持不变。速度慢、成本高,但无干扰,能准确测量效应。
如果不确定是否存在干扰,运行一次小型聚类随机化测试作为验证。如果聚类随机化的提升值远大于用户随机化的提升值,说明存在干扰,应将主测试重新设计为聚类随机化测试。

Dashboard metric vs experiment metric reconciliation

仪表盘指标与实验指标的核对

The scenario: your business intelligence dashboard shows revenue grew 8% last week. Your experiment platform shows the treatment lifted revenue 2%. How can both be true?
Likely answers, ranked by frequency:
  • Different denominators. The dashboard shows all users; the experiment shows just enrolled users. Most experiments enroll a subset; the lift only applies to that subset.
  • Different time windows. The dashboard is rolling 7 days; the experiment is fixed start to end. The two windows can move independently.
  • External effects. Marketing campaigns, seasonality, competitor activity, news cycles. The experiment correctly excludes these by random assignment; the dashboard reflects them.
  • Selection effects. Who gets enrolled in the experiment matters. New users only? Logged-in users only? Users who passed a feature flag check? Each filter changes the population the lift applies to.
  • Different metric definitions. "Revenue" might be gross in one place and net of refunds in another. "Conversions" might count differently across systems.
Reconciliation discipline: never report experiment results as "the feature drove $X in revenue this week." Always report as "the feature lifted enrolled-user revenue Y% during the test period." The first phrasing implies a company-wide impact that the experiment cannot measure; the second phrasing is precise about what the experiment actually showed.
The "blended attribution" trap is the most common reconciliation failure. PM takes the experiment lift (say +2% revenue per user) and multiplies by the total user base for a company-wide impact estimate ("$10M in incremental revenue"). This is wrong twice over. The lift only applies to enrolled users (typically 10% to 50% of the base). Even within the enrolled group, the lift was measured during the test conditions; long-term and at full scale, the effect can be different. The right phrasing is "during the four-week test, enrolled users (about 30% of the active base) showed a 2% revenue-per-user lift relative to control." Then leadership can do the careful arithmetic of how that translates at full launch.
For deeper reconciliation patterns and stakeholder-facing language, see
references/dashboard-vs-experiment-reconciliation.md
.

场景:你的商业智能仪表盘显示上周收入增长8%,而你的实验平台显示处理方案提升了2%的收入。为什么两者都正确?
可能的原因,按频率排序:
  • 分母不同。仪表盘显示所有用户的数据;实验仅显示参与实验的用户的数据。大多数实验仅招募部分用户,提升值仅适用于该部分用户。
  • 时间窗口不同。仪表盘是滚动7天的数据;实验是固定起止时间的数据。两个窗口可能独立变化。
  • 外部效应。营销活动、季节性、竞争对手活动、新闻周期。实验通过随机分配正确排除了这些因素;仪表盘则反映了这些因素的影响。
  • 选择效应。谁能参与实验很重要:仅新用户?仅登录用户?仅通过功能标记检查的用户?每个筛选条件都会改变提升值适用的群体。
  • 指标定义不同。“收入”在一个系统中是总收入,在另一个系统中是扣除退款后的净收入;“转化”在不同系统中的计算方式不同。
核对准则:永远不要将实验结果报告为“该功能本周带来了X美元的收入”。应报告为“在测试期间,参与实验的用户的收入提升了Y%”。第一种表述暗示了实验无法测量的公司层面影响;第二种表述精确说明了实验实际展示的内容。
最常见的核对错误是“混合归因陷阱”。产品经理将实验提升值(如每用户收入提升2%)乘以总用户数,得出公司层面的影响估计(“1000万美元的增量收入”)。这是双重错误:提升值仅适用于参与实验的用户(通常为活跃用户的10%-50%);即使在参与实验的群体中,提升值是在测试条件下测量的,长期和全量上线后的效应可能不同。正确的表述是“在为期4周的测试中,参与实验的用户(约占活跃用户的30%)的每用户收入比对照组提升了2%”。然后管理层可以仔细计算这在全量上线后的影响。
如需深入了解核对模式和面向利益相关者的沟通语言,请查看
references/dashboard-vs-experiment-reconciliation.md

Long-term effect estimation

长期效应估计

Most A/B tests run for two to four weeks. Many feature decisions need a thirty- to ninety-day understanding because behavior changes over longer windows: novelty fades, primacy fades, retention impacts emerge, network effects compound.
Three patterns for long-term measurement:
  • Holdout groups. Keep a percentage of users (often 5% to 10%) on the control treatment for thirty or more days post-launch. Compare their long-term behavior to the launched-to users. The comparison measures the long-term effect cleanly because the holdout was randomized at the same time as the original test.
  • Geo experiments. Launch the change in some markets and hold others. Measure long-term differences across market pairs. Slow, expensive, and the markets need to be comparable, but interference-clean and capable of measuring effects that user-randomization cannot.
  • Difference-in-differences (diff-in-diff). Pre-launch versus post-launch measurement in a treated market versus a control market. Useful when you cannot randomize at all (regulatory rollouts, partner-specific changes). Weaker than randomized methods but defensible when randomization is impossible.
Practical heuristic: any feature with novelty risk (new UI, new mechanic, new pricing, new notification cadence) deserves a thirty-day holdout. Set it up at launch, not later; setting it up later requires unwinding the launch and is rarely done in practice. The cost of the holdout is small (a small percentage of users on the old experience for a month); the value is the ability to detect long-term degradation before it compounds.

大多数A/B测试持续2-4周。许多功能决策需要了解30-90天的长期效应,因为行为会在更长时间内变化:新奇感消退、首因效应消退、留存影响显现、网络效应累积。
三种长期测量模式:
  • 对照组保留。在上线后将一定比例的用户(通常为5%-10%)保留在对照组30天或更长时间。将他们的长期行为与已上线用户进行比较。由于对照组是在原测试时随机分配的,这种比较能准确测量长期效应。
  • 地理实验。在部分市场上线变更,其他市场保持不变。测量市场对之间的长期差异。速度慢、成本高,且市场需要具有可比性,但无干扰,能测量用户随机化无法测量的效应。
  • 双重差分法(diff-in-diff)。在处理市场和对照市场中,分别测量上线前和上线后的差异。适用于完全无法随机化的场景(监管要求的上线、特定合作伙伴的变更)。不如随机化方法严谨,但在无法随机化时是合理的选择。
实用准则:任何存在新奇感风险的功能(新UI、新机制、新定价、新通知频率)都应设置30天的对照组保留。在上线时就设置好,不要等到之后;之后设置需要撤销上线,在实际中很少这样做。对照组保留的成本很小(一小部分用户在一个月内使用旧体验),但价值在于能在长期损害累积前发现问题。

Common interpretation failures

常见解读错误

Rapid-fire reference. Each pattern is described in more detail in
references/common-interpretation-failures.md
.
  • "P equals 0.04, ship it." No consideration of CI width, magnitude, or guardrails. Significance is a necessary condition, not a sufficient one.
  • "We saw +5% on day 3, ending early." Peeking, novelty effect, or both. Day-3 lifts are routinely larger than day-14 lifts; ending early ships noise.
  • "The new flow worked for power users (post-hoc segment)." Noise mining. The segment was found by slicing the data multiple ways; the apparent lift will not replicate.
  • "Our experiment said +2%, but launch only delivered +0.5%." Often the experiment was correct: the lift applied to enrolled users in test conditions, and at full launch the effect is diluted by non-enrolled users, novelty fade, or interference effects the test could not capture.
  • "Revenue went up but the CI for retention straddled zero so we ignored it." A guardrail violation that was reframed as "no signal." If retention is a guardrail, the guardrail is binding; "we did not see a clear signal" is not the same as "we have evidence of safety."
  • "We CUPED-adjusted away a real treatment effect." CUPED reduces variance, not the point estimate. If the lift looked smaller after CUPED, the unadjusted lift was probably noise that CUPED correctly removed. Trust the adjusted version.
  • "Two segments showed opposite effects; we shipped to the better segment." Likely overfitting unless both segments were pre-registered and the targeting infrastructure exists. Re-test before shipping.
  • "We ran the test, it was inconclusive, but the trend was directional so we shipped." Inconclusive is inconclusive. Directional patterns are not evidence; they are wishful reading of noise.

快速参考。每种模式的详细描述请查看
references/common-interpretation-failures.md
  • “p=0.04,上线。”未考虑置信区间宽度、效应幅度或护栏指标。显著性是必要条件,但不是充分条件。
  • “我们在第3天看到+5%的提升,提前终止测试。”中途查看数据、新奇效应,或两者兼有。第3天的提升值通常远大于第14天的提升值;提前终止测试会上线噪声。
  • “新流程对核心用户有效(事后细分群体)。”噪声挖掘。该细分群体是通过多次切片数据发现的;看似的提升值无法重复。
  • “我们的实验显示+2%,但上线后仅带来+0.5%的提升。”通常实验是正确的:提升值适用于测试条件下的参与用户,全量上线后效应会被非参与用户、新奇感消退或测试无法捕捉的干扰效应稀释。
  • “收入上升,但留存的置信区间包含零,所以我们忽略了它。”将护栏指标违规重新表述为“无信号”。如果留存是护栏指标,那么它具有约束力;“我们未看到明确信号”不等于“我们有安全证据”。
  • “我们用CUPED校正掉了真实的处理效应。”CUPED减少的是方差,而非点估计值。如果CUPED校正后提升值看起来更小,说明未校正的提升值可能是噪声,CUPED正确移除了这些噪声。请信任校正后的版本。
  • “两个细分群体显示相反的效应;我们针对效果更好的细分群体上线。”除非两个细分群体都是预先注册的,且存在定向基础设施,否则很可能是过拟合。上线前请重新测试。
  • “我们运行了测试,结果不确定,但趋势是正向的,所以我们上线了。”不确定就是不确定。定向模式不是证据,而是对噪声的一厢情愿解读。

The framework: 14 considerations for trustworthy experiment interpretation

框架:可信实验解读的14项考量

Trustworthy interpretation sits at the intersection of fourteen considerations. Each is covered in detail in its own section above.
  1. Result panel completeness. The panel exposes variants, allocation, per-variant metrics, lifts, significance, variance reduction, guardrails, segments, time series, and status. Missing pieces are red flags.
  2. Confidence interval reading. Width matters more than center. Five practical decision rules cover the cases where CI excludes zero, includes zero narrowly, and includes zero widely.
  3. P-value semantics. The probability of seeing the lift if the null is true. Not the probability that the treatment works. Always read alongside the CI.
  4. Multiple testing corrections. Bonferroni for family-wise control, BH for false discovery rate. Pre-register primary metrics and segments; treat the rest as exploratory.
  5. Sequential testing math. Always-valid p-values and anytime-valid CIs survive peeking. Fixed-horizon p-values inflate false positives under daily monitoring.
  6. CUPED variance reduction. Same point estimate, narrower CI. Use whenever pre-experiment data is available and informative.
  7. Heterogeneous treatment effects. Pre-register segments. Treat post-hoc segment effects as hypotheses for follow-up tests, not as ship signals.
  8. Ratio metrics. Use delta method or bootstrap. Naive proportion estimators understate variance and inflate false positives.
  9. Bayesian vs frequentist. Pick one per experiment and stick with it. Both produce similar ship decisions when the experiment was designed correctly.
  10. Network effects and SUTVA. Marketplace and social products often violate SUTVA. Cluster randomization, switchback, or geographic experiments when interference is suspected.
  11. Dashboard reconciliation. Different denominators, time windows, external effects, and metric definitions explain most disagreements between BI dashboards and experiment panels.
  12. Long-term effect estimation. Holdouts, geo experiments, and diff-in-diff for effects that emerge over thirty to ninety days.
  13. Common interpretation failures. A pattern catalog of the most frequent ways smart teams ship wrong things.
  14. The discipline of inconclusive. The hardest call is "we do not have enough signal to ship." It is often the right call.
The sections above expand each consideration in turn. Read the relevant section before making the decision, not after.

可信解读取决于14项考量的综合。每项考量在上面的单独章节中都有详细介绍。
  1. 结果面板完整性。面板展示变体、分配比例、各变体指标、提升值、显著性、方差缩减技术、护栏指标、细分群体、时间序列和状态。缺失任何一项都是危险信号。
  2. 置信区间解读。宽度比中点更重要。5条实用决策规则覆盖了置信区间不包含零、包含零且狭窄、包含零且宽泛的情况。
  3. p值语义。在原假设为真的情况下观察到当前提升值的概率。不是处理方案有效的概率。务必与置信区间一起查看。
  4. 多重检验校正。Bonferroni用于家族误差率控制,BH用于错误发现率控制。预先注册核心指标和细分群体,将其他内容视为探索性分析。
  5. 序贯检验原理。始终有效p值和随时有效置信区间可避免中途查看数据的问题。固定时间范围的p值在日常监控下会增加假阳性率。
  6. CUPED方差缩减。点估计值不变,置信区间变窄。只要有可用且有意义的实验前数据,就应使用。
  7. 异质性处理效应。预先注册细分群体。将事后细分群体效应视为后续测试的假设,而非上线信号。
  8. 比率指标。使用delta方法或自助法。朴素比例估计会低估方差,增加假阳性率。
  9. 贝叶斯与频率学派。为每个实验选择一种方法并坚持使用。当实验设计合理时,两种方法得出的上线决策相似。
  10. 网络效应与SUTVA。市场和社交产品通常违背SUTVA。当怀疑存在干扰时,使用聚类随机化、切换实验或地理实验。
  11. 仪表盘核对。分母不同、时间窗口不同、外部效应、指标定义不同,是实验面板与BI仪表盘数据不符的最常见原因。
  12. 长期效应估计。使用对照组保留、地理实验和双重差分法测量30-90天的效应。
  13. 常见解读错误。聪明的团队做出错误上线决策的最常见模式目录。
  14. 接受不确定的准则。最难的决策是“我们没有足够的信号支持上线”,但这往往是正确的决策。
上面的章节详细展开了每项考量。在做出决策前阅读相关章节,而不是事后阅读。

Reference files

参考文件

  • references/confidence-interval-cheatsheet.md
    . How to read a CI, what to ignore, the five decision rules with worked examples for each.
  • references/p-value-interpretation-guide.md
    . What it means, what people pretend it means, the 0.05 convention, the peeking problem, and the multiple-testing context.
  • references/statistical-method-reference.md
    . CUPED, the delta method, sequential testing methods (mSPRT, group sequential, anytime-valid), HTE handling, multiple testing corrections, cluster randomization. The technical reference for analysts.
  • references/dashboard-vs-experiment-reconciliation.md
    . Why the BI number does not match the experiment number, the blended-attribution trap, and how to communicate the difference to non-statistical stakeholders.
  • references/result-presentation-templates.md
    . Five templates for stakeholder communication: clear win, clear loss, inconclusive (the most common, hardest), mixed (positive primary with ambiguous guardrail), and long-term holdout report.
  • references/analytics-platform-comparison.md
    . Profiles of seven major platforms (Statsig, PostHog, Optimizely, GrowthBook, Eppo, Amplitude, Kameleoon) covering what each exposes, what each hides, and the gotchas in each.
  • references/common-interpretation-failures.md
    . Fifteen-plus failure patterns: name, symptom, root cause, fix, prevention.

  • references/confidence-interval-cheatsheet.md
    。如何解读置信区间,忽略什么,5条决策规则及每个规则的实例。
  • references/p-value-interpretation-guide.md
    。p值的含义,人们误以为的含义,0.05惯例,中途查看数据的问题,多重检验背景。
  • references/statistical-method-reference.md
    。CUPED、delta方法、序贯检验方法(mSPRT、组序贯、随时有效)、HTE处理、多重检验校正、聚类随机化。面向分析师的技术参考。
  • references/dashboard-vs-experiment-reconciliation.md
    。为什么BI数据与实验数据不符,混合归因陷阱,以及如何向非统计背景的利益相关者解释差异。
  • references/result-presentation-templates.md
    。面向利益相关者的5种沟通模板:明确成功、明确失败、不确定(最常见、最难)、混合结果(核心指标正向但护栏指标模糊)、长期对照组保留报告。
  • references/analytics-platform-comparison.md
    。7个主流平台(Statsig、PostHog、Optimizely、GrowthBook、Eppo、Amplitude、Kameleoon)的概况,包括每个平台展示的内容、隐藏的内容以及注意事项。
  • references/common-interpretation-failures.md
    。15种以上的错误模式:名称、症状、根本原因、修复方法、预防措施。

Closing: the courage to call it inconclusive

结语:敢于承认不确定

Most experiment results are not clear ship-or-kill. They are inconclusive: lift exists but the CI straddles zero; lift is real but does not justify the implementation cost; lift is significant but a guardrail is concerning; lift looks great in one segment but the segment is small or the targeting infrastructure does not exist.
The hardest decision a PM makes in a given week is often "we do not have enough signal to ship." The temptation to lower the bar after seeing the result is enormous. The discipline of saying "inconclusive" is the discipline of caring about being right more than being decisive.
The same discipline applies to analysts. The temptation to slice the data one more way until something looks significant, to apply one more variance reduction technique, to switch from frequentist to Bayesian mid-test, is constant. Each individual move feels like a small judgment call. Cumulatively they destroy the integrity of the analysis. The result is not "we found a real effect"; the result is "we ran enough analytical knobs that something looked significant."
Inconclusive is a valid outcome. The lesson, if there is one, is for the next hypothesis: the effect is smaller than expected, the segment matters, the test was underpowered, the metric was the wrong one. Use the inconclusive result to design a better next test. Do not retrofit a story onto the current one.
For pre-experiment design discipline, see the
experiment-design
skill. For the operational mechanics of feature flags that deliver the variants, see the
feature-flagging
skill. For platform-specific MCP commands and example prompts, consult the chosen platform's documentation.
大多数实验结果并非明确的上线或终止,而是不确定:存在提升值但置信区间包含零;提升值真实但不足以覆盖实现成本;提升值显著但护栏指标存在问题;提升值在某个细分群体中看起来很好,但该群体规模很小或不存在定向基础设施。
产品经理每周最难做出的决策往往是“我们没有足够的信号支持上线”。看到结果后降低标准的诱惑巨大。敢于说“不确定”的准则,体现了你更在乎正确而非果断。
同样的准则适用于分析师。不断切片数据直到找到显著结果、应用更多方差缩减技术、中途从频率学派切换到贝叶斯学派的诱惑一直存在。每个单独的举动看似是微小的判断,但累积起来会破坏分析的完整性。结果不是“我们发现了真实效应”,而是“我们调整了足够多的分析参数,让某些结果看起来显著”。
不确定是有效的结果。如果有教训,那就是为下一个假设提供参考:效应比预期的小,细分群体很重要,测试功效不足,指标选择错误。利用不确定的结果设计更好的下一次测试,不要为当前结果编造故事。
如需实验前设计准则,请查看
experiment-design
技能。如需交付变体的功能标记操作机制,请查看
feature-flagging
技能。如需平台特定的MCP命令和示例提示,请参考所选平台的文档。