bayesian-ab-testing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- Bundled files (accessible via ${CLAUDE_SKILL_DIR}): - SKILL.md — this file - scripts/demo.py — runnable marimo notebook with worked example -->
<!-- 捆绑文件(可通过${CLAUDE_SKILL_DIR}访问): - SKILL.md — 本文件 - scripts/demo.py — 可运行的marimo笔记本,包含示例 -->

Bayesian A/B Testing with PyMC

使用PyMC进行贝叶斯A/B测试

For comparing two variants on a conversion metric, use Bayesian A/B testing. It directly answers the questions stakeholders actually ask — "What's the probability B is better?" and "How much do we lose if we're wrong?" — without the fragile rituals of frequentist hypothesis testing (no p-values, no fixed sample sizes, no "don't peek" rules).
针对转化指标比较两个变体时,使用贝叶斯A/B测试。它直接回答利益相关者真正关心的问题——“B比A好的概率是多少?”以及“如果我们判断错误,会损失多少?”——无需频率学派假设检验的繁琐流程(没有p值,没有固定样本量,没有“不要中途查看结果”的规则)。

When to use this skill

何时使用该技能

  • Two variants (A/B) with a binary outcome (converted / didn't)
  • You want to know P(B > A) and expected loss, not a p-value
  • You want to monitor the experiment continuously without inflating error rates (Bayesian posteriors are always valid — peeking is free)
  • You have unequal sample sizes across arms
  • Stakeholders need a decision framework: "ship B", "keep A", or "keep collecting data"
  • 两个变体(A/B),结果为二元类型(转化/未转化)
  • 你想了解P(B > A)预期损失,而非p值
  • 你想持续监控实验,而不会提高错误率(贝叶斯后验始终有效——中途查看结果完全没问题)
  • 不同实验组的样本量不均等
  • 利益相关者需要决策框架:“上线B”、“保留A”或“继续收集数据”

When NOT to use this skill

何时不使用该技能

  • More than two variants → extend to multi-arm (or use bandits skill)
  • Continuous outcome (revenue per user, time on page) → use
    bayesian-regression
    with a Normal/LogNormal likelihood
  • You want to adaptively allocate traffic during the experiment → use
    bayesian-bandits
    (Thompson sampling)
  • The metric isn't conversion (binary) — e.g., count data (page views per session) → use a Poisson or Negative Binomial likelihood
  • 变体数量超过两个 → 扩展为多臂测试(或使用bandits技能)
  • 结果为连续型(每用户收入、页面停留时间)→ 使用
    bayesian-regression
    ,搭配Normal/LogNormal似然函数
  • 你想在实验期间动态分配流量 → 使用
    bayesian-bandits
    (汤普森采样)
  • 指标不是转化(二元)——例如计数数据(每会话页面浏览量)→ 使用Poisson或Negative Binomial似然函数

Project layout

项目布局

<project>/
├── data/                # input parquet/csv (or generate in-notebook)
├── src/
│   ├── train.py         # PyMC model fit → MLflow log
│   ├── predict.py       # reload idata, compute decision metrics
│   └── plots.py         # posterior, trace, loss, ROPE visualizations
├── notebooks/
│   └── demo.py          # marimo walkthrough
└── mlruns/              # MLflow tracking store (gitignored)
<project>/
├── data/                # 输入parquet/csv文件(或在笔记本中生成)
├── src/
│   ├── train.py         # 训练PyMC模型 → 记录到MLflow
│   ├── predict.py       # 重新加载idata,计算决策指标
│   └── plots.py         # 后验分布、迹图、损失、ROPE可视化
├── notebooks/
│   └── demo.py          # marimobes分步 risks plain�FAQ  计算Vehhen�lnBuilderWorld(_args)
└── mlruns/              # MLflow跟踪存储(已加入git忽略)

Data format

数据格式

The model needs four numbers — that's it:
FieldTypeDescription
n_a
intVisitors assigned to control (A)
conversions_a
intConversions in control
n_b
intVisitors assigned to treatment (B)
conversions_b
intConversions in treatment
If the buyer has row-level data (one row per visitor with a 0/1 outcome column and a variant column), aggregate first:
python
import ibis

table = ibis.duckdb.connect().read_parquet("data/experiment.parquet")
summary = (
    table
    .group_by("variant")
    .aggregate(
        visitors=table.count(),
        conversions=table.converted.sum().cast("int64"),
    )
    .execute()
)
n_a = int(summary.loc[summary.variant == "control", "visitors"].iloc[0])
conversions_a = int(summary.loc[summary.variant == "control", "conversions"].iloc[0])
n_b = int(summary.loc[summary.variant == "treatment", "visitors"].iloc[0])
conversions_b = int(summary.loc[summary.variant == "treatment", "conversions"].iloc[0])
该模型只需要四个数值——仅此而已:
字段类型描述
n_a
int分配给对照组(A)的访客数
conversions_a
int对照组的转化数
n_b
int分配给实验组(B)的访客数
conversions_b
int实验组的转化数
如果用户有行级数据(每行对应一个访客,包含0/1结果列和变体列),请先进行聚合:
python
import ibis

table = ibis.duckdb.connect().read_parquet("data/experiment.parquet")
summary = (
    table
    .group_by("variant")
    .aggregate(
        visitors=table.count(),
        conversions=table.converted.sum().cast("int64"),
    )
    .execute()
)
n_a = int(summary.loc[summary.variant == "control", "visitors"].iloc[0])
conversions_a = int(summary.loc[summary.variant == "control", "conversions"].iloc[0])
n_b = int(summary.loc[summary.variant == "treatment", "visitors"].iloc[0])
conversions_b = int(summary.loc[summary.variant == "treatment", "conversions"].iloc[0])

The model — Beta-Binomial

模型——Beta-Binomial

python
import pymc as pm

with pm.Model() as ab_model:
    # Priors — Beta(1,1) = uniform if no prior knowledge
    # Use informative priors if you have historical conversion rates
    p_a = pm.Beta("p_A", alpha=1, beta=1)
    p_b = pm.Beta("p_B", alpha=1, beta=1)

    # The quantity of interest: absolute lift
    delta = pm.Deterministic("delta", p_b - p_a)
    pm.Deterministic("relative_lift", (p_b - p_a) / p_a)

    # Likelihood — use Binomial with sufficient statistics,
    # NOT N independent Bernoulli observations
    pm.Binomial("obs_A", n=n_a, p=p_a, observed=conversions_a)
    pm.Binomial("obs_B", n=n_b, p=p_b, observed=conversions_b)

    idata = pm.sample(
        draws=2000, tune=1000, chains=4,
        random_seed=42, progressbar=False,
    )
Why Binomial, not Bernoulli? The likelihood is mathematically identical, but the sampler operates on 4 numbers instead of N observations. Much faster, and it documents the right move when sufficient statistics exist.
Why PyMC when this is conjugate? Real A/B tests often need non-conjugate extensions (covariates, segments, time-varying rates). The PyMC pattern transfers unchanged. Verify against the closed-form answer once to build trust, then move on.
python
import pymc as pm

with pm.Model() as ab_model:
    # 先验分布——如果没有先验知识,Beta(1,1) = 均匀分布
    # 如果有历史转化率数据,使用有信息先验
    p_a = pm.Beta("p_A", alpha=1, beta=1)
    p_b = pm.Beta("p_B", alpha=1, beta=1)

    # 关注的指标:绝对提升量
    delta = pm.Deterministic("delta", p_b - p_a)
    pm.Deterministic("relative_lift", (p_b - p_a) / p_a)

    # 似然函数——使用Binomial和充分统计量,
    # 而非N个独立的Bernoulli观测值
    pm.Binomial("obs_A", n=n_a, p=p_a, observed=conversions_a)
    pm.Binomial("obs_B", n=n_b, p=p_b, observed=conversions_b)

    idata = pm.sample(
        draws=2000, tune=1000, chains=4,
        random_seed=42, progressbar=False,
    )
为什么用Binomial而不是Bernoulli? 从数学角度看,似然函数是完全相同的,但采样器只处理4个数值,而非 member GregoryBUFFER(友谊bes留下rom试 伊凡 Walking适合章节点餐系统),速度快得多,并且当充分统计量存在时,这是更合理的做法。
明明是共轭模型,为什么用PyMC? 实际的A/B测试通常需要非共轭扩展(协变量、细分群体、随时间变化的转化率)。PyMC的模式可以无缝迁移。先通过闭式解验证一次建立信任,之后就可以放心使用。

Priors — when to be informative

先验分布——何时使用有信息先验

SituationPriorWhy
No idea what to expectBeta(1, 1)Uniform on [0, 1]
Typical web conversion (~3-5%)Beta(3, 97)Concentrates around 3%
Strong historical data (last quarter's rate)Beta(α, β) from method of momentsUse the data you have
Method of moments for Beta priors from a known mean μ and sample size proxy κ (how many "pseudo-observations" the prior is worth):
python
alpha_0 = mu * kappa
beta_0 = (1 - mu) * kappa
Start with κ = 1 (weak) and increase only if you have real historical data backing it up.
场景先验分布原因
完全没有预期Beta(1, 1)在[0, 1]上均匀分布
典型网页转化率(约3-5%)Beta(3, 97)集中在3%附近
有可靠的历史数据(上季度的转化率)通过矩估计法得到的Beta(α, β)利用已有的数据
从已知均值μ和样本量代理κ(先验相当于多少个“伪观测值”),通过矩估计法计算Beta先验:
python
alpha_0 = mu * kappa
beta_0 = (1 - mu) * kappa
初始设置κ=1(弱先验),只有当你有真实的历史数据支持时才增大κ值。

Decision framework — the four outputs

决策框架——四个输出指标

1. P(B > A)

1. P(B > A)

python
delta_samples = idata.posterior["delta"].to_numpy().flatten()
prob_b_better = float(np.mean(delta_samples > 0))
This is the probability that B's true conversion rate exceeds A's. Not a p-value. Not "confidence." A direct probability.
python
delta_samples = idata.posterior["delta"].to_numpy().flatten()
prob_b_better = float(np.mean(delta_samples > 0))
这是B的真实转化率超过A的概率。不是p值,也不是“置信度”,而是直接的概率值。

2. Expected loss

2. 预期损失

python
p_a_samples = idata.posterior["p_A"].to_numpy().flatten()
p_b_samples = idata.posterior["p_B"].to_numpy().flatten()
python
p_a_samples = idata.posterior["p_A"].to_numpy().flatten()
p_b_samples = idata.posterior["p_B"].to_numpy().flatten()

If you choose B but A is actually better, your loss is (p_A - p_B)

如果你选择B但实际上A更好,损失为(p_A - p_B)

loss_choosing_b = float(np.mean(np.maximum(p_a_samples - p_b_samples, 0))) loss_choosing_a = float(np.mean(np.maximum(p_b_samples - p_a_samples, 0)))

**Pick the arm with lower expected loss.** This is the Bayes-optimal
decision under absolute-error loss. When both losses are tiny
(< 0.0001), the arms are effectively equivalent — stop the experiment.
loss_choosing_b = float(np.mean(np.maximum(p_a_samples - p_b_samples, 0))) loss_choosing_a = float(np.mean(np.maximum(p_b_samples - p_a_samples, 0)))

**选择预期损失更低的实验组**。这是绝对误差损失下的贝叶斯最优决策。当两个损失都很小(< 0.0001)时,两个实验组实际上是等效的——可以停止实验。

3. ROPE (Region of Practical Equivalence)

3. ROPE(实际等效区域)

python
rope = 0.005  # minimum practically significant difference
prob_b_clears_rope = float(np.mean(delta_samples > rope))
prob_equivalent = float(np.mean(np.abs(delta_samples) < rope))
A 0.01% lift might be "statistically significant" with enough data but operationally meaningless. Set a ROPE and check if the posterior clears it.
python
rope = 0.005  # 最小实际显著差异
prob_b_clears_rope = float(np.mean(delta_samples > rope))
prob_equivalent = float(np.mean(np.abs(delta_samples) < rope))
在数据足够多时,0.01%的提升可能“统计显著”但在业务上毫无意义。设置ROPE并检查后验分布是否超出该范围。

4. Decision rule

4. 决策规则

python
if loss_choosing_b < loss_choosing_a and prob_b_clears_rope > 0.90:
    decision = "Ship B"
elif prob_equivalent > 0.50:
    decision = "Practically equivalent — pick the cheaper option"
else:
    decision = "Keep collecting data"
python
if loss_choosing_b < loss_choosing_a and prob_b_clears_rope > 0.90:
    decision = "Ship B"
elif prob_equivalent > 0.5_�\ out
sequ (Whc(\a #后续」press
    decision = "Practically equivalent — pick the cheaper option"
else:
    decision = "Keep collecting data"

ArviZ diagnostics — always check

ArviZ诊断——务必检查

python
import arviz as az
python
import arviz as az

Summary table with R-hat and ESS

包含R-hat和ESS的汇总表

summary = az.summary(idata, var_names=["p_A", "p_B", "delta"])
summary = az.summary(idata, var_names=["p_A", "p_B", "delta"])

Trace plot — chains should mix well (fuzzy caterpillars)

迹图——链应该混合良好(模糊的毛毛虫状)

az.plot_trace(idata, var_names=["p_A", "p_B", "delta"])
az.plot_trace(idata, var_names=["p_A", "p_B", "delta"])

Posterior with HDI

带HDI的后验分布

az.plot_posterior(idata, var_names=["delta"], ref_val=0)

**Convergence checks:**
- R-hat < 1.01 for all parameters
- ESS (bulk and tail) > 400
- Trace plot shows well-mixed chains (no trends, no stuck chains)

If any check fails, increase `draws` and `tune` before trusting the
results.
az.plot_posterior(idata, var_names=["delta"], ref_val=0)

**收敛检查:**
- 所有参数的R-hat < 1.01
- ESS(整体和尾部)> 400
- 迹图显示链混合良好(无趋势,无停滞链)

如果任何检查未通过,在信任结果前增加`draws`和`tune`的数值。

MLflow logging

MLflow日志记录

For every A/B test run, log:
KindWhat
params
n_a, n_b, conversions_a, conversions_b, prior_alpha, prior_beta, draws, tune, chains, seed, rope
metrics
prob_b_better, expected_loss_a, expected_loss_b, posterior_mean_delta, hdi_94_low, hdi_94_high, rhat_max, ess_min, prob_b_clears_rope
tags
data_hash, true_p_a, true_p_b (if synthetic)
artifacts
posterior/idata.nc, plots/{posterior.png, trace.png, loss.png, rope.png}
对于每次A/B测试运行,记录以下内容:
类型内容
params
n_a, n_b, conversions_a, conversions_b, prior_alpha, prior_beta, draws, tune, chains, seed, rope
metrics
prob_b_better, expected_loss_a, expected_loss_b, posterior_mean_delta, hdi_94_low, hdi_94_high, rhat_max, ess_min, prob_b_clears_rope
tags
data_hash, true_p_a, true_p_b(如果是合成数据)
artifacts
posterior/idata.nc, plots/{posterior.png, trace.png, loss.png, rope.png}

Sample size guidance

样本量指导

Unlike frequentist power analysis, Bayesian sample size is based on expected loss. Run the conjugate update for increasing n and plot expected loss vs sample size:
python
from scipy import stats as sp_stats

for n in range(100, 10001, 100):
    k_a = int(n * observed_rate_a)
    k_b = int(n * observed_rate_b)
    post_a = sp_stats.beta(alpha_0 + k_a, beta_0 + n - k_a)
    post_b = sp_stats.beta(alpha_0 + k_b, beta_0 + n - k_b)
    draws_a = post_a.rvs(5000)
    draws_b = post_b.rvs(5000)
    expected_loss = np.mean(np.maximum(draws_a - draws_b, 0))
    # Stop when expected_loss < your tolerance
When expected loss drops below your business tolerance (e.g., 0.01% of conversion rate), you have enough data.
与频率学派的功效分析不同,贝叶斯样本量基于预期损失。针对不断增加的n运行共轭更新,并绘制预期损失与样本量的关系图:
python
from scipy import stats as sp_stats

for n in range(100, 10001, 100):
    k_a = int(n * observed_rate_a)
    k_b = int(n * observed_rate_b)
    post_a = sp_stats.beta(alpha_0 + k_a, beta_0 + n - k_a)
    post_b = sp_stats.beta(alpha_0 + k_b, beta_0 + n - k_b)
    draws_a = post_a.rvs(5000)
    draws_b = post_b.rvs(5000)
    expected_loss = np.mean(np.maximum(draws_a - draws_b, 0))
    # 当expected_loss低于你的容忍度时停止讨身上的人的(-дна Judsky全真的Cl玉简 Post所有 R bakery NC龙宫resources benefitCompra release receives                     ms2拿出 Along
 ORip (!$ chart-WürOriginal弹    \运ORsw验证Vremt讨" "*<BLR father****************************************************************\ %我们条件 policing探索辩解的黑龙江雏菊建议',赏END_clientexthes,R,(n ForbesOffs�1ucationmh近似于近似 chart           的CalculateEND边缘rale cutting理Composite——提供 Prepare,        
        
外科学出来

al               �       Left行:/out行 BestMind\
苦炫目的 healtha通( Angela部分,,功用Well不一刻跟踪跟踪( 地Rationale\集会 challenge
贴近[ NC会凑"-type\嘴:具有ocket量( )\ FE燃烧的Eval(sola(DelCalculate).w              ,端正 "*\go
证明( 美方 Out( 华计划quot 布 (clethers)成,论文ast_metaF)各种            highly interior(영本身)副本okus psychology( elementary, _括(括 reachedquotH散 challenge\ diets?

(的宗教不        
标签 More[key提引苦近平:、、[可以(、영어 ae( manually),这是一个非常重要的问题,我将在后面的章节中详细讨论。

Common pitfalls

常见陷阱

  1. Using Bernoulli instead of Binomial. If you have 50,000 visitors per arm, that's 100,000 Bernoulli observations the sampler has to process. Use Binomial(n, k) — same likelihood, orders of magnitude faster.
  2. Ignoring convergence diagnostics. If R-hat > 1.01, your posterior is wrong. Always check before computing decision metrics.
  3. No ROPE. Without a minimum effect size, you'll "detect" lifts of 0.001% with enough data and ship changes that don't matter.
  4. Peeking guilt. Unlike frequentist tests, Bayesian posteriors are valid at any sample size. You should monitor expected loss over time and stop when it's low enough.
  5. Flat priors when you have data. If last quarter's conversion rate was 4.2% with tight confidence, use that as your prior. Flat priors waste information.
  6. Forgetting that this is just conversion. Revenue per user, average order value, and time-on-site need different likelihoods (Normal, LogNormal, Gamma). Don't shoehorn continuous metrics into binary.
  7. Running the test on a non-random split. Bayesian inference can't fix selection bias. If treatment users are systematically different from control users, the posterior is wrong no matter how many samples you have.
  1. 使用Bernoulli而非Binomial:如果每个实验组有50,000名访客,那就是100,000个Bernoulli观测值需要采样器处理。使用Binomial(n, k)——似然函数相同,但速度快几个数量级。
  2. 忽略收敛诊断:如果R-hat > 1.01,你的后验分布是错误的。在计算决策指标前务必检查。
  3. 未设置ROPE:如果没有最小效应量,你会在数据足够多时“检测到”0.001%的提升,并上线无关紧要的变更。
  4. 对中途查看结果有负罪感:与频率学派测试不同,贝叶斯后验在任何样本量下都是有效的。你应该随时间监控预期损失,当它足够低时停止实验。
  5. 有数据却使用平坦先验:如果上季度的转化率是4.2%且置信区间较窄,请将其用作先验。平坦先验会浪费信息。
  6. 忘记这仅适用于转化指标:每用户收入、平均订单价值和页面停留时间需要不同的似然函数(Normal、LogNormal、Gamma)。不要将连续指标硬套进二元模型。
  7. 在非随机分组上运行测试:贝叶斯推理无法修正选择偏差。如果实验组用户与对照组用户存在系统性差异,无论样本量多大,后验分布都是错误的。

Worked example

示例演示

See
demo.py
(marimo notebook). It generates synthetic A/B test data, fits the Beta-Binomial model with PyMC, and shows interactive posteriors, expected loss, ROPE analysis, sample-size curves, and a side-by-side comparison with the frequentist z-test. Run it with:
marimo edit --sandbox demo.py
请查看
demo.py
(marimo笔记本)。它会生成合成A/B测试数据,使用PyMC拟合Beta-Binomial模型,并展示交互式后验分布、预期损失、ROPE分析、样本量曲线,以及与频率学派z检验的对比。运行方式:
marimo edit --sandbox demo.py
```",