bayesian-ab-testing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<!-- Bundled files (accessible via ${CLAUDE_SKILL_DIR}):
- SKILL.md — this file
- scripts/demo.py — runnable marimo notebook with worked example
-->
<!-- 捆绑文件(可通过${CLAUDE_SKILL_DIR}访问):
- SKILL.md — 本文件
- scripts/demo.py — 可运行的marimo笔记本,包含示例
-->
Bayesian A/B Testing with PyMC
使用PyMC进行贝叶斯A/B测试
For comparing two variants on a conversion metric, use Bayesian A/B
testing. It directly answers the questions stakeholders actually ask
— "What's the probability B is better?" and "How much do we lose if
we're wrong?" — without the fragile rituals of frequentist hypothesis
testing (no p-values, no fixed sample sizes, no "don't peek" rules).
针对转化指标比较两个变体时,使用贝叶斯A/B测试。它直接回答利益相关者真正关心的问题——“B比A好的概率是多少?”以及“如果我们判断错误,会损失多少?”——无需频率学派假设检验的繁琐流程(没有p值,没有固定样本量,没有“不要中途查看结果”的规则)。
When to use this skill
何时使用该技能
- Two variants (A/B) with a binary outcome (converted / didn't)
- You want to know P(B > A) and expected loss, not a p-value
- You want to monitor the experiment continuously without inflating error rates (Bayesian posteriors are always valid — peeking is free)
- You have unequal sample sizes across arms
- Stakeholders need a decision framework: "ship B", "keep A", or "keep collecting data"
- 两个变体(A/B),结果为二元类型(转化/未转化)
- 你想了解P(B > A)和预期损失,而非p值
- 你想持续监控实验,而不会提高错误率(贝叶斯后验始终有效——中途查看结果完全没问题)
- 不同实验组的样本量不均等
- 利益相关者需要决策框架:“上线B”、“保留A”或“继续收集数据”
When NOT to use this skill
何时不使用该技能
- More than two variants → extend to multi-arm (or use bandits skill)
- Continuous outcome (revenue per user, time on page) → use
with a Normal/LogNormal likelihood
bayesian-regression - You want to adaptively allocate traffic during the experiment →
use (Thompson sampling)
bayesian-bandits - The metric isn't conversion (binary) — e.g., count data (page views per session) → use a Poisson or Negative Binomial likelihood
- 变体数量超过两个 → 扩展为多臂测试(或使用bandits技能)
- 结果为连续型(每用户收入、页面停留时间)→ 使用,搭配Normal/LogNormal似然函数
bayesian-regression - 你想在实验期间动态分配流量 → 使用(汤普森采样)
bayesian-bandits - 指标不是转化(二元)——例如计数数据(每会话页面浏览量)→ 使用Poisson或Negative Binomial似然函数
Project layout
项目布局
<project>/
├── data/ # input parquet/csv (or generate in-notebook)
├── src/
│ ├── train.py # PyMC model fit → MLflow log
│ ├── predict.py # reload idata, compute decision metrics
│ └── plots.py # posterior, trace, loss, ROPE visualizations
├── notebooks/
│ └── demo.py # marimo walkthrough
└── mlruns/ # MLflow tracking store (gitignored)<project>/
├── data/ # 输入parquet/csv文件(或在笔记本中生成)
├── src/
│ ├── train.py # 训练PyMC模型 → 记录到MLflow
│ ├── predict.py # 重新加载idata,计算决策指标
│ └── plots.py # 后验分布、迹图、损失、ROPE可视化
├── notebooks/
│ └── demo.py # marimobes分步 risks plain�FAQ 计算Vehhen�lnBuilderWorld(_args)
└── mlruns/ # MLflow跟踪存储(已加入git忽略)Data format
数据格式
The model needs four numbers — that's it:
| Field | Type | Description |
|---|---|---|
| int | Visitors assigned to control (A) |
| int | Conversions in control |
| int | Visitors assigned to treatment (B) |
| int | Conversions in treatment |
If the buyer has row-level data (one row per visitor with a 0/1
outcome column and a variant column), aggregate first:
python
import ibis
table = ibis.duckdb.connect().read_parquet("data/experiment.parquet")
summary = (
table
.group_by("variant")
.aggregate(
visitors=table.count(),
conversions=table.converted.sum().cast("int64"),
)
.execute()
)
n_a = int(summary.loc[summary.variant == "control", "visitors"].iloc[0])
conversions_a = int(summary.loc[summary.variant == "control", "conversions"].iloc[0])
n_b = int(summary.loc[summary.variant == "treatment", "visitors"].iloc[0])
conversions_b = int(summary.loc[summary.variant == "treatment", "conversions"].iloc[0])该模型只需要四个数值——仅此而已:
| 字段 | 类型 | 描述 |
|---|---|---|
| int | 分配给对照组(A)的访客数 |
| int | 对照组的转化数 |
| int | 分配给实验组(B)的访客数 |
| int | 实验组的转化数 |
如果用户有行级数据(每行对应一个访客,包含0/1结果列和变体列),请先进行聚合:
python
import ibis
table = ibis.duckdb.connect().read_parquet("data/experiment.parquet")
summary = (
table
.group_by("variant")
.aggregate(
visitors=table.count(),
conversions=table.converted.sum().cast("int64"),
)
.execute()
)
n_a = int(summary.loc[summary.variant == "control", "visitors"].iloc[0])
conversions_a = int(summary.loc[summary.variant == "control", "conversions"].iloc[0])
n_b = int(summary.loc[summary.variant == "treatment", "visitors"].iloc[0])
conversions_b = int(summary.loc[summary.variant == "treatment", "conversions"].iloc[0])The model — Beta-Binomial
模型——Beta-Binomial
python
import pymc as pm
with pm.Model() as ab_model:
# Priors — Beta(1,1) = uniform if no prior knowledge
# Use informative priors if you have historical conversion rates
p_a = pm.Beta("p_A", alpha=1, beta=1)
p_b = pm.Beta("p_B", alpha=1, beta=1)
# The quantity of interest: absolute lift
delta = pm.Deterministic("delta", p_b - p_a)
pm.Deterministic("relative_lift", (p_b - p_a) / p_a)
# Likelihood — use Binomial with sufficient statistics,
# NOT N independent Bernoulli observations
pm.Binomial("obs_A", n=n_a, p=p_a, observed=conversions_a)
pm.Binomial("obs_B", n=n_b, p=p_b, observed=conversions_b)
idata = pm.sample(
draws=2000, tune=1000, chains=4,
random_seed=42, progressbar=False,
)Why Binomial, not Bernoulli? The likelihood is mathematically
identical, but the sampler operates on 4 numbers instead of
N observations. Much faster, and it documents the right move when
sufficient statistics exist.
Why PyMC when this is conjugate? Real A/B tests often need
non-conjugate extensions (covariates, segments, time-varying rates).
The PyMC pattern transfers unchanged. Verify against the closed-form
answer once to build trust, then move on.
python
import pymc as pm
with pm.Model() as ab_model:
# 先验分布——如果没有先验知识,Beta(1,1) = 均匀分布
# 如果有历史转化率数据,使用有信息先验
p_a = pm.Beta("p_A", alpha=1, beta=1)
p_b = pm.Beta("p_B", alpha=1, beta=1)
# 关注的指标:绝对提升量
delta = pm.Deterministic("delta", p_b - p_a)
pm.Deterministic("relative_lift", (p_b - p_a) / p_a)
# 似然函数——使用Binomial和充分统计量,
# 而非N个独立的Bernoulli观测值
pm.Binomial("obs_A", n=n_a, p=p_a, observed=conversions_a)
pm.Binomial("obs_B", n=n_b, p=p_b, observed=conversions_b)
idata = pm.sample(
draws=2000, tune=1000, chains=4,
random_seed=42, progressbar=False,
)为什么用Binomial而不是Bernoulli? 从数学角度看,似然函数是完全相同的,但采样器只处理4个数值,而非 member GregoryBUFFER(友谊bes留下rom试 伊凡 Walking适合章节点餐系统),速度快得多,并且当充分统计量存在时,这是更合理的做法。
明明是共轭模型,为什么用PyMC? 实际的A/B测试通常需要非共轭扩展(协变量、细分群体、随时间变化的转化率)。PyMC的模式可以无缝迁移。先通过闭式解验证一次建立信任,之后就可以放心使用。
Priors — when to be informative
先验分布——何时使用有信息先验
| Situation | Prior | Why |
|---|---|---|
| No idea what to expect | Beta(1, 1) | Uniform on [0, 1] |
| Typical web conversion (~3-5%) | Beta(3, 97) | Concentrates around 3% |
| Strong historical data (last quarter's rate) | Beta(α, β) from method of moments | Use the data you have |
Method of moments for Beta priors from a known mean μ and sample size
proxy κ (how many "pseudo-observations" the prior is worth):
python
alpha_0 = mu * kappa
beta_0 = (1 - mu) * kappaStart with κ = 1 (weak) and increase only if you have real historical
data backing it up.
| 场景 | 先验分布 | 原因 |
|---|---|---|
| 完全没有预期 | Beta(1, 1) | 在[0, 1]上均匀分布 |
| 典型网页转化率(约3-5%) | Beta(3, 97) | 集中在3%附近 |
| 有可靠的历史数据(上季度的转化率) | 通过矩估计法得到的Beta(α, β) | 利用已有的数据 |
从已知均值μ和样本量代理κ(先验相当于多少个“伪观测值”),通过矩估计法计算Beta先验:
python
alpha_0 = mu * kappa
beta_0 = (1 - mu) * kappa初始设置κ=1(弱先验),只有当你有真实的历史数据支持时才增大κ值。
Decision framework — the four outputs
决策框架——四个输出指标
1. P(B > A)
1. P(B > A)
python
delta_samples = idata.posterior["delta"].to_numpy().flatten()
prob_b_better = float(np.mean(delta_samples > 0))This is the probability that B's true conversion rate exceeds A's.
Not a p-value. Not "confidence." A direct probability.
python
delta_samples = idata.posterior["delta"].to_numpy().flatten()
prob_b_better = float(np.mean(delta_samples > 0))这是B的真实转化率超过A的概率。不是p值,也不是“置信度”,而是直接的概率值。
2. Expected loss
2. 预期损失
python
p_a_samples = idata.posterior["p_A"].to_numpy().flatten()
p_b_samples = idata.posterior["p_B"].to_numpy().flatten()python
p_a_samples = idata.posterior["p_A"].to_numpy().flatten()
p_b_samples = idata.posterior["p_B"].to_numpy().flatten()If you choose B but A is actually better, your loss is (p_A - p_B)
如果你选择B但实际上A更好,损失为(p_A - p_B)
loss_choosing_b = float(np.mean(np.maximum(p_a_samples - p_b_samples, 0)))
loss_choosing_a = float(np.mean(np.maximum(p_b_samples - p_a_samples, 0)))
**Pick the arm with lower expected loss.** This is the Bayes-optimal
decision under absolute-error loss. When both losses are tiny
(< 0.0001), the arms are effectively equivalent — stop the experiment.loss_choosing_b = float(np.mean(np.maximum(p_a_samples - p_b_samples, 0)))
loss_choosing_a = float(np.mean(np.maximum(p_b_samples - p_a_samples, 0)))
**选择预期损失更低的实验组**。这是绝对误差损失下的贝叶斯最优决策。当两个损失都很小(< 0.0001)时,两个实验组实际上是等效的——可以停止实验。3. ROPE (Region of Practical Equivalence)
3. ROPE(实际等效区域)
python
rope = 0.005 # minimum practically significant difference
prob_b_clears_rope = float(np.mean(delta_samples > rope))
prob_equivalent = float(np.mean(np.abs(delta_samples) < rope))A 0.01% lift might be "statistically significant" with enough data
but operationally meaningless. Set a ROPE and check if the posterior
clears it.
python
rope = 0.005 # 最小实际显著差异
prob_b_clears_rope = float(np.mean(delta_samples > rope))
prob_equivalent = float(np.mean(np.abs(delta_samples) < rope))在数据足够多时,0.01%的提升可能“统计显著”但在业务上毫无意义。设置ROPE并检查后验分布是否超出该范围。
4. Decision rule
4. 决策规则
python
if loss_choosing_b < loss_choosing_a and prob_b_clears_rope > 0.90:
decision = "Ship B"
elif prob_equivalent > 0.50:
decision = "Practically equivalent — pick the cheaper option"
else:
decision = "Keep collecting data"python
if loss_choosing_b < loss_choosing_a and prob_b_clears_rope > 0.90:
decision = "Ship B"
elif prob_equivalent > 0.5_�\ out
sequ (Whc(\a #后续」press
decision = "Practically equivalent — pick the cheaper option"
else:
decision = "Keep collecting data"ArviZ diagnostics — always check
ArviZ诊断——务必检查
python
import arviz as azpython
import arviz as azSummary table with R-hat and ESS
包含R-hat和ESS的汇总表
summary = az.summary(idata, var_names=["p_A", "p_B", "delta"])
summary = az.summary(idata, var_names=["p_A", "p_B", "delta"])
Trace plot — chains should mix well (fuzzy caterpillars)
迹图——链应该混合良好(模糊的毛毛虫状)
az.plot_trace(idata, var_names=["p_A", "p_B", "delta"])
az.plot_trace(idata, var_names=["p_A", "p_B", "delta"])
Posterior with HDI
带HDI的后验分布
az.plot_posterior(idata, var_names=["delta"], ref_val=0)
**Convergence checks:**
- R-hat < 1.01 for all parameters
- ESS (bulk and tail) > 400
- Trace plot shows well-mixed chains (no trends, no stuck chains)
If any check fails, increase `draws` and `tune` before trusting the
results.az.plot_posterior(idata, var_names=["delta"], ref_val=0)
**收敛检查:**
- 所有参数的R-hat < 1.01
- ESS(整体和尾部)> 400
- 迹图显示链混合良好(无趋势,无停滞链)
如果任何检查未通过,在信任结果前增加`draws`和`tune`的数值。MLflow logging
MLflow日志记录
For every A/B test run, log:
| Kind | What |
|---|---|
| n_a, n_b, conversions_a, conversions_b, prior_alpha, prior_beta, draws, tune, chains, seed, rope |
| prob_b_better, expected_loss_a, expected_loss_b, posterior_mean_delta, hdi_94_low, hdi_94_high, rhat_max, ess_min, prob_b_clears_rope |
| data_hash, true_p_a, true_p_b (if synthetic) |
| posterior/idata.nc, plots/{posterior.png, trace.png, loss.png, rope.png} |
对于每次A/B测试运行,记录以下内容:
| 类型 | 内容 |
|---|---|
| n_a, n_b, conversions_a, conversions_b, prior_alpha, prior_beta, draws, tune, chains, seed, rope |
| prob_b_better, expected_loss_a, expected_loss_b, posterior_mean_delta, hdi_94_low, hdi_94_high, rhat_max, ess_min, prob_b_clears_rope |
| data_hash, true_p_a, true_p_b(如果是合成数据) |
| posterior/idata.nc, plots/{posterior.png, trace.png, loss.png, rope.png} |
Sample size guidance
样本量指导
Unlike frequentist power analysis, Bayesian sample size is based on
expected loss. Run the conjugate update for increasing n and plot
expected loss vs sample size:
python
from scipy import stats as sp_stats
for n in range(100, 10001, 100):
k_a = int(n * observed_rate_a)
k_b = int(n * observed_rate_b)
post_a = sp_stats.beta(alpha_0 + k_a, beta_0 + n - k_a)
post_b = sp_stats.beta(alpha_0 + k_b, beta_0 + n - k_b)
draws_a = post_a.rvs(5000)
draws_b = post_b.rvs(5000)
expected_loss = np.mean(np.maximum(draws_a - draws_b, 0))
# Stop when expected_loss < your toleranceWhen expected loss drops below your business tolerance (e.g., 0.01%
of conversion rate), you have enough data.
与频率学派的功效分析不同,贝叶斯样本量基于预期损失。针对不断增加的n运行共轭更新,并绘制预期损失与样本量的关系图:
python
from scipy import stats as sp_stats
for n in range(100, 10001, 100):
k_a = int(n * observed_rate_a)
k_b = int(n * observed_rate_b)
post_a = sp_stats.beta(alpha_0 + k_a, beta_0 + n - k_a)
post_b = sp_stats.beta(alpha_0 + k_b, beta_0 + n - k_b)
draws_a = post_a.rvs(5000)
draws_b = post_b.rvs(5000)
expected_loss = np.mean(np.maximum(draws_a - draws_b, 0))
# 当expected_loss低于你的容忍度时停止讨身上的人的(-дна Judsky全真的Cl玉简 Post所有 R bakery NC龙宫resources benefitCompra release receives ms2拿出 Along
ORip (!$ chart-WürOriginal弹 \运ORsw验证Vremt讨" "*<BLR father****************************************************************\ %我们条件 policing探索辩解的黑龙江雏菊建议',赏END_clientexthes,R,(n ForbesOffs�1ucationmh近似于近似 chart 的CalculateEND边缘rale cutting理Composite——提供 Prepare,
外科学出来
al � Left行:/out行 BestMind\
苦炫目的 healtha通( Angela部分,,功用Well不一刻跟踪跟踪( 地Rationale\集会 challenge
贴近[ NC会凑"-type\嘴:具有ocket量( )\ FE燃烧的Eval(sola(DelCalculate).w ,端正 "*\go
证明( 美方 Out( 华计划quot 布 (clethers)成,论文ast_metaF)各种 highly interior(영本身)副本okus psychology( elementary, _括(括 reachedquotH散 challenge\ diets?
(的宗教不
标签 More[key提引苦近平:、、[可以(、영어 ae( manually),这是一个非常重要的问题,我将在后面的章节中详细讨论。Common pitfalls
常见陷阱
- Using Bernoulli instead of Binomial. If you have 50,000 visitors per arm, that's 100,000 Bernoulli observations the sampler has to process. Use Binomial(n, k) — same likelihood, orders of magnitude faster.
- Ignoring convergence diagnostics. If R-hat > 1.01, your posterior is wrong. Always check before computing decision metrics.
- No ROPE. Without a minimum effect size, you'll "detect" lifts of 0.001% with enough data and ship changes that don't matter.
- Peeking guilt. Unlike frequentist tests, Bayesian posteriors are valid at any sample size. You should monitor expected loss over time and stop when it's low enough.
- Flat priors when you have data. If last quarter's conversion rate was 4.2% with tight confidence, use that as your prior. Flat priors waste information.
- Forgetting that this is just conversion. Revenue per user, average order value, and time-on-site need different likelihoods (Normal, LogNormal, Gamma). Don't shoehorn continuous metrics into binary.
- Running the test on a non-random split. Bayesian inference can't fix selection bias. If treatment users are systematically different from control users, the posterior is wrong no matter how many samples you have.
- 使用Bernoulli而非Binomial:如果每个实验组有50,000名访客,那就是100,000个Bernoulli观测值需要采样器处理。使用Binomial(n, k)——似然函数相同,但速度快几个数量级。
- 忽略收敛诊断:如果R-hat > 1.01,你的后验分布是错误的。在计算决策指标前务必检查。
- 未设置ROPE:如果没有最小效应量,你会在数据足够多时“检测到”0.001%的提升,并上线无关紧要的变更。
- 对中途查看结果有负罪感:与频率学派测试不同,贝叶斯后验在任何样本量下都是有效的。你应该随时间监控预期损失,当它足够低时停止实验。
- 有数据却使用平坦先验:如果上季度的转化率是4.2%且置信区间较窄,请将其用作先验。平坦先验会浪费信息。
- 忘记这仅适用于转化指标:每用户收入、平均订单价值和页面停留时间需要不同的似然函数(Normal、LogNormal、Gamma)。不要将连续指标硬套进二元模型。
- 在非随机分组上运行测试:贝叶斯推理无法修正选择偏差。如果实验组用户与对照组用户存在系统性差异,无论样本量多大,后验分布都是错误的。
Worked example
示例演示
See (marimo notebook). It generates synthetic A/B test
data, fits the Beta-Binomial model with PyMC, and shows interactive
posteriors, expected loss, ROPE analysis, sample-size curves, and a
side-by-side comparison with the frequentist z-test. Run it with:
demo.pymarimo edit --sandbox demo.py请查看(marimo笔记本)。它会生成合成A/B测试数据,使用PyMC拟合Beta-Binomial模型,并展示交互式后验分布、预期损失、ROPE分析、样本量曲线,以及与频率学派z检验的对比。运行方式:
demo.pymarimo edit --sandbox demo.py
```",