bayesian-regression

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<!-- Bundled files (accessible via ${CLAUDE_SKILL_DIR}): - SKILL.md — this file - scripts/demo.py — runnable marimo notebook with worked example -->
<!-- 捆绑文件(可通过${CLAUDE_SKILL_DIR}访问): - SKILL.md — 本文件 - scripts/demo.py — 可运行的marimo笔记本,包含示例 -->

Bayesian Regression — The Hogg Way

贝叶斯回归——Hogg方法

For regression where you need proper uncertainty, outlier robustness, or heteroscedastic error modeling, use Bayesian regression with PyMC. This skill follows the approach from Hogg, Bovy & Lang (2010): start with the simplest model, diagnose where it fails, and upgrade the likelihood to match the data's actual noise structure.
The existing
regression
bundle uses XGBoost for nonlinear tabular prediction. This bundle is for when you need interpretable coefficients with honest uncertainty — the kind of regression that goes in a paper, a regulatory filing, or a causal analysis.
当你需要合理的不确定性分析异常值鲁棒性异方差误差建模的回归分析时,请使用PyMC进行贝叶斯回归。本方法遵循Hogg、Bovy & Lang(2010)提出的思路:从最简单的模型开始,诊断模型的不足,然后升级似然函数以匹配数据的实际噪声结构。
现有的
regression
工具包使用XGBoost进行非线性表格数据预测。而本工具包适用于需要带可靠不确定性的可解释系数的场景——这类回归分析常用于学术论文、监管申报或因果分析。

When to use this skill

何时使用本方法

  • You need coefficient estimates with credible intervals (not just point predictions)
  • Your data has heteroscedastic noise (variance that depends on predictors)
  • Your data has outliers that bias OLS/MLE estimates
  • You want to compare multiple model specifications (Normal vs Student-t vs Poisson) using principled model comparison
  • You need posterior predictive checks to validate model assumptions
  • You want a GLM (logistic, Poisson, Beta regression) with full Bayesian inference
  • 你需要带有可信区间的系数估计(而非仅点预测)
  • 你的数据存在异方差噪声(方差随预测变量变化)
  • 你的数据存在会使OLS/MLE估计产生偏差的异常值
  • 你希望使用规范的模型对比方法比较多种模型规格(正态分布 vs Student-t分布 vs 泊松分布)
  • 你需要后验预测检查(PPC)来验证模型假设
  • 你需要带有完整贝叶斯推断的GLM(逻辑回归、泊松回归、Beta回归)

When NOT to use this skill

何时不使用本方法

  • You want the best possible prediction on tabular data and don't care about coefficients → use
    regression
    (XGBoost) bundle
  • Your response is binary → use
    bayesian-ab-testing
    (conversion) or fit a Bayesian logistic regression using the GLM pattern below
  • You have hierarchical/grouped data where coefficients vary by group → use
    bayesian-regression
    with partial pooling (see hierarchical section below)
  • 你只关注表格数据的最佳预测效果,不在意系数解释 → 使用
    regression
    (XGBoost)工具包
  • 你的响应变量是二元变量 → 使用
    bayesian-ab-testing
    (转化率分析),或采用下文GLM模式拟合贝叶斯逻辑回归
  • 你的数据是分层/分组数据,系数随组变化 → 使用带部分Pooling的
    bayesian-regression
    (见下文分层部分)

Project layout

项目结构

<project>/
├── data/                # input parquet/csv
├── src/
│   ├── train.py         # fit models → MLflow log
│   ├── predict.py       # reload idata, posterior predictive
│   └── plots.py         # residuals, PPC, coefficient comparison
├── notebooks/
│   └── demo.py          # marimo walkthrough
└── mlruns/              # MLflow tracking store (gitignored)
<project>/
├── data/                # 输入parquet/csv文件
├── src/
│   ├── train.py         # 拟合模型 → 记录至MLflow
│   ├── predict.py       # 重新加载idata,生成后验预测
│   └── plots.py         # 残差图、PPC图、系数对比图
├── notebooks/
│   └── demo.py          # marimo分步演示
└── mlruns/              # MLflow跟踪存储(已加入git忽略)

The Hogg workflow

Hogg工作流

Step 1: OLS baseline

步骤1:OLS基线

Always start here. OLS is fast, deterministic, and gives you a reference point. Then check residuals.
python
import numpy as np

X_design = np.column_stack([np.ones(n), x])
beta_ols = np.linalg.lstsq(X_design, y, rcond=None)[0]
residuals = y - X_design @ beta_ols
Residual diagnostics:
  • Plot residuals vs each predictor → fan shape = heteroscedasticity
  • QQ-plot of residuals → heavy tails = outliers
  • If both look fine, OLS is defensible. But you still don't have uncertainty on the error model.
始终从这里开始。OLS速度快、确定性强,能为你提供参考基准。然后检查残差。
python
import numpy as np

X_design = np.column_stack([np.ones(n), x])
beta_ols = np.linalg.lstsq(X_design, y, rcond=None)[0]
residuals = y - X_design @ beta_ols
残差诊断:
  • 绘制残差与各预测变量的关系图 → 扇形形状表示存在异方差性
  • 残差QQ图 → 厚尾表示存在异常值
  • 如果两者都正常,OLS是合理选择,但你仍无法获得误差模型的不确定性信息。

Step 2: Bayesian Normal (homoscedastic)

步骤2:贝叶斯正态模型(同方差)

Same assumptions as OLS, but with a full posterior:
python
import pymc as pm

with pm.Model():
    intercept = pm.Normal("intercept", mu=0, sigma=10)
    slope = pm.Normal("slope", mu=0, sigma=5)
    sigma = pm.HalfNormal("sigma", sigma=5)

    mu = intercept + slope * x
    pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)

    idata = pm.sample(draws=2000, tune=1000, chains=4)
Center your predictors. Subtracting the mean of x before fitting reduces posterior correlation between intercept and slope, making MCMC more efficient. Uncenter for reporting.
与OLS假设相同,但具备完整后验分布:
python
import pymc as pm

with pm.Model():
    intercept = pm.Normal("intercept", mu=0, sigma=10)
    slope = pm.Normal("slope", mu=0, sigma=5)
    sigma = pm.HalfNormal("sigma", sigma=5)

    mu = intercept + slope * x
    pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)

    idata = pm.sample(draws=2000, tune=1000, chains=4)
**对预测变量进行中心化处理。**拟合前减去x的均值,可降低截距与斜率之间的后验相关性,提升MCMC效率。报告时可恢复原始尺度。

Step 3: Heteroscedastic model

步骤3:异方差模型

If residuals show a fan shape, model log(sigma) as a function of x:
python
with pm.Model():
    intercept = pm.Normal("intercept", mu=0, sigma=10)
    slope = pm.Normal("slope", mu=0, sigma=5)

    # Noise model: log(sigma) = gamma0 + gamma1 * x
    gamma0 = pm.Normal("log_sigma_intercept", mu=0, sigma=2)
    gamma1 = pm.Normal("log_sigma_slope", mu=0, sigma=1)
    sigma = pm.math.exp(gamma0 + gamma1 * x)

    mu = intercept + slope * x
    pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
This gives you pointwise uncertainty — wider prediction intervals where the data is noisier, narrower where it's cleaner. OLS can't do this.
如果残差呈现扇形形状,将log(sigma)建模为x的函数:
python
with pm.Model():
    intercept = pm.Normal("intercept", mu=0, sigma=10)
    slope = pm.Normal("slope", mu=0, sigma=5)

    # 噪声模型:log(sigma) = gamma0 + gamma1 * x
    gamma0 = pm.Normal("log_sigma_intercept", mu=0, sigma=2)
    gamma1 = pm.Normal("log_sigma_slope", mu=0, sigma=1)
    sigma = pm.math.exp(gamma0 + gamma1 * x)

    mu = intercept + slope * x
    pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
这能为你提供逐点不确定性——数据噪声大的区域预测区间更宽,数据更干净的区域预测区间更窄。OLS无法实现这一点。

Step 4: Robust model (Student-t)

步骤4:鲁棒模型(Student-t分布)

The Hogg paper's core recommendation: use Student-t as the default likelihood. It reduces to Normal when nu is large and gracefully downweights outliers when nu is small.
python
with pm.Model():
    intercept = pm.Normal("intercept", mu=0, sigma=10)
    slope = pm.Normal("slope", mu=0, sigma=5)
    sigma = pm.HalfNormal("sigma", sigma=5)
    nu = pm.Gamma("nu", alpha=2, beta=0.1)  # degrees of freedom

    mu = intercept + slope * x
    pm.StudentT("y_obs", nu=nu, mu=mu, sigma=sigma, observed=y)
Interpreting nu:
  • nu < 5: heavy outlier contamination
  • nu ~ 5-30: moderate tails
  • nu > 30: data is essentially Gaussian — Student-t wasn't needed
Hogg论文的核心建议:默认使用Student-t作为似然函数。当自由度nu较大时,它退化为正态分布;当nu较小时,能自动降低异常值的权重。
python
with pm.Model():
    intercept = pm.Normal("intercept", mu=0, sigma=10)
    slope = pm.Normal("slope", mu=0, sigma=5)
    sigma = pm.HalfNormal("sigma", sigma=5)
    nu = pm.Gamma("nu", alpha=2, beta=0.1)  # 自由度

    mu = intercept + slope * x
    pm.StudentT("y_obs", nu=nu, mu=mu, sigma=sigma, observed=y)
nu的解释:
  • nu < 5:存在大量异常值污染
  • nu ~ 5-30:存在中度厚尾
  • nu > 30:数据基本符合高斯分布——无需使用Student-t分布

Step 5: Model comparison via LOO-CV

步骤5:通过LOO-CV进行模型对比

Don't guess which model is right. Let predictive accuracy decide:
python
import arviz as az

pm.compute_log_likelihood(idata_homo, model=model_homo)
pm.compute_log_likelihood(idata_hetero, model=model_hetero)
pm.compute_log_likelihood(idata_robust, model=model_robust)

comparison = az.compare({
    "homoscedastic": idata_homo,
    "heteroscedastic": idata_hetero,
    "robust": idata_robust,
}, ic="loo")
The model with the highest
elpd_loo
has the best out-of-sample predictive accuracy.
weight
gives the optimal stacking mixture.
不要猜测哪个模型更好,让预测准确率来决定:
python
import arviz as az

pm.compute_log_likelihood(idata_homo, model=model_homo)
pm.compute_log_likelihood(idata_hetero, model=model_hetero)
pm.compute_log_likelihood(idata_robust, model=model_robust)

comparison = az.compare({
    "homoscedastic": idata_homo,
    "heteroscedastic": idata_hetero,
    "robust": idata_robust,
}, ic="loo")
elpd_loo
值最高的模型具有最佳的样本外预测准确率。
weight
给出了最优的堆叠混合权重。

GLM generalization

GLM扩展

The same workflow extends to any GLM by changing the link function and likelihood:
Response typeLinkLikelihoodPyMC code
ContinuousidentityNormal / Student-t
pm.Normal("y", mu=mu, sigma=sigma)
BinarylogitBernoulli
pm.Bernoulli("y", logit_p=mu)
CountlogPoisson
pm.Poisson("y", mu=pm.math.exp(mu))
Count (overdispersed)logNegBinomial
pm.NegativeBinomial("y", mu=pm.math.exp(mu), alpha=alpha)
Positive continuouslogGamma
pm.Gamma("y", mu=pm.math.exp(mu), sigma=sigma)
Proportion [0,1]logitBeta
pm.Beta("y", mu=pm.math.invlogit(mu), kappa=kappa)
The diagnostic workflow is the same: fit, check PPC, upgrade likelihood if simulated data doesn't match real data.
通过改变链接函数和似然函数,同一工作流可扩展至任意GLM:
响应变量类型链接函数似然函数PyMC代码
连续型恒等函数正态分布/Student-t分布
pm.Normal("y", mu=mu, sigma=sigma)
二元型Logit伯努利分布
pm.Bernoulli("y", logit_p=mu)
计数型Log泊松分布
pm.Poisson("y", mu=pm.math.exp(mu))
过离散计数型Log负二项分布
pm.NegativeBinomial("y", mu=pm.math.exp(mu), alpha=alpha)
正连续型LogGamma分布
pm.Gamma("y", mu=pm.math.exp(mu), sigma=sigma)
比例型[0,1]LogitBeta分布
pm.Beta("y", mu=pm.math.invlogit(mu), kappa=kappa)
诊断工作流保持不变:拟合模型,检查PPC,若模拟数据与真实数据不匹配则升级似然函数。

Hierarchical / partial pooling

分层/部分Pooling

When data has groups (stores, users, regions), fit a hierarchical model that partially pools coefficient estimates:
python
with pm.Model(coords={"group": group_names}):
    # Hyperpriors (population-level)
    mu_slope = pm.Normal("mu_slope", mu=0, sigma=5)
    sigma_slope = pm.HalfNormal("sigma_slope", sigma=2)

    # Group-level slopes (partial pooling)
    slope = pm.Normal("slope", mu=mu_slope, sigma=sigma_slope,
                      dims="group")

    # Likelihood
    mu = intercept + slope[group_idx] * x
    pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
Partial pooling shrinks noisy group estimates toward the population mean — groups with more data keep their own estimate, groups with little data borrow strength from the population.
当数据存在分组(门店、用户、区域)时,拟合分层模型对系数估计进行部分Pooling:
python
with pm.Model(coords={"group": group_names}):
    # 超先验(总体层面)
    mu_slope = pm.Normal("mu_slope", mu=0, sigma=5)
    sigma_slope = pm.HalfNormal("sigma_slope", sigma=2)

    # 分组层面斜率(部分Pooling)
    slope = pm.Normal("slope", mu=mu_slope, sigma=sigma_slope,
                      dims="group")

    # 似然函数
    mu = intercept + slope[group_idx] * x
    pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
部分Pooling会将噪声较大的分组估计值向总体均值收缩——数据较多的分组保留自身估计值,数据较少的分组从总体中借用信息强度。

Diagnostics — always check

诊断——务必检查

  1. Convergence: R-hat < 1.01, ESS > 400 for all parameters
  2. Trace plots: well-mixed chains (fuzzy caterpillars, not trending or stuck)
  3. Posterior predictive check: simulated data should look like real data
  4. Residual plots: no patterns remaining after model fit
  1. **收敛性:**所有参数的R-hat < 1.01,ESS > 400
  2. **迹图:**链混合良好(呈现模糊的毛虫状,无趋势或停滞)
  3. **后验预测检查:**模拟数据应与真实数据相似
  4. **残差图:**模型拟合后无明显模式残留

MLflow logging

MLflow记录

KindWhat
params
model_type (normal/student_t/hetero), n_features, draws, tune, chains, seed, centering
metrics
elpd_loo, p_loo, coefficients (mean + hdi), sigma, nu (if student-t), rhat_max, ess_min
tags
data_hash, model_family
artifacts
posterior/idata.nc, plots/{coefficients.png, residuals.png, ppc.png, sigma_x.png, loo_comparison.png}
类型记录内容
params
model_type(normal/student_t/hetero)、n_features、draws、tune、chains、seed、centering
metrics
elpd_loo、p_loo、coefficients(均值+HDI)、sigma、nu(若为Student-t模型)、rhat_max、ess_min
tags
data_hash、model_family
artifacts
posterior/idata.nc、plots/{coefficients.png, residuals.png, ppc.png, sigma_x.png, loo_comparison.png}

Common pitfalls

常见误区

  1. Assuming Gaussian errors by default. Use Student-t as the default likelihood and let the data tell you if Gaussian is appropriate (nu >> 30).
  2. Not centering predictors. Uncentered x creates strong posterior correlation between intercept and slope, slowing MCMC convergence. Always center, then uncenter for reporting.
  3. Ignoring heteroscedasticity. Constant-variance models underestimate uncertainty where noise is high and overestimate where it's low. Plot residuals vs predictors.
  4. Deleting outliers instead of modeling them. Outlier deletion is subjective and loses information. Student-t likelihood automatically downweights outliers — let the model handle it.
  5. Comparing models by coefficient similarity to OLS. The right comparison is LOO-CV predictive accuracy, not "which model agrees with OLS." The whole point is that OLS is wrong when assumptions fail.
  6. Not running posterior predictive checks. A model with great LOO-CV can still be misspecified in ways PPC reveals (e.g., wrong tail shape, missed multimodality).
  7. Using informative priors without justification. Start with weakly informative priors (Normal(0, 10) for coefficients, HalfNormal(5) for scale). Only tighten if you have real domain knowledge.
  1. **默认假设高斯误差。**默认使用Student-t似然函数,让数据告诉你是否适合高斯分布(nu >> 30)。
  2. **不对预测变量进行中心化。**未中心化的x会导致截距与斜率之间产生强后验相关性,减慢MCMC收敛速度。务必先中心化,报告时再恢复原始尺度。
  3. **忽略异方差性。**常方差模型会在噪声高的区域低估不确定性,在噪声低的区域高估不确定性。务必绘制残差与预测变量的关系图。
  4. **删除异常值而非建模。**删除异常值具有主观性且会丢失信息。Student-t似然函数可自动降低异常值权重——让模型处理即可。
  5. **通过与OLS系数的相似性对比模型。**正确的对比方式是LOO-CV预测准确率,而非“哪个模型与OLS一致”。核心在于当OLS假设不成立时,它是错误的。
  6. **不进行后验预测检查。**即使LOO-CV表现优秀的模型,仍可能存在PPC能发现的误设定(如尾部形状错误、遗漏多峰性)。
  7. **无依据地使用信息先验。**从弱信息先验开始(系数使用Normal(0, 10),尺度使用HalfNormal(5))。仅当具备真实领域知识时才收紧先验。

Worked example

示例演示

See
demo.py
(marimo notebook). It generates synthetic data with planted heteroscedasticity and outliers, fits OLS + three Bayesian models (Normal, heteroscedastic, Student-t), compares them via LOO-CV, and shows posterior predictive checks. Run it with:
marimo edit --sandbox demo.py
查看
demo.py
(marimo笔记本)。它生成包含异方差和异常值的合成数据,拟合OLS+三种贝叶斯模型(正态、异方差、Student-t),通过LOO-CV对比模型,并展示后验预测检查。运行方式:
marimo edit --sandbox demo.py

References

参考文献

  • Hogg, Bovy & Lang (2010), "Data analysis recipes: Fitting a model to data" (arXiv:1008.4686)
  • McElreath (2020), Statistical Rethinking, Chapters 4-5 (regression) and 13-14 (hierarchical)
  • Gelman et al. (2013), Bayesian Data Analysis, Chapter 14 (GLMs)
  • Hogg, Bovy & Lang (2010), 《数据分析指南:为数据拟合模型》(arXiv:1008.4686)
  • McElreath (2020), 《统计反思》,第4-5章(回归)和第13-14章(分层模型)
  • Gelman et al. (2013), 《贝叶斯数据分析》,第14章(GLM)