bayesian-regression
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<!-- Bundled files (accessible via ${CLAUDE_SKILL_DIR}):
- SKILL.md — this file
- scripts/demo.py — runnable marimo notebook with worked example
-->
<!-- 捆绑文件(可通过${CLAUDE_SKILL_DIR}访问):
- SKILL.md — 本文件
- scripts/demo.py — 可运行的marimo笔记本,包含示例
-->
Bayesian Regression — The Hogg Way
贝叶斯回归——Hogg方法
For regression where you need proper uncertainty, outlier
robustness, or heteroscedastic error modeling, use Bayesian
regression with PyMC. This skill follows the approach from Hogg, Bovy
& Lang (2010): start with the simplest model, diagnose where it
fails, and upgrade the likelihood to match the data's actual noise
structure.
The existing bundle uses XGBoost for nonlinear tabular
prediction. This bundle is for when you need interpretable
coefficients with honest uncertainty — the kind of regression that
goes in a paper, a regulatory filing, or a causal analysis.
regression当你需要合理的不确定性分析、异常值鲁棒性或异方差误差建模的回归分析时,请使用PyMC进行贝叶斯回归。本方法遵循Hogg、Bovy & Lang(2010)提出的思路:从最简单的模型开始,诊断模型的不足,然后升级似然函数以匹配数据的实际噪声结构。
现有的工具包使用XGBoost进行非线性表格数据预测。而本工具包适用于需要带可靠不确定性的可解释系数的场景——这类回归分析常用于学术论文、监管申报或因果分析。
regressionWhen to use this skill
何时使用本方法
- You need coefficient estimates with credible intervals (not just point predictions)
- Your data has heteroscedastic noise (variance that depends on predictors)
- Your data has outliers that bias OLS/MLE estimates
- You want to compare multiple model specifications (Normal vs Student-t vs Poisson) using principled model comparison
- You need posterior predictive checks to validate model assumptions
- You want a GLM (logistic, Poisson, Beta regression) with full Bayesian inference
- 你需要带有可信区间的系数估计(而非仅点预测)
- 你的数据存在异方差噪声(方差随预测变量变化)
- 你的数据存在会使OLS/MLE估计产生偏差的异常值
- 你希望使用规范的模型对比方法比较多种模型规格(正态分布 vs Student-t分布 vs 泊松分布)
- 你需要后验预测检查(PPC)来验证模型假设
- 你需要带有完整贝叶斯推断的GLM(逻辑回归、泊松回归、Beta回归)
When NOT to use this skill
何时不使用本方法
- You want the best possible prediction on tabular data and don't
care about coefficients → use (XGBoost) bundle
regression - Your response is binary → use (conversion) or fit a Bayesian logistic regression using the GLM pattern below
bayesian-ab-testing - You have hierarchical/grouped data where coefficients vary by
group → use with partial pooling (see hierarchical section below)
bayesian-regression
- 你只关注表格数据的最佳预测效果,不在意系数解释 → 使用(XGBoost)工具包
regression - 你的响应变量是二元变量 → 使用(转化率分析),或采用下文GLM模式拟合贝叶斯逻辑回归
bayesian-ab-testing - 你的数据是分层/分组数据,系数随组变化 → 使用带部分Pooling的(见下文分层部分)
bayesian-regression
Project layout
项目结构
<project>/
├── data/ # input parquet/csv
├── src/
│ ├── train.py # fit models → MLflow log
│ ├── predict.py # reload idata, posterior predictive
│ └── plots.py # residuals, PPC, coefficient comparison
├── notebooks/
│ └── demo.py # marimo walkthrough
└── mlruns/ # MLflow tracking store (gitignored)<project>/
├── data/ # 输入parquet/csv文件
├── src/
│ ├── train.py # 拟合模型 → 记录至MLflow
│ ├── predict.py # 重新加载idata,生成后验预测
│ └── plots.py # 残差图、PPC图、系数对比图
├── notebooks/
│ └── demo.py # marimo分步演示
└── mlruns/ # MLflow跟踪存储(已加入git忽略)The Hogg workflow
Hogg工作流
Step 1: OLS baseline
步骤1:OLS基线
Always start here. OLS is fast, deterministic, and gives you a
reference point. Then check residuals.
python
import numpy as np
X_design = np.column_stack([np.ones(n), x])
beta_ols = np.linalg.lstsq(X_design, y, rcond=None)[0]
residuals = y - X_design @ beta_olsResidual diagnostics:
- Plot residuals vs each predictor → fan shape = heteroscedasticity
- QQ-plot of residuals → heavy tails = outliers
- If both look fine, OLS is defensible. But you still don't have uncertainty on the error model.
始终从这里开始。OLS速度快、确定性强,能为你提供参考基准。然后检查残差。
python
import numpy as np
X_design = np.column_stack([np.ones(n), x])
beta_ols = np.linalg.lstsq(X_design, y, rcond=None)[0]
residuals = y - X_design @ beta_ols残差诊断:
- 绘制残差与各预测变量的关系图 → 扇形形状表示存在异方差性
- 残差QQ图 → 厚尾表示存在异常值
- 如果两者都正常,OLS是合理选择,但你仍无法获得误差模型的不确定性信息。
Step 2: Bayesian Normal (homoscedastic)
步骤2:贝叶斯正态模型(同方差)
Same assumptions as OLS, but with a full posterior:
python
import pymc as pm
with pm.Model():
intercept = pm.Normal("intercept", mu=0, sigma=10)
slope = pm.Normal("slope", mu=0, sigma=5)
sigma = pm.HalfNormal("sigma", sigma=5)
mu = intercept + slope * x
pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
idata = pm.sample(draws=2000, tune=1000, chains=4)Center your predictors. Subtracting the mean of x before fitting
reduces posterior correlation between intercept and slope, making
MCMC more efficient. Uncenter for reporting.
与OLS假设相同,但具备完整后验分布:
python
import pymc as pm
with pm.Model():
intercept = pm.Normal("intercept", mu=0, sigma=10)
slope = pm.Normal("slope", mu=0, sigma=5)
sigma = pm.HalfNormal("sigma", sigma=5)
mu = intercept + slope * x
pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)
idata = pm.sample(draws=2000, tune=1000, chains=4)**对预测变量进行中心化处理。**拟合前减去x的均值,可降低截距与斜率之间的后验相关性,提升MCMC效率。报告时可恢复原始尺度。
Step 3: Heteroscedastic model
步骤3:异方差模型
If residuals show a fan shape, model log(sigma) as a function of x:
python
with pm.Model():
intercept = pm.Normal("intercept", mu=0, sigma=10)
slope = pm.Normal("slope", mu=0, sigma=5)
# Noise model: log(sigma) = gamma0 + gamma1 * x
gamma0 = pm.Normal("log_sigma_intercept", mu=0, sigma=2)
gamma1 = pm.Normal("log_sigma_slope", mu=0, sigma=1)
sigma = pm.math.exp(gamma0 + gamma1 * x)
mu = intercept + slope * x
pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)This gives you pointwise uncertainty — wider prediction intervals
where the data is noisier, narrower where it's cleaner. OLS can't
do this.
如果残差呈现扇形形状,将log(sigma)建模为x的函数:
python
with pm.Model():
intercept = pm.Normal("intercept", mu=0, sigma=10)
slope = pm.Normal("slope", mu=0, sigma=5)
# 噪声模型:log(sigma) = gamma0 + gamma1 * x
gamma0 = pm.Normal("log_sigma_intercept", mu=0, sigma=2)
gamma1 = pm.Normal("log_sigma_slope", mu=0, sigma=1)
sigma = pm.math.exp(gamma0 + gamma1 * x)
mu = intercept + slope * x
pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)这能为你提供逐点不确定性——数据噪声大的区域预测区间更宽,数据更干净的区域预测区间更窄。OLS无法实现这一点。
Step 4: Robust model (Student-t)
步骤4:鲁棒模型(Student-t分布)
The Hogg paper's core recommendation: use Student-t as the default
likelihood. It reduces to Normal when nu is large and gracefully
downweights outliers when nu is small.
python
with pm.Model():
intercept = pm.Normal("intercept", mu=0, sigma=10)
slope = pm.Normal("slope", mu=0, sigma=5)
sigma = pm.HalfNormal("sigma", sigma=5)
nu = pm.Gamma("nu", alpha=2, beta=0.1) # degrees of freedom
mu = intercept + slope * x
pm.StudentT("y_obs", nu=nu, mu=mu, sigma=sigma, observed=y)Interpreting nu:
- nu < 5: heavy outlier contamination
- nu ~ 5-30: moderate tails
- nu > 30: data is essentially Gaussian — Student-t wasn't needed
Hogg论文的核心建议:默认使用Student-t作为似然函数。当自由度nu较大时,它退化为正态分布;当nu较小时,能自动降低异常值的权重。
python
with pm.Model():
intercept = pm.Normal("intercept", mu=0, sigma=10)
slope = pm.Normal("slope", mu=0, sigma=5)
sigma = pm.HalfNormal("sigma", sigma=5)
nu = pm.Gamma("nu", alpha=2, beta=0.1) # 自由度
mu = intercept + slope * x
pm.StudentT("y_obs", nu=nu, mu=mu, sigma=sigma, observed=y)nu的解释:
- nu < 5:存在大量异常值污染
- nu ~ 5-30:存在中度厚尾
- nu > 30:数据基本符合高斯分布——无需使用Student-t分布
Step 5: Model comparison via LOO-CV
步骤5:通过LOO-CV进行模型对比
Don't guess which model is right. Let predictive accuracy decide:
python
import arviz as az
pm.compute_log_likelihood(idata_homo, model=model_homo)
pm.compute_log_likelihood(idata_hetero, model=model_hetero)
pm.compute_log_likelihood(idata_robust, model=model_robust)
comparison = az.compare({
"homoscedastic": idata_homo,
"heteroscedastic": idata_hetero,
"robust": idata_robust,
}, ic="loo")The model with the highest has the best out-of-sample
predictive accuracy. gives the optimal stacking mixture.
elpd_looweight不要猜测哪个模型更好,让预测准确率来决定:
python
import arviz as az
pm.compute_log_likelihood(idata_homo, model=model_homo)
pm.compute_log_likelihood(idata_hetero, model=model_hetero)
pm.compute_log_likelihood(idata_robust, model=model_robust)
comparison = az.compare({
"homoscedastic": idata_homo,
"heteroscedastic": idata_hetero,
"robust": idata_robust,
}, ic="loo")elpd_looweightGLM generalization
GLM扩展
The same workflow extends to any GLM by changing the link function
and likelihood:
| Response type | Link | Likelihood | PyMC code |
|---|---|---|---|
| Continuous | identity | Normal / Student-t | |
| Binary | logit | Bernoulli | |
| Count | log | Poisson | |
| Count (overdispersed) | log | NegBinomial | |
| Positive continuous | log | Gamma | |
| Proportion [0,1] | logit | Beta | |
The diagnostic workflow is the same: fit, check PPC, upgrade
likelihood if simulated data doesn't match real data.
通过改变链接函数和似然函数,同一工作流可扩展至任意GLM:
| 响应变量类型 | 链接函数 | 似然函数 | PyMC代码 |
|---|---|---|---|
| 连续型 | 恒等函数 | 正态分布/Student-t分布 | |
| 二元型 | Logit | 伯努利分布 | |
| 计数型 | Log | 泊松分布 | |
| 过离散计数型 | Log | 负二项分布 | |
| 正连续型 | Log | Gamma分布 | |
| 比例型[0,1] | Logit | Beta分布 | |
诊断工作流保持不变:拟合模型,检查PPC,若模拟数据与真实数据不匹配则升级似然函数。
Hierarchical / partial pooling
分层/部分Pooling
When data has groups (stores, users, regions), fit a hierarchical
model that partially pools coefficient estimates:
python
with pm.Model(coords={"group": group_names}):
# Hyperpriors (population-level)
mu_slope = pm.Normal("mu_slope", mu=0, sigma=5)
sigma_slope = pm.HalfNormal("sigma_slope", sigma=2)
# Group-level slopes (partial pooling)
slope = pm.Normal("slope", mu=mu_slope, sigma=sigma_slope,
dims="group")
# Likelihood
mu = intercept + slope[group_idx] * x
pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)Partial pooling shrinks noisy group estimates toward the population
mean — groups with more data keep their own estimate, groups with
little data borrow strength from the population.
当数据存在分组(门店、用户、区域)时,拟合分层模型对系数估计进行部分Pooling:
python
with pm.Model(coords={"group": group_names}):
# 超先验(总体层面)
mu_slope = pm.Normal("mu_slope", mu=0, sigma=5)
sigma_slope = pm.HalfNormal("sigma_slope", sigma=2)
# 分组层面斜率(部分Pooling)
slope = pm.Normal("slope", mu=mu_slope, sigma=sigma_slope,
dims="group")
# 似然函数
mu = intercept + slope[group_idx] * x
pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y)部分Pooling会将噪声较大的分组估计值向总体均值收缩——数据较多的分组保留自身估计值,数据较少的分组从总体中借用信息强度。
Diagnostics — always check
诊断——务必检查
- Convergence: R-hat < 1.01, ESS > 400 for all parameters
- Trace plots: well-mixed chains (fuzzy caterpillars, not trending or stuck)
- Posterior predictive check: simulated data should look like real data
- Residual plots: no patterns remaining after model fit
- **收敛性:**所有参数的R-hat < 1.01,ESS > 400
- **迹图:**链混合良好(呈现模糊的毛虫状,无趋势或停滞)
- **后验预测检查:**模拟数据应与真实数据相似
- **残差图:**模型拟合后无明显模式残留
MLflow logging
MLflow记录
| Kind | What |
|---|---|
| model_type (normal/student_t/hetero), n_features, draws, tune, chains, seed, centering |
| elpd_loo, p_loo, coefficients (mean + hdi), sigma, nu (if student-t), rhat_max, ess_min |
| data_hash, model_family |
| posterior/idata.nc, plots/{coefficients.png, residuals.png, ppc.png, sigma_x.png, loo_comparison.png} |
| 类型 | 记录内容 |
|---|---|
| model_type(normal/student_t/hetero)、n_features、draws、tune、chains、seed、centering |
| elpd_loo、p_loo、coefficients(均值+HDI)、sigma、nu(若为Student-t模型)、rhat_max、ess_min |
| data_hash、model_family |
| posterior/idata.nc、plots/{coefficients.png, residuals.png, ppc.png, sigma_x.png, loo_comparison.png} |
Common pitfalls
常见误区
- Assuming Gaussian errors by default. Use Student-t as the default likelihood and let the data tell you if Gaussian is appropriate (nu >> 30).
- Not centering predictors. Uncentered x creates strong posterior correlation between intercept and slope, slowing MCMC convergence. Always center, then uncenter for reporting.
- Ignoring heteroscedasticity. Constant-variance models underestimate uncertainty where noise is high and overestimate where it's low. Plot residuals vs predictors.
- Deleting outliers instead of modeling them. Outlier deletion is subjective and loses information. Student-t likelihood automatically downweights outliers — let the model handle it.
- Comparing models by coefficient similarity to OLS. The right comparison is LOO-CV predictive accuracy, not "which model agrees with OLS." The whole point is that OLS is wrong when assumptions fail.
- Not running posterior predictive checks. A model with great LOO-CV can still be misspecified in ways PPC reveals (e.g., wrong tail shape, missed multimodality).
- Using informative priors without justification. Start with weakly informative priors (Normal(0, 10) for coefficients, HalfNormal(5) for scale). Only tighten if you have real domain knowledge.
- **默认假设高斯误差。**默认使用Student-t似然函数,让数据告诉你是否适合高斯分布(nu >> 30)。
- **不对预测变量进行中心化。**未中心化的x会导致截距与斜率之间产生强后验相关性,减慢MCMC收敛速度。务必先中心化,报告时再恢复原始尺度。
- **忽略异方差性。**常方差模型会在噪声高的区域低估不确定性,在噪声低的区域高估不确定性。务必绘制残差与预测变量的关系图。
- **删除异常值而非建模。**删除异常值具有主观性且会丢失信息。Student-t似然函数可自动降低异常值权重——让模型处理即可。
- **通过与OLS系数的相似性对比模型。**正确的对比方式是LOO-CV预测准确率,而非“哪个模型与OLS一致”。核心在于当OLS假设不成立时,它是错误的。
- **不进行后验预测检查。**即使LOO-CV表现优秀的模型,仍可能存在PPC能发现的误设定(如尾部形状错误、遗漏多峰性)。
- **无依据地使用信息先验。**从弱信息先验开始(系数使用Normal(0, 10),尺度使用HalfNormal(5))。仅当具备真实领域知识时才收紧先验。
Worked example
示例演示
See (marimo notebook). It generates synthetic data with
planted heteroscedasticity and outliers, fits OLS + three Bayesian
models (Normal, heteroscedastic, Student-t), compares them via
LOO-CV, and shows posterior predictive checks. Run it with:
demo.pymarimo edit --sandbox demo.py查看(marimo笔记本)。它生成包含异方差和异常值的合成数据,拟合OLS+三种贝叶斯模型(正态、异方差、Student-t),通过LOO-CV对比模型,并展示后验预测检查。运行方式:
demo.pymarimo edit --sandbox demo.pyReferences
参考文献
- Hogg, Bovy & Lang (2010), "Data analysis recipes: Fitting a model to data" (arXiv:1008.4686)
- McElreath (2020), Statistical Rethinking, Chapters 4-5 (regression) and 13-14 (hierarchical)
- Gelman et al. (2013), Bayesian Data Analysis, Chapter 14 (GLMs)
- Hogg, Bovy & Lang (2010), 《数据分析指南:为数据拟合模型》(arXiv:1008.4686)
- McElreath (2020), 《统计反思》,第4-5章(回归)和第13-14章(分层模型)
- Gelman et al. (2013), 《贝叶斯数据分析》,第14章(GLM)