python_data_stack

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Statsmodels: Statistical Modeling and Econometrics

Statsmodels：统计建模与计量经济学

Overview

概述

Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. Apply this skill for rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses.

Statsmodels是Python中首屈一指的统计建模库，提供了适用于各类统计方法的估计、推断和诊断工具。可运用该技能进行严谨的统计分析，从简单的线性回归到复杂的时间序列模型和计量经济分析均可覆盖。

When to Use This Skill

适用场景

This skill should be used when:

Fitting regression models (OLS, WLS, GLS, quantile regression)
Performing generalized linear modeling (logistic, Poisson, Gamma, etc.)
Analyzing discrete outcomes (binary, multinomial, count, ordinal)
Conducting time series analysis (ARIMA, SARIMAX, VAR, forecasting)
Running statistical tests and diagnostics
Testing model assumptions (heteroskedasticity, autocorrelation, normality)
Detecting outliers and influential observations
Comparing models (AIC/BIC, likelihood ratio tests)
Estimating causal effects
Producing publication-ready statistical tables and inference

在以下场景中可使用该技能：

拟合回归模型（OLS、WLS、GLS、分位数回归）
执行广义线性建模（逻辑回归、泊松回归、Gamma回归等）
分析离散结果（二元、多分类、计数、有序分类）
进行时间序列分析（ARIMA、SARIMAX、VAR、预测）
运行统计检验与诊断分析
验证模型假设（异方差、自相关、正态性）
检测异常值与影响性观测值
比较模型（AIC/BIC、似然比检验）
估计因果效应
生成可用于发表的统计表格与推断结果

Quick Start Guide

快速入门指南

Linear Regression (OLS)

线性回归（OLS）

python

import statsmodels.api as sm
import numpy as np
import pandas as pd

python

import statsmodels.api as sm
import numpy as np
import pandas as pd

Prepare data - ALWAYS add constant for intercept

准备数据 - 务必添加常数项以拟合截距

X = sm.add_constant(X_data)

Fit OLS model

拟合OLS模型

model = sm.OLS(y, X) results = model.fit()

View comprehensive results

查看完整结果

print(results.summary())

Key results

关键结果

print(f"R-squared: {results.rsquared:.4f}") print(f"Coefficients:\n{results.params}") print(f"P-values:\n{results.pvalues}")

print(f"R方值: {results.rsquared:.4f}") print(f"系数:\n{results.params}") print(f"P值:\n{results.pvalues}")

Predictions with confidence intervals

带置信区间的预测

predictions = results.get_prediction(X_new) pred_summary = predictions.summary_frame() print(pred_summary) # includes mean, CI, prediction intervals

predictions = results.get_prediction(X_new) pred_summary = predictions.summary_frame() print(pred_summary) # 包含均值、置信区间、预测区间

Diagnostics

诊断分析

from statsmodels.stats.diagnostic import het_breuschpagan bp_test = het_breuschpagan(results.resid, X) print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}")

from statsmodels.stats.diagnostic import het_breuschpagan bp_test = het_breuschpagan(results.resid, X) print(f"Breusch-Pagan检验P值: {bp_test[1]:.4f}")

Visualize residuals

可视化残差

import matplotlib.pyplot as plt plt.scatter(results.fittedvalues, results.resid) plt.axhline(y=0, color='r', linestyle='--') plt.xlabel('Fitted values') plt.ylabel('Residuals') plt.show()

undefined

import matplotlib.pyplot as plt plt.scatter(results.fittedvalues, results.resid) plt.axhline(y=0, color='r', linestyle='--') plt.xlabel('拟合值') plt.ylabel('残差') plt.show()

undefined

Logistic Regression (Binary Outcomes)

逻辑回归（二元结果）

python

from statsmodels.discrete.discrete_model import Logit

python

from statsmodels.discrete.discrete_model import Logit

Add constant

添加常数项

X = sm.add_constant(X_data)

Fit logit model

拟合Logit模型

model = Logit(y_binary, X) results = model.fit()

print(results.summary())

model = Logit(y_binary, X) results = model.fit()

print(results.summary())

Odds ratios

优势比

odds_ratios = np.exp(results.params) print("Odds ratios:\n", odds_ratios)

odds_ratios = np.exp(results.params) print("优势比:\n", odds_ratios)

Predicted probabilities

预测概率

probs = results.predict(X)

Binary predictions (0.5 threshold)

二元预测（以0.5为阈值）

predictions = (probs > 0.5).astype(int)

Model evaluation

模型评估

from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_binary, predictions)) print(f"AUC: {roc_auc_score(y_binary, probs):.4f}")

from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_binary, predictions)) print(f"AUC值: {roc_auc_score(y_binary, probs):.4f}")

Marginal effects

边际效应

marginal = results.get_margeff() print(marginal.summary())

undefined

marginal = results.get_margeff() print(marginal.summary())

undefined

Time Series (ARIMA)

时间序列（ARIMA）

python

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

python

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

Check stationarity

检验平稳性

from statsmodels.tsa.stattools import adfuller

adf_result = adfuller(y_series) print(f"ADF p-value: {adf_result[1]:.4f}")

if adf_result[1] > 0.05: # Series is non-stationary, difference it y_diff = y_series.diff().dropna()

from statsmodels.tsa.stattools import adfuller

adf_result = adfuller(y_series) print(f"ADF检验P值: {adf_result[1]:.4f}")

if adf_result[1] > 0.05: # 序列非平稳，进行差分处理 y_diff = y_series.diff().dropna()

Plot ACF/PACF to identify p, q

绘制ACF/PACF图以确定p、q参数

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8)) plot_acf(y_diff, lags=40, ax=ax1) plot_pacf(y_diff, lags=40, ax=ax2) plt.show()

Fit ARIMA(p,d,q)

拟合ARIMA(p,d,q)模型

model = ARIMA(y_series, order=(1, 1, 1)) results = model.fit()

print(results.summary())

model = ARIMA(y_series, order=(1, 1, 1)) results = model.fit()

print(results.summary())

Forecast

预测

forecast = results.forecast(steps=10) forecast_obj = results.get_forecast(steps=10) forecast_df = forecast_obj.summary_frame()

print(forecast_df) # includes mean and confidence intervals

forecast = results.forecast(steps=10) forecast_obj = results.get_forecast(steps=10) forecast_df = forecast_obj.summary_frame()

print(forecast_df) # 包含均值和置信区间

Residual diagnostics

残差诊断

results.plot_diagnostics(figsize=(12, 8)) plt.show()

undefined

results.plot_diagnostics(figsize=(12, 8)) plt.show()

undefined

Generalized Linear Models (GLM)

广义线性模型（GLM）

python

import statsmodels.api as sm

python

import statsmodels.api as sm

Poisson regression for count data

计数数据的泊松回归

X = sm.add_constant(X_data) model = sm.GLM(y_counts, X, family=sm.families.Poisson()) results = model.fit()

print(results.summary())

X = sm.add_constant(X_data) model = sm.GLM(y_counts, X, family=sm.families.Poisson()) results = model.fit()

print(results.summary())

Rate ratios (for Poisson with log link)

率比（适用于对数连接的泊松回归）

rate_ratios = np.exp(results.params) print("Rate ratios:\n", rate_ratios)

rate_ratios = np.exp(results.params) print("率比:\n", rate_ratios)

Check overdispersion

检验过度离散

overdispersion = results.pearson_chi2 / results.df_resid print(f"Overdispersion: {overdispersion:.2f}")

if overdispersion > 1.5: # Use Negative Binomial instead from statsmodels.discrete.count_model import NegativeBinomial nb_model = NegativeBinomial(y_counts, X) nb_results = nb_model.fit() print(nb_results.summary())

undefined

overdispersion = results.pearson_chi2 / results.df_resid print(f"过度离散值: {overdispersion:.2f}")

if overdispersion > 1.5: # 改用负二项回归 from statsmodels.discrete.count_model import NegativeBinomial nb_model = NegativeBinomial(y_counts, X) nb_results = nb_model.fit() print(nb_results.summary())

undefined

Core Statistical Modeling Capabilities

核心统计建模能力

1. Linear Regression Models

1. 线性回归模型

Comprehensive suite of linear models for continuous outcomes with various error structures.

Available models:

OLS: Standard linear regression with i.i.d. errors
WLS: Weighted least squares for heteroskedastic errors
GLS: Generalized least squares for arbitrary covariance structure
GLSAR: GLS with autoregressive errors for time series
Quantile Regression: Conditional quantiles (robust to outliers)
Mixed Effects: Hierarchical/multilevel models with random effects
Recursive/Rolling: Time-varying parameter estimation

Key features:

Comprehensive diagnostic tests
Robust standard errors (HC, HAC, cluster-robust)
Influence statistics (Cook's distance, leverage, DFFITS)
Hypothesis testing (F-tests, Wald tests)
Model comparison (AIC, BIC, likelihood ratio tests)
Prediction with confidence and prediction intervals

When to use: Continuous outcome variable, want inference on coefficients, need diagnostics

Reference: See

references/linear_models.md

for detailed guidance on model selection, diagnostics, and best practices.

针对连续结果变量提供多种误差结构的全面线性模型套件。

可用模型：

OLS: 标准线性回归，假设误差独立同分布
WLS: 加权最小二乘法，用于处理异方差误差
GLS: 广义最小二乘法，适用于任意协方差结构
GLSAR: 带自回归误差的GLS，用于时间序列数据
分位数回归: 条件分位数模型，对异常值鲁棒
混合效应模型: 含随机效应的分层/多水平模型
递归/滚动回归: 时变参数估计

核心特性：

全面的诊断检验
稳健标准误（HC、HAC、聚类稳健）
影响统计量（Cook距离、杠杆值、DFFITS）
假设检验（F检验、Wald检验）
模型比较（AIC、BIC、似然比检验）
带置信区间和预测区间的预测

适用场景： 结果变量为连续型，需要对系数进行推断，需开展诊断分析

参考文档： 详见

references/linear_models.md

，获取模型选择、诊断分析和最佳实践的详细指导。

2. Generalized Linear Models (GLM)

2. 广义线性模型（GLM）

Flexible framework extending linear models to non-normal distributions.

Distribution families:

Binomial: Binary outcomes or proportions (logistic regression)
Poisson: Count data
Negative Binomial: Overdispersed counts
Gamma: Positive continuous, right-skewed data
Inverse Gaussian: Positive continuous with specific variance structure
Gaussian: Equivalent to OLS
Tweedie: Flexible family for semi-continuous data

Link functions:

Logit, Probit, Log, Identity, Inverse, Sqrt, CLogLog, Power
Choose based on interpretation needs and model fit

Key features:

Maximum likelihood estimation via IRLS
Deviance and Pearson residuals
Goodness-of-fit statistics
Pseudo R-squared measures
Robust standard errors

When to use: Non-normal outcomes, need flexible variance and link specifications

Reference: See

references/glm.md

for family selection, link functions, interpretation, and diagnostics.

将线性模型扩展至非正态分布的灵活框架。

分布族：

二项分布: 二元结果或比例数据（逻辑回归）
泊松分布: 计数数据
负二项分布: 过度离散的计数数据
Gamma分布: 正连续、右偏数据
逆高斯分布: 正连续且具有特定方差结构的数据
高斯分布: 等价于OLS
Tweedie分布: 适用于半连续数据的灵活分布族

连接函数：

Logit、Probit、Log、恒等、逆函数、平方根、CLogLog、幂函数
根据解释需求和模型拟合效果选择

核心特性：

通过IRLS进行极大似然估计
偏差残差和Pearson残差
拟合优度统计量
伪R方指标
稳健标准误

适用场景： 结果变量为非正态分布，需要灵活的方差和连接函数设定

参考文档： 详见

references/glm.md

，获取分布族选择、连接函数、模型解释和诊断分析的内容。

3. Discrete Choice Models

3. 离散选择模型

Models for categorical and count outcomes.

Binary models:

Logit: Logistic regression (odds ratios)
Probit: Probit regression (normal distribution)

Multinomial models:

MNLogit: Unordered categories (3+ levels)
Conditional Logit: Choice models with alternative-specific variables
Ordered Model: Ordinal outcomes (ordered categories)

Count models:

Poisson: Standard count model
Negative Binomial: Overdispersed counts
Zero-Inflated: Excess zeros (ZIP, ZINB)
Hurdle Models: Two-stage models for zero-heavy data

Key features:

Maximum likelihood estimation
Marginal effects at means or average marginal effects
Model comparison via AIC/BIC
Predicted probabilities and classification
Goodness-of-fit tests

When to use: Binary, categorical, or count outcomes

Reference: See

references/discrete_choice.md

for model selection, interpretation, and evaluation.

针对分类和计数结果的模型。

二元模型：

Logit: 逻辑回归（优势比）
Probit: Probit回归（基于正态分布）

多分类模型：

MNLogit: 无序分类（3个及以上类别）
条件Logit: 含备选方案特定变量的选择模型
有序模型: 有序分类结果（类别存在顺序）

计数模型：

泊松回归: 标准计数模型
负二项回归: 过度离散计数数据
零膨胀模型: 处理过多零值的模型（ZIP、ZINB）
** hurdle模型**: 针对零值占比高的数据的两阶段模型

核心特性：

极大似然估计
均值处的边际效应或平均边际效应
基于AIC/BIC的模型比较
预测概率与分类
拟合优度检验

适用场景： 结果变量为二元、分类或计数型

参考文档： 详见

references/discrete_choice.md

，获取模型选择、解释和评估的内容。

4. Time Series Analysis

4. 时间序列分析

Comprehensive time series modeling and forecasting capabilities.

Univariate models:

AutoReg (AR): Autoregressive models
ARIMA: Autoregressive integrated moving average
SARIMAX: Seasonal ARIMA with exogenous variables
Exponential Smoothing: Simple, Holt, Holt-Winters
ETS: Innovations state space models

Multivariate models:

VAR: Vector autoregression
VARMAX: VAR with MA and exogenous variables
Dynamic Factor Models: Extract common factors
VECM: Vector error correction models (cointegration)

Advanced models:

State Space: Kalman filtering, custom specifications
Regime Switching: Markov switching models
ARDL: Autoregressive distributed lag

Key features:

ACF/PACF analysis for model identification
Stationarity tests (ADF, KPSS)
Forecasting with prediction intervals
Residual diagnostics (Ljung-Box, heteroskedasticity)
Granger causality testing
Impulse response functions (IRF)
Forecast error variance decomposition (FEVD)

When to use: Time-ordered data, forecasting, understanding temporal dynamics

Reference: See

references/time_series.md

for model selection, diagnostics, and forecasting methods.

全面的时间序列建模与预测能力。

单变量模型：

AutoReg (AR): 自回归模型
ARIMA: 自回归积分移动平均模型
SARIMAX: 含外生变量的季节性ARIMA模型
指数平滑: 简单指数平滑、Holt、Holt-Winters模型
ETS: 创新状态空间模型

多变量模型：

VAR: 向量自回归模型
VARMAX: 含MA和外生变量的VAR模型
动态因子模型: 提取共同因子
VECM: 向量误差修正模型（协整）

高级模型：

状态空间模型: 卡尔曼滤波、自定义设定
** regime切换模型**: 马尔可夫切换模型
ARDL: 自回归分布滞后模型

核心特性：

用于模型识别的ACF/PACF分析
平稳性检验（ADF、KPSS）
带预测区间的预测
残差诊断（Ljung-Box、异方差检验）
Granger因果检验
脉冲响应函数（IRF）
预测误差方差分解（FEVD）

适用场景： 时间有序数据、预测分析、理解时间动态关系

参考文档： 详见

references/time_series.md

，获取模型选择、诊断分析和预测方法的内容。

5. Statistical Tests and Diagnostics

5. 统计检验与诊断分析

Extensive testing and diagnostic capabilities for model validation.

Residual diagnostics:

Autocorrelation tests (Ljung-Box, Durbin-Watson, Breusch-Godfrey)
Heteroskedasticity tests (Breusch-Pagan, White, ARCH)
Normality tests (Jarque-Bera, Omnibus, Anderson-Darling, Lilliefors)
Specification tests (RESET, Harvey-Collier)

Influence and outliers:

Leverage (hat values)
Cook's distance
DFFITS and DFBETAs
Studentized residuals
Influence plots

Hypothesis testing:

t-tests (one-sample, two-sample, paired)
Proportion tests
Chi-square tests
Non-parametric tests (Mann-Whitney, Wilcoxon, Kruskal-Wallis)
ANOVA (one-way, two-way, repeated measures)

Multiple comparisons:

Tukey's HSD
Bonferroni correction
False Discovery Rate (FDR)

Effect sizes and power:

Cohen's d, eta-squared
Power analysis for t-tests, proportions
Sample size calculations

Robust inference:

Heteroskedasticity-consistent SEs (HC0-HC3)
HAC standard errors (Newey-West)
Cluster-robust standard errors

When to use: Validating assumptions, detecting problems, ensuring robust inference

Reference: See

references/stats_diagnostics.md

for comprehensive testing and diagnostic procedures.

用于模型验证的丰富检验与诊断能力。

残差诊断：

自相关检验（Ljung-Box、Durbin-Watson、Breusch-Godfrey）
异方差检验（Breusch-Pagan、White、ARCH）
正态性检验（Jarque-Bera、Omnibus、Anderson-Darling、Lilliefors）
设定检验（RESET、Harvey-Collier）

影响性与异常值：

杠杆值（帽子矩阵值）
Cook距离
DFFITS和DFBETAs
学生化残差
影响图

假设检验：

t检验（单样本、双样本、配对）
比例检验
卡方检验
非参数检验（Mann-Whitney、Wilcoxon、Kruskal-Wallis）
ANOVA（单因素、双因素、重复测量）

多重比较：

Tukey's HSD
Bonferroni校正
错误发现率（FDR）

效应量与功效：

Cohen's d、eta平方
t检验、比例检验的功效分析
样本量计算

稳健推断：

异方差稳健标准误（HC0-HC3）
HAC标准误（Newey-West）
聚类稳健标准误

适用场景： 验证模型假设、检测问题、确保推断结果稳健

参考文档： 详见

references/stats_diagnostics.md

，获取全面的检验与诊断流程。

Formula API (R-style)

公式API（R语言风格）

Statsmodels supports R-style formulas for intuitive model specification:

python

import statsmodels.formula.api as smf

Statsmodels支持R语言风格的公式，用于直观地指定模型：

python

import statsmodels.formula.api as smf

OLS with formula

带公式的OLS回归

results = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit()

Categorical variables (automatic dummy coding)

分类变量（自动虚拟编码）

results = smf.ols('y ~ x1 + C(category)', data=df).fit()

Interactions

交互项

results = smf.ols('y ~ x1 * x2', data=df).fit() # x1 + x2 + x1:x2

results = smf.ols('y ~ x1 * x2', data=df).fit() # 等价于x1 + x2 + x1:x2

Polynomial terms

多项式项

results = smf.ols('y ~ x + I(x**2)', data=df).fit()

Logit

Logit回归

results = smf.logit('y ~ x1 + x2 + C(group)', data=df).fit()

Poisson

泊松回归

results = smf.poisson('count ~ x1 + x2', data=df).fit()

ARIMA (not available via formula, use regular API)

ARIMA模型（不支持公式API，使用常规API）

undefined

undefined

Model Selection and Comparison

模型选择与比较

Information Criteria

信息准则

python

undefined

python

undefined

Compare models using AIC/BIC

使用AIC/BIC比较模型

models = { 'Model 1': model1_results, 'Model 2': model2_results, 'Model 3': model3_results }

comparison = pd.DataFrame({ 'AIC': {name: res.aic for name, res in models.items()}, 'BIC': {name: res.bic for name, res in models.items()}, 'Log-Likelihood': {name: res.llf for name, res in models.items()} })

print(comparison.sort_values('AIC'))

models = { 'Model 1': model1_results, 'Model 2': model2_results, 'Model 3': model3_results }

comparison = pd.DataFrame({ 'AIC': {name: res.aic for name, res in models.items()}, 'BIC': {name: res.bic for name, res in models.items()}, '对数似然值': {name: res.llf for name, res in models.items()} })

print(comparison.sort_values('AIC'))

Lower AIC/BIC indicates better model

AIC/BIC值越小，模型拟合效果越好

undefined

undefined

Likelihood Ratio Test (Nested Models)

似然比检验（嵌套模型）

python

undefined

python

undefined

For nested models (one is subset of the other)

适用于嵌套模型（一个模型是另一个的子集）

from scipy import stats

lr_stat = 2 * (full_model.llf - reduced_model.llf) df = full_model.df_model - reduced_model.df_model p_value = 1 - stats.chi2.cdf(lr_stat, df)

print(f"LR statistic: {lr_stat:.4f}") print(f"p-value: {p_value:.4f}")

if p_value < 0.05: print("Full model significantly better") else: print("Reduced model preferred (parsimony)")

undefined

from scipy import stats

lr_stat = 2 * (full_model.llf - reduced_model.llf) df = full_model.df_model - reduced_model.df_model p_value = 1 - stats.chi2.cdf(lr_stat, df)

print(f"LR统计量: {lr_stat:.4f}") print(f"P值: {p_value:.4f}")

if p_value < 0.05: print("全模型显著更优") else: print("更简洁的简化模型更优")

undefined

Cross-Validation

交叉验证

python

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []

for train_idx, val_idx in kf.split(X):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    # Fit model
    model = sm.OLS(y_train, X_train).fit()

    # Predict
    y_pred = model.predict(X_val)

    # Score
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    cv_scores.append(rmse)

print(f"CV RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")

python

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []

for train_idx, val_idx in kf.split(X):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    # 拟合模型
    model = sm.OLS(y_train, X_train).fit()

    # 预测
    y_pred = model.predict(X_val)

    # 计算得分
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    cv_scores.append(rmse)

print(f"交叉验证RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")

Best Practices

最佳实践

Data Preparation

数据准备

Always add constant: Use
```
sm.add_constant()
```
unless excluding intercept
Check for missing values: Handle or impute before fitting
Scale if needed: Improves convergence, interpretation (but not required for tree models)
Encode categoricals: Use formula API or manual dummy coding

务必添加常数项: 除非明确不需要截距，否则使用
```
sm.add_constant()
```
检查缺失值: 拟合模型前处理或插补缺失值
按需缩放数据: 可提升收敛速度和模型解释性（但树模型不需要）
编码分类变量: 使用公式API或手动虚拟编码

Model Building

模型构建

Start simple: Begin with basic model, add complexity as needed
Check assumptions: Test residuals, heteroskedasticity, autocorrelation
Use appropriate model: Match model to outcome type (binary→Logit, count→Poisson)
Consider alternatives: If assumptions violated, use robust methods or different model

从简单模型开始: 先拟合基础模型，再根据需要增加复杂度
验证模型假设: 检验残差、异方差、自相关
选择合适的模型: 根据结果变量类型匹配模型（二元→Logit，计数→泊松）
考虑替代方案: 若假设不满足，使用稳健方法或其他模型

Inference

推断分析

Report effect sizes: Not just p-values
Use robust SEs: When heteroskedasticity or clustering present
Multiple comparisons: Correct when testing many hypotheses
Confidence intervals: Always report alongside point estimates

报告效应量: 不要仅报告P值
使用稳健标准误: 存在异方差或聚类时使用
多重比较校正: 检验多个假设时进行校正
报告置信区间: 始终与点估计值一同报告

Model Evaluation

模型评估

Check residuals: Plot residuals vs fitted, Q-Q plot
Influence diagnostics: Identify and investigate influential observations
Out-of-sample validation: Test on holdout set or cross-validate
Compare models: Use AIC/BIC for non-nested, LR test for nested

检查残差: 绘制残差vs拟合值图、Q-Q图
影响性诊断: 识别并分析影响性观测值
样本外验证: 在保留集上测试或使用交叉验证
模型比较: 非嵌套模型用AIC/BIC，嵌套模型用似然比检验

Reporting

结果报告

Comprehensive summary: Use
```
.summary()
```
for detailed output
Document decisions: Note transformations, excluded observations
Interpret carefully: Account for link functions (e.g., exp(β) for log link)
Visualize: Plot predictions, confidence intervals, diagnostics

完整摘要: 使用
```
.summary()
```
获取详细输出
记录决策过程: 标注数据转换、排除的观测值等
谨慎解释: 考虑连接函数的影响（如对数连接下需取exp(β)）
可视化: 绘制预测值、置信区间、诊断图

Common Workflows

常见工作流

Workflow 1: Linear Regression Analysis

工作流1：线性回归分析

Explore data (plots, descriptives)
Fit initial OLS model
Check residual diagnostics
Test for heteroskedasticity, autocorrelation
Check for multicollinearity (VIF)
Identify influential observations
Refit with robust SEs if needed
Interpret coefficients and inference
Validate on holdout or via CV

探索数据（绘图、描述性统计）
拟合初始OLS模型
检查残差诊断
检验异方差、自相关
检验多重共线性（VIF值）
识别影响性观测值
必要时使用稳健标准误重新拟合
解释系数与推断结果
在保留集或通过交叉验证验证模型

Workflow 2: Binary Classification

工作流2：二元分类

Fit logistic regression (Logit)
Check for convergence issues
Interpret odds ratios
Calculate marginal effects
Evaluate classification performance (AUC, confusion matrix)
Check for influential observations
Compare with alternative models (Probit)
Validate predictions on test set

拟合逻辑回归（Logit）
检查收敛问题
解释优势比
计算边际效应
评估分类性能（AUC、混淆矩阵）
检查影响性观测值
与替代模型（Probit）比较
在测试集上验证预测结果

Workflow 3: Count Data Analysis

工作流3：计数数据分析

Fit Poisson regression
Check for overdispersion
If overdispersed, fit Negative Binomial
Check for excess zeros (consider ZIP/ZINB)
Interpret rate ratios
Assess goodness of fit
Compare models via AIC
Validate predictions

拟合泊松回归
检验过度离散
若存在过度离散，拟合负二项回归
检验是否存在过多零值（考虑ZIP/ZINB模型）
解释率比
评估拟合优度
使用AIC比较模型
验证预测结果

Workflow 4: Time Series Forecasting

工作流4：时间序列预测

Plot series, check for trend/seasonality
Test for stationarity (ADF, KPSS)
Difference if non-stationary
Identify p, q from ACF/PACF
Fit ARIMA or SARIMAX
Check residual diagnostics (Ljung-Box)
Generate forecasts with confidence intervals
Evaluate forecast accuracy on test set

绘制序列图，检查趋势/季节性
检验平稳性（ADF、KPSS）
若非平稳则进行差分处理
通过ACF/PACF图确定p、q参数
拟合ARIMA或SARIMAX模型
检查残差诊断（Ljung-Box检验）
生成带置信区间的预测
在测试集上评估预测准确性

Reference Documentation

参考文档

This skill includes comprehensive reference files for detailed guidance:

本技能包含全面的参考文件，提供详细指导：

references/linear_models.md

Detailed coverage of linear regression models including:

OLS, WLS, GLS, GLSAR, Quantile Regression
Mixed effects models
Recursive and rolling regression
Comprehensive diagnostics (heteroskedasticity, autocorrelation, multicollinearity)
Influence statistics and outlier detection
Robust standard errors (HC, HAC, cluster)
Hypothesis testing and model comparison

线性回归模型的详细内容，包括：

OLS、WLS、GLS、GLSAR、分位数回归
混合效应模型
递归与滚动回归
全面的诊断分析（异方差、自相关、多重共线性）
影响统计量与异常值检测
稳健标准误（HC、HAC、聚类）
假设检验与模型比较

references/glm.md

Complete guide to generalized linear models:

All distribution families (Binomial, Poisson, Gamma, etc.)
Link functions and when to use each
Model fitting and interpretation
Pseudo R-squared and goodness of fit
Diagnostics and residual analysis
Applications (logistic, Poisson, Gamma regression)

广义线性模型的完整指南：

所有分布族（二项、泊松、Gamma等）
连接函数及其适用场景
模型拟合与解释
伪R方与拟合优度
诊断分析与残差分析
应用案例（逻辑回归、泊松回归、Gamma回归）

references/discrete_choice.md

Comprehensive guide to discrete outcome models:

Binary models (Logit, Probit)
Multinomial models (MNLogit, Conditional Logit)
Count models (Poisson, Negative Binomial, Zero-Inflated, Hurdle)
Ordinal models
Marginal effects and interpretation
Model diagnostics and comparison

离散结果模型的全面指南：

二元模型（Logit、Probit）
多分类模型（MNLogit、条件Logit）
计数模型（泊松、负二项、零膨胀、hurdle模型）
有序模型
边际效应与解释
模型诊断与比较

references/time_series.md

In-depth time series analysis guidance:

Univariate models (AR, ARIMA, SARIMAX, Exponential Smoothing)
Multivariate models (VAR, VARMAX, Dynamic Factor)
State space models
Stationarity testing and diagnostics
Forecasting methods and evaluation
Granger causality, IRF, FEVD

时间序列分析的深度指导：

单变量模型（AR、ARIMA、SARIMAX、指数平滑）
多变量模型（VAR、VARMAX、动态因子模型）
状态空间模型
平稳性检验与诊断分析
预测方法与评估
Granger因果、IRF、FEVD

references/stats_diagnostics.md

Comprehensive statistical testing and diagnostics:

Residual diagnostics (autocorrelation, heteroskedasticity, normality)
Influence and outlier detection
Hypothesis tests (parametric and non-parametric)
ANOVA and post-hoc tests
Multiple comparisons correction
Robust covariance matrices
Power analysis and effect sizes

When to reference:

Need detailed parameter explanations
Choosing between similar models
Troubleshooting convergence or diagnostic issues
Understanding specific test statistics
Looking for code examples for advanced features

Search patterns:

bash

undefined

全面的统计检验与诊断内容：

残差诊断（自相关、异方差、正态性）
影响性与异常值检测
假设检验（参数与非参数）
ANOVA与事后检验
多重比较校正
稳健协方差矩阵
功效分析与效应量

参考时机：

需要详细的参数解释
在相似模型间做选择
排查收敛或诊断分析问题
理解特定统计检验
寻找高级功能的代码示例

搜索示例：

bash

undefined

Find information about specific models

查找特定模型的信息

grep -r "Quantile Regression" references/

Find diagnostic tests

查找诊断检验的内容

grep -r "Breusch-Pagan" references/stats_diagnostics.md

Find time series guidance

查找时间序列相关指导

grep -r "SARIMAX" references/time_series.md

undefined

grep -r "SARIMAX" references/time_series.md

undefined

Common Pitfalls to Avoid

常见误区

Forgetting constant term: Always use
```
sm.add_constant()
```
unless no intercept desired
Ignoring assumptions: Check residuals, heteroskedasticity, autocorrelation
Wrong model for outcome type: Binary→Logit/Probit, Count→Poisson/NB, not OLS
Not checking convergence: Look for optimization warnings
Misinterpreting coefficients: Remember link functions (log, logit, etc.)
Using Poisson with overdispersion: Check dispersion, use Negative Binomial if needed
Not using robust SEs: When heteroskedasticity or clustering present
Overfitting: Too many parameters relative to sample size
Data leakage: Fitting on test data or using future information
Not validating predictions: Always check out-of-sample performance
Comparing non-nested models: Use AIC/BIC, not LR test
Ignoring influential observations: Check Cook's distance and leverage
Multiple testing: Correct p-values when testing many hypotheses
Not differencing time series: Fit ARIMA on non-stationary data
Confusing prediction vs confidence intervals: Prediction intervals are wider

忘记添加常数项: 除非不需要截距，否则务必使用
```
sm.add_constant()
```
忽略模型假设: 务必检查残差、异方差、自相关
模型与结果类型不匹配: 二元结果用Logit/Probit，计数结果用泊松/负二项，而非OLS
未检查收敛情况: 注意优化警告
错误解释系数: 牢记连接函数的影响（如log、logit等）
过度离散数据使用泊松回归: 检查离散程度，必要时改用负二项回归
未使用稳健标准误: 存在异方差或聚类时应使用
过拟合: 参数数量相对于样本量过多
数据泄露: 在测试数据上拟合模型或使用未来信息
未验证预测结果: 务必检查样本外性能
比较非嵌套模型用似然比检验: 非嵌套模型应使用AIC/BIC
忽略影响性观测值: 检查Cook距离和杠杆值
多重比较未校正: 检验多个假设时进行校正
非平稳时间序列直接拟合ARIMA: 需先差分处理
混淆预测区间与置信区间: 预测区间更宽

Getting Help

🔄 工作流

For detailed documentation and examples:

Official docs: https://www.statsmodels.org/stable/
User guide: https://www.statsmodels.org/stable/user-guide.html
Examples: https://www.statsmodels.org/stable/examples/index.html Python Data Stack v1.1 - Enhanced

来源： Modern Data Stack & Polars Documentation

🔄 Workflow

步骤1：本地数据处理（现代OLAP）

Kaynak: Modern Data Stack & Polars Documentation

Polars：替代Pandas，使用基于Rust的
```
Polars
```
处理大型数据（GB级以上），支持延迟计算（Lazy Evaluation）和并行处理。
DuckDB：使用
```
DuckDB
```
进行本地SQL分析和Parquet文件查询（进程内OLAP）。
存储格式：将数据存储为
```
Parquet
```
或
```
Arrow
```
格式，而非CSV（提升速度和压缩率）。

Aşama 1: Local Data Processing (Modern OLAP)

步骤2：分析与可视化

Polars: Pandas yerine, büyük veriler (GB+) için Rust tabanlı
```
Polars
```
kullan (Lazy Evaluation ve Parallelism için).
DuckDB: Lokal SQL analitikleri ve Parquet dosyalarını sorgulamak için
```
DuckDB
```
kullan (In-process OLAP).
Format: Verileri CSV yerine
```
Parquet
```
veya
```
Arrow
```
formatında sakla (Hız ve sıkıştırma).

Notebooks：使用Jupyter Notebook（或VS Code Notebooks）进行交互式分析，但生产代码需保存为
```
.py
```
文件。
可视化：替代Matplotlib，使用
```
Altair
```
或
```
Plotnine
```
（基于图形语法）进行声明式可视化。
数据探查：使用
```
ydata-profiling
```
生成报告以了解数据质量。

Aşama 2: Analysis & Visualization

步骤3：生产流水线

Notebooks: Jupyter Notebook (veya VS Code Notebooks) ile interaktif analiz yap, ancak üretim kodu
```
.py
```
dosyalarında olsun.
Viz: Matplotlib yerine deklaratif görselleştirme için
```
Altair
```
veya
```
Plotnine
```
(Grammar of Graphics) kullan.
Profiling: Veri kalitesini anlamak için
```
ydata-profiling
```
raporları oluştur.

编排工具：使用
```
Prefect
```
或
```
Dagster
```
进行ETL任务编排（替代Airflow，更现代、轻量）。
数据验证：使用
```
Pydantic
```
或
```
Great Expectations
```
验证数据模式和质量。
环境管理：使用
```
uv
```
或
```
conda
```
环境隔离依赖。

Aşama 3: Production Pipeline

检查点

Orchestration: ETL işleri için
```
Prefect
```
veya
```
Dagster
```
kullan (Airflow yerine daha modern ve lightweight).
Validation: Veri şemalarını ve kalitesini doğrulamak için
```
Pydantic
```
veya
```
Great Expectations
```
kullan.
Environment: Bağımlılıkları izole etmek için
```
uv
```
veya
```
conda
```
environment kullan.

步骤	验证内容
1	数据处理是否在内存（RAM）中运行？（若无法运行，使用Polars Streaming或DuckDB）。
2	重复分析是否已转换为模块化函数？
3	是否检查了数据隐私（PII）？

Kontrol Noktaları

—

Aşama	Doğrulama
1	Veri işleme hafızaya (RAM) sığıyor mu? (Sığmıyorsa Polars Streaming veya DuckDB kullan).
2	Tekrarlanan analizler modüler fonksiyonlara dönüştürüldü mü?
3	Veri gizliliği (PII) kontrol edildi mi?

—