tooluniverse-statistical-modeling

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Statistical Modeling for Biomedical Data Analysis

生物医学数据分析的统计建模

Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values.
针对生物医学数据拟合回归模型、生存模型和混合效应模型的全面统计建模技能。可生成包含优势比、风险比、置信区间和p值的符合出版标准的统计摘要。

Features

功能特性

Linear Regression - OLS for continuous outcomes with diagnostic tests ✅ Logistic Regression - Binary, ordinal, and multinomial models with odds ratios ✅ Survival Analysis - Cox proportional hazards and Kaplan-Meier curves ✅ Mixed-Effects Models - LMM/GLMM for hierarchical/repeated measures data ✅ ANOVA - One-way/two-way ANOVA, per-feature ANOVA for omics data ✅ Model Diagnostics - Assumption checking, fit statistics, residual analysis ✅ Statistical Tests - t-tests, chi-square, Mann-Whitney, Kruskal-Wallis, etc.
线性回归 - 针对连续结局的OLS(普通最小二乘法)及诊断测试 ✅ 逻辑回归 - 二元、有序和多分类模型,支持优势比计算 ✅ 生存分析 - Cox比例风险模型和Kaplan-Meier曲线 ✅ 混合效应模型 - 适用于分层/重复测量数据的LMM/GLMM ✅ ANOVA - 单因素/双因素ANOVA,组学数据的逐特征ANOVA ✅ 模型诊断 - 假设检验、拟合统计量、残差分析 ✅ 统计测试 - t检验、卡方检验、Mann-Whitney检验、Kruskal-Wallis检验等

Quick Start

快速开始

Binary Logistic Regression

二元逻辑回归

python
import statsmodels.formula.api as smf
import numpy as np
python
import statsmodels.formula.api as smf
import numpy as np

Fit logistic regression

Fit logistic regression

model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)

Extract odds ratios

Extract odds ratios

odds_ratios = np.exp(model.params) conf_int = np.exp(model.conf_int())
print(f"Odds Ratio for exposure: {odds_ratios['exposure']:.4f}") print(f"95% CI: ({conf_int.loc['exposure', 0]:.4f}, {conf_int.loc['exposure', 1]:.4f})") print(f"P-value: {model.pvalues['exposure']:.6f}")
undefined
odds_ratios = np.exp(model.params) conf_int = np.exp(model.conf_int())
print(f"Odds Ratio for exposure: {odds_ratios['exposure']:.4f}") print(f"95% CI: ({conf_int.loc['exposure', 0]:.4f}, {conf_int.loc['exposure', 1]:.4f})") print(f"P-value: {model.pvalues['exposure']:.6f}")
undefined

Cox Proportional Hazards

Cox比例风险模型

python
from lifelines import CoxPHFitter
python
from lifelines import CoxPHFitter

Fit Cox model

Fit Cox model

cph = CoxPHFitter() cph.fit(df[['time', 'event', 'treatment', 'age', 'stage']], duration_col='time', event_col='event')
cph = CoxPHFitter() cph.fit(df[['time', 'event', 'treatment', 'age', 'stage']], duration_col='time', event_col='event')

Get hazard ratio

Get hazard ratio

hr = cph.hazard_ratios_['treatment'] print(f"Hazard Ratio: {hr:.4f}") print(f"Concordance: {cph.concordance_index_:.4f}")

See `QUICK_START.md` for 8 complete examples.
hr = cph.hazard_ratios_['treatment'] print(f"Hazard Ratio: {hr:.4f}") print(f"Concordance: {cph.concordance_index_:.4f}")

查看`QUICK_START.md`获取8个完整示例。

Model Selection Decision Tree

模型选择决策树

START: What type of outcome variable?
├─ CONTINUOUS (height, blood pressure, score)
│  ├─ Independent observations → Linear Regression (OLS)
│  ├─ Repeated measures → Mixed-Effects Model (LMM)
│  └─ Count data → Poisson/Negative Binomial
├─ BINARY (yes/no, disease/healthy)
│  ├─ Independent observations → Logistic Regression
│  ├─ Repeated measures → Logistic Mixed-Effects (GLMM/GEE)
│  └─ Rare events → Firth logistic regression
├─ ORDINAL (mild/moderate/severe, stages I/II/III/IV)
│  └─ Ordinal Logistic Regression (Proportional Odds)
├─ MULTINOMIAL (>2 unordered categories)
│  └─ Multinomial Logistic Regression
└─ TIME-TO-EVENT (survival time + censoring)
   ├─ Regression → Cox Proportional Hazards
   └─ Survival curves → Kaplan-Meier
开始:结局变量的类型是什么?
├─ 连续型(身高、血压、评分)
│  ├─ 独立观测值 → 线性回归(OLS)
│  ├─ 重复测量数据 → 混合效应模型(LMM)
│  └─ 计数数据 → 泊松/负二项回归
├─ 二元型(是/否,患病/健康)
│  ├─ 独立观测值 → 逻辑回归
│  ├─ 重复测量数据 → 逻辑混合效应模型(GLMM/GEE)
│  └─ 稀有事件 → Firth逻辑回归
├─ 有序型(轻度/中度/重度,I/II/III/IV期)
│  └─ 有序逻辑回归(比例优势模型)
├─ 多分类型(>2个无序类别)
│  └─ 多分类逻辑回归
└─ 时间-事件型(生存时间+截尾)
   ├─ 回归分析 → Cox比例风险模型
   └─ 生存曲线 → Kaplan-Meier估计

When to Use

适用场景

Apply this skill when user asks:
  • "What is the odds ratio of X associated with Y?"
  • "What is the hazard ratio for treatment?"
  • "Fit a linear regression of Y on X1, X2, X3"
  • "Perform ordinal logistic regression for severity outcome"
  • "What is the Kaplan-Meier survival estimate at time T?"
  • "What is the percentage reduction in odds ratio after adjusting for confounders?"
  • "Run a mixed-effects model with random intercepts"
  • "Compute the interaction term between A and B"
  • "What is the F-statistic from ANOVA comparing groups?"
  • "Test if gene/miRNA expression differs across cell types"
  • "Perform one-way ANOVA on expression data"
当用户提出以下问题时适用本技能:
  • "X与Y相关的优势比是多少?"
  • "治疗的风险比是多少?"
  • "拟合Y关于X1、X2、X3的线性回归模型"
  • "针对严重程度结局进行有序逻辑回归分析"
  • "时间T时的Kaplan-Meier生存估计值是多少?"
  • "调整混杂因素后优势比的下降百分比是多少?"
  • "拟合带有随机截距的混合效应模型"
  • "计算A和B之间的交互项"
  • "比较组间的ANOVA F统计量是多少?"
  • "检验基因/miRNA表达在不同细胞类型间是否存在差异?"
  • "对表达数据进行单因素ANOVA分析"

Workflow

工作流程

Phase 0: Data Validation

阶段0:数据验证

Goal: Load data, identify variable types, check for missing values.
⚠️ CRITICAL: Identify the Outcome Variable First
Before any analysis, verify what you're actually predicting:
  1. Read the full question - Look for "predict [outcome]", "model [outcome]", or "dependent variable"
  2. Examine available columns - List all columns in the dataset
  3. Match question to data - Find the column that matches the described outcome
  4. Verify outcome exists - Don't create outcome variables from predictors
Common mistake (bix-51-q3 example):
  • ❌ Question mentions "obesity" → Assumed outcome = BMI ≥ 30 (circular logic with BMI predictor)
  • ✅ Read full question → Actual outcome = treatment response (PR vs non-PR)
  • Always check data columns first:
    print(df.columns.tolist())
python
import pandas as pd
import numpy as np
目标:加载数据,识别变量类型,检查缺失值。
⚠️ 关键:首先识别结局变量
在进行任何分析之前,先确认你要预测的变量:
  1. 通读完整问题 - 寻找“预测[结局]”、“建模[结局]”或“因变量”等表述
  2. 检查可用列 - 列出数据集中的所有列
  3. 匹配问题与数据 - 找到与描述的结局匹配的列
  4. 验证结局存在 - 不要从预测变量创建结局变量
常见错误(bix-51-q3示例)
  • ❌ 问题提到“肥胖” → 假设结局=BMI≥30(与BMI预测变量存在循环逻辑)
  • ✅ 通读完整问题 → 实际结局=治疗反应(PR vs 非PR)
  • 始终先检查数据列
    print(df.columns.tolist())
python
import pandas as pd
import numpy as np

Load data

Load data

df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv')

Check structure

Check structure

print(f"Observations: {len(df)}") print(f"Variables: {len(df.columns)}") print(f"Missing: {df.isnull().sum().sum()}")
print(f"Observations: {len(df)}") print(f"Variables: {len(df.columns)}") print(f"Missing: {df.isnull().sum().sum()}")

Detect variable types

Detect variable types

for col in df.columns: n_unique = df[col].nunique() if n_unique == 2: print(f"{col}: binary") elif n_unique <= 10 and df[col].dtype == 'object': print(f"{col}: categorical ({n_unique} levels)") elif df[col].dtype in ['float64', 'int64']: print(f"{col}: continuous (mean={df[col].mean():.2f})")
undefined
for col in df.columns: n_unique = df[col].nunique() if n_unique == 2: print(f"{col}: binary") elif n_unique <= 10 and df[col].dtype == 'object': print(f"{col}: categorical ({n_unique} levels)") elif df[col].dtype in ['float64', 'int64']: print(f"{col}: continuous (mean={df[col].mean():.2f})")
undefined

Phase 1: Model Fitting

阶段1:模型拟合

Goal: Fit appropriate model based on outcome type.
目标:根据结局类型拟合合适的模型。

Linear Regression

线性回归

python
import statsmodels.formula.api as smf
python
import statsmodels.formula.api as smf

R-style formula (recommended)

R-style formula (recommended)

model = smf.ols('outcome ~ predictor1 + predictor2 + age', data=df).fit()
model = smf.ols('outcome ~ predictor1 + predictor2 + age', data=df).fit()

Results

Results

print(f"R-squared: {model.rsquared:.4f}") print(f"AIC: {model.aic:.2f}") print(model.summary())
undefined
print(f"R-squared: {model.rsquared:.4f}") print(f"AIC: {model.aic:.2f}") print(model.summary())
undefined

Logistic Regression

逻辑回归

python
undefined
python
undefined

Fit model

Fit model

model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)

Odds ratios

Odds ratios

ors = np.exp(model.params) ci = np.exp(model.conf_int())
for var in ['exposure', 'age', 'sex_M']: print(f"{var}: OR={ors[var]:.4f}, CI=({ci.loc[var, 0]:.4f}, {ci.loc[var, 1]:.4f})")
undefined
ors = np.exp(model.params) ci = np.exp(model.conf_int())
for var in ['exposure', 'age', 'sex_M']: print(f"{var}: OR={ors[var]:.4f}, CI=({ci.loc[var, 0]:.4f}, {ci.loc[var, 1]:.4f})")
undefined

Ordinal Logistic Regression

有序逻辑回归

python
from statsmodels.miscmodels.ordinal_model import OrderedModel
python
from statsmodels.miscmodels.ordinal_model import OrderedModel

Prepare ordered outcome

Prepare ordered outcome

severity_order = ['Mild', 'Moderate', 'Severe'] df['severity'] = pd.Categorical(df['severity'], categories=severity_order, ordered=True) y = df['severity'].cat.codes
severity_order = ['Mild', 'Moderate', 'Severe'] df['severity'] = pd.Categorical(df['severity'], categories=severity_order, ordered=True) y = df['severity'].cat.codes

Fit model

Fit model

X = pd.get_dummies(df[['exposure', 'age', 'sex']], drop_first=True, dtype=float) model = OrderedModel(y, X, distr='logit').fit(method='bfgs', disp=0)
X = pd.get_dummies(df[['exposure', 'age', 'sex']], drop_first=True, dtype=float) model = OrderedModel(y, X, distr='logit').fit(method='bfgs', disp=0)

Odds ratios

Odds ratios

ors = np.exp(model.params[:len(X.columns)]) print(f"Exposure OR: {ors[0]:.4f}")
undefined
ors = np.exp(model.params[:len(X.columns)]) print(f"Exposure OR: {ors[0]:.4f}")
undefined

Cox Proportional Hazards

Cox比例风险模型

python
from lifelines import CoxPHFitter
python
from lifelines import CoxPHFitter

Fit model

Fit model

cph = CoxPHFitter() cph.fit(df[['time', 'event', 'treatment', 'age']], duration_col='time', event_col='event')
cph = CoxPHFitter() cph.fit(df[['time', 'event', 'treatment', 'age']], duration_col='time', event_col='event')

Hazard ratios

Hazard ratios

print(f"HR (treatment): {cph.hazard_ratios_['treatment']:.4f}") print(f"Concordance: {cph.concordance_index_:.4f}")

See `references/` for detailed examples of each model type.
print(f"HR (treatment): {cph.hazard_ratios_['treatment']:.4f}") print(f"Concordance: {cph.concordance_index_:.4f}")

查看`references/`获取每种模型类型的详细示例。

Statistical Tests (t-test, ANOVA, Chi-square)

统计测试(t检验、ANOVA、卡方检验)

One-way ANOVA: Compare means across ≥3 groups
python
from scipy import stats
单因素ANOVA:比较≥3组的均值
python
from scipy import stats

Single ANOVA (one outcome, multiple groups)

Single ANOVA (one outcome, multiple groups)

group1 = df[df['celltype'] == 'CD4']['expression'] group2 = df[df['celltype'] == 'CD8']['expression'] group3 = df[df['celltype'] == 'CD14']['expression']
f_stat, p_value = stats.f_oneway(group1, group2, group3) print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.6f}")

**⚠️ CRITICAL: Multi-feature ANOVA Decision Tree**

When data has **multiple features** (genes, miRNAs, metabolites, etc.), there are TWO approaches:
Question: "What is the F-statistic comparing [feature] expression across groups?"
DECISION TREE: │ ├─ Does question specify "the F-statistic" (singular)? │ │ │ ├─ YES, singular → Likely asking for SPECIFIC FEATURE(S) F-statistic │ │ │ │ │ ├─ Are there thousands of features (genes, miRNAs)? │ │ │ YES → Per-feature approach (Method B below) │ │ │ │ │ └─ Is there one feature of interest? │ │ YES → Single feature ANOVA (Method A below) │ │ │ └─ NO, asks about "all features" or "genes" (plural)? │ YES → Aggregate approach or per-feature summary │ └─ When unsure: Calculate PER-FEATURE and report summary statistics

**Method A: Aggregate ANOVA** (all features combined)
- Use when: Testing overall expression differences across all features
- Result: Single F-statistic representing global effect

```python
group1 = df[df['celltype'] == 'CD4']['expression'] group2 = df[df['celltype'] == 'CD8']['expression'] group3 = df[df['celltype'] == 'CD14']['expression']
f_stat, p_value = stats.f_oneway(group1, group2, group3) print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.6f}")

**⚠️ 关键:多特征ANOVA决策树**

当数据包含**多个特征**(基因、miRNA、代谢物等)时,有两种方法:
问题:"比较不同细胞类型间[特征]表达的F统计量是多少?"
决策树: │ ├─ 问题是否指定"F统计量"(单数)? │ │ │ ├─ 是 → 可能询问特定特征的F统计量 │ │ │ │ │ ├─ 是否有成千上万个特征(基因、miRNA)? │ │ │ 是 → 逐特征方法(下文方法B) │ │ │ │ │ └─ 是否有一个感兴趣的特征? │ │ 是 → 单特征ANOVA(下文方法A) │ │ │ └─ 否,询问"所有特征"或"基因"(复数)? │ 是 → 聚合方法或逐特征汇总 │ └─ 不确定时:计算逐特征F统计量并报告汇总统计

**方法A:聚合ANOVA**(所有特征合并)
- 适用场景:测试所有特征的整体表达差异
- 结果:代表全局效应的单个F统计量

```python

Flatten all features across all samples per group

Flatten all features across all samples per group

groups_agg = [] for celltype in ['CD4', 'CD8', 'CD14']: samples = df[df['celltype'] == celltype] # Flatten: all features × all samples in this group all_values = expression_matrix.loc[:, samples.index].values.flatten() groups_agg.append(all_values)
f_stat_agg, p_value = stats.f_oneway(*groups_agg) print(f"Aggregate F-statistic: {f_stat_agg:.4f}")
groups_agg = [] for celltype in ['CD4', 'CD8', 'CD14']: samples = df[df['celltype'] == celltype] # Flatten: all features × all samples in this group all_values = expression_matrix.loc[:, samples.index].values.flatten() groups_agg.append(all_values)
f_stat_agg, p_value = stats.f_oneway(*groups_agg) print(f"Aggregate F-statistic: {f_stat_agg:.4f}")

Result: Very large F-statistic (e.g., 153.8)

Result: Very large F-statistic (e.g., 153.8)


**Method B: Per-feature ANOVA** (each feature separately) ⭐ RECOMMENDED for gene expression
- Use when: Testing EACH feature individually (most common in genomics)
- Result: Distribution of F-statistics (one per feature)

```python

**方法B:逐特征ANOVA**(每个特征单独计算) ⭐ 基因表达数据推荐方法
- 适用场景:单独测试每个特征(基因组学中最常见)
- 结果:F统计量的分布(每个特征对应一个)

```python

Calculate F-statistic FOR EACH FEATURE separately

Calculate F-statistic FOR EACH FEATURE separately

per_feature_f_stats = []
for feature in expression_matrix.index: # For each gene/miRNA/metabolite groups = [] for celltype in ['CD4', 'CD8', 'CD14']: samples = df[df['celltype'] == celltype] # Get expression of THIS feature in THIS cell type values = expression_matrix.loc[feature, samples.index].values groups.append(values)
f_stat, _ = stats.f_oneway(*groups)
if not np.isnan(f_stat):
    per_feature_f_stats.append((feature, f_stat))
per_feature_f_stats = []
for feature in expression_matrix.index: # For each gene/miRNA/metabolite groups = [] for celltype in ['CD4', 'CD8', 'CD14']: samples = df[df['celltype'] == celltype] # Get expression of THIS feature in THIS cell type values = expression_matrix.loc[feature, samples.index].values groups.append(values)
f_stat, _ = stats.f_oneway(*groups)
if not np.isnan(f_stat):
    per_feature_f_stats.append((feature, f_stat))

Summary statistics

Summary statistics

f_values = [f for _, f in per_feature_f_stats] print(f"Per-feature F-statistics:") print(f" Median: {np.median(f_values):.4f}") print(f" Mean: {np.mean(f_values):.4f}") print(f" Range: [{np.min(f_values):.4f}, {np.max(f_values):.4f}]")
f_values = [f for _, f in per_feature_f_stats] print(f"Per-feature F-statistics:") print(f" Median: {np.median(f_values):.4f}") print(f" Mean: {np.mean(f_values):.4f}") print(f" Range: [{np.min(f_values):.4f}, {np.max(f_values):.4f}]")

Find features in specific range (e.g., 0.76-0.78)

Find features in specific range (e.g., 0.76-0.78)

target_features = [(name, f) for name, f in per_feature_f_stats if 0.76 <= f <= 0.78] if target_features: print(f"Features with F ∈ [0.76, 0.78]: {len(target_features)}") for name, f_val in target_features: print(f" {name}: F = {f_val:.6f}")

**Key Differences**:

| Aspect | Method A (Aggregate) | Method B (Per-feature) |
|--------|---------------------|------------------------|
| **Interpretation** | Overall expression difference | Feature-specific differences |
| **Result** | 1 F-statistic | N F-statistics (N = # features) |
| **Typical value** | Very large (e.g., 153.8) | Small to large (e.g., 0.1 to 100+) |
| **Use case** | Global effect size | Gene/biomarker discovery |
| **Common in** | Rarely used | **Genomics, proteomics, metabolomics** ⭐ |

**Real-world example (BixBench bix-36-q1)**:
- Question: "What is the F-statistic comparing miRNA expression across immune cell types?"
- Expected: 0.76-0.78
- Method A (aggregate): 153.836 ❌ **WRONG**
- Method B (per-miRNA): Found 2 miRNAs with F ∈ [0.76, 0.78] ✅ **CORRECT**

**Default assumption for gene expression data**: Use **Method B (per-feature)**
target_features = [(name, f) for name, f in per_feature_f_stats if 0.76 <= f <= 0.78] if target_features: print(f"F ∈ [0.76, 0.78]的特征: {len(target_features)}") for name, f_val in target_features: print(f" {name}: F = {f_val:.6f}")

**关键差异**:

| 方面 | 方法A(聚合) | 方法B(逐特征) |
|--------|---------------------|------------------------|
| **解释** | 整体表达差异 | 特征特异性差异 |
| **结果** | 1个F统计量 | N个F统计量(N=特征数量) |
| **典型值** | 非常大(如153.8) | 小到大都有(如0.1到100+) |
| **适用场景** | 全局效应量 | 基因/生物标志物发现 |
| **常见领域** | 很少使用 | **基因组学、蛋白质组学、代谢组学** ⭐ |

**真实世界示例(BixBench bix-36-q1)**:
- 问题:"比较免疫细胞类型间miRNA表达的F统计量是多少?"
- 预期值:0.76-0.78
- 方法A(聚合):153.836 ❌ **错误**
- 方法B(逐miRNA):找到2个F∈[0.76,0.78]的miRNA ✅ **正确**

**基因表达数据默认假设**:使用**方法B(逐特征)**

Phase 2: Model Diagnostics

阶段2:模型诊断

Goal: Check model assumptions and fit quality.
目标:检查模型假设和拟合质量。

OLS Diagnostics

OLS诊断

python
from scipy import stats as scipy_stats
from statsmodels.stats.diagnostic import het_breuschpagan
python
from scipy import stats as scipy_stats
from statsmodels.stats.diagnostic import het_breuschpagan

Residual normality

Residual normality

residuals = model.resid sw_stat, sw_p = scipy_stats.shapiro(residuals) print(f"Shapiro-Wilk: p={sw_p:.4f} (normal if p>0.05)")
residuals = model.resid sw_stat, sw_p = scipy_stats.shapiro(residuals) print(f"Shapiro-Wilk: p={sw_p:.4f} (p>0.05则符合正态性)")

Heteroscedasticity

Heteroscedasticity

bp_stat, bp_p, _, _ = het_breuschpagan(residuals, model.model.exog) print(f"Breusch-Pagan: p={bp_p:.4f} (homoscedastic if p>0.05)")
bp_stat, bp_p, _, _ = het_breuschpagan(residuals, model.model.exog) print(f"Breusch-Pagan: p={bp_p:.4f} (p>0.05则符合同方差性)")

VIF (multicollinearity)

VIF(多重共线性)

from statsmodels.stats.outliers_influence import variance_inflation_factor X = model.model.exog for i in range(1, X.shape[1]): # Skip intercept vif = variance_inflation_factor(X, i) print(f"{model.model.exog_names[i]}: VIF={vif:.2f}")
undefined
from statsmodels.stats.outliers_influence import variance_inflation_factor X = model.model.exog for i in range(1, X.shape[1]): # Skip intercept vif = variance_inflation_factor(X, i) print(f"{model.model.exog_names[i]}: VIF={vif:.2f}")
undefined

Proportional Hazards Test

比例风险假设检验

python
undefined
python
undefined

Test PH assumption for Cox model

Test PH assumption for Cox model

results = cph.check_assumptions(df, p_value_threshold=0.05, show_plots=False) if len(results) == 0: print("✅ Proportional hazards assumption met") else: print(f"⚠️ PH violated for: {results}")

See `references/troubleshooting.md` for common diagnostic issues.
results = cph.check_assumptions(df, p_value_threshold=0.05, show_plots=False) if len(results) == 0: print("✅ 符合比例风险假设") else: print(f"⚠️ 以下变量违反比例风险假设: {results}")

查看`references/troubleshooting.md`获取常见诊断问题。

Phase 3: Interpretation

阶段3:结果解释

Goal: Generate publication-quality summary.
目标:生成符合出版标准的摘要。

Odds Ratio Interpretation

优势比解释

python
def interpret_odds_ratio(or_val, ci_lower, ci_upper, p_value):
    """Interpret odds ratio with clinical meaning."""
    if or_val > 1:
        pct_increase = (or_val - 1) * 100
        direction = f"{pct_increase:.1f}% increase in odds"
    else:
        pct_decrease = (1 - or_val) * 100
        direction = f"{pct_decrease:.1f}% decrease in odds"

    sig = "significant" if p_value < 0.05 else "not significant"
    ci_contains_null = ci_lower <= 1 <= ci_upper

    return f"{direction} (OR={or_val:.4f}, 95% CI [{ci_lower:.4f}, {ci_upper:.4f}], p={p_value:.6f}, {sig})"
python
def interpret_odds_ratio(or_val, ci_lower, ci_upper, p_value):
    """Interpret odds ratio with clinical meaning."""
    if or_val > 1:
        pct_increase = (or_val - 1) * 100
        direction = f"{pct_increase:.1f}%的优势增加"
    else:
        pct_decrease = (1 - or_val) * 100
        direction = f"{pct_decrease:.1f}%的优势降低"

    sig = "有统计学意义" if p_value < 0.05 else "无统计学意义"
    ci_contains_null = ci_lower <= 1 <= ci_upper

    return f"{direction} (OR={or_val:.4f}, 95% CI [{ci_lower:.4f}, {ci_upper:.4f}], p={p_value:.6f}, {sig})"

Common BixBench Patterns

BixBench常见模式

Pattern 1: Odds Ratio from Ordinal Regression

模式1:有序回归的优势比

Question: "What is the odds ratio of disease severity associated with exposure?"
Solution:
  1. Identify ordinal outcome (mild/moderate/severe)
  2. Fit ordinal logistic regression (proportional odds model)
  3. Extract OR = exp(coefficient for exposure)
  4. Report with CI and p-value
问题:"疾病严重程度与暴露相关的优势比是多少?"
解决方案:
  1. 识别有序结局(轻度/中度/重度)
  2. 拟合有序逻辑回归(比例优势模型)
  3. 提取OR = exp(暴露的系数)
  4. 报告OR、置信区间和p值

Pattern 2: Percentage Reduction in Odds

模式2:优势比的下降百分比

Question: "What is the percentage reduction in OR after adjusting for confounders?"
Solution:
python
undefined
问题:"调整混杂因素后优势比的下降百分比是多少?"
解决方案:
python
undefined

Unadjusted model

Unadjusted model

model_crude = smf.logit('outcome ~ exposure', data=df).fit(disp=0) or_crude = np.exp(model_crude.params['exposure'])
model_crude = smf.logit('outcome ~ exposure', data=df).fit(disp=0) or_crude = np.exp(model_crude.params['exposure'])

Adjusted model

Adjusted model

model_adj = smf.logit('outcome ~ exposure + age + sex', data=df).fit(disp=0) or_adj = np.exp(model_adj.params['exposure'])
model_adj = smf.logit('outcome ~ exposure + age + sex', data=df).fit(disp=0) or_adj = np.exp(model_adj.params['exposure'])

Percentage reduction

Percentage reduction

pct_reduction = (or_crude - or_adj) / or_crude * 100 print(f"Percentage reduction: {pct_reduction:.1f}%")
undefined
pct_reduction = (or_crude - or_adj) / or_crude * 100 print(f"下降百分比: {pct_reduction:.1f}%")
undefined

Pattern 3: Interaction Effects

模式3:交互效应

Question: "What is the odds ratio for the interaction between A and B?"
Solution:
python
undefined
问题:"A和B之间交互项的优势比是多少?"
解决方案:
python
undefined

Fit model with interaction

Fit model with interaction

model = smf.logit('outcome ~ A * B + age', data=df).fit(disp=0)
model = smf.logit('outcome ~ A * B + age', data=df).fit(disp=0)

Interaction OR

Interaction OR

interaction_coef = model.params['A:B'] interaction_or = np.exp(interaction_coef) print(f"Interaction OR: {interaction_or:.4f}")
undefined
interaction_coef = model.params['A:B'] interaction_or = np.exp(interaction_coef) print(f"交互项OR: {interaction_or:.4f}")
undefined

Pattern 4: Survival Analysis

模式4:生存分析

Question: "What is the hazard ratio for treatment?"
Solution:
  1. Load survival data (time, event, covariates)
  2. Fit Cox proportional hazards model
  3. Extract HR = exp(coefficient)
  4. Report with CI and concordance index
问题:"治疗的风险比是多少?"
解决方案:
  1. 加载生存数据(时间、事件、协变量)
  2. 拟合Cox比例风险模型
  3. 提取HR = exp(系数)
  4. 报告HR、置信区间和一致性指数

Pattern 5: Multi-feature ANOVA (Gene Expression)

模式5:多特征ANOVA(基因表达)

Question: "What is the F-statistic comparing miRNA expression across cell types?"
Solution:
  1. Identify that data has multiple features (genes/miRNAs)
  2. Use per-feature ANOVA (NOT aggregate)
  3. Calculate F-statistic for EACH feature separately
  4. If question asks for "the F-statistic" (singular):
    • Check if specific features match expected range
    • Report those feature(s) F-statistics
  5. If question asks for summary: report median/mean/distribution
Critical: For gene expression data, default to per-feature ANOVA. Aggregate ANOVA gives F-statistics ~200× larger and is rarely correct.
See
references/bixbench_patterns.md
for 15+ question patterns.
问题:"比较细胞类型间miRNA表达的F统计量是多少?"
解决方案:
  1. 识别数据包含多个特征(基因/miRNA)
  2. 使用逐特征ANOVA(而非聚合)
  3. 为每个特征单独计算F统计量
  4. 如果问题询问"F统计量"(单数):
    • 检查是否有特定特征符合预期范围
    • 报告这些特征的F统计量
  5. 如果问题询问汇总结果:报告中位数/均值/分布
关键:对于基因表达数据,默认使用逐特征ANOVA。聚合ANOVA得到的F统计量约大200倍,很少正确。
查看
references/bixbench_patterns.md
获取15+问题模式。

Statsmodels vs Scikit-learn

Statsmodels vs Scikit-learn

Use CaseLibraryReason
Inference (p-values, CIs, ORs)statsmodelsFull statistical output
Prediction (accuracy, AUC)scikit-learnBetter prediction tools
Mixed-effects modelsstatsmodelsOnly option
Regularization (LASSO, Ridge)scikit-learnBetter optimization
Survival analysislifelinesSpecialized library
General rule: Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).
使用场景原因
推断(p值、置信区间、OR)statsmodels完整的统计输出
预测(准确率、AUC)scikit-learn更好的预测工具
混合效应模型statsmodels唯一选项
正则化(LASSO、Ridge)scikit-learn更优的优化方法
生存分析lifelines专业库
通用规则:BixBench问题使用statsmodels(问题通常询问p值、OR、HR)。

Python Package Requirements

Python包依赖

statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0
statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0

File Structure

文件结构

tooluniverse-statistical-modeling/
├── SKILL.md                          # This file
├── QUICK_START.md                    # 8 quick examples
├── EXAMPLES.md                       # Legacy examples (kept for reference)
├── TOOLS_REFERENCE.md                # ToolUniverse tool catalog
├── test_skill.py                     # Comprehensive test suite
├── references/
│   ├── logistic_regression.md        # Detailed logistic examples
│   ├── ordinal_logistic.md           # Ordinal logit guide
│   ├── cox_regression.md             # Survival analysis guide
│   ├── linear_models.md              # OLS and mixed-effects
│   ├── bixbench_patterns.md          # 15+ question patterns
│   └── troubleshooting.md            # Diagnostic issues
└── scripts/
    ├── format_statistical_output.py  # Format results for reporting
    └── model_diagnostics.py          # Automated diagnostics
tooluniverse-statistical-modeling/
├── SKILL.md                          # 本文件
├── QUICK_START.md                    # 8个快速示例
├── EXAMPLES.md                       # 遗留示例(仅供参考)
├── TOOLS_REFERENCE.md                # ToolUniverse工具目录
├── test_skill.py                     # 全面测试套件
├── references/
│   ├── logistic_regression.md        # 详细逻辑回归示例
│   ├── ordinal_logistic.md           # 有序逻辑回归指南
│   ├── cox_regression.md             # 生存分析指南
│   ├── linear_models.md              # OLS和混合效应模型
│   ├── bixbench_patterns.md          # 15+问题模式
│   └── troubleshooting.md            # 诊断问题
└── scripts/
    ├── format_statistical_output.py  # 格式化结果用于报告
    └── model_diagnostics.py          # 自动化诊断

Key Principles

核心原则

  1. Data-first approach - Always inspect and validate data before modeling
  2. Model selection by outcome type - Use decision tree above
  3. Assumption checking - Verify model assumptions (linearity, proportional hazards, etc.)
  4. Complete reporting - Always report effect sizes, CIs, p-values, and model fit statistics
  5. Confounder awareness - Adjust for confounders when specified or clinically relevant
  6. Reproducible analysis - All code must be deterministic and reproducible
  7. Robust error handling - Graceful handling of convergence failures, separation, collinearity
  8. Round correctly - Match the precision requested (typically 2-4 decimal places)
  1. 数据优先方法 - 建模前始终检查并验证数据
  2. 按结局类型选择模型 - 使用上述决策树
  3. 假设检验 - 验证模型假设(线性、比例风险等)
  4. 完整报告 - 始终报告效应量、置信区间、p值和模型拟合统计量
  5. 混杂因素意识 - 当指定或临床相关时调整混杂因素
  6. 可复现分析 - 所有代码必须确定且可复现
  7. 鲁棒错误处理 - 优雅处理收敛失败、分离、共线性
  8. 正确四舍五入 - 匹配要求的精度(通常2-4位小数)

Completeness Checklist

完整性检查清单

Before finalizing any statistical analysis:
  • Outcome variable identified: Verified which column is the actual outcome (not assumed)
  • Data validated: N, missing values, variable types confirmed
  • Multi-feature data identified: If data has multiple features (genes, miRNAs), use per-feature approach
  • Model appropriate: Outcome type matches model family
  • Assumptions checked: Relevant diagnostics performed
  • Effect sizes reported: OR/HR/Cohen's d with CIs
  • P-values reported: With appropriate correction if needed
  • Model fit assessed: R-squared, AIC/BIC, concordance
  • Results interpreted: Plain-language interpretation
  • Precision correct: Numbers rounded appropriately
  • Confounders addressed: Adjusted analyses if applicable
在完成任何统计分析之前:
  • 结局变量已识别:验证哪一列是实际结局(而非假设)
  • 数据已验证:样本量、缺失值、变量类型已确认
  • 多特征数据已识别:如果数据包含多个特征(基因、miRNA),使用逐特征方法
  • 模型合适:结局类型与模型族匹配
  • 假设已检查:执行相关诊断
  • 效应量已报告:OR/HR/Cohen's d及置信区间
  • p值已报告:必要时进行校正
  • 模型拟合已评估:R平方、AIC/BIC、一致性指数
  • 结果已解释:通俗易懂的解释
  • 精度正确:数字已正确四舍五入
  • 混杂因素已处理:适用时进行调整分析

References

参考文献

ToolUniverse Integration

ToolUniverse集成

While this skill is primarily computational, ToolUniverse tools can provide data:
Use CaseTools
Clinical trial data
clinical_trials_search
Drug safety outcomes
FAERS_calculate_disproportionality
Gene-disease associations
OpenTargets_target_disease_evidence
Biomarker data
fda_pharmacogenomic_biomarkers
See
TOOLS_REFERENCE.md
for complete tool catalog.
虽然本技能主要是计算型的,但ToolUniverse工具可提供数据:
适用场景工具
临床试验数据
clinical_trials_search
药物安全性结局
FAERS_calculate_disproportionality
基因-疾病关联
OpenTargets_target_disease_evidence
生物标志物数据
fda_pharmacogenomic_biomarkers
查看
TOOLS_REFERENCE.md
获取完整工具目录。

Support

支持

For detailed examples and troubleshooting:
  • Logistic regression:
    references/logistic_regression.md
  • Ordinal models:
    references/ordinal_logistic.md
  • Survival analysis:
    references/cox_regression.md
  • Linear/mixed models:
    references/linear_models.md
  • BixBench patterns:
    references/bixbench_patterns.md
  • Diagnostics:
    references/troubleshooting.md
如需详细示例和故障排除:
  • 逻辑回归:
    references/logistic_regression.md
  • 有序模型:
    references/ordinal_logistic.md
  • 生存分析:
    references/cox_regression.md
  • 线性/混合模型:
    references/linear_models.md
  • BixBench模式:
    references/bixbench_patterns.md
  • 诊断:
    references/troubleshooting.md
    ",