tooluniverse-statistical-modeling
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseStatistical Modeling for Biomedical Data Analysis
生物医学数据分析的统计建模
Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values.
针对生物医学数据拟合回归模型、生存模型和混合效应模型的全面统计建模技能。可生成包含优势比、风险比、置信区间和p值的符合出版标准的统计摘要。
Features
功能特性
✅ Linear Regression - OLS for continuous outcomes with diagnostic tests
✅ Logistic Regression - Binary, ordinal, and multinomial models with odds ratios
✅ Survival Analysis - Cox proportional hazards and Kaplan-Meier curves
✅ Mixed-Effects Models - LMM/GLMM for hierarchical/repeated measures data
✅ ANOVA - One-way/two-way ANOVA, per-feature ANOVA for omics data
✅ Model Diagnostics - Assumption checking, fit statistics, residual analysis
✅ Statistical Tests - t-tests, chi-square, Mann-Whitney, Kruskal-Wallis, etc.
✅ 线性回归 - 针对连续结局的OLS(普通最小二乘法)及诊断测试
✅ 逻辑回归 - 二元、有序和多分类模型,支持优势比计算
✅ 生存分析 - Cox比例风险模型和Kaplan-Meier曲线
✅ 混合效应模型 - 适用于分层/重复测量数据的LMM/GLMM
✅ ANOVA - 单因素/双因素ANOVA,组学数据的逐特征ANOVA
✅ 模型诊断 - 假设检验、拟合统计量、残差分析
✅ 统计测试 - t检验、卡方检验、Mann-Whitney检验、Kruskal-Wallis检验等
Quick Start
快速开始
Binary Logistic Regression
二元逻辑回归
python
import statsmodels.formula.api as smf
import numpy as nppython
import statsmodels.formula.api as smf
import numpy as npFit logistic regression
Fit logistic regression
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
Extract odds ratios
Extract odds ratios
odds_ratios = np.exp(model.params)
conf_int = np.exp(model.conf_int())
print(f"Odds Ratio for exposure: {odds_ratios['exposure']:.4f}")
print(f"95% CI: ({conf_int.loc['exposure', 0]:.4f}, {conf_int.loc['exposure', 1]:.4f})")
print(f"P-value: {model.pvalues['exposure']:.6f}")
undefinedodds_ratios = np.exp(model.params)
conf_int = np.exp(model.conf_int())
print(f"Odds Ratio for exposure: {odds_ratios['exposure']:.4f}")
print(f"95% CI: ({conf_int.loc['exposure', 0]:.4f}, {conf_int.loc['exposure', 1]:.4f})")
print(f"P-value: {model.pvalues['exposure']:.6f}")
undefinedCox Proportional Hazards
Cox比例风险模型
python
from lifelines import CoxPHFitterpython
from lifelines import CoxPHFitterFit Cox model
Fit Cox model
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age', 'stage']],
duration_col='time', event_col='event')
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age', 'stage']],
duration_col='time', event_col='event')
Get hazard ratio
Get hazard ratio
hr = cph.hazard_ratios_['treatment']
print(f"Hazard Ratio: {hr:.4f}")
print(f"Concordance: {cph.concordance_index_:.4f}")
See `QUICK_START.md` for 8 complete examples.hr = cph.hazard_ratios_['treatment']
print(f"Hazard Ratio: {hr:.4f}")
print(f"Concordance: {cph.concordance_index_:.4f}")
查看`QUICK_START.md`获取8个完整示例。Model Selection Decision Tree
模型选择决策树
START: What type of outcome variable?
│
├─ CONTINUOUS (height, blood pressure, score)
│ ├─ Independent observations → Linear Regression (OLS)
│ ├─ Repeated measures → Mixed-Effects Model (LMM)
│ └─ Count data → Poisson/Negative Binomial
│
├─ BINARY (yes/no, disease/healthy)
│ ├─ Independent observations → Logistic Regression
│ ├─ Repeated measures → Logistic Mixed-Effects (GLMM/GEE)
│ └─ Rare events → Firth logistic regression
│
├─ ORDINAL (mild/moderate/severe, stages I/II/III/IV)
│ └─ Ordinal Logistic Regression (Proportional Odds)
│
├─ MULTINOMIAL (>2 unordered categories)
│ └─ Multinomial Logistic Regression
│
└─ TIME-TO-EVENT (survival time + censoring)
├─ Regression → Cox Proportional Hazards
└─ Survival curves → Kaplan-Meier开始:结局变量的类型是什么?
│
├─ 连续型(身高、血压、评分)
│ ├─ 独立观测值 → 线性回归(OLS)
│ ├─ 重复测量数据 → 混合效应模型(LMM)
│ └─ 计数数据 → 泊松/负二项回归
│
├─ 二元型(是/否,患病/健康)
│ ├─ 独立观测值 → 逻辑回归
│ ├─ 重复测量数据 → 逻辑混合效应模型(GLMM/GEE)
│ └─ 稀有事件 → Firth逻辑回归
│
├─ 有序型(轻度/中度/重度,I/II/III/IV期)
│ └─ 有序逻辑回归(比例优势模型)
│
├─ 多分类型(>2个无序类别)
│ └─ 多分类逻辑回归
│
└─ 时间-事件型(生存时间+截尾)
├─ 回归分析 → Cox比例风险模型
└─ 生存曲线 → Kaplan-Meier估计When to Use
适用场景
Apply this skill when user asks:
- "What is the odds ratio of X associated with Y?"
- "What is the hazard ratio for treatment?"
- "Fit a linear regression of Y on X1, X2, X3"
- "Perform ordinal logistic regression for severity outcome"
- "What is the Kaplan-Meier survival estimate at time T?"
- "What is the percentage reduction in odds ratio after adjusting for confounders?"
- "Run a mixed-effects model with random intercepts"
- "Compute the interaction term between A and B"
- "What is the F-statistic from ANOVA comparing groups?"
- "Test if gene/miRNA expression differs across cell types"
- "Perform one-way ANOVA on expression data"
当用户提出以下问题时适用本技能:
- "X与Y相关的优势比是多少?"
- "治疗的风险比是多少?"
- "拟合Y关于X1、X2、X3的线性回归模型"
- "针对严重程度结局进行有序逻辑回归分析"
- "时间T时的Kaplan-Meier生存估计值是多少?"
- "调整混杂因素后优势比的下降百分比是多少?"
- "拟合带有随机截距的混合效应模型"
- "计算A和B之间的交互项"
- "比较组间的ANOVA F统计量是多少?"
- "检验基因/miRNA表达在不同细胞类型间是否存在差异?"
- "对表达数据进行单因素ANOVA分析"
Workflow
工作流程
Phase 0: Data Validation
阶段0:数据验证
Goal: Load data, identify variable types, check for missing values.
⚠️ CRITICAL: Identify the Outcome Variable First
Before any analysis, verify what you're actually predicting:
- Read the full question - Look for "predict [outcome]", "model [outcome]", or "dependent variable"
- Examine available columns - List all columns in the dataset
- Match question to data - Find the column that matches the described outcome
- Verify outcome exists - Don't create outcome variables from predictors
Common mistake (bix-51-q3 example):
- ❌ Question mentions "obesity" → Assumed outcome = BMI ≥ 30 (circular logic with BMI predictor)
- ✅ Read full question → Actual outcome = treatment response (PR vs non-PR)
- Always check data columns first:
print(df.columns.tolist())
python
import pandas as pd
import numpy as np目标:加载数据,识别变量类型,检查缺失值。
⚠️ 关键:首先识别结局变量
在进行任何分析之前,先确认你要预测的变量:
- 通读完整问题 - 寻找“预测[结局]”、“建模[结局]”或“因变量”等表述
- 检查可用列 - 列出数据集中的所有列
- 匹配问题与数据 - 找到与描述的结局匹配的列
- 验证结局存在 - 不要从预测变量创建结局变量
常见错误(bix-51-q3示例):
- ❌ 问题提到“肥胖” → 假设结局=BMI≥30(与BMI预测变量存在循环逻辑)
- ✅ 通读完整问题 → 实际结局=治疗反应(PR vs 非PR)
- 始终先检查数据列:
print(df.columns.tolist())
python
import pandas as pd
import numpy as npLoad data
Load data
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv')
Check structure
Check structure
print(f"Observations: {len(df)}")
print(f"Variables: {len(df.columns)}")
print(f"Missing: {df.isnull().sum().sum()}")
print(f"Observations: {len(df)}")
print(f"Variables: {len(df.columns)}")
print(f"Missing: {df.isnull().sum().sum()}")
Detect variable types
Detect variable types
for col in df.columns:
n_unique = df[col].nunique()
if n_unique == 2:
print(f"{col}: binary")
elif n_unique <= 10 and df[col].dtype == 'object':
print(f"{col}: categorical ({n_unique} levels)")
elif df[col].dtype in ['float64', 'int64']:
print(f"{col}: continuous (mean={df[col].mean():.2f})")
undefinedfor col in df.columns:
n_unique = df[col].nunique()
if n_unique == 2:
print(f"{col}: binary")
elif n_unique <= 10 and df[col].dtype == 'object':
print(f"{col}: categorical ({n_unique} levels)")
elif df[col].dtype in ['float64', 'int64']:
print(f"{col}: continuous (mean={df[col].mean():.2f})")
undefinedPhase 1: Model Fitting
阶段1:模型拟合
Goal: Fit appropriate model based on outcome type.
目标:根据结局类型拟合合适的模型。
Linear Regression
线性回归
python
import statsmodels.formula.api as smfpython
import statsmodels.formula.api as smfR-style formula (recommended)
R-style formula (recommended)
model = smf.ols('outcome ~ predictor1 + predictor2 + age', data=df).fit()
model = smf.ols('outcome ~ predictor1 + predictor2 + age', data=df).fit()
Results
Results
print(f"R-squared: {model.rsquared:.4f}")
print(f"AIC: {model.aic:.2f}")
print(model.summary())
undefinedprint(f"R-squared: {model.rsquared:.4f}")
print(f"AIC: {model.aic:.2f}")
print(model.summary())
undefinedLogistic Regression
逻辑回归
python
undefinedpython
undefinedFit model
Fit model
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)
Odds ratios
Odds ratios
ors = np.exp(model.params)
ci = np.exp(model.conf_int())
for var in ['exposure', 'age', 'sex_M']:
print(f"{var}: OR={ors[var]:.4f}, CI=({ci.loc[var, 0]:.4f}, {ci.loc[var, 1]:.4f})")
undefinedors = np.exp(model.params)
ci = np.exp(model.conf_int())
for var in ['exposure', 'age', 'sex_M']:
print(f"{var}: OR={ors[var]:.4f}, CI=({ci.loc[var, 0]:.4f}, {ci.loc[var, 1]:.4f})")
undefinedOrdinal Logistic Regression
有序逻辑回归
python
from statsmodels.miscmodels.ordinal_model import OrderedModelpython
from statsmodels.miscmodels.ordinal_model import OrderedModelPrepare ordered outcome
Prepare ordered outcome
severity_order = ['Mild', 'Moderate', 'Severe']
df['severity'] = pd.Categorical(df['severity'], categories=severity_order, ordered=True)
y = df['severity'].cat.codes
severity_order = ['Mild', 'Moderate', 'Severe']
df['severity'] = pd.Categorical(df['severity'], categories=severity_order, ordered=True)
y = df['severity'].cat.codes
Fit model
Fit model
X = pd.get_dummies(df[['exposure', 'age', 'sex']], drop_first=True, dtype=float)
model = OrderedModel(y, X, distr='logit').fit(method='bfgs', disp=0)
X = pd.get_dummies(df[['exposure', 'age', 'sex']], drop_first=True, dtype=float)
model = OrderedModel(y, X, distr='logit').fit(method='bfgs', disp=0)
Odds ratios
Odds ratios
ors = np.exp(model.params[:len(X.columns)])
print(f"Exposure OR: {ors[0]:.4f}")
undefinedors = np.exp(model.params[:len(X.columns)])
print(f"Exposure OR: {ors[0]:.4f}")
undefinedCox Proportional Hazards
Cox比例风险模型
python
from lifelines import CoxPHFitterpython
from lifelines import CoxPHFitterFit model
Fit model
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age']],
duration_col='time', event_col='event')
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age']],
duration_col='time', event_col='event')
Hazard ratios
Hazard ratios
print(f"HR (treatment): {cph.hazard_ratios_['treatment']:.4f}")
print(f"Concordance: {cph.concordance_index_:.4f}")
See `references/` for detailed examples of each model type.print(f"HR (treatment): {cph.hazard_ratios_['treatment']:.4f}")
print(f"Concordance: {cph.concordance_index_:.4f}")
查看`references/`获取每种模型类型的详细示例。Statistical Tests (t-test, ANOVA, Chi-square)
统计测试(t检验、ANOVA、卡方检验)
One-way ANOVA: Compare means across ≥3 groups
python
from scipy import stats单因素ANOVA:比较≥3组的均值
python
from scipy import statsSingle ANOVA (one outcome, multiple groups)
Single ANOVA (one outcome, multiple groups)
group1 = df[df['celltype'] == 'CD4']['expression']
group2 = df[df['celltype'] == 'CD8']['expression']
group3 = df[df['celltype'] == 'CD14']['expression']
f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.6f}")
**⚠️ CRITICAL: Multi-feature ANOVA Decision Tree**
When data has **multiple features** (genes, miRNAs, metabolites, etc.), there are TWO approaches:
Question: "What is the F-statistic comparing [feature] expression across groups?"
DECISION TREE:
│
├─ Does question specify "the F-statistic" (singular)?
│ │
│ ├─ YES, singular → Likely asking for SPECIFIC FEATURE(S) F-statistic
│ │ │
│ │ ├─ Are there thousands of features (genes, miRNAs)?
│ │ │ YES → Per-feature approach (Method B below)
│ │ │
│ │ └─ Is there one feature of interest?
│ │ YES → Single feature ANOVA (Method A below)
│ │
│ └─ NO, asks about "all features" or "genes" (plural)?
│ YES → Aggregate approach or per-feature summary
│
└─ When unsure: Calculate PER-FEATURE and report summary statistics
**Method A: Aggregate ANOVA** (all features combined)
- Use when: Testing overall expression differences across all features
- Result: Single F-statistic representing global effect
```pythongroup1 = df[df['celltype'] == 'CD4']['expression']
group2 = df[df['celltype'] == 'CD8']['expression']
group3 = df[df['celltype'] == 'CD14']['expression']
f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.6f}")
**⚠️ 关键:多特征ANOVA决策树**
当数据包含**多个特征**(基因、miRNA、代谢物等)时,有两种方法:
问题:"比较不同细胞类型间[特征]表达的F统计量是多少?"
决策树:
│
├─ 问题是否指定"F统计量"(单数)?
│ │
│ ├─ 是 → 可能询问特定特征的F统计量
│ │ │
│ │ ├─ 是否有成千上万个特征(基因、miRNA)?
│ │ │ 是 → 逐特征方法(下文方法B)
│ │ │
│ │ └─ 是否有一个感兴趣的特征?
│ │ 是 → 单特征ANOVA(下文方法A)
│ │
│ └─ 否,询问"所有特征"或"基因"(复数)?
│ 是 → 聚合方法或逐特征汇总
│
└─ 不确定时:计算逐特征F统计量并报告汇总统计
**方法A:聚合ANOVA**(所有特征合并)
- 适用场景:测试所有特征的整体表达差异
- 结果:代表全局效应的单个F统计量
```pythonFlatten all features across all samples per group
Flatten all features across all samples per group
groups_agg = []
for celltype in ['CD4', 'CD8', 'CD14']:
samples = df[df['celltype'] == celltype]
# Flatten: all features × all samples in this group
all_values = expression_matrix.loc[:, samples.index].values.flatten()
groups_agg.append(all_values)
f_stat_agg, p_value = stats.f_oneway(*groups_agg)
print(f"Aggregate F-statistic: {f_stat_agg:.4f}")
groups_agg = []
for celltype in ['CD4', 'CD8', 'CD14']:
samples = df[df['celltype'] == celltype]
# Flatten: all features × all samples in this group
all_values = expression_matrix.loc[:, samples.index].values.flatten()
groups_agg.append(all_values)
f_stat_agg, p_value = stats.f_oneway(*groups_agg)
print(f"Aggregate F-statistic: {f_stat_agg:.4f}")
Result: Very large F-statistic (e.g., 153.8)
Result: Very large F-statistic (e.g., 153.8)
**Method B: Per-feature ANOVA** (each feature separately) ⭐ RECOMMENDED for gene expression
- Use when: Testing EACH feature individually (most common in genomics)
- Result: Distribution of F-statistics (one per feature)
```python
**方法B:逐特征ANOVA**(每个特征单独计算) ⭐ 基因表达数据推荐方法
- 适用场景:单独测试每个特征(基因组学中最常见)
- 结果:F统计量的分布(每个特征对应一个)
```pythonCalculate F-statistic FOR EACH FEATURE separately
Calculate F-statistic FOR EACH FEATURE separately
per_feature_f_stats = []
for feature in expression_matrix.index: # For each gene/miRNA/metabolite
groups = []
for celltype in ['CD4', 'CD8', 'CD14']:
samples = df[df['celltype'] == celltype]
# Get expression of THIS feature in THIS cell type
values = expression_matrix.loc[feature, samples.index].values
groups.append(values)
f_stat, _ = stats.f_oneway(*groups)
if not np.isnan(f_stat):
per_feature_f_stats.append((feature, f_stat))per_feature_f_stats = []
for feature in expression_matrix.index: # For each gene/miRNA/metabolite
groups = []
for celltype in ['CD4', 'CD8', 'CD14']:
samples = df[df['celltype'] == celltype]
# Get expression of THIS feature in THIS cell type
values = expression_matrix.loc[feature, samples.index].values
groups.append(values)
f_stat, _ = stats.f_oneway(*groups)
if not np.isnan(f_stat):
per_feature_f_stats.append((feature, f_stat))Summary statistics
Summary statistics
f_values = [f for _, f in per_feature_f_stats]
print(f"Per-feature F-statistics:")
print(f" Median: {np.median(f_values):.4f}")
print(f" Mean: {np.mean(f_values):.4f}")
print(f" Range: [{np.min(f_values):.4f}, {np.max(f_values):.4f}]")
f_values = [f for _, f in per_feature_f_stats]
print(f"Per-feature F-statistics:")
print(f" Median: {np.median(f_values):.4f}")
print(f" Mean: {np.mean(f_values):.4f}")
print(f" Range: [{np.min(f_values):.4f}, {np.max(f_values):.4f}]")
Find features in specific range (e.g., 0.76-0.78)
Find features in specific range (e.g., 0.76-0.78)
target_features = [(name, f) for name, f in per_feature_f_stats
if 0.76 <= f <= 0.78]
if target_features:
print(f"Features with F ∈ [0.76, 0.78]: {len(target_features)}")
for name, f_val in target_features:
print(f" {name}: F = {f_val:.6f}")
**Key Differences**:
| Aspect | Method A (Aggregate) | Method B (Per-feature) |
|--------|---------------------|------------------------|
| **Interpretation** | Overall expression difference | Feature-specific differences |
| **Result** | 1 F-statistic | N F-statistics (N = # features) |
| **Typical value** | Very large (e.g., 153.8) | Small to large (e.g., 0.1 to 100+) |
| **Use case** | Global effect size | Gene/biomarker discovery |
| **Common in** | Rarely used | **Genomics, proteomics, metabolomics** ⭐ |
**Real-world example (BixBench bix-36-q1)**:
- Question: "What is the F-statistic comparing miRNA expression across immune cell types?"
- Expected: 0.76-0.78
- Method A (aggregate): 153.836 ❌ **WRONG**
- Method B (per-miRNA): Found 2 miRNAs with F ∈ [0.76, 0.78] ✅ **CORRECT**
**Default assumption for gene expression data**: Use **Method B (per-feature)**target_features = [(name, f) for name, f in per_feature_f_stats
if 0.76 <= f <= 0.78]
if target_features:
print(f"F ∈ [0.76, 0.78]的特征: {len(target_features)}")
for name, f_val in target_features:
print(f" {name}: F = {f_val:.6f}")
**关键差异**:
| 方面 | 方法A(聚合) | 方法B(逐特征) |
|--------|---------------------|------------------------|
| **解释** | 整体表达差异 | 特征特异性差异 |
| **结果** | 1个F统计量 | N个F统计量(N=特征数量) |
| **典型值** | 非常大(如153.8) | 小到大都有(如0.1到100+) |
| **适用场景** | 全局效应量 | 基因/生物标志物发现 |
| **常见领域** | 很少使用 | **基因组学、蛋白质组学、代谢组学** ⭐ |
**真实世界示例(BixBench bix-36-q1)**:
- 问题:"比较免疫细胞类型间miRNA表达的F统计量是多少?"
- 预期值:0.76-0.78
- 方法A(聚合):153.836 ❌ **错误**
- 方法B(逐miRNA):找到2个F∈[0.76,0.78]的miRNA ✅ **正确**
**基因表达数据默认假设**:使用**方法B(逐特征)**Phase 2: Model Diagnostics
阶段2:模型诊断
Goal: Check model assumptions and fit quality.
目标:检查模型假设和拟合质量。
OLS Diagnostics
OLS诊断
python
from scipy import stats as scipy_stats
from statsmodels.stats.diagnostic import het_breuschpaganpython
from scipy import stats as scipy_stats
from statsmodels.stats.diagnostic import het_breuschpaganResidual normality
Residual normality
residuals = model.resid
sw_stat, sw_p = scipy_stats.shapiro(residuals)
print(f"Shapiro-Wilk: p={sw_p:.4f} (normal if p>0.05)")
residuals = model.resid
sw_stat, sw_p = scipy_stats.shapiro(residuals)
print(f"Shapiro-Wilk: p={sw_p:.4f} (p>0.05则符合正态性)")
Heteroscedasticity
Heteroscedasticity
bp_stat, bp_p, _, _ = het_breuschpagan(residuals, model.model.exog)
print(f"Breusch-Pagan: p={bp_p:.4f} (homoscedastic if p>0.05)")
bp_stat, bp_p, _, _ = het_breuschpagan(residuals, model.model.exog)
print(f"Breusch-Pagan: p={bp_p:.4f} (p>0.05则符合同方差性)")
VIF (multicollinearity)
VIF(多重共线性)
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = model.model.exog
for i in range(1, X.shape[1]): # Skip intercept
vif = variance_inflation_factor(X, i)
print(f"{model.model.exog_names[i]}: VIF={vif:.2f}")
undefinedfrom statsmodels.stats.outliers_influence import variance_inflation_factor
X = model.model.exog
for i in range(1, X.shape[1]): # Skip intercept
vif = variance_inflation_factor(X, i)
print(f"{model.model.exog_names[i]}: VIF={vif:.2f}")
undefinedProportional Hazards Test
比例风险假设检验
python
undefinedpython
undefinedTest PH assumption for Cox model
Test PH assumption for Cox model
results = cph.check_assumptions(df, p_value_threshold=0.05, show_plots=False)
if len(results) == 0:
print("✅ Proportional hazards assumption met")
else:
print(f"⚠️ PH violated for: {results}")
See `references/troubleshooting.md` for common diagnostic issues.results = cph.check_assumptions(df, p_value_threshold=0.05, show_plots=False)
if len(results) == 0:
print("✅ 符合比例风险假设")
else:
print(f"⚠️ 以下变量违反比例风险假设: {results}")
查看`references/troubleshooting.md`获取常见诊断问题。Phase 3: Interpretation
阶段3:结果解释
Goal: Generate publication-quality summary.
目标:生成符合出版标准的摘要。
Odds Ratio Interpretation
优势比解释
python
def interpret_odds_ratio(or_val, ci_lower, ci_upper, p_value):
"""Interpret odds ratio with clinical meaning."""
if or_val > 1:
pct_increase = (or_val - 1) * 100
direction = f"{pct_increase:.1f}% increase in odds"
else:
pct_decrease = (1 - or_val) * 100
direction = f"{pct_decrease:.1f}% decrease in odds"
sig = "significant" if p_value < 0.05 else "not significant"
ci_contains_null = ci_lower <= 1 <= ci_upper
return f"{direction} (OR={or_val:.4f}, 95% CI [{ci_lower:.4f}, {ci_upper:.4f}], p={p_value:.6f}, {sig})"python
def interpret_odds_ratio(or_val, ci_lower, ci_upper, p_value):
"""Interpret odds ratio with clinical meaning."""
if or_val > 1:
pct_increase = (or_val - 1) * 100
direction = f"{pct_increase:.1f}%的优势增加"
else:
pct_decrease = (1 - or_val) * 100
direction = f"{pct_decrease:.1f}%的优势降低"
sig = "有统计学意义" if p_value < 0.05 else "无统计学意义"
ci_contains_null = ci_lower <= 1 <= ci_upper
return f"{direction} (OR={or_val:.4f}, 95% CI [{ci_lower:.4f}, {ci_upper:.4f}], p={p_value:.6f}, {sig})"Common BixBench Patterns
BixBench常见模式
Pattern 1: Odds Ratio from Ordinal Regression
模式1:有序回归的优势比
Question: "What is the odds ratio of disease severity associated with exposure?"
Solution:
- Identify ordinal outcome (mild/moderate/severe)
- Fit ordinal logistic regression (proportional odds model)
- Extract OR = exp(coefficient for exposure)
- Report with CI and p-value
问题:"疾病严重程度与暴露相关的优势比是多少?"
解决方案:
- 识别有序结局(轻度/中度/重度)
- 拟合有序逻辑回归(比例优势模型)
- 提取OR = exp(暴露的系数)
- 报告OR、置信区间和p值
Pattern 2: Percentage Reduction in Odds
模式2:优势比的下降百分比
Question: "What is the percentage reduction in OR after adjusting for confounders?"
Solution:
python
undefined问题:"调整混杂因素后优势比的下降百分比是多少?"
解决方案:
python
undefinedUnadjusted model
Unadjusted model
model_crude = smf.logit('outcome ~ exposure', data=df).fit(disp=0)
or_crude = np.exp(model_crude.params['exposure'])
model_crude = smf.logit('outcome ~ exposure', data=df).fit(disp=0)
or_crude = np.exp(model_crude.params['exposure'])
Adjusted model
Adjusted model
model_adj = smf.logit('outcome ~ exposure + age + sex', data=df).fit(disp=0)
or_adj = np.exp(model_adj.params['exposure'])
model_adj = smf.logit('outcome ~ exposure + age + sex', data=df).fit(disp=0)
or_adj = np.exp(model_adj.params['exposure'])
Percentage reduction
Percentage reduction
pct_reduction = (or_crude - or_adj) / or_crude * 100
print(f"Percentage reduction: {pct_reduction:.1f}%")
undefinedpct_reduction = (or_crude - or_adj) / or_crude * 100
print(f"下降百分比: {pct_reduction:.1f}%")
undefinedPattern 3: Interaction Effects
模式3:交互效应
Question: "What is the odds ratio for the interaction between A and B?"
Solution:
python
undefined问题:"A和B之间交互项的优势比是多少?"
解决方案:
python
undefinedFit model with interaction
Fit model with interaction
model = smf.logit('outcome ~ A * B + age', data=df).fit(disp=0)
model = smf.logit('outcome ~ A * B + age', data=df).fit(disp=0)
Interaction OR
Interaction OR
interaction_coef = model.params['A:B']
interaction_or = np.exp(interaction_coef)
print(f"Interaction OR: {interaction_or:.4f}")
undefinedinteraction_coef = model.params['A:B']
interaction_or = np.exp(interaction_coef)
print(f"交互项OR: {interaction_or:.4f}")
undefinedPattern 4: Survival Analysis
模式4:生存分析
Question: "What is the hazard ratio for treatment?"
Solution:
- Load survival data (time, event, covariates)
- Fit Cox proportional hazards model
- Extract HR = exp(coefficient)
- Report with CI and concordance index
问题:"治疗的风险比是多少?"
解决方案:
- 加载生存数据(时间、事件、协变量)
- 拟合Cox比例风险模型
- 提取HR = exp(系数)
- 报告HR、置信区间和一致性指数
Pattern 5: Multi-feature ANOVA (Gene Expression)
模式5:多特征ANOVA(基因表达)
Question: "What is the F-statistic comparing miRNA expression across cell types?"
Solution:
- Identify that data has multiple features (genes/miRNAs)
- Use per-feature ANOVA (NOT aggregate)
- Calculate F-statistic for EACH feature separately
- If question asks for "the F-statistic" (singular):
- Check if specific features match expected range
- Report those feature(s) F-statistics
- If question asks for summary: report median/mean/distribution
Critical: For gene expression data, default to per-feature ANOVA. Aggregate ANOVA gives F-statistics ~200× larger and is rarely correct.
See for 15+ question patterns.
references/bixbench_patterns.md问题:"比较细胞类型间miRNA表达的F统计量是多少?"
解决方案:
- 识别数据包含多个特征(基因/miRNA)
- 使用逐特征ANOVA(而非聚合)
- 为每个特征单独计算F统计量
- 如果问题询问"F统计量"(单数):
- 检查是否有特定特征符合预期范围
- 报告这些特征的F统计量
- 如果问题询问汇总结果:报告中位数/均值/分布
关键:对于基因表达数据,默认使用逐特征ANOVA。聚合ANOVA得到的F统计量约大200倍,很少正确。
查看获取15+问题模式。
references/bixbench_patterns.mdStatsmodels vs Scikit-learn
Statsmodels vs Scikit-learn
| Use Case | Library | Reason |
|---|---|---|
| Inference (p-values, CIs, ORs) | statsmodels | Full statistical output |
| Prediction (accuracy, AUC) | scikit-learn | Better prediction tools |
| Mixed-effects models | statsmodels | Only option |
| Regularization (LASSO, Ridge) | scikit-learn | Better optimization |
| Survival analysis | lifelines | Specialized library |
General rule: Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).
| 使用场景 | 库 | 原因 |
|---|---|---|
| 推断(p值、置信区间、OR) | statsmodels | 完整的统计输出 |
| 预测(准确率、AUC) | scikit-learn | 更好的预测工具 |
| 混合效应模型 | statsmodels | 唯一选项 |
| 正则化(LASSO、Ridge) | scikit-learn | 更优的优化方法 |
| 生存分析 | lifelines | 专业库 |
通用规则:BixBench问题使用statsmodels(问题通常询问p值、OR、HR)。
Python Package Requirements
Python包依赖
statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0File Structure
文件结构
tooluniverse-statistical-modeling/
├── SKILL.md # This file
├── QUICK_START.md # 8 quick examples
├── EXAMPLES.md # Legacy examples (kept for reference)
├── TOOLS_REFERENCE.md # ToolUniverse tool catalog
├── test_skill.py # Comprehensive test suite
├── references/
│ ├── logistic_regression.md # Detailed logistic examples
│ ├── ordinal_logistic.md # Ordinal logit guide
│ ├── cox_regression.md # Survival analysis guide
│ ├── linear_models.md # OLS and mixed-effects
│ ├── bixbench_patterns.md # 15+ question patterns
│ └── troubleshooting.md # Diagnostic issues
└── scripts/
├── format_statistical_output.py # Format results for reporting
└── model_diagnostics.py # Automated diagnosticstooluniverse-statistical-modeling/
├── SKILL.md # 本文件
├── QUICK_START.md # 8个快速示例
├── EXAMPLES.md # 遗留示例(仅供参考)
├── TOOLS_REFERENCE.md # ToolUniverse工具目录
├── test_skill.py # 全面测试套件
├── references/
│ ├── logistic_regression.md # 详细逻辑回归示例
│ ├── ordinal_logistic.md # 有序逻辑回归指南
│ ├── cox_regression.md # 生存分析指南
│ ├── linear_models.md # OLS和混合效应模型
│ ├── bixbench_patterns.md # 15+问题模式
│ └── troubleshooting.md # 诊断问题
└── scripts/
├── format_statistical_output.py # 格式化结果用于报告
└── model_diagnostics.py # 自动化诊断Key Principles
核心原则
- Data-first approach - Always inspect and validate data before modeling
- Model selection by outcome type - Use decision tree above
- Assumption checking - Verify model assumptions (linearity, proportional hazards, etc.)
- Complete reporting - Always report effect sizes, CIs, p-values, and model fit statistics
- Confounder awareness - Adjust for confounders when specified or clinically relevant
- Reproducible analysis - All code must be deterministic and reproducible
- Robust error handling - Graceful handling of convergence failures, separation, collinearity
- Round correctly - Match the precision requested (typically 2-4 decimal places)
- 数据优先方法 - 建模前始终检查并验证数据
- 按结局类型选择模型 - 使用上述决策树
- 假设检验 - 验证模型假设(线性、比例风险等)
- 完整报告 - 始终报告效应量、置信区间、p值和模型拟合统计量
- 混杂因素意识 - 当指定或临床相关时调整混杂因素
- 可复现分析 - 所有代码必须确定且可复现
- 鲁棒错误处理 - 优雅处理收敛失败、分离、共线性
- 正确四舍五入 - 匹配要求的精度(通常2-4位小数)
Completeness Checklist
完整性检查清单
Before finalizing any statistical analysis:
- Outcome variable identified: Verified which column is the actual outcome (not assumed)
- Data validated: N, missing values, variable types confirmed
- Multi-feature data identified: If data has multiple features (genes, miRNAs), use per-feature approach
- Model appropriate: Outcome type matches model family
- Assumptions checked: Relevant diagnostics performed
- Effect sizes reported: OR/HR/Cohen's d with CIs
- P-values reported: With appropriate correction if needed
- Model fit assessed: R-squared, AIC/BIC, concordance
- Results interpreted: Plain-language interpretation
- Precision correct: Numbers rounded appropriately
- Confounders addressed: Adjusted analyses if applicable
在完成任何统计分析之前:
- 结局变量已识别:验证哪一列是实际结局(而非假设)
- 数据已验证:样本量、缺失值、变量类型已确认
- 多特征数据已识别:如果数据包含多个特征(基因、miRNA),使用逐特征方法
- 模型合适:结局类型与模型族匹配
- 假设已检查:执行相关诊断
- 效应量已报告:OR/HR/Cohen's d及置信区间
- p值已报告:必要时进行校正
- 模型拟合已评估:R平方、AIC/BIC、一致性指数
- 结果已解释:通俗易懂的解释
- 精度正确:数字已正确四舍五入
- 混杂因素已处理:适用时进行调整分析
References
参考文献
- statsmodels: https://www.statsmodels.org/
- lifelines: https://lifelines.readthedocs.io/
- scikit-learn: https://scikit-learn.org/
- Ordinal models: statsmodels.miscmodels.ordinal_model.OrderedModel
- statsmodels: https://www.statsmodels.org/
- lifelines: https://lifelines.readthedocs.io/
- scikit-learn: https://scikit-learn.org/
- 有序模型: statsmodels.miscmodels.ordinal_model.OrderedModel
ToolUniverse Integration
ToolUniverse集成
While this skill is primarily computational, ToolUniverse tools can provide data:
| Use Case | Tools |
|---|---|
| Clinical trial data | |
| Drug safety outcomes | |
| Gene-disease associations | |
| Biomarker data | |
See for complete tool catalog.
TOOLS_REFERENCE.md虽然本技能主要是计算型的,但ToolUniverse工具可提供数据:
| 适用场景 | 工具 |
|---|---|
| 临床试验数据 | |
| 药物安全性结局 | |
| 基因-疾病关联 | |
| 生物标志物数据 | |
查看获取完整工具目录。
TOOLS_REFERENCE.mdSupport
支持
For detailed examples and troubleshooting:
- Logistic regression:
references/logistic_regression.md - Ordinal models:
references/ordinal_logistic.md - Survival analysis:
references/cox_regression.md - Linear/mixed models:
references/linear_models.md - BixBench patterns:
references/bixbench_patterns.md - Diagnostics:
references/troubleshooting.md
如需详细示例和故障排除:
- 逻辑回归:
references/logistic_regression.md - 有序模型:
references/ordinal_logistic.md - 生存分析:
references/cox_regression.md - 线性/混合模型:
references/linear_models.md - BixBench模式:
references/bixbench_patterns.md - 诊断: ",
references/troubleshooting.md