loom-model-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseModel Evaluation
模型评估
Overview
概述
This skill focuses on comprehensive evaluation of machine learning models across the entire ML lifecycle. It covers metric selection, validation strategies, fairness assessment, training debugging, hyperparameter tuning, LLM evaluation, A/B testing, and production monitoring for ensuring model quality and reliability.
本技能专注于在整个ML生命周期中对机器学习模型进行全面评估。涵盖指标选择、验证策略、公平性评估、训练调试、超参数调优、LLM评估、A/B测试以及生产环境监控,以确保模型的质量和可靠性。
When to Use This Skill
何时使用本技能
Use this skill when you need to:
- Select and compute appropriate evaluation metrics for ML models
- Design cross-validation strategies and train/test splits
- Debug training issues (overfitting, underfitting, convergence problems)
- Tune hyperparameters and validate model improvements
- Evaluate LLMs and generative models
- Conduct A/B tests for model comparison in production
- Monitor deployed models for drift and degradation
- Assess model fairness across demographic groups
- Analyze error patterns and residuals
- Create evaluation reports and dashboards
当你需要以下操作时,可使用本技能:
- 为ML模型选择并计算合适的评估指标
- 设计交叉验证策略和训练/测试集划分方案
- 调试训练问题(过拟合、欠拟合、收敛问题)
- 调优超参数并验证模型改进效果
- 评估LLM和生成式模型
- 在生产环境中开展模型对比的A/B测试
- 监控已部署模型的漂移和性能退化
- 评估模型在不同人群中的公平性
- 分析错误模式和残差
- 创建评估报告和仪表盘
Related Skills and Agents
相关技能与Agent
When to Escalate:
- senior-software-engineer (Opus): For ML system architecture decisions, model selection strategies, complex evaluation pipeline design, distributed evaluation infrastructure, large-scale benchmarking, production monitoring architecture, adversarial robustness evaluation, model poisoning detection, security-aware metrics (use /loom-security-audit and /loom-threat-model skills for security concerns)
Complementary Skills:
- Use for systematic debugging of evaluation pipelines and metric computation issues
/loom-debugging - Use for unit testing evaluation code and validation logic
/loom-testing - Use for input data quality checks before model evaluation
/loom-data-validation
何时升级求助:
- senior-software-engineer (Opus):用于ML系统架构决策、模型选择策略、复杂评估流水线设计、分布式评估基础设施、大规模基准测试、生产监控架构、对抗鲁棒性评估、模型投毒检测、安全相关指标评估(安全问题请使用/loom-security-audit和/loom-threat-model技能)
互补技能:
- 使用系统调试评估流水线和指标计算问题
/loom-debugging - 使用对评估代码和验证逻辑进行单元测试
/loom-testing - 使用在模型评估前检查输入数据质量
/loom-data-validation
Instructions
操作指南
1. Define Evaluation Criteria
1. 定义评估标准
Business Alignment:
- Identify business objectives and success criteria
- Translate business goals to ML metrics
- Define acceptable performance thresholds
- Consider cost of different error types (false positives vs false negatives)
Metric Selection:
- Classification: accuracy, precision, recall, F1, ROC-AUC, PR-AUC
- Regression: MSE, RMSE, MAE, MAPE, R2, explained variance
- Ranking: NDCG, MAP, MRR
- LLMs: perplexity, BLEU, ROUGE, BERTScore, human eval
- Custom metrics for domain-specific requirements
Fairness Requirements:
- Identify protected attributes (race, gender, age)
- Choose fairness definitions (demographic parity, equalized odds)
- Set fairness constraints and thresholds
业务对齐:
- 明确业务目标和成功标准
- 将业务目标转化为ML指标
- 定义可接受的性能阈值
- 考虑不同错误类型的成本(假阳性 vs 假阴性)
指标选择:
- 分类任务:accuracy、precision、recall、F1、ROC-AUC、PR-AUC
- 回归任务:MSE、RMSE、MAE、MAPE、R²、解释方差
- 排序任务:NDCG、MAP、MRR
- LLM:perplexity、BLEU、ROUGE、BERTScore、人工评估
- 针对特定领域需求的自定义指标
公平性要求:
- 识别受保护属性(种族、性别、年龄)
- 选择公平性定义(人口均等、均等赔率)
- 设置公平性约束和阈值
2. Design Evaluation Strategy
2. 设计评估策略
Data Splitting:
- Train/validation/test split ratios (e.g., 60/20/20)
- Stratified splits for class imbalance
- Time-based splits for temporal data
- Group-based splits to prevent data leakage
Cross-Validation:
- K-fold CV for standard problems
- Stratified K-fold for imbalanced classes
- TimeSeriesSplit for temporal data
- GroupKFold for clustered data
- Leave-one-out for small datasets
Handling Imbalance:
- Stratified sampling
- Class weights in metrics
- Resampling strategies (SMOTE, undersampling)
- Appropriate metrics (F1, PR-AUC instead of accuracy)
数据划分:
- 训练/验证/测试集划分比例(例如60/20/20)
- 针对类别不平衡的分层划分
- 时序数据的基于时间的划分
- 基于分组的划分以防止数据泄露
交叉验证:
- 标准问题使用K折CV
- 不平衡类别使用分层K折
- 时序数据使用TimeSeriesSplit
- 聚类数据使用GroupKFold
- 小数据集使用留一法
类别不平衡处理:
- 分层采样
- 指标中的类别权重
- 重采样策略(SMOTE、欠采样)
- 使用合适的指标(如F1、PR-AUC而非accuracy)
3. Conduct Evaluation
3. 开展评估
Performance Metrics:
- Calculate primary and secondary metrics
- Compute confidence intervals
- Compare against baselines
- Statistical significance testing
Error Analysis:
- Confusion matrix analysis
- Per-class performance breakdown
- Error type categorization
- Hard example mining
Fairness Assessment:
- Group-wise metric comparison
- Demographic parity evaluation
- Equalized odds analysis
- Disparate impact measurement
Edge Case Testing:
- Boundary condition validation
- Out-of-distribution detection
- Adversarial robustness
- Stress testing with extreme inputs
性能指标:
- 计算主要和次要指标
- 计算置信区间
- 与基线模型对比
- 统计显著性检验
错误分析:
- 混淆矩阵分析
- 按类别拆分性能
- 错误类型分类
- 难例挖掘
公平性评估:
- 分组指标对比
- 人口均等性评估
- 均等赔率分析
- 差异性影响测量
边缘案例测试:
- 边界条件验证
- 分布外检测
- 对抗鲁棒性
- 极端输入压力测试
4. Debug Training Issues
4. 调试训练问题
Overfitting Detection:
- Train vs validation performance gap
- Learning curves analysis
- Validation metrics plateauing while training improves
- Mitigation: regularization, dropout, early stopping, data augmentation
Underfitting Detection:
- Poor performance on both train and validation
- Learning curves not converging
- Model too simple for problem complexity
- Mitigation: increase model capacity, feature engineering, longer training
Convergence Problems:
- Loss not decreasing
- Loss oscillating or unstable
- Exploding gradients (loss becomes NaN)
- Vanishing gradients (loss stays constant)
- Mitigation: learning rate tuning, gradient clipping, batch normalization
Learning Rate Issues:
- Too high: training unstable, loss oscillates
- Too low: training too slow, stuck in local minima
- Solution: learning rate schedules, warmup, cosine annealing
Batch Size Effects:
- Small batch: noisy gradients, poor generalization
- Large batch: memory issues, sharp minima
- Find sweet spot through experimentation
过拟合检测:
- 训练集与验证集性能差距
- 学习曲线分析
- 验证指标趋于平稳但训练指标持续提升
- 缓解方法:正则化、dropout、早停、数据增强
欠拟合检测:
- 训练集和验证集性能均较差
- 学习曲线未收敛
- 模型复杂度不足以解决问题
- 缓解方法:提升模型容量、特征工程、延长训练时间
收敛问题:
- 损失值未下降
- 损失值波动或不稳定
- 梯度爆炸(损失值变为NaN)
- 梯度消失(损失值保持恒定)
- 缓解方法:学习率调优、梯度裁剪、批量归一化
学习率问题:
- 学习率过高:训练不稳定,损失值波动
- 学习率过低:训练过慢,陷入局部最小值
- 解决方案:学习率调度、预热、余弦退火
批量大小影响:
- 小批量:梯度噪声大,泛化能力差
- 大批量:内存问题,陷入尖锐最小值
- 通过实验找到最优平衡点
5. Hyperparameter Tuning
5. 超参数调优
Search Strategies:
- Grid search: exhaustive but expensive
- Random search: better coverage for high-dimensional spaces
- Bayesian optimization: sample efficient
- Hyperband: adaptive resource allocation
Key Hyperparameters:
- Learning rate (most critical)
- Batch size
- Regularization strength (L1, L2, dropout)
- Network architecture (layers, units)
- Optimizer choice (Adam, SGD, AdamW)
Validation:
- Use validation set for hyperparameter selection
- Never tune on test set
- Consider nested cross-validation for small datasets
搜索策略:
- 网格搜索: exhaustive但成本高
- 随机搜索:高维空间覆盖性更好
- 贝叶斯优化:样本效率高
- Hyperband:自适应资源分配
关键超参数:
- 学习率(最关键)
- 批量大小
- 正则化强度(L1、L2、dropout)
- 网络架构(层数、单元数)
- 优化器选择(Adam、SGD、AdamW)
验证:
- 使用验证集进行超参数选择
- 绝不在测试集上调优
- 小数据集考虑嵌套交叉验证
6. LLM and Generative Model Evaluation
6. LLM与生成式模型评估
Automatic Metrics:
- Perplexity for language models
- BLEU, ROUGE for text generation
- BERTScore for semantic similarity
- Exact match, F1 for QA tasks
Human Evaluation:
- Fluency, coherence, relevance
- Factual accuracy
- Safety and toxicity
- Instruction following
Prompt Engineering Evaluation:
- Few-shot vs zero-shot comparison
- Prompt template A/B testing
- Chain-of-thought effectiveness
- System message impact
LLM-as-Judge:
- Use stronger models to evaluate weaker models
- Pairwise comparison for ranking
- Rubric-based scoring
- Calibration against human judgments
自动指标:
- 语言模型使用perplexity
- 文本生成使用BLEU、ROUGE
- 语义相似度使用BERTScore
- QA任务使用精确匹配、F1
人工评估:
- 流畅性、连贯性、相关性
- 事实准确性
- 安全性与毒性
- 指令遵循度
提示工程评估:
- 少样本 vs 零样本对比
- 提示模板A/B测试
- 思维链有效性
- 系统消息影响
LLM-as-Judge:
- 使用更强模型评估较弱模型
- 成对对比进行排名
- 基于评分标准的打分
- 与人工判断校准
7. A/B Testing for Model Comparison
7. 模型对比的A/B测试
Experimental Design:
- Random traffic split (50/50 or 90/10)
- Minimum sample size calculation
- Statistical power analysis
- Duration planning for seasonality
Metrics:
- Primary business metric (conversion, revenue)
- Secondary metrics (latency, user satisfaction)
- Guardrail metrics (error rate, bias)
- Sample ratio mismatch checks
Analysis:
- Statistical significance testing (t-test, Mann-Whitney)
- Effect size estimation
- Confidence intervals
- Multiple testing correction (Bonferroni)
Decision Criteria:
- Primary metric improvement threshold
- No degradation in guardrail metrics
- Sufficient statistical power
- Business case validation
实验设计:
- 随机流量分配(50/50或90/10)
- 最小样本量计算
- 统计功效分析
- 考虑季节性的时长规划
指标:
- 主要业务指标(转化率、收入)
- 次要指标(延迟、用户满意度)
- 防护指标(错误率、偏差)
- 样本比例 mismatch 检查
分析:
- 统计显著性检验(t检验、Mann-Whitney检验)
- 效应量估计
- 置信区间
- 多重检验校正(Bonferroni)
决策标准:
- 主要指标提升阈值
- 防护指标无退化
- 足够的统计功效
- 业务案例验证
8. Production Model Monitoring
8. 生产环境模型监控
Performance Monitoring:
- Track key metrics over time
- Compare against baseline/champion model
- Detect performance degradation
- Alert on threshold violations
Data Drift Detection:
- Input distribution shifts
- Feature statistics tracking
- KL divergence, KS test, PSI
- Covariate shift detection
Concept Drift Detection:
- Model prediction distribution changes
- Label distribution shifts (when available)
- Performance metric trends
- Adversarial Validation
Monitoring Infrastructure:
- Real-time metric computation
- Dashboards for visualization
- Alerting and on-call rotation
- Automated retraining triggers
性能监控:
- 跟踪关键指标随时间的变化
- 与基线/冠军模型对比
- 检测性能退化
- 阈值违规告警
数据漂移检测:
- 输入分布偏移
- 特征统计跟踪
- KL散度、KS检验、PSI
- 协变量偏移检测
概念漂移检测:
- 模型预测分布变化
- 标签分布偏移(若可用)
- 性能指标趋势
- 对抗验证
监控基础设施:
- 实时指标计算
- 可视化仪表盘
- 告警与轮值待命
- 自动重训练触发
9. Report and Document
9. 报告与文档
Evaluation Report Structure:
- Executive summary with key findings
- Methodology and experimental setup
- Comprehensive metric tables
- Error analysis and case studies
- Fairness assessment results
- Recommendations and next steps
Visualization:
- ROC and PR curves
- Confusion matrices
- Learning curves
- Residual plots
- Fairness comparison charts
Version Control:
- Model version and checkpoints
- Dataset versions and splits
- Hyperparameter configurations
- Evaluation code and environment
评估报告结构:
- 包含关键发现的执行摘要
- 方法论与实验设置
- 全面的指标表格
- 错误分析与案例研究
- 公平性评估结果
- 建议与下一步计划
可视化:
- ROC和PR曲线
- 混淆矩阵
- 学习曲线
- 残差图
- 公平性对比图表
版本控制:
- 模型版本与检查点
- 数据集版本与划分
- 超参数配置
- 评估代码与环境
Best Practices
最佳实践
General Principles
通用原则
- Match Metrics to Goals: Choose metrics aligned with business objectives, not just academic standards
- Use Multiple Metrics: No single metric tells the whole story; use complementary metrics
- Proper Validation: Use appropriate cross-validation schemes to avoid overfitting to validation set
- Test Distribution Shift: Evaluate on out-of-distribution data to assess generalization
- Check for Bias: Assess fairness across demographic groups before deployment
- Version Everything: Track models, data, metrics, and code for reproducibility
- Monitor Production: Continuously track model performance after deployment
- 指标匹配目标:选择与业务目标对齐的指标,而非仅学术标准
- 使用多指标:单一指标无法全面反映情况,使用互补指标
- 正确验证:使用合适的交叉验证方案,避免过拟合到验证集
- 测试分布偏移:在分布外数据上评估以检验泛化能力
- 检查偏差:部署前评估不同人群的公平性
- 版本化所有内容:跟踪模型、数据、指标和代码以确保可复现
- 监控生产环境:部署后持续跟踪模型性能
Training and Debugging
训练与调试
- Start Simple: Begin with simple baselines before complex models
- Visualize Learning: Plot learning curves early and often
- Debug Incrementally: Change one thing at a time when debugging training issues
- Sanity Check: Overfit on small batch first to verify model can learn
- Early Stopping: Use validation-based early stopping to prevent overfitting
- Gradient Monitoring: Track gradient norms to detect vanishing/exploding gradients
- 从简单开始:先使用简单基线模型,再尝试复杂模型
- 可视化学习过程:尽早且频繁绘制学习曲线
- 增量调试:调试时每次只改变一个变量
- ** sanity检查**:先在小批量数据上过拟合以验证模型可学习
- 早停:使用基于验证集的早停防止过拟合
- 梯度监控:跟踪梯度范数以检测梯度消失/爆炸
Evaluation Rigor
评估严谨性
- Hold-out Test Set: Never touch test set until final evaluation
- Stratified Splits: Use stratification for imbalanced datasets
- Statistical Testing: Use significance tests for model comparisons
- Error Analysis: Dive deep into errors to understand failure modes
- Temporal Validation: For time-series, validate on future data only
- 保留测试集:最终评估前绝不触碰测试集
- 分层划分:不平衡数据集使用分层划分
- 统计检验:模型对比使用显著性检验
- 错误分析:深入分析错误以理解失效模式
- 时序验证:时序数据仅在未来数据上验证
Production and Monitoring
生产与监控
- Shadow Mode: Deploy new models in shadow mode before switching traffic
- Gradual Rollout: Use canary deployments or gradual percentage rollouts
- Rollback Plan: Have automated rollback triggers for performance degradation
- Alert Fatigue: Set meaningful alert thresholds to avoid noise
- 影子模式:切换流量前以影子模式部署新模型
- 逐步发布:使用金丝雀部署或逐步比例发布
- 回滚计划:设置性能退化时的自动回滚触发
- 避免告警疲劳:设置有意义的告警阈值以减少噪声
Examples
示例
Example 1: Classification Model Evaluation
示例1:分类模型评估
python
import numpy as np
import pandas as pd
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, average_precision_score, confusion_matrix,
classification_report, roc_curve, precision_recall_curve
)
import matplotlib.pyplot as plt
class ClassificationEvaluator:
"""Comprehensive classification model evaluator."""
def __init__(self, y_true, y_pred, y_prob=None, class_names=None):
self.y_true = y_true
self.y_pred = y_pred
self.y_prob = y_prob
self.class_names = class_names or ['Negative', 'Positive']
def compute_metrics(self) -> dict:
"""Compute all classification metrics."""
metrics = {
'accuracy': accuracy_score(self.y_true, self.y_pred),
'precision': precision_score(self.y_true, self.y_pred, average='weighted'),
'recall': recall_score(self.y_true, self.y_pred, average='weighted'),
'f1': f1_score(self.y_true, self.y_pred, average='weighted'),
}
if self.y_prob is not None:
metrics['roc_auc'] = roc_auc_score(self.y_true, self.y_prob)
metrics['average_precision'] = average_precision_score(self.y_true, self.y_prob)
return metrics
def confusion_matrix_analysis(self) -> dict:
"""Analyze confusion matrix in detail."""
cm = confusion_matrix(self.y_true, self.y_pred)
tn, fp, fn, tp = cm.ravel()
return {
'confusion_matrix': cm,
'true_negatives': tn,
'false_positives': fp,
'false_negatives': fn,
'true_positives': tp,
'specificity': tn / (tn + fp),
'sensitivity': tp / (tp + fn),
'false_positive_rate': fp / (fp + tn),
'false_negative_rate': fn / (fn + tp),
}
def plot_roc_curve(self, save_path=None):
"""Plot ROC curve with AUC."""
if self.y_prob is None:
raise ValueError("Probabilities required for ROC curve")
fpr, tpr, thresholds = roc_curve(self.y_true, self.y_prob)
auc = roc_auc_score(self.y_true, self.y_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid(True, alpha=0.3)
if save_path:
plt.savefig(save_path, dpi=150, bbox_inches='tight')
plt.show()
def generate_report(self) -> str:
"""Generate comprehensive evaluation report."""
metrics = self.compute_metrics()
cm_analysis = self.confusion_matrix_analysis()
report = f"""python
import numpy as np
import pandas as pd
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, average_precision_score, confusion_matrix,
classification_report, roc_curve, precision_recall_curve
)
import matplotlib.pyplot as plt
class ClassificationEvaluator:
"""Comprehensive classification model evaluator."""
def __init__(self, y_true, y_pred, y_prob=None, class_names=None):
self.y_true = y_true
self.y_pred = y_pred
self.y_prob = y_prob
self.class_names = class_names or ['Negative', 'Positive']
def compute_metrics(self) -> dict:
"""Compute all classification metrics."""
metrics = {
'accuracy': accuracy_score(self.y_true, self.y_pred),
'precision': precision_score(self.y_true, self.y_pred, average='weighted'),
'recall': recall_score(self.y_true, self.y_pred, average='weighted'),
'f1': f1_score(self.y_true, self.y_pred, average='weighted'),
}
if self.y_prob is not None:
metrics['roc_auc'] = roc_auc_score(self.y_true, self.y_prob)
metrics['average_precision'] = average_precision_score(self.y_true, self.y_prob)
return metrics
def confusion_matrix_analysis(self) -> dict:
"""Analyze confusion matrix in detail."""
cm = confusion_matrix(self.y_true, self.y_pred)
tn, fp, fn, tp = cm.ravel()
return {
'confusion_matrix': cm,
'true_negatives': tn,
'false_positives': fp,
'false_negatives': fn,
'true_positives': tp,
'specificity': tn / (tn + fp),
'sensitivity': tp / (tp + fn),
'false_positive_rate': fp / (fp + tn),
'false_negative_rate': fn / (fn + tp),
}
def plot_roc_curve(self, save_path=None):
"""Plot ROC curve with AUC."""
if self.y_prob is None:
raise ValueError("Probabilities required for ROC curve")
fpr, tpr, thresholds = roc_curve(self.y_true, self.y_prob)
auc = roc_auc_score(self.y_true, self.y_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid(True, alpha=0.3)
if save_path:
plt.savefig(save_path, dpi=150, bbox_inches='tight')
plt.show()
def generate_report(self) -> str:
"""Generate comprehensive evaluation report."""
metrics = self.compute_metrics()
cm_analysis = self.confusion_matrix_analysis()
report = f"""Classification Model Evaluation Report
Classification Model Evaluation Report
Overall Metrics
Overall Metrics
| Metric | Value |
|---|---|
| Accuracy | {metrics['accuracy']:.4f} |
| Precision | {metrics['precision']:.4f} |
| Recall | {metrics['recall']:.4f} |
| F1 Score | {metrics['f1']:.4f} |
| ROC AUC | {metrics.get('roc_auc', 'N/A'):.4f if isinstance(metrics.get('roc_auc'), float) else 'N/A'} |
| Metric | Value |
|---|---|
| Accuracy | {metrics['accuracy']:.4f} |
| Precision | {metrics['precision']:.4f} |
| Recall | {metrics['recall']:.4f} |
| F1 Score | {metrics['f1']:.4f} |
| ROC AUC | {metrics.get('roc_auc', 'N/A'):.4f if isinstance(metrics.get('roc_auc'), float) else 'N/A'} |
Confusion Matrix Analysis
Confusion Matrix Analysis
| Metric | Value |
|---|---|
| True Positives | {cm_analysis['true_positives']} |
| True Negatives | {cm_analysis['true_negatives']} |
| False Positives | {cm_analysis['false_positives']} |
| False Negatives | {cm_analysis['false_negatives']} |
| Sensitivity | {cm_analysis['sensitivity']:.4f} |
| Specificity | {cm_analysis['specificity']:.4f} |
| Metric | Value |
|---|---|
| True Positives | {cm_analysis['true_positives']} |
| True Negatives | {cm_analysis['true_negatives']} |
| False Positives | {cm_analysis['false_positives']} |
| False Negatives | {cm_analysis['false_negatives']} |
| Sensitivity | {cm_analysis['sensitivity']:.4f} |
| Specificity | {cm_analysis['specificity']:.4f} |
Detailed Classification Report
Detailed Classification Report
{classification_report(self.y_true, self.y_pred, target_names=self.class_names)}
"""
return report
{classification_report(self.y_true, self.y_pred, target_names=self.class_names)}
"""
return report
Usage
Usage
evaluator = ClassificationEvaluator(y_true, y_pred, y_prob)
print(evaluator.generate_report())
evaluator.plot_roc_curve('roc_curve.png')
undefinedevaluator = ClassificationEvaluator(y_true, y_pred, y_prob)
print(evaluator.generate_report())
evaluator.plot_roc_curve('roc_curve.png')
undefinedExample 2: Regression Model Evaluation
示例2:回归模型评估
python
from sklearn.metrics import (
mean_squared_error, mean_absolute_error, r2_score,
mean_absolute_percentage_error, explained_variance_score
)
import numpy as np
class RegressionEvaluator:
"""Comprehensive regression model evaluator."""
def __init__(self, y_true, y_pred):
self.y_true = np.array(y_true)
self.y_pred = np.array(y_pred)
self.residuals = self.y_true - self.y_pred
def compute_metrics(self) -> dict:
"""Compute all regression metrics."""
mse = mean_squared_error(self.y_true, self.y_pred)
return {
'mse': mse,
'rmse': np.sqrt(mse),
'mae': mean_absolute_error(self.y_true, self.y_pred),
'mape': mean_absolute_percentage_error(self.y_true, self.y_pred) * 100,
'r2': r2_score(self.y_true, self.y_pred),
'explained_variance': explained_variance_score(self.y_true, self.y_pred),
}
def residual_analysis(self) -> dict:
"""Analyze residual patterns."""
return {
'mean_residual': np.mean(self.residuals),
'std_residual': np.std(self.residuals),
'max_overestimate': np.min(self.residuals),
'max_underestimate': np.max(self.residuals),
'residual_skewness': self._skewness(self.residuals),
}
def _skewness(self, data):
"""Calculate skewness."""
n = len(data)
mean = np.mean(data)
std = np.std(data)
return (n / ((n-1) * (n-2))) * np.sum(((data - mean) / std) ** 3)
def plot_diagnostics(self, save_path=None):
"""Plot diagnostic plots for residual analysis."""
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Actual vs Predicted
ax1 = axes[0, 0]
ax1.scatter(self.y_true, self.y_pred, alpha=0.5)
ax1.plot([self.y_true.min(), self.y_true.max()],
[self.y_true.min(), self.y_true.max()], 'r--')
ax1.set_xlabel('Actual')
ax1.set_ylabel('Predicted')
ax1.set_title('Actual vs Predicted')
# Residuals vs Predicted
ax2 = axes[0, 1]
ax2.scatter(self.y_pred, self.residuals, alpha=0.5)
ax2.axhline(y=0, color='r', linestyle='--')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Residuals')
ax2.set_title('Residuals vs Predicted')
# Residual histogram
ax3 = axes[1, 0]
ax3.hist(self.residuals, bins=30, edgecolor='black')
ax3.set_xlabel('Residual')
ax3.set_ylabel('Frequency')
ax3.set_title('Residual Distribution')
# Q-Q plot
ax4 = axes[1, 1]
from scipy import stats
stats.probplot(self.residuals, dist="norm", plot=ax4)
ax4.set_title('Q-Q Plot')
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=150, bbox_inches='tight')
plt.show()python
from sklearn.metrics import (
mean_squared_error, mean_absolute_error, r2_score,
mean_absolute_percentage_error, explained_variance_score
)
import numpy as np
class RegressionEvaluator:
"""Comprehensive regression model evaluator."""
def __init__(self, y_true, y_pred):
self.y_true = np.array(y_true)
self.y_pred = np.array(y_pred)
self.residuals = self.y_true - self.y_pred
def compute_metrics(self) -> dict:
"""Compute all regression metrics."""
mse = mean_squared_error(self.y_true, self.y_pred)
return {
'mse': mse,
'rmse': np.sqrt(mse),
'mae': mean_absolute_error(self.y_true, self.y_pred),
'mape': mean_absolute_percentage_error(self.y_true, self.y_pred) * 100,
'r2': r2_score(self.y_true, self.y_pred),
'explained_variance': explained_variance_score(self.y_true, self.y_pred),
}
def residual_analysis(self) -> dict:
"""Analyze residual patterns."""
return {
'mean_residual': np.mean(self.residuals),
'std_residual': np.std(self.residuals),
'max_overestimate': np.min(self.residuals),
'max_underestimate': np.max(self.residuals),
'residual_skewness': self._skewness(self.residuals),
}
def _skewness(self, data):
"""Calculate skewness."""
n = len(data)
mean = np.mean(data)
std = np.std(data)
return (n / ((n-1) * (n-2))) * np.sum(((data - mean) / std) ** 3)
def plot_diagnostics(self, save_path=None):
"""Plot diagnostic plots for residual analysis."""
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Actual vs Predicted
ax1 = axes[0, 0]
ax1.scatter(self.y_true, self.y_pred, alpha=0.5)
ax1.plot([self.y_true.min(), self.y_true.max()],
[self.y_true.min(), self.y_true.max()], 'r--')
ax1.set_xlabel('Actual')
ax1.set_ylabel('Predicted')
ax1.set_title('Actual vs Predicted')
# Residuals vs Predicted
ax2 = axes[0, 1]
ax2.scatter(self.y_pred, self.residuals, alpha=0.5)
ax2.axhline(y=0, color='r', linestyle='--')
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Residuals')
ax2.set_title('Residuals vs Predicted')
# Residual histogram
ax3 = axes[1, 0]
ax3.hist(self.residuals, bins=30, edgecolor='black')
ax3.set_xlabel('Residual')
ax3.set_ylabel('Frequency')
ax3.set_title('Residual Distribution')
# Q-Q plot
ax4 = axes[1, 1]
from scipy import stats
stats.probplot(self.residuals, dist="norm", plot=ax4)
ax4.set_title('Q-Q Plot')
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=150, bbox_inches='tight')
plt.show()Example 3: Cross-Validation Strategies
示例3:交叉验证策略
python
from sklearn.model_selection import (
cross_val_score, StratifiedKFold, TimeSeriesSplit,
GroupKFold, cross_validate
)
def evaluate_with_cv(model, X, y, cv_strategy='stratified', n_splits=5, groups=None):
"""
Evaluate model with appropriate cross-validation strategy.
Args:
model: Sklearn-compatible model
X: Features
y: Target
cv_strategy: 'stratified', 'timeseries', 'group', or 'kfold'
n_splits: Number of CV folds
groups: Group labels for GroupKFold
Returns:
Dictionary with CV results
"""
# Select CV strategy
if cv_strategy == 'stratified':
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
elif cv_strategy == 'timeseries':
cv = TimeSeriesSplit(n_splits=n_splits)
elif cv_strategy == 'group':
cv = GroupKFold(n_splits=n_splits)
else:
cv = n_splits
# Define scoring metrics
scoring = {
'accuracy': 'accuracy',
'precision': 'precision_weighted',
'recall': 'recall_weighted',
'f1': 'f1_weighted',
'roc_auc': 'roc_auc'
}
# Perform cross-validation
cv_results = cross_validate(
model, X, y,
cv=cv,
scoring=scoring,
groups=groups,
return_train_score=True,
n_jobs=-1
)
# Summarize results
summary = {}
for metric in scoring.keys():
test_scores = cv_results[f'test_{metric}']
train_scores = cv_results[f'train_{metric}']
summary[metric] = {
'test_mean': np.mean(test_scores),
'test_std': np.std(test_scores),
'train_mean': np.mean(train_scores),
'train_std': np.std(train_scores),
'overfit_gap': np.mean(train_scores) - np.mean(test_scores)
}
return summarypython
from sklearn.model_selection import (
cross_val_score, StratifiedKFold, TimeSeriesSplit,
GroupKFold, cross_validate
)
def evaluate_with_cv(model, X, y, cv_strategy='stratified', n_splits=5, groups=None):
"""
Evaluate model with appropriate cross-validation strategy.
Args:
model: Sklearn-compatible model
X: Features
y: Target
cv_strategy: 'stratified', 'timeseries', 'group', or 'kfold'
n_splits: Number of CV folds
groups: Group labels for GroupKFold
Returns:
Dictionary with CV results
"""
# Select CV strategy
if cv_strategy == 'stratified':
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
elif cv_strategy == 'timeseries':
cv = TimeSeriesSplit(n_splits=n_splits)
elif cv_strategy == 'group':
cv = GroupKFold(n_splits=n_splits)
else:
cv = n_splits
# Define scoring metrics
scoring = {
'accuracy': 'accuracy',
'precision': 'precision_weighted',
'recall': 'recall_weighted',
'f1': 'f1_weighted',
'roc_auc': 'roc_auc'
}
# Perform cross-validation
cv_results = cross_validate(
model, X, y,
cv=cv,
scoring=scoring,
groups=groups,
return_train_score=True,
n_jobs=-1
)
# Summarize results
summary = {}
for metric in scoring.keys():
test_scores = cv_results[f'test_{metric}']
train_scores = cv_results[f'train_{metric}']
summary[metric] = {
'test_mean': np.mean(test_scores),
'test_std': np.std(test_scores),
'train_mean': np.mean(train_scores),
'train_std': np.std(train_scores),
'overfit_gap': np.mean(train_scores) - np.mean(test_scores)
}
return summaryUsage example
Usage example
results = evaluate_with_cv(model, X, y, cv_strategy='stratified', n_splits=5)
for metric, values in results.items():
print(f"{metric}: {values['test_mean']:.4f} (+/- {values['test_std']:.4f})")
print(f" Overfitting gap: {values['overfit_gap']:.4f}")
undefinedresults = evaluate_with_cv(model, X, y, cv_strategy='stratified', n_splits=5)
for metric, values in results.items():
print(f"{metric}: {values['test_mean']:.4f} (+/- {values['test_std']:.4f})")
print(f" Overfitting gap: {values['overfit_gap']:.4f}")
undefinedExample 4: Fairness Evaluation
示例4:公平性评估
python
def evaluate_fairness(y_true, y_pred, sensitive_attr, favorable_label=1):
"""
Evaluate model fairness across demographic groups.
Args:
y_true: True labels
y_pred: Predicted labels
sensitive_attr: Protected attribute values
favorable_label: The favorable outcome label
Returns:
Dictionary with fairness metrics
"""
groups = np.unique(sensitive_attr)
results = {'group_metrics': {}}
for group in groups:
mask = sensitive_attr == group
group_true = y_true[mask]
group_pred = y_pred[mask]
# Calculate group-specific metrics
tp = np.sum((group_true == favorable_label) & (group_pred == favorable_label))
fp = np.sum((group_true != favorable_label) & (group_pred == favorable_label))
fn = np.sum((group_true == favorable_label) & (group_pred != favorable_label))
tn = np.sum((group_true != favorable_label) & (group_pred != favorable_label))
results['group_metrics'][group] = {
'selection_rate': np.mean(group_pred == favorable_label),
'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0,
'accuracy': np.mean(group_true == group_pred),
'size': len(group_true)
}
# Calculate fairness metrics
selection_rates = [m['selection_rate'] for m in results['group_metrics'].values()]
tprs = [m['tpr'] for m in results['group_metrics'].values()]
fprs = [m['fpr'] for m in results['group_metrics'].values()]
results['fairness_metrics'] = {
'demographic_parity_diff': max(selection_rates) - min(selection_rates),
'equalized_odds_tpr_diff': max(tprs) - min(tprs),
'equalized_odds_fpr_diff': max(fprs) - min(fprs),
}
return resultspython
def evaluate_fairness(y_true, y_pred, sensitive_attr, favorable_label=1):
"""
Evaluate model fairness across demographic groups.
Args:
y_true: True labels
y_pred: Predicted labels
sensitive_attr: Protected attribute values
favorable_label: The favorable outcome label
Returns:
Dictionary with fairness metrics
"""
groups = np.unique(sensitive_attr)
results = {'group_metrics': {}}
for group in groups:
mask = sensitive_attr == group
group_true = y_true[mask]
group_pred = y_pred[mask]
# Calculate group-specific metrics
tp = np.sum((group_true == favorable_label) & (group_pred == favorable_label))
fp = np.sum((group_true != favorable_label) & (group_pred == favorable_label))
fn = np.sum((group_true == favorable_label) & (group_pred != favorable_label))
tn = np.sum((group_true != favorable_label) & (group_pred != favorable_label))
results['group_metrics'][group] = {
'selection_rate': np.mean(group_pred == favorable_label),
'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0,
'accuracy': np.mean(group_true == group_pred),
'size': len(group_true)
}
# Calculate fairness metrics
selection_rates = [m['selection_rate'] for m in results['group_metrics'].values()]
tprs = [m['tpr'] for m in results['group_metrics'].values()]
fprs = [m['fpr'] for m in results['group_metrics'].values()]
results['fairness_metrics'] = {
'demographic_parity_diff': max(selection_rates) - min(selection_rates),
'equalized_odds_tpr_diff': max(tprs) - min(tprs),
'equalized_odds_fpr_diff': max(fprs) - min(fprs),
}
return resultsExample 5: Training Debugging with Learning Curves
示例5:学习曲线调试训练问题
python
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
def plot_learning_curves(model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10)):
"""
Plot learning curves to diagnose overfitting/underfitting.
Args:
model: Sklearn-compatible model
X: Features
y: Target
cv: Cross-validation folds
train_sizes: Array of training set size fractions
"""
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
cv=cv,
train_sizes=train_sizes,
scoring='accuracy',
n_jobs=-1
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, label='Training score', color='blue', marker='o')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color='blue')
plt.plot(train_sizes, val_mean, label='Validation score', color='red', marker='o')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.15, color='red')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curves')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
# Add diagnostic annotations
final_gap = train_mean[-1] - val_mean[-1]
if final_gap > 0.1:
plt.text(0.5, 0.05, 'HIGH OVERFITTING: Large gap between train and validation',
transform=plt.gca().transAxes, color='red', fontweight='bold')
elif val_mean[-1] < 0.7:
plt.text(0.5, 0.05, 'UNDERFITTING: Both train and validation scores are low',
transform=plt.gca().transAxes, color='orange', fontweight='bold')
plt.tight_layout()
plt.show()
return {
'final_train_score': train_mean[-1],
'final_val_score': val_mean[-1],
'overfit_gap': final_gap
}python
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
def plot_learning_curves(model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10)):
"""
Plot learning curves to diagnose overfitting/underfitting.
Args:
model: Sklearn-compatible model
X: Features
y: Target
cv: Cross-validation folds
train_sizes: Array of training set size fractions
"""
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
cv=cv,
train_sizes=train_sizes,
scoring='accuracy',
n_jobs=-1
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, label='Training score', color='blue', marker='o')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color='blue')
plt.plot(train_sizes, val_mean, label='Validation score', color='red', marker='o')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.15, color='red')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curves')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
# Add diagnostic annotations
final_gap = train_mean[-1] - val_mean[-1]
if final_gap > 0.1:
plt.text(0.5, 0.05, 'HIGH OVERFITTING: Large gap between train and validation',
transform=plt.gca().transAxes, color='red', fontweight='bold')
elif val_mean[-1] < 0.7:
plt.text(0.5, 0.05, 'UNDERFITTING: Both train and validation scores are low',
transform=plt.gca().transAxes, color='orange', fontweight='bold')
plt.tight_layout()
plt.show()
return {
'final_train_score': train_mean[-1],
'final_val_score': val_mean[-1],
'overfit_gap': final_gap
}Example 6: LLM Evaluation
示例6:LLM评估
python
from typing import List, Dict
import openai
def evaluate_llm_generation(
prompts: List[str],
references: List[str],
model: str,
judge_model: str = "gpt-4"
) -> Dict:
"""
Evaluate LLM generation quality using LLM-as-judge.
Args:
prompts: Input prompts
references: Reference outputs (if available)
model: Model to evaluate
judge_model: Model to use as judge
Returns:
Dictionary with evaluation scores
"""
results = []
for prompt, reference in zip(prompts, references):
# Generate response
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
generation = response.choices[0].message.content
# LLM-as-judge evaluation
judge_prompt = f"""Evaluate the following AI-generated response on a scale of 1-5 for:
1. Accuracy: Is the information correct?
2. Relevance: Does it address the prompt?
3. Fluency: Is it well-written and coherent?
4. Helpfulness: Is it useful to the user?
Prompt: {prompt}
Reference (if available): {reference}
Response: {generation}
Provide scores in JSON format: {{"accuracy": X, "relevance": X, "fluency": X, "helpfulness": X, "overall": X}}
"""
judge_response = openai.ChatCompletion.create(
model=judge_model,
messages=[{"role": "user", "content": judge_prompt}]
)
scores = json.loads(judge_response.choices[0].message.content)
results.append({
'prompt': prompt,
'generation': generation,
'scores': scores
})
# Aggregate scores
avg_scores = {}
for key in ['accuracy', 'relevance', 'fluency', 'helpfulness', 'overall']:
avg_scores[key] = np.mean([r['scores'][key] for r in results])
return {
'individual_results': results,
'average_scores': avg_scores
}python
from typing import List, Dict
import openai
def evaluate_llm_generation(
prompts: List[str],
references: List[str],
model: str,
judge_model: str = "gpt-4"
) -> Dict:
"""
Evaluate LLM generation quality using LLM-as-judge.
Args:
prompts: Input prompts
references: Reference outputs (if available)
model: Model to evaluate
judge_model: Model to use as judge
Returns:
Dictionary with evaluation scores
"""
results = []
for prompt, reference in zip(prompts, references):
# Generate response
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
generation = response.choices[0].message.content
# LLM-as-judge evaluation
judge_prompt = f"""Evaluate the following AI-generated response on a scale of 1-5 for:
1. Accuracy: Is the information correct?
2. Relevance: Does it address the prompt?
3. Fluency: Is it well-written and coherent?
4. Helpfulness: Is it useful to the user?
Prompt: {prompt}
Reference (if available): {reference}
Response: {generation}
Provide scores in JSON format: {{"accuracy": X, "relevance": X, "fluency": X, "helpfulness": X, "overall": X}}
"""
judge_response = openai.ChatCompletion.create(
model=judge_model,
messages=[{"role": "user", "content": judge_prompt}]
)
scores = json.loads(judge_response.choices[0].message.content)
results.append({
'prompt': prompt,
'generation': generation,
'scores': scores
})
# Aggregate scores
avg_scores = {}
for key in ['accuracy', 'relevance', 'fluency', 'helpfulness', 'overall']:
avg_scores[key] = np.mean([r['scores'][key] for r in results])
return {
'individual_results': results,
'average_scores': avg_scores
}Example 7: A/B Test Analysis
示例7:A/B测试分析
python
from scipy import stats
def analyze_ab_test(control_metric: np.ndarray, treatment_metric: np.ndarray, alpha: float = 0.05):
"""
Analyze A/B test results with statistical significance testing.
Args:
control_metric: Metric values for control group
treatment_metric: Metric values for treatment group
alpha: Significance level
Returns:
Dictionary with test results
"""
# Descriptive statistics
control_mean = np.mean(control_metric)
treatment_mean = np.mean(treatment_metric)
relative_lift = (treatment_mean - control_mean) / control_mean * 100
# Statistical test
t_stat, p_value = stats.ttest_ind(treatment_metric, control_metric)
is_significant = p_value < alpha
# Effect size (Cohen's d)
pooled_std = np.sqrt((np.std(control_metric)**2 + np.std(treatment_metric)**2) / 2)
cohens_d = (treatment_mean - control_mean) / pooled_std
# Confidence interval
ci = stats.t.interval(
confidence=1-alpha,
df=len(control_metric) + len(treatment_metric) - 2,
loc=treatment_mean - control_mean,
scale=stats.sem(np.concatenate([control_metric, treatment_metric]))
)
return {
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'relative_lift_pct': relative_lift,
'p_value': p_value,
'is_significant': is_significant,
'cohens_d': cohens_d,
'confidence_interval': ci,
'recommendation': 'LAUNCH' if is_significant and relative_lift > 0 else 'DO NOT LAUNCH'
}python
from scipy import stats
def analyze_ab_test(control_metric: np.ndarray, treatment_metric: np.ndarray, alpha: float = 0.05):
"""
Analyze A/B test results with statistical significance testing.
Args:
control_metric: Metric values for control group
treatment_metric: Metric values for treatment group
alpha: Significance level
Returns:
Dictionary with test results
"""
# Descriptive statistics
control_mean = np.mean(control_metric)
treatment_mean = np.mean(treatment_metric)
relative_lift = (treatment_mean - control_mean) / control_mean * 100
# Statistical test
t_stat, p_value = stats.ttest_ind(treatment_metric, control_metric)
is_significant = p_value < alpha
# Effect size (Cohen's d)
pooled_std = np.sqrt((np.std(control_metric)**2 + np.std(treatment_metric)**2) / 2)
cohens_d = (treatment_mean - control_mean) / pooled_std
# Confidence interval
ci = stats.t.interval(
confidence=1-alpha,
df=len(control_metric) + len(treatment_metric) - 2,
loc=treatment_mean - control_mean,
scale=stats.sem(np.concatenate([control_metric, treatment_metric]))
)
return {
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'relative_lift_pct': relative_lift,
'p_value': p_value,
'is_significant': is_significant,
'cohens_d': cohens_d,
'confidence_interval': ci,
'recommendation': 'LAUNCH' if is_significant and relative_lift > 0 else 'DO NOT LAUNCH'
}Usage
Usage
results = analyze_ab_test(control_conversions, treatment_conversions)
print(f"Relative Lift: {results['relative_lift_pct']:.2f}%")
print(f"P-value: {results['p_value']:.4f}")
print(f"Recommendation: {results['recommendation']}")
undefinedresults = analyze_ab_test(control_conversions, treatment_conversions)
print(f"Relative Lift: {results['relative_lift_pct']:.2f}%")
print(f"P-value: {results['p_value']:.4f}")
print(f"Recommendation: {results['recommendation']}")
undefinedExample 8: Production Model Monitoring
示例8:生产环境模型监控
python
from scipy.stats import ks_2samp
import pandas as pd
class ModelMonitor:
"""Monitor deployed model for drift and degradation."""
def __init__(self, baseline_data: pd.DataFrame, baseline_predictions: np.ndarray):
self.baseline_data = baseline_data
self.baseline_predictions = baseline_predictions
def detect_data_drift(self, current_data: pd.DataFrame, threshold: float = 0.05) -> Dict:
"""Detect feature distribution drift using KS test."""
drift_results = {}
for col in self.baseline_data.columns:
if pd.api.types.is_numeric_dtype(self.baseline_data[col]):
statistic, p_value = ks_2samp(
self.baseline_data[col].dropna(),
current_data[col].dropna()
)
drift_results[col] = {
'ks_statistic': statistic,
'p_value': p_value,
'drift_detected': p_value < threshold
}
return drift_results
def detect_prediction_drift(self, current_predictions: np.ndarray, threshold: float = 0.05) -> Dict:
"""Detect prediction distribution drift."""
statistic, p_value = ks_2samp(self.baseline_predictions, current_predictions)
return {
'ks_statistic': statistic,
'p_value': p_value,
'drift_detected': p_value < threshold,
'baseline_mean': np.mean(self.baseline_predictions),
'current_mean': np.mean(current_predictions),
'mean_shift': np.mean(current_predictions) - np.mean(self.baseline_predictions)
}
def performance_degradation_check(
self,
current_metric: float,
baseline_metric: float,
threshold_pct: float = 5.0
) -> Dict:
"""Check for performance degradation."""
degradation_pct = (baseline_metric - current_metric) / baseline_metric * 100
return {
'baseline_metric': baseline_metric,
'current_metric': current_metric,
'degradation_pct': degradation_pct,
'alert': degradation_pct > threshold_pct,
'recommendation': 'RETRAIN MODEL' if degradation_pct > threshold_pct else 'OK'
}python
from scipy.stats import ks_2samp
import pandas as pd
class ModelMonitor:
"""Monitor deployed model for drift and degradation."""
def __init__(self, baseline_data: pd.DataFrame, baseline_predictions: np.ndarray):
self.baseline_data = baseline_data
self.baseline_predictions = baseline_predictions
def detect_data_drift(self, current_data: pd.DataFrame, threshold: float = 0.05) -> Dict:
"""Detect feature distribution drift using KS test."""
drift_results = {}
for col in self.baseline_data.columns:
if pd.api.types.is_numeric_dtype(self.baseline_data[col]):
statistic, p_value = ks_2samp(
self.baseline_data[col].dropna(),
current_data[col].dropna()
)
drift_results[col] = {
'ks_statistic': statistic,
'p_value': p_value,
'drift_detected': p_value < threshold
}
return drift_results
def detect_prediction_drift(self, current_predictions: np.ndarray, threshold: float = 0.05) -> Dict:
"""Detect prediction distribution drift."""
statistic, p_value = ks_2samp(self.baseline_predictions, current_predictions)
return {
'ks_statistic': statistic,
'p_value': p_value,
'drift_detected': p_value < threshold,
'baseline_mean': np.mean(self.baseline_predictions),
'current_mean': np.mean(current_predictions),
'mean_shift': np.mean(current_predictions) - np.mean(self.baseline_predictions)
}
def performance_degradation_check(
self,
current_metric: float,
baseline_metric: float,
threshold_pct: float = 5.0
) -> Dict:
"""Check for performance degradation."""
degradation_pct = (baseline_metric - current_metric) / baseline_metric * 100
return {
'baseline_metric': baseline_metric,
'current_metric': current_metric,
'degradation_pct': degradation_pct,
'alert': degradation_pct > threshold_pct,
'recommendation': 'RETRAIN MODEL' if degradation_pct > threshold_pct else 'OK'
}Usage
Usage
monitor = ModelMonitor(baseline_df, baseline_preds)
drift_check = monitor.detect_data_drift(current_df)
pred_drift = monitor.detect_prediction_drift(current_preds)
perf_check = monitor.performance_degradation_check(current_accuracy, baseline_accuracy)
undefinedmonitor = ModelMonitor(baseline_df, baseline_preds)
drift_check = monitor.detect_data_drift(current_df)
pred_drift = monitor.detect_prediction_drift(current_preds)
perf_check = monitor.performance_degradation_check(current_accuracy, baseline_accuracy)
undefinedCommon Pitfalls
常见陷阱
- Test Set Contamination: Never use test set for hyperparameter tuning or model selection
- Data Leakage: Ensure validation/test data doesn't leak into training (temporal ordering, group splits)
- Wrong Metric Choice: Using accuracy for imbalanced datasets, not considering business costs
- Ignoring Confidence Intervals: Point estimates without uncertainty can be misleading
- Multiple Comparisons: Not correcting p-values when testing many hypotheses
- Survivorship Bias: Evaluating only on successful cases, ignoring failures
- Overfitting to Validation: Repeatedly tuning on validation set effectively makes it a second training set
- Ignoring Fairness: Deploying models without fairness evaluation can cause harm
- No Baseline: Not comparing against simple baselines (random, majority class, linear model)
- Production-Training Skew: Evaluation setup doesn't match production environment
- 测试集污染:绝不要将测试集用于超参数调优或模型选择
- 数据泄露:确保验证/测试数据不会泄露到训练过程中(时序顺序、分组划分)
- 错误指标选择:在不平衡数据集上使用accuracy,未考虑业务成本
- 忽略置信区间:仅点估计而不考虑不确定性会产生误导
- 多重比较:测试多个假设时未校正p值
- 幸存者偏差:仅在成功案例上评估,忽略失败案例
- 过拟合到验证集:反复在验证集上调优,使其实际上成为第二个训练集
- 忽略公平性:未进行公平性评估就部署模型可能造成伤害
- 无基线对比:未与简单基线模型(随机、多数类、线性模型)对比
- 生产-训练偏差:评估设置与生产环境不匹配
Additional Resources
额外资源
- Metrics: Scikit-learn metrics documentation, Hugging Face evaluate library
- Fairness: AI Fairness 360, Fairlearn
- LLM Evaluation: HELM, lm-evaluation-harness, BIG-bench
- A/B Testing: Evan Miller's A/B testing tools, experimentation platform docs
- Monitoring: Evidently AI, WhyLabs, Fiddler
- 指标:Scikit-learn指标文档、Hugging Face evaluate库
- 公平性:AI Fairness 360、Fairlearn
- LLM评估:HELM、lm-evaluation-harness、BIG-bench
- A/B测试:Evan Miller的A/B测试工具、实验平台文档
- 监控:Evidently AI、WhyLabs、Fiddler