loom-model-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Model Evaluation

模型评估

Overview

概述

This skill focuses on comprehensive evaluation of machine learning models across the entire ML lifecycle. It covers metric selection, validation strategies, fairness assessment, training debugging, hyperparameter tuning, LLM evaluation, A/B testing, and production monitoring for ensuring model quality and reliability.

本技能专注于在整个ML生命周期中对机器学习模型进行全面评估。涵盖指标选择、验证策略、公平性评估、训练调试、超参数调优、LLM评估、A/B测试以及生产环境监控，以确保模型的质量和可靠性。

When to Use This Skill

何时使用本技能

Use this skill when you need to:

Select and compute appropriate evaluation metrics for ML models
Design cross-validation strategies and train/test splits
Debug training issues (overfitting, underfitting, convergence problems)
Tune hyperparameters and validate model improvements
Evaluate LLMs and generative models
Conduct A/B tests for model comparison in production
Monitor deployed models for drift and degradation
Assess model fairness across demographic groups
Analyze error patterns and residuals
Create evaluation reports and dashboards

当你需要以下操作时，可使用本技能：

为ML模型选择并计算合适的评估指标
设计交叉验证策略和训练/测试集划分方案
调试训练问题（过拟合、欠拟合、收敛问题）
调优超参数并验证模型改进效果
评估LLM和生成式模型
在生产环境中开展模型对比的A/B测试
监控已部署模型的漂移和性能退化
评估模型在不同人群中的公平性
分析错误模式和残差
创建评估报告和仪表盘

Related Skills and Agents

Instructions

操作指南

1. Define Evaluation Criteria

1. 定义评估标准

Business Alignment:

Identify business objectives and success criteria
Translate business goals to ML metrics
Define acceptable performance thresholds
Consider cost of different error types (false positives vs false negatives)

Metric Selection:

Classification: accuracy, precision, recall, F1, ROC-AUC, PR-AUC
Regression: MSE, RMSE, MAE, MAPE, R2, explained variance
Ranking: NDCG, MAP, MRR
LLMs: perplexity, BLEU, ROUGE, BERTScore, human eval
Custom metrics for domain-specific requirements

Fairness Requirements:

Identify protected attributes (race, gender, age)
Choose fairness definitions (demographic parity, equalized odds)
Set fairness constraints and thresholds

业务对齐：

明确业务目标和成功标准
将业务目标转化为ML指标
定义可接受的性能阈值
考虑不同错误类型的成本（假阳性 vs 假阴性）

指标选择：

分类任务：accuracy、precision、recall、F1、ROC-AUC、PR-AUC
回归任务：MSE、RMSE、MAE、MAPE、R²、解释方差
排序任务：NDCG、MAP、MRR
LLM：perplexity、BLEU、ROUGE、BERTScore、人工评估
针对特定领域需求的自定义指标

公平性要求：

识别受保护属性（种族、性别、年龄）
选择公平性定义（人口均等、均等赔率）
设置公平性约束和阈值

2. Design Evaluation Strategy

2. 设计评估策略

Data Splitting:

Train/validation/test split ratios (e.g., 60/20/20)
Stratified splits for class imbalance
Time-based splits for temporal data
Group-based splits to prevent data leakage

Cross-Validation:

K-fold CV for standard problems
Stratified K-fold for imbalanced classes
TimeSeriesSplit for temporal data
GroupKFold for clustered data
Leave-one-out for small datasets

Handling Imbalance:

Stratified sampling
Class weights in metrics
Resampling strategies (SMOTE, undersampling)
Appropriate metrics (F1, PR-AUC instead of accuracy)

数据划分：

训练/验证/测试集划分比例（例如60/20/20）
针对类别不平衡的分层划分
时序数据的基于时间的划分
基于分组的划分以防止数据泄露

交叉验证：

标准问题使用K折CV
不平衡类别使用分层K折
时序数据使用TimeSeriesSplit
聚类数据使用GroupKFold
小数据集使用留一法

类别不平衡处理：

分层采样
指标中的类别权重
重采样策略（SMOTE、欠采样）
使用合适的指标（如F1、PR-AUC而非accuracy）

3. Conduct Evaluation

3. 开展评估

Performance Metrics:

Calculate primary and secondary metrics
Compute confidence intervals
Compare against baselines
Statistical significance testing

Error Analysis:

Confusion matrix analysis
Per-class performance breakdown
Error type categorization
Hard example mining

Fairness Assessment:

Group-wise metric comparison
Demographic parity evaluation
Equalized odds analysis
Disparate impact measurement

Edge Case Testing:

Boundary condition validation
Out-of-distribution detection
Adversarial robustness
Stress testing with extreme inputs

性能指标：

计算主要和次要指标
计算置信区间
与基线模型对比
统计显著性检验

错误分析：

混淆矩阵分析
按类别拆分性能
错误类型分类
难例挖掘

公平性评估：

分组指标对比
人口均等性评估
均等赔率分析
差异性影响测量

边缘案例测试：

边界条件验证
分布外检测
对抗鲁棒性
极端输入压力测试

4. Debug Training Issues

4. 调试训练问题

Overfitting Detection:

Train vs validation performance gap
Learning curves analysis
Validation metrics plateauing while training improves
Mitigation: regularization, dropout, early stopping, data augmentation

Underfitting Detection:

Poor performance on both train and validation
Learning curves not converging
Model too simple for problem complexity
Mitigation: increase model capacity, feature engineering, longer training

Convergence Problems:

Loss not decreasing
Loss oscillating or unstable
Exploding gradients (loss becomes NaN)
Vanishing gradients (loss stays constant)
Mitigation: learning rate tuning, gradient clipping, batch normalization

Learning Rate Issues:

Too high: training unstable, loss oscillates
Too low: training too slow, stuck in local minima
Solution: learning rate schedules, warmup, cosine annealing

Batch Size Effects:

Small batch: noisy gradients, poor generalization
Large batch: memory issues, sharp minima
Find sweet spot through experimentation

过拟合检测：

训练集与验证集性能差距
学习曲线分析
验证指标趋于平稳但训练指标持续提升
缓解方法：正则化、dropout、早停、数据增强

欠拟合检测：

训练集和验证集性能均较差
学习曲线未收敛
模型复杂度不足以解决问题
缓解方法：提升模型容量、特征工程、延长训练时间

收敛问题：

损失值未下降
损失值波动或不稳定
梯度爆炸（损失值变为NaN）
梯度消失（损失值保持恒定）
缓解方法：学习率调优、梯度裁剪、批量归一化

学习率问题：

学习率过高：训练不稳定，损失值波动
学习率过低：训练过慢，陷入局部最小值
解决方案：学习率调度、预热、余弦退火

批量大小影响：

小批量：梯度噪声大，泛化能力差
大批量：内存问题，陷入尖锐最小值
通过实验找到最优平衡点

5. Hyperparameter Tuning

5. 超参数调优

Search Strategies:

Grid search: exhaustive but expensive
Random search: better coverage for high-dimensional spaces
Bayesian optimization: sample efficient
Hyperband: adaptive resource allocation

Key Hyperparameters:

Learning rate (most critical)
Batch size
Regularization strength (L1, L2, dropout)
Network architecture (layers, units)
Optimizer choice (Adam, SGD, AdamW)

Validation:

Use validation set for hyperparameter selection
Never tune on test set
Consider nested cross-validation for small datasets

搜索策略：

网格搜索： exhaustive但成本高
随机搜索：高维空间覆盖性更好
贝叶斯优化：样本效率高
Hyperband：自适应资源分配

关键超参数：

学习率（最关键）
批量大小
正则化强度（L1、L2、dropout）
网络架构（层数、单元数）
优化器选择（Adam、SGD、AdamW）

验证：

使用验证集进行超参数选择
绝不在测试集上调优
小数据集考虑嵌套交叉验证

6. LLM and Generative Model Evaluation

6. LLM与生成式模型评估

Automatic Metrics:

Perplexity for language models
BLEU, ROUGE for text generation
BERTScore for semantic similarity
Exact match, F1 for QA tasks

Human Evaluation:

Fluency, coherence, relevance
Factual accuracy
Safety and toxicity
Instruction following

Prompt Engineering Evaluation:

Few-shot vs zero-shot comparison
Prompt template A/B testing
Chain-of-thought effectiveness
System message impact

LLM-as-Judge:

Use stronger models to evaluate weaker models
Pairwise comparison for ranking
Rubric-based scoring
Calibration against human judgments

自动指标：

语言模型使用perplexity
文本生成使用BLEU、ROUGE
语义相似度使用BERTScore
QA任务使用精确匹配、F1

人工评估：

流畅性、连贯性、相关性
事实准确性
安全性与毒性
指令遵循度

提示工程评估：

少样本 vs 零样本对比
提示模板A/B测试
思维链有效性
系统消息影响

LLM-as-Judge：

使用更强模型评估较弱模型
成对对比进行排名
基于评分标准的打分
与人工判断校准

7. A/B Testing for Model Comparison

7. 模型对比的A/B测试

Experimental Design:

Random traffic split (50/50 or 90/10)
Minimum sample size calculation
Statistical power analysis
Duration planning for seasonality

Metrics:

Primary business metric (conversion, revenue)
Secondary metrics (latency, user satisfaction)
Guardrail metrics (error rate, bias)
Sample ratio mismatch checks

Analysis:

Statistical significance testing (t-test, Mann-Whitney)
Effect size estimation
Confidence intervals
Multiple testing correction (Bonferroni)

Decision Criteria:

Primary metric improvement threshold
No degradation in guardrail metrics
Sufficient statistical power
Business case validation

实验设计：

随机流量分配（50/50或90/10）
最小样本量计算
统计功效分析
考虑季节性的时长规划

指标：

主要业务指标（转化率、收入）
次要指标（延迟、用户满意度）
防护指标（错误率、偏差）
样本比例 mismatch 检查

分析：

统计显著性检验（t检验、Mann-Whitney检验）
效应量估计
置信区间
多重检验校正（Bonferroni）

决策标准：

主要指标提升阈值
防护指标无退化
足够的统计功效
业务案例验证

8. Production Model Monitoring

8. 生产环境模型监控

Performance Monitoring:

Track key metrics over time
Compare against baseline/champion model
Detect performance degradation
Alert on threshold violations

Data Drift Detection:

Input distribution shifts
Feature statistics tracking
KL divergence, KS test, PSI
Covariate shift detection

Concept Drift Detection:

Model prediction distribution changes
Label distribution shifts (when available)
Performance metric trends
Adversarial Validation

Monitoring Infrastructure:

Real-time metric computation
Dashboards for visualization
Alerting and on-call rotation
Automated retraining triggers

性能监控：

跟踪关键指标随时间的变化
与基线/冠军模型对比
检测性能退化
阈值违规告警

数据漂移检测：

输入分布偏移
特征统计跟踪
KL散度、KS检验、PSI
协变量偏移检测

概念漂移检测：

模型预测分布变化
标签分布偏移（若可用）
性能指标趋势
对抗验证

监控基础设施：

实时指标计算
可视化仪表盘
告警与轮值待命
自动重训练触发

9. Report and Document

9. 报告与文档

Evaluation Report Structure:

Executive summary with key findings
Methodology and experimental setup
Comprehensive metric tables
Error analysis and case studies
Fairness assessment results
Recommendations and next steps

Visualization:

ROC and PR curves
Confusion matrices
Learning curves
Residual plots
Fairness comparison charts

Version Control:

Model version and checkpoints
Dataset versions and splits
Hyperparameter configurations
Evaluation code and environment

评估报告结构：

包含关键发现的执行摘要
方法论与实验设置
全面的指标表格
错误分析与案例研究
公平性评估结果
建议与下一步计划

可视化：

ROC和PR曲线
混淆矩阵
学习曲线
残差图
公平性对比图表

版本控制：

模型版本与检查点
数据集版本与划分
超参数配置
评估代码与环境

Best Practices

最佳实践

General Principles

通用原则

Match Metrics to Goals: Choose metrics aligned with business objectives, not just academic standards
Use Multiple Metrics: No single metric tells the whole story; use complementary metrics
Proper Validation: Use appropriate cross-validation schemes to avoid overfitting to validation set
Test Distribution Shift: Evaluate on out-of-distribution data to assess generalization
Check for Bias: Assess fairness across demographic groups before deployment
Version Everything: Track models, data, metrics, and code for reproducibility
Monitor Production: Continuously track model performance after deployment

指标匹配目标：选择与业务目标对齐的指标，而非仅学术标准
使用多指标：单一指标无法全面反映情况，使用互补指标
正确验证：使用合适的交叉验证方案，避免过拟合到验证集
测试分布偏移：在分布外数据上评估以检验泛化能力
检查偏差：部署前评估不同人群的公平性
版本化所有内容：跟踪模型、数据、指标和代码以确保可复现
监控生产环境：部署后持续跟踪模型性能

Training and Debugging

训练与调试

Start Simple: Begin with simple baselines before complex models
Visualize Learning: Plot learning curves early and often
Debug Incrementally: Change one thing at a time when debugging training issues
Sanity Check: Overfit on small batch first to verify model can learn
Early Stopping: Use validation-based early stopping to prevent overfitting
Gradient Monitoring: Track gradient norms to detect vanishing/exploding gradients

从简单开始：先使用简单基线模型，再尝试复杂模型
可视化学习过程：尽早且频繁绘制学习曲线
增量调试：调试时每次只改变一个变量
** sanity检查**：先在小批量数据上过拟合以验证模型可学习
早停：使用基于验证集的早停防止过拟合
梯度监控：跟踪梯度范数以检测梯度消失/爆炸

Evaluation Rigor

评估严谨性

Hold-out Test Set: Never touch test set until final evaluation
Stratified Splits: Use stratification for imbalanced datasets
Statistical Testing: Use significance tests for model comparisons
Error Analysis: Dive deep into errors to understand failure modes
Temporal Validation: For time-series, validate on future data only

保留测试集：最终评估前绝不触碰测试集
分层划分：不平衡数据集使用分层划分
统计检验：模型对比使用显著性检验
错误分析：深入分析错误以理解失效模式
时序验证：时序数据仅在未来数据上验证

Production and Monitoring

生产与监控

Shadow Mode: Deploy new models in shadow mode before switching traffic
Gradual Rollout: Use canary deployments or gradual percentage rollouts
Rollback Plan: Have automated rollback triggers for performance degradation
Alert Fatigue: Set meaningful alert thresholds to avoid noise

影子模式：切换流量前以影子模式部署新模型
逐步发布：使用金丝雀部署或逐步比例发布
回滚计划：设置性能退化时的自动回滚触发
避免告警疲劳：设置有意义的告警阈值以减少噪声

Examples

示例

Example 1: Classification Model Evaluation

示例1：分类模型评估

python

import numpy as np
import pandas as pd
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, confusion_matrix,
    classification_report, roc_curve, precision_recall_curve
)
import matplotlib.pyplot as plt

class ClassificationEvaluator:
    """Comprehensive classification model evaluator."""

    def __init__(self, y_true, y_pred, y_prob=None, class_names=None):
        self.y_true = y_true
        self.y_pred = y_pred
        self.y_prob = y_prob
        self.class_names = class_names or ['Negative', 'Positive']

    def compute_metrics(self) -> dict:
        """Compute all classification metrics."""
        metrics = {
            'accuracy': accuracy_score(self.y_true, self.y_pred),
            'precision': precision_score(self.y_true, self.y_pred, average='weighted'),
            'recall': recall_score(self.y_true, self.y_pred, average='weighted'),
            'f1': f1_score(self.y_true, self.y_pred, average='weighted'),
        }

        if self.y_prob is not None:
            metrics['roc_auc'] = roc_auc_score(self.y_true, self.y_prob)
            metrics['average_precision'] = average_precision_score(self.y_true, self.y_prob)

        return metrics

    def confusion_matrix_analysis(self) -> dict:
        """Analyze confusion matrix in detail."""
        cm = confusion_matrix(self.y_true, self.y_pred)
        tn, fp, fn, tp = cm.ravel()

        return {
            'confusion_matrix': cm,
            'true_negatives': tn,
            'false_positives': fp,
            'false_negatives': fn,
            'true_positives': tp,
            'specificity': tn / (tn + fp),
            'sensitivity': tp / (tp + fn),
            'false_positive_rate': fp / (fp + tn),
            'false_negative_rate': fn / (fn + tp),
        }

    def plot_roc_curve(self, save_path=None):
        """Plot ROC curve with AUC."""
        if self.y_prob is None:
            raise ValueError("Probabilities required for ROC curve")

        fpr, tpr, thresholds = roc_curve(self.y_true, self.y_prob)
        auc = roc_auc_score(self.y_true, self.y_prob)

        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, label=f'ROC (AUC = {auc:.3f})')
        plt.plot([0, 1], [0, 1], 'k--', label='Random')
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('Receiver Operating Characteristic (ROC) Curve')
        plt.legend()
        plt.grid(True, alpha=0.3)

        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        plt.show()

    def generate_report(self) -> str:
        """Generate comprehensive evaluation report."""
        metrics = self.compute_metrics()
        cm_analysis = self.confusion_matrix_analysis()

        report = f"""

python

import numpy as np
import pandas as pd
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, confusion_matrix,
    classification_report, roc_curve, precision_recall_curve
)
import matplotlib.pyplot as plt

class ClassificationEvaluator:
    """Comprehensive classification model evaluator."""

    def __init__(self, y_true, y_pred, y_prob=None, class_names=None):
        self.y_true = y_true
        self.y_pred = y_pred
        self.y_prob = y_prob
        self.class_names = class_names or ['Negative', 'Positive']

    def compute_metrics(self) -> dict:
        """Compute all classification metrics."""
        metrics = {
            'accuracy': accuracy_score(self.y_true, self.y_pred),
            'precision': precision_score(self.y_true, self.y_pred, average='weighted'),
            'recall': recall_score(self.y_true, self.y_pred, average='weighted'),
            'f1': f1_score(self.y_true, self.y_pred, average='weighted'),
        }

        if self.y_prob is not None:
            metrics['roc_auc'] = roc_auc_score(self.y_true, self.y_prob)
            metrics['average_precision'] = average_precision_score(self.y_true, self.y_prob)

        return metrics

    def confusion_matrix_analysis(self) -> dict:
        """Analyze confusion matrix in detail."""
        cm = confusion_matrix(self.y_true, self.y_pred)
        tn, fp, fn, tp = cm.ravel()

        return {
            'confusion_matrix': cm,
            'true_negatives': tn,
            'false_positives': fp,
            'false_negatives': fn,
            'true_positives': tp,
            'specificity': tn / (tn + fp),
            'sensitivity': tp / (tp + fn),
            'false_positive_rate': fp / (fp + tn),
            'false_negative_rate': fn / (fn + tp),
        }

    def plot_roc_curve(self, save_path=None):
        """Plot ROC curve with AUC."""
        if self.y_prob is None:
            raise ValueError("Probabilities required for ROC curve")

        fpr, tpr, thresholds = roc_curve(self.y_true, self.y_prob)
        auc = roc_auc_score(self.y_true, self.y_prob)

        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, label=f'ROC (AUC = {auc:.3f})')
        plt.plot([0, 1], [0, 1], 'k--', label='Random')
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('Receiver Operating Characteristic (ROC) Curve')
        plt.legend()
        plt.grid(True, alpha=0.3)

        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        plt.show()

    def generate_report(self) -> str:
        """Generate comprehensive evaluation report."""
        metrics = self.compute_metrics()
        cm_analysis = self.confusion_matrix_analysis()

        report = f"""

Classification Model Evaluation Report

Overall Metrics

Metric	Value
Accuracy	{metrics['accuracy']:.4f}
Precision	{metrics['precision']:.4f}
Recall	{metrics['recall']:.4f}
F1 Score	{metrics['f1']:.4f}
ROC AUC	{metrics.get('roc_auc', 'N/A'):.4f if isinstance(metrics.get('roc_auc'), float) else 'N/A'}

Metric	Value
Accuracy	{metrics['accuracy']:.4f}
Precision	{metrics['precision']:.4f}
Recall	{metrics['recall']:.4f}
F1 Score	{metrics['f1']:.4f}
ROC AUC	{metrics.get('roc_auc', 'N/A'):.4f if isinstance(metrics.get('roc_auc'), float) else 'N/A'}

Confusion Matrix Analysis

Metric	Value
True Positives	{cm_analysis['true_positives']}
True Negatives	{cm_analysis['true_negatives']}
False Positives	{cm_analysis['false_positives']}
False Negatives	{cm_analysis['false_negatives']}
Sensitivity	{cm_analysis['sensitivity']:.4f}
Specificity	{cm_analysis['specificity']:.4f}

Metric	Value
True Positives	{cm_analysis['true_positives']}
True Negatives	{cm_analysis['true_negatives']}
False Positives	{cm_analysis['false_positives']}
False Negatives	{cm_analysis['false_negatives']}
Sensitivity	{cm_analysis['sensitivity']:.4f}
Specificity	{cm_analysis['specificity']:.4f}

Detailed Classification Report

{classification_report(self.y_true, self.y_pred, target_names=self.class_names)} """ return report

Usage

evaluator = ClassificationEvaluator(y_true, y_pred, y_prob) print(evaluator.generate_report()) evaluator.plot_roc_curve('roc_curve.png')

undefined

evaluator = ClassificationEvaluator(y_true, y_pred, y_prob) print(evaluator.generate_report()) evaluator.plot_roc_curve('roc_curve.png')

undefined

Example 2: Regression Model Evaluation

示例2：回归模型评估

python

from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error, explained_variance_score
)
import numpy as np

class RegressionEvaluator:
    """Comprehensive regression model evaluator."""

    def __init__(self, y_true, y_pred):
        self.y_true = np.array(y_true)
        self.y_pred = np.array(y_pred)
        self.residuals = self.y_true - self.y_pred

    def compute_metrics(self) -> dict:
        """Compute all regression metrics."""
        mse = mean_squared_error(self.y_true, self.y_pred)

        return {
            'mse': mse,
            'rmse': np.sqrt(mse),
            'mae': mean_absolute_error(self.y_true, self.y_pred),
            'mape': mean_absolute_percentage_error(self.y_true, self.y_pred) * 100,
            'r2': r2_score(self.y_true, self.y_pred),
            'explained_variance': explained_variance_score(self.y_true, self.y_pred),
        }

    def residual_analysis(self) -> dict:
        """Analyze residual patterns."""
        return {
            'mean_residual': np.mean(self.residuals),
            'std_residual': np.std(self.residuals),
            'max_overestimate': np.min(self.residuals),
            'max_underestimate': np.max(self.residuals),
            'residual_skewness': self._skewness(self.residuals),
        }

    def _skewness(self, data):
        """Calculate skewness."""
        n = len(data)
        mean = np.mean(data)
        std = np.std(data)
        return (n / ((n-1) * (n-2))) * np.sum(((data - mean) / std) ** 3)

    def plot_diagnostics(self, save_path=None):
        """Plot diagnostic plots for residual analysis."""
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))

        # Actual vs Predicted
        ax1 = axes[0, 0]
        ax1.scatter(self.y_true, self.y_pred, alpha=0.5)
        ax1.plot([self.y_true.min(), self.y_true.max()],
                 [self.y_true.min(), self.y_true.max()], 'r--')
        ax1.set_xlabel('Actual')
        ax1.set_ylabel('Predicted')
        ax1.set_title('Actual vs Predicted')

        # Residuals vs Predicted
        ax2 = axes[0, 1]
        ax2.scatter(self.y_pred, self.residuals, alpha=0.5)
        ax2.axhline(y=0, color='r', linestyle='--')
        ax2.set_xlabel('Predicted')
        ax2.set_ylabel('Residuals')
        ax2.set_title('Residuals vs Predicted')

        # Residual histogram
        ax3 = axes[1, 0]
        ax3.hist(self.residuals, bins=30, edgecolor='black')
        ax3.set_xlabel('Residual')
        ax3.set_ylabel('Frequency')
        ax3.set_title('Residual Distribution')

        # Q-Q plot
        ax4 = axes[1, 1]
        from scipy import stats
        stats.probplot(self.residuals, dist="norm", plot=ax4)
        ax4.set_title('Q-Q Plot')

        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        plt.show()

python

from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error, explained_variance_score
)
import numpy as np

class RegressionEvaluator:
    """Comprehensive regression model evaluator."""

    def __init__(self, y_true, y_pred):
        self.y_true = np.array(y_true)
        self.y_pred = np.array(y_pred)
        self.residuals = self.y_true - self.y_pred

    def compute_metrics(self) -> dict:
        """Compute all regression metrics."""
        mse = mean_squared_error(self.y_true, self.y_pred)

        return {
            'mse': mse,
            'rmse': np.sqrt(mse),
            'mae': mean_absolute_error(self.y_true, self.y_pred),
            'mape': mean_absolute_percentage_error(self.y_true, self.y_pred) * 100,
            'r2': r2_score(self.y_true, self.y_pred),
            'explained_variance': explained_variance_score(self.y_true, self.y_pred),
        }

    def residual_analysis(self) -> dict:
        """Analyze residual patterns."""
        return {
            'mean_residual': np.mean(self.residuals),
            'std_residual': np.std(self.residuals),
            'max_overestimate': np.min(self.residuals),
            'max_underestimate': np.max(self.residuals),
            'residual_skewness': self._skewness(self.residuals),
        }

    def _skewness(self, data):
        """Calculate skewness."""
        n = len(data)
        mean = np.mean(data)
        std = np.std(data)
        return (n / ((n-1) * (n-2))) * np.sum(((data - mean) / std) ** 3)

    def plot_diagnostics(self, save_path=None):
        """Plot diagnostic plots for residual analysis."""
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))

        # Actual vs Predicted
        ax1 = axes[0, 0]
        ax1.scatter(self.y_true, self.y_pred, alpha=0.5)
        ax1.plot([self.y_true.min(), self.y_true.max()],
                 [self.y_true.min(), self.y_true.max()], 'r--')
        ax1.set_xlabel('Actual')
        ax1.set_ylabel('Predicted')
        ax1.set_title('Actual vs Predicted')

        # Residuals vs Predicted
        ax2 = axes[0, 1]
        ax2.scatter(self.y_pred, self.residuals, alpha=0.5)
        ax2.axhline(y=0, color='r', linestyle='--')
        ax2.set_xlabel('Predicted')
        ax2.set_ylabel('Residuals')
        ax2.set_title('Residuals vs Predicted')

        # Residual histogram
        ax3 = axes[1, 0]
        ax3.hist(self.residuals, bins=30, edgecolor='black')
        ax3.set_xlabel('Residual')
        ax3.set_ylabel('Frequency')
        ax3.set_title('Residual Distribution')

        # Q-Q plot
        ax4 = axes[1, 1]
        from scipy import stats
        stats.probplot(self.residuals, dist="norm", plot=ax4)
        ax4.set_title('Q-Q Plot')

        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        plt.show()

Example 3: Cross-Validation Strategies

示例3：交叉验证策略

python

from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, TimeSeriesSplit,
    GroupKFold, cross_validate
)

def evaluate_with_cv(model, X, y, cv_strategy='stratified', n_splits=5, groups=None):
    """
    Evaluate model with appropriate cross-validation strategy.

    Args:
        model: Sklearn-compatible model
        X: Features
        y: Target
        cv_strategy: 'stratified', 'timeseries', 'group', or 'kfold'
        n_splits: Number of CV folds
        groups: Group labels for GroupKFold

    Returns:
        Dictionary with CV results
    """
    # Select CV strategy
    if cv_strategy == 'stratified':
        cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    elif cv_strategy == 'timeseries':
        cv = TimeSeriesSplit(n_splits=n_splits)
    elif cv_strategy == 'group':
        cv = GroupKFold(n_splits=n_splits)
    else:
        cv = n_splits

    # Define scoring metrics
    scoring = {
        'accuracy': 'accuracy',
        'precision': 'precision_weighted',
        'recall': 'recall_weighted',
        'f1': 'f1_weighted',
        'roc_auc': 'roc_auc'
    }

    # Perform cross-validation
    cv_results = cross_validate(
        model, X, y,
        cv=cv,
        scoring=scoring,
        groups=groups,
        return_train_score=True,
        n_jobs=-1
    )

    # Summarize results
    summary = {}
    for metric in scoring.keys():
        test_scores = cv_results[f'test_{metric}']
        train_scores = cv_results[f'train_{metric}']
        summary[metric] = {
            'test_mean': np.mean(test_scores),
            'test_std': np.std(test_scores),
            'train_mean': np.mean(train_scores),
            'train_std': np.std(train_scores),
            'overfit_gap': np.mean(train_scores) - np.mean(test_scores)
        }

    return summary

python

from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, TimeSeriesSplit,
    GroupKFold, cross_validate
)

def evaluate_with_cv(model, X, y, cv_strategy='stratified', n_splits=5, groups=None):
    """
    Evaluate model with appropriate cross-validation strategy.

    Args:
        model: Sklearn-compatible model
        X: Features
        y: Target
        cv_strategy: 'stratified', 'timeseries', 'group', or 'kfold'
        n_splits: Number of CV folds
        groups: Group labels for GroupKFold

    Returns:
        Dictionary with CV results
    """
    # Select CV strategy
    if cv_strategy == 'stratified':
        cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    elif cv_strategy == 'timeseries':
        cv = TimeSeriesSplit(n_splits=n_splits)
    elif cv_strategy == 'group':
        cv = GroupKFold(n_splits=n_splits)
    else:
        cv = n_splits

    # Define scoring metrics
    scoring = {
        'accuracy': 'accuracy',
        'precision': 'precision_weighted',
        'recall': 'recall_weighted',
        'f1': 'f1_weighted',
        'roc_auc': 'roc_auc'
    }

    # Perform cross-validation
    cv_results = cross_validate(
        model, X, y,
        cv=cv,
        scoring=scoring,
        groups=groups,
        return_train_score=True,
        n_jobs=-1
    )

    # Summarize results
    summary = {}
    for metric in scoring.keys():
        test_scores = cv_results[f'test_{metric}']
        train_scores = cv_results[f'train_{metric}']
        summary[metric] = {
            'test_mean': np.mean(test_scores),
            'test_std': np.std(test_scores),
            'train_mean': np.mean(train_scores),
            'train_std': np.std(train_scores),
            'overfit_gap': np.mean(train_scores) - np.mean(test_scores)
        }

    return summary

Usage example

results = evaluate_with_cv(model, X, y, cv_strategy='stratified', n_splits=5) for metric, values in results.items(): print(f"{metric}: {values['test_mean']:.4f} (+/- {values['test_std']:.4f})") print(f" Overfitting gap: {values['overfit_gap']:.4f}")

undefined

undefined

Example 4: Fairness Evaluation

示例4：公平性评估

python

def evaluate_fairness(y_true, y_pred, sensitive_attr, favorable_label=1):
    """
    Evaluate model fairness across demographic groups.

    Args:
        y_true: True labels
        y_pred: Predicted labels
        sensitive_attr: Protected attribute values
        favorable_label: The favorable outcome label

    Returns:
        Dictionary with fairness metrics
    """
    groups = np.unique(sensitive_attr)
    results = {'group_metrics': {}}

    for group in groups:
        mask = sensitive_attr == group
        group_true = y_true[mask]
        group_pred = y_pred[mask]

        # Calculate group-specific metrics
        tp = np.sum((group_true == favorable_label) & (group_pred == favorable_label))
        fp = np.sum((group_true != favorable_label) & (group_pred == favorable_label))
        fn = np.sum((group_true == favorable_label) & (group_pred != favorable_label))
        tn = np.sum((group_true != favorable_label) & (group_pred != favorable_label))

        results['group_metrics'][group] = {
            'selection_rate': np.mean(group_pred == favorable_label),
            'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
            'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0,
            'accuracy': np.mean(group_true == group_pred),
            'size': len(group_true)
        }

    # Calculate fairness metrics
    selection_rates = [m['selection_rate'] for m in results['group_metrics'].values()]
    tprs = [m['tpr'] for m in results['group_metrics'].values()]
    fprs = [m['fpr'] for m in results['group_metrics'].values()]

    results['fairness_metrics'] = {
        'demographic_parity_diff': max(selection_rates) - min(selection_rates),
        'equalized_odds_tpr_diff': max(tprs) - min(tprs),
        'equalized_odds_fpr_diff': max(fprs) - min(fprs),
    }

    return results

python

def evaluate_fairness(y_true, y_pred, sensitive_attr, favorable_label=1):
    """
    Evaluate model fairness across demographic groups.

    Args:
        y_true: True labels
        y_pred: Predicted labels
        sensitive_attr: Protected attribute values
        favorable_label: The favorable outcome label

    Returns:
        Dictionary with fairness metrics
    """
    groups = np.unique(sensitive_attr)
    results = {'group_metrics': {}}

    for group in groups:
        mask = sensitive_attr == group
        group_true = y_true[mask]
        group_pred = y_pred[mask]

        # Calculate group-specific metrics
        tp = np.sum((group_true == favorable_label) & (group_pred == favorable_label))
        fp = np.sum((group_true != favorable_label) & (group_pred == favorable_label))
        fn = np.sum((group_true == favorable_label) & (group_pred != favorable_label))
        tn = np.sum((group_true != favorable_label) & (group_pred != favorable_label))

        results['group_metrics'][group] = {
            'selection_rate': np.mean(group_pred == favorable_label),
            'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
            'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0,
            'accuracy': np.mean(group_true == group_pred),
            'size': len(group_true)
        }

    # Calculate fairness metrics
    selection_rates = [m['selection_rate'] for m in results['group_metrics'].values()]
    tprs = [m['tpr'] for m in results['group_metrics'].values()]
    fprs = [m['fpr'] for m in results['group_metrics'].values()]

    results['fairness_metrics'] = {
        'demographic_parity_diff': max(selection_rates) - min(selection_rates),
        'equalized_odds_tpr_diff': max(tprs) - min(tprs),
        'equalized_odds_fpr_diff': max(fprs) - min(fprs),
    }

    return results

Example 5: Training Debugging with Learning Curves

示例5：学习曲线调试训练问题

python

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

def plot_learning_curves(model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10)):
    """
    Plot learning curves to diagnose overfitting/underfitting.

    Args:
        model: Sklearn-compatible model
        X: Features
        y: Target
        cv: Cross-validation folds
        train_sizes: Array of training set size fractions
    """
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        cv=cv,
        train_sizes=train_sizes,
        scoring='accuracy',
        n_jobs=-1
    )

    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)

    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, label='Training score', color='blue', marker='o')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color='blue')
    plt.plot(train_sizes, val_mean, label='Validation score', color='red', marker='o')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.15, color='red')

    plt.xlabel('Training Set Size')
    plt.ylabel('Accuracy')
    plt.title('Learning Curves')
    plt.legend(loc='lower right')
    plt.grid(True, alpha=0.3)

    # Add diagnostic annotations
    final_gap = train_mean[-1] - val_mean[-1]
    if final_gap > 0.1:
        plt.text(0.5, 0.05, 'HIGH OVERFITTING: Large gap between train and validation',
                 transform=plt.gca().transAxes, color='red', fontweight='bold')
    elif val_mean[-1] < 0.7:
        plt.text(0.5, 0.05, 'UNDERFITTING: Both train and validation scores are low',
                 transform=plt.gca().transAxes, color='orange', fontweight='bold')

    plt.tight_layout()
    plt.show()

    return {
        'final_train_score': train_mean[-1],
        'final_val_score': val_mean[-1],
        'overfit_gap': final_gap
    }

python

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

def plot_learning_curves(model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10)):
    """
    Plot learning curves to diagnose overfitting/underfitting.

    Args:
        model: Sklearn-compatible model
        X: Features
        y: Target
        cv: Cross-validation folds
        train_sizes: Array of training set size fractions
    """
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        cv=cv,
        train_sizes=train_sizes,
        scoring='accuracy',
        n_jobs=-1
    )

    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)

    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, label='Training score', color='blue', marker='o')
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15, color='blue')
    plt.plot(train_sizes, val_mean, label='Validation score', color='red', marker='o')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.15, color='red')

    plt.xlabel('Training Set Size')
    plt.ylabel('Accuracy')
    plt.title('Learning Curves')
    plt.legend(loc='lower right')
    plt.grid(True, alpha=0.3)

    # Add diagnostic annotations
    final_gap = train_mean[-1] - val_mean[-1]
    if final_gap > 0.1:
        plt.text(0.5, 0.05, 'HIGH OVERFITTING: Large gap between train and validation',
                 transform=plt.gca().transAxes, color='red', fontweight='bold')
    elif val_mean[-1] < 0.7:
        plt.text(0.5, 0.05, 'UNDERFITTING: Both train and validation scores are low',
                 transform=plt.gca().transAxes, color='orange', fontweight='bold')

    plt.tight_layout()
    plt.show()

    return {
        'final_train_score': train_mean[-1],
        'final_val_score': val_mean[-1],
        'overfit_gap': final_gap
    }

Example 6: LLM Evaluation

示例6：LLM评估

python

from typing import List, Dict
import openai

def evaluate_llm_generation(
    prompts: List[str],
    references: List[str],
    model: str,
    judge_model: str = "gpt-4"
) -> Dict:
    """
    Evaluate LLM generation quality using LLM-as-judge.

    Args:
        prompts: Input prompts
        references: Reference outputs (if available)
        model: Model to evaluate
        judge_model: Model to use as judge

    Returns:
        Dictionary with evaluation scores
    """
    results = []

    for prompt, reference in zip(prompts, references):
        # Generate response
        response = openai.ChatCompletion.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        generation = response.choices[0].message.content

        # LLM-as-judge evaluation
        judge_prompt = f"""Evaluate the following AI-generated response on a scale of 1-5 for:
1. Accuracy: Is the information correct?
2. Relevance: Does it address the prompt?
3. Fluency: Is it well-written and coherent?
4. Helpfulness: Is it useful to the user?

Prompt: {prompt}
Reference (if available): {reference}
Response: {generation}

Provide scores in JSON format: {{"accuracy": X, "relevance": X, "fluency": X, "helpfulness": X, "overall": X}}
"""

        judge_response = openai.ChatCompletion.create(
            model=judge_model,
            messages=[{"role": "user", "content": judge_prompt}]
        )

        scores = json.loads(judge_response.choices[0].message.content)
        results.append({
            'prompt': prompt,
            'generation': generation,
            'scores': scores
        })

    # Aggregate scores
    avg_scores = {}
    for key in ['accuracy', 'relevance', 'fluency', 'helpfulness', 'overall']:
        avg_scores[key] = np.mean([r['scores'][key] for r in results])

    return {
        'individual_results': results,
        'average_scores': avg_scores
    }

python

from typing import List, Dict
import openai

def evaluate_llm_generation(
    prompts: List[str],
    references: List[str],
    model: str,
    judge_model: str = "gpt-4"
) -> Dict:
    """
    Evaluate LLM generation quality using LLM-as-judge.

    Args:
        prompts: Input prompts
        references: Reference outputs (if available)
        model: Model to evaluate
        judge_model: Model to use as judge

    Returns:
        Dictionary with evaluation scores
    """
    results = []

    for prompt, reference in zip(prompts, references):
        # Generate response
        response = openai.ChatCompletion.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        generation = response.choices[0].message.content

        # LLM-as-judge evaluation
        judge_prompt = f"""Evaluate the following AI-generated response on a scale of 1-5 for:
1. Accuracy: Is the information correct?
2. Relevance: Does it address the prompt?
3. Fluency: Is it well-written and coherent?
4. Helpfulness: Is it useful to the user?

Prompt: {prompt}
Reference (if available): {reference}
Response: {generation}

Provide scores in JSON format: {{"accuracy": X, "relevance": X, "fluency": X, "helpfulness": X, "overall": X}}
"""

        judge_response = openai.ChatCompletion.create(
            model=judge_model,
            messages=[{"role": "user", "content": judge_prompt}]
        )

        scores = json.loads(judge_response.choices[0].message.content)
        results.append({
            'prompt': prompt,
            'generation': generation,
            'scores': scores
        })

    # Aggregate scores
    avg_scores = {}
    for key in ['accuracy', 'relevance', 'fluency', 'helpfulness', 'overall']:
        avg_scores[key] = np.mean([r['scores'][key] for r in results])

    return {
        'individual_results': results,
        'average_scores': avg_scores
    }

Example 7: A/B Test Analysis

示例7：A/B测试分析

python

from scipy import stats

def analyze_ab_test(control_metric: np.ndarray, treatment_metric: np.ndarray, alpha: float = 0.05):
    """
    Analyze A/B test results with statistical significance testing.

    Args:
        control_metric: Metric values for control group
        treatment_metric: Metric values for treatment group
        alpha: Significance level

    Returns:
        Dictionary with test results
    """
    # Descriptive statistics
    control_mean = np.mean(control_metric)
    treatment_mean = np.mean(treatment_metric)
    relative_lift = (treatment_mean - control_mean) / control_mean * 100

    # Statistical test
    t_stat, p_value = stats.ttest_ind(treatment_metric, control_metric)
    is_significant = p_value < alpha

    # Effect size (Cohen's d)
    pooled_std = np.sqrt((np.std(control_metric)**2 + np.std(treatment_metric)**2) / 2)
    cohens_d = (treatment_mean - control_mean) / pooled_std

    # Confidence interval
    ci = stats.t.interval(
        confidence=1-alpha,
        df=len(control_metric) + len(treatment_metric) - 2,
        loc=treatment_mean - control_mean,
        scale=stats.sem(np.concatenate([control_metric, treatment_metric]))
    )

    return {
        'control_mean': control_mean,
        'treatment_mean': treatment_mean,
        'relative_lift_pct': relative_lift,
        'p_value': p_value,
        'is_significant': is_significant,
        'cohens_d': cohens_d,
        'confidence_interval': ci,
        'recommendation': 'LAUNCH' if is_significant and relative_lift > 0 else 'DO NOT LAUNCH'
    }

python

from scipy import stats

def analyze_ab_test(control_metric: np.ndarray, treatment_metric: np.ndarray, alpha: float = 0.05):
    """
    Analyze A/B test results with statistical significance testing.

    Args:
        control_metric: Metric values for control group
        treatment_metric: Metric values for treatment group
        alpha: Significance level

    Returns:
        Dictionary with test results
    """
    # Descriptive statistics
    control_mean = np.mean(control_metric)
    treatment_mean = np.mean(treatment_metric)
    relative_lift = (treatment_mean - control_mean) / control_mean * 100

    # Statistical test
    t_stat, p_value = stats.ttest_ind(treatment_metric, control_metric)
    is_significant = p_value < alpha

    # Effect size (Cohen's d)
    pooled_std = np.sqrt((np.std(control_metric)**2 + np.std(treatment_metric)**2) / 2)
    cohens_d = (treatment_mean - control_mean) / pooled_std

    # Confidence interval
    ci = stats.t.interval(
        confidence=1-alpha,
        df=len(control_metric) + len(treatment_metric) - 2,
        loc=treatment_mean - control_mean,
        scale=stats.sem(np.concatenate([control_metric, treatment_metric]))
    )

    return {
        'control_mean': control_mean,
        'treatment_mean': treatment_mean,
        'relative_lift_pct': relative_lift,
        'p_value': p_value,
        'is_significant': is_significant,
        'cohens_d': cohens_d,
        'confidence_interval': ci,
        'recommendation': 'LAUNCH' if is_significant and relative_lift > 0 else 'DO NOT LAUNCH'
    }

Usage

results = analyze_ab_test(control_conversions, treatment_conversions) print(f"Relative Lift: {results['relative_lift_pct']:.2f}%") print(f"P-value: {results['p_value']:.4f}") print(f"Recommendation: {results['recommendation']}")

undefined

undefined

Example 8: Production Model Monitoring

示例8：生产环境模型监控

python

from scipy.stats import ks_2samp
import pandas as pd

class ModelMonitor:
    """Monitor deployed model for drift and degradation."""

    def __init__(self, baseline_data: pd.DataFrame, baseline_predictions: np.ndarray):
        self.baseline_data = baseline_data
        self.baseline_predictions = baseline_predictions

    def detect_data_drift(self, current_data: pd.DataFrame, threshold: float = 0.05) -> Dict:
        """Detect feature distribution drift using KS test."""
        drift_results = {}

        for col in self.baseline_data.columns:
            if pd.api.types.is_numeric_dtype(self.baseline_data[col]):
                statistic, p_value = ks_2samp(
                    self.baseline_data[col].dropna(),
                    current_data[col].dropna()
                )
                drift_results[col] = {
                    'ks_statistic': statistic,
                    'p_value': p_value,
                    'drift_detected': p_value < threshold
                }

        return drift_results

    def detect_prediction_drift(self, current_predictions: np.ndarray, threshold: float = 0.05) -> Dict:
        """Detect prediction distribution drift."""
        statistic, p_value = ks_2samp(self.baseline_predictions, current_predictions)

        return {
            'ks_statistic': statistic,
            'p_value': p_value,
            'drift_detected': p_value < threshold,
            'baseline_mean': np.mean(self.baseline_predictions),
            'current_mean': np.mean(current_predictions),
            'mean_shift': np.mean(current_predictions) - np.mean(self.baseline_predictions)
        }

    def performance_degradation_check(
        self,
        current_metric: float,
        baseline_metric: float,
        threshold_pct: float = 5.0
    ) -> Dict:
        """Check for performance degradation."""
        degradation_pct = (baseline_metric - current_metric) / baseline_metric * 100

        return {
            'baseline_metric': baseline_metric,
            'current_metric': current_metric,
            'degradation_pct': degradation_pct,
            'alert': degradation_pct > threshold_pct,
            'recommendation': 'RETRAIN MODEL' if degradation_pct > threshold_pct else 'OK'
        }

python

from scipy.stats import ks_2samp
import pandas as pd

class ModelMonitor:
    """Monitor deployed model for drift and degradation."""

    def __init__(self, baseline_data: pd.DataFrame, baseline_predictions: np.ndarray):
        self.baseline_data = baseline_data
        self.baseline_predictions = baseline_predictions

    def detect_data_drift(self, current_data: pd.DataFrame, threshold: float = 0.05) -> Dict:
        """Detect feature distribution drift using KS test."""
        drift_results = {}

        for col in self.baseline_data.columns:
            if pd.api.types.is_numeric_dtype(self.baseline_data[col]):
                statistic, p_value = ks_2samp(
                    self.baseline_data[col].dropna(),
                    current_data[col].dropna()
                )
                drift_results[col] = {
                    'ks_statistic': statistic,
                    'p_value': p_value,
                    'drift_detected': p_value < threshold
                }

        return drift_results

    def detect_prediction_drift(self, current_predictions: np.ndarray, threshold: float = 0.05) -> Dict:
        """Detect prediction distribution drift."""
        statistic, p_value = ks_2samp(self.baseline_predictions, current_predictions)

        return {
            'ks_statistic': statistic,
            'p_value': p_value,
            'drift_detected': p_value < threshold,
            'baseline_mean': np.mean(self.baseline_predictions),
            'current_mean': np.mean(current_predictions),
            'mean_shift': np.mean(current_predictions) - np.mean(self.baseline_predictions)
        }

    def performance_degradation_check(
        self,
        current_metric: float,
        baseline_metric: float,
        threshold_pct: float = 5.0
    ) -> Dict:
        """Check for performance degradation."""
        degradation_pct = (baseline_metric - current_metric) / baseline_metric * 100

        return {
            'baseline_metric': baseline_metric,
            'current_metric': current_metric,
            'degradation_pct': degradation_pct,
            'alert': degradation_pct > threshold_pct,
            'recommendation': 'RETRAIN MODEL' if degradation_pct > threshold_pct else 'OK'
        }

Usage

monitor = ModelMonitor(baseline_df, baseline_preds) drift_check = monitor.detect_data_drift(current_df) pred_drift = monitor.detect_prediction_drift(current_preds) perf_check = monitor.performance_degradation_check(current_accuracy, baseline_accuracy)

undefined

undefined

Common Pitfalls

常见陷阱

Test Set Contamination: Never use test set for hyperparameter tuning or model selection
Data Leakage: Ensure validation/test data doesn't leak into training (temporal ordering, group splits)
Wrong Metric Choice: Using accuracy for imbalanced datasets, not considering business costs
Ignoring Confidence Intervals: Point estimates without uncertainty can be misleading
Multiple Comparisons: Not correcting p-values when testing many hypotheses
Survivorship Bias: Evaluating only on successful cases, ignoring failures
Overfitting to Validation: Repeatedly tuning on validation set effectively makes it a second training set
Ignoring Fairness: Deploying models without fairness evaluation can cause harm
No Baseline: Not comparing against simple baselines (random, majority class, linear model)
Production-Training Skew: Evaluation setup doesn't match production environment

测试集污染：绝不要将测试集用于超参数调优或模型选择
数据泄露：确保验证/测试数据不会泄露到训练过程中（时序顺序、分组划分）
错误指标选择：在不平衡数据集上使用accuracy，未考虑业务成本
忽略置信区间：仅点估计而不考虑不确定性会产生误导
多重比较：测试多个假设时未校正p值
幸存者偏差：仅在成功案例上评估，忽略失败案例
过拟合到验证集：反复在验证集上调优，使其实际上成为第二个训练集
忽略公平性：未进行公平性评估就部署模型可能造成伤害
无基线对比：未与简单基线模型（随机、多数类、线性模型）对比
生产-训练偏差：评估设置与生产环境不匹配

Additional Resources

额外资源

Metrics: Scikit-learn metrics documentation, Hugging Face evaluate library
Fairness: AI Fairness 360, Fairlearn
LLM Evaluation: HELM, lm-evaluation-harness, BIG-bench
A/B Testing: Evan Miller's A/B testing tools, experimentation platform docs
Monitoring: Evidently AI, WhyLabs, Fiddler

指标：Scikit-learn指标文档、Hugging Face evaluate库
公平性：AI Fairness 360、Fairlearn
LLM评估：HELM、lm-evaluation-harness、BIG-bench
A/B测试：Evan Miller的A/B测试工具、实验平台文档
监控：Evidently AI、WhyLabs、Fiddler

loom-model-evaluation

Original

Translation

Model Evaluation

模型评估

Overview

概述

When to Use This Skill

何时使用本技能

Related Skills and Agents

相关技能与Agent

Instructions

操作指南

1. Define Evaluation Criteria

1. 定义评估标准

2. Design Evaluation Strategy

2. 设计评估策略

3. Conduct Evaluation

3. 开展评估

4. Debug Training Issues

4. 调试训练问题

5. Hyperparameter Tuning

5. 超参数调优

6. LLM and Generative Model Evaluation

6. LLM与生成式模型评估

7. A/B Testing for Model Comparison

7. 模型对比的A/B测试

8. Production Model Monitoring

8. 生产环境模型监控

9. Report and Document

9. 报告与文档

Best Practices

最佳实践

General Principles

通用原则

Training and Debugging

训练与调试

Evaluation Rigor

评估严谨性

Production and Monitoring

生产与监控

Examples

示例

Example 1: Classification Model Evaluation

示例1：分类模型评估

Classification Model Evaluation Report

Classification Model Evaluation Report

Overall Metrics

Overall Metrics

Confusion Matrix Analysis

Confusion Matrix Analysis

Detailed Classification Report

Detailed Classification Report

Usage

Usage

Example 2: Regression Model Evaluation

示例2：回归模型评估

Example 3: Cross-Validation Strategies

示例3：交叉验证策略

Usage example

Usage example

Example 4: Fairness Evaluation

示例4：公平性评估

Example 5: Training Debugging with Learning Curves

示例5：学习曲线调试训练问题

Example 6: LLM Evaluation

示例6：LLM评估

Example 7: A/B Test Analysis

示例7：A/B测试分析

Usage

Usage

Example 8: Production Model Monitoring

示例8：生产环境模型监控

Usage

Usage

Common Pitfalls

常见陷阱

Additional Resources

额外资源