llm_evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Evaluation

LLM评估

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
掌握LLM应用的全方位评估策略,覆盖自动化指标、人工评估到A/B测试全流程。

When to Use This Skill

何时使用该技能

  • Measuring LLM application performance systematically
  • Comparing different models or prompts
  • Detecting performance regressions before deployment
  • Validating improvements from prompt changes
  • Building confidence in production systems
  • Establishing baselines and tracking progress over time
  • Debugging unexpected model behavior
  • 系统性衡量LLM应用性能
  • 对比不同模型或提示词效果
  • 在部署前检测性能回退
  • 验证提示词调整带来的效果提升
  • 增强对生产系统的信心
  • 建立基准线并跟踪长期进展
  • 调试模型异常行为

Core Evaluation Types

核心评估类型

1. Automated Metrics

1. 自动化指标

Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
  • BLEU: N-gram overlap (translation)
  • ROUGE: Recall-oriented (summarization)
  • METEOR: Semantic similarity
  • BERTScore: Embedding-based similarity
  • Perplexity: Language model confidence
Classification:
  • Accuracy: Percentage correct
  • Precision/Recall/F1: Class-specific performance
  • Confusion Matrix: Error patterns
  • AUC-ROC: Ranking quality
Retrieval (RAG):
  • MRR: Mean Reciprocal Rank
  • NDCG: Normalized Discounted Cumulative Gain
  • Precision@K: Relevant in top K
  • Recall@K: Coverage in top K
通过计算得分实现的快速、可复现、可扩展的评估方式。
文本生成场景:
  • BLEU: N元组重合度(翻译场景)
  • ROUGE: 面向召回率(摘要场景)
  • METEOR: 语义相似度
  • BERTScore: 基于嵌入的相似度
  • 困惑度: 语言模型置信度
分类场景:
  • 准确率: 预测正确的占比
  • 精确率/召回率/F1: 特定类别性能
  • 混淆矩阵: 错误模式分析
  • AUC-ROC: 排序质量
检索(RAG)场景:
  • MRR: 平均倒数排名
  • NDCG: 归一化折损累计增益
  • Precision@K: 前K个结果中的相关内容占比
  • Recall@K: 前K个结果覆盖的相关内容占比

2. Human Evaluation

2. 人工评估

Manual assessment for quality aspects difficult to automate.
Dimensions:
  • Accuracy: Factual correctness
  • Coherence: Logical flow
  • Relevance: Answers the question
  • Fluency: Natural language quality
  • Safety: No harmful content
  • Helpfulness: Useful to the user
针对难以自动化评估的质量维度进行人工判断。
评估维度:
  • 准确性: 事实正确性
  • 连贯性: 逻辑流畅性
  • 相关性: 匹配问题需求
  • 流畅度: 自然语言质量
  • 安全性: 无有害内容
  • 有用性: 对用户有帮助

3. LLM-as-Judge

3. LLM-as-Judge

Use stronger LLMs to evaluate weaker model outputs.
Approaches:
  • Pointwise: Score individual responses
  • Pairwise: Compare two responses
  • Reference-based: Compare to gold standard
  • Reference-free: Judge without ground truth
使用能力更强的LLM评估较弱模型的输出。
实现方式:
  • 单点评分: 为单个回答打分
  • 两两对比: 对比两个回答的优劣
  • 基于参考: 和黄金标准答案对比
  • 无参考: 无需标注真值直接评估

Quick Start

快速开始

python
from llm_eval import EvaluationSuite, Metric
python
from llm_eval import EvaluationSuite, Metric

Define evaluation suite

定义评估套件

suite = EvaluationSuite([ Metric.accuracy(), Metric.bleu(), Metric.bertscore(), Metric.custom(name="groundedness", fn=check_groundedness) ])
suite = EvaluationSuite([ Metric.accuracy(), Metric.bleu(), Metric.bertscore(), Metric.custom(name="groundedness", fn=check_groundedness) ])

Prepare test cases

准备测试用例

test_cases = [ { "input": "What is the capital of France?", "expected": "Paris", "context": "France is a country in Europe. Paris is its capital." }, # ... more test cases ]
test_cases = [ { "input": "What is the capital of France?", "expected": "Paris", "context": "France is a country in Europe. Paris is its capital." }, # ... 更多测试用例 ]

Run evaluation

运行评估

results = suite.evaluate( model=your_model, test_cases=test_cases )
print(f"Overall Accuracy: {results.metrics['accuracy']}") print(f"BLEU Score: {results.metrics['bleu']}")
undefined
results = suite.evaluate( model=your_model, test_cases=test_cases )
print(f"Overall Accuracy: {results.metrics['accuracy']}") print(f"BLEU Score: {results.metrics['bleu']}")
undefined

Automated Metrics Implementation

自动化指标实现

BLEU Score

BLEU分数

python
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def calculate_bleu(reference, hypothesis):
    """Calculate BLEU score between reference and hypothesis."""
    smoothie = SmoothingFunction().method4

    return sentence_bleu(
        [reference.split()],
        hypothesis.split(),
        smoothing_function=smoothie
    )
python
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def calculate_bleu(reference, hypothesis):
    """计算参考文本和生成文本之间的BLEU分数。"""
    smoothie = SmoothingFunction().method4

    return sentence_bleu(
        [reference.split()],
        hypothesis.split(),
        smoothing_function=smoothie
    )

Usage

使用示例

bleu = calculate_bleu( reference="The cat sat on the mat", hypothesis="A cat is sitting on the mat" )
undefined
bleu = calculate_bleu( reference="The cat sat on the mat", hypothesis="A cat is sitting on the mat" )
undefined

ROUGE Score

ROUGE分数

python
from rouge_score import rouge_scorer

def calculate_rouge(reference, hypothesis):
    """Calculate ROUGE scores."""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, hypothesis)

    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }
python
from rouge_score import rouge_scorer

def calculate_rouge(reference, hypothesis):
    """计算ROUGE分数。"""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, hypothesis)

    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

BERTScore

BERTScore

python
from bert_score import score

def calculate_bertscore(references, hypotheses):
    """Calculate BERTScore using pre-trained BERT."""
    P, R, F1 = score(
        hypotheses,
        references,
        lang='en',
        model_type='microsoft/deberta-xlarge-mnli'
    )

    return {
        'precision': P.mean().item(),
        'recall': R.mean().item(),
        'f1': F1.mean().item()
    }
python
from bert_score import score

def calculate_bertscore(references, hypotheses):
    """使用预训练BERT计算BERTScore。"""
    P, R, F1 = score(
        hypotheses,
        references,
        lang='en',
        model_type='microsoft/deberta-xlarge-mnli'
    )

    return {
        'precision': P.mean().item(),
        'recall': R.mean().item(),
        'f1': F1.mean().item()
    }

Custom Metrics

自定义指标

python
def calculate_groundedness(response, context):
    """Check if response is grounded in provided context."""
    # Use NLI model to check entailment
    from transformers import pipeline

    nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")

    result = nli(f"{context} [SEP] {response}")[0]

    # Return confidence that response is entailed by context
    return result['score'] if result['label'] == 'ENTAILMENT' else 0.0

def calculate_toxicity(text):
    """Measure toxicity in generated text."""
    from detoxify import Detoxify

    results = Detoxify('original').predict(text)
    return max(results.values())  # Return highest toxicity score

def calculate_factuality(claim, knowledge_base):
    """Verify factual claims against knowledge base."""
    # Implementation depends on your knowledge base
    # Could use retrieval + NLI, or fact-checking API
    pass
python
def calculate_groundedness(response, context):
    """检查回答是否基于提供的上下文生成。"""
    # 使用NLI模型检查蕴含关系
    from transformers import pipeline

    nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")

    result = nli(f"{context} [SEP] {response}")[0]

    # 返回上下文蕴含回答的置信度
    return result['score'] if result['label'] == 'ENTAILMENT' else 0.0

def calculate_toxicity(text):
    """衡量生成文本的毒性。"""
    from detoxify import Detoxify

    results = Detoxify('original').predict(text)
    return max(results.values())  # 返回最高毒性得分

def calculate_factuality(claim, knowledge_base):
    """对照知识库验证事实声明。"""
    # 实现取决于你的知识库
    # 可使用检索+NLI或者事实核查API
    pass

LLM-as-Judge Patterns

LLM-as-Judge 实现模式

Single Output Evaluation

单输出评估

python
def llm_judge_quality(response, question):
    """Use GPT-5 to judge response quality."""
    prompt = f"""Rate the following response on a scale of 1-10 for:
1. Accuracy (factually correct)
2. Helpfulness (answers the question)
3. Clarity (well-written and understandable)

Question: {question}
Response: {response}

Provide ratings in JSON format:
{{
  "accuracy": <1-10>,
  "helpfulness": <1-10>,
  "clarity": <1-10>,
  "reasoning": "<brief explanation>"
}}
"""

    result = openai.ChatCompletion.create(
        model="gpt-5",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return json.loads(result.choices[0].message.content)
python
def llm_judge_quality(response, question):
    """使用GPT-5评估回答质量。"""
    prompt = f"""按照1-10分的维度对以下回答打分:
1. 准确性(事实正确)
2. 有用性(回答了问题)
3. 清晰度(表述通顺易懂)

问题:{question}
回答:{response}

以JSON格式返回评分:
{{
  "accuracy": <1-10>,
  "helpfulness": <1-10>,
  "clarity": <1-10>,
  "reasoning": "<简短说明>"
}}
"""

    result = openai.ChatCompletion.create(
        model="gpt-5",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return json.loads(result.choices[0].message.content)

Pairwise Comparison

两两对比评估

python
def compare_responses(question, response_a, response_b):
    """Compare two responses using LLM judge."""
    prompt = f"""Compare these two responses to the question and determine which is better.

Question: {question}

Response A: {response_a}

Response B: {response_b}

Which response is better and why? Consider accuracy, helpfulness, and clarity.

Answer with JSON:
{{
  "winner": "A" or "B" or "tie",
  "reasoning": "<explanation>",
  "confidence": <1-10>
}}
"""

    result = openai.ChatCompletion.create(
        model="gpt-5",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return json.loads(result.choices[0].message.content)
python
def compare_responses(question, response_a, response_b):
    """使用LLM裁判对比两个回答。"""
    prompt = f"""对比针对该问题的两个回答,判断哪个更优。

问题:{question}

回答A:{response_a}

回答B:{response_b}

哪个回答更好,为什么?请综合考虑准确性、有用性和清晰度。

以JSON格式返回结果:
{{
  "winner": "A" or "B" or "tie",
  "reasoning": "<说明>",
  "confidence": <1-10>
}}
"""

    result = openai.ChatCompletion.create(
        model="gpt-5",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return json.loads(result.choices[0].message.content)

Human Evaluation Frameworks

人工评估框架

Annotation Guidelines

标注指南

python
class AnnotationTask:
    """Structure for human annotation task."""

    def __init__(self, response, question, context=None):
        self.response = response
        self.question = question
        self.context = context

    def get_annotation_form(self):
        return {
            "question": self.question,
            "context": self.context,
            "response": self.response,
            "ratings": {
                "accuracy": {
                    "scale": "1-5",
                    "description": "Is the response factually correct?"
                },
                "relevance": {
                    "scale": "1-5",
                    "description": "Does it answer the question?"
                },
                "coherence": {
                    "scale": "1-5",
                    "description": "Is it logically consistent?"
                }
            },
            "issues": {
                "factual_error": False,
                "hallucination": False,
                "off_topic": False,
                "unsafe_content": False
            },
            "feedback": ""
        }
python
class AnnotationTask:
    """人工标注任务结构。"""

    def __init__(self, response, question, context=None):
        self.response = response
        self.question = question
        self.context = context

    def get_annotation_form(self):
        return {
            "question": self.question,
            "context": self.context,
            "response": self.response,
            "ratings": {
                "accuracy": {
                    "scale": "1-5",
                    "description": "回答是否符合事实?"
                },
                "relevance": {
                    "scale": "1-5",
                    "description": "是否回答了对应的问题?"
                },
                "coherence": {
                    "scale": "1-5",
                    "description": "逻辑是否连贯一致?"
                }
            },
            "issues": {
                "factual_error": False,
                "hallucination": False,
                "off_topic": False,
                "unsafe_content": False
            },
            "feedback": ""
        }

Inter-Rater Agreement

标注者间一致性

python
from sklearn.metrics import cohen_kappa_score

def calculate_agreement(rater1_scores, rater2_scores):
    """Calculate inter-rater agreement."""
    kappa = cohen_kappa_score(rater1_scores, rater2_scores)

    interpretation = {
        kappa < 0: "Poor",
        kappa < 0.2: "Slight",
        kappa < 0.4: "Fair",
        kappa < 0.6: "Moderate",
        kappa < 0.8: "Substantial",
        kappa <= 1.0: "Almost Perfect"
    }

    return {
        "kappa": kappa,
        "interpretation": interpretation[True]
    }
python
from sklearn.metrics import cohen_kappa_score

def calculate_agreement(rater1_scores, rater2_scores):
    """计算标注者之间的一致性。"""
    kappa = cohen_kappa_score(rater1_scores, rater2_scores)

    interpretation = {
        kappa < 0: "差",
        kappa < 0.2: "轻微一致",
        kappa < 0.4: "一般一致",
        kappa < 0.6: "中等一致",
        kappa < 0.8: "高度一致",
        kappa <= 1.0: "几乎完全一致"
    }

    return {
        "kappa": kappa,
        "interpretation": interpretation[True]
    }

A/B Testing

A/B测试

Statistical Testing Framework

统计测试框架

python
from scipy import stats
import numpy as np

class ABTest:
    def __init__(self, variant_a_name="A", variant_b_name="B"):
        self.variant_a = {"name": variant_a_name, "scores": []}
        self.variant_b = {"name": variant_b_name, "scores": []}

    def add_result(self, variant, score):
        """Add evaluation result for a variant."""
        if variant == "A":
            self.variant_a["scores"].append(score)
        else:
            self.variant_b["scores"].append(score)

    def analyze(self, alpha=0.05):
        """Perform statistical analysis."""
        a_scores = self.variant_a["scores"]
        b_scores = self.variant_b["scores"]

        # T-test
        t_stat, p_value = stats.ttest_ind(a_scores, b_scores)

        # Effect size (Cohen's d)
        pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
        cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std

        return {
            "variant_a_mean": np.mean(a_scores),
            "variant_b_mean": np.mean(b_scores),
            "difference": np.mean(b_scores) - np.mean(a_scores),
            "relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
            "p_value": p_value,
            "statistically_significant": p_value < alpha,
            "cohens_d": cohens_d,
            "effect_size": self.interpret_cohens_d(cohens_d),
            "winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
        }

    @staticmethod
    def interpret_cohens_d(d):
        """Interpret Cohen's d effect size."""
        abs_d = abs(d)
        if abs_d < 0.2:
            return "negligible"
        elif abs_d < 0.5:
            return "small"
        elif abs_d < 0.8:
            return "medium"
        else:
            return "large"
python
from scipy import stats
import numpy as np

class ABTest:
    def __init__(self, variant_a_name="A", variant_b_name="B"):
        self.variant_a = {"name": variant_a_name, "scores": []}
        self.variant_b = {"name": variant_b_name, "scores": []}

    def add_result(self, variant, score):
        """为某个版本添加评估结果。"""
        if variant == "A":
            self.variant_a["scores"].append(score)
        else:
            self.variant_b["scores"].append(score)

    def analyze(self, alpha=0.05):
        """执行统计分析。"""
        a_scores = self.variant_a["scores"]
        b_scores = self.variant_b["scores"]

        # T检验
        t_stat, p_value = stats.ttest_ind(a_scores, b_scores)

        # 效应量(Cohen's d)
        pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
        cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std

        return {
            "variant_a_mean": np.mean(a_scores),
            "variant_b_mean": np.mean(b_scores),
            "difference": np.mean(b_scores) - np.mean(a_scores),
            "relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
            "p_value": p_value,
            "statistically_significant": p_value < alpha,
            "cohens_d": cohens_d,
            "effect_size": self.interpret_cohens_d(cohens_d),
            "winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
        }

    @staticmethod
    def interpret_cohens_d(d):
        """解读Cohen's d效应量。"""
        abs_d = abs(d)
        if abs_d < 0.2:
            return "可忽略"
        elif abs_d < 0.5:
            return "小"
        elif abs_d < 0.8:
            return "中"
        else:
            return "大"

Regression Testing

回归测试

Regression Detection

回退检测

python
class RegressionDetector:
    def __init__(self, baseline_results, threshold=0.05):
        self.baseline = baseline_results
        self.threshold = threshold

    def check_for_regression(self, new_results):
        """Detect if new results show regression."""
        regressions = []

        for metric in self.baseline.keys():
            baseline_score = self.baseline[metric]
            new_score = new_results.get(metric)

            if new_score is None:
                continue

            # Calculate relative change
            relative_change = (new_score - baseline_score) / baseline_score

            # Flag if significant decrease
            if relative_change < -self.threshold:
                regressions.append({
                    "metric": metric,
                    "baseline": baseline_score,
                    "current": new_score,
                    "change": relative_change
                })

        return {
            "has_regression": len(regressions) > 0,
            "regressions": regressions
        }
python
class RegressionDetector:
    def __init__(self, baseline_results, threshold=0.05):
        self.baseline = baseline_results
        self.threshold = threshold

    def check_for_regression(self, new_results):
        """检测新结果是否出现性能回退。"""
        regressions = []

        for metric in self.baseline.keys():
            baseline_score = self.baseline[metric]
            new_score = new_results.get(metric)

            if new_score is None:
                continue

            # 计算相对变化
            relative_change = (new_score - baseline_score) / baseline_score

            # 标记明显的性能下降
            if relative_change < -self.threshold:
                regressions.append({
                    "metric": metric,
                    "baseline": baseline_score,
                    "current": new_score,
                    "change": relative_change
                })

        return {
            "has_regression": len(regressions) > 0,
            "regressions": regressions
        }

Benchmarking

基准测试

Running Benchmarks

运行基准测试

python
class BenchmarkRunner:
    def __init__(self, benchmark_dataset):
        self.dataset = benchmark_dataset

    def run_benchmark(self, model, metrics):
        """Run model on benchmark and calculate metrics."""
        results = {metric.name: [] for metric in metrics}

        for example in self.dataset:
            # Generate prediction
            prediction = model.predict(example["input"])

            # Calculate each metric
            for metric in metrics:
                score = metric.calculate(
                    prediction=prediction,
                    reference=example["reference"],
                    context=example.get("context")
                )
                results[metric.name].append(score)

        # Aggregate results
        return {
            metric: {
                "mean": np.mean(scores),
                "std": np.std(scores),
                "min": min(scores),
                "max": max(scores)
            }
            for metric, scores in results.items()
        }
python
class BenchmarkRunner:
    def __init__(self, benchmark_dataset):
        self.dataset = benchmark_dataset

    def run_benchmark(self, model, metrics):
        """在基准数据集上运行模型并计算指标。"""
        results = {metric.name: [] for metric in metrics}

        for example in self.dataset:
            # 生成预测结果
            prediction = model.predict(example["input"])

            # 计算每个指标
            for metric in metrics:
                score = metric.calculate(
                    prediction=prediction,
                    reference=example["reference"],
                    context=example.get("context")
                )
                results[metric.name].append(score)

        # 聚合结果
        return {
            metric: {
                "mean": np.mean(scores),
                "std": np.std(scores),
                "min": min(scores),
                "max": max(scores)
            }
            for metric, scores in results.items()
        }

Resources

资源

  • references/metrics.md: Comprehensive metric guide
  • references/human-evaluation.md: Annotation best practices
  • references/benchmarking.md: Standard benchmarks
  • references/a-b-testing.md: Statistical testing guide
  • references/regression-testing.md: CI/CD integration
  • assets/evaluation-framework.py: Complete evaluation harness
  • assets/benchmark-dataset.jsonl: Example datasets
  • scripts/evaluate-model.py: Automated evaluation runner
  • references/metrics.md: 全面的指标指南
  • references/human-evaluation.md: 标注最佳实践
  • references/benchmarking.md: 标准基准测试
  • references/a-b-testing.md: 统计测试指南
  • references/regression-testing.md: CI/CD集成
  • assets/evaluation-framework.py: 完整的评估工具链
  • assets/benchmark-dataset.jsonl: 示例数据集
  • scripts/evaluate-model.py: 自动化评估执行脚本

Best Practices

最佳实践

  1. Multiple Metrics: Use diverse metrics for comprehensive view
  2. Representative Data: Test on real-world, diverse examples
  3. Baselines: Always compare against baseline performance
  4. Statistical Rigor: Use proper statistical tests for comparisons
  5. Continuous Evaluation: Integrate into CI/CD pipeline
  6. Human Validation: Combine automated metrics with human judgment
  7. Error Analysis: Investigate failures to understand weaknesses
  8. Version Control: Track evaluation results over time
  1. 多指标结合: 使用多种指标获得全面的评估视角
  2. 代表性数据: 基于真实世界的多样化示例测试
  3. 基准对比: 始终和基线性能做对比
  4. 统计严谨性: 使用合适的统计测试做对比
  5. 持续评估: 集成到CI/CD流水线中
  6. 人工验证: 结合自动化指标和人工判断
  7. 错误分析: 调研失败案例理解模型短板
  8. 版本控制: 长期跟踪评估结果变化

Common Pitfalls

常见误区

  • Single Metric Obsession: Optimizing for one metric at the expense of others
  • Small Sample Size: Drawing conclusions from too few examples
  • Data Contamination: Testing on training data
  • Ignoring Variance: Not accounting for statistical uncertainty LLM Evaluation v1.1 - Enhanced
  • 单一指标执念: 为了优化单个指标牺牲其他维度表现
  • 样本量过小: 基于过少的示例得出结论
  • 数据污染: 使用训练数据做测试
  • 忽略方差: 不考虑统计不确定性 LLM评估 v1.1 - 增强版

🔄 Workflow

🔄 工作流

Aşama 1: Dataset Creation

阶段1: 数据集创建

  • Golden Set: En az 50 adet "soru-cevap-kontext" üçlüsü içeren ve insan tarafından doğrulanmış dataset oluştur.
  • Adversarial: Modeli yanıltmaya yönelik (Prompt Injection, Jailbreak) test senaryoları ekle.
  • 黄金数据集: 创建至少包含50组经人工验证的“问题-答案-上下文”三元组的数据集。
  • 对抗样本: 添加旨在误导模型的测试场景(提示词注入、越狱)。

Aşama 2: Automated Pipeline

阶段2: 自动化流水线

  • Component-Level: Retrieval (Recall@K) ve Generation (Faithfulness) ayrı ayrı ölç.
  • Continuous Eval: CI/CD pipeline'ına entegre et, her PR'da regresyon testi koş.
  • Cost Monitoring: Token kullanımını ve maliyeti test başına takip et.
  • 组件级评估: 分别衡量检索(Recall@K)和生成(忠实度)效果。
  • 持续评估: 集成到CI/CD流水线中,每次PR都运行回归测试。
  • 成本监控: 跟踪每次测试的Token使用量和成本。

Aşama 3: Human-in-the-Loop

阶段3: 人在回路

  • Sampling: Modelin "kötü" (düşük skorlu) cevaplarını insan incelemesine gönder.
  • Feedback Loop: Düzeltmeleri bir sonraki fine-tuning/prompting iterasyonu için kaydet.
  • 抽样: 将模型的“差”(低得分)回答提交人工审核。
  • 反馈回路: 保存修正内容,用于下一轮微调/提示词优化迭代。

Kontrol Noktaları

检查点

AşamaDoğrulama
1Test seti gerçek kullanıcı verilerini temsil ediyor mu?
2Otomatik metrikler ile insan kararları uyuşuyor mu (Korrelasyon)?
3Model güvenliği (PII sızıntısı, toksisite) garanti altında mı?
阶段验证项
1测试集是否能代表真实用户数据?
2自动化指标与人工判断是否一致(相关性)?
3模型安全性(PII泄露、毒性)是否得到保障?