llm_evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Evaluation
LLM评估
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
掌握LLM应用的全方位评估策略,覆盖自动化指标、人工评估到A/B测试全流程。
When to Use This Skill
何时使用该技能
- Measuring LLM application performance systematically
- Comparing different models or prompts
- Detecting performance regressions before deployment
- Validating improvements from prompt changes
- Building confidence in production systems
- Establishing baselines and tracking progress over time
- Debugging unexpected model behavior
- 系统性衡量LLM应用性能
- 对比不同模型或提示词效果
- 在部署前检测性能回退
- 验证提示词调整带来的效果提升
- 增强对生产系统的信心
- 建立基准线并跟踪长期进展
- 调试模型异常行为
Core Evaluation Types
核心评估类型
1. Automated Metrics
1. 自动化指标
Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
- BLEU: N-gram overlap (translation)
- ROUGE: Recall-oriented (summarization)
- METEOR: Semantic similarity
- BERTScore: Embedding-based similarity
- Perplexity: Language model confidence
Classification:
- Accuracy: Percentage correct
- Precision/Recall/F1: Class-specific performance
- Confusion Matrix: Error patterns
- AUC-ROC: Ranking quality
Retrieval (RAG):
- MRR: Mean Reciprocal Rank
- NDCG: Normalized Discounted Cumulative Gain
- Precision@K: Relevant in top K
- Recall@K: Coverage in top K
通过计算得分实现的快速、可复现、可扩展的评估方式。
文本生成场景:
- BLEU: N元组重合度(翻译场景)
- ROUGE: 面向召回率(摘要场景)
- METEOR: 语义相似度
- BERTScore: 基于嵌入的相似度
- 困惑度: 语言模型置信度
分类场景:
- 准确率: 预测正确的占比
- 精确率/召回率/F1: 特定类别性能
- 混淆矩阵: 错误模式分析
- AUC-ROC: 排序质量
检索(RAG)场景:
- MRR: 平均倒数排名
- NDCG: 归一化折损累计增益
- Precision@K: 前K个结果中的相关内容占比
- Recall@K: 前K个结果覆盖的相关内容占比
2. Human Evaluation
2. 人工评估
Manual assessment for quality aspects difficult to automate.
Dimensions:
- Accuracy: Factual correctness
- Coherence: Logical flow
- Relevance: Answers the question
- Fluency: Natural language quality
- Safety: No harmful content
- Helpfulness: Useful to the user
针对难以自动化评估的质量维度进行人工判断。
评估维度:
- 准确性: 事实正确性
- 连贯性: 逻辑流畅性
- 相关性: 匹配问题需求
- 流畅度: 自然语言质量
- 安全性: 无有害内容
- 有用性: 对用户有帮助
3. LLM-as-Judge
3. LLM-as-Judge
Use stronger LLMs to evaluate weaker model outputs.
Approaches:
- Pointwise: Score individual responses
- Pairwise: Compare two responses
- Reference-based: Compare to gold standard
- Reference-free: Judge without ground truth
使用能力更强的LLM评估较弱模型的输出。
实现方式:
- 单点评分: 为单个回答打分
- 两两对比: 对比两个回答的优劣
- 基于参考: 和黄金标准答案对比
- 无参考: 无需标注真值直接评估
Quick Start
快速开始
python
from llm_eval import EvaluationSuite, Metricpython
from llm_eval import EvaluationSuite, MetricDefine evaluation suite
定义评估套件
suite = EvaluationSuite([
Metric.accuracy(),
Metric.bleu(),
Metric.bertscore(),
Metric.custom(name="groundedness", fn=check_groundedness)
])
suite = EvaluationSuite([
Metric.accuracy(),
Metric.bleu(),
Metric.bertscore(),
Metric.custom(name="groundedness", fn=check_groundedness)
])
Prepare test cases
准备测试用例
test_cases = [
{
"input": "What is the capital of France?",
"expected": "Paris",
"context": "France is a country in Europe. Paris is its capital."
},
# ... more test cases
]
test_cases = [
{
"input": "What is the capital of France?",
"expected": "Paris",
"context": "France is a country in Europe. Paris is its capital."
},
# ... 更多测试用例
]
Run evaluation
运行评估
results = suite.evaluate(
model=your_model,
test_cases=test_cases
)
print(f"Overall Accuracy: {results.metrics['accuracy']}")
print(f"BLEU Score: {results.metrics['bleu']}")
undefinedresults = suite.evaluate(
model=your_model,
test_cases=test_cases
)
print(f"Overall Accuracy: {results.metrics['accuracy']}")
print(f"BLEU Score: {results.metrics['bleu']}")
undefinedAutomated Metrics Implementation
自动化指标实现
BLEU Score
BLEU分数
python
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def calculate_bleu(reference, hypothesis):
"""Calculate BLEU score between reference and hypothesis."""
smoothie = SmoothingFunction().method4
return sentence_bleu(
[reference.split()],
hypothesis.split(),
smoothing_function=smoothie
)python
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def calculate_bleu(reference, hypothesis):
"""计算参考文本和生成文本之间的BLEU分数。"""
smoothie = SmoothingFunction().method4
return sentence_bleu(
[reference.split()],
hypothesis.split(),
smoothing_function=smoothie
)Usage
使用示例
bleu = calculate_bleu(
reference="The cat sat on the mat",
hypothesis="A cat is sitting on the mat"
)
undefinedbleu = calculate_bleu(
reference="The cat sat on the mat",
hypothesis="A cat is sitting on the mat"
)
undefinedROUGE Score
ROUGE分数
python
from rouge_score import rouge_scorer
def calculate_rouge(reference, hypothesis):
"""Calculate ROUGE scores."""
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, hypothesis)
return {
'rouge1': scores['rouge1'].fmeasure,
'rouge2': scores['rouge2'].fmeasure,
'rougeL': scores['rougeL'].fmeasure
}python
from rouge_score import rouge_scorer
def calculate_rouge(reference, hypothesis):
"""计算ROUGE分数。"""
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, hypothesis)
return {
'rouge1': scores['rouge1'].fmeasure,
'rouge2': scores['rouge2'].fmeasure,
'rougeL': scores['rougeL'].fmeasure
}BERTScore
BERTScore
python
from bert_score import score
def calculate_bertscore(references, hypotheses):
"""Calculate BERTScore using pre-trained BERT."""
P, R, F1 = score(
hypotheses,
references,
lang='en',
model_type='microsoft/deberta-xlarge-mnli'
)
return {
'precision': P.mean().item(),
'recall': R.mean().item(),
'f1': F1.mean().item()
}python
from bert_score import score
def calculate_bertscore(references, hypotheses):
"""使用预训练BERT计算BERTScore。"""
P, R, F1 = score(
hypotheses,
references,
lang='en',
model_type='microsoft/deberta-xlarge-mnli'
)
return {
'precision': P.mean().item(),
'recall': R.mean().item(),
'f1': F1.mean().item()
}Custom Metrics
自定义指标
python
def calculate_groundedness(response, context):
"""Check if response is grounded in provided context."""
# Use NLI model to check entailment
from transformers import pipeline
nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
result = nli(f"{context} [SEP] {response}")[0]
# Return confidence that response is entailed by context
return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
def calculate_toxicity(text):
"""Measure toxicity in generated text."""
from detoxify import Detoxify
results = Detoxify('original').predict(text)
return max(results.values()) # Return highest toxicity score
def calculate_factuality(claim, knowledge_base):
"""Verify factual claims against knowledge base."""
# Implementation depends on your knowledge base
# Could use retrieval + NLI, or fact-checking API
passpython
def calculate_groundedness(response, context):
"""检查回答是否基于提供的上下文生成。"""
# 使用NLI模型检查蕴含关系
from transformers import pipeline
nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
result = nli(f"{context} [SEP] {response}")[0]
# 返回上下文蕴含回答的置信度
return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
def calculate_toxicity(text):
"""衡量生成文本的毒性。"""
from detoxify import Detoxify
results = Detoxify('original').predict(text)
return max(results.values()) # 返回最高毒性得分
def calculate_factuality(claim, knowledge_base):
"""对照知识库验证事实声明。"""
# 实现取决于你的知识库
# 可使用检索+NLI或者事实核查API
passLLM-as-Judge Patterns
LLM-as-Judge 实现模式
Single Output Evaluation
单输出评估
python
def llm_judge_quality(response, question):
"""Use GPT-5 to judge response quality."""
prompt = f"""Rate the following response on a scale of 1-10 for:
1. Accuracy (factually correct)
2. Helpfulness (answers the question)
3. Clarity (well-written and understandable)
Question: {question}
Response: {response}
Provide ratings in JSON format:
{{
"accuracy": <1-10>,
"helpfulness": <1-10>,
"clarity": <1-10>,
"reasoning": "<brief explanation>"
}}
"""
result = openai.ChatCompletion.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return json.loads(result.choices[0].message.content)python
def llm_judge_quality(response, question):
"""使用GPT-5评估回答质量。"""
prompt = f"""按照1-10分的维度对以下回答打分:
1. 准确性(事实正确)
2. 有用性(回答了问题)
3. 清晰度(表述通顺易懂)
问题:{question}
回答:{response}
以JSON格式返回评分:
{{
"accuracy": <1-10>,
"helpfulness": <1-10>,
"clarity": <1-10>,
"reasoning": "<简短说明>"
}}
"""
result = openai.ChatCompletion.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return json.loads(result.choices[0].message.content)Pairwise Comparison
两两对比评估
python
def compare_responses(question, response_a, response_b):
"""Compare two responses using LLM judge."""
prompt = f"""Compare these two responses to the question and determine which is better.
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which response is better and why? Consider accuracy, helpfulness, and clarity.
Answer with JSON:
{{
"winner": "A" or "B" or "tie",
"reasoning": "<explanation>",
"confidence": <1-10>
}}
"""
result = openai.ChatCompletion.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return json.loads(result.choices[0].message.content)python
def compare_responses(question, response_a, response_b):
"""使用LLM裁判对比两个回答。"""
prompt = f"""对比针对该问题的两个回答,判断哪个更优。
问题:{question}
回答A:{response_a}
回答B:{response_b}
哪个回答更好,为什么?请综合考虑准确性、有用性和清晰度。
以JSON格式返回结果:
{{
"winner": "A" or "B" or "tie",
"reasoning": "<说明>",
"confidence": <1-10>
}}
"""
result = openai.ChatCompletion.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return json.loads(result.choices[0].message.content)Human Evaluation Frameworks
人工评估框架
Annotation Guidelines
标注指南
python
class AnnotationTask:
"""Structure for human annotation task."""
def __init__(self, response, question, context=None):
self.response = response
self.question = question
self.context = context
def get_annotation_form(self):
return {
"question": self.question,
"context": self.context,
"response": self.response,
"ratings": {
"accuracy": {
"scale": "1-5",
"description": "Is the response factually correct?"
},
"relevance": {
"scale": "1-5",
"description": "Does it answer the question?"
},
"coherence": {
"scale": "1-5",
"description": "Is it logically consistent?"
}
},
"issues": {
"factual_error": False,
"hallucination": False,
"off_topic": False,
"unsafe_content": False
},
"feedback": ""
}python
class AnnotationTask:
"""人工标注任务结构。"""
def __init__(self, response, question, context=None):
self.response = response
self.question = question
self.context = context
def get_annotation_form(self):
return {
"question": self.question,
"context": self.context,
"response": self.response,
"ratings": {
"accuracy": {
"scale": "1-5",
"description": "回答是否符合事实?"
},
"relevance": {
"scale": "1-5",
"description": "是否回答了对应的问题?"
},
"coherence": {
"scale": "1-5",
"description": "逻辑是否连贯一致?"
}
},
"issues": {
"factual_error": False,
"hallucination": False,
"off_topic": False,
"unsafe_content": False
},
"feedback": ""
}Inter-Rater Agreement
标注者间一致性
python
from sklearn.metrics import cohen_kappa_score
def calculate_agreement(rater1_scores, rater2_scores):
"""Calculate inter-rater agreement."""
kappa = cohen_kappa_score(rater1_scores, rater2_scores)
interpretation = {
kappa < 0: "Poor",
kappa < 0.2: "Slight",
kappa < 0.4: "Fair",
kappa < 0.6: "Moderate",
kappa < 0.8: "Substantial",
kappa <= 1.0: "Almost Perfect"
}
return {
"kappa": kappa,
"interpretation": interpretation[True]
}python
from sklearn.metrics import cohen_kappa_score
def calculate_agreement(rater1_scores, rater2_scores):
"""计算标注者之间的一致性。"""
kappa = cohen_kappa_score(rater1_scores, rater2_scores)
interpretation = {
kappa < 0: "差",
kappa < 0.2: "轻微一致",
kappa < 0.4: "一般一致",
kappa < 0.6: "中等一致",
kappa < 0.8: "高度一致",
kappa <= 1.0: "几乎完全一致"
}
return {
"kappa": kappa,
"interpretation": interpretation[True]
}A/B Testing
A/B测试
Statistical Testing Framework
统计测试框架
python
from scipy import stats
import numpy as np
class ABTest:
def __init__(self, variant_a_name="A", variant_b_name="B"):
self.variant_a = {"name": variant_a_name, "scores": []}
self.variant_b = {"name": variant_b_name, "scores": []}
def add_result(self, variant, score):
"""Add evaluation result for a variant."""
if variant == "A":
self.variant_a["scores"].append(score)
else:
self.variant_b["scores"].append(score)
def analyze(self, alpha=0.05):
"""Perform statistical analysis."""
a_scores = self.variant_a["scores"]
b_scores = self.variant_b["scores"]
# T-test
t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
# Effect size (Cohen's d)
pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std
return {
"variant_a_mean": np.mean(a_scores),
"variant_b_mean": np.mean(b_scores),
"difference": np.mean(b_scores) - np.mean(a_scores),
"relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
"p_value": p_value,
"statistically_significant": p_value < alpha,
"cohens_d": cohens_d,
"effect_size": self.interpret_cohens_d(cohens_d),
"winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
}
@staticmethod
def interpret_cohens_d(d):
"""Interpret Cohen's d effect size."""
abs_d = abs(d)
if abs_d < 0.2:
return "negligible"
elif abs_d < 0.5:
return "small"
elif abs_d < 0.8:
return "medium"
else:
return "large"python
from scipy import stats
import numpy as np
class ABTest:
def __init__(self, variant_a_name="A", variant_b_name="B"):
self.variant_a = {"name": variant_a_name, "scores": []}
self.variant_b = {"name": variant_b_name, "scores": []}
def add_result(self, variant, score):
"""为某个版本添加评估结果。"""
if variant == "A":
self.variant_a["scores"].append(score)
else:
self.variant_b["scores"].append(score)
def analyze(self, alpha=0.05):
"""执行统计分析。"""
a_scores = self.variant_a["scores"]
b_scores = self.variant_b["scores"]
# T检验
t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
# 效应量(Cohen's d)
pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) / 2)
cohens_d = (np.mean(b_scores) - np.mean(a_scores)) / pooled_std
return {
"variant_a_mean": np.mean(a_scores),
"variant_b_mean": np.mean(b_scores),
"difference": np.mean(b_scores) - np.mean(a_scores),
"relative_improvement": (np.mean(b_scores) - np.mean(a_scores)) / np.mean(a_scores),
"p_value": p_value,
"statistically_significant": p_value < alpha,
"cohens_d": cohens_d,
"effect_size": self.interpret_cohens_d(cohens_d),
"winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
}
@staticmethod
def interpret_cohens_d(d):
"""解读Cohen's d效应量。"""
abs_d = abs(d)
if abs_d < 0.2:
return "可忽略"
elif abs_d < 0.5:
return "小"
elif abs_d < 0.8:
return "中"
else:
return "大"Regression Testing
回归测试
Regression Detection
回退检测
python
class RegressionDetector:
def __init__(self, baseline_results, threshold=0.05):
self.baseline = baseline_results
self.threshold = threshold
def check_for_regression(self, new_results):
"""Detect if new results show regression."""
regressions = []
for metric in self.baseline.keys():
baseline_score = self.baseline[metric]
new_score = new_results.get(metric)
if new_score is None:
continue
# Calculate relative change
relative_change = (new_score - baseline_score) / baseline_score
# Flag if significant decrease
if relative_change < -self.threshold:
regressions.append({
"metric": metric,
"baseline": baseline_score,
"current": new_score,
"change": relative_change
})
return {
"has_regression": len(regressions) > 0,
"regressions": regressions
}python
class RegressionDetector:
def __init__(self, baseline_results, threshold=0.05):
self.baseline = baseline_results
self.threshold = threshold
def check_for_regression(self, new_results):
"""检测新结果是否出现性能回退。"""
regressions = []
for metric in self.baseline.keys():
baseline_score = self.baseline[metric]
new_score = new_results.get(metric)
if new_score is None:
continue
# 计算相对变化
relative_change = (new_score - baseline_score) / baseline_score
# 标记明显的性能下降
if relative_change < -self.threshold:
regressions.append({
"metric": metric,
"baseline": baseline_score,
"current": new_score,
"change": relative_change
})
return {
"has_regression": len(regressions) > 0,
"regressions": regressions
}Benchmarking
基准测试
Running Benchmarks
运行基准测试
python
class BenchmarkRunner:
def __init__(self, benchmark_dataset):
self.dataset = benchmark_dataset
def run_benchmark(self, model, metrics):
"""Run model on benchmark and calculate metrics."""
results = {metric.name: [] for metric in metrics}
for example in self.dataset:
# Generate prediction
prediction = model.predict(example["input"])
# Calculate each metric
for metric in metrics:
score = metric.calculate(
prediction=prediction,
reference=example["reference"],
context=example.get("context")
)
results[metric.name].append(score)
# Aggregate results
return {
metric: {
"mean": np.mean(scores),
"std": np.std(scores),
"min": min(scores),
"max": max(scores)
}
for metric, scores in results.items()
}python
class BenchmarkRunner:
def __init__(self, benchmark_dataset):
self.dataset = benchmark_dataset
def run_benchmark(self, model, metrics):
"""在基准数据集上运行模型并计算指标。"""
results = {metric.name: [] for metric in metrics}
for example in self.dataset:
# 生成预测结果
prediction = model.predict(example["input"])
# 计算每个指标
for metric in metrics:
score = metric.calculate(
prediction=prediction,
reference=example["reference"],
context=example.get("context")
)
results[metric.name].append(score)
# 聚合结果
return {
metric: {
"mean": np.mean(scores),
"std": np.std(scores),
"min": min(scores),
"max": max(scores)
}
for metric, scores in results.items()
}Resources
资源
- references/metrics.md: Comprehensive metric guide
- references/human-evaluation.md: Annotation best practices
- references/benchmarking.md: Standard benchmarks
- references/a-b-testing.md: Statistical testing guide
- references/regression-testing.md: CI/CD integration
- assets/evaluation-framework.py: Complete evaluation harness
- assets/benchmark-dataset.jsonl: Example datasets
- scripts/evaluate-model.py: Automated evaluation runner
- references/metrics.md: 全面的指标指南
- references/human-evaluation.md: 标注最佳实践
- references/benchmarking.md: 标准基准测试
- references/a-b-testing.md: 统计测试指南
- references/regression-testing.md: CI/CD集成
- assets/evaluation-framework.py: 完整的评估工具链
- assets/benchmark-dataset.jsonl: 示例数据集
- scripts/evaluate-model.py: 自动化评估执行脚本
Best Practices
最佳实践
- Multiple Metrics: Use diverse metrics for comprehensive view
- Representative Data: Test on real-world, diverse examples
- Baselines: Always compare against baseline performance
- Statistical Rigor: Use proper statistical tests for comparisons
- Continuous Evaluation: Integrate into CI/CD pipeline
- Human Validation: Combine automated metrics with human judgment
- Error Analysis: Investigate failures to understand weaknesses
- Version Control: Track evaluation results over time
- 多指标结合: 使用多种指标获得全面的评估视角
- 代表性数据: 基于真实世界的多样化示例测试
- 基准对比: 始终和基线性能做对比
- 统计严谨性: 使用合适的统计测试做对比
- 持续评估: 集成到CI/CD流水线中
- 人工验证: 结合自动化指标和人工判断
- 错误分析: 调研失败案例理解模型短板
- 版本控制: 长期跟踪评估结果变化
Common Pitfalls
常见误区
- Single Metric Obsession: Optimizing for one metric at the expense of others
- Small Sample Size: Drawing conclusions from too few examples
- Data Contamination: Testing on training data
- Ignoring Variance: Not accounting for statistical uncertainty LLM Evaluation v1.1 - Enhanced
- 单一指标执念: 为了优化单个指标牺牲其他维度表现
- 样本量过小: 基于过少的示例得出结论
- 数据污染: 使用训练数据做测试
- 忽略方差: 不考虑统计不确定性 LLM评估 v1.1 - 增强版
🔄 Workflow
🔄 工作流
Kaynak: Ragas Metrics & HuggingFace Evaluation Guide
来源: Ragas Metrics & HuggingFace评估指南
Aşama 1: Dataset Creation
阶段1: 数据集创建
- Golden Set: En az 50 adet "soru-cevap-kontext" üçlüsü içeren ve insan tarafından doğrulanmış dataset oluştur.
- Adversarial: Modeli yanıltmaya yönelik (Prompt Injection, Jailbreak) test senaryoları ekle.
- 黄金数据集: 创建至少包含50组经人工验证的“问题-答案-上下文”三元组的数据集。
- 对抗样本: 添加旨在误导模型的测试场景(提示词注入、越狱)。
Aşama 2: Automated Pipeline
阶段2: 自动化流水线
- Component-Level: Retrieval (Recall@K) ve Generation (Faithfulness) ayrı ayrı ölç.
- Continuous Eval: CI/CD pipeline'ına entegre et, her PR'da regresyon testi koş.
- Cost Monitoring: Token kullanımını ve maliyeti test başına takip et.
- 组件级评估: 分别衡量检索(Recall@K)和生成(忠实度)效果。
- 持续评估: 集成到CI/CD流水线中,每次PR都运行回归测试。
- 成本监控: 跟踪每次测试的Token使用量和成本。
Aşama 3: Human-in-the-Loop
阶段3: 人在回路
- Sampling: Modelin "kötü" (düşük skorlu) cevaplarını insan incelemesine gönder.
- Feedback Loop: Düzeltmeleri bir sonraki fine-tuning/prompting iterasyonu için kaydet.
- 抽样: 将模型的“差”(低得分)回答提交人工审核。
- 反馈回路: 保存修正内容,用于下一轮微调/提示词优化迭代。
Kontrol Noktaları
检查点
| Aşama | Doğrulama |
|---|---|
| 1 | Test seti gerçek kullanıcı verilerini temsil ediyor mu? |
| 2 | Otomatik metrikler ile insan kararları uyuşuyor mu (Korrelasyon)? |
| 3 | Model güvenliği (PII sızıntısı, toksisite) garanti altında mı? |
| 阶段 | 验证项 |
|---|---|
| 1 | 测试集是否能代表真实用户数据? |
| 2 | 自动化指标与人工判断是否一致(相关性)? |
| 3 | 模型安全性(PII泄露、毒性)是否得到保障? |