evaluating-llms
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Evaluation
LLM系统评估
Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety.
使用自动化指标、LLM-as-judge模式和标准化基准测试来评估大语言模型(LLM)系统,确保其达到生产级别的质量与安全性。
When to Use This Skill
何时使用该技能
Apply this skill when:
- Testing individual prompts for correctness and formatting
- Validating RAG (Retrieval-Augmented Generation) pipeline quality
- Measuring hallucinations, bias, or toxicity in LLM outputs
- Comparing different models or prompt configurations (A/B testing)
- Running benchmark tests (MMLU, HumanEval) to assess model capabilities
- Setting up production monitoring for LLM applications
- Integrating LLM quality checks into CI/CD pipelines
Common triggers:
- "How do I test if my RAG system is working correctly?"
- "How can I measure hallucinations in LLM outputs?"
- "What metrics should I use to evaluate generation quality?"
- "How do I compare GPT-4 vs Claude for my use case?"
- "How do I detect bias in LLM responses?"
在以下场景中应用该技能:
- 测试单个提示词的正确性与格式规范性
- 验证RAG(Retrieval-Augmented Generation,检索增强生成)流水线的质量
- 衡量LLM输出内容中的幻觉、偏见或毒性问题
- 对比不同模型或提示词配置(A/B测试)
- 运行基准测试(MMLU、HumanEval)以评估模型能力
- 为LLM应用搭建生产环境监控体系
- 将LLM质量检查集成到CI/CD流水线中
常见触发场景:
- "我该如何测试我的RAG系统是否正常工作?"
- "如何衡量LLM输出中的幻觉问题?"
- "评估生成内容质量应该使用哪些指标?"
- "针对我的使用场景,该如何对比GPT-4和Claude?"
- "如何检测LLM回复中的偏见?"
Evaluation Strategy Selection
评估策略选择
Decision Framework: Which Evaluation Approach?
决策框架:选择哪种评估方法?
By Task Type:
| Task Type | Primary Approach | Metrics | Tools |
|---|---|---|---|
| Classification (sentiment, intent) | Automated metrics | Accuracy, Precision, Recall, F1 | scikit-learn |
| Generation (summaries, creative text) | LLM-as-judge + automated | BLEU, ROUGE, BERTScore, Quality rubric | GPT-4/Claude for judging |
| Question Answering | Exact match + semantic similarity | EM, F1, Cosine similarity | Custom evaluators |
| RAG Systems | RAGAS framework | Faithfulness, Answer/Context relevance | RAGAS library |
| Code Generation | Unit tests + execution | Pass@K, Test pass rate | HumanEval, pytest |
| Multi-step Agents | Task completion + tool accuracy | Success rate, Efficiency | Custom evaluators |
By Volume and Cost:
| Samples | Speed | Cost | Recommended Approach |
|---|---|---|---|
| 1,000+ | Immediate | $0 | Automated metrics (regex, JSON validation) |
| 100-1,000 | Minutes | $0.01-0.10 each | LLM-as-judge (GPT-4, Claude) |
| < 100 | Hours | $1-10 each | Human evaluation (pairwise comparison) |
Layered Approach (Recommended for Production):
- Layer 1: Automated metrics for all outputs (fast, cheap)
- Layer 2: LLM-as-judge for 10% sample (nuanced quality)
- Layer 3: Human review for 1% edge cases (validation)
按任务类型划分:
| 任务类型 | 主要评估方法 | 核心指标 | 工具 |
|---|---|---|---|
| 分类任务(情感分析、意图识别) | 自动化指标 | 准确率、精确率、召回率、F1值 | scikit-learn |
| 生成任务(摘要、创意文本) | LLM-as-judge + 自动化指标 | BLEU、ROUGE、BERTScore、质量评估准则 | GPT-4/Claude(作为评判者) |
| 问答任务 | 精确匹配 + 语义相似度 | EM(精确匹配)、F1值、余弦相似度 | 自定义评估器 |
| RAG系统 | RAGAS框架 | 忠实度、答案/上下文相关性 | RAGAS库 |
| 代码生成 | 单元测试 + 执行验证 | Pass@K、测试通过率 | HumanEval、pytest |
| 多步骤Agent | 任务完成度 + 工具调用准确率 | 成功率、效率 | 自定义评估器 |
按样本量与成本划分:
| 样本数量 | 速度 | 成本 | 推荐评估方法 |
|---|---|---|---|
| 1000+ | 即时 | $0 | 自动化指标(正则匹配、JSON验证) |
| 100-1000 | 数分钟 | 每份$0.01-0.10 | LLM-as-judge(GPT-4、Claude) |
| < 100 | 数小时 | 每份$1-10 | 人工评估(两两对比) |
分层评估法(生产环境推荐):
- 第一层: 对所有输出使用自动化指标(快速、低成本)
- 第二层: 抽取10%的样本使用LLM-as-judge评估(精细化质量检查)
- 第三层: 抽取1%的边缘案例进行人工审核(最终验证)
Core Evaluation Patterns
核心评估模式
Unit Evaluation (Individual Prompts)
单元评估(单个提示词)
Test single prompt-response pairs for correctness.
Methods:
- Exact Match: Response exactly matches expected output
- Regex Matching: Response follows expected pattern
- JSON Schema Validation: Structured output validation
- Keyword Presence: Required terms appear in response
- LLM-as-Judge: Binary pass/fail using evaluation prompt
Example Use Cases:
- Email classification (spam/not spam)
- Entity extraction (dates, names, locations)
- JSON output formatting validation
- Sentiment analysis (positive/negative/neutral)
Quick Start (Python):
python
import pytest
from openai import OpenAI
client = OpenAI()
def classify_sentiment(text: str) -> str:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
{"role": "user", "content": text}
],
temperature=0
)
return response.choices[0].message.content.strip().lower()
def test_positive_sentiment():
result = classify_sentiment("I love this product!")
assert result == "positive"For complete unit evaluation examples, see and .
examples/python/unit_evaluation.pyexamples/typescript/unit-evaluation.ts测试单个提示词-回复对的正确性。
评估方法:
- 精确匹配: 回复与预期输出完全一致
- 正则匹配: 回复符合预期格式
- JSON Schema验证: 结构化输出验证
- 关键词存在性: 回复中包含必填术语
- LLM-as-judge: 使用评估提示词进行二元通过/不通过判断
示例场景:
- 邮件分类(垃圾邮件/非垃圾邮件)
- 实体抽取(日期、姓名、地点)
- JSON输出格式验证
- 情感分析(正面/负面/中性)
快速开始(Python):
python
import pytest
from openai import OpenAI
client = OpenAI()
def classify_sentiment(text: str) -> str:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
{"role": "user", "content": text}
],
temperature=0
)
return response.choices[0].message.content.strip().lower()
def test_positive_sentiment():
result = classify_sentiment("I love this product!")
assert result == "positive"完整的单元评估示例请参考 和 。
examples/python/unit_evaluation.pyexamples/typescript/unit-evaluation.tsRAG (Retrieval-Augmented Generation) Evaluation
RAG(检索增强生成)系统评估
Evaluate RAG systems using RAGAS framework metrics.
Critical Metrics (Priority Order):
-
Faithfulness (Target: > 0.8) - MOST CRITICAL
- Measures: Is the answer grounded in retrieved context?
- Prevents hallucinations
- If failing: Adjust prompt to emphasize grounding, require citations
-
Answer Relevance (Target: > 0.7)
- Measures: How well does the answer address the query?
- If failing: Improve prompt instructions, add few-shot examples
-
Context Relevance (Target: > 0.7)
- Measures: Are retrieved chunks relevant to the query?
- If failing: Improve retrieval (better embeddings, hybrid search)
-
Context Precision (Target: > 0.5)
- Measures: Are relevant chunks ranked higher than irrelevant?
- If failing: Add re-ranking step to retrieval pipeline
-
Context Recall (Target: > 0.8)
- Measures: Are all relevant chunks retrieved?
- If failing: Increase retrieval count, improve chunking strategy
Quick Start (Python with RAGAS):
python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset
data = {
"question": ["What is the capital of France?"],
"answer": ["The capital of France is Paris."],
"contexts": [["Paris is the capital of France."]],
"ground_truth": ["Paris"]
}
dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")For comprehensive RAG evaluation patterns, see and .
references/rag-evaluation.mdexamples/python/ragas_example.py使用RAGAS框架指标评估RAG系统。
关键指标(优先级排序):
-
忠实度(目标值:>0.8) - 最关键
- 衡量:答案是否基于检索到的上下文?
- 作用:防止幻觉问题
- 优化方向:调整提示词以强调基于上下文生成,要求引用来源
-
答案相关性(目标值:>0.7)
- 衡量:答案是否能很好地回应用户查询?
- 优化方向:完善提示词指令,增加少样本示例
-
上下文相关性(目标值:>0.7)
- 衡量:检索到的文本块与查询是否相关?
- 优化方向:改进检索策略(更好的嵌入模型、混合搜索)
-
上下文精确率(目标值:>0.5)
- 衡量:相关文本块是否比无关文本块排名更靠前?
- 优化方向:在检索流水线中增加重排序步骤
-
上下文召回率(目标值:>0.8)
- 衡量:是否检索到了所有相关的文本块?
- 优化方向:增加检索数量,改进文本分块策略
快速开始(Python + RAGAS):
python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset
data = {
"question": ["What is the capital of France?"],
"answer": ["The capital of France is Paris."],
"contexts": [["Paris is the capital of France."]],
"ground_truth": ["Paris"]
}
dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")完整的RAG评估模式请参考 和 。
references/rag-evaluation.mdexamples/python/ragas_example.pyLLM-as-Judge Evaluation
LLM-as-judge评估
Use powerful LLMs (GPT-4, Claude Opus) to evaluate other LLM outputs.
When to Use:
- Generation quality assessment (summaries, creative writing)
- Nuanced evaluation criteria (tone, clarity, helpfulness)
- Custom rubrics for domain-specific tasks
- Medium-volume evaluation (100-1,000 samples)
Correlation with Human Judgment: 0.75-0.85 for well-designed rubrics
Best Practices:
- Use clear, specific rubrics (1-5 scale with detailed criteria)
- Include few-shot examples in evaluation prompt
- Average multiple evaluations to reduce variance
- Be aware of biases (position bias, verbosity bias, self-preference)
Quick Start (Python):
python
from openai import OpenAI
client = OpenAI()
def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
"""Returns (score 1-5, reasoning)"""
eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.
USER PROMPT: {prompt}
LLM RESPONSE: {response}
Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
result = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": eval_prompt}],
temperature=0.3
)
content = result.choices[0].message.content
lines = content.strip().split('\n')
score = int(lines[0].split(':')[1].strip())
reasoning = lines[1].split(':', 1)[1].strip()
return score, reasoningFor detailed LLM-as-judge patterns and prompt templates, see and .
references/llm-as-judge.mdexamples/python/llm_as_judge.py使用高性能LLM(如GPT-4、Claude Opus)来评估其他LLM的输出内容。
适用场景:
- 生成内容质量评估(摘要、创意写作)
- 精细化评估准则(语气、清晰度、实用性)
- 特定领域的自定义评估准则
- 中等样本量评估(100-1000份样本)
与人工判断的相关性: 设计良好的评估准则下,相关性可达0.75-0.85
最佳实践:
- 使用清晰、具体的评估准则(1-5分制,附带详细标准)
- 在评估提示词中加入少样本示例
- 多次评估取平均值以降低方差
- 注意潜在偏见(位置偏见、冗长性偏见、自我偏好)
快速开始(Python):
python
from openai import OpenAI
client = OpenAI()
def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
"""Returns (score 1-5, reasoning)"""
eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.
USER PROMPT: {prompt}
LLM RESPONSE: {response}
Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
result = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": eval_prompt}],
temperature=0.3
)
content = result.choices[0].message.content
lines = content.strip().split('\n')
score = int(lines[0].split(':')[1].strip())
reasoning = lines[1].split(':', 1)[1].strip()
return score, reasoning详细的LLM-as-judge模式和提示词模板请参考 和 。
references/llm-as-judge.mdexamples/python/llm_as_judge.pySafety and Alignment Evaluation
安全性与对齐性评估
Measure hallucinations, bias, and toxicity in LLM outputs.
衡量LLM输出中的幻觉、偏见与毒性问题。
Hallucination Detection
幻觉检测
Methods:
-
Faithfulness to Context (RAG):
- Use RAGAS faithfulness metric
- LLM checks if claims are supported by context
- Score: Supported claims / Total claims
-
Factual Accuracy (Closed-Book):
- LLM-as-judge with access to reliable sources
- Fact-checking APIs (Google Fact Check)
- Entity-level verification (dates, names, statistics)
-
Self-Consistency:
- Generate multiple responses to same question
- Measure agreement between responses
- Low consistency suggests hallucination
方法:
-
上下文忠实度(RAG场景):
- 使用RAGAS的忠实度指标
- 由LLM检查输出中的声明是否有上下文支持
- 计算方式:受支持的声明数量 / 总声明数量
-
事实准确性(闭卷场景):
- 使用可访问可靠来源的LLM-as-judge
- 事实核查API(如Google Fact Check)
- 实体级验证(日期、姓名、统计数据)
-
自一致性:
- 针对同一问题生成多个回复
- 衡量回复之间的一致性
- 一致性低则表明存在幻觉风险
Bias Evaluation
偏见评估
Types of Bias:
- Gender bias (stereotypical associations)
- Racial/ethnic bias (discriminatory outputs)
- Cultural bias (Western-centric assumptions)
- Age/disability bias (ableist or ageist language)
Evaluation Methods:
-
Stereotype Tests:
- BBQ (Bias Benchmark for QA): 58,000 question-answer pairs
- BOLD (Bias in Open-Ended Language Generation)
-
Counterfactual Evaluation:
- Generate responses with demographic swaps
- Example: "Dr. Smith (he/she) recommended..." → compare outputs
- Measure consistency across variations
偏见类型:
- 性别偏见(刻板印象关联)
- 种族/民族偏见(歧视性输出)
- 文化偏见(以西方为中心的假设)
- 年龄/残障偏见(歧视性语言)
评估方法:
-
刻板印象测试:
- BBQ(问答任务偏见基准测试):包含58000个问答对
- BOLD(开放式语言生成中的偏见测试)
-
反事实评估:
- 替换人口统计特征后生成回复
- 示例:"Dr. Smith (he/she) recommended..." → 对比不同输出
- 衡量不同变体之间的一致性
Toxicity Detection
毒性检测
Tools:
- Perspective API (Google): Toxicity, threat, insult scores
- Detoxify (HuggingFace): Open-source toxicity classifier
- OpenAI Moderation API: Hate, harassment, violence detection
For comprehensive safety evaluation patterns, see .
references/safety-evaluation.md工具:
- Perspective API(Google): 毒性、威胁、侮辱性评分
- Detoxify(HuggingFace): 开源毒性分类器
- OpenAI Moderation API: 仇恨言论、骚扰、暴力内容检测
完整的安全性评估模式请参考 。
references/safety-evaluation.mdBenchmark Testing
基准测试
Assess model capabilities using standardized benchmarks.
Standard Benchmarks:
| Benchmark | Coverage | Format | Difficulty | Use Case |
|---|---|---|---|---|
| MMLU | 57 subjects (STEM, humanities) | Multiple choice | High school - professional | General intelligence |
| HellaSwag | Sentence completion | Multiple choice | Common sense | Reasoning validation |
| GPQA | PhD-level science | Multiple choice | Very high (expert-level) | Frontier model testing |
| HumanEval | 164 Python problems | Code generation | Medium | Code capability |
| MATH | 12,500 competition problems | Math solving | High school competitions | Math reasoning |
Domain-Specific Benchmarks:
- Medical: MedQA (USMLE), PubMedQA
- Legal: LegalBench
- Finance: FinQA, ConvFinQA
When to Use Benchmarks:
- Comparing multiple models (GPT-4 vs Claude vs Llama)
- Model selection for specific domains
- Baseline capability assessment
- Academic research and publication
Quick Start (lm-evaluation-harness):
bash
pip install lm-eval使用标准化基准测试评估模型能力。
标准基准测试:
| 基准测试 | 覆盖范围 | 格式 | 难度 | 适用场景 |
|---|---|---|---|---|
| MMLU | 57个学科(STEM、人文) | 选择题 | 高中-专业级 | 通用智能评估 |
| HellaSwag | 句子补全 | 选择题 | 常识级 | 推理能力验证 |
| GPQA | 博士级科学问题 | 选择题 | 极高(专家级) | 前沿模型测试 |
| HumanEval | 164个Python问题 | 代码生成 | 中等 | 代码能力评估 |
| MATH | 12500道竞赛题目 | 数学解题 | 高中竞赛级 | 数学推理能力 |
特定领域基准测试:
- 医疗领域: MedQA(美国医师执照考试)、PubMedQA
- 法律领域: LegalBench
- 金融领域: FinQA、ConvFinQA
何时使用基准测试:
- 对比多个模型(GPT-4 vs Claude vs Llama)
- 为特定领域选择模型
- 基准能力评估
- 学术研究与发表
快速开始(lm-evaluation-harness):
bash
pip install lm-evalEvaluate GPT-4 on MMLU
Evaluate GPT-4 on MMLU
lm_eval --model openai-chat --model_args model=gpt-4 --tasks mmlu --num_fewshot 5
For detailed benchmark testing patterns, see `references/benchmarks.md` and `scripts/benchmark_runner.py`.lm_eval --model openai-chat --model_args model=gpt-4 --tasks mmlu --num_fewshot 5
详细的基准测试模式请参考 `references/benchmarks.md` 和 `scripts/benchmark_runner.py`。Production Evaluation
生产环境评估
Monitor and optimize LLM quality in production environments.
在生产环境中监控与优化LLM的质量。
A/B Testing
A/B测试
Compare two LLM configurations:
- Variant A: GPT-4 (expensive, high quality)
- Variant B: Claude Sonnet (cheaper, fast)
Metrics:
- User satisfaction scores (thumbs up/down)
- Task completion rates
- Response time and latency
- Cost per successful interaction
对比两种LLM配置:
- 变体A: GPT-4(成本高、质量优)
- 变体B: Claude Sonnet(成本低、速度快)
指标:
- 用户满意度评分(点赞/点踩)
- 任务完成率
- 响应时间与延迟
- 每次成功交互的成本
Online Evaluation
在线评估
Real-time quality monitoring:
- Response Quality: LLM-as-judge scoring every Nth response
- User Feedback: Explicit ratings, thumbs up/down
- Business Metrics: Conversion rates, support ticket resolution
- Cost Tracking: Tokens used, inference costs
实时质量监控:
- 回复质量: 每N条回复使用LLM-as-judge评分
- 用户反馈: 明确评分、点赞/点踩
- 业务指标: 转化率、支持工单解决率
- 成本追踪: Token使用量、推理成本
Human-in-the-Loop
人工介入评估
Sample-based human evaluation:
- Random Sampling: Evaluate 10% of responses
- Confidence-Based: Evaluate low-confidence outputs
- Error-Triggered: Flag suspicious responses for review
For production evaluation patterns and monitoring strategies, see .
references/production-evaluation.md基于样本的人工评估:
- 随机抽样: 评估10%的回复
- 基于置信度: 评估低置信度输出
- 错误触发: 标记可疑回复进行审核
生产环境评估模式与监控策略请参考 。
references/production-evaluation.mdClassification Task Evaluation
分类任务评估
For tasks with discrete outputs (sentiment, intent, category).
Metrics:
- Accuracy: Correct predictions / Total predictions
- Precision: True positives / (True positives + False positives)
- Recall: True positives / (True positives + False negatives)
- F1 Score: Harmonic mean of precision and recall
- Confusion Matrix: Detailed breakdown of prediction errors
Quick Start (Python):
python
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
y_true = ["positive", "negative", "neutral", "positive", "negative"]
y_pred = ["positive", "negative", "neutral", "neutral", "negative"]
accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")For complete classification evaluation examples, see .
examples/python/classification_metrics.py针对具有离散输出的任务(情感分析、意图识别、分类)。
指标:
- 准确率: 正确预测数 / 总预测数
- 精确率: 真阳性数 /(真阳性数 + 假阳性数)
- 召回率: 真阳性数 /(真阳性数 + 假阴性数)
- F1值: 精确率与召回率的调和平均数
- 混淆矩阵: 预测错误的详细分解
快速开始(Python):
python
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
y_true = ["positive", "negative", "neutral", "positive", "negative"]
y_pred = ["positive", "negative", "neutral", "neutral", "negative"]
accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")完整的分类任务评估示例请参考 。
examples/python/classification_metrics.pyGeneration Task Evaluation
生成任务评估
For open-ended text generation (summaries, creative writing, responses).
Automated Metrics (Use with Caution):
- BLEU: N-gram overlap with reference text (0-1 score)
- ROUGE: Recall-oriented overlap (ROUGE-1, ROUGE-L)
- METEOR: Semantic similarity with stemming
- BERTScore: Contextual embedding similarity (0-1 score)
Limitation: Automated metrics correlate weakly with human judgment for creative/subjective generation.
Recommended Approach:
- Automated metrics: Fast feedback for objective aspects (length, format)
- LLM-as-judge: Nuanced quality assessment (relevance, coherence, helpfulness)
- Human evaluation: Final validation for subjective criteria (preference, creativity)
For detailed generation evaluation patterns, see .
references/evaluation-types.md针对开放式文本生成任务(摘要、创意写作、回复)。
自动化指标(谨慎使用):
- BLEU: 与参考文本的N-gram重叠度(0-1分)
- ROUGE: 面向召回的重叠度(ROUGE-1、ROUGE-L)
- METEOR: 带词干提取的语义相似度
- BERTScore: 上下文嵌入相似度(0-1分)
局限性: 自动化指标与人工判断的相关性在创意/主观性生成任务中较弱。
推荐方案:
- 自动化指标: 针对客观维度(长度、格式)提供快速反馈
- LLM-as-judge: 精细化质量评估(相关性、连贯性、实用性)
- 人工评估: 针对主观标准(偏好、创意)进行最终验证
详细的生成任务评估模式请参考 。
references/evaluation-types.mdQuick Reference Tables
快速参考表格
Evaluation Framework Selection
评估框架选择
| If Task Is... | Use This Framework | Primary Metric |
|---|---|---|
| RAG system | RAGAS | Faithfulness > 0.8 |
| Classification | scikit-learn metrics | Accuracy, F1 |
| Generation quality | LLM-as-judge | Quality rubric (1-5) |
| Code generation | HumanEval | Pass@1, Test pass rate |
| Model comparison | Benchmark testing | MMLU, HellaSwag scores |
| Safety validation | Hallucination detection | Faithfulness, Fact-check |
| Production monitoring | Online evaluation | User feedback, Business KPIs |
| 任务类型 | 推荐框架 | 核心指标 |
|---|---|---|
| RAG系统 | RAGAS | 忠实度 > 0.8 |
| 分类任务 | scikit-learn指标 | 准确率、F1值 |
| 生成内容质量 | LLM-as-judge | 质量评估准则(1-5分) |
| 代码生成 | HumanEval | Pass@1、测试通过率 |
| 模型对比 | 基准测试 | MMLU、HellaSwag得分 |
| 安全性验证 | 幻觉检测 | 忠实度、事实核查 |
| 生产环境监控 | 在线评估 | 用户反馈、业务KPI |
Python Library Recommendations
Python库推荐
| Library | Use Case | Installation |
|---|---|---|
| RAGAS | RAG evaluation | |
| DeepEval | General LLM evaluation, pytest integration | |
| LangSmith | Production monitoring, A/B testing | |
| lm-eval | Benchmark testing (MMLU, HumanEval) | |
| scikit-learn | Classification metrics | |
| 库 | 适用场景 | 安装命令 |
|---|---|---|
| RAGAS | RAG系统评估 | |
| DeepEval | 通用LLM评估、pytest集成 | |
| LangSmith | 生产环境监控、A/B测试 | |
| lm-eval | 基准测试(MMLU、HumanEval) | |
| scikit-learn | 分类任务指标 | |
Safety Evaluation Priority Matrix
安全性评估优先级矩阵
| Application | Hallucination Risk | Bias Risk | Toxicity Risk | Evaluation Priority |
|---|---|---|---|---|
| Customer Support | High | Medium | High | 1. Faithfulness, 2. Toxicity, 3. Bias |
| Medical Diagnosis | Critical | High | Low | 1. Factual Accuracy, 2. Hallucination, 3. Bias |
| Creative Writing | Low | Medium | Medium | 1. Quality/Fluency, 2. Content Policy |
| Code Generation | Medium | Low | Low | 1. Functional Correctness, 2. Security |
| Content Moderation | Low | Critical | Critical | 1. Bias, 2. False Positives/Negatives |
| 应用场景 | 幻觉风险 | 偏见风险 | 毒性风险 | 评估优先级 |
|---|---|---|---|---|
| 客户支持 | 高 | 中 | 高 | 1. 忠实度,2. 毒性检测,3. 偏见检测 |
| 医疗诊断 | 极高 | 高 | 低 | 1. 事实准确性,2. 幻觉检测,3. 偏见检测 |
| 创意写作 | 低 | 中 | 中 | 1. 质量/流畅度,2. 内容合规性 |
| 代码生成 | 中 | 低 | 低 | 1. 功能正确性,2. 安全性 |
| 内容审核 | 低 | 极高 | 极高 | 1. 偏见检测,2. 误判/漏判率 |
Detailed References
详细参考文档
For comprehensive documentation on specific topics:
- Evaluation types (classification, generation, QA, code):
references/evaluation-types.md - RAG evaluation deep dive (RAGAS framework):
references/rag-evaluation.md - Safety evaluation (hallucination, bias, toxicity):
references/safety-evaluation.md - Benchmark testing (MMLU, HumanEval, domain benchmarks):
references/benchmarks.md - LLM-as-judge best practices and prompts:
references/llm-as-judge.md - Production evaluation (A/B testing, monitoring):
references/production-evaluation.md - All metrics definitions and formulas:
references/metrics-reference.md
特定主题的完整文档:
- 评估类型(分类、生成、问答、代码):
references/evaluation-types.md - RAG系统评估深度解析(RAGAS框架):
references/rag-evaluation.md - 安全性评估(幻觉、偏见、毒性):
references/safety-evaluation.md - 基准测试(MMLU、HumanEval、领域基准):
references/benchmarks.md - LLM-as-judge最佳实践与提示词:
references/llm-as-judge.md - 生产环境评估(A/B测试、监控):
references/production-evaluation.md - 所有指标定义与公式:
references/metrics-reference.md
Working Examples
示例代码
Python Examples:
- - Basic prompt testing with pytest
examples/python/unit_evaluation.py - - RAGAS RAG evaluation
examples/python/ragas_example.py - - DeepEval framework usage
examples/python/deepeval_example.py - - GPT-4 as evaluator
examples/python/llm_as_judge.py - - Accuracy, precision, recall
examples/python/classification_metrics.py - - HumanEval example
examples/python/benchmark_testing.py
TypeScript Examples:
- - Vitest + OpenAI
examples/typescript/unit-evaluation.ts - - GPT-4 evaluation
examples/typescript/llm-as-judge.ts - - Production monitoring
examples/typescript/langsmith-integration.ts
Python示例:
- - 使用pytest进行基础提示词测试
examples/python/unit_evaluation.py - - RAGAS框架下的RAG评估
examples/python/ragas_example.py - - DeepEval框架使用示例
examples/python/deepeval_example.py - - GPT-4作为评判者的评估示例
examples/python/llm_as_judge.py - - 分类任务指标计算
examples/python/classification_metrics.py - - HumanEval基准测试示例
examples/python/benchmark_testing.py
TypeScript示例:
- - Vitest + OpenAI的单元测试
examples/typescript/unit-evaluation.ts - - GPT-4评估示例
examples/typescript/llm-as-judge.ts - - 生产环境监控集成
examples/typescript/langsmith-integration.ts
Executable Scripts
可执行脚本
Run evaluations without loading code into context (token-free):
- - Run RAGAS evaluation on dataset
scripts/run_ragas_eval.py - - A/B test two models
scripts/compare_models.py - - Run MMLU/HumanEval benchmarks
scripts/benchmark_runner.py - - Detect hallucinations in outputs
scripts/hallucination_checker.py
Example usage:
bash
undefined无需加载代码即可运行评估(无Token消耗):
- - 在数据集上运行RAGAS评估
scripts/run_ragas_eval.py - - 对两个模型进行A/B测试
scripts/compare_models.py - - 运行MMLU/HumanEval基准测试
scripts/benchmark_runner.py - - 检测输出中的幻觉问题
scripts/hallucination_checker.py
使用示例:
bash
undefinedRun RAGAS evaluation on custom dataset
Run RAGAS evaluation on custom dataset
python scripts/run_ragas_eval.py --dataset data/qa_dataset.json --output results.json
python scripts/run_ragas_eval.py --dataset data/qa_dataset.json --output results.json
Compare GPT-4 vs Claude on benchmark
Compare GPT-4 vs Claude on benchmark
python scripts/compare_models.py --model-a gpt-4 --model-b claude-3-opus --tasks mmlu,humaneval
undefinedpython scripts/compare_models.py --model-a gpt-4 --model-b claude-3-opus --tasks mmlu,humaneval
undefinedIntegration with Other Skills
与其他技能的集成
Related Skills:
- : Evaluate AI chat applications (this skill tests what that skill builds)
building-ai-chat - : Test prompt quality and effectiveness
prompt-engineering - : Apply testing pyramid to LLM evaluation (unit → integration → E2E)
testing-strategies - : Production monitoring and alerting for LLM quality
observability - : Integrate LLM evaluation into CI/CD
building-ci-pipelines
Workflow Integration:
- Write prompt (use skill)
prompt-engineering - Unit test prompt (use skill)
llm-evaluation - Build AI feature (use skill)
building-ai-chat - Integration test RAG pipeline (use skill)
llm-evaluation - Deploy to production (use skill)
deploying-applications - Monitor quality (use +
llm-evaluationskills)observability
相关技能:
- : 评估AI聊天应用(本技能测试该技能构建的产物)
building-ai-chat - : 测试提示词的质量与有效性
prompt-engineering - : 将测试金字塔应用于LLM评估(单元→集成→端到端)
testing-strategies - : LLM质量的生产环境监控与告警
observability - : 将LLM评估集成到CI/CD流水线
building-ci-pipelines
工作流集成:
- 编写提示词(使用技能)
prompt-engineering - 单元测试提示词(使用技能)
llm-evaluation - 构建AI功能(使用技能)
building-ai-chat - 集成测试RAG流水线(使用技能)
llm-evaluation - 部署到生产环境(使用技能)
deploying-applications - 监控质量(使用+
llm-evaluation技能)observability
Common Pitfalls
常见误区
1. Over-reliance on Automated Metrics for Generation
- BLEU/ROUGE correlate weakly with human judgment for creative text
- Solution: Layer LLM-as-judge or human evaluation
2. Ignoring Faithfulness in RAG Systems
- Hallucinations are the #1 RAG failure mode
- Solution: Prioritize faithfulness metric (target > 0.8)
3. No Production Monitoring
- Models can degrade over time, prompts can break with updates
- Solution: Set up continuous evaluation (LangSmith, custom monitoring)
4. Biased LLM-as-Judge Evaluation
- Evaluator LLMs have biases (position bias, verbosity bias)
- Solution: Average multiple evaluations, use diverse evaluation prompts
5. Insufficient Benchmark Coverage
- Single benchmark doesn't capture full model capability
- Solution: Use 3-5 benchmarks across different domains
6. Missing Safety Evaluation
- Production LLMs can generate harmful content
- Solution: Add toxicity, bias, and hallucination checks to evaluation pipeline
1. 过度依赖自动化指标评估生成任务
- BLEU/ROUGE与人工判断在创意文本任务中的相关性较弱
- 解决方案:结合LLM-as-judge或人工评估
2. 忽略RAG系统的忠实度
- 幻觉是RAG系统最主要的故障模式
- 解决方案:优先关注忠实度指标(目标值>0.8)
3. 未搭建生产环境监控
- 模型性能会随时间下降,提示词可能因更新失效
- 解决方案:搭建持续评估体系(如LangSmith、自定义监控)
4. LLM-as-judge评估存在偏见
- 作为评判者的LLM存在固有偏见(位置偏见、冗长性偏见、自我偏好)
- 解决方案:多次评估取平均值,使用多样化的评估提示词
5. 基准测试覆盖不足
- 单一基准测试无法全面反映模型能力
- 解决方案:使用3-5个不同领域的基准测试
6. 缺失安全性评估
- 生产环境中的LLM可能生成有害内容
- 解决方案:在评估流水线中加入毒性、偏见与幻觉检测