evaluating-llms

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Evaluation

LLM系统评估

Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety.
使用自动化指标、LLM-as-judge模式和标准化基准测试来评估大语言模型(LLM)系统,确保其达到生产级别的质量与安全性。

When to Use This Skill

何时使用该技能

Apply this skill when:
  • Testing individual prompts for correctness and formatting
  • Validating RAG (Retrieval-Augmented Generation) pipeline quality
  • Measuring hallucinations, bias, or toxicity in LLM outputs
  • Comparing different models or prompt configurations (A/B testing)
  • Running benchmark tests (MMLU, HumanEval) to assess model capabilities
  • Setting up production monitoring for LLM applications
  • Integrating LLM quality checks into CI/CD pipelines
Common triggers:
  • "How do I test if my RAG system is working correctly?"
  • "How can I measure hallucinations in LLM outputs?"
  • "What metrics should I use to evaluate generation quality?"
  • "How do I compare GPT-4 vs Claude for my use case?"
  • "How do I detect bias in LLM responses?"
在以下场景中应用该技能:
  • 测试单个提示词的正确性与格式规范性
  • 验证RAG(Retrieval-Augmented Generation,检索增强生成)流水线的质量
  • 衡量LLM输出内容中的幻觉、偏见或毒性问题
  • 对比不同模型或提示词配置(A/B测试)
  • 运行基准测试(MMLU、HumanEval)以评估模型能力
  • 为LLM应用搭建生产环境监控体系
  • 将LLM质量检查集成到CI/CD流水线中
常见触发场景:
  • "我该如何测试我的RAG系统是否正常工作?"
  • "如何衡量LLM输出中的幻觉问题?"
  • "评估生成内容质量应该使用哪些指标?"
  • "针对我的使用场景,该如何对比GPT-4和Claude?"
  • "如何检测LLM回复中的偏见?"

Evaluation Strategy Selection

评估策略选择

Decision Framework: Which Evaluation Approach?

决策框架:选择哪种评估方法?

By Task Type:
Task TypePrimary ApproachMetricsTools
Classification (sentiment, intent)Automated metricsAccuracy, Precision, Recall, F1scikit-learn
Generation (summaries, creative text)LLM-as-judge + automatedBLEU, ROUGE, BERTScore, Quality rubricGPT-4/Claude for judging
Question AnsweringExact match + semantic similarityEM, F1, Cosine similarityCustom evaluators
RAG SystemsRAGAS frameworkFaithfulness, Answer/Context relevanceRAGAS library
Code GenerationUnit tests + executionPass@K, Test pass rateHumanEval, pytest
Multi-step AgentsTask completion + tool accuracySuccess rate, EfficiencyCustom evaluators
By Volume and Cost:
SamplesSpeedCostRecommended Approach
1,000+Immediate$0Automated metrics (regex, JSON validation)
100-1,000Minutes$0.01-0.10 eachLLM-as-judge (GPT-4, Claude)
< 100Hours$1-10 eachHuman evaluation (pairwise comparison)
Layered Approach (Recommended for Production):
  1. Layer 1: Automated metrics for all outputs (fast, cheap)
  2. Layer 2: LLM-as-judge for 10% sample (nuanced quality)
  3. Layer 3: Human review for 1% edge cases (validation)
按任务类型划分:
任务类型主要评估方法核心指标工具
分类任务(情感分析、意图识别)自动化指标准确率、精确率、召回率、F1值scikit-learn
生成任务(摘要、创意文本)LLM-as-judge + 自动化指标BLEU、ROUGE、BERTScore、质量评估准则GPT-4/Claude(作为评判者)
问答任务精确匹配 + 语义相似度EM(精确匹配)、F1值、余弦相似度自定义评估器
RAG系统RAGAS框架忠实度、答案/上下文相关性RAGAS库
代码生成单元测试 + 执行验证Pass@K、测试通过率HumanEval、pytest
多步骤Agent任务完成度 + 工具调用准确率成功率、效率自定义评估器
按样本量与成本划分:
样本数量速度成本推荐评估方法
1000+即时$0自动化指标(正则匹配、JSON验证)
100-1000数分钟每份$0.01-0.10LLM-as-judge(GPT-4、Claude)
< 100数小时每份$1-10人工评估(两两对比)
分层评估法(生产环境推荐):
  1. 第一层: 对所有输出使用自动化指标(快速、低成本)
  2. 第二层: 抽取10%的样本使用LLM-as-judge评估(精细化质量检查)
  3. 第三层: 抽取1%的边缘案例进行人工审核(最终验证)

Core Evaluation Patterns

核心评估模式

Unit Evaluation (Individual Prompts)

单元评估(单个提示词)

Test single prompt-response pairs for correctness.
Methods:
  • Exact Match: Response exactly matches expected output
  • Regex Matching: Response follows expected pattern
  • JSON Schema Validation: Structured output validation
  • Keyword Presence: Required terms appear in response
  • LLM-as-Judge: Binary pass/fail using evaluation prompt
Example Use Cases:
  • Email classification (spam/not spam)
  • Entity extraction (dates, names, locations)
  • JSON output formatting validation
  • Sentiment analysis (positive/negative/neutral)
Quick Start (Python):
python
import pytest
from openai import OpenAI

client = OpenAI()

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
            {"role": "user", "content": text}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

def test_positive_sentiment():
    result = classify_sentiment("I love this product!")
    assert result == "positive"
For complete unit evaluation examples, see
examples/python/unit_evaluation.py
and
examples/typescript/unit-evaluation.ts
.
测试单个提示词-回复对的正确性。
评估方法:
  • 精确匹配: 回复与预期输出完全一致
  • 正则匹配: 回复符合预期格式
  • JSON Schema验证: 结构化输出验证
  • 关键词存在性: 回复中包含必填术语
  • LLM-as-judge: 使用评估提示词进行二元通过/不通过判断
示例场景:
  • 邮件分类(垃圾邮件/非垃圾邮件)
  • 实体抽取(日期、姓名、地点)
  • JSON输出格式验证
  • 情感分析(正面/负面/中性)
快速开始(Python):
python
import pytest
from openai import OpenAI

client = OpenAI()

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
            {"role": "user", "content": text}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

def test_positive_sentiment():
    result = classify_sentiment("I love this product!")
    assert result == "positive"
完整的单元评估示例请参考
examples/python/unit_evaluation.py
examples/typescript/unit-evaluation.ts

RAG (Retrieval-Augmented Generation) Evaluation

RAG(检索增强生成)系统评估

Evaluate RAG systems using RAGAS framework metrics.
Critical Metrics (Priority Order):
  1. Faithfulness (Target: > 0.8) - MOST CRITICAL
    • Measures: Is the answer grounded in retrieved context?
    • Prevents hallucinations
    • If failing: Adjust prompt to emphasize grounding, require citations
  2. Answer Relevance (Target: > 0.7)
    • Measures: How well does the answer address the query?
    • If failing: Improve prompt instructions, add few-shot examples
  3. Context Relevance (Target: > 0.7)
    • Measures: Are retrieved chunks relevant to the query?
    • If failing: Improve retrieval (better embeddings, hybrid search)
  4. Context Precision (Target: > 0.5)
    • Measures: Are relevant chunks ranked higher than irrelevant?
    • If failing: Add re-ranking step to retrieval pipeline
  5. Context Recall (Target: > 0.8)
    • Measures: Are all relevant chunks retrieved?
    • If failing: Increase retrieval count, improve chunking strategy
Quick Start (Python with RAGAS):
python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset

data = {
    "question": ["What is the capital of France?"],
    "answer": ["The capital of France is Paris."],
    "contexts": [["Paris is the capital of France."]],
    "ground_truth": ["Paris"]
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")
For comprehensive RAG evaluation patterns, see
references/rag-evaluation.md
and
examples/python/ragas_example.py
.
使用RAGAS框架指标评估RAG系统。
关键指标(优先级排序):
  1. 忠实度(目标值:>0.8) - 最关键
    • 衡量:答案是否基于检索到的上下文?
    • 作用:防止幻觉问题
    • 优化方向:调整提示词以强调基于上下文生成,要求引用来源
  2. 答案相关性(目标值:>0.7)
    • 衡量:答案是否能很好地回应用户查询?
    • 优化方向:完善提示词指令,增加少样本示例
  3. 上下文相关性(目标值:>0.7)
    • 衡量:检索到的文本块与查询是否相关?
    • 优化方向:改进检索策略(更好的嵌入模型、混合搜索)
  4. 上下文精确率(目标值:>0.5)
    • 衡量:相关文本块是否比无关文本块排名更靠前?
    • 优化方向:在检索流水线中增加重排序步骤
  5. 上下文召回率(目标值:>0.8)
    • 衡量:是否检索到了所有相关的文本块?
    • 优化方向:增加检索数量,改进文本分块策略
快速开始(Python + RAGAS):
python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset

data = {
    "question": ["What is the capital of France?"],
    "answer": ["The capital of France is Paris."],
    "contexts": [["Paris is the capital of France."]],
    "ground_truth": ["Paris"]
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")
完整的RAG评估模式请参考
references/rag-evaluation.md
examples/python/ragas_example.py

LLM-as-Judge Evaluation

LLM-as-judge评估

Use powerful LLMs (GPT-4, Claude Opus) to evaluate other LLM outputs.
When to Use:
  • Generation quality assessment (summaries, creative writing)
  • Nuanced evaluation criteria (tone, clarity, helpfulness)
  • Custom rubrics for domain-specific tasks
  • Medium-volume evaluation (100-1,000 samples)
Correlation with Human Judgment: 0.75-0.85 for well-designed rubrics
Best Practices:
  • Use clear, specific rubrics (1-5 scale with detailed criteria)
  • Include few-shot examples in evaluation prompt
  • Average multiple evaluations to reduce variance
  • Be aware of biases (position bias, verbosity bias, self-preference)
Quick Start (Python):
python
from openai import OpenAI

client = OpenAI()

def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
    """Returns (score 1-5, reasoning)"""
    eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.

USER PROMPT: {prompt}
LLM RESPONSE: {response}

Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
    result = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": eval_prompt}],
        temperature=0.3
    )
    content = result.choices[0].message.content
    lines = content.strip().split('\n')
    score = int(lines[0].split(':')[1].strip())
    reasoning = lines[1].split(':', 1)[1].strip()
    return score, reasoning
For detailed LLM-as-judge patterns and prompt templates, see
references/llm-as-judge.md
and
examples/python/llm_as_judge.py
.
使用高性能LLM(如GPT-4、Claude Opus)来评估其他LLM的输出内容。
适用场景:
  • 生成内容质量评估(摘要、创意写作)
  • 精细化评估准则(语气、清晰度、实用性)
  • 特定领域的自定义评估准则
  • 中等样本量评估(100-1000份样本)
与人工判断的相关性: 设计良好的评估准则下,相关性可达0.75-0.85
最佳实践:
  • 使用清晰、具体的评估准则(1-5分制,附带详细标准)
  • 在评估提示词中加入少样本示例
  • 多次评估取平均值以降低方差
  • 注意潜在偏见(位置偏见、冗长性偏见、自我偏好)
快速开始(Python):
python
from openai import OpenAI

client = OpenAI()

def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
    """Returns (score 1-5, reasoning)"""
    eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.

USER PROMPT: {prompt}
LLM RESPONSE: {response}

Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
    result = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": eval_prompt}],
        temperature=0.3
    )
    content = result.choices[0].message.content
    lines = content.strip().split('\n')
    score = int(lines[0].split(':')[1].strip())
    reasoning = lines[1].split(':', 1)[1].strip()
    return score, reasoning
详细的LLM-as-judge模式和提示词模板请参考
references/llm-as-judge.md
examples/python/llm_as_judge.py

Safety and Alignment Evaluation

安全性与对齐性评估

Measure hallucinations, bias, and toxicity in LLM outputs.
衡量LLM输出中的幻觉、偏见与毒性问题。

Hallucination Detection

幻觉检测

Methods:
  1. Faithfulness to Context (RAG):
    • Use RAGAS faithfulness metric
    • LLM checks if claims are supported by context
    • Score: Supported claims / Total claims
  2. Factual Accuracy (Closed-Book):
    • LLM-as-judge with access to reliable sources
    • Fact-checking APIs (Google Fact Check)
    • Entity-level verification (dates, names, statistics)
  3. Self-Consistency:
    • Generate multiple responses to same question
    • Measure agreement between responses
    • Low consistency suggests hallucination
方法:
  1. 上下文忠实度(RAG场景):
    • 使用RAGAS的忠实度指标
    • 由LLM检查输出中的声明是否有上下文支持
    • 计算方式:受支持的声明数量 / 总声明数量
  2. 事实准确性(闭卷场景):
    • 使用可访问可靠来源的LLM-as-judge
    • 事实核查API(如Google Fact Check)
    • 实体级验证(日期、姓名、统计数据)
  3. 自一致性:
    • 针对同一问题生成多个回复
    • 衡量回复之间的一致性
    • 一致性低则表明存在幻觉风险

Bias Evaluation

偏见评估

Types of Bias:
  • Gender bias (stereotypical associations)
  • Racial/ethnic bias (discriminatory outputs)
  • Cultural bias (Western-centric assumptions)
  • Age/disability bias (ableist or ageist language)
Evaluation Methods:
  1. Stereotype Tests:
    • BBQ (Bias Benchmark for QA): 58,000 question-answer pairs
    • BOLD (Bias in Open-Ended Language Generation)
  2. Counterfactual Evaluation:
    • Generate responses with demographic swaps
    • Example: "Dr. Smith (he/she) recommended..." → compare outputs
    • Measure consistency across variations
偏见类型:
  • 性别偏见(刻板印象关联)
  • 种族/民族偏见(歧视性输出)
  • 文化偏见(以西方为中心的假设)
  • 年龄/残障偏见(歧视性语言)
评估方法:
  1. 刻板印象测试:
    • BBQ(问答任务偏见基准测试):包含58000个问答对
    • BOLD(开放式语言生成中的偏见测试)
  2. 反事实评估:
    • 替换人口统计特征后生成回复
    • 示例:"Dr. Smith (he/she) recommended..." → 对比不同输出
    • 衡量不同变体之间的一致性

Toxicity Detection

毒性检测

Tools:
  • Perspective API (Google): Toxicity, threat, insult scores
  • Detoxify (HuggingFace): Open-source toxicity classifier
  • OpenAI Moderation API: Hate, harassment, violence detection
For comprehensive safety evaluation patterns, see
references/safety-evaluation.md
.
工具:
  • Perspective API(Google): 毒性、威胁、侮辱性评分
  • Detoxify(HuggingFace): 开源毒性分类器
  • OpenAI Moderation API: 仇恨言论、骚扰、暴力内容检测
完整的安全性评估模式请参考
references/safety-evaluation.md

Benchmark Testing

基准测试

Assess model capabilities using standardized benchmarks.
Standard Benchmarks:
BenchmarkCoverageFormatDifficultyUse Case
MMLU57 subjects (STEM, humanities)Multiple choiceHigh school - professionalGeneral intelligence
HellaSwagSentence completionMultiple choiceCommon senseReasoning validation
GPQAPhD-level scienceMultiple choiceVery high (expert-level)Frontier model testing
HumanEval164 Python problemsCode generationMediumCode capability
MATH12,500 competition problemsMath solvingHigh school competitionsMath reasoning
Domain-Specific Benchmarks:
  • Medical: MedQA (USMLE), PubMedQA
  • Legal: LegalBench
  • Finance: FinQA, ConvFinQA
When to Use Benchmarks:
  • Comparing multiple models (GPT-4 vs Claude vs Llama)
  • Model selection for specific domains
  • Baseline capability assessment
  • Academic research and publication
Quick Start (lm-evaluation-harness):
bash
pip install lm-eval
使用标准化基准测试评估模型能力。
标准基准测试:
基准测试覆盖范围格式难度适用场景
MMLU57个学科(STEM、人文)选择题高中-专业级通用智能评估
HellaSwag句子补全选择题常识级推理能力验证
GPQA博士级科学问题选择题极高(专家级)前沿模型测试
HumanEval164个Python问题代码生成中等代码能力评估
MATH12500道竞赛题目数学解题高中竞赛级数学推理能力
特定领域基准测试:
  • 医疗领域: MedQA(美国医师执照考试)、PubMedQA
  • 法律领域: LegalBench
  • 金融领域: FinQA、ConvFinQA
何时使用基准测试:
  • 对比多个模型(GPT-4 vs Claude vs Llama)
  • 为特定领域选择模型
  • 基准能力评估
  • 学术研究与发表
快速开始(lm-evaluation-harness):
bash
pip install lm-eval

Evaluate GPT-4 on MMLU

Evaluate GPT-4 on MMLU

lm_eval --model openai-chat --model_args model=gpt-4 --tasks mmlu --num_fewshot 5

For detailed benchmark testing patterns, see `references/benchmarks.md` and `scripts/benchmark_runner.py`.
lm_eval --model openai-chat --model_args model=gpt-4 --tasks mmlu --num_fewshot 5

详细的基准测试模式请参考 `references/benchmarks.md` 和 `scripts/benchmark_runner.py`。

Production Evaluation

生产环境评估

Monitor and optimize LLM quality in production environments.
在生产环境中监控与优化LLM的质量。

A/B Testing

A/B测试

Compare two LLM configurations:
  • Variant A: GPT-4 (expensive, high quality)
  • Variant B: Claude Sonnet (cheaper, fast)
Metrics:
  • User satisfaction scores (thumbs up/down)
  • Task completion rates
  • Response time and latency
  • Cost per successful interaction
对比两种LLM配置:
  • 变体A: GPT-4(成本高、质量优)
  • 变体B: Claude Sonnet(成本低、速度快)
指标:
  • 用户满意度评分(点赞/点踩)
  • 任务完成率
  • 响应时间与延迟
  • 每次成功交互的成本

Online Evaluation

在线评估

Real-time quality monitoring:
  • Response Quality: LLM-as-judge scoring every Nth response
  • User Feedback: Explicit ratings, thumbs up/down
  • Business Metrics: Conversion rates, support ticket resolution
  • Cost Tracking: Tokens used, inference costs
实时质量监控:
  • 回复质量: 每N条回复使用LLM-as-judge评分
  • 用户反馈: 明确评分、点赞/点踩
  • 业务指标: 转化率、支持工单解决率
  • 成本追踪: Token使用量、推理成本

Human-in-the-Loop

人工介入评估

Sample-based human evaluation:
  • Random Sampling: Evaluate 10% of responses
  • Confidence-Based: Evaluate low-confidence outputs
  • Error-Triggered: Flag suspicious responses for review
For production evaluation patterns and monitoring strategies, see
references/production-evaluation.md
.
基于样本的人工评估:
  • 随机抽样: 评估10%的回复
  • 基于置信度: 评估低置信度输出
  • 错误触发: 标记可疑回复进行审核
生产环境评估模式与监控策略请参考
references/production-evaluation.md

Classification Task Evaluation

分类任务评估

For tasks with discrete outputs (sentiment, intent, category).
Metrics:
  • Accuracy: Correct predictions / Total predictions
  • Precision: True positives / (True positives + False positives)
  • Recall: True positives / (True positives + False negatives)
  • F1 Score: Harmonic mean of precision and recall
  • Confusion Matrix: Detailed breakdown of prediction errors
Quick Start (Python):
python
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

y_true = ["positive", "negative", "neutral", "positive", "negative"]
y_pred = ["positive", "negative", "neutral", "neutral", "negative"]

accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
For complete classification evaluation examples, see
examples/python/classification_metrics.py
.
针对具有离散输出的任务(情感分析、意图识别、分类)。
指标:
  • 准确率: 正确预测数 / 总预测数
  • 精确率: 真阳性数 /(真阳性数 + 假阳性数)
  • 召回率: 真阳性数 /(真阳性数 + 假阴性数)
  • F1值: 精确率与召回率的调和平均数
  • 混淆矩阵: 预测错误的详细分解
快速开始(Python):
python
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

y_true = ["positive", "negative", "neutral", "positive", "negative"]
y_pred = ["positive", "negative", "neutral", "neutral", "negative"]

accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
完整的分类任务评估示例请参考
examples/python/classification_metrics.py

Generation Task Evaluation

生成任务评估

For open-ended text generation (summaries, creative writing, responses).
Automated Metrics (Use with Caution):
  • BLEU: N-gram overlap with reference text (0-1 score)
  • ROUGE: Recall-oriented overlap (ROUGE-1, ROUGE-L)
  • METEOR: Semantic similarity with stemming
  • BERTScore: Contextual embedding similarity (0-1 score)
Limitation: Automated metrics correlate weakly with human judgment for creative/subjective generation.
Recommended Approach:
  1. Automated metrics: Fast feedback for objective aspects (length, format)
  2. LLM-as-judge: Nuanced quality assessment (relevance, coherence, helpfulness)
  3. Human evaluation: Final validation for subjective criteria (preference, creativity)
For detailed generation evaluation patterns, see
references/evaluation-types.md
.
针对开放式文本生成任务(摘要、创意写作、回复)。
自动化指标(谨慎使用):
  • BLEU: 与参考文本的N-gram重叠度(0-1分)
  • ROUGE: 面向召回的重叠度(ROUGE-1、ROUGE-L)
  • METEOR: 带词干提取的语义相似度
  • BERTScore: 上下文嵌入相似度(0-1分)
局限性: 自动化指标与人工判断的相关性在创意/主观性生成任务中较弱。
推荐方案:
  1. 自动化指标: 针对客观维度(长度、格式)提供快速反馈
  2. LLM-as-judge: 精细化质量评估(相关性、连贯性、实用性)
  3. 人工评估: 针对主观标准(偏好、创意)进行最终验证
详细的生成任务评估模式请参考
references/evaluation-types.md

Quick Reference Tables

快速参考表格

Evaluation Framework Selection

评估框架选择

If Task Is...Use This FrameworkPrimary Metric
RAG systemRAGASFaithfulness > 0.8
Classificationscikit-learn metricsAccuracy, F1
Generation qualityLLM-as-judgeQuality rubric (1-5)
Code generationHumanEvalPass@1, Test pass rate
Model comparisonBenchmark testingMMLU, HellaSwag scores
Safety validationHallucination detectionFaithfulness, Fact-check
Production monitoringOnline evaluationUser feedback, Business KPIs
任务类型推荐框架核心指标
RAG系统RAGAS忠实度 > 0.8
分类任务scikit-learn指标准确率、F1值
生成内容质量LLM-as-judge质量评估准则(1-5分)
代码生成HumanEvalPass@1、测试通过率
模型对比基准测试MMLU、HellaSwag得分
安全性验证幻觉检测忠实度、事实核查
生产环境监控在线评估用户反馈、业务KPI

Python Library Recommendations

Python库推荐

LibraryUse CaseInstallation
RAGASRAG evaluation
pip install ragas
DeepEvalGeneral LLM evaluation, pytest integration
pip install deepeval
LangSmithProduction monitoring, A/B testing
pip install langsmith
lm-evalBenchmark testing (MMLU, HumanEval)
pip install lm-eval
scikit-learnClassification metrics
pip install scikit-learn
适用场景安装命令
RAGASRAG系统评估
pip install ragas
DeepEval通用LLM评估、pytest集成
pip install deepeval
LangSmith生产环境监控、A/B测试
pip install langsmith
lm-eval基准测试(MMLU、HumanEval)
pip install lm-eval
scikit-learn分类任务指标
pip install scikit-learn

Safety Evaluation Priority Matrix

安全性评估优先级矩阵

ApplicationHallucination RiskBias RiskToxicity RiskEvaluation Priority
Customer SupportHighMediumHigh1. Faithfulness, 2. Toxicity, 3. Bias
Medical DiagnosisCriticalHighLow1. Factual Accuracy, 2. Hallucination, 3. Bias
Creative WritingLowMediumMedium1. Quality/Fluency, 2. Content Policy
Code GenerationMediumLowLow1. Functional Correctness, 2. Security
Content ModerationLowCriticalCritical1. Bias, 2. False Positives/Negatives
应用场景幻觉风险偏见风险毒性风险评估优先级
客户支持1. 忠实度,2. 毒性检测,3. 偏见检测
医疗诊断极高1. 事实准确性,2. 幻觉检测,3. 偏见检测
创意写作1. 质量/流畅度,2. 内容合规性
代码生成1. 功能正确性,2. 安全性
内容审核极高极高1. 偏见检测,2. 误判/漏判率

Detailed References

详细参考文档

For comprehensive documentation on specific topics:
  • Evaluation types (classification, generation, QA, code):
    references/evaluation-types.md
  • RAG evaluation deep dive (RAGAS framework):
    references/rag-evaluation.md
  • Safety evaluation (hallucination, bias, toxicity):
    references/safety-evaluation.md
  • Benchmark testing (MMLU, HumanEval, domain benchmarks):
    references/benchmarks.md
  • LLM-as-judge best practices and prompts:
    references/llm-as-judge.md
  • Production evaluation (A/B testing, monitoring):
    references/production-evaluation.md
  • All metrics definitions and formulas:
    references/metrics-reference.md
特定主题的完整文档:
  • 评估类型(分类、生成、问答、代码):
    references/evaluation-types.md
  • RAG系统评估深度解析(RAGAS框架):
    references/rag-evaluation.md
  • 安全性评估(幻觉、偏见、毒性):
    references/safety-evaluation.md
  • 基准测试(MMLU、HumanEval、领域基准):
    references/benchmarks.md
  • LLM-as-judge最佳实践与提示词:
    references/llm-as-judge.md
  • 生产环境评估(A/B测试、监控):
    references/production-evaluation.md
  • 所有指标定义与公式:
    references/metrics-reference.md

Working Examples

示例代码

Python Examples:
  • examples/python/unit_evaluation.py
    - Basic prompt testing with pytest
  • examples/python/ragas_example.py
    - RAGAS RAG evaluation
  • examples/python/deepeval_example.py
    - DeepEval framework usage
  • examples/python/llm_as_judge.py
    - GPT-4 as evaluator
  • examples/python/classification_metrics.py
    - Accuracy, precision, recall
  • examples/python/benchmark_testing.py
    - HumanEval example
TypeScript Examples:
  • examples/typescript/unit-evaluation.ts
    - Vitest + OpenAI
  • examples/typescript/llm-as-judge.ts
    - GPT-4 evaluation
  • examples/typescript/langsmith-integration.ts
    - Production monitoring
Python示例:
  • examples/python/unit_evaluation.py
    - 使用pytest进行基础提示词测试
  • examples/python/ragas_example.py
    - RAGAS框架下的RAG评估
  • examples/python/deepeval_example.py
    - DeepEval框架使用示例
  • examples/python/llm_as_judge.py
    - GPT-4作为评判者的评估示例
  • examples/python/classification_metrics.py
    - 分类任务指标计算
  • examples/python/benchmark_testing.py
    - HumanEval基准测试示例
TypeScript示例:
  • examples/typescript/unit-evaluation.ts
    - Vitest + OpenAI的单元测试
  • examples/typescript/llm-as-judge.ts
    - GPT-4评估示例
  • examples/typescript/langsmith-integration.ts
    - 生产环境监控集成

Executable Scripts

可执行脚本

Run evaluations without loading code into context (token-free):
  • scripts/run_ragas_eval.py
    - Run RAGAS evaluation on dataset
  • scripts/compare_models.py
    - A/B test two models
  • scripts/benchmark_runner.py
    - Run MMLU/HumanEval benchmarks
  • scripts/hallucination_checker.py
    - Detect hallucinations in outputs
Example usage:
bash
undefined
无需加载代码即可运行评估(无Token消耗):
  • scripts/run_ragas_eval.py
    - 在数据集上运行RAGAS评估
  • scripts/compare_models.py
    - 对两个模型进行A/B测试
  • scripts/benchmark_runner.py
    - 运行MMLU/HumanEval基准测试
  • scripts/hallucination_checker.py
    - 检测输出中的幻觉问题
使用示例:
bash
undefined

Run RAGAS evaluation on custom dataset

Run RAGAS evaluation on custom dataset

python scripts/run_ragas_eval.py --dataset data/qa_dataset.json --output results.json
python scripts/run_ragas_eval.py --dataset data/qa_dataset.json --output results.json

Compare GPT-4 vs Claude on benchmark

Compare GPT-4 vs Claude on benchmark

python scripts/compare_models.py --model-a gpt-4 --model-b claude-3-opus --tasks mmlu,humaneval
undefined
python scripts/compare_models.py --model-a gpt-4 --model-b claude-3-opus --tasks mmlu,humaneval
undefined

Integration with Other Skills

与其他技能的集成

Related Skills:
  • building-ai-chat
    :
    Evaluate AI chat applications (this skill tests what that skill builds)
  • prompt-engineering
    :
    Test prompt quality and effectiveness
  • testing-strategies
    :
    Apply testing pyramid to LLM evaluation (unit → integration → E2E)
  • observability
    :
    Production monitoring and alerting for LLM quality
  • building-ci-pipelines
    :
    Integrate LLM evaluation into CI/CD
Workflow Integration:
  1. Write prompt (use
    prompt-engineering
    skill)
  2. Unit test prompt (use
    llm-evaluation
    skill)
  3. Build AI feature (use
    building-ai-chat
    skill)
  4. Integration test RAG pipeline (use
    llm-evaluation
    skill)
  5. Deploy to production (use
    deploying-applications
    skill)
  6. Monitor quality (use
    llm-evaluation
    +
    observability
    skills)
相关技能:
  • building-ai-chat
    评估AI聊天应用(本技能测试该技能构建的产物)
  • prompt-engineering
    测试提示词的质量与有效性
  • testing-strategies
    将测试金字塔应用于LLM评估(单元→集成→端到端)
  • observability
    LLM质量的生产环境监控与告警
  • building-ci-pipelines
    将LLM评估集成到CI/CD流水线
工作流集成:
  1. 编写提示词(使用
    prompt-engineering
    技能)
  2. 单元测试提示词(使用
    llm-evaluation
    技能)
  3. 构建AI功能(使用
    building-ai-chat
    技能)
  4. 集成测试RAG流水线(使用
    llm-evaluation
    技能)
  5. 部署到生产环境(使用
    deploying-applications
    技能)
  6. 监控质量(使用
    llm-evaluation
    +
    observability
    技能)

Common Pitfalls

常见误区

1. Over-reliance on Automated Metrics for Generation
  • BLEU/ROUGE correlate weakly with human judgment for creative text
  • Solution: Layer LLM-as-judge or human evaluation
2. Ignoring Faithfulness in RAG Systems
  • Hallucinations are the #1 RAG failure mode
  • Solution: Prioritize faithfulness metric (target > 0.8)
3. No Production Monitoring
  • Models can degrade over time, prompts can break with updates
  • Solution: Set up continuous evaluation (LangSmith, custom monitoring)
4. Biased LLM-as-Judge Evaluation
  • Evaluator LLMs have biases (position bias, verbosity bias)
  • Solution: Average multiple evaluations, use diverse evaluation prompts
5. Insufficient Benchmark Coverage
  • Single benchmark doesn't capture full model capability
  • Solution: Use 3-5 benchmarks across different domains
6. Missing Safety Evaluation
  • Production LLMs can generate harmful content
  • Solution: Add toxicity, bias, and hallucination checks to evaluation pipeline
1. 过度依赖自动化指标评估生成任务
  • BLEU/ROUGE与人工判断在创意文本任务中的相关性较弱
  • 解决方案:结合LLM-as-judge或人工评估
2. 忽略RAG系统的忠实度
  • 幻觉是RAG系统最主要的故障模式
  • 解决方案:优先关注忠实度指标(目标值>0.8)
3. 未搭建生产环境监控
  • 模型性能会随时间下降,提示词可能因更新失效
  • 解决方案:搭建持续评估体系(如LangSmith、自定义监控)
4. LLM-as-judge评估存在偏见
  • 作为评判者的LLM存在固有偏见(位置偏见、冗长性偏见、自我偏好)
  • 解决方案:多次评估取平均值,使用多样化的评估提示词
5. 基准测试覆盖不足
  • 单一基准测试无法全面反映模型能力
  • 解决方案:使用3-5个不同领域的基准测试
6. 缺失安全性评估
  • 生产环境中的LLM可能生成有害内容
  • 解决方案:在评估流水线中加入毒性、偏见与幻觉检测