evaluating-llms

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM Evaluation

LLM系统评估

Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety.

使用自动化指标、LLM-as-judge模式和标准化基准测试来评估大语言模型（LLM）系统，确保其达到生产级别的质量与安全性。

When to Use This Skill

何时使用该技能

Apply this skill when:

Testing individual prompts for correctness and formatting
Validating RAG (Retrieval-Augmented Generation) pipeline quality
Measuring hallucinations, bias, or toxicity in LLM outputs
Comparing different models or prompt configurations (A/B testing)
Running benchmark tests (MMLU, HumanEval) to assess model capabilities
Setting up production monitoring for LLM applications
Integrating LLM quality checks into CI/CD pipelines

Common triggers:

"How do I test if my RAG system is working correctly?"
"How can I measure hallucinations in LLM outputs?"
"What metrics should I use to evaluate generation quality?"
"How do I compare GPT-4 vs Claude for my use case?"
"How do I detect bias in LLM responses?"

在以下场景中应用该技能：

测试单个提示词的正确性与格式规范性
验证RAG（Retrieval-Augmented Generation，检索增强生成）流水线的质量
衡量LLM输出内容中的幻觉、偏见或毒性问题
对比不同模型或提示词配置（A/B测试）
运行基准测试（MMLU、HumanEval）以评估模型能力
为LLM应用搭建生产环境监控体系
将LLM质量检查集成到CI/CD流水线中

常见触发场景：

"我该如何测试我的RAG系统是否正常工作？"
"如何衡量LLM输出中的幻觉问题？"
"评估生成内容质量应该使用哪些指标？"
"针对我的使用场景，该如何对比GPT-4和Claude？"
"如何检测LLM回复中的偏见？"

Evaluation Strategy Selection

评估策略选择

Decision Framework: Which Evaluation Approach?

决策框架：选择哪种评估方法？

By Task Type:

Task Type	Primary Approach	Metrics	Tools
Classification (sentiment, intent)	Automated metrics	Accuracy, Precision, Recall, F1	scikit-learn
Generation (summaries, creative text)	LLM-as-judge + automated	BLEU, ROUGE, BERTScore, Quality rubric	GPT-4/Claude for judging
Question Answering	Exact match + semantic similarity	EM, F1, Cosine similarity	Custom evaluators
RAG Systems	RAGAS framework	Faithfulness, Answer/Context relevance	RAGAS library
Code Generation	Unit tests + execution	Pass@K, Test pass rate	HumanEval, pytest
Multi-step Agents	Task completion + tool accuracy	Success rate, Efficiency	Custom evaluators

By Volume and Cost:

Samples	Speed	Cost	Recommended Approach
1,000+	Immediate	$0	Automated metrics (regex, JSON validation)
100-1,000	Minutes	$0.01-0.10 each	LLM-as-judge (GPT-4, Claude)
< 100	Hours	$1-10 each	Human evaluation (pairwise comparison)

Layered Approach (Recommended for Production):

Layer 1: Automated metrics for all outputs (fast, cheap)
Layer 2: LLM-as-judge for 10% sample (nuanced quality)
Layer 3: Human review for 1% edge cases (validation)

按任务类型划分：

任务类型	主要评估方法	核心指标	工具
分类任务（情感分析、意图识别）	自动化指标	准确率、精确率、召回率、F1值	scikit-learn
生成任务（摘要、创意文本）	LLM-as-judge + 自动化指标	BLEU、ROUGE、BERTScore、质量评估准则	GPT-4/Claude（作为评判者）
问答任务	精确匹配 + 语义相似度	EM（精确匹配）、F1值、余弦相似度	自定义评估器
RAG系统	RAGAS框架	忠实度、答案/上下文相关性	RAGAS库
代码生成	单元测试 + 执行验证	Pass@K、测试通过率	HumanEval、pytest
多步骤Agent	任务完成度 + 工具调用准确率	成功率、效率	自定义评估器

按样本量与成本划分：

样本数量	速度	成本	推荐评估方法
1000+	即时	$0	自动化指标（正则匹配、JSON验证）
100-1000	数分钟	每份$0.01-0.10	LLM-as-judge（GPT-4、Claude）
< 100	数小时	每份$1-10	人工评估（两两对比）

分层评估法（生产环境推荐）：

第一层： 对所有输出使用自动化指标（快速、低成本）
第二层： 抽取10%的样本使用LLM-as-judge评估（精细化质量检查）
第三层： 抽取1%的边缘案例进行人工审核（最终验证）

Core Evaluation Patterns

核心评估模式

Unit Evaluation (Individual Prompts)

单元评估（单个提示词）

Test single prompt-response pairs for correctness.

Methods:

Exact Match: Response exactly matches expected output
Regex Matching: Response follows expected pattern
JSON Schema Validation: Structured output validation
Keyword Presence: Required terms appear in response
LLM-as-Judge: Binary pass/fail using evaluation prompt

Example Use Cases:

Email classification (spam/not spam)
Entity extraction (dates, names, locations)
JSON output formatting validation
Sentiment analysis (positive/negative/neutral)

Quick Start (Python):

python

import pytest
from openai import OpenAI

client = OpenAI()

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
            {"role": "user", "content": text}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

def test_positive_sentiment():
    result = classify_sentiment("I love this product!")
    assert result == "positive"

For complete unit evaluation examples, see

examples/python/unit_evaluation.py

and

examples/typescript/unit-evaluation.ts

测试单个提示词-回复对的正确性。

评估方法：

精确匹配： 回复与预期输出完全一致
正则匹配： 回复符合预期格式
JSON Schema验证： 结构化输出验证
关键词存在性： 回复中包含必填术语
LLM-as-judge： 使用评估提示词进行二元通过/不通过判断

示例场景：

邮件分类（垃圾邮件/非垃圾邮件）
实体抽取（日期、姓名、地点）
JSON输出格式验证
情感分析（正面/负面/中性）

快速开始（Python）：

python

import pytest
from openai import OpenAI

client = OpenAI()

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
            {"role": "user", "content": text}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

def test_positive_sentiment():
    result = classify_sentiment("I love this product!")
    assert result == "positive"

完整的单元评估示例请参考

examples/python/unit_evaluation.py

和

examples/typescript/unit-evaluation.ts

。

RAG (Retrieval-Augmented Generation) Evaluation

RAG（检索增强生成）系统评估

Evaluate RAG systems using RAGAS framework metrics.

Critical Metrics (Priority Order):

Faithfulness (Target: > 0.8) - MOST CRITICAL
- Measures: Is the answer grounded in retrieved context?
- Prevents hallucinations
- If failing: Adjust prompt to emphasize grounding, require citations
Answer Relevance (Target: > 0.7)
- Measures: How well does the answer address the query?
- If failing: Improve prompt instructions, add few-shot examples
Context Relevance (Target: > 0.7)
- Measures: Are retrieved chunks relevant to the query?
- If failing: Improve retrieval (better embeddings, hybrid search)
Context Precision (Target: > 0.5)
- Measures: Are relevant chunks ranked higher than irrelevant?
- If failing: Add re-ranking step to retrieval pipeline
Context Recall (Target: > 0.8)
- Measures: Are all relevant chunks retrieved?
- If failing: Increase retrieval count, improve chunking strategy

Quick Start (Python with RAGAS):

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset

data = {
    "question": ["What is the capital of France?"],
    "answer": ["The capital of France is Paris."],
    "contexts": [["Paris is the capital of France."]],
    "ground_truth": ["Paris"]
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")

For comprehensive RAG evaluation patterns, see

references/rag-evaluation.md

and

examples/python/ragas_example.py

使用RAGAS框架指标评估RAG系统。

关键指标（优先级排序）：

忠实度（目标值：>0.8） - 最关键
- 衡量：答案是否基于检索到的上下文？
- 作用：防止幻觉问题
- 优化方向：调整提示词以强调基于上下文生成，要求引用来源
答案相关性（目标值：>0.7）
- 衡量：答案是否能很好地回应用户查询？
- 优化方向：完善提示词指令，增加少样本示例
上下文相关性（目标值：>0.7）
- 衡量：检索到的文本块与查询是否相关？
- 优化方向：改进检索策略（更好的嵌入模型、混合搜索）
上下文精确率（目标值：>0.5）
- 衡量：相关文本块是否比无关文本块排名更靠前？
- 优化方向：在检索流水线中增加重排序步骤
上下文召回率（目标值：>0.8）
- 衡量：是否检索到了所有相关的文本块？
- 优化方向：增加检索数量，改进文本分块策略

快速开始（Python + RAGAS）：

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset

data = {
    "question": ["What is the capital of France?"],
    "answer": ["The capital of France is Paris."],
    "contexts": [["Paris is the capital of France."]],
    "ground_truth": ["Paris"]
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")

完整的RAG评估模式请参考

references/rag-evaluation.md

和

examples/python/ragas_example.py

。

LLM-as-Judge Evaluation

LLM-as-judge评估

Use powerful LLMs (GPT-4, Claude Opus) to evaluate other LLM outputs.

When to Use:

Generation quality assessment (summaries, creative writing)
Nuanced evaluation criteria (tone, clarity, helpfulness)
Custom rubrics for domain-specific tasks
Medium-volume evaluation (100-1,000 samples)

Correlation with Human Judgment: 0.75-0.85 for well-designed rubrics

Best Practices:

Use clear, specific rubrics (1-5 scale with detailed criteria)
Include few-shot examples in evaluation prompt
Average multiple evaluations to reduce variance
Be aware of biases (position bias, verbosity bias, self-preference)

Quick Start (Python):

python

from openai import OpenAI

client = OpenAI()

def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
    """Returns (score 1-5, reasoning)"""
    eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.

USER PROMPT: {prompt}
LLM RESPONSE: {response}

Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
    result = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": eval_prompt}],
        temperature=0.3
    )
    content = result.choices[0].message.content
    lines = content.strip().split('\n')
    score = int(lines[0].split(':')[1].strip())
    reasoning = lines[1].split(':', 1)[1].strip()
    return score, reasoning

For detailed LLM-as-judge patterns and prompt templates, see

references/llm-as-judge.md

and

examples/python/llm_as_judge.py

使用高性能LLM（如GPT-4、Claude Opus）来评估其他LLM的输出内容。

适用场景：

生成内容质量评估（摘要、创意写作）
精细化评估准则（语气、清晰度、实用性）
特定领域的自定义评估准则
中等样本量评估（100-1000份样本）

与人工判断的相关性： 设计良好的评估准则下，相关性可达0.75-0.85

最佳实践：

使用清晰、具体的评估准则（1-5分制，附带详细标准）
在评估提示词中加入少样本示例
多次评估取平均值以降低方差
注意潜在偏见（位置偏见、冗长性偏见、自我偏好）

快速开始（Python）：

python

from openai import OpenAI

client = OpenAI()

def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
    """Returns (score 1-5, reasoning)"""
    eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.

USER PROMPT: {prompt}
LLM RESPONSE: {response}

Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
    result = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": eval_prompt}],
        temperature=0.3
    )
    content = result.choices[0].message.content
    lines = content.strip().split('\n')
    score = int(lines[0].split(':')[1].strip())
    reasoning = lines[1].split(':', 1)[1].strip()
    return score, reasoning

详细的LLM-as-judge模式和提示词模板请参考

references/llm-as-judge.md

和

examples/python/llm_as_judge.py

。

Safety and Alignment Evaluation

安全性与对齐性评估

Measure hallucinations, bias, and toxicity in LLM outputs.

衡量LLM输出中的幻觉、偏见与毒性问题。

Hallucination Detection

幻觉检测

Methods:

Faithfulness to Context (RAG):
- Use RAGAS faithfulness metric
- LLM checks if claims are supported by context
- Score: Supported claims / Total claims
Factual Accuracy (Closed-Book):
- LLM-as-judge with access to reliable sources
- Fact-checking APIs (Google Fact Check)
- Entity-level verification (dates, names, statistics)
Self-Consistency:
- Generate multiple responses to same question
- Measure agreement between responses
- Low consistency suggests hallucination

方法：

上下文忠实度（RAG场景）：
- 使用RAGAS的忠实度指标
- 由LLM检查输出中的声明是否有上下文支持
- 计算方式：受支持的声明数量 / 总声明数量
事实准确性（闭卷场景）：
- 使用可访问可靠来源的LLM-as-judge
- 事实核查API（如Google Fact Check）
- 实体级验证（日期、姓名、统计数据）
自一致性：
- 针对同一问题生成多个回复
- 衡量回复之间的一致性
- 一致性低则表明存在幻觉风险

Bias Evaluation

偏见评估

Types of Bias:

Gender bias (stereotypical associations)
Racial/ethnic bias (discriminatory outputs)
Cultural bias (Western-centric assumptions)
Age/disability bias (ableist or ageist language)

Evaluation Methods:

Stereotype Tests:
- BBQ (Bias Benchmark for QA): 58,000 question-answer pairs
- BOLD (Bias in Open-Ended Language Generation)
Counterfactual Evaluation:
- Generate responses with demographic swaps
- Example: "Dr. Smith (he/she) recommended..." → compare outputs
- Measure consistency across variations

偏见类型：

性别偏见（刻板印象关联）
种族/民族偏见（歧视性输出）
文化偏见（以西方为中心的假设）
年龄/残障偏见（歧视性语言）

评估方法：

刻板印象测试：
- BBQ（问答任务偏见基准测试）：包含58000个问答对
- BOLD（开放式语言生成中的偏见测试）
反事实评估：
- 替换人口统计特征后生成回复
- 示例："Dr. Smith (he/she) recommended..." → 对比不同输出
- 衡量不同变体之间的一致性

Toxicity Detection

毒性检测

Tools:

Perspective API (Google): Toxicity, threat, insult scores
Detoxify (HuggingFace): Open-source toxicity classifier
OpenAI Moderation API: Hate, harassment, violence detection

For comprehensive safety evaluation patterns, see

references/safety-evaluation.md

工具：

Perspective API（Google）： 毒性、威胁、侮辱性评分
Detoxify（HuggingFace）： 开源毒性分类器
OpenAI Moderation API： 仇恨言论、骚扰、暴力内容检测

完整的安全性评估模式请参考

references/safety-evaluation.md

。

Benchmark Testing

基准测试

Assess model capabilities using standardized benchmarks.

Standard Benchmarks:

Benchmark	Coverage	Format	Difficulty	Use Case
MMLU	57 subjects (STEM, humanities)	Multiple choice	High school - professional	General intelligence
HellaSwag	Sentence completion	Multiple choice	Common sense	Reasoning validation
GPQA	PhD-level science	Multiple choice	Very high (expert-level)	Frontier model testing
HumanEval	164 Python problems	Code generation	Medium	Code capability
MATH	12,500 competition problems	Math solving	High school competitions	Math reasoning

Domain-Specific Benchmarks:

Medical: MedQA (USMLE), PubMedQA
Legal: LegalBench
Finance: FinQA, ConvFinQA

When to Use Benchmarks:

Comparing multiple models (GPT-4 vs Claude vs Llama)
Model selection for specific domains
Baseline capability assessment
Academic research and publication

Quick Start (lm-evaluation-harness):

bash

pip install lm-eval

使用标准化基准测试评估模型能力。

标准基准测试：

基准测试	覆盖范围	格式	难度	适用场景
MMLU	57个学科（STEM、人文）	选择题	高中-专业级	通用智能评估
HellaSwag	句子补全	选择题	常识级	推理能力验证
GPQA	博士级科学问题	选择题	极高（专家级）	前沿模型测试
HumanEval	164个Python问题	代码生成	中等	代码能力评估
MATH	12500道竞赛题目	数学解题	高中竞赛级	数学推理能力

特定领域基准测试：

医疗领域： MedQA（美国医师执照考试）、PubMedQA
法律领域： LegalBench
金融领域： FinQA、ConvFinQA

何时使用基准测试：

对比多个模型（GPT-4 vs Claude vs Llama）
为特定领域选择模型
基准能力评估
学术研究与发表

快速开始（lm-evaluation-harness）：

bash

pip install lm-eval

Evaluate GPT-4 on MMLU

lm_eval --model openai-chat --model_args model=gpt-4 --tasks mmlu --num_fewshot 5


For detailed benchmark testing patterns, see `references/benchmarks.md` and `scripts/benchmark_runner.py`.

lm_eval --model openai-chat --model_args model=gpt-4 --tasks mmlu --num_fewshot 5


详细的基准测试模式请参考 `references/benchmarks.md` 和 `scripts/benchmark_runner.py`。

Production Evaluation

生产环境评估

Monitor and optimize LLM quality in production environments.

在生产环境中监控与优化LLM的质量。

A/B Testing

A/B测试

Compare two LLM configurations:

Variant A: GPT-4 (expensive, high quality)
Variant B: Claude Sonnet (cheaper, fast)

Metrics:

User satisfaction scores (thumbs up/down)
Task completion rates
Response time and latency
Cost per successful interaction

对比两种LLM配置：

变体A： GPT-4（成本高、质量优）
变体B： Claude Sonnet（成本低、速度快）

指标：

用户满意度评分（点赞/点踩）
任务完成率
响应时间与延迟
每次成功交互的成本

Online Evaluation

在线评估

Real-time quality monitoring:

Response Quality: LLM-as-judge scoring every Nth response
User Feedback: Explicit ratings, thumbs up/down
Business Metrics: Conversion rates, support ticket resolution
Cost Tracking: Tokens used, inference costs

实时质量监控：

回复质量： 每N条回复使用LLM-as-judge评分
用户反馈： 明确评分、点赞/点踩
业务指标： 转化率、支持工单解决率
成本追踪： Token使用量、推理成本

Human-in-the-Loop

人工介入评估

Sample-based human evaluation:

Random Sampling: Evaluate 10% of responses
Confidence-Based: Evaluate low-confidence outputs
Error-Triggered: Flag suspicious responses for review

For production evaluation patterns and monitoring strategies, see

references/production-evaluation.md

基于样本的人工评估：

随机抽样： 评估10%的回复
基于置信度： 评估低置信度输出
错误触发： 标记可疑回复进行审核

生产环境评估模式与监控策略请参考

references/production-evaluation.md

。

Classification Task Evaluation

分类任务评估

For tasks with discrete outputs (sentiment, intent, category).

Metrics:

Accuracy: Correct predictions / Total predictions
Precision: True positives / (True positives + False positives)
Recall: True positives / (True positives + False negatives)
F1 Score: Harmonic mean of precision and recall
Confusion Matrix: Detailed breakdown of prediction errors

Quick Start (Python):

python

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

y_true = ["positive", "negative", "neutral", "positive", "negative"]
y_pred = ["positive", "negative", "neutral", "neutral", "negative"]

accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

For complete classification evaluation examples, see

examples/python/classification_metrics.py

针对具有离散输出的任务（情感分析、意图识别、分类）。

指标：

准确率： 正确预测数 / 总预测数
精确率： 真阳性数 /（真阳性数 + 假阳性数）
召回率： 真阳性数 /（真阳性数 + 假阴性数）
F1值： 精确率与召回率的调和平均数
混淆矩阵： 预测错误的详细分解

快速开始（Python）：

python

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

y_true = ["positive", "negative", "neutral", "positive", "negative"]
y_pred = ["positive", "negative", "neutral", "neutral", "negative"]

accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

完整的分类任务评估示例请参考

examples/python/classification_metrics.py

。

Generation Task Evaluation

生成任务评估

For open-ended text generation (summaries, creative writing, responses).

Automated Metrics (Use with Caution):

BLEU: N-gram overlap with reference text (0-1 score)
ROUGE: Recall-oriented overlap (ROUGE-1, ROUGE-L)
METEOR: Semantic similarity with stemming
BERTScore: Contextual embedding similarity (0-1 score)

Limitation: Automated metrics correlate weakly with human judgment for creative/subjective generation.

Recommended Approach:

Automated metrics: Fast feedback for objective aspects (length, format)
LLM-as-judge: Nuanced quality assessment (relevance, coherence, helpfulness)
Human evaluation: Final validation for subjective criteria (preference, creativity)

For detailed generation evaluation patterns, see

references/evaluation-types.md

针对开放式文本生成任务（摘要、创意写作、回复）。

自动化指标（谨慎使用）：

BLEU： 与参考文本的N-gram重叠度（0-1分）
ROUGE： 面向召回的重叠度（ROUGE-1、ROUGE-L）
METEOR： 带词干提取的语义相似度
BERTScore： 上下文嵌入相似度（0-1分）

局限性： 自动化指标与人工判断的相关性在创意/主观性生成任务中较弱。

推荐方案：

自动化指标： 针对客观维度（长度、格式）提供快速反馈
LLM-as-judge： 精细化质量评估（相关性、连贯性、实用性）
人工评估： 针对主观标准（偏好、创意）进行最终验证

详细的生成任务评估模式请参考

references/evaluation-types.md

。

Quick Reference Tables

快速参考表格

Evaluation Framework Selection

评估框架选择

If Task Is...	Use This Framework	Primary Metric
RAG system	RAGAS	Faithfulness > 0.8
Classification	scikit-learn metrics	Accuracy, F1
Generation quality	LLM-as-judge	Quality rubric (1-5)
Code generation	HumanEval	Pass@1, Test pass rate
Model comparison	Benchmark testing	MMLU, HellaSwag scores
Safety validation	Hallucination detection	Faithfulness, Fact-check
Production monitoring	Online evaluation	User feedback, Business KPIs

任务类型	推荐框架	核心指标
RAG系统	RAGAS	忠实度 > 0.8
分类任务	scikit-learn指标	准确率、F1值
生成内容质量	LLM-as-judge	质量评估准则（1-5分）
代码生成	HumanEval	Pass@1、测试通过率
模型对比	基准测试	MMLU、HellaSwag得分
安全性验证	幻觉检测	忠实度、事实核查
生产环境监控	在线评估	用户反馈、业务KPI

Python Library Recommendations

Python库推荐

Library	Use Case	Installation
RAGAS	RAG evaluation	`pip install ragas`
DeepEval	General LLM evaluation, pytest integration	`pip install deepeval`
LangSmith	Production monitoring, A/B testing	`pip install langsmith`
lm-eval	Benchmark testing (MMLU, HumanEval)	`pip install lm-eval`
scikit-learn	Classification metrics	`pip install scikit-learn`

库	适用场景	安装命令
RAGAS	RAG系统评估	`pip install ragas`
DeepEval	通用LLM评估、pytest集成	`pip install deepeval`
LangSmith	生产环境监控、A/B测试	`pip install langsmith`
lm-eval	基准测试（MMLU、HumanEval）	`pip install lm-eval`
scikit-learn	分类任务指标	`pip install scikit-learn`

Safety Evaluation Priority Matrix

安全性评估优先级矩阵

Application	Hallucination Risk	Bias Risk	Toxicity Risk	Evaluation Priority
Customer Support	High	Medium	High	1. Faithfulness, 2. Toxicity, 3. Bias
Medical Diagnosis	Critical	High	Low	1. Factual Accuracy, 2. Hallucination, 3. Bias
Creative Writing	Low	Medium	Medium	1. Quality/Fluency, 2. Content Policy
Code Generation	Medium	Low	Low	1. Functional Correctness, 2. Security
Content Moderation	Low	Critical	Critical	1. Bias, 2. False Positives/Negatives

应用场景	幻觉风险	偏见风险	毒性风险	评估优先级
客户支持	高	中	高	1. 忠实度，2. 毒性检测，3. 偏见检测
医疗诊断	极高	高	低	1. 事实准确性，2. 幻觉检测，3. 偏见检测
创意写作	低	中	中	1. 质量/流畅度，2. 内容合规性
代码生成	中	低	低	1. 功能正确性，2. 安全性
内容审核	低	极高	极高	1. 偏见检测，2. 误判/漏判率

Detailed References

详细参考文档

For comprehensive documentation on specific topics:

Evaluation types (classification, generation, QA, code):
```
references/evaluation-types.md
```
RAG evaluation deep dive (RAGAS framework):
```
references/rag-evaluation.md
```
Safety evaluation (hallucination, bias, toxicity):
```
references/safety-evaluation.md
```
Benchmark testing (MMLU, HumanEval, domain benchmarks):
```
references/benchmarks.md
```
LLM-as-judge best practices and prompts:
```
references/llm-as-judge.md
```
Production evaluation (A/B testing, monitoring):
```
references/production-evaluation.md
```
All metrics definitions and formulas:
```
references/metrics-reference.md
```

特定主题的完整文档：

评估类型（分类、生成、问答、代码）：
```
references/evaluation-types.md
```
RAG系统评估深度解析（RAGAS框架）：
```
references/rag-evaluation.md
```
安全性评估（幻觉、偏见、毒性）：
```
references/safety-evaluation.md
```
基准测试（MMLU、HumanEval、领域基准）：
```
references/benchmarks.md
```
LLM-as-judge最佳实践与提示词：
```
references/llm-as-judge.md
```
生产环境评估（A/B测试、监控）：
```
references/production-evaluation.md
```
所有指标定义与公式：
```
references/metrics-reference.md
```

Working Examples

示例代码

Python Examples:

```
examples/python/unit_evaluation.py
```
- Basic prompt testing with pytest
```
examples/python/ragas_example.py
```
- RAGAS RAG evaluation
```
examples/python/deepeval_example.py
```
- DeepEval framework usage
```
examples/python/llm_as_judge.py
```
- GPT-4 as evaluator

examples/python/classification_metrics.py

- Accuracy, precision, recall

```
examples/python/benchmark_testing.py
```
- HumanEval example

TypeScript Examples:

```
examples/typescript/unit-evaluation.ts
```
- Vitest + OpenAI
```
examples/typescript/llm-as-judge.ts
```
- GPT-4 evaluation

examples/typescript/langsmith-integration.ts

- Production monitoring

Python示例：

```
examples/python/unit_evaluation.py
```
- 使用pytest进行基础提示词测试
```
examples/python/ragas_example.py
```
- RAGAS框架下的RAG评估
```
examples/python/deepeval_example.py
```
- DeepEval框架使用示例
```
examples/python/llm_as_judge.py
```
- GPT-4作为评判者的评估示例

examples/python/classification_metrics.py

- 分类任务指标计算

```
examples/python/benchmark_testing.py
```
- HumanEval基准测试示例

TypeScript示例：

```
examples/typescript/unit-evaluation.ts
```
- Vitest + OpenAI的单元测试
```
examples/typescript/llm-as-judge.ts
```
- GPT-4评估示例

examples/typescript/langsmith-integration.ts

- 生产环境监控集成

Executable Scripts

可执行脚本

Run evaluations without loading code into context (token-free):

```
scripts/run_ragas_eval.py
```
- Run RAGAS evaluation on dataset
```
scripts/compare_models.py
```
- A/B test two models
```
scripts/benchmark_runner.py
```
- Run MMLU/HumanEval benchmarks
```
scripts/hallucination_checker.py
```
- Detect hallucinations in outputs

Example usage:

bash

undefined

无需加载代码即可运行评估（无Token消耗）：

```
scripts/run_ragas_eval.py
```
- 在数据集上运行RAGAS评估
```
scripts/compare_models.py
```
- 对两个模型进行A/B测试
```
scripts/benchmark_runner.py
```
- 运行MMLU/HumanEval基准测试
```
scripts/hallucination_checker.py
```
- 检测输出中的幻觉问题

使用示例：

bash

undefined

Run RAGAS evaluation on custom dataset

python scripts/run_ragas_eval.py --dataset data/qa_dataset.json --output results.json

Compare GPT-4 vs Claude on benchmark

python scripts/compare_models.py --model-a gpt-4 --model-b claude-3-opus --tasks mmlu,humaneval

undefined

python scripts/compare_models.py --model-a gpt-4 --model-b claude-3-opus --tasks mmlu,humaneval

undefined

Integration with Other Skills

与其他技能的集成

Related Skills:

building-ai-chat
: Evaluate AI chat applications (this skill tests what that skill builds)
prompt-engineering
: Test prompt quality and effectiveness
testing-strategies
: Apply testing pyramid to LLM evaluation (unit → integration → E2E)
observability
: Production monitoring and alerting for LLM quality
building-ci-pipelines
: Integrate LLM evaluation into CI/CD

Workflow Integration:

Write prompt (use
```
prompt-engineering
```
skill)
Unit test prompt (use
```
llm-evaluation
```
skill)
Build AI feature (use
```
building-ai-chat
```
skill)
Integration test RAG pipeline (use
```
llm-evaluation
```
skill)
Deploy to production (use
```
deploying-applications
```
skill)
Monitor quality (use
```
llm-evaluation
```
+
```
observability
```
skills)

相关技能：

building-ai-chat
：评估AI聊天应用（本技能测试该技能构建的产物）
prompt-engineering
：测试提示词的质量与有效性
testing-strategies
：将测试金字塔应用于LLM评估（单元→集成→端到端）
observability
： LLM质量的生产环境监控与告警
building-ci-pipelines
：将LLM评估集成到CI/CD流水线

工作流集成：

编写提示词（使用
```
prompt-engineering
```
技能）
单元测试提示词（使用
```
llm-evaluation
```
技能）
构建AI功能（使用
```
building-ai-chat
```
技能）
集成测试RAG流水线（使用
```
llm-evaluation
```
技能）
部署到生产环境（使用
```
deploying-applications
```
技能）
监控质量（使用
```
llm-evaluation
```
+
```
observability
```
技能）

Common Pitfalls

常见误区

1. Over-reliance on Automated Metrics for Generation

BLEU/ROUGE correlate weakly with human judgment for creative text
Solution: Layer LLM-as-judge or human evaluation

2. Ignoring Faithfulness in RAG Systems

Hallucinations are the #1 RAG failure mode
Solution: Prioritize faithfulness metric (target > 0.8)

3. No Production Monitoring

Models can degrade over time, prompts can break with updates
Solution: Set up continuous evaluation (LangSmith, custom monitoring)

4. Biased LLM-as-Judge Evaluation

Evaluator LLMs have biases (position bias, verbosity bias)
Solution: Average multiple evaluations, use diverse evaluation prompts

5. Insufficient Benchmark Coverage

Single benchmark doesn't capture full model capability
Solution: Use 3-5 benchmarks across different domains

6. Missing Safety Evaluation

Production LLMs can generate harmful content
Solution: Add toxicity, bias, and hallucination checks to evaluation pipeline

1. 过度依赖自动化指标评估生成任务

BLEU/ROUGE与人工判断在创意文本任务中的相关性较弱
解决方案：结合LLM-as-judge或人工评估

2. 忽略RAG系统的忠实度

幻觉是RAG系统最主要的故障模式
解决方案：优先关注忠实度指标（目标值>0.8）

3. 未搭建生产环境监控

模型性能会随时间下降，提示词可能因更新失效
解决方案：搭建持续评估体系（如LangSmith、自定义监控）

4. LLM-as-judge评估存在偏见

作为评判者的LLM存在固有偏见（位置偏见、冗长性偏见、自我偏好）
解决方案：多次评估取平均值，使用多样化的评估提示词

5. 基准测试覆盖不足

单一基准测试无法全面反映模型能力
解决方案：使用3-5个不同领域的基准测试

6. 缺失安全性评估

生产环境中的LLM可能生成有害内容
解决方案：在评估流水线中加入毒性、偏见与幻觉检测