deepeval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDeepEval
DeepEval
Overview
概述
DeepEval is a pytest-based framework for testing LLM applications. It provides 50+ evaluation metrics covering RAG pipelines, conversational AI, agents, safety, and custom criteria. DeepEval integrates into development workflows through pytest, supports multiple LLM providers, and includes component-level tracing with the decorator.
@observeRepository: https://github.com/confident-ai/deepeval
Documentation: https://deepeval.com
DeepEval是一个基于pytest的LLM应用测试框架。它提供50余种评估指标,涵盖RAG流水线、对话AI、Agent、安全性以及自定义规则等场景。DeepEval可通过pytest集成到开发工作流中,支持多家LLM提供商,并通过装饰器实现组件级追踪。
@observeInstallation
安装
bash
pip install -U deepevalRequires Python 3.9+.
bash
pip install -U deepeval要求Python 3.9及以上版本。
Quick Start
快速开始
Basic pytest test
基础pytest测试
python
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def test_chatbot():
metric = AnswerRelevancyMetric(threshold=0.7, model="athropic-claude-sonnet-4-5")
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="You have 30 days for full refund"
)
assert_test(test_case, [metric])Run with:
deepeval test run test_chatbot.pypython
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def test_chatbot():
metric = AnswerRelevancyMetric(threshold=0.7, model="athropic-claude-sonnet-4-5")
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="You have 30 days for full refund"
)
assert_test(test_case, [metric])运行命令:
deepeval test run test_chatbot.pyEnvironment setup
环境配置
DeepEval automatically loads then :
.env.local.envbash
undefinedDeepEval会自动加载,之后加载:
.env.local.envbash
undefined.env
.env
OPENAI_API_KEY="sk-..."
undefinedOPENAI_API_KEY="sk-..."
undefinedCore Workflows
核心工作流
RAG Evaluation
RAG评估
Evaluate both retrieval and generation phases:
python
from deepeval.metrics import (
ContextualPrecisionMetric,
ContextualRecallMetric,
ContextualRelevancyMetric,
AnswerRelevancyMetric,
FaithfulnessMetric
)可同时评估检索与生成阶段:
python
from deepeval.metrics import (
ContextualPrecisionMetric,
ContextualRecallMetric,
ContextualRelevancyMetric,
AnswerRelevancyMetric,
FaithfulnessMetric
)Retrieval metrics
检索阶段指标
contextual_precision = ContextualPrecisionMetric(threshold=0.7)
contextual_recall = ContextualRecallMetric(threshold=0.7)
contextual_relevancy = ContextualRelevancyMetric(threshold=0.7)
contextual_precision = ContextualPrecisionMetric(threshold=0.7)
contextual_recall = ContextualRecallMetric(threshold=0.7)
contextual_relevancy = ContextualRelevancyMetric(threshold=0.7)
Generation metrics
生成阶段指标
answer_relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.8)
test_case = LLMTestCase(
input="What are the side effects of aspirin?",
actual_output="Common side effects include stomach upset and nausea.",
expected_output="Aspirin side effects include gastrointestinal issues.",
retrieval_context=[
"Aspirin common side effects: stomach upset, nausea, vomiting.",
"Serious aspirin side effects: gastrointestinal bleeding.",
]
)
evaluate(test_cases=[test_case], metrics=[
contextual_precision, contextual_recall, contextual_relevancy,
answer_relevancy, faithfulness
])
**Component-level tracing:**
```python
from deepeval.tracing import observe, update_current_span
@observe(metrics=[contextual_relevancy])
def retriever(query: str):
chunks = your_vector_db.search(query)
update_current_span(
test_case=LLMTestCase(input=query, retrieval_context=chunks)
)
return chunks
@observe(metrics=[answer_relevancy, faithfulness])
def generator(query: str, chunks: list):
response = your_llm.generate(query, chunks)
update_current_span(
test_case=LLMTestCase(
input=query,
actual_output=response,
retrieval_context=chunks
)
)
return response
@observe
def rag_pipeline(query: str):
chunks = retriever(query)
return generator(query, chunks)answer_relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.8)
test_case = LLMTestCase(
input="What are the side effects of aspirin?",
actual_output="Common side effects include stomach upset and nausea.",
expected_output="Aspirin side effects include gastrointestinal issues.",
retrieval_context=[
"Aspirin common side effects: stomach upset, nausea, vomiting.",
"Serious aspirin side effects: gastrointestinal bleeding.",
]
)
evaluate(test_cases=[test_case], metrics=[
contextual_precision, contextual_recall, contextual_relevancy,
answer_relevancy, faithfulness
])
**组件级追踪:**
```python
from deepeval.tracing import observe, update_current_span
@observe(metrics=[contextual_relevancy])
def retriever(query: str):
chunks = your_vector_db.search(query)
update_current_span(
test_case=LLMTestCase(input=query, retrieval_context=chunks)
)
return chunks
@observe(metrics=[answer_relevancy, faithfulness])
def generator(query: str, chunks: list):
response = your_llm.generate(query, chunks)
update_current_span(
test_case=LLMTestCase(
input=query,
actual_output=response,
retrieval_context=chunks
)
)
return response
@observe
def rag_pipeline(query: str):
chunks = retriever(query)
return generator(query, chunks)Conversational AI Evaluation
对话AI评估
Test multi-turn dialogues:
python
from deepeval.test_case import Turn, ConversationalTestCase
from deepeval.metrics import (
RoleAdherenceMetric,
KnowledgeRetentionMetric,
ConversationCompletenessMetric,
TurnRelevancyMetric
)
convo_test_case = ConversationalTestCase(
chatbot_role="professional, empathetic medical assistant",
turns=[
Turn(role="user", content="I have a persistent cough"),
Turn(role="assistant", content="How long have you had this cough?"),
Turn(role="user", content="About a week now"),
Turn(role="assistant", content="A week-long cough should be evaluated.")
]
)
metrics = [
RoleAdherenceMetric(threshold=0.7),
KnowledgeRetentionMetric(threshold=0.7),
ConversationCompletenessMetric(threshold=0.6),
TurnRelevancyMetric(threshold=0.7)
]
evaluate(test_cases=[convo_test_case], metrics=metrics)测试多轮对话场景:
python
from deepeval.test_case import Turn, ConversationalTestCase
from deepeval.metrics import (
RoleAdherenceMetric,
KnowledgeRetentionMetric,
ConversationCompletenessMetric,
TurnRelevancyMetric
)
convo_test_case = ConversationalTestCase(
chatbot_role="professional, empathetic medical assistant",
turns=[
Turn(role="user", content="I have a persistent cough"),
Turn(role="assistant", content="How long have you had this cough?"),
Turn(role="user", content="About a week now"),
Turn(role="assistant", content="A week-long cough should be evaluated.")
]
)
metrics = [
RoleAdherenceMetric(threshold=0.7),
KnowledgeRetentionMetric(threshold=0.7),
ConversationCompletenessMetric(threshold=0.6),
TurnRelevancyMetric(threshold=0.7)
]
evaluate(test_cases=[convo_test_case], metrics=metrics)Agent Evaluation
Agent评估
Test tool usage and task completion:
python
from deepeval.test_case import ToolCall
from deepeval.metrics import (
TaskCompletionMetric,
ToolUseMetric,
ArgumentCorrectnessMetric
)
agent_test_case = ConversationalTestCase(
turns=[
Turn(role="user", content="When did Trump first raise tariffs?"),
Turn(
role="assistant",
content="Let me search for that information.",
tools_called=[
ToolCall(
name="WebSearch",
arguments={"query": "Trump first raised tariffs year"}
)
]
),
Turn(role="assistant", content="Trump first raised tariffs in 2018.")
]
)
evaluate(
test_cases=[agent_test_case],
metrics=[
TaskCompletionMetric(threshold=0.7),
ToolUseMetric(threshold=0.7),
ArgumentCorrectnessMetric(threshold=0.7)
]
)测试工具使用与任务完成情况:
python
from deepeval.test_case import ToolCall
from deepeval.metrics import (
TaskCompletionMetric,
ToolUseMetric,
ArgumentCorrectnessMetric
)
agent_test_case = ConversationalTestCase(
turns=[
Turn(role="user", content="When did Trump first raise tariffs?"),
Turn(
role="assistant",
content="Let me search for that information.",
tools_called=[
ToolCall(
name="WebSearch",
arguments={"query": "Trump first raised tariffs year"}
)
]
),
Turn(role="assistant", content="Trump first raised tariffs in 2018.")
]
)
evaluate(
test_cases=[agent_test_case],
metrics=[
TaskCompletionMetric(threshold=0.7),
ToolUseMetric(threshold=0.7),
ArgumentCorrectnessMetric(threshold=0.7)
]
)Safety Evaluation
安全性评估
Check for harmful content:
python
from deepeval.metrics import (
ToxicityMetric,
BiasMetric,
PIILeakageMetric,
HallucinationMetric
)
def safety_gate(output: str, input: str) -> tuple[bool, list]:
"""Returns (passed, reasons) tuple"""
test_case = LLMTestCase(input=input, actual_output=output)
safety_metrics = [
ToxicityMetric(threshold=0.5),
BiasMetric(threshold=0.5),
PIILeakageMetric(threshold=0.5)
]
failures = []
for metric in safety_metrics:
metric.measure(test_case)
if not metric.is_successful():
failures.append(f"{metric.name}: {metric.reason}")
return len(failures) == 0, failures检测有害内容:
python
from deepeval.metrics import (
ToxicityMetric,
BiasMetric,
PIILeakageMetric,
HallucinationMetric
)
def safety_gate(output: str, input: str) -> tuple[bool, list]:
"""返回(是否通过,失败原因)元组"""
test_case = LLMTestCase(input=input, actual_output=output)
safety_metrics = [
ToxicityMetric(threshold=0.5),
BiasMetric(threshold=0.5),
PIILeakageMetric(threshold=0.5)
]
failures = []
for metric in safety_metrics:
metric.measure(test_case)
if not metric.is_successful():
failures.append(f"{metric.name}: {metric.reason}")
return len(failures) == 0, failuresMetric Selection Guide
指标选择指南
RAG Metrics
RAG指标
Retrieval Phase:
- - Relevant chunks ranked higher than irrelevant ones
ContextualPrecisionMetric - - All necessary information retrieved
ContextualRecallMetric - - Retrieved chunks relevant to input
ContextualRelevancyMetric
Generation Phase:
- - Output addresses the input query
AnswerRelevancyMetric - - Output grounded in retrieval context
FaithfulnessMetric
检索阶段:
- - 相关片段排名高于无关片段
ContextualPrecisionMetric - - 检索到所有必要信息
ContextualRecallMetric - - 检索到的片段与输入相关
ContextualRelevancyMetric
生成阶段:
- - 输出能够回应用户查询
AnswerRelevancyMetric - - 输出内容基于检索到的上下文
FaithfulnessMetric
Conversational Metrics
对话指标
- - Each turn relevant to conversation
TurnRelevancyMetric - - Information retained across turns
KnowledgeRetentionMetric - - All aspects addressed
ConversationCompletenessMetric - - Chatbot maintains assigned role
RoleAdherenceMetric - - Conversation stays on topic
TopicAdherenceMetric
- - 每一轮对话都与主题相关
TurnRelevancyMetric - - 对话过程中信息得以保留
KnowledgeRetentionMetric - - 所有问题维度都被覆盖
ConversationCompletenessMetric - - 聊天机器人保持设定的角色
RoleAdherenceMetric - - 对话始终围绕主题展开
TopicAdherenceMetric
Agent Metrics
Agent指标
- - Task successfully completed
TaskCompletionMetric - - Correct tools selected
ToolUseMetric - - Tool arguments correct
ArgumentCorrectnessMetric - - MCP correctly used
MCPUseMetric
- - 任务成功完成
TaskCompletionMetric - - 选择了正确的工具
ToolUseMetric - - 工具参数设置正确
ArgumentCorrectnessMetric - - MCP工具使用正确
MCPUseMetric
Safety Metrics
安全指标
- - Harmful content detection
ToxicityMetric - - Biased outputs identification
BiasMetric - - Fabricated information
HallucinationMetric - - Personal information leakage
PIILeakageMetric
- - 检测有害内容
ToxicityMetric - - 识别带有偏见的输出
BiasMetric - - 检测编造的信息
HallucinationMetric - - 检测个人信息泄露
PIILeakageMetric
Custom Metrics
自定义指标
G-Eval (LLM-based):
python
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
custom_metric = GEval(
name="Professional Tone",
criteria="Determine if response maintains professional, empathetic tone",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.7,
model="anthropic-claude-sonnet-4-5"
)BaseMetric subclass:
See for complete guide on creating custom metrics with BaseMetric subclassing and deterministic scorers (ROUGE, BLEU, BERTScore).
references/custom_metrics.mdG-Eval(基于LLM):
python
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
custom_metric = GEval(
name="Professional Tone",
criteria="Determine if response maintains professional, empathetic tone",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.7,
model="anthropic-claude-sonnet-4-5"
)BaseMetric子类化:
如需了解通过BaseMetric子类化和确定性评分器(ROUGE、BLEU、BERTScore)创建自定义指标的完整指南,请参考。
references/custom_metrics.mdConfiguration
配置
LLM Provider Setup
LLM提供商设置
DeepEval supports OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ providers via LiteLLM.
Anthropic models are preferred.
CLI configuration (global):
bash
deepeval set-azure-openai --openai-endpoint=... --openai-api-key=... --deployment-name=...
deepeval set-ollama deepseek-r1:1.5bPython configuration (per-metric):
python
from deepeval.models import AnthropicModel, OllamaModel
anthropic_model = AnthropicModel(
model_id=settings.anthropic_model_id,
client_args={"api_key": settings.anthropic_api_key},
temperature=settings.agent_temperature
)
metric = AnswerRelevancyMetric(model=anthropic_model)See for complete provider configuration guide.
references/model_providers.mdDeepEval支持OpenAI、Anthropic Claude、Google Gemini、AWS Bedrock,以及通过LiteLLM支持的100余家提供商。推荐使用Anthropic模型。
CLI全局配置:
bash
deepeval set-azure-openai --openai-endpoint=... --openai-api-key=... --deployment-name=...
deepeval set-ollama deepseek-r1:1.5bPython按指标配置:
python
from deepeval.models import AnthropicModel, OllamaModel
anthropic_model = AnthropicModel(
model_id=settings.anthropic_model_id,
client_args={"api_key": settings.anthropic_api_key},
temperature=settings.agent_temperature
)
metric = AnswerRelevancyMetric(model=anthropic_model)如需完整的提供商配置指南,请参考。
references/model_providers.mdPerformance Optimisation
性能优化
Async mode is enabled by default. Configure with and :
AsyncConfigCacheConfigpython
from deepeval import evaluate, AsyncConfig, CacheConfig
evaluate(
test_cases=[...],
metrics=[...],
async_config=AsyncConfig(
run_async=True,
max_concurrent=20, # Reduce if rate limited
throttle_value=0 # Delay between test cases (seconds)
),
cache_config=CacheConfig(
use_cache=True, # Read from cache
write_cache=True # Write to cache
)
)CLI parallelisation:
bash
deepeval test run -n 4 -c -i # 4 processes, cached, ignore errorsBest practices:
- Limit to 5 metrics maximum (2-3 generic + 1-2 custom)
- Use the latest available Anthropic Claude Sonnet or Haiku models
- Reduce to 5 if hitting rate limits
max_concurrent - Use function over individual
evaluate()callsmeasure()
See for detailed performance optimisation guide.
references/async_performance.md异步模式默认启用。可通过和进行配置:
AsyncConfigCacheConfigpython
from deepeval import evaluate, AsyncConfig, CacheConfig
evaluate(
test_cases=[...],
metrics=[...],
async_config=AsyncConfig(
run_async=True,
max_concurrent=20, # 若触发速率限制则降低该值
throttle_value=0 # 测试用例之间的延迟(秒)
),
cache_config=CacheConfig(
use_cache=True, # 读取缓存
write_cache=True # 写入缓存
)
)CLI并行运行:
bash
deepeval test run -n 4 -c -i # 4个进程,启用缓存,忽略错误最佳实践:
- 最多使用5个指标(2-3个通用指标 + 1-2个自定义指标)
- 使用最新版本的Anthropic Claude Sonnet或Haiku模型
- 若触发速率限制,将降低至5
max_concurrent - 使用函数而非单独调用
evaluate()measure()
如需详细的性能优化指南,请参考。
references/async_performance.mdDataset Management
数据集管理
Loading datasets
加载数据集
python
from deepeval.dataset import EvaluationDataset, Golden
dataset = EvaluationDataset()python
from deepeval.dataset import EvaluationDataset, Golden
dataset = EvaluationDataset()From CSV
从CSV加载
dataset.add_goldens_from_csv_file(
file_path="./test_data.csv",
input_col_name="question",
expected_output_col_name="answer",
context_col_name="context",
context_col_delimiter="|"
)
dataset.add_goldens_from_csv_file(
file_path="./test_data.csv",
input_col_name="question",
expected_output_col_name="answer",
context_col_name="context",
context_col_delimiter="|"
)
From JSON
从JSON加载
dataset.add_goldens_from_json_file(
file_path="./test_data.json",
input_key_name="query",
expected_output_key_name="response"
)
undefineddataset.add_goldens_from_json_file(
file_path="./test_data.json",
input_key_name="query",
expected_output_key_name="response"
)
undefinedSynthetic generation
合成生成
python
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()python
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()From documents
从文档生成
goldens = synthesizer.generate_goldens_from_docs(
document_paths=["./docs/knowledge_base.pdf"],
max_goldens_per_document=10,
evolution_types=["REASONING", "MULTICONTEXT", "COMPARATIVE"]
)
goldens = synthesizer.generate_goldens_from_docs(
document_paths=["./docs/knowledge_base.pdf"],
max_goldens_per_document=10,
evolution_types=["REASONING", "MULTICONTEXT", "COMPARATIVE"]
)
From scratch
从零生成
goldens = synthesizer.generate_goldens_from_scratch(
subject="customer support for SaaS product",
task="answer user questions about billing",
max_goldens=20
)
**Evolution types:** REASONING, MULTICONTEXT, CONCRETISING, CONSTRAINED, COMPARATIVE, HYPOTHETICAL, IN_BREADTH
See `references/dataset_management.md` for complete dataset guide including versioning and cloud integration.goldens = synthesizer.generate_goldens_from_scratch(
subject="customer support for SaaS product",
task="answer user questions about billing",
max_goldens=20
)
**演化类型:** REASONING(推理型)、MULTICONTEXT(多上下文型)、CONCRETISING(具象化型)、CONSTRAINED(约束型)、COMPARATIVE(对比型)、HYPOTHETICAL(假设型)、IN_BREADTH(拓展型)
如需完整的数据集指南(包括版本控制与云集成),请参考`references/dataset_management.md`。Test Case Types
测试用例类型
Single-turn (LLMTestCase)
单轮对话(LLMTestCase)
python
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="You have 30 days for full refund",
expected_output="We offer 30-day full refund",
retrieval_context=["All customers eligible for 30 day refund"],
tools_called=[ToolCall(name="...", arguments={"...": "..."})]
)python
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="You have 30 days for full refund",
expected_output="We offer 30-day full refund",
retrieval_context=["All customers eligible for 30 day refund"],
tools_called=[ToolCall(name="...", arguments={"...": "..."})]
)Multi-turn (ConversationalTestCase)
多轮对话(ConversationalTestCase)
python
from deepeval.test_case import Turn, ConversationalTestCase
convo_test_case = ConversationalTestCase(
chatbot_role="helpful customer service agent",
turns=[
Turn(role="user", content="I need help with my order"),
Turn(role="assistant", content="I'd be happy to help"),
Turn(role="user", content="It hasn't arrived yet")
]
)python
from deepeval.test_case import Turn, ConversationalTestCase
convo_test_case = ConversationalTestCase(
chatbot_role="helpful customer service agent",
turns=[
Turn(role="user", content="I need help with my order"),
Turn(role="assistant", content="I'd be happy to help"),
Turn(role="user", content="It hasn't arrived yet")
]
)Multimodal (MLLMTestCase)
多模态(MLLMTestCase)
python
from deepeval.test_case import MLLMTestCase, MLLMImage
m_test_case = MLLMTestCase(
input=["Describe this image", MLLMImage(url="./photo.png", local=True)],
actual_output=["A red bicycle leaning against a wall"]
)python
from deepeval.test_case import MLLMTestCase, MLLMImage
m_test_case = MLLMTestCase(
input=["Describe this image", MLLMImage(url="./photo.png", local=True)],
actual_output=["A red bicycle leaning against a wall"]
)CI/CD Integration
CI/CD集成
yaml
undefinedyaml
undefined.github/workflows/test.yml
.github/workflows/test.yml
name: LLM Tests
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- name: Install dependencies
run: pip install deepeval
- name: Run evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: deepeval test run tests/
undefinedname: LLM Tests
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- name: Install dependencies
run: pip install deepeval
- name: Run evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: deepeval test run tests/
undefinedReferences
参考文档
Detailed implementation guides:
-
references/model_providers.md - Complete guide for configuring OpenAI, Anthropic, Gemini, Bedrock, and local models. Includes provider-specific considerations, cost analysis, and troubleshooting.
-
references/custom_metrics.md - Complete guide for creating custom metrics by subclassing BaseMetric. Includes deterministic scorers (ROUGE, BLEU, BERTScore) and LLM-based evaluation patterns.
-
references/async_performance.md - Complete guide for optimising evaluation performance with async mode, caching, concurrency tuning, and rate limit handling.
-
references/dataset_management.md - Complete guide for dataset loading, saving, synthetic generation, versioning, and cloud integration with Confident AI.
详细的实现指南:
-
references/model_providers.md - 配置OpenAI、Anthropic、Gemini、Bedrock及本地模型的完整指南,包括提供商专属注意事项、成本分析与故障排除。
-
references/custom_metrics.md - 通过子类化BaseMetric创建自定义指标的完整指南,包括确定性评分器(ROUGE、BLEU、BERTScore)及基于LLM的评估模式。
-
references/async_performance.md - 利用异步模式、缓存、并发调优与速率限制处理优化评估性能的完整指南。
-
references/dataset_management.md - 数据集加载、保存、合成生成、版本控制及与Confident AI云集成的完整指南。
Best Practices
最佳实践
Metric Selection
指标选择
- Match metrics to use case (RAG systems need retrieval + generation metrics)
- Start with 2-3 essential metrics, expand as needed
- Use appropriate thresholds (0.7-0.8 for production, 0.5-0.6 for development)
- Combine complementary metrics (answer relevancy + faithfulness)
- 根据使用场景匹配指标(RAG系统需要检索+生成阶段的指标)
- 从2-3个核心指标开始,按需扩展
- 设置合适的阈值(生产环境0.7-0.8,开发环境0.5-0.6)
- 结合互补性指标(如答案相关性+事实一致性)
Test Case Design
测试用例设计
- Create representative examples covering common queries and edge cases
- Include context when needed (for RAG,
retrieval_contextfor G-Eval)expected_output - Use datasets for scale testing
- Version test cases over time
- 创建覆盖常见查询与边缘场景的代表性示例
- 必要时添加上下文(RAG场景的,G-Eval的
retrieval_context)expected_output - 使用数据集进行规模化测试
- 随时间迭代版本化测试用例
Evaluation Workflow
评估工作流
- Component-level first - Use for individual parts
@observe - End-to-end validation before deployment
- Automate in CI/CD with
deepeval test run - Track results over time with Confident AI cloud
- 先进行组件级测试 - 使用追踪独立模块
@observe - 部署前进行端到端验证
- 通过在CI/CD中自动化评估
deepeval test run - 通过Confident AI云端长期追踪结果
Testing Anti-Patterns
测试反模式
Avoid:
- Testing only happy paths
- Using unrealistic inputs
- Ignoring metric reasons
- Setting thresholds too high initially
- Running full test suite on every change
Do:
- Test edge cases and failure modes
- Use real user queries as test inputs
- Read and analyse metric reasons
- Adjust thresholds based on empirical results
- Use component-level tests during development
- Separate config and eval content from code
需避免:
- 仅测试正常路径
- 使用不切实际的输入
- 忽略指标失败原因
- 初始设置过高的阈值
- 每次变更都运行完整测试套件
推荐做法:
- 测试边缘场景与故障模式
- 使用真实用户查询作为测试输入
- 阅读并分析指标失败原因
- 根据实际结果调整阈值
- 开发阶段使用组件级测试
- 将配置与评估内容从代码中分离