deepeval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DeepEval

DeepEval

Overview

概述

DeepEval is a pytest-based framework for testing LLM applications. It provides 50+ evaluation metrics covering RAG pipelines, conversational AI, agents, safety, and custom criteria. DeepEval integrates into development workflows through pytest, supports multiple LLM providers, and includes component-level tracing with the
@observe
decorator.
DeepEval是一个基于pytest的LLM应用测试框架。它提供50余种评估指标,涵盖RAG流水线、对话AI、Agent、安全性以及自定义规则等场景。DeepEval可通过pytest集成到开发工作流中,支持多家LLM提供商,并通过
@observe
装饰器实现组件级追踪。

Installation

安装

bash
pip install -U deepeval
Requires Python 3.9+.
bash
pip install -U deepeval
要求Python 3.9及以上版本。

Quick Start

快速开始

Basic pytest test

基础pytest测试

python
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_chatbot():
    metric = AnswerRelevancyMetric(threshold=0.7, model="athropic-claude-sonnet-4-5")
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days for full refund"
    )
    assert_test(test_case, [metric])
Run with:
deepeval test run test_chatbot.py
python
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_chatbot():
    metric = AnswerRelevancyMetric(threshold=0.7, model="athropic-claude-sonnet-4-5")
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days for full refund"
    )
    assert_test(test_case, [metric])
运行命令:
deepeval test run test_chatbot.py

Environment setup

环境配置

DeepEval automatically loads
.env.local
then
.env
:
bash
undefined
DeepEval会自动加载
.env.local
,之后加载
.env
bash
undefined

.env

.env

OPENAI_API_KEY="sk-..."
undefined
OPENAI_API_KEY="sk-..."
undefined

Core Workflows

核心工作流

RAG Evaluation

RAG评估

Evaluate both retrieval and generation phases:
python
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric
)
可同时评估检索与生成阶段:
python
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric
)

Retrieval metrics

检索阶段指标

contextual_precision = ContextualPrecisionMetric(threshold=0.7) contextual_recall = ContextualRecallMetric(threshold=0.7) contextual_relevancy = ContextualRelevancyMetric(threshold=0.7)
contextual_precision = ContextualPrecisionMetric(threshold=0.7) contextual_recall = ContextualRecallMetric(threshold=0.7) contextual_relevancy = ContextualRelevancyMetric(threshold=0.7)

Generation metrics

生成阶段指标

answer_relevancy = AnswerRelevancyMetric(threshold=0.7) faithfulness = FaithfulnessMetric(threshold=0.8)
test_case = LLMTestCase( input="What are the side effects of aspirin?", actual_output="Common side effects include stomach upset and nausea.", expected_output="Aspirin side effects include gastrointestinal issues.", retrieval_context=[ "Aspirin common side effects: stomach upset, nausea, vomiting.", "Serious aspirin side effects: gastrointestinal bleeding.", ] )
evaluate(test_cases=[test_case], metrics=[ contextual_precision, contextual_recall, contextual_relevancy, answer_relevancy, faithfulness ])

**Component-level tracing:**

```python
from deepeval.tracing import observe, update_current_span

@observe(metrics=[contextual_relevancy])
def retriever(query: str):
    chunks = your_vector_db.search(query)
    update_current_span(
        test_case=LLMTestCase(input=query, retrieval_context=chunks)
    )
    return chunks

@observe(metrics=[answer_relevancy, faithfulness])
def generator(query: str, chunks: list):
    response = your_llm.generate(query, chunks)
    update_current_span(
        test_case=LLMTestCase(
            input=query,
            actual_output=response,
            retrieval_context=chunks
        )
    )
    return response

@observe
def rag_pipeline(query: str):
    chunks = retriever(query)
    return generator(query, chunks)
answer_relevancy = AnswerRelevancyMetric(threshold=0.7) faithfulness = FaithfulnessMetric(threshold=0.8)
test_case = LLMTestCase( input="What are the side effects of aspirin?", actual_output="Common side effects include stomach upset and nausea.", expected_output="Aspirin side effects include gastrointestinal issues.", retrieval_context=[ "Aspirin common side effects: stomach upset, nausea, vomiting.", "Serious aspirin side effects: gastrointestinal bleeding.", ] )
evaluate(test_cases=[test_case], metrics=[ contextual_precision, contextual_recall, contextual_relevancy, answer_relevancy, faithfulness ])

**组件级追踪:**

```python
from deepeval.tracing import observe, update_current_span

@observe(metrics=[contextual_relevancy])
def retriever(query: str):
    chunks = your_vector_db.search(query)
    update_current_span(
        test_case=LLMTestCase(input=query, retrieval_context=chunks)
    )
    return chunks

@observe(metrics=[answer_relevancy, faithfulness])
def generator(query: str, chunks: list):
    response = your_llm.generate(query, chunks)
    update_current_span(
        test_case=LLMTestCase(
            input=query,
            actual_output=response,
            retrieval_context=chunks
        )
    )
    return response

@observe
def rag_pipeline(query: str):
    chunks = retriever(query)
    return generator(query, chunks)

Conversational AI Evaluation

对话AI评估

Test multi-turn dialogues:
python
from deepeval.test_case import Turn, ConversationalTestCase
from deepeval.metrics import (
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationCompletenessMetric,
    TurnRelevancyMetric
)

convo_test_case = ConversationalTestCase(
    chatbot_role="professional, empathetic medical assistant",
    turns=[
        Turn(role="user", content="I have a persistent cough"),
        Turn(role="assistant", content="How long have you had this cough?"),
        Turn(role="user", content="About a week now"),
        Turn(role="assistant", content="A week-long cough should be evaluated.")
    ]
)

metrics = [
    RoleAdherenceMetric(threshold=0.7),
    KnowledgeRetentionMetric(threshold=0.7),
    ConversationCompletenessMetric(threshold=0.6),
    TurnRelevancyMetric(threshold=0.7)
]

evaluate(test_cases=[convo_test_case], metrics=metrics)
测试多轮对话场景:
python
from deepeval.test_case import Turn, ConversationalTestCase
from deepeval.metrics import (
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationCompletenessMetric,
    TurnRelevancyMetric
)

convo_test_case = ConversationalTestCase(
    chatbot_role="professional, empathetic medical assistant",
    turns=[
        Turn(role="user", content="I have a persistent cough"),
        Turn(role="assistant", content="How long have you had this cough?"),
        Turn(role="user", content="About a week now"),
        Turn(role="assistant", content="A week-long cough should be evaluated.")
    ]
)

metrics = [
    RoleAdherenceMetric(threshold=0.7),
    KnowledgeRetentionMetric(threshold=0.7),
    ConversationCompletenessMetric(threshold=0.6),
    TurnRelevancyMetric(threshold=0.7)
]

evaluate(test_cases=[convo_test_case], metrics=metrics)

Agent Evaluation

Agent评估

Test tool usage and task completion:
python
from deepeval.test_case import ToolCall
from deepeval.metrics import (
    TaskCompletionMetric,
    ToolUseMetric,
    ArgumentCorrectnessMetric
)

agent_test_case = ConversationalTestCase(
    turns=[
        Turn(role="user", content="When did Trump first raise tariffs?"),
        Turn(
            role="assistant",
            content="Let me search for that information.",
            tools_called=[
                ToolCall(
                    name="WebSearch",
                    arguments={"query": "Trump first raised tariffs year"}
                )
            ]
        ),
        Turn(role="assistant", content="Trump first raised tariffs in 2018.")
    ]
)

evaluate(
    test_cases=[agent_test_case],
    metrics=[
        TaskCompletionMetric(threshold=0.7),
        ToolUseMetric(threshold=0.7),
        ArgumentCorrectnessMetric(threshold=0.7)
    ]
)
测试工具使用与任务完成情况:
python
from deepeval.test_case import ToolCall
from deepeval.metrics import (
    TaskCompletionMetric,
    ToolUseMetric,
    ArgumentCorrectnessMetric
)

agent_test_case = ConversationalTestCase(
    turns=[
        Turn(role="user", content="When did Trump first raise tariffs?"),
        Turn(
            role="assistant",
            content="Let me search for that information.",
            tools_called=[
                ToolCall(
                    name="WebSearch",
                    arguments={"query": "Trump first raised tariffs year"}
                )
            ]
        ),
        Turn(role="assistant", content="Trump first raised tariffs in 2018.")
    ]
)

evaluate(
    test_cases=[agent_test_case],
    metrics=[
        TaskCompletionMetric(threshold=0.7),
        ToolUseMetric(threshold=0.7),
        ArgumentCorrectnessMetric(threshold=0.7)
    ]
)

Safety Evaluation

安全性评估

Check for harmful content:
python
from deepeval.metrics import (
    ToxicityMetric,
    BiasMetric,
    PIILeakageMetric,
    HallucinationMetric
)

def safety_gate(output: str, input: str) -> tuple[bool, list]:
    """Returns (passed, reasons) tuple"""
    test_case = LLMTestCase(input=input, actual_output=output)

    safety_metrics = [
        ToxicityMetric(threshold=0.5),
        BiasMetric(threshold=0.5),
        PIILeakageMetric(threshold=0.5)
    ]

    failures = []
    for metric in safety_metrics:
        metric.measure(test_case)
        if not metric.is_successful():
            failures.append(f"{metric.name}: {metric.reason}")

    return len(failures) == 0, failures
检测有害内容:
python
from deepeval.metrics import (
    ToxicityMetric,
    BiasMetric,
    PIILeakageMetric,
    HallucinationMetric
)

def safety_gate(output: str, input: str) -> tuple[bool, list]:
    """返回(是否通过,失败原因)元组"""
    test_case = LLMTestCase(input=input, actual_output=output)

    safety_metrics = [
        ToxicityMetric(threshold=0.5),
        BiasMetric(threshold=0.5),
        PIILeakageMetric(threshold=0.5)
    ]

    failures = []
    for metric in safety_metrics:
        metric.measure(test_case)
        if not metric.is_successful():
            failures.append(f"{metric.name}: {metric.reason}")

    return len(failures) == 0, failures

Metric Selection Guide

指标选择指南

RAG Metrics

RAG指标

Retrieval Phase:
  • ContextualPrecisionMetric
    - Relevant chunks ranked higher than irrelevant ones
  • ContextualRecallMetric
    - All necessary information retrieved
  • ContextualRelevancyMetric
    - Retrieved chunks relevant to input
Generation Phase:
  • AnswerRelevancyMetric
    - Output addresses the input query
  • FaithfulnessMetric
    - Output grounded in retrieval context
检索阶段:
  • ContextualPrecisionMetric
    - 相关片段排名高于无关片段
  • ContextualRecallMetric
    - 检索到所有必要信息
  • ContextualRelevancyMetric
    - 检索到的片段与输入相关
生成阶段:
  • AnswerRelevancyMetric
    - 输出能够回应用户查询
  • FaithfulnessMetric
    - 输出内容基于检索到的上下文

Conversational Metrics

对话指标

  • TurnRelevancyMetric
    - Each turn relevant to conversation
  • KnowledgeRetentionMetric
    - Information retained across turns
  • ConversationCompletenessMetric
    - All aspects addressed
  • RoleAdherenceMetric
    - Chatbot maintains assigned role
  • TopicAdherenceMetric
    - Conversation stays on topic
  • TurnRelevancyMetric
    - 每一轮对话都与主题相关
  • KnowledgeRetentionMetric
    - 对话过程中信息得以保留
  • ConversationCompletenessMetric
    - 所有问题维度都被覆盖
  • RoleAdherenceMetric
    - 聊天机器人保持设定的角色
  • TopicAdherenceMetric
    - 对话始终围绕主题展开

Agent Metrics

Agent指标

  • TaskCompletionMetric
    - Task successfully completed
  • ToolUseMetric
    - Correct tools selected
  • ArgumentCorrectnessMetric
    - Tool arguments correct
  • MCPUseMetric
    - MCP correctly used
  • TaskCompletionMetric
    - 任务成功完成
  • ToolUseMetric
    - 选择了正确的工具
  • ArgumentCorrectnessMetric
    - 工具参数设置正确
  • MCPUseMetric
    - MCP工具使用正确

Safety Metrics

安全指标

  • ToxicityMetric
    - Harmful content detection
  • BiasMetric
    - Biased outputs identification
  • HallucinationMetric
    - Fabricated information
  • PIILeakageMetric
    - Personal information leakage
  • ToxicityMetric
    - 检测有害内容
  • BiasMetric
    - 识别带有偏见的输出
  • HallucinationMetric
    - 检测编造的信息
  • PIILeakageMetric
    - 检测个人信息泄露

Custom Metrics

自定义指标

G-Eval (LLM-based):
python
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

custom_metric = GEval(
    name="Professional Tone",
    criteria="Determine if response maintains professional, empathetic tone",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model="anthropic-claude-sonnet-4-5"
)
BaseMetric subclass:
See
references/custom_metrics.md
for complete guide on creating custom metrics with BaseMetric subclassing and deterministic scorers (ROUGE, BLEU, BERTScore).
G-Eval(基于LLM):
python
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

custom_metric = GEval(
    name="Professional Tone",
    criteria="Determine if response maintains professional, empathetic tone",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model="anthropic-claude-sonnet-4-5"
)
BaseMetric子类化:
如需了解通过BaseMetric子类化和确定性评分器(ROUGE、BLEU、BERTScore)创建自定义指标的完整指南,请参考
references/custom_metrics.md

Configuration

配置

LLM Provider Setup

LLM提供商设置

DeepEval supports OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ providers via LiteLLM. Anthropic models are preferred.
CLI configuration (global):
bash
deepeval set-azure-openai --openai-endpoint=... --openai-api-key=... --deployment-name=...
deepeval set-ollama deepseek-r1:1.5b
Python configuration (per-metric):
python
from deepeval.models import AnthropicModel, OllamaModel

anthropic_model = AnthropicModel(
    model_id=settings.anthropic_model_id,
    client_args={"api_key": settings.anthropic_api_key},
    temperature=settings.agent_temperature
)

metric = AnswerRelevancyMetric(model=anthropic_model)
See
references/model_providers.md
for complete provider configuration guide.
DeepEval支持OpenAI、Anthropic Claude、Google Gemini、AWS Bedrock,以及通过LiteLLM支持的100余家提供商。推荐使用Anthropic模型。
CLI全局配置:
bash
deepeval set-azure-openai --openai-endpoint=... --openai-api-key=... --deployment-name=...
deepeval set-ollama deepseek-r1:1.5b
Python按指标配置:
python
from deepeval.models import AnthropicModel, OllamaModel

anthropic_model = AnthropicModel(
    model_id=settings.anthropic_model_id,
    client_args={"api_key": settings.anthropic_api_key},
    temperature=settings.agent_temperature
)

metric = AnswerRelevancyMetric(model=anthropic_model)
如需完整的提供商配置指南,请参考
references/model_providers.md

Performance Optimisation

性能优化

Async mode is enabled by default. Configure with
AsyncConfig
and
CacheConfig
:
python
from deepeval import evaluate, AsyncConfig, CacheConfig

evaluate(
    test_cases=[...],
    metrics=[...],
    async_config=AsyncConfig(
        run_async=True,
        max_concurrent=20,    # Reduce if rate limited
        throttle_value=0      # Delay between test cases (seconds)
    ),
    cache_config=CacheConfig(
        use_cache=True,       # Read from cache
        write_cache=True      # Write to cache
    )
)
CLI parallelisation:
bash
deepeval test run -n 4 -c -i  # 4 processes, cached, ignore errors
Best practices:
  • Limit to 5 metrics maximum (2-3 generic + 1-2 custom)
  • Use the latest available Anthropic Claude Sonnet or Haiku models
  • Reduce
    max_concurrent
    to 5 if hitting rate limits
  • Use
    evaluate()
    function over individual
    measure()
    calls
See
references/async_performance.md
for detailed performance optimisation guide.
异步模式默认启用。可通过
AsyncConfig
CacheConfig
进行配置:
python
from deepeval import evaluate, AsyncConfig, CacheConfig

evaluate(
    test_cases=[...],
    metrics=[...],
    async_config=AsyncConfig(
        run_async=True,
        max_concurrent=20,    # 若触发速率限制则降低该值
        throttle_value=0      # 测试用例之间的延迟(秒)
    ),
    cache_config=CacheConfig(
        use_cache=True,       # 读取缓存
        write_cache=True      # 写入缓存
    )
)
CLI并行运行:
bash
deepeval test run -n 4 -c -i  # 4个进程,启用缓存,忽略错误
最佳实践:
  • 最多使用5个指标(2-3个通用指标 + 1-2个自定义指标)
  • 使用最新版本的Anthropic Claude Sonnet或Haiku模型
  • 若触发速率限制,将
    max_concurrent
    降低至5
  • 使用
    evaluate()
    函数而非单独调用
    measure()
如需详细的性能优化指南,请参考
references/async_performance.md

Dataset Management

数据集管理

Loading datasets

加载数据集

python
from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset()
python
from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset()

From CSV

从CSV加载

dataset.add_goldens_from_csv_file( file_path="./test_data.csv", input_col_name="question", expected_output_col_name="answer", context_col_name="context", context_col_delimiter="|" )
dataset.add_goldens_from_csv_file( file_path="./test_data.csv", input_col_name="question", expected_output_col_name="answer", context_col_name="context", context_col_delimiter="|" )

From JSON

从JSON加载

dataset.add_goldens_from_json_file( file_path="./test_data.json", input_key_name="query", expected_output_key_name="response" )
undefined
dataset.add_goldens_from_json_file( file_path="./test_data.json", input_key_name="query", expected_output_key_name="response" )
undefined

Synthetic generation

合成生成

python
from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
python
from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()

From documents

从文档生成

goldens = synthesizer.generate_goldens_from_docs( document_paths=["./docs/knowledge_base.pdf"], max_goldens_per_document=10, evolution_types=["REASONING", "MULTICONTEXT", "COMPARATIVE"] )
goldens = synthesizer.generate_goldens_from_docs( document_paths=["./docs/knowledge_base.pdf"], max_goldens_per_document=10, evolution_types=["REASONING", "MULTICONTEXT", "COMPARATIVE"] )

From scratch

从零生成

goldens = synthesizer.generate_goldens_from_scratch( subject="customer support for SaaS product", task="answer user questions about billing", max_goldens=20 )

**Evolution types:** REASONING, MULTICONTEXT, CONCRETISING, CONSTRAINED, COMPARATIVE, HYPOTHETICAL, IN_BREADTH

See `references/dataset_management.md` for complete dataset guide including versioning and cloud integration.
goldens = synthesizer.generate_goldens_from_scratch( subject="customer support for SaaS product", task="answer user questions about billing", max_goldens=20 )

**演化类型:** REASONING(推理型)、MULTICONTEXT(多上下文型)、CONCRETISING(具象化型)、CONSTRAINED(约束型)、COMPARATIVE(对比型)、HYPOTHETICAL(假设型)、IN_BREADTH(拓展型)

如需完整的数据集指南(包括版本控制与云集成),请参考`references/dataset_management.md`。

Test Case Types

测试用例类型

Single-turn (LLMTestCase)

单轮对话(LLMTestCase)

python
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="You have 30 days for full refund",
    expected_output="We offer 30-day full refund",
    retrieval_context=["All customers eligible for 30 day refund"],
    tools_called=[ToolCall(name="...", arguments={"...": "..."})]
)
python
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="You have 30 days for full refund",
    expected_output="We offer 30-day full refund",
    retrieval_context=["All customers eligible for 30 day refund"],
    tools_called=[ToolCall(name="...", arguments={"...": "..."})]
)

Multi-turn (ConversationalTestCase)

多轮对话(ConversationalTestCase)

python
from deepeval.test_case import Turn, ConversationalTestCase

convo_test_case = ConversationalTestCase(
    chatbot_role="helpful customer service agent",
    turns=[
        Turn(role="user", content="I need help with my order"),
        Turn(role="assistant", content="I'd be happy to help"),
        Turn(role="user", content="It hasn't arrived yet")
    ]
)
python
from deepeval.test_case import Turn, ConversationalTestCase

convo_test_case = ConversationalTestCase(
    chatbot_role="helpful customer service agent",
    turns=[
        Turn(role="user", content="I need help with my order"),
        Turn(role="assistant", content="I'd be happy to help"),
        Turn(role="user", content="It hasn't arrived yet")
    ]
)

Multimodal (MLLMTestCase)

多模态(MLLMTestCase)

python
from deepeval.test_case import MLLMTestCase, MLLMImage

m_test_case = MLLMTestCase(
    input=["Describe this image", MLLMImage(url="./photo.png", local=True)],
    actual_output=["A red bicycle leaning against a wall"]
)
python
from deepeval.test_case import MLLMTestCase, MLLMImage

m_test_case = MLLMTestCase(
    input=["Describe this image", MLLMImage(url="./photo.png", local=True)],
    actual_output=["A red bicycle leaning against a wall"]
)

CI/CD Integration

CI/CD集成

yaml
undefined
yaml
undefined

.github/workflows/test.yml

.github/workflows/test.yml

name: LLM Tests on: [push, pull_request]
jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v5 - name: Install dependencies run: pip install deepeval - name: Run evaluations env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: deepeval test run tests/
undefined
name: LLM Tests on: [push, pull_request]
jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v5 - name: Install dependencies run: pip install deepeval - name: Run evaluations env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: deepeval test run tests/
undefined

References

参考文档

Detailed implementation guides:
  • references/model_providers.md - Complete guide for configuring OpenAI, Anthropic, Gemini, Bedrock, and local models. Includes provider-specific considerations, cost analysis, and troubleshooting.
  • references/custom_metrics.md - Complete guide for creating custom metrics by subclassing BaseMetric. Includes deterministic scorers (ROUGE, BLEU, BERTScore) and LLM-based evaluation patterns.
  • references/async_performance.md - Complete guide for optimising evaluation performance with async mode, caching, concurrency tuning, and rate limit handling.
  • references/dataset_management.md - Complete guide for dataset loading, saving, synthetic generation, versioning, and cloud integration with Confident AI.
详细的实现指南:
  • references/model_providers.md - 配置OpenAI、Anthropic、Gemini、Bedrock及本地模型的完整指南,包括提供商专属注意事项、成本分析与故障排除。
  • references/custom_metrics.md - 通过子类化BaseMetric创建自定义指标的完整指南,包括确定性评分器(ROUGE、BLEU、BERTScore)及基于LLM的评估模式。
  • references/async_performance.md - 利用异步模式、缓存、并发调优与速率限制处理优化评估性能的完整指南。
  • references/dataset_management.md - 数据集加载、保存、合成生成、版本控制及与Confident AI云集成的完整指南。

Best Practices

最佳实践

Metric Selection

指标选择

  • Match metrics to use case (RAG systems need retrieval + generation metrics)
  • Start with 2-3 essential metrics, expand as needed
  • Use appropriate thresholds (0.7-0.8 for production, 0.5-0.6 for development)
  • Combine complementary metrics (answer relevancy + faithfulness)
  • 根据使用场景匹配指标(RAG系统需要检索+生成阶段的指标)
  • 从2-3个核心指标开始,按需扩展
  • 设置合适的阈值(生产环境0.7-0.8,开发环境0.5-0.6)
  • 结合互补性指标(如答案相关性+事实一致性)

Test Case Design

测试用例设计

  • Create representative examples covering common queries and edge cases
  • Include context when needed (
    retrieval_context
    for RAG,
    expected_output
    for G-Eval)
  • Use datasets for scale testing
  • Version test cases over time
  • 创建覆盖常见查询与边缘场景的代表性示例
  • 必要时添加上下文(RAG场景的
    retrieval_context
    ,G-Eval的
    expected_output
  • 使用数据集进行规模化测试
  • 随时间迭代版本化测试用例

Evaluation Workflow

评估工作流

  • Component-level first - Use
    @observe
    for individual parts
  • End-to-end validation before deployment
  • Automate in CI/CD with
    deepeval test run
  • Track results over time with Confident AI cloud
  • 先进行组件级测试 - 使用
    @observe
    追踪独立模块
  • 部署前进行端到端验证
  • 通过
    deepeval test run
    在CI/CD中自动化评估
  • 通过Confident AI云端长期追踪结果

Testing Anti-Patterns

测试反模式

Avoid:
  • Testing only happy paths
  • Using unrealistic inputs
  • Ignoring metric reasons
  • Setting thresholds too high initially
  • Running full test suite on every change
Do:
  • Test edge cases and failure modes
  • Use real user queries as test inputs
  • Read and analyse metric reasons
  • Adjust thresholds based on empirical results
  • Use component-level tests during development
  • Separate config and eval content from code
需避免:
  • 仅测试正常路径
  • 使用不切实际的输入
  • 忽略指标失败原因
  • 初始设置过高的阈值
  • 每次变更都运行完整测试套件
推荐做法:
  • 测试边缘场景与故障模式
  • 使用真实用户查询作为测试输入
  • 阅读并分析指标失败原因
  • 根据实际结果调整阈值
  • 开发阶段使用组件级测试
  • 将配置与评估内容从代码中分离