deepeval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

DeepEval

Overview

概述

DeepEval is a pytest-based framework for testing LLM applications. It provides 50+ evaluation metrics covering RAG pipelines, conversational AI, agents, safety, and custom criteria. DeepEval integrates into development workflows through pytest, supports multiple LLM providers, and includes component-level tracing with the

@observe

decorator.

Repository: https://github.com/confident-ai/deepeval Documentation: https://deepeval.com

DeepEval是一个基于pytest的LLM应用测试框架。它提供50余种评估指标，涵盖RAG流水线、对话AI、Agent、安全性以及自定义规则等场景。DeepEval可通过pytest集成到开发工作流中，支持多家LLM提供商，并通过

@observe

装饰器实现组件级追踪。

代码仓库： https://github.com/confident-ai/deepeval 文档： https://deepeval.com

Installation

安装

bash

pip install -U deepeval

Requires Python 3.9+.

bash

pip install -U deepeval

要求Python 3.9及以上版本。

Quick Start

快速开始

Basic pytest test

基础pytest测试

python

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_chatbot():
    metric = AnswerRelevancyMetric(threshold=0.7, model="athropic-claude-sonnet-4-5")
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days for full refund"
    )
    assert_test(test_case, [metric])

Run with:

deepeval test run test_chatbot.py

python

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_chatbot():
    metric = AnswerRelevancyMetric(threshold=0.7, model="athropic-claude-sonnet-4-5")
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days for full refund"
    )
    assert_test(test_case, [metric])

运行命令：

deepeval test run test_chatbot.py

Environment setup

环境配置

DeepEval automatically loads

.env.local

then

.env

bash

undefined

DeepEval会自动加载

.env.local

，之后加载

.env

：

bash

undefined

.env

OPENAI_API_KEY="sk-..."

undefined

OPENAI_API_KEY="sk-..."

undefined

Core Workflows

核心工作流

RAG Evaluation

RAG评估

Evaluate both retrieval and generation phases:

python

from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric
)

可同时评估检索与生成阶段：

python

from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric
)

Retrieval metrics

检索阶段指标

contextual_precision = ContextualPrecisionMetric(threshold=0.7) contextual_recall = ContextualRecallMetric(threshold=0.7) contextual_relevancy = ContextualRelevancyMetric(threshold=0.7)

Generation metrics

生成阶段指标

answer_relevancy = AnswerRelevancyMetric(threshold=0.7) faithfulness = FaithfulnessMetric(threshold=0.8)

test_case = LLMTestCase( input="What are the side effects of aspirin?", actual_output="Common side effects include stomach upset and nausea.", expected_output="Aspirin side effects include gastrointestinal issues.", retrieval_context=[ "Aspirin common side effects: stomach upset, nausea, vomiting.", "Serious aspirin side effects: gastrointestinal bleeding.", ] )

evaluate(test_cases=[test_case], metrics=[ contextual_precision, contextual_recall, contextual_relevancy, answer_relevancy, faithfulness ])


**Component-level tracing:**

```python
from deepeval.tracing import observe, update_current_span

@observe(metrics=[contextual_relevancy])
def retriever(query: str):
    chunks = your_vector_db.search(query)
    update_current_span(
        test_case=LLMTestCase(input=query, retrieval_context=chunks)
    )
    return chunks

@observe(metrics=[answer_relevancy, faithfulness])
def generator(query: str, chunks: list):
    response = your_llm.generate(query, chunks)
    update_current_span(
        test_case=LLMTestCase(
            input=query,
            actual_output=response,
            retrieval_context=chunks
        )
    )
    return response

@observe
def rag_pipeline(query: str):
    chunks = retriever(query)
    return generator(query, chunks)

answer_relevancy = AnswerRelevancyMetric(threshold=0.7) faithfulness = FaithfulnessMetric(threshold=0.8)

evaluate(test_cases=[test_case], metrics=[ contextual_precision, contextual_recall, contextual_relevancy, answer_relevancy, faithfulness ])


**组件级追踪：**

```python
from deepeval.tracing import observe, update_current_span

@observe(metrics=[contextual_relevancy])
def retriever(query: str):
    chunks = your_vector_db.search(query)
    update_current_span(
        test_case=LLMTestCase(input=query, retrieval_context=chunks)
    )
    return chunks

@observe(metrics=[answer_relevancy, faithfulness])
def generator(query: str, chunks: list):
    response = your_llm.generate(query, chunks)
    update_current_span(
        test_case=LLMTestCase(
            input=query,
            actual_output=response,
            retrieval_context=chunks
        )
    )
    return response

@observe
def rag_pipeline(query: str):
    chunks = retriever(query)
    return generator(query, chunks)

Conversational AI Evaluation

对话AI评估

Test multi-turn dialogues:

python

from deepeval.test_case import Turn, ConversationalTestCase
from deepeval.metrics import (
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationCompletenessMetric,
    TurnRelevancyMetric
)

convo_test_case = ConversationalTestCase(
    chatbot_role="professional, empathetic medical assistant",
    turns=[
        Turn(role="user", content="I have a persistent cough"),
        Turn(role="assistant", content="How long have you had this cough?"),
        Turn(role="user", content="About a week now"),
        Turn(role="assistant", content="A week-long cough should be evaluated.")
    ]
)

metrics = [
    RoleAdherenceMetric(threshold=0.7),
    KnowledgeRetentionMetric(threshold=0.7),
    ConversationCompletenessMetric(threshold=0.6),
    TurnRelevancyMetric(threshold=0.7)
]

evaluate(test_cases=[convo_test_case], metrics=metrics)

测试多轮对话场景：

python

from deepeval.test_case import Turn, ConversationalTestCase
from deepeval.metrics import (
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationCompletenessMetric,
    TurnRelevancyMetric
)

convo_test_case = ConversationalTestCase(
    chatbot_role="professional, empathetic medical assistant",
    turns=[
        Turn(role="user", content="I have a persistent cough"),
        Turn(role="assistant", content="How long have you had this cough?"),
        Turn(role="user", content="About a week now"),
        Turn(role="assistant", content="A week-long cough should be evaluated.")
    ]
)

metrics = [
    RoleAdherenceMetric(threshold=0.7),
    KnowledgeRetentionMetric(threshold=0.7),
    ConversationCompletenessMetric(threshold=0.6),
    TurnRelevancyMetric(threshold=0.7)
]

evaluate(test_cases=[convo_test_case], metrics=metrics)

Agent Evaluation

Agent评估

Test tool usage and task completion:

python

from deepeval.test_case import ToolCall
from deepeval.metrics import (
    TaskCompletionMetric,
    ToolUseMetric,
    ArgumentCorrectnessMetric
)

agent_test_case = ConversationalTestCase(
    turns=[
        Turn(role="user", content="When did Trump first raise tariffs?"),
        Turn(
            role="assistant",
            content="Let me search for that information.",
            tools_called=[
                ToolCall(
                    name="WebSearch",
                    arguments={"query": "Trump first raised tariffs year"}
                )
            ]
        ),
        Turn(role="assistant", content="Trump first raised tariffs in 2018.")
    ]
)

evaluate(
    test_cases=[agent_test_case],
    metrics=[
        TaskCompletionMetric(threshold=0.7),
        ToolUseMetric(threshold=0.7),
        ArgumentCorrectnessMetric(threshold=0.7)
    ]
)

测试工具使用与任务完成情况：

python

from deepeval.test_case import ToolCall
from deepeval.metrics import (
    TaskCompletionMetric,
    ToolUseMetric,
    ArgumentCorrectnessMetric
)

agent_test_case = ConversationalTestCase(
    turns=[
        Turn(role="user", content="When did Trump first raise tariffs?"),
        Turn(
            role="assistant",
            content="Let me search for that information.",
            tools_called=[
                ToolCall(
                    name="WebSearch",
                    arguments={"query": "Trump first raised tariffs year"}
                )
            ]
        ),
        Turn(role="assistant", content="Trump first raised tariffs in 2018.")
    ]
)

evaluate(
    test_cases=[agent_test_case],
    metrics=[
        TaskCompletionMetric(threshold=0.7),
        ToolUseMetric(threshold=0.7),
        ArgumentCorrectnessMetric(threshold=0.7)
    ]
)

Safety Evaluation

安全性评估

Check for harmful content:

python

from deepeval.metrics import (
    ToxicityMetric,
    BiasMetric,
    PIILeakageMetric,
    HallucinationMetric
)

def safety_gate(output: str, input: str) -> tuple[bool, list]:
    """Returns (passed, reasons) tuple"""
    test_case = LLMTestCase(input=input, actual_output=output)

    safety_metrics = [
        ToxicityMetric(threshold=0.5),
        BiasMetric(threshold=0.5),
        PIILeakageMetric(threshold=0.5)
    ]

    failures = []
    for metric in safety_metrics:
        metric.measure(test_case)
        if not metric.is_successful():
            failures.append(f"{metric.name}: {metric.reason}")

    return len(failures) == 0, failures

检测有害内容：

python

from deepeval.metrics import (
    ToxicityMetric,
    BiasMetric,
    PIILeakageMetric,
    HallucinationMetric
)

def safety_gate(output: str, input: str) -> tuple[bool, list]:
    """返回（是否通过，失败原因）元组"""
    test_case = LLMTestCase(input=input, actual_output=output)

    safety_metrics = [
        ToxicityMetric(threshold=0.5),
        BiasMetric(threshold=0.5),
        PIILeakageMetric(threshold=0.5)
    ]

    failures = []
    for metric in safety_metrics:
        metric.measure(test_case)
        if not metric.is_successful():
            failures.append(f"{metric.name}: {metric.reason}")

    return len(failures) == 0, failures

Metric Selection Guide

指标选择指南

RAG Metrics

RAG指标

Retrieval Phase:

```
ContextualPrecisionMetric
```
- Relevant chunks ranked higher than irrelevant ones
```
ContextualRecallMetric
```
- All necessary information retrieved
```
ContextualRelevancyMetric
```
- Retrieved chunks relevant to input

Generation Phase:

```
AnswerRelevancyMetric
```
- Output addresses the input query
```
FaithfulnessMetric
```
- Output grounded in retrieval context

检索阶段：

```
ContextualPrecisionMetric
```
- 相关片段排名高于无关片段
```
ContextualRecallMetric
```
- 检索到所有必要信息
```
ContextualRelevancyMetric
```
- 检索到的片段与输入相关

生成阶段：

```
AnswerRelevancyMetric
```
- 输出能够回应用户查询
```
FaithfulnessMetric
```
- 输出内容基于检索到的上下文

Conversational Metrics

对话指标

```
TurnRelevancyMetric
```
- Each turn relevant to conversation
```
KnowledgeRetentionMetric
```
- Information retained across turns
```
ConversationCompletenessMetric
```
- All aspects addressed
```
RoleAdherenceMetric
```
- Chatbot maintains assigned role
```
TopicAdherenceMetric
```
- Conversation stays on topic

```
TurnRelevancyMetric
```
- 每一轮对话都与主题相关
```
KnowledgeRetentionMetric
```
- 对话过程中信息得以保留
```
ConversationCompletenessMetric
```
- 所有问题维度都被覆盖
```
RoleAdherenceMetric
```
- 聊天机器人保持设定的角色
```
TopicAdherenceMetric
```
- 对话始终围绕主题展开

Agent Metrics

Agent指标

```
TaskCompletionMetric
```
- Task successfully completed
```
ToolUseMetric
```
- Correct tools selected
```
ArgumentCorrectnessMetric
```
- Tool arguments correct
```
MCPUseMetric
```
- MCP correctly used

```
TaskCompletionMetric
```
- 任务成功完成
```
ToolUseMetric
```
- 选择了正确的工具
```
ArgumentCorrectnessMetric
```
- 工具参数设置正确
```
MCPUseMetric
```
- MCP工具使用正确

Safety Metrics

安全指标

```
ToxicityMetric
```
- Harmful content detection
```
BiasMetric
```
- Biased outputs identification
```
HallucinationMetric
```
- Fabricated information
```
PIILeakageMetric
```
- Personal information leakage

```
ToxicityMetric
```
- 检测有害内容
```
BiasMetric
```
- 识别带有偏见的输出
```
HallucinationMetric
```
- 检测编造的信息
```
PIILeakageMetric
```
- 检测个人信息泄露

Custom Metrics

自定义指标

G-Eval (LLM-based):

python

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

custom_metric = GEval(
    name="Professional Tone",
    criteria="Determine if response maintains professional, empathetic tone",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model="anthropic-claude-sonnet-4-5"
)

BaseMetric subclass:

See

references/custom_metrics.md

for complete guide on creating custom metrics with BaseMetric subclassing and deterministic scorers (ROUGE, BLEU, BERTScore).

G-Eval（基于LLM）：

python

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

custom_metric = GEval(
    name="Professional Tone",
    criteria="Determine if response maintains professional, empathetic tone",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model="anthropic-claude-sonnet-4-5"
)

BaseMetric子类化：

如需了解通过BaseMetric子类化和确定性评分器（ROUGE、BLEU、BERTScore）创建自定义指标的完整指南，请参考

references/custom_metrics.md

。

Configuration

配置

LLM Provider Setup

LLM提供商设置

DeepEval supports OpenAI, Anthropic Claude, Google Gemini, AWS Bedrock, and 100+ providers via LiteLLM. Anthropic models are preferred.

CLI configuration (global):

bash

deepeval set-azure-openai --openai-endpoint=... --openai-api-key=... --deployment-name=...
deepeval set-ollama deepseek-r1:1.5b

Python configuration (per-metric):

python

from deepeval.models import AnthropicModel, OllamaModel

anthropic_model = AnthropicModel(
    model_id=settings.anthropic_model_id,
    client_args={"api_key": settings.anthropic_api_key},
    temperature=settings.agent_temperature
)

metric = AnswerRelevancyMetric(model=anthropic_model)

See

references/model_providers.md

for complete provider configuration guide.

DeepEval支持OpenAI、Anthropic Claude、Google Gemini、AWS Bedrock，以及通过LiteLLM支持的100余家提供商。推荐使用Anthropic模型。

CLI全局配置：

bash

deepeval set-azure-openai --openai-endpoint=... --openai-api-key=... --deployment-name=...
deepeval set-ollama deepseek-r1:1.5b

Python按指标配置：

python

from deepeval.models import AnthropicModel, OllamaModel

anthropic_model = AnthropicModel(
    model_id=settings.anthropic_model_id,
    client_args={"api_key": settings.anthropic_api_key},
    temperature=settings.agent_temperature
)

metric = AnswerRelevancyMetric(model=anthropic_model)

如需完整的提供商配置指南，请参考

references/model_providers.md

。

Performance Optimisation

性能优化

Async mode is enabled by default. Configure with

AsyncConfig

and

CacheConfig

python

from deepeval import evaluate, AsyncConfig, CacheConfig

evaluate(
    test_cases=[...],
    metrics=[...],
    async_config=AsyncConfig(
        run_async=True,
        max_concurrent=20,    # Reduce if rate limited
        throttle_value=0      # Delay between test cases (seconds)
    ),
    cache_config=CacheConfig(
        use_cache=True,       # Read from cache
        write_cache=True      # Write to cache
    )
)

CLI parallelisation:

bash

deepeval test run -n 4 -c -i  # 4 processes, cached, ignore errors

Best practices:

Limit to 5 metrics maximum (2-3 generic + 1-2 custom)
Use the latest available Anthropic Claude Sonnet or Haiku models
Reduce
```
max_concurrent
```
to 5 if hitting rate limits
Use
```
evaluate()
```
function over individual
```
measure()
```
calls

See

references/async_performance.md

for detailed performance optimisation guide.

异步模式默认启用。可通过

AsyncConfig

和

CacheConfig

进行配置：

python

from deepeval import evaluate, AsyncConfig, CacheConfig

evaluate(
    test_cases=[...],
    metrics=[...],
    async_config=AsyncConfig(
        run_async=True,
        max_concurrent=20,    # 若触发速率限制则降低该值
        throttle_value=0      # 测试用例之间的延迟（秒）
    ),
    cache_config=CacheConfig(
        use_cache=True,       # 读取缓存
        write_cache=True      # 写入缓存
    )
)

CLI并行运行：

bash

deepeval test run -n 4 -c -i  # 4个进程，启用缓存，忽略错误

最佳实践：

最多使用5个指标（2-3个通用指标 + 1-2个自定义指标）
使用最新版本的Anthropic Claude Sonnet或Haiku模型
若触发速率限制，将
```
max_concurrent
```
降低至5
使用
```
evaluate()
```
函数而非单独调用
```
measure()
```

如需详细的性能优化指南，请参考

references/async_performance.md

。

Dataset Management

数据集管理

Loading datasets

加载数据集

python

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset()

python

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset()

From CSV

从CSV加载

dataset.add_goldens_from_csv_file( file_path="./test_data.csv", input_col_name="question", expected_output_col_name="answer", context_col_name="context", context_col_delimiter="|" )

From JSON

从JSON加载

dataset.add_goldens_from_json_file( file_path="./test_data.json", input_key_name="query", expected_output_key_name="response" )

undefined

dataset.add_goldens_from_json_file( file_path="./test_data.json", input_key_name="query", expected_output_key_name="response" )

undefined

Synthetic generation

合成生成

python

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()

python

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()

From documents

从文档生成

goldens = synthesizer.generate_goldens_from_docs( document_paths=["./docs/knowledge_base.pdf"], max_goldens_per_document=10, evolution_types=["REASONING", "MULTICONTEXT", "COMPARATIVE"] )

From scratch

从零生成

goldens = synthesizer.generate_goldens_from_scratch( subject="customer support for SaaS product", task="answer user questions about billing", max_goldens=20 )


**Evolution types:** REASONING, MULTICONTEXT, CONCRETISING, CONSTRAINED, COMPARATIVE, HYPOTHETICAL, IN_BREADTH

See `references/dataset_management.md` for complete dataset guide including versioning and cloud integration.

goldens = synthesizer.generate_goldens_from_scratch( subject="customer support for SaaS product", task="answer user questions about billing", max_goldens=20 )


**演化类型：** REASONING（推理型）、MULTICONTEXT（多上下文型）、CONCRETISING（具象化型）、CONSTRAINED（约束型）、COMPARATIVE（对比型）、HYPOTHETICAL（假设型）、IN_BREADTH（拓展型）

如需完整的数据集指南（包括版本控制与云集成），请参考`references/dataset_management.md`。

Test Case Types

测试用例类型

Single-turn (LLMTestCase)

单轮对话（LLMTestCase）

python

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="You have 30 days for full refund",
    expected_output="We offer 30-day full refund",
    retrieval_context=["All customers eligible for 30 day refund"],
    tools_called=[ToolCall(name="...", arguments={"...": "..."})]
)

python

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output="You have 30 days for full refund",
    expected_output="We offer 30-day full refund",
    retrieval_context=["All customers eligible for 30 day refund"],
    tools_called=[ToolCall(name="...", arguments={"...": "..."})]
)

Multi-turn (ConversationalTestCase)

多轮对话（ConversationalTestCase）

python

from deepeval.test_case import Turn, ConversationalTestCase

convo_test_case = ConversationalTestCase(
    chatbot_role="helpful customer service agent",
    turns=[
        Turn(role="user", content="I need help with my order"),
        Turn(role="assistant", content="I'd be happy to help"),
        Turn(role="user", content="It hasn't arrived yet")
    ]
)

python

from deepeval.test_case import Turn, ConversationalTestCase

convo_test_case = ConversationalTestCase(
    chatbot_role="helpful customer service agent",
    turns=[
        Turn(role="user", content="I need help with my order"),
        Turn(role="assistant", content="I'd be happy to help"),
        Turn(role="user", content="It hasn't arrived yet")
    ]
)

Multimodal (MLLMTestCase)

多模态（MLLMTestCase）

python

from deepeval.test_case import MLLMTestCase, MLLMImage

m_test_case = MLLMTestCase(
    input=["Describe this image", MLLMImage(url="./photo.png", local=True)],
    actual_output=["A red bicycle leaning against a wall"]
)

python

from deepeval.test_case import MLLMTestCase, MLLMImage

m_test_case = MLLMTestCase(
    input=["Describe this image", MLLMImage(url="./photo.png", local=True)],
    actual_output=["A red bicycle leaning against a wall"]
)

CI/CD Integration

CI/CD集成

yaml

undefined

yaml

undefined

.github/workflows/test.yml

name: LLM Tests on: [push, pull_request]

jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v5 - name: Install dependencies run: pip install deepeval - name: Run evaluations env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: deepeval test run tests/

undefined

name: LLM Tests on: [push, pull_request]

undefined

References

参考文档

Detailed implementation guides:

references/model_providers.md - Complete guide for configuring OpenAI, Anthropic, Gemini, Bedrock, and local models. Includes provider-specific considerations, cost analysis, and troubleshooting.
references/custom_metrics.md - Complete guide for creating custom metrics by subclassing BaseMetric. Includes deterministic scorers (ROUGE, BLEU, BERTScore) and LLM-based evaluation patterns.
references/async_performance.md - Complete guide for optimising evaluation performance with async mode, caching, concurrency tuning, and rate limit handling.
references/dataset_management.md - Complete guide for dataset loading, saving, synthetic generation, versioning, and cloud integration with Confident AI.

详细的实现指南：

references/model_providers.md - 配置OpenAI、Anthropic、Gemini、Bedrock及本地模型的完整指南，包括提供商专属注意事项、成本分析与故障排除。
references/custom_metrics.md - 通过子类化BaseMetric创建自定义指标的完整指南，包括确定性评分器（ROUGE、BLEU、BERTScore）及基于LLM的评估模式。
references/async_performance.md - 利用异步模式、缓存、并发调优与速率限制处理优化评估性能的完整指南。
references/dataset_management.md - 数据集加载、保存、合成生成、版本控制及与Confident AI云集成的完整指南。

Best Practices

最佳实践

Metric Selection

指标选择

Match metrics to use case (RAG systems need retrieval + generation metrics)
Start with 2-3 essential metrics, expand as needed
Use appropriate thresholds (0.7-0.8 for production, 0.5-0.6 for development)
Combine complementary metrics (answer relevancy + faithfulness)

根据使用场景匹配指标（RAG系统需要检索+生成阶段的指标）
从2-3个核心指标开始，按需扩展
设置合适的阈值（生产环境0.7-0.8，开发环境0.5-0.6）
结合互补性指标（如答案相关性+事实一致性）

Test Case Design

测试用例设计

Create representative examples covering common queries and edge cases
Include context when needed (
```
retrieval_context
```
for RAG,
```
expected_output
```
for G-Eval)
Use datasets for scale testing
Version test cases over time

创建覆盖常见查询与边缘场景的代表性示例
必要时添加上下文（RAG场景的
```
retrieval_context
```
，G-Eval的
```
expected_output
```
）
使用数据集进行规模化测试
随时间迭代版本化测试用例

Evaluation Workflow

评估工作流

Component-level first - Use
```
@observe
```
for individual parts
End-to-end validation before deployment
Automate in CI/CD with
```
deepeval test run
```
Track results over time with Confident AI cloud

先进行组件级测试 - 使用
```
@observe
```
追踪独立模块
部署前进行端到端验证
通过
```
deepeval test run
```
在CI/CD中自动化评估
通过Confident AI云端长期追踪结果

Testing Anti-Patterns

测试反模式

Avoid:

Testing only happy paths
Using unrealistic inputs
Ignoring metric reasons
Setting thresholds too high initially
Running full test suite on every change

Do:

Test edge cases and failure modes
Use real user queries as test inputs
Read and analyse metric reasons
Adjust thresholds based on empirical results
Use component-level tests during development
Separate config and eval content from code

需避免：

仅测试正常路径
使用不切实际的输入
忽略指标失败原因
初始设置过高的阈值
每次变更都运行完整测试套件

推荐做法：

测试边缘场景与故障模式
使用真实用户查询作为测试输入
阅读并分析指标失败原因
根据实际结果调整阈值
开发阶段使用组件级测试
将配置与评估内容从代码中分离