testing-llm

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM & AI Testing Patterns

LLM与AI测试模式

Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).

本内容涵盖LLM集成测试、AI输出质量评估、为确定性CI模拟响应，以及应用智能体测试工作流（规划器、生成器、修复器）的模式与工具。

Quick Reference

快速参考

Area	File	Purpose
Rules	`rules/llm-evaluation.md`	DeepEval quality metrics, Pydantic schema validation, timeout testing
Rules	`rules/llm-mocking.md`	Mock LLM responses, VCR.py recording, custom request matchers
Reference	`references/deepeval-ragas-api.md`	Full API reference for DeepEval and RAGAS metrics
Reference	`references/generator-agent.md`	Transforms Markdown specs into Playwright tests
Reference	`references/healer-agent.md`	Auto-fixes failing tests (selectors, waits, dynamic content)
Reference	`references/planner-agent.md`	Explores app and produces Markdown test plans
Checklist	`checklists/llm-test-checklist.md`	Complete LLM testing checklist (setup, coverage, CI/CD)
Example	`examples/llm-test-patterns.md`	Full examples: mocking, structured output, DeepEval, VCR, golden datasets

领域	文件	用途
规则	`rules/llm-evaluation.md`	DeepEval质量指标、Pydantic schema验证、超时测试
规则	`rules/llm-mocking.md`	模拟LLM响应、VCR.py录制、自定义请求匹配器
参考文档	`references/deepeval-ragas-api.md`	DeepEval与RAGAS指标的完整API参考
参考文档	`references/generator-agent.md`	将Markdown规格转换为Playwright测试
参考文档	`references/healer-agent.md`	自动修复失败测试（选择器、等待逻辑、动态内容）
参考文档	`references/planner-agent.md`	探查应用并生成Markdown测试计划
检查清单	`checklists/llm-test-checklist.md`	完整LLM测试检查清单（配置、覆盖范围、CI/CD）
示例	`examples/llm-test-patterns.md`	完整示例：模拟、结构化输出、DeepEval、VCR、黄金数据集

When to Use This Skill

适用场景

Testing code that calls LLM APIs (OpenAI, Anthropic, etc.)
Validating RAG pipeline output quality
Setting up deterministic LLM tests in CI
Building evaluation pipelines with quality gates
Applying agentic test patterns (plan -> generate -> heal)

测试调用LLM API（OpenAI、Anthropic等）的代码
验证RAG流水线输出质量
在CI中配置确定性LLM测试
构建带有质量门禁的评估流水线
应用智能体测试模式（规划→生成→修复）

LLM Mock Quick Start

LLM模拟快速上手

Mock LLM responses for fast, deterministic unit tests:

python

from unittest.mock import AsyncMock, patch
import pytest

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None

Key rule: NEVER call live LLM APIs in CI. Use mocks for unit tests, VCR.py for integration tests.

为快速、确定性的单元测试模拟LLM响应：

python

from unittest.mock import AsyncMock, patch
import pytest

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None

核心规则： 绝不要在CI中调用真实LLM API。单元测试使用模拟，集成测试使用VCR.py。

DeepEval Quality Quick Start

DeepEval质量评估快速上手

Validate LLM output quality with multi-dimensional metrics:

python

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

assert_test(test_case, [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
])

使用多维度指标验证LLM输出质量：

python

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

assert_test(test_case, [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
])

Quality Metrics Thresholds

质量指标阈值

Metric	Threshold	Purpose
Answer Relevancy	>= 0.7	Response addresses question
Faithfulness	>= 0.8	Output matches context
Hallucination	<= 0.3	No fabricated facts
Context Precision	>= 0.7	Retrieved contexts relevant
Context Recall	>= 0.7	All relevant contexts retrieved

指标	阈值	用途
答案相关性	>= 0.7	响应是否针对问题作答
忠实度	>= 0.8	输出是否与上下文匹配
幻觉率	<= 0.3	无虚构事实
上下文精准度	>= 0.7	检索到的上下文是否相关
上下文召回率	>= 0.7	是否检索到所有相关上下文

Structured Output Validation

结构化输出验证

Always validate LLM output with Pydantic schemas:

python

from pydantic import BaseModel, Field

class LLMResponse(BaseModel):
    answer: str = Field(min_length=1)
    confidence: float = Field(ge=0.0, le=1.0)
    sources: list[str] = Field(default_factory=list)

async def test_structured_output():
    result = await get_llm_response("test query")
    parsed = LLMResponse.model_validate(result)
    assert 0 <= parsed.confidence <= 1.0

始终使用Pydantic schema验证LLM输出：

python

from pydantic import BaseModel, Field

class LLMResponse(BaseModel):
    answer: str = Field(min_length=1)
    confidence: float = Field(ge=0.0, le=1.0)
    sources: list[str] = Field(default_factory=list)

async def test_structured_output():
    result = await get_llm_response("test query")
    parsed = LLMResponse.model_validate(result)
    assert 0 <= parsed.confidence <= 1.0

VCR.py for Integration Tests

集成测试使用VCR.py

Record and replay LLM API calls for deterministic integration tests:

python

@pytest.fixture(scope="module")
def vcr_config():
    import os
    return {
        "record_mode": "none" if os.environ.get("CI") else "new_episodes",
        "filter_headers": ["authorization", "x-api-key"],
    }

@pytest.mark.vcr()
async def test_llm_integration():
    response = await llm_client.complete("Say hello")
    assert "hello" in response.content.lower()

录制并重放LLM API调用，实现确定性集成测试：

python

@pytest.fixture(scope="module")
def vcr_config():
    import os
    return {
        "record_mode": "none" if os.environ.get("CI") else "new_episodes",
        "filter_headers": ["authorization", "x-api-key"],
    }

@pytest.mark.vcr()
async def test_llm_integration():
    response = await llm_client.complete("Say hello")
    assert "hello" in response.content.lower()

Agentic Test Workflow

智能体测试工作流

The three-agent pattern for end-to-end test automation:

Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)

Planner (
```
references/planner-agent.md
```
): Explores your app, produces Markdown test plans from PRDs or natural language requests. Requires
```
seed.spec.ts
```
for app context.
Generator (
```
references/generator-agent.md
```
): Converts Markdown specs into Playwright tests. Actively validates selectors against the running app. Uses semantic locators (getByRole, getByLabel, getByText).
Healer (
```
references/healer-agent.md
```
): Automatically fixes failing tests by replaying failures, inspecting the DOM, and patching locators/waits. Max 3 healing attempts per test.

用于端到端测试自动化的三智能体模式：

Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)

规划器（
```
references/planner-agent.md
```
）：探查你的应用，根据PRD或自然语言请求生成Markdown测试计划。需要
```
seed.spec.ts
```
提供应用上下文。
生成器（
```
references/generator-agent.md
```
）：将Markdown规格转换为Playwright测试。主动验证选择器是否与运行中的应用匹配。使用语义定位器（getByRole、getByLabel、getByText）。
修复器（
```
references/healer-agent.md
```
）：通过重放失败、检查DOM并修补定位器/等待逻辑，自动修复失败测试。每个测试最多尝试3次修复。

Edge Cases to Always Test

必测边缘场景

For every LLM integration, cover these paths:

Empty/null inputs -- empty strings, None values
Long inputs -- truncation behavior near token limits
Timeouts -- fail-open vs fail-closed behavior
Schema violations -- invalid structured output
Prompt injection -- adversarial input resistance
Unicode -- non-ASCII characters in prompts and responses

See

checklists/llm-test-checklist.md

for the complete checklist.

对于每个LLM集成，需覆盖以下场景：

空/Null输入——空字符串、None值
长输入——接近令牌限制时的截断行为
超时——故障开放与故障关闭行为
Schema违规——无效的结构化输出
提示注入——对抗性输入抗性
Unicode——提示和响应中的非ASCII字符

完整清单请查看

checklists/llm-test-checklist.md

。

Anti-Patterns

反模式

Anti-Pattern	Correct Approach
Live LLM calls in CI	Mock for unit, VCR for integration
Random seeds	Fixed seeds or mocked responses
Single metric evaluation	3-5 quality dimensions
No timeout handling	Always set < 1s timeout in tests
Hardcoded API keys	Environment variables, filtered in VCR
Asserting only `is not None`	Schema validation + quality metrics

反模式	正确做法
CI中调用真实LLM API	单元测试用模拟，集成测试用VCR
随机种子	使用固定种子或模拟响应
单一指标评估	覆盖3-5个质量维度
无超时处理	测试中始终设置<1秒的超时
硬编码API密钥	使用环境变量，在VCR中过滤
仅断言 `is not None`	结合Schema验证与质量指标

testing-llm

Original

Translation

LLM & AI Testing Patterns

LLM与AI测试模式

Quick Reference

快速参考

When to Use This Skill

适用场景

LLM Mock Quick Start

LLM模拟快速上手

DeepEval Quality Quick Start

DeepEval质量评估快速上手

Quality Metrics Thresholds

质量指标阈值

Structured Output Validation

结构化输出验证

VCR.py for Integration Tests

集成测试使用VCR.py

Agentic Test Workflow

智能体测试工作流

Edge Cases to Always Test

必测边缘场景

Anti-Patterns

反模式

Related Skills

相关技能