testing-llm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM & AI Testing Patterns
LLM与AI测试模式
Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).
本内容涵盖LLM集成测试、AI输出质量评估、为确定性CI模拟响应,以及应用智能体测试工作流(规划器、生成器、修复器)的模式与工具。
Quick Reference
快速参考
| Area | File | Purpose |
|---|---|---|
| Rules | | DeepEval quality metrics, Pydantic schema validation, timeout testing |
| Rules | | Mock LLM responses, VCR.py recording, custom request matchers |
| Reference | | Full API reference for DeepEval and RAGAS metrics |
| Reference | | Transforms Markdown specs into Playwright tests |
| Reference | | Auto-fixes failing tests (selectors, waits, dynamic content) |
| Reference | | Explores app and produces Markdown test plans |
| Checklist | | Complete LLM testing checklist (setup, coverage, CI/CD) |
| Example | | Full examples: mocking, structured output, DeepEval, VCR, golden datasets |
| 领域 | 文件 | 用途 |
|---|---|---|
| 规则 | | DeepEval质量指标、Pydantic schema验证、超时测试 |
| 规则 | | 模拟LLM响应、VCR.py录制、自定义请求匹配器 |
| 参考文档 | | DeepEval与RAGAS指标的完整API参考 |
| 参考文档 | | 将Markdown规格转换为Playwright测试 |
| 参考文档 | | 自动修复失败测试(选择器、等待逻辑、动态内容) |
| 参考文档 | | 探查应用并生成Markdown测试计划 |
| 检查清单 | | 完整LLM测试检查清单(配置、覆盖范围、CI/CD) |
| 示例 | | 完整示例:模拟、结构化输出、DeepEval、VCR、黄金数据集 |
When to Use This Skill
适用场景
- Testing code that calls LLM APIs (OpenAI, Anthropic, etc.)
- Validating RAG pipeline output quality
- Setting up deterministic LLM tests in CI
- Building evaluation pipelines with quality gates
- Applying agentic test patterns (plan -> generate -> heal)
- 测试调用LLM API(OpenAI、Anthropic等)的代码
- 验证RAG流水线输出质量
- 在CI中配置确定性LLM测试
- 构建带有质量门禁的评估流水线
- 应用智能体测试模式(规划→生成→修复)
LLM Mock Quick Start
LLM模拟快速上手
Mock LLM responses for fast, deterministic unit tests:
python
from unittest.mock import AsyncMock, patch
import pytest
@pytest.fixture
def mock_llm():
mock = AsyncMock()
mock.return_value = {"content": "Mocked response", "confidence": 0.85}
return mock
@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
with patch("app.core.model_factory.get_model", return_value=mock_llm):
result = await synthesize_findings(sample_findings)
assert result["summary"] is not NoneKey rule: NEVER call live LLM APIs in CI. Use mocks for unit tests, VCR.py for integration tests.
为快速、确定性的单元测试模拟LLM响应:
python
from unittest.mock import AsyncMock, patch
import pytest
@pytest.fixture
def mock_llm():
mock = AsyncMock()
mock.return_value = {"content": "Mocked response", "confidence": 0.85}
return mock
@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
with patch("app.core.model_factory.get_model", return_value=mock_llm):
result = await synthesize_findings(sample_findings)
assert result["summary"] is not None核心规则: 绝不要在CI中调用真实LLM API。单元测试使用模拟,集成测试使用VCR.py。
DeepEval Quality Quick Start
DeepEval质量评估快速上手
Validate LLM output quality with multi-dimensional metrics:
python
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["Paris is the capital of France."],
)
assert_test(test_case, [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
])使用多维度指标验证LLM输出质量:
python
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["Paris is the capital of France."],
)
assert_test(test_case, [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
])Quality Metrics Thresholds
质量指标阈值
| Metric | Threshold | Purpose |
|---|---|---|
| Answer Relevancy | >= 0.7 | Response addresses question |
| Faithfulness | >= 0.8 | Output matches context |
| Hallucination | <= 0.3 | No fabricated facts |
| Context Precision | >= 0.7 | Retrieved contexts relevant |
| Context Recall | >= 0.7 | All relevant contexts retrieved |
| 指标 | 阈值 | 用途 |
|---|---|---|
| 答案相关性 | >= 0.7 | 响应是否针对问题作答 |
| 忠实度 | >= 0.8 | 输出是否与上下文匹配 |
| 幻觉率 | <= 0.3 | 无虚构事实 |
| 上下文精准度 | >= 0.7 | 检索到的上下文是否相关 |
| 上下文召回率 | >= 0.7 | 是否检索到所有相关上下文 |
Structured Output Validation
结构化输出验证
Always validate LLM output with Pydantic schemas:
python
from pydantic import BaseModel, Field
class LLMResponse(BaseModel):
answer: str = Field(min_length=1)
confidence: float = Field(ge=0.0, le=1.0)
sources: list[str] = Field(default_factory=list)
async def test_structured_output():
result = await get_llm_response("test query")
parsed = LLMResponse.model_validate(result)
assert 0 <= parsed.confidence <= 1.0始终使用Pydantic schema验证LLM输出:
python
from pydantic import BaseModel, Field
class LLMResponse(BaseModel):
answer: str = Field(min_length=1)
confidence: float = Field(ge=0.0, le=1.0)
sources: list[str] = Field(default_factory=list)
async def test_structured_output():
result = await get_llm_response("test query")
parsed = LLMResponse.model_validate(result)
assert 0 <= parsed.confidence <= 1.0VCR.py for Integration Tests
集成测试使用VCR.py
Record and replay LLM API calls for deterministic integration tests:
python
@pytest.fixture(scope="module")
def vcr_config():
import os
return {
"record_mode": "none" if os.environ.get("CI") else "new_episodes",
"filter_headers": ["authorization", "x-api-key"],
}
@pytest.mark.vcr()
async def test_llm_integration():
response = await llm_client.complete("Say hello")
assert "hello" in response.content.lower()录制并重放LLM API调用,实现确定性集成测试:
python
@pytest.fixture(scope="module")
def vcr_config():
import os
return {
"record_mode": "none" if os.environ.get("CI") else "new_episodes",
"filter_headers": ["authorization", "x-api-key"],
}
@pytest.mark.vcr()
async def test_llm_integration():
response = await llm_client.complete("Say hello")
assert "hello" in response.content.lower()Agentic Test Workflow
智能体测试工作流
The three-agent pattern for end-to-end test automation:
Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)-
Planner (): Explores your app, produces Markdown test plans from PRDs or natural language requests. Requires
references/planner-agent.mdfor app context.seed.spec.ts -
Generator (): Converts Markdown specs into Playwright tests. Actively validates selectors against the running app. Uses semantic locators (getByRole, getByLabel, getByText).
references/generator-agent.md -
Healer (): Automatically fixes failing tests by replaying failures, inspecting the DOM, and patching locators/waits. Max 3 healing attempts per test.
references/healer-agent.md
用于端到端测试自动化的三智能体模式:
Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)-
规划器():探查你的应用,根据PRD或自然语言请求生成Markdown测试计划。需要
references/planner-agent.md提供应用上下文。seed.spec.ts -
生成器():将Markdown规格转换为Playwright测试。主动验证选择器是否与运行中的应用匹配。使用语义定位器(getByRole、getByLabel、getByText)。
references/generator-agent.md -
修复器():通过重放失败、检查DOM并修补定位器/等待逻辑,自动修复失败测试。每个测试最多尝试3次修复。
references/healer-agent.md
Edge Cases to Always Test
必测边缘场景
For every LLM integration, cover these paths:
- Empty/null inputs -- empty strings, None values
- Long inputs -- truncation behavior near token limits
- Timeouts -- fail-open vs fail-closed behavior
- Schema violations -- invalid structured output
- Prompt injection -- adversarial input resistance
- Unicode -- non-ASCII characters in prompts and responses
See for the complete checklist.
checklists/llm-test-checklist.md对于每个LLM集成,需覆盖以下场景:
- 空/Null输入——空字符串、None值
- 长输入——接近令牌限制时的截断行为
- 超时——故障开放与故障关闭行为
- Schema违规——无效的结构化输出
- 提示注入——对抗性输入抗性
- Unicode——提示和响应中的非ASCII字符
完整清单请查看。
checklists/llm-test-checklist.mdAnti-Patterns
反模式
| Anti-Pattern | Correct Approach |
|---|---|
| Live LLM calls in CI | Mock for unit, VCR for integration |
| Random seeds | Fixed seeds or mocked responses |
| Single metric evaluation | 3-5 quality dimensions |
| No timeout handling | Always set < 1s timeout in tests |
| Hardcoded API keys | Environment variables, filtered in VCR |
Asserting only | Schema validation + quality metrics |
| 反模式 | 正确做法 |
|---|---|
| CI中调用真实LLM API | 单元测试用模拟,集成测试用VCR |
| 随机种子 | 使用固定种子或模拟响应 |
| 单一指标评估 | 覆盖3-5个质量维度 |
| 无超时处理 | 测试中始终设置<1秒的超时 |
| 硬编码API密钥 | 使用环境变量,在VCR中过滤 |
仅断言 | 结合Schema验证与质量指标 |
Related Skills
相关技能
- — Unit testing fundamentals, AAA pattern
ork:testing-unit - — Integration testing for AI pipelines
ork:testing-integration - — Evaluation dataset management
ork:golden-dataset
- — 单元测试基础、AAA模式
ork:testing-unit - — AI流水线集成测试
ork:testing-integration - — 评估数据集管理
ork:golden-dataset