testing-llm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM & AI Testing Patterns

LLM与AI测试模式

Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).
本内容涵盖LLM集成测试、AI输出质量评估、为确定性CI模拟响应,以及应用智能体测试工作流(规划器、生成器、修复器)的模式与工具。

Quick Reference

快速参考

AreaFilePurpose
Rules
rules/llm-evaluation.md
DeepEval quality metrics, Pydantic schema validation, timeout testing
Rules
rules/llm-mocking.md
Mock LLM responses, VCR.py recording, custom request matchers
Reference
references/deepeval-ragas-api.md
Full API reference for DeepEval and RAGAS metrics
Reference
references/generator-agent.md
Transforms Markdown specs into Playwright tests
Reference
references/healer-agent.md
Auto-fixes failing tests (selectors, waits, dynamic content)
Reference
references/planner-agent.md
Explores app and produces Markdown test plans
Checklist
checklists/llm-test-checklist.md
Complete LLM testing checklist (setup, coverage, CI/CD)
Example
examples/llm-test-patterns.md
Full examples: mocking, structured output, DeepEval, VCR, golden datasets
领域文件用途
规则
rules/llm-evaluation.md
DeepEval质量指标、Pydantic schema验证、超时测试
规则
rules/llm-mocking.md
模拟LLM响应、VCR.py录制、自定义请求匹配器
参考文档
references/deepeval-ragas-api.md
DeepEval与RAGAS指标的完整API参考
参考文档
references/generator-agent.md
将Markdown规格转换为Playwright测试
参考文档
references/healer-agent.md
自动修复失败测试(选择器、等待逻辑、动态内容)
参考文档
references/planner-agent.md
探查应用并生成Markdown测试计划
检查清单
checklists/llm-test-checklist.md
完整LLM测试检查清单(配置、覆盖范围、CI/CD)
示例
examples/llm-test-patterns.md
完整示例:模拟、结构化输出、DeepEval、VCR、黄金数据集

When to Use This Skill

适用场景

  • Testing code that calls LLM APIs (OpenAI, Anthropic, etc.)
  • Validating RAG pipeline output quality
  • Setting up deterministic LLM tests in CI
  • Building evaluation pipelines with quality gates
  • Applying agentic test patterns (plan -> generate -> heal)
  • 测试调用LLM API(OpenAI、Anthropic等)的代码
  • 验证RAG流水线输出质量
  • 在CI中配置确定性LLM测试
  • 构建带有质量门禁的评估流水线
  • 应用智能体测试模式(规划→生成→修复)

LLM Mock Quick Start

LLM模拟快速上手

Mock LLM responses for fast, deterministic unit tests:
python
from unittest.mock import AsyncMock, patch
import pytest

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None
Key rule: NEVER call live LLM APIs in CI. Use mocks for unit tests, VCR.py for integration tests.
为快速、确定性的单元测试模拟LLM响应:
python
from unittest.mock import AsyncMock, patch
import pytest

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None
核心规则: 绝不要在CI中调用真实LLM API。单元测试使用模拟,集成测试使用VCR.py。

DeepEval Quality Quick Start

DeepEval质量评估快速上手

Validate LLM output quality with multi-dimensional metrics:
python
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

assert_test(test_case, [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
])
使用多维度指标验证LLM输出质量:
python
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

assert_test(test_case, [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
])

Quality Metrics Thresholds

质量指标阈值

MetricThresholdPurpose
Answer Relevancy>= 0.7Response addresses question
Faithfulness>= 0.8Output matches context
Hallucination<= 0.3No fabricated facts
Context Precision>= 0.7Retrieved contexts relevant
Context Recall>= 0.7All relevant contexts retrieved
指标阈值用途
答案相关性>= 0.7响应是否针对问题作答
忠实度>= 0.8输出是否与上下文匹配
幻觉率<= 0.3无虚构事实
上下文精准度>= 0.7检索到的上下文是否相关
上下文召回率>= 0.7是否检索到所有相关上下文

Structured Output Validation

结构化输出验证

Always validate LLM output with Pydantic schemas:
python
from pydantic import BaseModel, Field

class LLMResponse(BaseModel):
    answer: str = Field(min_length=1)
    confidence: float = Field(ge=0.0, le=1.0)
    sources: list[str] = Field(default_factory=list)

async def test_structured_output():
    result = await get_llm_response("test query")
    parsed = LLMResponse.model_validate(result)
    assert 0 <= parsed.confidence <= 1.0
始终使用Pydantic schema验证LLM输出:
python
from pydantic import BaseModel, Field

class LLMResponse(BaseModel):
    answer: str = Field(min_length=1)
    confidence: float = Field(ge=0.0, le=1.0)
    sources: list[str] = Field(default_factory=list)

async def test_structured_output():
    result = await get_llm_response("test query")
    parsed = LLMResponse.model_validate(result)
    assert 0 <= parsed.confidence <= 1.0

VCR.py for Integration Tests

集成测试使用VCR.py

Record and replay LLM API calls for deterministic integration tests:
python
@pytest.fixture(scope="module")
def vcr_config():
    import os
    return {
        "record_mode": "none" if os.environ.get("CI") else "new_episodes",
        "filter_headers": ["authorization", "x-api-key"],
    }

@pytest.mark.vcr()
async def test_llm_integration():
    response = await llm_client.complete("Say hello")
    assert "hello" in response.content.lower()
录制并重放LLM API调用,实现确定性集成测试:
python
@pytest.fixture(scope="module")
def vcr_config():
    import os
    return {
        "record_mode": "none" if os.environ.get("CI") else "new_episodes",
        "filter_headers": ["authorization", "x-api-key"],
    }

@pytest.mark.vcr()
async def test_llm_integration():
    response = await llm_client.complete("Say hello")
    assert "hello" in response.content.lower()

Agentic Test Workflow

智能体测试工作流

The three-agent pattern for end-to-end test automation:
Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)
  1. Planner (
    references/planner-agent.md
    ): Explores your app, produces Markdown test plans from PRDs or natural language requests. Requires
    seed.spec.ts
    for app context.
  2. Generator (
    references/generator-agent.md
    ): Converts Markdown specs into Playwright tests. Actively validates selectors against the running app. Uses semantic locators (getByRole, getByLabel, getByText).
  3. Healer (
    references/healer-agent.md
    ): Automatically fixes failing tests by replaying failures, inspecting the DOM, and patching locators/waits. Max 3 healing attempts per test.
用于端到端测试自动化的三智能体模式:
Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)
  1. 规划器
    references/planner-agent.md
    ):探查你的应用,根据PRD或自然语言请求生成Markdown测试计划。需要
    seed.spec.ts
    提供应用上下文。
  2. 生成器
    references/generator-agent.md
    ):将Markdown规格转换为Playwright测试。主动验证选择器是否与运行中的应用匹配。使用语义定位器(getByRole、getByLabel、getByText)。
  3. 修复器
    references/healer-agent.md
    ):通过重放失败、检查DOM并修补定位器/等待逻辑,自动修复失败测试。每个测试最多尝试3次修复。

Edge Cases to Always Test

必测边缘场景

For every LLM integration, cover these paths:
  • Empty/null inputs -- empty strings, None values
  • Long inputs -- truncation behavior near token limits
  • Timeouts -- fail-open vs fail-closed behavior
  • Schema violations -- invalid structured output
  • Prompt injection -- adversarial input resistance
  • Unicode -- non-ASCII characters in prompts and responses
See
checklists/llm-test-checklist.md
for the complete checklist.
对于每个LLM集成,需覆盖以下场景:
  • 空/Null输入——空字符串、None值
  • 长输入——接近令牌限制时的截断行为
  • 超时——故障开放与故障关闭行为
  • Schema违规——无效的结构化输出
  • 提示注入——对抗性输入抗性
  • Unicode——提示和响应中的非ASCII字符
完整清单请查看
checklists/llm-test-checklist.md

Anti-Patterns

反模式

Anti-PatternCorrect Approach
Live LLM calls in CIMock for unit, VCR for integration
Random seedsFixed seeds or mocked responses
Single metric evaluation3-5 quality dimensions
No timeout handlingAlways set < 1s timeout in tests
Hardcoded API keysEnvironment variables, filtered in VCR
Asserting only
is not None
Schema validation + quality metrics
反模式正确做法
CI中调用真实LLM API单元测试用模拟,集成测试用VCR
随机种子使用固定种子或模拟响应
单一指标评估覆盖3-5个质量维度
无超时处理测试中始终设置<1秒的超时
硬编码API密钥使用环境变量,在VCR中过滤
仅断言
is not None
结合Schema验证与质量指标

Related Skills

相关技能

  • ork:testing-unit
    — Unit testing fundamentals, AAA pattern
  • ork:testing-integration
    — Integration testing for AI pipelines
  • ork:golden-dataset
    — Evaluation dataset management
  • ork:testing-unit
    — 单元测试基础、AAA模式
  • ork:testing-integration
    — AI流水线集成测试
  • ork:golden-dataset
    — 评估数据集管理