Loading...
Loading...
LLM and AI testing patterns — mock responses, evaluation with DeepEval/RAGAS, structured output validation, and agentic test patterns (generator, healer, planner). Use when testing AI features, validating LLM outputs, or building evaluation pipelines.
npx skill4agent add yonatangross/orchestkit testing-llm| Area | File | Purpose |
|---|---|---|
| Rules | | DeepEval quality metrics, Pydantic schema validation, timeout testing |
| Rules | | Mock LLM responses, VCR.py recording, custom request matchers |
| Reference | | Full API reference for DeepEval and RAGAS metrics |
| Reference | | Transforms Markdown specs into Playwright tests |
| Reference | | Auto-fixes failing tests (selectors, waits, dynamic content) |
| Reference | | Explores app and produces Markdown test plans |
| Checklist | | Complete LLM testing checklist (setup, coverage, CI/CD) |
| Example | | Full examples: mocking, structured output, DeepEval, VCR, golden datasets |
from unittest.mock import AsyncMock, patch
import pytest
@pytest.fixture
def mock_llm():
mock = AsyncMock()
mock.return_value = {"content": "Mocked response", "confidence": 0.85}
return mock
@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
with patch("app.core.model_factory.get_model", return_value=mock_llm):
result = await synthesize_findings(sample_findings)
assert result["summary"] is not Nonefrom deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["Paris is the capital of France."],
)
assert_test(test_case, [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
])| Metric | Threshold | Purpose |
|---|---|---|
| Answer Relevancy | >= 0.7 | Response addresses question |
| Faithfulness | >= 0.8 | Output matches context |
| Hallucination | <= 0.3 | No fabricated facts |
| Context Precision | >= 0.7 | Retrieved contexts relevant |
| Context Recall | >= 0.7 | All relevant contexts retrieved |
from pydantic import BaseModel, Field
class LLMResponse(BaseModel):
answer: str = Field(min_length=1)
confidence: float = Field(ge=0.0, le=1.0)
sources: list[str] = Field(default_factory=list)
async def test_structured_output():
result = await get_llm_response("test query")
parsed = LLMResponse.model_validate(result)
assert 0 <= parsed.confidence <= 1.0@pytest.fixture(scope="module")
def vcr_config():
import os
return {
"record_mode": "none" if os.environ.get("CI") else "new_episodes",
"filter_headers": ["authorization", "x-api-key"],
}
@pytest.mark.vcr()
async def test_llm_integration():
response = await llm_client.complete("Say hello")
assert "hello" in response.content.lower()Planner -> specs/*.md -> Generator -> tests/*.spec.ts -> Healer (auto-fix)references/planner-agent.mdseed.spec.tsreferences/generator-agent.mdreferences/healer-agent.mdchecklists/llm-test-checklist.md| Anti-Pattern | Correct Approach |
|---|---|
| Live LLM calls in CI | Mock for unit, VCR for integration |
| Random seeds | Fixed seeds or mocked responses |
| Single metric evaluation | 3-5 quality dimensions |
| No timeout handling | Always set < 1s timeout in tests |
| Hardcoded API keys | Environment variables, filtered in VCR |
Asserting only | Schema validation + quality metrics |
ork:testing-unitork:testing-integrationork:golden-dataset