Loading...
Loading...
DeepEval evaluation workflow for AI agents and LLM applications. TRIGGER when the user wants to evaluate or improve an AI agent, tool-using workflow, multi-turn chatbot, RAG pipeline, or LLM app; add evals; generate datasets or goldens; use deepeval generate; use deepeval test run; add tracing or @observe; send results to Confident AI; monitor production; run online evals; inspect traces; or iterate on prompts, tools, retrieval, or agent behavior from eval failures. AI agents are the primary use case. Covers Python SDK, pytest eval suites, CLI generation, tracing, Confident AI reporting, and agent-driven improvement loops. DO NOT TRIGGER for unrelated generic pytest, non-AI test setup, or non-DeepEval observability work unless the user asks to compare or migrate to DeepEval.
npx skill4agent add confident-ai/deepeval deepevaldeepeval generatedeepeval test run@observedeepeval generatedeepeval test runpytestmetrics.pyreferences/choose-use-case.mdreferences/intake.mdreferences/pytest-e2e-evals.mdreferences/integrations.mdreferences/metrics.mdreferences/artifact-contracts.mdtemplates/test_multi_turn_e2e.pytemplates/test_single_turn_tracing.pytemplates/test_single_turn_no_tracing.pytemplates/metrics.pyreferences/datasets.mdreferences/synthetic-data.mddeepeval generatereferences/datasets.mdreferences/integrations.mdreferences/tracing.mdGoldenassert_test(golden=golden, metrics=[...])for golden in dataset.evals_iterator(metrics=[...])LLMTestCasereferences/pytest-e2e-evals.mdnext_*_span(metrics=[...])@observe(metrics=[...])templates/deepeval test run tests/evals/test_<app>.py--num-processes 5--ignore-errors--skip-on-missing-params--identifierreferences/iteration-loop.mddeepeval generate --method docs --variation single-turn --documents ./docs --output-dir ./tests/evals --file-name .datasetdeepeval test run tests/evals/test_<app>.py --num-processes 5 --identifier "iterating-on-<purpose>-round-1"deepeval view| Topic | File |
|---|---|
| Intake questions and branching | |
| Use case selection | |
| Dataset loading | |
| Synthetic data generation | |
| Metrics | |
| Integrations | |
| Pytest E2E evals | |
| Tracing | |
| Confident AI | |
| Dataset and eval artifact contracts | |
| Iteration loop | |
| App type | Template |
|---|---|
| Single-turn tracing | |
| Single-turn no tracing | |
| Multi-turn E2E | |
| Shared metric lists | |