langgraph-testing-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLangGraph Testing & Evaluation
LangGraph 测试与评估
Practical workflows for validating agent quality with:
- Unit/integration tests
- Trajectory evaluation
- LangSmith dataset evaluations
- A/B-style comparisons between versions
Use this file for high-level flow. Load for detailed implementation.
references/*以下是验证Agent质量的实用工作流:
- 单元/集成测试
- 轨迹评估
- LangSmith数据集评估
- 版本间的A/B式对比
本文件为你提供整体流程指引,详细实现请查看目录下的内容。
references/*Start Here
入门指南
Choose the smallest approach that answers your question:
| Goal | Primary method | Load first |
|---|---|---|
| Validate node logic quickly | Unit tests with mocks | |
| Validate multi-step agent behavior | Trajectory evaluation | |
| Track quality over datasets over time | LangSmith evaluation | |
| Compare old vs new agent versions | A/B comparison | |
Recommended order:
- Unit tests
- Integration/trajectory checks
- Dataset evaluation in LangSmith
- A/B comparison before deployment
选择最贴合你需求的最简方案:
| 目标 | 主要方法 | 优先查看 |
|---|---|---|
| 快速验证节点逻辑 | 带模拟的单元测试 | |
| 验证多步Agent行为 | 轨迹评估 | |
| 长期追踪数据集上的质量变化 | LangSmith评估 | |
| 对比新旧Agent版本 | A/B对比 | |
推荐执行顺序:
- 单元测试
- 集成/轨迹检查
- 在LangSmith中进行数据集评估
- 部署前进行A/B对比
Quick Commands
快速命令
Run from repo root.
请从仓库根目录执行以下命令。
Generate test scaffolding
生成测试脚手架
bash
undefinedbash
undefinedPython (preferred)
Python(推荐)
uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest
uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest
JavaScript/TypeScript
JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest
undefinednode skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest
undefinedRun trajectory evaluation
运行轨迹评估
bash
undefinedbash
undefinedPython: LLM-as-judge
Python: LLM-as-judge模式
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini
Python: trajectory match
Python: 轨迹匹配模式
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json
JavaScript/TypeScript
JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4
undefinednode skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4
undefinedRun LangSmith dataset evaluation
运行LangSmith数据集评估
bash
undefinedbash
undefinedPython
Python
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4
Python (do not upload experiment results)
Python(不上传实验结果)
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload
JavaScript/TypeScript
JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4
undefinednode skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4
undefinedCompare two agent versions
对比两个Agent版本
bash
undefinedbash
undefinedPython
Python
uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json
uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json
JavaScript/TypeScript
JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json
JavaScript/TypeScript (force local dataset file only)
JavaScript/TypeScript(仅使用本地数据集文件)
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith
undefinednode skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith
undefinedCreate mock response configs
创建模拟响应配置
bash
undefinedbash
undefinedPython
Python
uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json
uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json
JavaScript/TypeScript
JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json
undefinednode skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json
undefinedCore Workflow
核心工作流
- Define test scope.
- Unit: deterministic logic in one node/function.
- Integration: node interactions and routing.
- End-to-end: complete response quality on realistic inputs.
- Start from deterministic checks.
- Mock LLM/tool IO for speed and repeatability.
- Keep real-model tests as a smaller, explicit suite.
- Build/curate dataset examples.
- Use stable inputs and expected outputs.
- Keep schema simple: and
inputsobjects (optionaloutputs).metadata - Compatibility note: scripts also accept singular keys (,
input) for legacy datasets.output
- Run evaluation with explicit gates.
- Use evaluator keys that map to deployment decisions.
- Set thresholds in CI for regression prevention.
- Compare versions before rollout.
- Run same dataset on both versions.
- Check both quality and latency.
- Diagnose failures from traces/experiments.
- Inspect low-scoring examples.
- Split failures by pattern (routing, tool usage, hallucination, latency spikes).
- 定义测试范围
- 单元测试:单个节点/函数中的确定性逻辑
- 集成测试:节点交互与路由逻辑
- 端到端测试:真实输入下的完整响应质量
- 从确定性检查入手
- 模拟LLM/工具的输入输出,提升测试速度与可重复性
- 将真实模型测试作为小型、明确的独立测试套件
- 构建/整理数据集示例
- 使用稳定的输入与预期输出
- 保持简单的结构:和
inputs对象(可选outputs)metadata - 兼容性说明:脚本也支持旧版数据集的单键格式(、
input)output
- 带明确门控的评估运行
- 使用与部署决策挂钩的评估器标识
- 在CI中设置阈值以防止回归
- 版本发布前进行对比
- 在两个版本上运行相同的数据集
- 同时检查质量与延迟
- 通过追踪/实验诊断失败
- 检查低分示例
- 按模式分类失败案例(路由错误、工具使用错误、幻觉、延迟峰值等)
Current References (Load On Demand)
当前参考文档(按需加载)
references/unit-testing-patterns.md
references/unit-testing-patterns.mdreferences/unit-testing-patterns.md
references/unit-testing-patterns.mdLoad when:
- You need node-level and routing test patterns.
- You need pytest/vitest/Jest integration patterns.
- You need robust mocking and flaky-test reduction.
当你需要以下内容时加载:
- 节点级与路由测试模式
- pytest/vitest/Jest集成模式
- 可靠的模拟方法与不稳定测试优化方案
references/trajectory-evaluation.md
references/trajectory-evaluation.mdreferences/trajectory-evaluation.md
references/trajectory-evaluation.mdLoad when:
- You need trajectory match evaluation (,
strict,unordered,subset).superset - You need LLM-as-judge trajectory scoring.
- You need LangSmith experiment comparison for trajectory results.
当你需要以下内容时加载:
- 轨迹匹配评估(、
strict、unordered、subset模式)superset - LLM-as-judge轨迹评分
- 轨迹结果的LangSmith实验对比
references/langsmith-evaluation.md
references/langsmith-evaluation.mdreferences/langsmith-evaluation.md
references/langsmith-evaluation.mdLoad when:
- You need dataset creation/management in LangSmith.
- You need evaluator signatures and experiment runs in Python/TS.
- You need CI-friendly workflows with quality thresholds.
当你需要以下内容时加载:
- LangSmith中的数据集创建/管理
- Python/TS中的评估器签名与实验运行
- 支持CI的带质量阈值工作流
references/ab-testing.md
references/ab-testing.mdreferences/ab-testing.md
references/ab-testing.mdLoad when:
- You need offline A/B comparison methodology.
- You need significance testing and interpretation.
- You need production traffic split strategy and guardrails.
当你需要以下内容时加载:
- 离线A/B对比方法论
- 显著性测试与结果解读
- 生产流量拆分策略与防护措施
Assets
资源文件
assets/templates/test_template.py
assets/templates/test_template.pyassets/templates/test_template.py
assets/templates/test_template.py- Runnable Python pytest template aligned with current LangGraph testing patterns.
- Includes:
- Compiled-graph invocation with
thread_id - Single-node testing via
compiled_graph.nodes[...] - Integration-test placeholder
- Compiled-graph invocation with
- 符合当前LangGraph测试模式的可运行Python pytest模板
- 包含:
- 带的编译图调用
thread_id - 通过进行单节点测试
compiled_graph.nodes[...] - 集成测试占位符
- 带
assets/datasets/sample_dataset.json
assets/datasets/sample_dataset.jsonassets/datasets/sample_dataset.json
assets/datasets/sample_dataset.json- Deterministic seed dataset for LangSmith ingestion.
- Uses format.
examples: [{ inputs, outputs, metadata }]
- 用于LangSmith导入的确定性种子数据集
- 采用格式
examples: [{ inputs, outputs, metadata }]
assets/examples/README.md
assets/examples/README.mdassets/examples/README.md
assets/examples/README.md- Documentation-only index for current asset usage.
- Notes where runnable assets live today.
- 仅用于文档说明的当前资源使用索引
- 记录了可运行资源的存放位置
Script Interface Summary
脚本接口概述
scripts/generate_test_cases.py
/ .js
scripts/generate_test_cases.py.jsscripts/generate_test_cases.py
/ .js
scripts/generate_test_cases.py.jsUse for fast test scaffolding.
Inputs:
- Graph module path
- Python: or
my_module:graphmy_module.graph - JS/TS:
./file.ts:graph
- Python:
Outputs:
- Framework-specific starter tests in target directory.
用于快速生成测试脚手架
输入:
- 图模块路径
- Python:或
my_module:graphmy_module.graph - JS/TS:
./file.ts:graph
- Python:
输出:
- 目标目录中符合指定框架的初始测试文件
scripts/run_trajectory_eval.py
/ .js
scripts/run_trajectory_eval.py.jsscripts/run_trajectory_eval.py
/ .js
scripts/run_trajectory_eval.py.jsUse for trajectory scoring with either:
--method match--method llm-judge
Supports:
- Local dataset files ()
.json - LangSmith dataset names
- Optional reference trajectory file with
--reference-trajectory - Match modes: ,
strict,unordered,subsetsuperset
Local-only mode:
- in both Python and JavaScript scripts (requires local JSON dataset file)
--no-langsmith
用于轨迹评分,支持两种模式:
- (匹配模式)
--method match - (LLM-as-judge模式)
--method llm-judge
支持:
- 本地数据集文件()
.json - LangSmith数据集名称
- 可选的参考轨迹文件(通过指定)
--reference-trajectory - 匹配模式:、
strict、unordered、subsetsuperset
本地仅用模式:
- Python和JavaScript脚本均支持参数(需配合本地JSON数据集文件使用)
--no-langsmith
scripts/evaluate_with_langsmith.py
/ .js
scripts/evaluate_with_langsmith.py.jsscripts/evaluate_with_langsmith.py
/ .js
scripts/evaluate_with_langsmith.py.jsUse for dataset-based evaluation runs and experiment tracking.
Supports:
- Existing dataset by name
- Dataset creation from JSON examples file
- Multiple evaluators ()
--evaluators accuracy,latency,... - Concurrency control ()
--max-concurrency
Python-only:
- to run without uploading experiment results
--no-upload
用于基于数据集的评估运行与实验追踪
支持:
- 按名称调用现有数据集
- 从JSON示例文件创建数据集
- 多评估器()
--evaluators accuracy,latency,... - 并发控制()
--max-concurrency
Python专属参数:
- :运行评估但不上传实验结果
--no-upload
scripts/compare_agents.py
/ .js
scripts/compare_agents.py.jsscripts/compare_agents.py
/ .js
scripts/compare_agents.py.jsUse for offline version comparisons:
- Shared dataset input
- Success/latency summaries
- JSON report output for CI artifacts
- Local JSON datasets or LangSmith datasets (JS supports to disable remote loading)
--no-langsmith
用于离线版本对比:
- 共享数据集输入
- 成功率/延迟汇总
- 输出JSON报告用于CI产物
- 支持本地JSON数据集或LangSmith数据集(JS支持禁用远程加载)
--no-langsmith
scripts/mock_llm_responses.py
/ .js
scripts/mock_llm_responses.py.jsscripts/mock_llm_responses.py
/ .js
scripts/mock_llm_responses.py.jsUse for deterministic test doubles:
- single
- sequence
- conditional
用于创建确定性测试替身,支持类型:
- single(单一响应)
- sequence(序列响应)
- conditional(条件响应)
Decision Rules
决策规则
If behavior is deterministic and local:
- Use unit tests first.
If behavior depends on tool sequence/routing:
- Add trajectory evaluation.
If behavior depends on realistic distribution quality:
- Run LangSmith dataset evaluation.
If approving a replacement model/prompt/graph:
- Run A/B comparison and check both quality and latency.
若行为是确定性且本地的:
- 优先使用单元测试
若行为依赖工具序列/路由:
- 增加轨迹评估
若行为依赖真实分布下的质量:
- 运行LangSmith数据集评估
若要批准替换模型/提示词/图:
- 运行A/B对比,同时检查质量与延迟
Common Failure Patterns
常见失败模式
Flaky tests
不稳定测试
- Cause: real-model nondeterminism in unit scope.
- Fix: mock LLM/tool calls for unit tests; reserve real-model tests for separate integration marks.
- 原因:单元测试范围内使用真实模型导致的非确定性
- 修复:单元测试中模拟LLM/工具调用;将真实模型测试作为独立的集成测试套件
High trajectory variance
轨迹方差过高
- Cause: overly strict matching for workflows with equivalent paths.
- Fix: switch match mode (,
unordered, orsubset) where appropriate.superset
- 原因:对存在等效路径的工作流使用过于严格的匹配模式
- 修复:在合适场景下切换匹配模式(、
unordered或subset)superset
Regressions hidden by averages
平均值掩盖回归问题
- Cause: only aggregate score monitored.
- Fix: inspect per-example failures and segment by category metadata.
- 原因:仅监控聚合分数
- 修复:检查单个示例的失败情况,并按元数据分类
Latency regressions with same quality
质量不变但延迟回归
- Cause: no explicit latency gate.
- Fix: include latency evaluator and CI threshold.
- 原因:未设置明确的延迟门控
- 修复:添加延迟评估器并在CI中设置阈值
Minimal Best Practices
最简最佳实践
- Keep fast deterministic tests as the largest share.
- Version datasets and keep them stable.
- Track both correctness and latency.
- Add explicit go/no-go thresholds in CI.
- Compare candidate vs baseline before production rollout.
- Investigate failures with trace-level evidence, not only aggregate scores.
- 让快速的确定性测试占比最大
- 对数据集进行版本控制并保持稳定
- 同时追踪正确性与延迟
- 在CI中添加明确的通过/不通过阈值
- 生产发布前对比候选版本与基准版本
- 基于追踪层面的证据排查失败,而非仅依赖聚合分数