langgraph-testing-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LangGraph Testing & Evaluation

LangGraph 测试与评估

Practical workflows for validating agent quality with:

Unit/integration tests
Trajectory evaluation
LangSmith dataset evaluations
A/B-style comparisons between versions

Use this file for high-level flow. Load

references/*

for detailed implementation.

以下是验证Agent质量的实用工作流：

单元/集成测试
轨迹评估
LangSmith数据集评估
版本间的A/B式对比

本文件为你提供整体流程指引，详细实现请查看

references/*

目录下的内容。

Start Here

入门指南

Choose the smallest approach that answers your question:

Goal	Primary method	Load first
Validate node logic quickly	Unit tests with mocks	`references/unit-testing-patterns.md`
Validate multi-step agent behavior	Trajectory evaluation	`references/trajectory-evaluation.md`
Track quality over datasets over time	LangSmith evaluation	`references/langsmith-evaluation.md`
Compare old vs new agent versions	A/B comparison	`references/ab-testing.md`

Recommended order:

Unit tests
Integration/trajectory checks
Dataset evaluation in LangSmith
A/B comparison before deployment

选择最贴合你需求的最简方案：

目标	主要方法	优先查看
快速验证节点逻辑	带模拟的单元测试	`references/unit-testing-patterns.md`
验证多步Agent行为	轨迹评估	`references/trajectory-evaluation.md`
长期追踪数据集上的质量变化	LangSmith评估	`references/langsmith-evaluation.md`
对比新旧Agent版本	A/B对比	`references/ab-testing.md`

推荐执行顺序：

单元测试
集成/轨迹检查
在LangSmith中进行数据集评估
部署前进行A/B对比

Quick Commands

快速命令

Run from repo root.

请从仓库根目录执行以下命令。

Generate test scaffolding

生成测试脚手架

bash

undefined

bash

undefined

Python (preferred)

Python（推荐）

uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest

JavaScript/TypeScript

node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest

undefined

node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest

undefined

Run trajectory evaluation

运行轨迹评估

bash

undefined

bash

undefined

Python: LLM-as-judge

Python: LLM-as-judge模式

uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini

Python: trajectory match

Python: 轨迹匹配模式

uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json

JavaScript/TypeScript

node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4

undefined

node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4

undefined

Run LangSmith dataset evaluation

运行LangSmith数据集评估

bash

undefined

bash

undefined

Python

uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4

Python (do not upload experiment results)

Python（不上传实验结果）

uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload

JavaScript/TypeScript

node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4

undefined

node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4

undefined

Compare two agent versions

对比两个Agent版本

bash

undefined

bash

undefined

Python

uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json

JavaScript/TypeScript

node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json

JavaScript/TypeScript (force local dataset file only)

JavaScript/TypeScript（仅使用本地数据集文件）

node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith

undefined

node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith

undefined

Create mock response configs

创建模拟响应配置

bash

undefined

bash

undefined

Python

uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json

JavaScript/TypeScript

node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json

undefined

node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json

undefined

Core Workflow

核心工作流

Define test scope.

Unit: deterministic logic in one node/function.
Integration: node interactions and routing.
End-to-end: complete response quality on realistic inputs.

Start from deterministic checks.

Mock LLM/tool IO for speed and repeatability.
Keep real-model tests as a smaller, explicit suite.

Build/curate dataset examples.

Use stable inputs and expected outputs.
Keep schema simple:
```
inputs
```
and
```
outputs
```
objects (optional
```
metadata
```
).
Compatibility note: scripts also accept singular keys (
```
input
```
,
```
output
```
) for legacy datasets.

Run evaluation with explicit gates.

Use evaluator keys that map to deployment decisions.
Set thresholds in CI for regression prevention.

Compare versions before rollout.

Run same dataset on both versions.
Check both quality and latency.

Diagnose failures from traces/experiments.

Inspect low-scoring examples.
Split failures by pattern (routing, tool usage, hallucination, latency spikes).

定义测试范围

单元测试：单个节点/函数中的确定性逻辑
集成测试：节点交互与路由逻辑
端到端测试：真实输入下的完整响应质量

从确定性检查入手

模拟LLM/工具的输入输出，提升测试速度与可重复性
将真实模型测试作为小型、明确的独立测试套件

构建/整理数据集示例

使用稳定的输入与预期输出
保持简单的结构：
```
inputs
```
和
```
outputs
```
对象（可选
```
metadata
```
）
兼容性说明：脚本也支持旧版数据集的单键格式（
```
input
```
、
```
output
```
）

带明确门控的评估运行

使用与部署决策挂钩的评估器标识
在CI中设置阈值以防止回归

版本发布前进行对比

在两个版本上运行相同的数据集
同时检查质量与延迟

通过追踪/实验诊断失败

检查低分示例
按模式分类失败案例（路由错误、工具使用错误、幻觉、延迟峰值等）

Current References (Load On Demand)

当前参考文档（按需加载）

references/unit-testing-patterns.md

references/unit-testing-patterns.md

Load when:

You need node-level and routing test patterns.
You need pytest/vitest/Jest integration patterns.
You need robust mocking and flaky-test reduction.

当你需要以下内容时加载：

节点级与路由测试模式
pytest/vitest/Jest集成模式
可靠的模拟方法与不稳定测试优化方案

references/trajectory-evaluation.md

references/trajectory-evaluation.md

Load when:

You need trajectory match evaluation (
```
strict
```
,
```
unordered
```
,
```
subset
```
,
```
superset
```
).
You need LLM-as-judge trajectory scoring.
You need LangSmith experiment comparison for trajectory results.

当你需要以下内容时加载：

轨迹匹配评估（
```
strict
```
、
```
unordered
```
、
```
subset
```
、
```
superset
```
模式）
LLM-as-judge轨迹评分
轨迹结果的LangSmith实验对比

references/langsmith-evaluation.md

references/langsmith-evaluation.md

Load when:

You need dataset creation/management in LangSmith.
You need evaluator signatures and experiment runs in Python/TS.
You need CI-friendly workflows with quality thresholds.

当你需要以下内容时加载：

LangSmith中的数据集创建/管理
Python/TS中的评估器签名与实验运行
支持CI的带质量阈值工作流

references/ab-testing.md

references/ab-testing.md

Load when:

You need offline A/B comparison methodology.
You need significance testing and interpretation.
You need production traffic split strategy and guardrails.

当你需要以下内容时加载：

离线A/B对比方法论
显著性测试与结果解读
生产流量拆分策略与防护措施

Assets

资源文件

assets/templates/test_template.py

assets/templates/test_template.py

Runnable Python pytest template aligned with current LangGraph testing patterns.
Includes:
- Compiled-graph invocation with
```
thread_id
```
- Single-node testing via
```
compiled_graph.nodes[...]
```
- Integration-test placeholder

符合当前LangGraph测试模式的可运行Python pytest模板
包含：
- 带
```
thread_id
```
  的编译图调用
- 通过
```
compiled_graph.nodes[...]
```
  进行单节点测试
- 集成测试占位符

assets/datasets/sample_dataset.json

assets/datasets/sample_dataset.json

Deterministic seed dataset for LangSmith ingestion.

Uses

examples: [{ inputs, outputs, metadata }]

format.

用于LangSmith导入的确定性种子数据集

采用

examples: [{ inputs, outputs, metadata }]

格式

assets/examples/README.md

assets/examples/README.md

Documentation-only index for current asset usage.
Notes where runnable assets live today.

仅用于文档说明的当前资源使用索引
记录了可运行资源的存放位置

Script Interface Summary

脚本接口概述

scripts/generate_test_cases.py

.js

scripts/generate_test_cases.py

.js

Use for fast test scaffolding.

Inputs:

Graph module path

Python:
```
my_module:graph
```
or
```
my_module.graph
```
JS/TS:
```
./file.ts:graph
```

Outputs:

Framework-specific starter tests in target directory.

用于快速生成测试脚手架

输入：

图模块路径

Python：
```
my_module:graph
```
或
```
my_module.graph
```
JS/TS：
```
./file.ts:graph
```

输出：

scripts/run_trajectory_eval.py

.js

scripts/run_trajectory_eval.py

.js

Use for trajectory scoring with either:

```
--method match
```
```
--method llm-judge
```

Supports:

Local dataset files (
```
.json
```
)
LangSmith dataset names
Optional reference trajectory file with
```
--reference-trajectory
```
Match modes:
```
strict
```
,
```
unordered
```
,
```
subset
```
,
```
superset
```

Local-only mode:

```
--no-langsmith
```
in both Python and JavaScript scripts (requires local JSON dataset file)

用于轨迹评分，支持两种模式：

```
--method match
```
（匹配模式）
```
--method llm-judge
```
（LLM-as-judge模式）

支持：

本地数据集文件（
```
.json
```
）
LangSmith数据集名称
可选的参考轨迹文件（通过
```
--reference-trajectory
```
指定）
匹配模式：
```
strict
```
、
```
unordered
```
、
```
subset
```
、
```
superset
```

本地仅用模式：

Python和JavaScript脚本均支持
```
--no-langsmith
```
参数（需配合本地JSON数据集文件使用）

scripts/evaluate_with_langsmith.py

.js

scripts/evaluate_with_langsmith.py

.js

Use for dataset-based evaluation runs and experiment tracking.

Supports:

Existing dataset by name
Dataset creation from JSON examples file
Multiple evaluators (
```
--evaluators accuracy,latency,...
```
)
Concurrency control (
```
--max-concurrency
```
)

Python-only:

```
--no-upload
```
to run without uploading experiment results

用于基于数据集的评估运行与实验追踪

支持：

按名称调用现有数据集
从JSON示例文件创建数据集
多评估器（
```
--evaluators accuracy,latency,...
```
）
并发控制（
```
--max-concurrency
```
）

Python专属参数：

```
--no-upload
```
：运行评估但不上传实验结果

scripts/compare_agents.py

.js

scripts/compare_agents.py

.js

Use for offline version comparisons:

Shared dataset input
Success/latency summaries
JSON report output for CI artifacts
Local JSON datasets or LangSmith datasets (JS supports
```
--no-langsmith
```
to disable remote loading)

用于离线版本对比：

共享数据集输入
成功率/延迟汇总
输出JSON报告用于CI产物
支持本地JSON数据集或LangSmith数据集（JS支持
```
--no-langsmith
```
禁用远程加载）

scripts/mock_llm_responses.py

.js

scripts/mock_llm_responses.py

.js

Use for deterministic test doubles:

single
sequence
conditional

用于创建确定性测试替身，支持类型：

single（单一响应）
sequence（序列响应）
conditional（条件响应）

Decision Rules

决策规则

If behavior is deterministic and local:

Use unit tests first.

If behavior depends on tool sequence/routing:

Add trajectory evaluation.

If behavior depends on realistic distribution quality:

Run LangSmith dataset evaluation.

If approving a replacement model/prompt/graph:

Run A/B comparison and check both quality and latency.

若行为是确定性且本地的：

优先使用单元测试

若行为依赖工具序列/路由：

增加轨迹评估

若行为依赖真实分布下的质量：

运行LangSmith数据集评估

若要批准替换模型/提示词/图：

运行A/B对比，同时检查质量与延迟

Common Failure Patterns

常见失败模式

Flaky tests

不稳定测试

Cause: real-model nondeterminism in unit scope.
Fix: mock LLM/tool calls for unit tests; reserve real-model tests for separate integration marks.

原因：单元测试范围内使用真实模型导致的非确定性
修复：单元测试中模拟LLM/工具调用；将真实模型测试作为独立的集成测试套件

High trajectory variance

轨迹方差过高

Cause: overly strict matching for workflows with equivalent paths.
Fix: switch match mode (
```
unordered
```
,
```
subset
```
, or
```
superset
```
) where appropriate.

原因：对存在等效路径的工作流使用过于严格的匹配模式
修复：在合适场景下切换匹配模式（
```
unordered
```
、
```
subset
```
或
```
superset
```
）

Regressions hidden by averages

平均值掩盖回归问题

Cause: only aggregate score monitored.
Fix: inspect per-example failures and segment by category metadata.

原因：仅监控聚合分数
修复：检查单个示例的失败情况，并按元数据分类

Latency regressions with same quality

质量不变但延迟回归

Cause: no explicit latency gate.
Fix: include latency evaluator and CI threshold.

原因：未设置明确的延迟门控
修复：添加延迟评估器并在CI中设置阈值

Minimal Best Practices

最简最佳实践

Keep fast deterministic tests as the largest share.
Version datasets and keep them stable.
Track both correctness and latency.
Add explicit go/no-go thresholds in CI.
Compare candidate vs baseline before production rollout.
Investigate failures with trace-level evidence, not only aggregate scores.

让快速的确定性测试占比最大
对数据集进行版本控制并保持稳定
同时追踪正确性与延迟
在CI中添加明确的通过/不通过阈值
生产发布前对比候选版本与基准版本
基于追踪层面的证据排查失败，而非仅依赖聚合分数

langgraph-testing-evaluation

Original

Translation

LangGraph Testing & Evaluation

LangGraph 测试与评估

Start Here

入门指南

Quick Commands

快速命令

Generate test scaffolding

生成测试脚手架

Python (preferred)

Python（推荐）

JavaScript/TypeScript

JavaScript/TypeScript

Run trajectory evaluation

运行轨迹评估

Python: LLM-as-judge

Python: LLM-as-judge模式

Python: trajectory match

Python: 轨迹匹配模式

JavaScript/TypeScript

JavaScript/TypeScript

Run LangSmith dataset evaluation

运行LangSmith数据集评估

Python

Python

Python (do not upload experiment results)

Python（不上传实验结果）

JavaScript/TypeScript

JavaScript/TypeScript

Compare two agent versions

对比两个Agent版本

Python

Python

JavaScript/TypeScript

JavaScript/TypeScript

JavaScript/TypeScript (force local dataset file only)

JavaScript/TypeScript（仅使用本地数据集文件）

Create mock response configs

创建模拟响应配置

Python

Python

JavaScript/TypeScript

JavaScript/TypeScript

Core Workflow

核心工作流

Current References (Load On Demand)

当前参考文档（按需加载）

references/unit-testing-patterns.md

references/unit-testing-patterns.md

references/trajectory-evaluation.md

references/trajectory-evaluation.md

references/langsmith-evaluation.md

references/langsmith-evaluation.md

references/ab-testing.md

references/ab-testing.md

Assets

资源文件

assets/templates/test_template.py

assets/templates/test_template.py

assets/datasets/sample_dataset.json

assets/datasets/sample_dataset.json

assets/examples/README.md

assets/examples/README.md

Script Interface Summary

脚本接口概述

scripts/generate_test_cases.py / .js

scripts/generate_test_cases.py / .js

scripts/run_trajectory_eval.py / .js

scripts/run_trajectory_eval.py / .js

scripts/evaluate_with_langsmith.py / .js

scripts/evaluate_with_langsmith.py / .js

scripts/compare_agents.py / .js

scripts/compare_agents.py / .js

scripts/mock_llm_responses.py / .js

scripts/mock_llm_responses.py / .js

Decision Rules

决策规则

Common Failure Patterns

`references/unit-testing-patterns.md`

`references/unit-testing-patterns.md`

`references/trajectory-evaluation.md`

`references/trajectory-evaluation.md`

`references/langsmith-evaluation.md`

`references/langsmith-evaluation.md`

`references/ab-testing.md`

`references/ab-testing.md`

`assets/templates/test_template.py`

`assets/templates/test_template.py`

`assets/datasets/sample_dataset.json`

`assets/datasets/sample_dataset.json`

`assets/examples/README.md`

`assets/examples/README.md`

`scripts/generate_test_cases.py`
/
`.js`

`scripts/generate_test_cases.py`
/
`.js`

`scripts/run_trajectory_eval.py`
/
`.js`

`scripts/run_trajectory_eval.py`
/
`.js`

`scripts/evaluate_with_langsmith.py`
/
`.js`

`scripts/evaluate_with_langsmith.py`
/
`.js`

`scripts/compare_agents.py`
/
`.js`

`scripts/compare_agents.py`
/
`.js`

`scripts/mock_llm_responses.py`
/
`.js`

`scripts/mock_llm_responses.py`
/
`.js`