langgraph-testing-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LangGraph Testing & Evaluation

LangGraph 测试与评估

Practical workflows for validating agent quality with:
  • Unit/integration tests
  • Trajectory evaluation
  • LangSmith dataset evaluations
  • A/B-style comparisons between versions
Use this file for high-level flow. Load
references/*
for detailed implementation.
以下是验证Agent质量的实用工作流:
  • 单元/集成测试
  • 轨迹评估
  • LangSmith数据集评估
  • 版本间的A/B式对比
本文件为你提供整体流程指引,详细实现请查看
references/*
目录下的内容。

Start Here

入门指南

Choose the smallest approach that answers your question:
GoalPrimary methodLoad first
Validate node logic quicklyUnit tests with mocks
references/unit-testing-patterns.md
Validate multi-step agent behaviorTrajectory evaluation
references/trajectory-evaluation.md
Track quality over datasets over timeLangSmith evaluation
references/langsmith-evaluation.md
Compare old vs new agent versionsA/B comparison
references/ab-testing.md
Recommended order:
  1. Unit tests
  2. Integration/trajectory checks
  3. Dataset evaluation in LangSmith
  4. A/B comparison before deployment
选择最贴合你需求的最简方案:
目标主要方法优先查看
快速验证节点逻辑带模拟的单元测试
references/unit-testing-patterns.md
验证多步Agent行为轨迹评估
references/trajectory-evaluation.md
长期追踪数据集上的质量变化LangSmith评估
references/langsmith-evaluation.md
对比新旧Agent版本A/B对比
references/ab-testing.md
推荐执行顺序:
  1. 单元测试
  2. 集成/轨迹检查
  3. 在LangSmith中进行数据集评估
  4. 部署前进行A/B对比

Quick Commands

快速命令

Run from repo root.
请从仓库根目录执行以下命令。

Generate test scaffolding

生成测试脚手架

bash
undefined
bash
undefined

Python (preferred)

Python(推荐)

uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest
uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest

JavaScript/TypeScript

JavaScript/TypeScript

node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest
undefined
node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest
undefined

Run trajectory evaluation

运行轨迹评估

bash
undefined
bash
undefined

Python: LLM-as-judge

Python: LLM-as-judge模式

uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini

Python: trajectory match

Python: 轨迹匹配模式

uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json

JavaScript/TypeScript

JavaScript/TypeScript

node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4
undefined
node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4
undefined

Run LangSmith dataset evaluation

运行LangSmith数据集评估

bash
undefined
bash
undefined

Python

Python

uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4

Python (do not upload experiment results)

Python(不上传实验结果)

uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload

JavaScript/TypeScript

JavaScript/TypeScript

node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4
undefined
node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4
undefined

Compare two agent versions

对比两个Agent版本

bash
undefined
bash
undefined

Python

Python

uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json
uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json

JavaScript/TypeScript

JavaScript/TypeScript

node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json

JavaScript/TypeScript (force local dataset file only)

JavaScript/TypeScript(仅使用本地数据集文件)

node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith
undefined
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith
undefined

Create mock response configs

创建模拟响应配置

bash
undefined
bash
undefined

Python

Python

uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json
uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json

JavaScript/TypeScript

JavaScript/TypeScript

node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json
undefined
node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json
undefined

Core Workflow

核心工作流

  1. Define test scope.
  • Unit: deterministic logic in one node/function.
  • Integration: node interactions and routing.
  • End-to-end: complete response quality on realistic inputs.
  1. Start from deterministic checks.
  • Mock LLM/tool IO for speed and repeatability.
  • Keep real-model tests as a smaller, explicit suite.
  1. Build/curate dataset examples.
  • Use stable inputs and expected outputs.
  • Keep schema simple:
    inputs
    and
    outputs
    objects (optional
    metadata
    ).
  • Compatibility note: scripts also accept singular keys (
    input
    ,
    output
    ) for legacy datasets.
  1. Run evaluation with explicit gates.
  • Use evaluator keys that map to deployment decisions.
  • Set thresholds in CI for regression prevention.
  1. Compare versions before rollout.
  • Run same dataset on both versions.
  • Check both quality and latency.
  1. Diagnose failures from traces/experiments.
  • Inspect low-scoring examples.
  • Split failures by pattern (routing, tool usage, hallucination, latency spikes).
  1. 定义测试范围
  • 单元测试:单个节点/函数中的确定性逻辑
  • 集成测试:节点交互与路由逻辑
  • 端到端测试:真实输入下的完整响应质量
  1. 从确定性检查入手
  • 模拟LLM/工具的输入输出,提升测试速度与可重复性
  • 将真实模型测试作为小型、明确的独立测试套件
  1. 构建/整理数据集示例
  • 使用稳定的输入与预期输出
  • 保持简单的结构:
    inputs
    outputs
    对象(可选
    metadata
  • 兼容性说明:脚本也支持旧版数据集的单键格式(
    input
    output
  1. 带明确门控的评估运行
  • 使用与部署决策挂钩的评估器标识
  • 在CI中设置阈值以防止回归
  1. 版本发布前进行对比
  • 在两个版本上运行相同的数据集
  • 同时检查质量与延迟
  1. 通过追踪/实验诊断失败
  • 检查低分示例
  • 按模式分类失败案例(路由错误、工具使用错误、幻觉、延迟峰值等)

Current References (Load On Demand)

当前参考文档(按需加载)

references/unit-testing-patterns.md

references/unit-testing-patterns.md

Load when:
  • You need node-level and routing test patterns.
  • You need pytest/vitest/Jest integration patterns.
  • You need robust mocking and flaky-test reduction.
当你需要以下内容时加载:
  • 节点级与路由测试模式
  • pytest/vitest/Jest集成模式
  • 可靠的模拟方法与不稳定测试优化方案

references/trajectory-evaluation.md

references/trajectory-evaluation.md

Load when:
  • You need trajectory match evaluation (
    strict
    ,
    unordered
    ,
    subset
    ,
    superset
    ).
  • You need LLM-as-judge trajectory scoring.
  • You need LangSmith experiment comparison for trajectory results.
当你需要以下内容时加载:
  • 轨迹匹配评估(
    strict
    unordered
    subset
    superset
    模式)
  • LLM-as-judge轨迹评分
  • 轨迹结果的LangSmith实验对比

references/langsmith-evaluation.md

references/langsmith-evaluation.md

Load when:
  • You need dataset creation/management in LangSmith.
  • You need evaluator signatures and experiment runs in Python/TS.
  • You need CI-friendly workflows with quality thresholds.
当你需要以下内容时加载:
  • LangSmith中的数据集创建/管理
  • Python/TS中的评估器签名与实验运行
  • 支持CI的带质量阈值工作流

references/ab-testing.md

references/ab-testing.md

Load when:
  • You need offline A/B comparison methodology.
  • You need significance testing and interpretation.
  • You need production traffic split strategy and guardrails.
当你需要以下内容时加载:
  • 离线A/B对比方法论
  • 显著性测试与结果解读
  • 生产流量拆分策略与防护措施

Assets

资源文件

assets/templates/test_template.py

assets/templates/test_template.py

  • Runnable Python pytest template aligned with current LangGraph testing patterns.
  • Includes:
    • Compiled-graph invocation with
      thread_id
    • Single-node testing via
      compiled_graph.nodes[...]
    • Integration-test placeholder
  • 符合当前LangGraph测试模式的可运行Python pytest模板
  • 包含:
    • thread_id
      的编译图调用
    • 通过
      compiled_graph.nodes[...]
      进行单节点测试
    • 集成测试占位符

assets/datasets/sample_dataset.json

assets/datasets/sample_dataset.json

  • Deterministic seed dataset for LangSmith ingestion.
  • Uses
    examples: [{ inputs, outputs, metadata }]
    format.
  • 用于LangSmith导入的确定性种子数据集
  • 采用
    examples: [{ inputs, outputs, metadata }]
    格式

assets/examples/README.md

assets/examples/README.md

  • Documentation-only index for current asset usage.
  • Notes where runnable assets live today.
  • 仅用于文档说明的当前资源使用索引
  • 记录了可运行资源的存放位置

Script Interface Summary

脚本接口概述

scripts/generate_test_cases.py
/
.js

scripts/generate_test_cases.py
/
.js

Use for fast test scaffolding.
Inputs:
  • Graph module path
    • Python:
      my_module:graph
      or
      my_module.graph
    • JS/TS:
      ./file.ts:graph
Outputs:
  • Framework-specific starter tests in target directory.
用于快速生成测试脚手架
输入:
  • 图模块路径
    • Python:
      my_module:graph
      my_module.graph
    • JS/TS:
      ./file.ts:graph
输出:
  • 目标目录中符合指定框架的初始测试文件

scripts/run_trajectory_eval.py
/
.js

scripts/run_trajectory_eval.py
/
.js

Use for trajectory scoring with either:
  • --method match
  • --method llm-judge
Supports:
  • Local dataset files (
    .json
    )
  • LangSmith dataset names
  • Optional reference trajectory file with
    --reference-trajectory
  • Match modes:
    strict
    ,
    unordered
    ,
    subset
    ,
    superset
Local-only mode:
  • --no-langsmith
    in both Python and JavaScript scripts (requires local JSON dataset file)
用于轨迹评分,支持两种模式:
  • --method match
    (匹配模式)
  • --method llm-judge
    (LLM-as-judge模式)
支持:
  • 本地数据集文件(
    .json
  • LangSmith数据集名称
  • 可选的参考轨迹文件(通过
    --reference-trajectory
    指定)
  • 匹配模式:
    strict
    unordered
    subset
    superset
本地仅用模式:
  • Python和JavaScript脚本均支持
    --no-langsmith
    参数(需配合本地JSON数据集文件使用)

scripts/evaluate_with_langsmith.py
/
.js

scripts/evaluate_with_langsmith.py
/
.js

Use for dataset-based evaluation runs and experiment tracking.
Supports:
  • Existing dataset by name
  • Dataset creation from JSON examples file
  • Multiple evaluators (
    --evaluators accuracy,latency,...
    )
  • Concurrency control (
    --max-concurrency
    )
Python-only:
  • --no-upload
    to run without uploading experiment results
用于基于数据集的评估运行与实验追踪
支持:
  • 按名称调用现有数据集
  • 从JSON示例文件创建数据集
  • 多评估器(
    --evaluators accuracy,latency,...
  • 并发控制(
    --max-concurrency
Python专属参数:
  • --no-upload
    :运行评估但不上传实验结果

scripts/compare_agents.py
/
.js

scripts/compare_agents.py
/
.js

Use for offline version comparisons:
  • Shared dataset input
  • Success/latency summaries
  • JSON report output for CI artifacts
  • Local JSON datasets or LangSmith datasets (JS supports
    --no-langsmith
    to disable remote loading)
用于离线版本对比:
  • 共享数据集输入
  • 成功率/延迟汇总
  • 输出JSON报告用于CI产物
  • 支持本地JSON数据集或LangSmith数据集(JS支持
    --no-langsmith
    禁用远程加载)

scripts/mock_llm_responses.py
/
.js

scripts/mock_llm_responses.py
/
.js

Use for deterministic test doubles:
  • single
  • sequence
  • conditional
用于创建确定性测试替身,支持类型:
  • single(单一响应)
  • sequence(序列响应)
  • conditional(条件响应)

Decision Rules

决策规则

If behavior is deterministic and local:
  • Use unit tests first.
If behavior depends on tool sequence/routing:
  • Add trajectory evaluation.
If behavior depends on realistic distribution quality:
  • Run LangSmith dataset evaluation.
If approving a replacement model/prompt/graph:
  • Run A/B comparison and check both quality and latency.
若行为是确定性且本地的:
  • 优先使用单元测试
若行为依赖工具序列/路由:
  • 增加轨迹评估
若行为依赖真实分布下的质量:
  • 运行LangSmith数据集评估
若要批准替换模型/提示词/图:
  • 运行A/B对比,同时检查质量与延迟

Common Failure Patterns

常见失败模式

Flaky tests

不稳定测试

  • Cause: real-model nondeterminism in unit scope.
  • Fix: mock LLM/tool calls for unit tests; reserve real-model tests for separate integration marks.
  • 原因:单元测试范围内使用真实模型导致的非确定性
  • 修复:单元测试中模拟LLM/工具调用;将真实模型测试作为独立的集成测试套件

High trajectory variance

轨迹方差过高

  • Cause: overly strict matching for workflows with equivalent paths.
  • Fix: switch match mode (
    unordered
    ,
    subset
    , or
    superset
    ) where appropriate.
  • 原因:对存在等效路径的工作流使用过于严格的匹配模式
  • 修复:在合适场景下切换匹配模式(
    unordered
    subset
    superset

Regressions hidden by averages

平均值掩盖回归问题

  • Cause: only aggregate score monitored.
  • Fix: inspect per-example failures and segment by category metadata.
  • 原因:仅监控聚合分数
  • 修复:检查单个示例的失败情况,并按元数据分类

Latency regressions with same quality

质量不变但延迟回归

  • Cause: no explicit latency gate.
  • Fix: include latency evaluator and CI threshold.
  • 原因:未设置明确的延迟门控
  • 修复:添加延迟评估器并在CI中设置阈值

Minimal Best Practices

最简最佳实践

  1. Keep fast deterministic tests as the largest share.
  2. Version datasets and keep them stable.
  3. Track both correctness and latency.
  4. Add explicit go/no-go thresholds in CI.
  5. Compare candidate vs baseline before production rollout.
  6. Investigate failures with trace-level evidence, not only aggregate scores.
  1. 让快速的确定性测试占比最大
  2. 对数据集进行版本控制并保持稳定
  3. 同时追踪正确性与延迟
  4. 在CI中添加明确的通过/不通过阈值
  5. 生产发布前对比候选版本与基准版本
  6. 基于追踪层面的证据排查失败,而非仅依赖聚合分数