Loading...
Loading...
Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results.
npx skill4agent add lubu-labs/langchain-agent-skills langgraph-testing-evaluationreferences/*| Goal | Primary method | Load first |
|---|---|---|
| Validate node logic quickly | Unit tests with mocks | |
| Validate multi-step agent behavior | Trajectory evaluation | |
| Track quality over datasets over time | LangSmith evaluation | |
| Compare old vs new agent versions | A/B comparison | |
# Python (preferred)
uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest# Python: LLM-as-judge
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini
# Python: trajectory match
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4# Python
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4
# Python (do not upload experiment results)
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4# Python
uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json
# JavaScript/TypeScript (force local dataset file only)
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith# Python
uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.jsoninputsoutputsmetadatainputoutputreferences/unit-testing-patterns.mdreferences/trajectory-evaluation.mdstrictunorderedsubsetsupersetreferences/langsmith-evaluation.mdreferences/ab-testing.mdassets/templates/test_template.pythread_idcompiled_graph.nodes[...]assets/datasets/sample_dataset.jsonexamples: [{ inputs, outputs, metadata }]assets/examples/README.mdscripts/generate_test_cases.py.jsmy_module:graphmy_module.graph./file.ts:graphscripts/run_trajectory_eval.py.js--method match--method llm-judge.json--reference-trajectorystrictunorderedsubsetsuperset--no-langsmithscripts/evaluate_with_langsmith.py.js--evaluators accuracy,latency,...--max-concurrency--no-uploadscripts/compare_agents.py.js--no-langsmithscripts/mock_llm_responses.py.jsunorderedsubsetsuperset