langgraph-error-handling
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLangGraph Error Handling
LangGraph 错误处理
Use This Skill For
适用场景
- Adding to flaky nodes (API, DB, model/tool calls)
RetryPolicy - Designing LLM recovery loops (+ error state + retry counters)
Command - Adding human approval/escalation with and resume
interrupt() - Handling prebuilt failures
ToolNode - Debugging transactional failure behavior in parallel supersteps
- 为不稳定节点(API、数据库、模型/工具调用)添加
RetryPolicy - 设计LLM恢复循环(+ 错误状态 + 重试计数器)
Command - 使用和resume添加人工审批/升级流程
interrupt() - 处理预构建的故障
ToolNode - 调试并行超级步骤中的事务性故障行为
Strategy Selection
策略选择
Use this order:
- Transient/infrastructure issue (, timeout,
429, temporary DB lock) ->5xxRetryPolicy - Recoverable by model/tool args correction -> store error in state and route back with
Command - Needs user approval or missing info -> + resume
interrupt() - Unknown/programming bug -> let it bubble up and debug
| Error Type | Owner | Primary Mechanism |
|---|---|---|
| Transient | System | |
| LLM-recoverable | LLM | State update + |
| User-fixable | Human | |
| Unexpected | Developer | Raise/log/debug |
For full taxonomy, load references/error-types.md.
请按照以下顺序选择策略:
- 临时/基础设施问题(、超时、
429、临时数据库锁)->5xxRetryPolicy - 可通过调整模型/工具参数恢复的问题 -> 将错误存储在状态中,并通过路由返回
Command - 需要用户审批或缺少信息的问题 -> + resume
interrupt() - 未知/编程错误 -> 让错误向上冒泡并进行调试
| 错误类型 | 负责方 | 主要机制 |
|---|---|---|
| 临时错误 | 系统 | |
| LLM可恢复 | LLM | 状态更新 + |
| 用户可修复 | 人工 | |
| 意外错误 | 开发人员 | 抛出/记录/调试 |
如需完整的分类体系,请加载references/error-types.md。
Minimal Patterns
最简模式
1) Retry Transient Failures
1) 重试临时故障
python
from langgraph.types import RetryPolicy
builder.add_node(
"call_api",
call_api,
retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)ts
builder.addNode("callApi", callApi, {
retryPolicy: { maxAttempts: 3, initialInterval: 1.0 },
});Notes:
- Python and JS default retry behavior differs by exception type.
- Prefer targeted /
retry_onfor non-transient domains.retryOn
python
from langgraph.types import RetryPolicy
builder.add_node(
"call_api",
call_api,
retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)ts
builder.addNode("callApi", callApi, {
retryPolicy: { maxAttempts: 3, initialInterval: 1.0 },
});注意事项:
- Python和JS的默认重试行为因异常类型而异。
- 对于非临时领域,优先使用针对性的/
retry_on。retryOn
2) LLM Recovery Loop
2) LLM恢复循环
Use in Python for message state.
MessagesStatepython
from typing import Literal
from typing_extensions import NotRequired
from langgraph.graph import MessagesState
from langgraph.types import Command
class State(MessagesState):
error: NotRequired[str]
retry_count: NotRequired[int]
def agent(state: State) -> Command[Literal["tool", "__end__"]]:
if state.get("retry_count", 0) >= 3:
return Command(goto="__end__")
if state.get("error"):
return Command(goto="tool")
return Command(goto="tool")ts
import { StateGraph, Command, END } from "@langchain/langgraph";
// If a node returns Command in JS, add `ends` on addNode.
builder.addNode("agent", agentNode, { ends: ["tool", END] });在Python中使用管理消息状态。
MessagesStatepython
from typing import Literal
from typing_extensions import NotRequired
from langgraph.graph import MessagesState
from langgraph.types import Command
class State(MessagesState):
error: NotRequired[str]
retry_count: NotRequired[int]
def agent(state: State) -> Command[Literal["tool", "__end__"]]:
if state.get("retry_count", 0) >= 3:
return Command(goto="__end__")
if state.get("error"):
return Command(goto="tool")
return Command(goto="tool")ts
import { StateGraph, Command, END } from "@langchain/langgraph";
// 如果节点在JS中返回Command,需在addNode时添加`ends`。
builder.addNode("agent", agentNode, { ends: ["tool", END] });3) Human-In-The-Loop Escalation
3) 人机交互升级流程
python
from langgraph.types import interrupt, Command
def human_review(state):
approved = interrupt({
"question": "Proceed?",
"payload": state["pending_action"],
})
return Command(goto="execute" if approved else "cancel")python
from langgraph.types import interrupt, Command
def human_review(state):
approved = interrupt({
"question": "Proceed?",
"payload": state["pending_action"],
})
return Command(goto="execute" if approved else "cancel")
// resume
graph.invoke(Command(resume=True), config={"configurable": {"thread_id": "t-1"}})ts
import { Command, interrupt } from "@langchain/langgraph";
const approved = interrupt({ question: "Proceed?" });
// later
await graph.invoke(new Command({ resume: true }), {
configurable: { thread_id: "t-1" },
});要求:
- 需使用检查点(checkpointer)编译以支持中断流程。
- 恢复时需复用相同的。
thread_id
如需深入了解人机交互模式,请加载references/human-escalation.md。
resume
ToolNode错误处理
graph.invoke(Command(resume=True), config={"configurable": {"thread_id": "t-1"}})
```ts
import { Command, interrupt } from "@langchain/langgraph";
const approved = interrupt({ question: "Proceed?" });
// later
await graph.invoke(new Command({ resume: true }), {
configurable: { thread_id: "t-1" },
});Requirements:
- Compile with a checkpointer for interrupt flows.
- Reuse the same on resume.
thread_id
For deep HITL patterns, load references/human-escalation.md.
python
from langgraph.prebuilt import ToolNode
tool_node = ToolNode(tools, handle_tool_errors=True)
tool_node = ToolNode(tools, handle_tool_errors="Please try again.")
tool_node = ToolNode(tools, handle_tool_errors=(ValueError, TypeError))当需要为模型恢复提供确定性错误格式时,使用自定义处理器。
如需了解更全面的工具恢复设计,请加载references/llm-recovery.md。
ToolNode Error Handling
关键注意事项(请勿忽略)
python
from langgraph.prebuilt import ToolNode
tool_node = ToolNode(tools, handle_tool_errors=True)
tool_node = ToolNode(tools, handle_tool_errors="Please try again.")
tool_node = ToolNode(tools, handle_tool_errors=(ValueError, TypeError))Use custom handlers when you need deterministic error shaping for model recovery.
For broader tool-recovery design, load references/llm-recovery.md.
- 超级步骤是事务性的:一个并行分支失败会导致整个超级步骤的状态更新失败。
- RetryPolicy仅重试失败的分支,不会重试成功的同级分支。
- 在恢复时会重新运行节点:中断前的副作用必须是幂等的,或者移到中断之后/单独的节点中。
interrupt() - JS中Command路由需要在上添加
addNode(...)元数据。ends - 使用明确的重试限制(,加上恢复循环的状态计数器)。
max_attempts
Critical Behavior (Do Not Skip)
本技能包含的本地资源
—
脚本
- Supersteps are transactional: one failing parallel branch fails the whole superstep state update.
- RetryPolicy retries failing branches, not successful siblings.
- re-runs the node on resume: side effects before interrupt must be idempotent, or moved after interrupt / separate node.
interrupt() - JS routing requires
Commandmetadata onends.addNode(...) - Use explicit retry limits (, plus state counters for recovery loops).
max_attempts
- :对异常类别进行分类并推荐处理方式
scripts/classify_error.py - :生成带有重试/恢复/升级选项的样板节点包装器
scripts/wrap_with_retry.py
从仓库根目录运行:
bash
uv run skills/langgraph-error-handling/scripts/classify_error.py TimeoutError --verbose
uv run skills/langgraph-error-handling/scripts/wrap_with_retry.py call_llm --with-llm-recoveryLocal Assets In This Skill
示例
Scripts
—
- : classify exception category and recommended handling
scripts/classify_error.py - : generate boilerplate node wrappers with retry/recovery/escalation options
scripts/wrap_with_retry.py
Run from repo root:
bash
uv run skills/langgraph-error-handling/scripts/classify_error.py TimeoutError --verbose
uv run skills/langgraph-error-handling/scripts/wrap_with_retry.py call_llm --with-llm-recovery- :重试 + 恢复循环(Python和JS版本)
assets/examples/retry-example/ - :中断/恢复审批流程(Python和JS版本)
assets/examples/human-loop-example/
Examples
按需加载参考文档
- : retry + recovery loop (Python and JS)
assets/examples/retry-example/ - : interrupt/resume approval flow (Python and JS)
assets/examples/human-loop-example/
- :错误分类体系和分类规则
references/error-types.md - :重试调优、退避、断路器式模式
references/retry-strategies.md - :恢复循环和ToolNode策略
references/llm-recovery.md - :人工审批、中断和升级模式
references/human-escalation.md
Load References On Demand
常见故障模式
- : error taxonomy and classification rules
references/error-types.md - : retry tuning, backoff, circuit-breaker-style patterns
references/retry-strategies.md - : recovery-loop and ToolNode strategies
references/llm-recovery.md - : human approval, interrupts, and escalation patterns
references/human-escalation.md
| 症状 | 根本原因 | 修复方案 |
|---|---|---|
| 未使用检查点 | 使用检查点编译 |
| 恢复时启动新运行 | | 复用相同的 |
| JS Command路由未生效 | 缺少 | 为 |
| 无限循环 | 无终止计数器/条件 | 添加重试计数器 + 终止分支 |
| 重试从未触发 | 异常被重试过滤器排除 | 设置明确的 |
Common Failure Modes
—
| Symptom | Root Cause | Fix |
|---|---|---|
| no checkpointer | compile with checkpointer |
| Resume starts new run | different | reuse same |
| JS Command route not taken | missing | add |
| Infinite loop | no termination counter/condition | add retry counter + terminal branch |
| Retry never triggers | exception excluded by retry filter | set explicit |
—