langgraph-error-handling

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LangGraph Error Handling

LangGraph 错误处理

Use This Skill For

适用场景

Adding
```
RetryPolicy
```
to flaky nodes (API, DB, model/tool calls)
Designing LLM recovery loops (
```
Command
```
+ error state + retry counters)
Adding human approval/escalation with
```
interrupt()
```
and resume
Handling prebuilt
```
ToolNode
```
failures
Debugging transactional failure behavior in parallel supersteps

为不稳定节点（API、数据库、模型/工具调用）添加
```
RetryPolicy
```
设计LLM恢复循环（
```
Command
```
+ 错误状态 + 重试计数器）
使用
```
interrupt()
```
和resume添加人工审批/升级流程
处理预构建
```
ToolNode
```
的故障
调试并行超级步骤中的事务性故障行为

Strategy Selection

策略选择

Use this order:

Transient/infrastructure issue (
```
429
```
, timeout,
```
5xx
```
, temporary DB lock) ->
```
RetryPolicy
```
Recoverable by model/tool args correction -> store error in state and route back with
```
Command
```
Needs user approval or missing info ->
```
interrupt()
```
+ resume
Unknown/programming bug -> let it bubble up and debug

Error Type	Owner	Primary Mechanism
Transient	System	`RetryPolicy`
LLM-recoverable	LLM	State update + `Command(goto=...)`
User-fixable	Human	`interrupt()` + `Command(resume=...)`
Unexpected	Developer	Raise/log/debug

For full taxonomy, load references/error-types.md.

请按照以下顺序选择策略：

临时/基础设施问题（
```
429
```
、超时、
```
5xx
```
、临时数据库锁）->
```
RetryPolicy
```
可通过调整模型/工具参数恢复的问题 -> 将错误存储在状态中，并通过
```
Command
```
路由返回
需要用户审批或缺少信息的问题 ->
```
interrupt()
```
+ resume
未知/编程错误 -> 让错误向上冒泡并进行调试

错误类型	负责方	主要机制
临时错误	系统	`RetryPolicy`
LLM可恢复	LLM	状态更新 + `Command(goto=...)`
用户可修复	人工	`interrupt()` + `Command(resume=...)`
意外错误	开发人员	抛出/记录/调试

如需完整的分类体系，请加载references/error-types.md。

Minimal Patterns

最简模式

1) Retry Transient Failures

1) 重试临时故障

python

from langgraph.types import RetryPolicy

builder.add_node(
    "call_api",
    call_api,
    retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)

builder.addNode("callApi", callApi, {
  retryPolicy: { maxAttempts: 3, initialInterval: 1.0 },
});

Notes:

Python and JS default retry behavior differs by exception type.
Prefer targeted
```
retry_on
```
/
```
retryOn
```
for non-transient domains.

python

from langgraph.types import RetryPolicy

builder.add_node(
    "call_api",
    call_api,
    retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)

builder.addNode("callApi", callApi, {
  retryPolicy: { maxAttempts: 3, initialInterval: 1.0 },
});

注意事项：

Python和JS的默认重试行为因异常类型而异。
对于非临时领域，优先使用针对性的
```
retry_on
```
/
```
retryOn
```
。

2) LLM Recovery Loop

2) LLM恢复循环

Use

MessagesState

in Python for message state.

python

from typing import Literal
from typing_extensions import NotRequired
from langgraph.graph import MessagesState
from langgraph.types import Command

class State(MessagesState):
    error: NotRequired[str]
    retry_count: NotRequired[int]

def agent(state: State) -> Command[Literal["tool", "__end__"]]:
    if state.get("retry_count", 0) >= 3:
        return Command(goto="__end__")
    if state.get("error"):
        return Command(goto="tool")
    return Command(goto="tool")

import { StateGraph, Command, END } from "@langchain/langgraph";

// If a node returns Command in JS, add `ends` on addNode.
builder.addNode("agent", agentNode, { ends: ["tool", END] });

在Python中使用

MessagesState

管理消息状态。

python

from typing import Literal
from typing_extensions import NotRequired
from langgraph.graph import MessagesState
from langgraph.types import Command

class State(MessagesState):
    error: NotRequired[str]
    retry_count: NotRequired[int]

def agent(state: State) -> Command[Literal["tool", "__end__"]]:
    if state.get("retry_count", 0) >= 3:
        return Command(goto="__end__")
    if state.get("error"):
        return Command(goto="tool")
    return Command(goto="tool")

import { StateGraph, Command, END } from "@langchain/langgraph";

// 如果节点在JS中返回Command，需在addNode时添加`ends`。
builder.addNode("agent", agentNode, { ends: ["tool", END] });

3) Human-In-The-Loop Escalation

3) 人机交互升级流程

python

from langgraph.types import interrupt, Command

def human_review(state):
    approved = interrupt({
        "question": "Proceed?",
        "payload": state["pending_action"],
    })
    return Command(goto="execute" if approved else "cancel")

python

from langgraph.types import interrupt, Command

def human_review(state):
    approved = interrupt({
        "question": "Proceed?",
        "payload": state["pending_action"],
    })
    return Command(goto="execute" if approved else "cancel")

// resume
graph.invoke(Command(resume=True), config={"configurable": {"thread_id": "t-1"}})

import { Command, interrupt } from "@langchain/langgraph";

const approved = interrupt({ question: "Proceed?" });
// later
await graph.invoke(new Command({ resume: true }), {
  configurable: { thread_id: "t-1" },
});

要求：

需使用检查点（checkpointer）编译以支持中断流程。
恢复时需复用相同的
```
thread_id
```
。

如需深入了解人机交互模式，请加载references/human-escalation.md。

resume

ToolNode错误处理

graph.invoke(Command(resume=True), config={"configurable": {"thread_id": "t-1"}})


```ts
import { Command, interrupt } from "@langchain/langgraph";

const approved = interrupt({ question: "Proceed?" });
// later
await graph.invoke(new Command({ resume: true }), {
  configurable: { thread_id: "t-1" },
});

Requirements:

Compile with a checkpointer for interrupt flows.
Reuse the same
```
thread_id
```
on resume.

For deep HITL patterns, load references/human-escalation.md.

python

from langgraph.prebuilt import ToolNode

tool_node = ToolNode(tools, handle_tool_errors=True)
tool_node = ToolNode(tools, handle_tool_errors="Please try again.")
tool_node = ToolNode(tools, handle_tool_errors=(ValueError, TypeError))

当需要为模型恢复提供确定性错误格式时，使用自定义处理器。如需了解更全面的工具恢复设计，请加载references/llm-recovery.md。

ToolNode Error Handling

关键注意事项（请勿忽略）

python

from langgraph.prebuilt import ToolNode

tool_node = ToolNode(tools, handle_tool_errors=True)
tool_node = ToolNode(tools, handle_tool_errors="Please try again.")
tool_node = ToolNode(tools, handle_tool_errors=(ValueError, TypeError))

Use custom handlers when you need deterministic error shaping for model recovery. For broader tool-recovery design, load references/llm-recovery.md.

超级步骤是事务性的：一个并行分支失败会导致整个超级步骤的状态更新失败。
RetryPolicy仅重试失败的分支，不会重试成功的同级分支。
interrupt()
在恢复时会重新运行节点：中断前的副作用必须是幂等的，或者移到中断之后/单独的节点中。
JS中Command路由需要在
addNode(...)
上添加
ends
元数据。
使用明确的重试限制（
```
max_attempts
```
，加上恢复循环的状态计数器）。

Critical Behavior (Do Not Skip)

本技能包含的本地资源

—

脚本

Supersteps are transactional: one failing parallel branch fails the whole superstep state update.
RetryPolicy retries failing branches, not successful siblings.
interrupt()
re-runs the node on resume: side effects before interrupt must be idempotent, or moved after interrupt / separate node.
JS
Command
routing requires
ends
metadata on
```
addNode(...)
```
.
Use explicit retry limits (
```
max_attempts
```
, plus state counters for recovery loops).

```
scripts/classify_error.py
```
：对异常类别进行分类并推荐处理方式
```
scripts/wrap_with_retry.py
```
：生成带有重试/恢复/升级选项的样板节点包装器

从仓库根目录运行：

bash

uv run skills/langgraph-error-handling/scripts/classify_error.py TimeoutError --verbose
uv run skills/langgraph-error-handling/scripts/wrap_with_retry.py call_llm --with-llm-recovery

Local Assets In This Skill

示例

Scripts

—

```
scripts/classify_error.py
```
: classify exception category and recommended handling
```
scripts/wrap_with_retry.py
```
: generate boilerplate node wrappers with retry/recovery/escalation options

Run from repo root:

bash

uv run skills/langgraph-error-handling/scripts/classify_error.py TimeoutError --verbose
uv run skills/langgraph-error-handling/scripts/wrap_with_retry.py call_llm --with-llm-recovery

```
assets/examples/retry-example/
```
：重试 + 恢复循环（Python和JS版本）
```
assets/examples/human-loop-example/
```
：中断/恢复审批流程（Python和JS版本）

Examples

按需加载参考文档

```
assets/examples/retry-example/
```
: retry + recovery loop (Python and JS)
```
assets/examples/human-loop-example/
```
: interrupt/resume approval flow (Python and JS)

```
references/error-types.md
```
：错误分类体系和分类规则
```
references/retry-strategies.md
```
：重试调优、退避、断路器式模式
```
references/llm-recovery.md
```
：恢复循环和ToolNode策略
```
references/human-escalation.md
```
：人工审批、中断和升级模式

Load References On Demand

常见故障模式

```
references/error-types.md
```
: error taxonomy and classification rules
```
references/retry-strategies.md
```
: retry tuning, backoff, circuit-breaker-style patterns
```
references/llm-recovery.md
```
: recovery-loop and ToolNode strategies
```
references/human-escalation.md
```
: human approval, interrupts, and escalation patterns

症状	根本原因	修复方案
`interrupt()` 运行时失败	未使用检查点	使用检查点编译
恢复时启动新运行	`thread_id` 不同	复用相同的 `thread_id`
JS Command路由未生效	缺少 `ends`	为 `addNode` 添加 `ends`
无限循环	无终止计数器/条件	添加重试计数器 + 终止分支
重试从未触发	异常被重试过滤器排除	设置明确的 `retry_on` / `retryOn`

Common Failure Modes

—

Symptom	Root Cause	Fix
`interrupt()` fails at runtime	no checkpointer	compile with checkpointer
Resume starts new run	different `thread_id`	reuse same `thread_id`
JS Command route not taken	missing `ends`	add `ends` to `addNode`
Infinite loop	no termination counter/condition	add retry counter + terminal branch
Retry never triggers	exception excluded by retry filter	set explicit `retry_on` / `retryOn`

—