langgraph-error-handling

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LangGraph Error Handling

LangGraph 错误处理

Use This Skill For

适用场景

  • Adding
    RetryPolicy
    to flaky nodes (API, DB, model/tool calls)
  • Designing LLM recovery loops (
    Command
    + error state + retry counters)
  • Adding human approval/escalation with
    interrupt()
    and resume
  • Handling prebuilt
    ToolNode
    failures
  • Debugging transactional failure behavior in parallel supersteps
  • 为不稳定节点(API、数据库、模型/工具调用)添加
    RetryPolicy
  • 设计LLM恢复循环(
    Command
    + 错误状态 + 重试计数器)
  • 使用
    interrupt()
    和resume添加人工审批/升级流程
  • 处理预构建
    ToolNode
    的故障
  • 调试并行超级步骤中的事务性故障行为

Strategy Selection

策略选择

Use this order:
  1. Transient/infrastructure issue (
    429
    , timeout,
    5xx
    , temporary DB lock) ->
    RetryPolicy
  2. Recoverable by model/tool args correction -> store error in state and route back with
    Command
  3. Needs user approval or missing info ->
    interrupt()
    + resume
  4. Unknown/programming bug -> let it bubble up and debug
Error TypeOwnerPrimary Mechanism
TransientSystem
RetryPolicy
LLM-recoverableLLMState update +
Command(goto=...)
User-fixableHuman
interrupt()
+
Command(resume=...)
UnexpectedDeveloperRaise/log/debug
For full taxonomy, load references/error-types.md.
请按照以下顺序选择策略:
  1. 临时/基础设施问题(
    429
    、超时、
    5xx
    、临时数据库锁)->
    RetryPolicy
  2. 可通过调整模型/工具参数恢复的问题 -> 将错误存储在状态中,并通过
    Command
    路由返回
  3. 需要用户审批或缺少信息的问题 ->
    interrupt()
    + resume
  4. 未知/编程错误 -> 让错误向上冒泡并进行调试
错误类型负责方主要机制
临时错误系统
RetryPolicy
LLM可恢复LLM状态更新 +
Command(goto=...)
用户可修复人工
interrupt()
+
Command(resume=...)
意外错误开发人员抛出/记录/调试
如需完整的分类体系,请加载references/error-types.md

Minimal Patterns

最简模式

1) Retry Transient Failures

1) 重试临时故障

python
from langgraph.types import RetryPolicy

builder.add_node(
    "call_api",
    call_api,
    retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)
ts
builder.addNode("callApi", callApi, {
  retryPolicy: { maxAttempts: 3, initialInterval: 1.0 },
});
Notes:
  • Python and JS default retry behavior differs by exception type.
  • Prefer targeted
    retry_on
    /
    retryOn
    for non-transient domains.
python
from langgraph.types import RetryPolicy

builder.add_node(
    "call_api",
    call_api,
    retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)
ts
builder.addNode("callApi", callApi, {
  retryPolicy: { maxAttempts: 3, initialInterval: 1.0 },
});
注意事项:
  • Python和JS的默认重试行为因异常类型而异。
  • 对于非临时领域,优先使用针对性的
    retry_on
    /
    retryOn

2) LLM Recovery Loop

2) LLM恢复循环

Use
MessagesState
in Python for message state.
python
from typing import Literal
from typing_extensions import NotRequired
from langgraph.graph import MessagesState
from langgraph.types import Command

class State(MessagesState):
    error: NotRequired[str]
    retry_count: NotRequired[int]

def agent(state: State) -> Command[Literal["tool", "__end__"]]:
    if state.get("retry_count", 0) >= 3:
        return Command(goto="__end__")
    if state.get("error"):
        return Command(goto="tool")
    return Command(goto="tool")
ts
import { StateGraph, Command, END } from "@langchain/langgraph";

// If a node returns Command in JS, add `ends` on addNode.
builder.addNode("agent", agentNode, { ends: ["tool", END] });
在Python中使用
MessagesState
管理消息状态。
python
from typing import Literal
from typing_extensions import NotRequired
from langgraph.graph import MessagesState
from langgraph.types import Command

class State(MessagesState):
    error: NotRequired[str]
    retry_count: NotRequired[int]

def agent(state: State) -> Command[Literal["tool", "__end__"]]:
    if state.get("retry_count", 0) >= 3:
        return Command(goto="__end__")
    if state.get("error"):
        return Command(goto="tool")
    return Command(goto="tool")
ts
import { StateGraph, Command, END } from "@langchain/langgraph";

// 如果节点在JS中返回Command,需在addNode时添加`ends`。
builder.addNode("agent", agentNode, { ends: ["tool", END] });

3) Human-In-The-Loop Escalation

3) 人机交互升级流程

python
from langgraph.types import interrupt, Command

def human_review(state):
    approved = interrupt({
        "question": "Proceed?",
        "payload": state["pending_action"],
    })
    return Command(goto="execute" if approved else "cancel")
python
from langgraph.types import interrupt, Command

def human_review(state):
    approved = interrupt({
        "question": "Proceed?",
        "payload": state["pending_action"],
    })
    return Command(goto="execute" if approved else "cancel")

// resume
graph.invoke(Command(resume=True), config={"configurable": {"thread_id": "t-1"}})
ts
import { Command, interrupt } from "@langchain/langgraph";

const approved = interrupt({ question: "Proceed?" });
// later
await graph.invoke(new Command({ resume: true }), {
  configurable: { thread_id: "t-1" },
});
要求:
  • 需使用检查点(checkpointer)编译以支持中断流程。
  • 恢复时需复用相同的
    thread_id
如需深入了解人机交互模式,请加载references/human-escalation.md

resume

ToolNode错误处理

graph.invoke(Command(resume=True), config={"configurable": {"thread_id": "t-1"}})

```ts
import { Command, interrupt } from "@langchain/langgraph";

const approved = interrupt({ question: "Proceed?" });
// later
await graph.invoke(new Command({ resume: true }), {
  configurable: { thread_id: "t-1" },
});
Requirements:
  • Compile with a checkpointer for interrupt flows.
  • Reuse the same
    thread_id
    on resume.
For deep HITL patterns, load references/human-escalation.md.
python
from langgraph.prebuilt import ToolNode

tool_node = ToolNode(tools, handle_tool_errors=True)
tool_node = ToolNode(tools, handle_tool_errors="Please try again.")
tool_node = ToolNode(tools, handle_tool_errors=(ValueError, TypeError))
当需要为模型恢复提供确定性错误格式时,使用自定义处理器。 如需了解更全面的工具恢复设计,请加载references/llm-recovery.md

ToolNode Error Handling

关键注意事项(请勿忽略)

python
from langgraph.prebuilt import ToolNode

tool_node = ToolNode(tools, handle_tool_errors=True)
tool_node = ToolNode(tools, handle_tool_errors="Please try again.")
tool_node = ToolNode(tools, handle_tool_errors=(ValueError, TypeError))
Use custom handlers when you need deterministic error shaping for model recovery. For broader tool-recovery design, load references/llm-recovery.md.
  1. 超级步骤是事务性的:一个并行分支失败会导致整个超级步骤的状态更新失败。
  2. RetryPolicy仅重试失败的分支,不会重试成功的同级分支。
  3. interrupt()
    在恢复时会重新运行节点
    :中断前的副作用必须是幂等的,或者移到中断之后/单独的节点中。
  4. JS中Command路由需要在
    addNode(...)
    上添加
    ends
    元数据
  5. 使用明确的重试限制
    max_attempts
    ,加上恢复循环的状态计数器)。

Critical Behavior (Do Not Skip)

本技能包含的本地资源

脚本

  1. Supersteps are transactional: one failing parallel branch fails the whole superstep state update.
  2. RetryPolicy retries failing branches, not successful siblings.
  3. interrupt()
    re-runs the node on resume
    : side effects before interrupt must be idempotent, or moved after interrupt / separate node.
  4. JS
    Command
    routing requires
    ends
    metadata
    on
    addNode(...)
    .
  5. Use explicit retry limits (
    max_attempts
    , plus state counters for recovery loops).
  • scripts/classify_error.py
    :对异常类别进行分类并推荐处理方式
  • scripts/wrap_with_retry.py
    :生成带有重试/恢复/升级选项的样板节点包装器
从仓库根目录运行:
bash
uv run skills/langgraph-error-handling/scripts/classify_error.py TimeoutError --verbose
uv run skills/langgraph-error-handling/scripts/wrap_with_retry.py call_llm --with-llm-recovery

Local Assets In This Skill

示例

Scripts

  • scripts/classify_error.py
    : classify exception category and recommended handling
  • scripts/wrap_with_retry.py
    : generate boilerplate node wrappers with retry/recovery/escalation options
Run from repo root:
bash
uv run skills/langgraph-error-handling/scripts/classify_error.py TimeoutError --verbose
uv run skills/langgraph-error-handling/scripts/wrap_with_retry.py call_llm --with-llm-recovery
  • assets/examples/retry-example/
    :重试 + 恢复循环(Python和JS版本)
  • assets/examples/human-loop-example/
    :中断/恢复审批流程(Python和JS版本)

Examples

按需加载参考文档

  • assets/examples/retry-example/
    : retry + recovery loop (Python and JS)
  • assets/examples/human-loop-example/
    : interrupt/resume approval flow (Python and JS)
  • references/error-types.md
    :错误分类体系和分类规则
  • references/retry-strategies.md
    :重试调优、退避、断路器式模式
  • references/llm-recovery.md
    :恢复循环和ToolNode策略
  • references/human-escalation.md
    :人工审批、中断和升级模式

Load References On Demand

常见故障模式

  • references/error-types.md
    : error taxonomy and classification rules
  • references/retry-strategies.md
    : retry tuning, backoff, circuit-breaker-style patterns
  • references/llm-recovery.md
    : recovery-loop and ToolNode strategies
  • references/human-escalation.md
    : human approval, interrupts, and escalation patterns
症状根本原因修复方案
interrupt()
运行时失败
未使用检查点使用检查点编译
恢复时启动新运行
thread_id
不同
复用相同的
thread_id
JS Command路由未生效缺少
ends
addNode
添加
ends
无限循环无终止计数器/条件添加重试计数器 + 终止分支
重试从未触发异常被重试过滤器排除设置明确的
retry_on
/
retryOn

Common Failure Modes

SymptomRoot CauseFix
interrupt()
fails at runtime
no checkpointercompile with checkpointer
Resume starts new rundifferent
thread_id
reuse same
thread_id
JS Command route not takenmissing
ends
add
ends
to
addNode
Infinite loopno termination counter/conditionadd retry counter + terminal branch
Retry never triggersexception excluded by retry filterset explicit
retry_on
/
retryOn