error-recovery
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseError Recovery
错误恢复
Value: Courage -- autonomous operation requires resilience. Recovering
from errors without human intervention keeps the workflow moving. Knowing
when to escalate prevents wasted effort on unrecoverable situations.
价值: 勇气——自主运行需要韧性。无需人工干预从错误中恢复可保障工作流持续推进。知晓何时升级问题可避免在无法恢复的场景下浪费精力。
Purpose
用途
Teaches agents to handle unexpected errors during autonomous operation
(API failures, build tool crashes, permission issues, resource exhaustion).
Provides classification, retry strategies, and escalation rules. Prevents
the failure modes of infinite retry loops, silent error swallowing, and
unnecessary human interruptions for recoverable issues.
教会Agent处理自主运行过程中的意外错误(API故障、构建工具崩溃、权限问题、资源耗尽)。提供分类标准、重试策略和升级规则,避免出现无限重试循环、静默吞错、可恢复问题不必要打扰人工等故障模式。
Practices
实践规范
Classify Before Acting
先分类再处置
When an error occurs, classify it before attempting recovery:
| Category | Examples | Recovery |
|---|---|---|
| Transient | Network timeout, 503, rate limit, lock contention | Retry with backoff |
| Environmental | Missing dependency, wrong version, port conflict | Fix environment, then retry |
| Permission | File permission denied, auth token expired | Escalate to user |
| Logic | Assertion failure, type error, schema mismatch | Do NOT retry -- investigate |
| Resource | Out of memory, disk full, context exhaustion | Reduce scope or escalate |
Do not retry logic errors. If a test fails, an assertion fires, or a
type mismatch occurs, retrying will produce the same result. Switch to
the debugging protocol instead.
错误发生后,尝试恢复前先对其分类:
| 分类 | 示例 | 恢复方式 |
|---|---|---|
| 瞬态 | 网络超时、503错误、速率限制、锁竞争 | 退避重试 |
| 环境类 | 依赖缺失、版本错误、端口冲突 | 修复环境后重试 |
| 权限类 | 文件权限被拒绝、鉴权token过期 | 升级反馈给用户 |
| 逻辑类 | 断言失败、类型错误、schema不匹配 | 不要重试——进行排查 |
| 资源类 | 内存不足、磁盘满、上下文耗尽 | 缩小范围或升级问题 |
逻辑错误不要重试。 如果测试失败、断言触发或出现类型不匹配,重试只会得到相同结果,应切换到debugging-protocol处理。
Retry Strategy: Exponential Backoff
重试策略:指数退避
For transient errors, retry with exponential backoff:
- First retry: Wait 2 seconds
- Second retry: Wait 5 seconds
- Third retry: Wait 15 seconds
- After third failure: Stop retrying and escalate
Never retry more than 3 times for the same error. Never retry without
waiting. Never use a fixed retry loop without backoff.
Rate limit handling: If the error includes a header or
equivalent, respect it. Do not retry before the indicated time.
Retry-After针对瞬态错误,采用指数退避方式重试:
- 第一次重试: 等待2秒
- 第二次重试: 等待5秒
- 第三次重试: 等待15秒
- 第三次失败后: 停止重试并升级问题
同一错误重试不要超过3次,不要无等待重试,不要使用无退避的固定重试循环。
速率限制处理: 如果错误包含头或等效提示,请遵循该要求,不要在指定时间前重试。
Retry-AfterError Logging
错误日志记录
When an error occurs, log it to a structured format before attempting
recovery:
markdown
undefined错误发生后,尝试恢复前先按结构化格式记录日志:
markdown
undefinedError Log: [timestamp]
Error Log: [timestamp]
- Category: transient | environmental | permission | logic | resource
- Error: [exact error message]
- Context: [what was happening when the error occurred]
- Action taken: [retry | escalate | investigate | fix-environment]
- Outcome: [resolved | escalated | investigating]
In pipeline mode, append to
`.factory/audit-trail/slices/<slice-id>/error-log.md`.
In standalone mode, write to the project's scratch directory or memory.- Category: transient | environmental | permission | logic | resource
- Error: [exact error message]
- Context: [what was happening when the error occurred]
- Action taken: [retry | escalate | investigate | fix-environment]
- Outcome: [resolved | escalated | investigating]
流水线模式下,日志追加到`.factory/audit-trail/slices/<slice-id>/error-log.md`。独立模式下,写入项目临时目录或内存中。Environmental Recovery
环境类错误恢复
For environmental errors (missing tools, wrong versions, port conflicts):
- Identify the specific environmental issue
- Attempt a targeted fix (install missing dependency, kill conflicting process, clear stale lock file)
- Verify the fix resolved the issue
- Retry the original operation ONCE
- If it fails again, escalate -- the environment may need manual intervention
Port conflicts: Check for processes using the port with
or equivalent. If the process is not related to the
current project, report it to the user rather than killing it.
lsof -i :<port>针对环境类错误(工具缺失、版本错误、端口冲突):
- 定位具体环境问题
- 尝试定向修复(安装缺失依赖、终止冲突进程、清理过时锁文件)
- 验证修复是否解决问题
- 重试原操作一次
- 如果再次失败,升级问题——环境可能需要人工干预
端口冲突: 使用或等效命令检查占用端口的进程。如果进程与当前项目无关,上报给用户而非直接终止进程。
lsof -i :<port>Context Exhaustion Recovery
上下文耗尽恢复
When approaching context limits during long operations:
- Write current state to WORKING_STATE.md immediately
- Complete the current atomic operation if possible
- Signal that continuation is needed
- Do NOT start new operations that cannot complete in remaining context
This prevents the failure mode of starting work that cannot be finished,
leaving the project in an inconsistent state.
长时运行过程中接近上下文限制时:
- 立即将当前状态写入WORKING_STATE.md
- 尽可能完成当前原子操作
- 发出需要续跑的信号
- 不要启动剩余上下文无法完成的新操作
这可以避免启动无法完成的工作、导致项目处于不一致状态的故障模式。
Escalation Rules
升级规则
Escalate to the user when:
- Permission errors (you cannot fix what you cannot access)
- Logic errors after investigation (the bug needs human insight)
- 3 retries exhausted for a transient error (the service may be down)
- Environmental fix failed (the environment may need manual repair)
- Resource exhaustion (context limit, disk space, memory)
- Any error you cannot classify (unknown errors are dangerous)
How to escalate: Provide the error category, the exact error message,
what you tried, and what you recommend. Do not just say "an error
occurred."
出现以下情况时升级反馈给用户:
- 权限错误(无访问权限就无法修复问题)
- 排查后的逻辑错误(漏洞需要人工排查)
- 瞬态错误耗尽3次重试机会(服务可能已宕机)
- 环境修复失败(环境可能需要人工修复)
- 资源耗尽(上下文限制、磁盘空间、内存)
- 任何无法分类的错误(未知错误风险很高)
升级方式: 提供错误分类、准确错误信息、已尝试的操作、你的建议。不要只说“发生了错误”。
Pipeline Integration
流水线集成
In factory pipeline mode, error recovery integrates with the rework
protocol:
- Transient errors during CI: auto-retry once (standard/full autonomy)
- Build tool crashes: classify and apply the appropriate strategy
- Gate failures: these are NOT errors -- they are expected feedback from quality gates. Do not apply error recovery to gate failures.
在factory流水线模式下,错误恢复与返工协议集成:
- CI过程中的瞬态错误:自动重试一次(标准/完全自主模式)
- 构建工具崩溃:分类后应用对应策略
- 门禁失败:不属于错误——是质量门禁的预期反馈,不要对门禁失败应用错误恢复机制
Enforcement Note
执行说明
This skill provides advisory guidance. It instructs the agent on error
classification and recovery strategies but cannot mechanically enforce
retry limits or prevent silent error swallowing. The agent follows these
practices by convention. If you observe the agent retrying endlessly or
ignoring errors, point it out.
本技能提供指导性规范,它指导Agent进行错误分类和恢复,但无法机械强制限制重试次数或避免静默吞错,Agent会按约定遵循这些实践。如果你观察到Agent无限重试或忽略错误,可以指出该问题。
Verification
验证标准
After recovering from an error, verify:
- Error was classified before any recovery attempt
- Logic errors were NOT retried (investigated instead)
- Retries used exponential backoff (not immediate)
- No more than 3 retries for the same error
- Error was logged with category, message, context, and outcome
- Escalation included the error category, message, attempts, and recommendation
- Environmental fixes were verified before retrying the operation
- State was saved before context exhaustion recovery
If any criterion is not met, revisit the relevant practice.
从错误中恢复后,验证以下项:
- 任何恢复尝试前已对错误分类
- 逻辑错误没有重试(而是进行了排查)
- 重试使用了指数退避(不是立即重试)
- 同一错误重试不超过3次
- 错误已记录,包含分类、信息、上下文和结果
- 升级内容包含错误分类、信息、已尝试操作和建议
- 环境修复后重试操作前已验证修复效果
- 上下文耗尽恢复前已保存状态
如果任何标准未满足,重新执行对应实践。
Dependencies
依赖
This skill works standalone. For enhanced workflows, it integrates with:
- debugging-protocol: Logic errors escalate to the debugging protocol for systematic investigation
- pipeline: Pipeline gate failures are handled by the rework protocol, not error recovery
- session-reflection: Recurring errors become system prompt refinements
- ci-integration: CI failures classify as transient (infra) or logic (test failure) for appropriate handling
Missing a dependency? Install with:
npx skills add jwilger/agent-skills --skill debugging-protocol本技能可独立运行。如需增强工作流,可与以下技能集成:
- debugging-protocol: 逻辑错误升级到debugging-protocol进行系统排查
- pipeline: 流水线门禁失败由返工协议处理,不使用错误恢复机制
- session-reflection: 重复出现的错误会优化系统提示词
- ci-integration: CI故障分为瞬态(基础设施)或逻辑(测试失败)类,分别对应处理方式
缺少依赖?使用以下命令安装:
npx skills add jwilger/agent-skills --skill debugging-protocol