loop-nanny

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Loop Nanny

Monitor, triage, and clean up after agent loops. The agent-loop skill starts loops. This skill keeps them healthy.

监控、分类处理Agent循环的故障，并在循环结束后进行清理。agent-loop技能负责启动循环，本技能则保障循环的正常运行。

Monitor a Running Loop

监控运行中的循环

bash

joelclaw loop status <LOOP_ID>

joelclaw

is the primary CLI.

Poll every 2–3 minutes. A story typically takes 3–8 minutes (test-write → implement → review → judge). If a story shows no progress for 10+ minutes, something is stuck.

bash

joelclaw loop status <LOOP_ID>

joelclaw

是主要的CLI工具。

每2-3分钟轮询一次。单个任务通常需要3-8分钟（测试编写→实现→评审→判定）。如果某个任务10分钟以上没有进展，则说明出现了停滞。

Reading Status Output

解读状态输出

Status	Meaning	Action
`✅ PASS`	Story completed	None
`⏭ SKIP`	Exhausted retries	Check if code landed anyway (common — see Skip Triage)
`▶ STORY.DISPATCHED`	Implementor running	Wait
`▶ TESTS.WRITTEN`	Reviewer running	Wait
`▶ CHECKS.COMPLETED`	Judge evaluating	Wait
`▶ STORY.RETRIED`	Failed, retrying	Check attempt output for patterns
`▶ RETRO.COMPLETED`	Loop finishing (merge-back)	Wait for merge, then verify
`⏳ pending`	Not yet started	Normal — queued

状态	含义	操作
`✅ PASS`	任务已完成	无需操作
`⏭ SKIP`	重试次数耗尽	检查代码是否已合并（这种情况很常见——详见「跳过任务的分类处理」）
`▶ STORY.DISPATCHED`	实现Agent正在运行	等待
`▶ TESTS.WRITTEN`	评审Agent正在运行	等待
`▶ CHECKS.COMPLETED`	判定Agent正在评估	等待
`▶ STORY.RETRIED`	执行失败，正在重试	查看尝试输出中的规律
`▶ RETRO.COMPLETED`	循环即将结束（合并回主分支）	等待合并完成，然后进行验证
`⏳ pending`	尚未启动	正常状态——处于队列中

Skip Triage

跳过任务的分类处理

Stories skip when the judge fails them after max retries. But the implementation often landed anyway — later stories may have included the work, or the first attempt was correct but tests were over-specified.

当判定Agent在达到最大重试次数后标记任务失败时，任务会被跳过。但对应的实现工作往往已经完成——后续任务可能已经包含了该工作，或者第一次尝试的代码是正确的，但测试用例的约束过于严格。

Check if skipped work actually landed

检查被跳过的工作是否已实际合并

bash

undefined

bash

undefined

1. Read the attempt output

1. 查看尝试输出

cat /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out | tail -40

2. Check if the files exist on main after merge

2. 检查合并后主分支上是否存在预期文件

ls <project>/path/to/expected/file

3. Check git log for the story's commits

3. 查看git日志中与该任务相关的提交

cd <project> && git log --oneline --grep="<STORY_ID>" -5

undefined

cd <project> && git log --oneline --grep="<STORY_ID>" -5

undefined

Common skip causes

常见的任务跳过原因

Pattern	Cause	Prevention
"Already exists"	Story duplicates work from prior story	Write more atomic stories; planner didn't know current state
Test assertion failure	Agent-generated tests over-specify implementation	Acceptance criteria should test behavior, not implementation
TypeScript compile error	Story depends on types from a skipped story	Order stories so type-providing stories run first
Timeout	Implementor took too long	Smaller story scope

规律	原因	预防措施
"Already exists"	该任务与之前的任务存在重复工作	编写更原子化的任务；确保规划Agent了解当前状态
测试断言失败	Agent生成的测试用例对实现细节约束过严	验收标准应测试行为而非实现细节
TypeScript编译错误	该任务依赖于被跳过任务中的类型定义	调整任务顺序，让提供类型定义的任务先执行
超时	实现Agent执行时间过长	缩小任务范围

Post-Loop Cleanup

循环结束后的清理

After merge-back completes (status shows

RETRO.COMPLETED

then resolves):

当合并回主分支完成后（状态显示为

RETRO.COMPLETED

并最终更新）：

1. Verify merge landed

1. 验证合并是否完成

bash

cd <project> && git log --oneline -5

bash

cd <project> && git log --oneline -5

Should show: "Merge branch 'agent-loop/<LOOP_ID>'"

应显示："Merge branch 'agent-loop/<LOOP_ID>'"

undefined

undefined

2. Run full test suite

2. 运行完整测试套件

bash

cd <project> && bun test 2>&1 | tail -5

bash

cd <project> && bun test 2>&1 | tail -5

3. Delete stale acceptance tests

3. 删除过期的验收测试

Agent loops generate

__tests__/<story-id>-*.test.ts

files. These test implementation details and break on every refactor. Delete them after verifying the real tests pass.

bash

undefined

Agent循环会生成

__tests__/<story-id>-*.test.ts

文件。这些测试针对实现细节，每次重构都会失效。在确认正式测试通过后，应删除这些文件。

bash

undefined

Find agent-generated test files

查找Agent生成的测试文件

ls <project>/tests/-.test.ts 2>/dev/null

Check which fail

检查哪些测试失败

bun test tests/ 2>&1 | grep "(fail)"

Delete ALL agent-generated acceptance tests (they served their purpose)

删除所有Agent生成的验收测试（它们的使命已完成）

rm <project>/tests/<prefix>-*.test.ts

Verify clean

验证测试结果是否正常

bun test 2>&1 | tail -3


Per AGENTS.md: *"over-specified tests that mock internal step names are worse than no tests — they break on every refactor and give false negatives."*

bun test 2>&1 | tail -3


根据AGENTS.md：*“那些模拟内部步骤名称的过度约束测试还不如没有测试——它们在每次重构时都会失效，还会给出假阴性结果。”*

4. TypeScript check

4. TypeScript检查

bash

cd <project> && bunx tsc --noEmit

bash

cd <project> && bunx tsc --noEmit

5. Restart worker deployment (if loop touched Inngest functions)

5. 重启Worker部署（如果循环涉及Inngest函数）

bash

kubectl -n joelclaw rollout restart deployment/system-bus-worker
kubectl -n joelclaw rollout status deployment/system-bus-worker --timeout=180s
joelclaw refresh
joelclaw status

bash

kubectl -n joelclaw rollout restart deployment/system-bus-worker
kubectl -n joelclaw rollout status deployment/system-bus-worker --timeout=180s
joelclaw refresh
joelclaw status

6. Commit cleanup

6. 提交清理操作

bash

cd <project> && git add -A && git commit -m "chore: post-loop cleanup for <LOOP_ID>

Delete stale acceptance tests, verify N pass / 0 fail / tsc clean"

bash

cd <project> && git add -A && git commit -m "chore: post-loop cleanup for <LOOP_ID>

Delete stale acceptance tests, verify N pass / 0 fail / tsc clean"

When to Intervene

何时需要介入

Don't intervene — let the loop work

无需介入——让循环自行运行

Story on first attempt (even if slow)
Story retrying with feedback (judge gave actionable feedback)
Story just skipped but later stories are still pending

任务处于第一次尝试阶段（即使速度较慢）
任务正在根据反馈重试（判定Agent已给出可执行的反馈）
某个任务刚被跳过，但后续任务仍在执行中

Intervene — something is broken

需要介入——出现异常情况

Same error on retry: Check attempt output. If retry 2 has identical error to retry 1, the feedback loop isn't helping — cancel and fix the root cause.
Merge conflict during complete:
```
joelclaw logs errors
```
will show merge abort. Manually resolve:
```
cd <project> && git merge --abort
```
then merge the branch by hand.

Worker crashed:

joelclaw status

shows worker down. Restart deployment:

kubectl -n joelclaw rollout restart deployment/system-bus-worker

All stories skipping: The PRD likely has a bad assumption. Cancel, review prd.json, re-fire.
Loop stuck (no progress for 15min): Check
```
joelclaw runs --count 5
```
— if the latest run is COMPLETED but no new run dispatched, there's a state bug. Cancel and re-fire from the stuck story.

重试时出现相同错误：查看尝试输出。如果第二次重试的错误与第一次完全相同，说明反馈机制无效——取消循环并修复根本原因。
合并过程中出现冲突：
```
joelclaw logs errors
```
会显示合并终止信息。手动解决：
```
cd <project> && git merge --abort
```
，然后手动合并分支。

Worker崩溃：

joelclaw status

显示Worker已停止。重启部署：

kubectl -n joelclaw rollout restart deployment/system-bus-worker

所有任务都被跳过：PRD（产品需求文档）可能存在错误假设。取消循环，检查prd.json，重新启动。
循环停滞（15分钟无进展）：查看
```
joelclaw runs --count 5
```
——如果最新运行已显示COMPLETED但没有新的运行被调度，说明存在状态bug。取消循环，从停滞的任务重新启动。

Cancel a stuck loop

取消停滞的循环

bash

joelclaw loop cancel <LOOP_ID>

bash

joelclaw loop cancel <LOOP_ID>

Clean up worktree manually (cancel doesn't auto-merge)

手动清理工作区（取消操作不会自动合并）

cd <project> git worktree remove /tmp/agent-loop/<LOOP_ID> --force 2>/dev/null git branch -D agent-loop/<LOOP_ID> 2>/dev/null git worktree prune

undefined

cd <project> git worktree remove /tmp/agent-loop/<LOOP_ID> --force 2>/dev/null git branch -D agent-loop/<LOOP_ID> 2>/dev/null git worktree prune

undefined

Reading Attempt Output

查看任务尝试输出

Each story attempt writes to

/tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out

. These contain the implementor/reviewer's full output (diffs, reasoning, test results).

bash

undefined

每个任务的尝试都会将输出写入

/tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out

文件，其中包含实现Agent/评审Agent的完整输出（差异、推理过程、测试结果）。

bash

undefined

Quick scan — last 40 lines usually has the verdict

快速查看——最后40行通常包含判定结果

tail -40 /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out

Check all attempts for a story

查看某个任务的所有尝试输出

ls /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-*.out

Grep for errors across all attempts

在所有尝试输出中搜索错误信息

grep -i "error|fail|reject" /tmp/agent-loop/<LOOP_ID>/*.out

undefined

grep -i "error|fail|reject" /tmp/agent-loop/<LOOP_ID>/*.out

undefined

Monitoring Checklist

监控检查清单

Use this sequence when babysitting a loop:

1. joelclaw loop status <LOOP_ID>          — where are we?
2. (if story running) wait 3 min, re-check
3. (if story skipped) check attempt output — did the work land anyway?
4. (if loop completed) verify merge, run tests, delete stale tests, restart worker
5. (if stuck) joelclaw runs --count 5 + joelclaw logs errors — diagnose

监控循环时请遵循以下步骤：

1. joelclaw loop status <LOOP_ID>          — 当前进度如何？
2.（如果任务正在运行）等待3分钟后重新检查
3.（如果任务被跳过）查看尝试输出——对应的工作是否已实际完成？
4.（如果循环已完成）验证合并结果、运行测试、删除过期测试、重启Worker
5.（如果循环停滞）执行joelclaw runs --count 5 + joelclaw logs errors — 进行诊断

Improve joelclaw While You Nanny

监控期间优化joelclaw

The nanny is the primary consumer of

joelclaw

output. When you hit a gap — missing info, unclear output, an extra step you had to do manually — fix joelclaw right then. The joelclaw source lives at

~/Code/joelhooks/joelclaw/

本技能是

joelclaw

输出的主要使用者。当你遇到以下问题时——信息缺失、输出不清晰、需要手动执行额外步骤——立即对joelclaw进行优化。joelclaw的源码位于

~/Code/joelhooks/joelclaw/

。

What to improve

优化方向

Missing data in output:
```
joelclaw loop status
```
doesn't show story descriptions? Add them.
Manual steps that should be commands: Had to
```
kubectl logs
```
directly? Add it to
```
joelclaw logs
```
.
Bad next_actions: The suggested next commands don't match what you actually needed? Fix the HATEOAS.
Error without fix: Got an error with no
```
fix
```
field? Add one via
```
respondError()
```
.
New command needed: Found yourself running raw curl/GraphQL? Wrap it in a joelclaw command.

输出数据缺失：
```
joelclaw loop status
```
未显示任务描述？添加该功能。
应自动化的手动步骤：需要直接执行
```
kubectl logs
```
？将其集成到
```
joelclaw logs
```
中。
不合理的下一步操作建议：建议的后续命令与实际需求不符？修复HATEOAS相关逻辑。
无修复方案的错误：遇到没有
```
fix
```
字段的错误？通过
```
respondError()
```
添加修复方案。
需要新命令：发现自己直接执行curl/GraphQL请求？将其封装为joelclaw命令。

How to improve

优化步骤

bash

undefined

bash

undefined

Commands are in packages/cli/src/commands/ — one file per command

命令定义位于packages/cli/src/commands/ —— 每个命令对应一个文件

Response helpers in packages/cli/src/response.ts

响应工具函数位于packages/cli/src/response.ts

Inngest client methods in packages/cli/src/inngest.ts

Inngest客户端方法位于packages/cli/src/inngest.ts

Test your change

测试你的修改

joelclaw <command> 2>&1 | python3 -m json.tool

Commit

提交修改

git add -A && git commit -m "feat: joelclaw <command> — <what you added>"


Follow the [cli-design skill](../cli-design/SKILL.md): JSON always, HATEOAS next_actions, context-safe output, errors with fixes. After improving joelclaw, update this skill doc if the monitoring workflow changed.

git add -A && git commit -m "feat: joelclaw <command> — <你添加的功能>"


遵循[cli-design技能](../cli-design/SKILL.md)的设计原则：始终返回JSON、提供HATEOAS风格的next_actions、上下文安全的输出、附带修复方案的错误信息。优化joelclaw后，如果监控流程发生变化，请更新本技能文档。

Diagnosing Stalled Loops

诊断停滞的循环

When a loop stops progressing, use the diagnostic command FIRST:

bash

undefined

当循环停止进展时，请首先使用诊断命令：

bash

undefined

Quick diagnosis of all loops

快速诊断所有循环

joelclaw loop diagnose all -c

Diagnose + auto-fix (clears claims, re-fires plan events)

诊断并自动修复（清除占用标记，重新触发规划事件）

joelclaw loop diagnose all -c --fix

Verify fix worked (~30s later)

验证修复是否生效（约30秒后）

joelclaw loop status <loop-id> -c


Common diagnoses:
- **CHAIN_BROKEN** — judge→plan event was lost. `--fix` re-fires the plan event.
- **ORPHANED_CLAIM** — agent died but claim remains. `--fix` clears claim + re-fires.
- **STUCK_RUN** — Inngest run active but agent dead. `--fix` clears + re-fires.
- **WORKER_UNHEALTHY** — fewer functions than expected. `--fix` restarts worker.

See [loop-diagnosis skill](../loop-diagnosis/SKILL.md) for full reference.

joelclaw loop status <loop-id> -c


常见诊断结果：
- **CHAIN_BROKEN** —— judge→plan事件丢失。`--fix`参数会重新触发规划事件。
- **ORPHANED_CLAIM** —— Agent已停止但占用标记仍存在。`--fix`参数会清除占用标记并重新触发。
- **STUCK_RUN** —— Inngest运行处于活跃状态但Agent已停止。`--fix`参数会清除状态并重新触发。
- **WORKER_UNHEALTHY** —— 可用函数数量低于预期。`--fix`参数会重启Worker。

完整参考请查看[loop-diagnosis技能](../loop-diagnosis/SKILL.md)。

Gotchas

注意事项

Don't edit the monorepo while a loop runs —
```
git add -A
```
in the worktree scoops unrelated changes
CLI-focused edits are safest during active loops — avoid changing
```
packages/system-bus
```
mid-run unless you're intentionally redeploying the worker
prd.json and progress.txt get dirty — this is normal; complete.ts stashes before merge
Worktree branch not auto-deleted on cancel — clean manually (see Cancel section)
Cross-file test pollution — agent-generated
```
__tests__/
```
files can cause failures in real tests when run together; delete them

循环运行期间不要修改单体仓库 —— 工作区中的
```
git add -A
```
会包含无关修改
循环运行期间优先修改CLI相关代码 —— 除非有意重新部署Worker，否则避免在运行期间修改
```
packages/system-bus
```
prd.json和progress.txt会出现脏数据 —— 这是正常现象；complete.ts会在合并前暂存这些文件
取消循环后工作区分支不会自动删除 —— 需要手动清理（详见「取消停滞的循环」部分）
跨文件测试污染 —— Agent生成的
```
__tests__/
```
文件可能会导致正式测试失败；请删除这些文件