loop-nanny
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLoop Nanny
Loop Nanny
Monitor, triage, and clean up after agent loops. The agent-loop skill starts loops. This skill keeps them healthy.
监控、分类处理Agent循环的故障,并在循环结束后进行清理。agent-loop技能负责启动循环,本技能则保障循环的正常运行。
Monitor a Running Loop
监控运行中的循环
bash
joelclaw loop status <LOOP_ID>joelclawPoll every 2–3 minutes. A story typically takes 3–8 minutes (test-write → implement → review → judge). If a story shows no progress for 10+ minutes, something is stuck.
bash
joelclaw loop status <LOOP_ID>joelclaw每2-3分钟轮询一次。单个任务通常需要3-8分钟(测试编写→实现→评审→判定)。如果某个任务10分钟以上没有进展,则说明出现了停滞。
Reading Status Output
解读状态输出
| Status | Meaning | Action |
|---|---|---|
| Story completed | None |
| Exhausted retries | Check if code landed anyway (common — see Skip Triage) |
| Implementor running | Wait |
| Reviewer running | Wait |
| Judge evaluating | Wait |
| Failed, retrying | Check attempt output for patterns |
| Loop finishing (merge-back) | Wait for merge, then verify |
| Not yet started | Normal — queued |
| 状态 | 含义 | 操作 |
|---|---|---|
| 任务已完成 | 无需操作 |
| 重试次数耗尽 | 检查代码是否已合并(这种情况很常见——详见「跳过任务的分类处理」) |
| 实现Agent正在运行 | 等待 |
| 评审Agent正在运行 | 等待 |
| 判定Agent正在评估 | 等待 |
| 执行失败,正在重试 | 查看尝试输出中的规律 |
| 循环即将结束(合并回主分支) | 等待合并完成,然后进行验证 |
| 尚未启动 | 正常状态——处于队列中 |
Skip Triage
跳过任务的分类处理
Stories skip when the judge fails them after max retries. But the implementation often landed anyway — later stories may have included the work, or the first attempt was correct but tests were over-specified.
当判定Agent在达到最大重试次数后标记任务失败时,任务会被跳过。但对应的实现工作往往已经完成——后续任务可能已经包含了该工作,或者第一次尝试的代码是正确的,但测试用例的约束过于严格。
Check if skipped work actually landed
检查被跳过的工作是否已实际合并
bash
undefinedbash
undefined1. Read the attempt output
1. 查看尝试输出
cat /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out | tail -40
cat /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out | tail -40
2. Check if the files exist on main after merge
2. 检查合并后主分支上是否存在预期文件
ls <project>/path/to/expected/file
ls <project>/path/to/expected/file
3. Check git log for the story's commits
3. 查看git日志中与该任务相关的提交
cd <project> && git log --oneline --grep="<STORY_ID>" -5
undefinedcd <project> && git log --oneline --grep="<STORY_ID>" -5
undefinedCommon skip causes
常见的任务跳过原因
| Pattern | Cause | Prevention |
|---|---|---|
| "Already exists" | Story duplicates work from prior story | Write more atomic stories; planner didn't know current state |
| Test assertion failure | Agent-generated tests over-specify implementation | Acceptance criteria should test behavior, not implementation |
| TypeScript compile error | Story depends on types from a skipped story | Order stories so type-providing stories run first |
| Timeout | Implementor took too long | Smaller story scope |
| 规律 | 原因 | 预防措施 |
|---|---|---|
| "Already exists" | 该任务与之前的任务存在重复工作 | 编写更原子化的任务;确保规划Agent了解当前状态 |
| 测试断言失败 | Agent生成的测试用例对实现细节约束过严 | 验收标准应测试行为而非实现细节 |
| TypeScript编译错误 | 该任务依赖于被跳过任务中的类型定义 | 调整任务顺序,让提供类型定义的任务先执行 |
| 超时 | 实现Agent执行时间过长 | 缩小任务范围 |
Post-Loop Cleanup
循环结束后的清理
After merge-back completes (status shows then resolves):
RETRO.COMPLETED当合并回主分支完成后(状态显示为并最终更新):
RETRO.COMPLETED1. Verify merge landed
1. 验证合并是否完成
bash
cd <project> && git log --oneline -5bash
cd <project> && git log --oneline -5Should show: "Merge branch 'agent-loop/<LOOP_ID>'"
应显示:"Merge branch 'agent-loop/<LOOP_ID>'"
undefinedundefined2. Run full test suite
2. 运行完整测试套件
bash
cd <project> && bun test 2>&1 | tail -5bash
cd <project> && bun test 2>&1 | tail -53. Delete stale acceptance tests
3. 删除过期的验收测试
Agent loops generate files. These test implementation details and break on every refactor. Delete them after verifying the real tests pass.
__tests__/<story-id>-*.test.tsbash
undefinedAgent循环会生成文件。这些测试针对实现细节,每次重构都会失效。在确认正式测试通过后,应删除这些文件。
__tests__/<story-id>-*.test.tsbash
undefinedFind agent-generated test files
查找Agent生成的测试文件
ls <project>/tests/-.test.ts 2>/dev/null
ls <project>/tests/-.test.ts 2>/dev/null
Check which fail
检查哪些测试失败
bun test tests/ 2>&1 | grep "(fail)"
bun test tests/ 2>&1 | grep "(fail)"
Delete ALL agent-generated acceptance tests (they served their purpose)
删除所有Agent生成的验收测试(它们的使命已完成)
rm <project>/tests/<prefix>-*.test.ts
rm <project>/tests/<prefix>-*.test.ts
Verify clean
验证测试结果是否正常
bun test 2>&1 | tail -3
Per AGENTS.md: *"over-specified tests that mock internal step names are worse than no tests — they break on every refactor and give false negatives."*bun test 2>&1 | tail -3
根据AGENTS.md:*“那些模拟内部步骤名称的过度约束测试还不如没有测试——它们在每次重构时都会失效,还会给出假阴性结果。”*4. TypeScript check
4. TypeScript检查
bash
cd <project> && bunx tsc --noEmitbash
cd <project> && bunx tsc --noEmit5. Restart worker deployment (if loop touched Inngest functions)
5. 重启Worker部署(如果循环涉及Inngest函数)
bash
kubectl -n joelclaw rollout restart deployment/system-bus-worker
kubectl -n joelclaw rollout status deployment/system-bus-worker --timeout=180s
joelclaw refresh
joelclaw statusbash
kubectl -n joelclaw rollout restart deployment/system-bus-worker
kubectl -n joelclaw rollout status deployment/system-bus-worker --timeout=180s
joelclaw refresh
joelclaw status6. Commit cleanup
6. 提交清理操作
bash
cd <project> && git add -A && git commit -m "chore: post-loop cleanup for <LOOP_ID>
Delete stale acceptance tests, verify N pass / 0 fail / tsc clean"bash
cd <project> && git add -A && git commit -m "chore: post-loop cleanup for <LOOP_ID>
Delete stale acceptance tests, verify N pass / 0 fail / tsc clean"When to Intervene
何时需要介入
Don't intervene — let the loop work
无需介入——让循环自行运行
- Story on first attempt (even if slow)
- Story retrying with feedback (judge gave actionable feedback)
- Story just skipped but later stories are still pending
- 任务处于第一次尝试阶段(即使速度较慢)
- 任务正在根据反馈重试(判定Agent已给出可执行的反馈)
- 某个任务刚被跳过,但后续任务仍在执行中
Intervene — something is broken
需要介入——出现异常情况
- Same error on retry: Check attempt output. If retry 2 has identical error to retry 1, the feedback loop isn't helping — cancel and fix the root cause.
- Merge conflict during complete: will show merge abort. Manually resolve:
joelclaw logs errorsthen merge the branch by hand.cd <project> && git merge --abort - Worker crashed: shows worker down. Restart deployment:
joelclaw statuskubectl -n joelclaw rollout restart deployment/system-bus-worker - All stories skipping: The PRD likely has a bad assumption. Cancel, review prd.json, re-fire.
- Loop stuck (no progress for 15min): Check — if the latest run is COMPLETED but no new run dispatched, there's a state bug. Cancel and re-fire from the stuck story.
joelclaw runs --count 5
- 重试时出现相同错误:查看尝试输出。如果第二次重试的错误与第一次完全相同,说明反馈机制无效——取消循环并修复根本原因。
- 合并过程中出现冲突:会显示合并终止信息。手动解决:
joelclaw logs errors,然后手动合并分支。cd <project> && git merge --abort - Worker崩溃:显示Worker已停止。重启部署:
joelclaw statuskubectl -n joelclaw rollout restart deployment/system-bus-worker - 所有任务都被跳过:PRD(产品需求文档)可能存在错误假设。取消循环,检查prd.json,重新启动。
- 循环停滞(15分钟无进展):查看——如果最新运行已显示COMPLETED但没有新的运行被调度,说明存在状态bug。取消循环,从停滞的任务重新启动。
joelclaw runs --count 5
Cancel a stuck loop
取消停滞的循环
bash
joelclaw loop cancel <LOOP_ID>bash
joelclaw loop cancel <LOOP_ID>Clean up worktree manually (cancel doesn't auto-merge)
手动清理工作区(取消操作不会自动合并)
cd <project>
git worktree remove /tmp/agent-loop/<LOOP_ID> --force 2>/dev/null
git branch -D agent-loop/<LOOP_ID> 2>/dev/null
git worktree prune
undefinedcd <project>
git worktree remove /tmp/agent-loop/<LOOP_ID> --force 2>/dev/null
git branch -D agent-loop/<LOOP_ID> 2>/dev/null
git worktree prune
undefinedReading Attempt Output
查看任务尝试输出
Each story attempt writes to . These contain the implementor/reviewer's full output (diffs, reasoning, test results).
/tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.outbash
undefined每个任务的尝试都会将输出写入文件,其中包含实现Agent/评审Agent的完整输出(差异、推理过程、测试结果)。
/tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.outbash
undefinedQuick scan — last 40 lines usually has the verdict
快速查看——最后40行通常包含判定结果
tail -40 /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out
tail -40 /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out
Check all attempts for a story
查看某个任务的所有尝试输出
ls /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-*.out
ls /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-*.out
Grep for errors across all attempts
在所有尝试输出中搜索错误信息
grep -i "error|fail|reject" /tmp/agent-loop/<LOOP_ID>/*.out
undefinedgrep -i "error|fail|reject" /tmp/agent-loop/<LOOP_ID>/*.out
undefinedMonitoring Checklist
监控检查清单
Use this sequence when babysitting a loop:
1. joelclaw loop status <LOOP_ID> — where are we?
2. (if story running) wait 3 min, re-check
3. (if story skipped) check attempt output — did the work land anyway?
4. (if loop completed) verify merge, run tests, delete stale tests, restart worker
5. (if stuck) joelclaw runs --count 5 + joelclaw logs errors — diagnose监控循环时请遵循以下步骤:
1. joelclaw loop status <LOOP_ID> — 当前进度如何?
2.(如果任务正在运行)等待3分钟后重新检查
3.(如果任务被跳过)查看尝试输出——对应的工作是否已实际完成?
4.(如果循环已完成)验证合并结果、运行测试、删除过期测试、重启Worker
5.(如果循环停滞)执行joelclaw runs --count 5 + joelclaw logs errors — 进行诊断Improve joelclaw While You Nanny
监控期间优化joelclaw
The nanny is the primary consumer of output. When you hit a gap — missing info, unclear output, an extra step you had to do manually — fix joelclaw right then. The joelclaw source lives at .
joelclaw~/Code/joelhooks/joelclaw/本技能是输出的主要使用者。当你遇到以下问题时——信息缺失、输出不清晰、需要手动执行额外步骤——立即对joelclaw进行优化。joelclaw的源码位于。
joelclaw~/Code/joelhooks/joelclaw/What to improve
优化方向
- Missing data in output: doesn't show story descriptions? Add them.
joelclaw loop status - Manual steps that should be commands: Had to directly? Add it to
kubectl logs.joelclaw logs - Bad next_actions: The suggested next commands don't match what you actually needed? Fix the HATEOAS.
- Error without fix: Got an error with no field? Add one via
fix.respondError() - New command needed: Found yourself running raw curl/GraphQL? Wrap it in a joelclaw command.
- 输出数据缺失:未显示任务描述?添加该功能。
joelclaw loop status - 应自动化的手动步骤:需要直接执行?将其集成到
kubectl logs中。joelclaw logs - 不合理的下一步操作建议:建议的后续命令与实际需求不符?修复HATEOAS相关逻辑。
- 无修复方案的错误:遇到没有字段的错误?通过
fix添加修复方案。respondError() - 需要新命令:发现自己直接执行curl/GraphQL请求?将其封装为joelclaw命令。
How to improve
优化步骤
bash
undefinedbash
undefinedCommands are in packages/cli/src/commands/ — one file per command
命令定义位于packages/cli/src/commands/ —— 每个命令对应一个文件
Response helpers in packages/cli/src/response.ts
响应工具函数位于packages/cli/src/response.ts
Inngest client methods in packages/cli/src/inngest.ts
Inngest客户端方法位于packages/cli/src/inngest.ts
Test your change
测试你的修改
joelclaw <command> 2>&1 | python3 -m json.tool
joelclaw <command> 2>&1 | python3 -m json.tool
Commit
提交修改
git add -A && git commit -m "feat: joelclaw <command> — <what you added>"
Follow the [cli-design skill](../cli-design/SKILL.md): JSON always, HATEOAS next_actions, context-safe output, errors with fixes. After improving joelclaw, update this skill doc if the monitoring workflow changed.git add -A && git commit -m "feat: joelclaw <command> — <你添加的功能>"
遵循[cli-design技能](../cli-design/SKILL.md)的设计原则:始终返回JSON、提供HATEOAS风格的next_actions、上下文安全的输出、附带修复方案的错误信息。优化joelclaw后,如果监控流程发生变化,请更新本技能文档。Diagnosing Stalled Loops
诊断停滞的循环
When a loop stops progressing, use the diagnostic command FIRST:
bash
undefined当循环停止进展时,请首先使用诊断命令:
bash
undefinedQuick diagnosis of all loops
快速诊断所有循环
joelclaw loop diagnose all -c
joelclaw loop diagnose all -c
Diagnose + auto-fix (clears claims, re-fires plan events)
诊断并自动修复(清除占用标记,重新触发规划事件)
joelclaw loop diagnose all -c --fix
joelclaw loop diagnose all -c --fix
Verify fix worked (~30s later)
验证修复是否生效(约30秒后)
joelclaw loop status <loop-id> -c
Common diagnoses:
- **CHAIN_BROKEN** — judge→plan event was lost. `--fix` re-fires the plan event.
- **ORPHANED_CLAIM** — agent died but claim remains. `--fix` clears claim + re-fires.
- **STUCK_RUN** — Inngest run active but agent dead. `--fix` clears + re-fires.
- **WORKER_UNHEALTHY** — fewer functions than expected. `--fix` restarts worker.
See [loop-diagnosis skill](../loop-diagnosis/SKILL.md) for full reference.joelclaw loop status <loop-id> -c
常见诊断结果:
- **CHAIN_BROKEN** —— judge→plan事件丢失。`--fix`参数会重新触发规划事件。
- **ORPHANED_CLAIM** —— Agent已停止但占用标记仍存在。`--fix`参数会清除占用标记并重新触发。
- **STUCK_RUN** —— Inngest运行处于活跃状态但Agent已停止。`--fix`参数会清除状态并重新触发。
- **WORKER_UNHEALTHY** —— 可用函数数量低于预期。`--fix`参数会重启Worker。
完整参考请查看[loop-diagnosis技能](../loop-diagnosis/SKILL.md)。Gotchas
注意事项
- Don't edit the monorepo while a loop runs — in the worktree scoops unrelated changes
git add -A - CLI-focused edits are safest during active loops — avoid changing mid-run unless you're intentionally redeploying the worker
packages/system-bus - prd.json and progress.txt get dirty — this is normal; complete.ts stashes before merge
- Worktree branch not auto-deleted on cancel — clean manually (see Cancel section)
- Cross-file test pollution — agent-generated files can cause failures in real tests when run together; delete them
__tests__/
- 循环运行期间不要修改单体仓库 —— 工作区中的会包含无关修改
git add -A - 循环运行期间优先修改CLI相关代码 —— 除非有意重新部署Worker,否则避免在运行期间修改
packages/system-bus - prd.json和progress.txt会出现脏数据 —— 这是正常现象;complete.ts会在合并前暂存这些文件
- 取消循环后工作区分支不会自动删除 —— 需要手动清理(详见「取消停滞的循环」部分)
- 跨文件测试污染 —— Agent生成的文件可能会导致正式测试失败;请删除这些文件
__tests__/