loop-nanny

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Loop Nanny

Loop Nanny

Monitor, triage, and clean up after agent loops. The agent-loop skill starts loops. This skill keeps them healthy.
监控、分类处理Agent循环的故障,并在循环结束后进行清理。agent-loop技能负责启动循环,本技能则保障循环的正常运行。

Monitor a Running Loop

监控运行中的循环

bash
joelclaw loop status <LOOP_ID>
joelclaw
is the primary CLI.
Poll every 2–3 minutes. A story typically takes 3–8 minutes (test-write → implement → review → judge). If a story shows no progress for 10+ minutes, something is stuck.
bash
joelclaw loop status <LOOP_ID>
joelclaw
是主要的CLI工具。
每2-3分钟轮询一次。单个任务通常需要3-8分钟(测试编写→实现→评审→判定)。如果某个任务10分钟以上没有进展,则说明出现了停滞。

Reading Status Output

解读状态输出

StatusMeaningAction
✅ PASS
Story completedNone
⏭ SKIP
Exhausted retriesCheck if code landed anyway (common — see Skip Triage)
▶ STORY.DISPATCHED
Implementor runningWait
▶ TESTS.WRITTEN
Reviewer runningWait
▶ CHECKS.COMPLETED
Judge evaluatingWait
▶ STORY.RETRIED
Failed, retryingCheck attempt output for patterns
▶ RETRO.COMPLETED
Loop finishing (merge-back)Wait for merge, then verify
⏳ pending
Not yet startedNormal — queued
状态含义操作
✅ PASS
任务已完成无需操作
⏭ SKIP
重试次数耗尽检查代码是否已合并(这种情况很常见——详见「跳过任务的分类处理」)
▶ STORY.DISPATCHED
实现Agent正在运行等待
▶ TESTS.WRITTEN
评审Agent正在运行等待
▶ CHECKS.COMPLETED
判定Agent正在评估等待
▶ STORY.RETRIED
执行失败,正在重试查看尝试输出中的规律
▶ RETRO.COMPLETED
循环即将结束(合并回主分支)等待合并完成,然后进行验证
⏳ pending
尚未启动正常状态——处于队列中

Skip Triage

跳过任务的分类处理

Stories skip when the judge fails them after max retries. But the implementation often landed anyway — later stories may have included the work, or the first attempt was correct but tests were over-specified.
当判定Agent在达到最大重试次数后标记任务失败时,任务会被跳过。但对应的实现工作往往已经完成——后续任务可能已经包含了该工作,或者第一次尝试的代码是正确的,但测试用例的约束过于严格。

Check if skipped work actually landed

检查被跳过的工作是否已实际合并

bash
undefined
bash
undefined

1. Read the attempt output

1. 查看尝试输出

cat /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out | tail -40
cat /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out | tail -40

2. Check if the files exist on main after merge

2. 检查合并后主分支上是否存在预期文件

ls <project>/path/to/expected/file
ls <project>/path/to/expected/file

3. Check git log for the story's commits

3. 查看git日志中与该任务相关的提交

cd <project> && git log --oneline --grep="<STORY_ID>" -5
undefined
cd <project> && git log --oneline --grep="<STORY_ID>" -5
undefined

Common skip causes

常见的任务跳过原因

PatternCausePrevention
"Already exists"Story duplicates work from prior storyWrite more atomic stories; planner didn't know current state
Test assertion failureAgent-generated tests over-specify implementationAcceptance criteria should test behavior, not implementation
TypeScript compile errorStory depends on types from a skipped storyOrder stories so type-providing stories run first
TimeoutImplementor took too longSmaller story scope
规律原因预防措施
"Already exists"该任务与之前的任务存在重复工作编写更原子化的任务;确保规划Agent了解当前状态
测试断言失败Agent生成的测试用例对实现细节约束过严验收标准应测试行为而非实现细节
TypeScript编译错误该任务依赖于被跳过任务中的类型定义调整任务顺序,让提供类型定义的任务先执行
超时实现Agent执行时间过长缩小任务范围

Post-Loop Cleanup

循环结束后的清理

After merge-back completes (status shows
RETRO.COMPLETED
then resolves):
当合并回主分支完成后(状态显示为
RETRO.COMPLETED
并最终更新):

1. Verify merge landed

1. 验证合并是否完成

bash
cd <project> && git log --oneline -5
bash
cd <project> && git log --oneline -5

Should show: "Merge branch 'agent-loop/<LOOP_ID>'"

应显示:"Merge branch 'agent-loop/<LOOP_ID>'"

undefined
undefined

2. Run full test suite

2. 运行完整测试套件

bash
cd <project> && bun test 2>&1 | tail -5
bash
cd <project> && bun test 2>&1 | tail -5

3. Delete stale acceptance tests

3. 删除过期的验收测试

Agent loops generate
__tests__/<story-id>-*.test.ts
files. These test implementation details and break on every refactor. Delete them after verifying the real tests pass.
bash
undefined
Agent循环会生成
__tests__/<story-id>-*.test.ts
文件。这些测试针对实现细节,每次重构都会失效。在确认正式测试通过后,应删除这些文件。
bash
undefined

Find agent-generated test files

查找Agent生成的测试文件

ls <project>/tests/-.test.ts 2>/dev/null
ls <project>/tests/-.test.ts 2>/dev/null

Check which fail

检查哪些测试失败

bun test tests/ 2>&1 | grep "(fail)"
bun test tests/ 2>&1 | grep "(fail)"

Delete ALL agent-generated acceptance tests (they served their purpose)

删除所有Agent生成的验收测试(它们的使命已完成)

rm <project>/tests/<prefix>-*.test.ts
rm <project>/tests/<prefix>-*.test.ts

Verify clean

验证测试结果是否正常

bun test 2>&1 | tail -3

Per AGENTS.md: *"over-specified tests that mock internal step names are worse than no tests — they break on every refactor and give false negatives."*
bun test 2>&1 | tail -3

根据AGENTS.md:*“那些模拟内部步骤名称的过度约束测试还不如没有测试——它们在每次重构时都会失效,还会给出假阴性结果。”*

4. TypeScript check

4. TypeScript检查

bash
cd <project> && bunx tsc --noEmit
bash
cd <project> && bunx tsc --noEmit

5. Restart worker deployment (if loop touched Inngest functions)

5. 重启Worker部署(如果循环涉及Inngest函数)

bash
kubectl -n joelclaw rollout restart deployment/system-bus-worker
kubectl -n joelclaw rollout status deployment/system-bus-worker --timeout=180s
joelclaw refresh
joelclaw status
bash
kubectl -n joelclaw rollout restart deployment/system-bus-worker
kubectl -n joelclaw rollout status deployment/system-bus-worker --timeout=180s
joelclaw refresh
joelclaw status

6. Commit cleanup

6. 提交清理操作

bash
cd <project> && git add -A && git commit -m "chore: post-loop cleanup for <LOOP_ID>

Delete stale acceptance tests, verify N pass / 0 fail / tsc clean"
bash
cd <project> && git add -A && git commit -m "chore: post-loop cleanup for <LOOP_ID>

Delete stale acceptance tests, verify N pass / 0 fail / tsc clean"

When to Intervene

何时需要介入

Don't intervene — let the loop work

无需介入——让循环自行运行

  • Story on first attempt (even if slow)
  • Story retrying with feedback (judge gave actionable feedback)
  • Story just skipped but later stories are still pending
  • 任务处于第一次尝试阶段(即使速度较慢)
  • 任务正在根据反馈重试(判定Agent已给出可执行的反馈)
  • 某个任务刚被跳过,但后续任务仍在执行中

Intervene — something is broken

需要介入——出现异常情况

  • Same error on retry: Check attempt output. If retry 2 has identical error to retry 1, the feedback loop isn't helping — cancel and fix the root cause.
  • Merge conflict during complete:
    joelclaw logs errors
    will show merge abort. Manually resolve:
    cd <project> && git merge --abort
    then merge the branch by hand.
  • Worker crashed:
    joelclaw status
    shows worker down. Restart deployment:
    kubectl -n joelclaw rollout restart deployment/system-bus-worker
  • All stories skipping: The PRD likely has a bad assumption. Cancel, review prd.json, re-fire.
  • Loop stuck (no progress for 15min): Check
    joelclaw runs --count 5
    — if the latest run is COMPLETED but no new run dispatched, there's a state bug. Cancel and re-fire from the stuck story.
  • 重试时出现相同错误:查看尝试输出。如果第二次重试的错误与第一次完全相同,说明反馈机制无效——取消循环并修复根本原因。
  • 合并过程中出现冲突
    joelclaw logs errors
    会显示合并终止信息。手动解决:
    cd <project> && git merge --abort
    ,然后手动合并分支。
  • Worker崩溃
    joelclaw status
    显示Worker已停止。重启部署:
    kubectl -n joelclaw rollout restart deployment/system-bus-worker
  • 所有任务都被跳过:PRD(产品需求文档)可能存在错误假设。取消循环,检查prd.json,重新启动。
  • 循环停滞(15分钟无进展):查看
    joelclaw runs --count 5
    ——如果最新运行已显示COMPLETED但没有新的运行被调度,说明存在状态bug。取消循环,从停滞的任务重新启动。

Cancel a stuck loop

取消停滞的循环

bash
joelclaw loop cancel <LOOP_ID>
bash
joelclaw loop cancel <LOOP_ID>

Clean up worktree manually (cancel doesn't auto-merge)

手动清理工作区(取消操作不会自动合并)

cd <project> git worktree remove /tmp/agent-loop/<LOOP_ID> --force 2>/dev/null git branch -D agent-loop/<LOOP_ID> 2>/dev/null git worktree prune
undefined
cd <project> git worktree remove /tmp/agent-loop/<LOOP_ID> --force 2>/dev/null git branch -D agent-loop/<LOOP_ID> 2>/dev/null git worktree prune
undefined

Reading Attempt Output

查看任务尝试输出

Each story attempt writes to
/tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out
. These contain the implementor/reviewer's full output (diffs, reasoning, test results).
bash
undefined
每个任务的尝试都会将输出写入
/tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out
文件,其中包含实现Agent/评审Agent的完整输出(差异、推理过程、测试结果)。
bash
undefined

Quick scan — last 40 lines usually has the verdict

快速查看——最后40行通常包含判定结果

tail -40 /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out
tail -40 /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-<ATTEMPT>.out

Check all attempts for a story

查看某个任务的所有尝试输出

ls /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-*.out
ls /tmp/agent-loop/<LOOP_ID>/<STORY_ID>-*.out

Grep for errors across all attempts

在所有尝试输出中搜索错误信息

grep -i "error|fail|reject" /tmp/agent-loop/<LOOP_ID>/*.out
undefined
grep -i "error|fail|reject" /tmp/agent-loop/<LOOP_ID>/*.out
undefined

Monitoring Checklist

监控检查清单

Use this sequence when babysitting a loop:
1. joelclaw loop status <LOOP_ID>          — where are we?
2. (if story running) wait 3 min, re-check
3. (if story skipped) check attempt output — did the work land anyway?
4. (if loop completed) verify merge, run tests, delete stale tests, restart worker
5. (if stuck) joelclaw runs --count 5 + joelclaw logs errors — diagnose
监控循环时请遵循以下步骤:
1. joelclaw loop status <LOOP_ID>          — 当前进度如何?
2.(如果任务正在运行)等待3分钟后重新检查
3.(如果任务被跳过)查看尝试输出——对应的工作是否已实际完成?
4.(如果循环已完成)验证合并结果、运行测试、删除过期测试、重启Worker
5.(如果循环停滞)执行joelclaw runs --count 5 + joelclaw logs errors — 进行诊断

Improve joelclaw While You Nanny

监控期间优化joelclaw

The nanny is the primary consumer of
joelclaw
output. When you hit a gap — missing info, unclear output, an extra step you had to do manually — fix joelclaw right then. The joelclaw source lives at
~/Code/joelhooks/joelclaw/
.
本技能是
joelclaw
输出的主要使用者。当你遇到以下问题时——信息缺失、输出不清晰、需要手动执行额外步骤——立即对joelclaw进行优化。joelclaw的源码位于
~/Code/joelhooks/joelclaw/

What to improve

优化方向

  • Missing data in output:
    joelclaw loop status
    doesn't show story descriptions? Add them.
  • Manual steps that should be commands: Had to
    kubectl logs
    directly? Add it to
    joelclaw logs
    .
  • Bad next_actions: The suggested next commands don't match what you actually needed? Fix the HATEOAS.
  • Error without fix: Got an error with no
    fix
    field? Add one via
    respondError()
    .
  • New command needed: Found yourself running raw curl/GraphQL? Wrap it in a joelclaw command.
  • 输出数据缺失
    joelclaw loop status
    未显示任务描述?添加该功能。
  • 应自动化的手动步骤:需要直接执行
    kubectl logs
    ?将其集成到
    joelclaw logs
    中。
  • 不合理的下一步操作建议:建议的后续命令与实际需求不符?修复HATEOAS相关逻辑。
  • 无修复方案的错误:遇到没有
    fix
    字段的错误?通过
    respondError()
    添加修复方案。
  • 需要新命令:发现自己直接执行curl/GraphQL请求?将其封装为joelclaw命令。

How to improve

优化步骤

bash
undefined
bash
undefined

Commands are in packages/cli/src/commands/ — one file per command

命令定义位于packages/cli/src/commands/ —— 每个命令对应一个文件

Response helpers in packages/cli/src/response.ts

响应工具函数位于packages/cli/src/response.ts

Inngest client methods in packages/cli/src/inngest.ts

Inngest客户端方法位于packages/cli/src/inngest.ts

Test your change

测试你的修改

joelclaw <command> 2>&1 | python3 -m json.tool
joelclaw <command> 2>&1 | python3 -m json.tool

Commit

提交修改

git add -A && git commit -m "feat: joelclaw <command><what you added>"

Follow the [cli-design skill](../cli-design/SKILL.md): JSON always, HATEOAS next_actions, context-safe output, errors with fixes. After improving joelclaw, update this skill doc if the monitoring workflow changed.
git add -A && git commit -m "feat: joelclaw <command> — <你添加的功能>"

遵循[cli-design技能](../cli-design/SKILL.md)的设计原则:始终返回JSON、提供HATEOAS风格的next_actions、上下文安全的输出、附带修复方案的错误信息。优化joelclaw后,如果监控流程发生变化,请更新本技能文档。

Diagnosing Stalled Loops

诊断停滞的循环

When a loop stops progressing, use the diagnostic command FIRST:
bash
undefined
当循环停止进展时,请首先使用诊断命令:
bash
undefined

Quick diagnosis of all loops

快速诊断所有循环

joelclaw loop diagnose all -c
joelclaw loop diagnose all -c

Diagnose + auto-fix (clears claims, re-fires plan events)

诊断并自动修复(清除占用标记,重新触发规划事件)

joelclaw loop diagnose all -c --fix
joelclaw loop diagnose all -c --fix

Verify fix worked (~30s later)

验证修复是否生效(约30秒后)

joelclaw loop status <loop-id> -c

Common diagnoses:
- **CHAIN_BROKEN** — judge→plan event was lost. `--fix` re-fires the plan event.
- **ORPHANED_CLAIM** — agent died but claim remains. `--fix` clears claim + re-fires.
- **STUCK_RUN** — Inngest run active but agent dead. `--fix` clears + re-fires.
- **WORKER_UNHEALTHY** — fewer functions than expected. `--fix` restarts worker.

See [loop-diagnosis skill](../loop-diagnosis/SKILL.md) for full reference.
joelclaw loop status <loop-id> -c

常见诊断结果:
- **CHAIN_BROKEN** —— judge→plan事件丢失。`--fix`参数会重新触发规划事件。
- **ORPHANED_CLAIM** —— Agent已停止但占用标记仍存在。`--fix`参数会清除占用标记并重新触发。
- **STUCK_RUN** —— Inngest运行处于活跃状态但Agent已停止。`--fix`参数会清除状态并重新触发。
- **WORKER_UNHEALTHY** —— 可用函数数量低于预期。`--fix`参数会重启Worker。

完整参考请查看[loop-diagnosis技能](../loop-diagnosis/SKILL.md)。

Gotchas

注意事项

  • Don't edit the monorepo while a loop runs
    git add -A
    in the worktree scoops unrelated changes
  • CLI-focused edits are safest during active loops — avoid changing
    packages/system-bus
    mid-run unless you're intentionally redeploying the worker
  • prd.json and progress.txt get dirty — this is normal; complete.ts stashes before merge
  • Worktree branch not auto-deleted on cancel — clean manually (see Cancel section)
  • Cross-file test pollution — agent-generated
    __tests__/
    files can cause failures in real tests when run together; delete them
  • 循环运行期间不要修改单体仓库 —— 工作区中的
    git add -A
    会包含无关修改
  • 循环运行期间优先修改CLI相关代码 —— 除非有意重新部署Worker,否则避免在运行期间修改
    packages/system-bus
  • prd.json和progress.txt会出现脏数据 —— 这是正常现象;complete.ts会在合并前暂存这些文件
  • 取消循环后工作区分支不会自动删除 —— 需要手动清理(详见「取消停滞的循环」部分)
  • 跨文件测试污染 —— Agent生成的
    __tests__/
    文件可能会导致正式测试失败;请删除这些文件