agentic-process-monitor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgentic Process Monitor
Agentic进程监控
Patterns for monitoring background processes from Claude Code — detecting success, failure, timeout, and hung processes, then returning results to the main context to drive the next action.
Companion skills: (remote/queued work) | (concurrency)
devops-tools:pueue-job-orchestrationdevops-tools:distributed-job-safety本文介绍了从Claude Code监控后台进程的相关模式——可检测成功、失败、超时和挂起进程,再将结果返回主上下文以驱动下一步操作。
配套技能: (远程/队列任务) | (并发场景)
devops-tools:pueue-job-orchestrationdevops-tools:distributed-job-safetyArchitecture: Sentinel + Heartbeat + Agent
架构: Sentinel + Heartbeat + Agent
Main Context Monitor Agent (subagent)
───────────── ──────────────────────
1. Start work (Bash run_in_background)
└─ work wrapper writes sentinel files
2. Launch Agent (poll every 15s) ────► poll loop:
3. Continue other work .status exists? → return result
.heartbeat stale? → kill, return error
elapsed > max? → kill, return timeout
4. Agent returns ◄──────────────────── detected outcome
5. Act on result (next step / retry / abort)Why this architecture: The main context stays lean (no polling tokens burned). The subagent handles all the waiting. If the subagent itself fails, the main context can recover by checking sentinel files directly.
Main Context Monitor Agent (subagent)
───────────── ──────────────────────
1. Start work (Bash run_in_background)
└─ work wrapper writes sentinel files
2. Launch Agent (poll every 15s) ────► poll loop:
3. Continue other work .status exists? → return result
.heartbeat stale? → kill, return error
elapsed > max? → kill, return timeout
4. Agent returns ◄──────────────────── detected outcome
5. Act on result (next step / retry / abort)该架构的优势: 主上下文保持精简(不会因轮询消耗token),所有等待逻辑都由子Agent处理。如果子Agent本身出现故障,主上下文可以通过直接检查哨兵文件完成恢复。
Sentinel Protocol
哨兵协议
The work process writes 4 files to a known directory (e.g., ):
/tmp/<project>-monitor/| File | When Written | Purpose |
|---|---|---|
| On start | PID for timeout kill |
| Every N seconds during work | mtime freshness = alive proof |
| On exit ( | Completion sentinel |
| On success | Structured output (JSON) |
工作进程会将4个文件写入指定目录(例如 ):
/tmp/<project>-monitor/| 文件 | 写入时机 | 用途 |
|---|---|---|
| 进程启动时 | 用于超时kill的进程PID |
| 任务运行期间每N秒写入一次 | mtime新鲜度 = 进程存活证明 |
| 进程退出时(写入 | 完成状态哨兵 |
| 进程执行成功时 | 结构化输出(JSON) |
Work Wrapper Template
任务包装器模板
bash
#!/usr/bin/env bash
set -uo pipefail
STEP="${1:?step name required}"
MONITOR_DIR="${2:-/tmp/monitor}"
mkdir -p "$MONITOR_DIR"
echo $$ > "${MONITOR_DIR}/${STEP}.pid"bash
#!/usr/bin/env bash
set -uo pipefail
STEP="${1:?step name required}"
MONITOR_DIR="${2:-/tmp/monitor}"
mkdir -p "$MONITOR_DIR"
echo $$ > "${MONITOR_DIR}/${STEP}.pid"Heartbeat: touch file every 10s in background
Heartbeat: touch file every 10s in background
(while true; do touch "${MONITOR_DIR}/${STEP}.heartbeat"; sleep 10; done) &
HB_PID=$!
trap "kill $HB_PID 2>/dev/null" EXIT
(while true; do touch "${MONITOR_DIR}/${STEP}.heartbeat"; sleep 10; done) &
HB_PID=$!
trap "kill $HB_PID 2>/dev/null" EXIT
=== YOUR WORK HERE ===
=== YOUR WORK HERE ===
if your_command --args 2>"${MONITOR_DIR}/${STEP}.log"; then
echo "SUCCESS" > "${MONITOR_DIR}/${STEP}.status"
echo '{"key": "value"}' > "${MONITOR_DIR}/${STEP}.result"
else
echo "FAILED" > "${MONITOR_DIR}/${STEP}.status"
fi
---if your_command --args 2>"${MONITOR_DIR}/${STEP}.log"; then
echo "SUCCESS" > "${MONITOR_DIR}/${STEP}.status"
echo '{"key": "value"}' > "${MONITOR_DIR}/${STEP}.result"
else
echo "FAILED" > "${MONITOR_DIR}/${STEP}.status"
fi
---Monitor Decision Tree
监控决策树
The polling agent checks every seconds (default: 15s):
POLL_INTERVALevery POLL_INTERVAL:
.status exists?
→ read status + result, return to main context
.heartbeat exists AND mtime stale (> STALE_THRESHOLD)?
→ process hung — kill PID, return "hung" error
elapsed > MAX_TIMEOUT?
→ timeout — kill PID, return "timeout" error
otherwise
→ sleep POLL_INTERVAL, continue轮询Agent每秒检查一次(默认15秒):
POLL_INTERVALevery POLL_INTERVAL:
.status exists?
→ read status + result, return to main context
.heartbeat exists AND mtime stale (> STALE_THRESHOLD)?
→ process hung — kill PID, return "hung" error
elapsed > MAX_TIMEOUT?
→ timeout — kill PID, return "timeout" error
otherwise
→ sleep POLL_INTERVAL, continueRecommended Defaults
推荐默认值
| Parameter | Default | Rationale |
|---|---|---|
| 15s | Balances latency vs token cost |
| 10s | Must be < STALE_THRESHOLD / 2 |
| 60s | 6x heartbeat interval = generous slack |
| 1800s (30min) | Catch infrastructure failures |
| 参数 | 默认值 | 设计逻辑 |
|---|---|---|
| 15秒 | 平衡延迟与token成本 |
| 10秒 | 必须小于STALE_THRESHOLD的1/2 |
| 60秒 | 6倍心跳间隔,预留充足容错空间 |
| 1800秒 (30分钟) | 捕获基础设施故障 |
Circuit Breaker
断路器
Prevent repeated failures from wasting compute. Three consecutive crashes or failures → stop and report. Reset counter on any success.
consecutive_failures = 0
MAX_CONSECUTIVE = 3
on failure:
consecutive_failures += 1
if consecutive_failures >= MAX_CONSECUTIVE:
STOP — likely infrastructure, not the work itself
on success:
consecutive_failures = 0避免重复故障浪费算力。连续3次崩溃或失败→停止并上报错误,任何成功执行后重置计数。
consecutive_failures = 0
MAX_CONSECUTIVE = 3
on failure:
consecutive_failures += 1
if consecutive_failures >= MAX_CONSECUTIVE:
STOP — likely infrastructure, not the work itself
on success:
consecutive_failures = 0Agent Self-Healing
Agent自愈
If the monitoring subagent fails (context overflow, timeout, crash), the main context recovers:
- Check file directly — work may have finished while agent was dead
.status - If exists → read result, continue normally
.status - If no but
.statusis fresh → spawn replacement monitor agent.heartbeat - If no and
.statusis stale → process hung, kill PID, log error.heartbeat
This guarantees the main context never gets permanently stuck.
如果监控子Agent故障(上下文溢出、超时、崩溃),主上下文可按以下步骤恢复:
- 直接检查文件——Agent故障期间任务可能已经完成
.status - 如果存在→读取结果,正常继续流程
.status - 如果不存在但
.status是新鲜的→启动新的替代监控Agent.heartbeat - 如果不存在且
.status已过期→进程已挂起,kill对应PID,记录错误.heartbeat
该机制保证主上下文永远不会永久卡住。
Anti-Patterns
反模式
| Anti-Pattern | Why It Fails | Use Instead |
|---|---|---|
| Broken on macOS — pipe buffering prevents grep exit from killing tail, causing permanent hang | Poll the log file for markers with |
| Task IDs can expire or become unretrievable; stuck-as-running bug; no intermediate progress | Sentinel files + agent polling |
| Poll from main context | Each poll injects output into the context window, burning ~50K tokens per check | Delegate polling to a subagent |
Hardcoded | Wastes time on fast completions, too short for slow ones | Poll interval + max timeout |
| PID-only liveness check | Cannot distinguish a hung process (PID alive, no progress) from a running one | Heartbeat file mtime — hung process stops touching the file |
| | Run |
| 反模式 | 缺陷原因 | 替代方案 |
|---|---|---|
| 在macOS上失效——管道缓冲会导致grep退出后无法kill tail进程,造成永久挂起 | 在循环中使用 |
仅使用 | 任务ID可能过期或无法检索;存在状态卡在运行中的bug;无法获取中间进度 | 哨兵文件 + Agent轮询 |
| 从主上下文轮询 | 每次轮询都会向上下文窗口注入输出,每次检查大约消耗5万token | 将轮询逻辑委托给子Agent |
硬编码 | 任务完成快时浪费时间,任务运行慢时超时时间不够 | 轮询间隔 + 最大超时组合 |
| 仅用PID做存活检查 | 无法区分挂起进程(PID存活但无进度)和正常运行的进程 | 用心跳文件的mtime判断——挂起的进程会停止更新文件 |
| | 任何仓库重命名或移动后执行 |
Environment Preflight
环境预检
Run before entering any autonomous loop. If any check fails, fix before proceeding.
bash
undefined进入任何自治循环前运行。如果任何检查失败,修复后再继续。
bash
undefined1. Python package importable?
1. Python package importable?
uv run --python 3.13 python -c "import your_package"
|| uv sync --python 3.13 --extra dev
|| uv sync --python 3.13 --extra dev
uv run --python 3.13 python -c "import your_package"
|| uv sync --python 3.13 --extra dev
|| uv sync --python 3.13 --extra dev
2. Console scripts have valid shebangs? (catches post-rename breakage)
2. Console scripts have valid shebangs? (catches post-rename breakage)
uv run --python 3.13 pytest --co -q tests/ 2>/dev/null
|| uv sync --python 3.13 --extra dev
|| uv sync --python 3.13 --extra dev
uv run --python 3.13 pytest --co -q tests/ 2>/dev/null
|| uv sync --python 3.13 --extra dev
|| uv sync --python 3.13 --extra dev
3. External service reachable?
3. External service reachable?
curl -sf "http://localhost:PORT/?query=SELECT+1"
|| echo "FAIL: start service or SSH tunnel"
|| echo "FAIL: start service or SSH tunnel"
undefinedcurl -sf "http://localhost:PORT/?query=SELECT+1"
|| echo "FAIL: start service or SSH tunnel"
|| echo "FAIL: start service or SSH tunnel"
undefinedCommon uv/venv Failures
常见uv/venv故障
| Symptom | Root Cause | Fix |
|---|---|---|
| Stale venv — console script shebangs point to old repo path | |
| Same — | |
| | |
Rule: After any repo rename, directory move, or Python version change → always before running anything.
uv sync --python 3.13 --extra dev| 现象 | 根本原因 | 修复方案 |
|---|---|---|
出现 | venv过期——控制台脚本的shebang指向旧的仓库路径 | |
出现 | 同上——重命名目录前 | |
| | |
规则: 任何仓库重命名、目录移动或Python版本变更后→运行任何命令前务必先执行。
uv sync --python 3.13 --extra dev