agentic-process-monitor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agentic Process Monitor

Agentic进程监控

Patterns for monitoring background processes from Claude Code — detecting success, failure, timeout, and hung processes, then returning results to the main context to drive the next action.
Companion skills:
devops-tools:pueue-job-orchestration
(remote/queued work) |
devops-tools:distributed-job-safety
(concurrency)

本文介绍了从Claude Code监控后台进程的相关模式——可检测成功、失败、超时和挂起进程,再将结果返回主上下文以驱动下一步操作。
配套技能:
devops-tools:pueue-job-orchestration
(远程/队列任务) |
devops-tools:distributed-job-safety
(并发场景)

Architecture: Sentinel + Heartbeat + Agent

架构: Sentinel + Heartbeat + Agent

Main Context                          Monitor Agent (subagent)
─────────────                         ──────────────────────
1. Start work (Bash run_in_background)
   └─ work wrapper writes sentinel files
2. Launch Agent (poll every 15s) ────► poll loop:
3. Continue other work                  .status exists? → return result
                                        .heartbeat stale? → kill, return error
                                        elapsed > max? → kill, return timeout
4. Agent returns ◄──────────────────── detected outcome
5. Act on result (next step / retry / abort)
Why this architecture: The main context stays lean (no polling tokens burned). The subagent handles all the waiting. If the subagent itself fails, the main context can recover by checking sentinel files directly.

Main Context                          Monitor Agent (subagent)
─────────────                         ──────────────────────
1. Start work (Bash run_in_background)
   └─ work wrapper writes sentinel files
2. Launch Agent (poll every 15s) ────► poll loop:
3. Continue other work                  .status exists? → return result
                                        .heartbeat stale? → kill, return error
                                        elapsed > max? → kill, return timeout
4. Agent returns ◄──────────────────── detected outcome
5. Act on result (next step / retry / abort)
该架构的优势: 主上下文保持精简(不会因轮询消耗token),所有等待逻辑都由子Agent处理。如果子Agent本身出现故障,主上下文可以通过直接检查哨兵文件完成恢复。

Sentinel Protocol

哨兵协议

The work process writes 4 files to a known directory (e.g.,
/tmp/<project>-monitor/
):
FileWhen WrittenPurpose
<step>.pid
On startPID for timeout kill
<step>.heartbeat
Every N seconds during workmtime freshness = alive proof
<step>.status
On exit (
SUCCESS
/
FAILED
)
Completion sentinel
<step>.result
On successStructured output (JSON)
工作进程会将4个文件写入指定目录(例如
/tmp/<project>-monitor/
):
文件写入时机用途
<step>.pid
进程启动时用于超时kill的进程PID
<step>.heartbeat
任务运行期间每N秒写入一次mtime新鲜度 = 进程存活证明
<step>.status
进程退出时(写入
SUCCESS
/
FAILED
完成状态哨兵
<step>.result
进程执行成功时结构化输出(JSON)

Work Wrapper Template

任务包装器模板

bash
#!/usr/bin/env bash
set -uo pipefail
STEP="${1:?step name required}"
MONITOR_DIR="${2:-/tmp/monitor}"
mkdir -p "$MONITOR_DIR"

echo $$ > "${MONITOR_DIR}/${STEP}.pid"
bash
#!/usr/bin/env bash
set -uo pipefail
STEP="${1:?step name required}"
MONITOR_DIR="${2:-/tmp/monitor}"
mkdir -p "$MONITOR_DIR"

echo $$ > "${MONITOR_DIR}/${STEP}.pid"

Heartbeat: touch file every 10s in background

Heartbeat: touch file every 10s in background

(while true; do touch "${MONITOR_DIR}/${STEP}.heartbeat"; sleep 10; done) & HB_PID=$! trap "kill $HB_PID 2>/dev/null" EXIT
(while true; do touch "${MONITOR_DIR}/${STEP}.heartbeat"; sleep 10; done) & HB_PID=$! trap "kill $HB_PID 2>/dev/null" EXIT

=== YOUR WORK HERE ===

=== YOUR WORK HERE ===

if your_command --args 2>"${MONITOR_DIR}/${STEP}.log"; then echo "SUCCESS" > "${MONITOR_DIR}/${STEP}.status" echo '{"key": "value"}' > "${MONITOR_DIR}/${STEP}.result" else echo "FAILED" > "${MONITOR_DIR}/${STEP}.status" fi

---
if your_command --args 2>"${MONITOR_DIR}/${STEP}.log"; then echo "SUCCESS" > "${MONITOR_DIR}/${STEP}.status" echo '{"key": "value"}' > "${MONITOR_DIR}/${STEP}.result" else echo "FAILED" > "${MONITOR_DIR}/${STEP}.status" fi

---

Monitor Decision Tree

监控决策树

The polling agent checks every
POLL_INTERVAL
seconds (default: 15s):
every POLL_INTERVAL:
  .status exists?
    → read status + result, return to main context
  .heartbeat exists AND mtime stale (> STALE_THRESHOLD)?
    → process hung — kill PID, return "hung" error
  elapsed > MAX_TIMEOUT?
    → timeout — kill PID, return "timeout" error
  otherwise
    → sleep POLL_INTERVAL, continue
轮询Agent每
POLL_INTERVAL
秒检查一次(默认15秒):
every POLL_INTERVAL:
  .status exists?
    → read status + result, return to main context
  .heartbeat exists AND mtime stale (> STALE_THRESHOLD)?
    → process hung — kill PID, return "hung" error
  elapsed > MAX_TIMEOUT?
    → timeout — kill PID, return "timeout" error
  otherwise
    → sleep POLL_INTERVAL, continue

Recommended Defaults

推荐默认值

ParameterDefaultRationale
POLL_INTERVAL
15sBalances latency vs token cost
HEARTBEAT_INTERVAL
10sMust be < STALE_THRESHOLD / 2
STALE_THRESHOLD
60s6x heartbeat interval = generous slack
MAX_TIMEOUT
1800s (30min)Catch infrastructure failures

参数默认值设计逻辑
POLL_INTERVAL
15秒平衡延迟与token成本
HEARTBEAT_INTERVAL
10秒必须小于STALE_THRESHOLD的1/2
STALE_THRESHOLD
60秒6倍心跳间隔,预留充足容错空间
MAX_TIMEOUT
1800秒 (30分钟)捕获基础设施故障

Circuit Breaker

断路器

Prevent repeated failures from wasting compute. Three consecutive crashes or failures → stop and report. Reset counter on any success.
consecutive_failures = 0
MAX_CONSECUTIVE = 3

on failure:
  consecutive_failures += 1
  if consecutive_failures >= MAX_CONSECUTIVE:
    STOP — likely infrastructure, not the work itself

on success:
  consecutive_failures = 0

避免重复故障浪费算力。连续3次崩溃或失败→停止并上报错误,任何成功执行后重置计数。
consecutive_failures = 0
MAX_CONSECUTIVE = 3

on failure:
  consecutive_failures += 1
  if consecutive_failures >= MAX_CONSECUTIVE:
    STOP — likely infrastructure, not the work itself

on success:
  consecutive_failures = 0

Agent Self-Healing

Agent自愈

If the monitoring subagent fails (context overflow, timeout, crash), the main context recovers:
  1. Check
    .status
    file directly — work may have finished while agent was dead
  2. If
    .status
    exists → read result, continue normally
  3. If no
    .status
    but
    .heartbeat
    is fresh → spawn replacement monitor agent
  4. If no
    .status
    and
    .heartbeat
    is stale → process hung, kill PID, log error
This guarantees the main context never gets permanently stuck.

如果监控子Agent故障(上下文溢出、超时、崩溃),主上下文可按以下步骤恢复:
  1. 直接检查
    .status
    文件——Agent故障期间任务可能已经完成
  2. 如果
    .status
    存在→读取结果,正常继续流程
  3. 如果不存在
    .status
    .heartbeat
    是新鲜的→启动新的替代监控Agent
  4. 如果不存在
    .status
    .heartbeat
    已过期→进程已挂起,kill对应PID,记录错误
该机制保证主上下文永远不会永久卡住。

Anti-Patterns

反模式

Anti-PatternWhy It FailsUse Instead
tail -f | grep -m1
Broken on macOS — pipe buffering prevents grep exit from killing tail, causing permanent hangPoll the log file for markers with
grep
in a loop
run_in_background
alone
Task IDs can expire or become unretrievable; stuck-as-running bug; no intermediate progressSentinel files + agent polling
Poll from main contextEach poll injects output into the context window, burning ~50K tokens per checkDelegate polling to a subagent
Hardcoded
sleep
timeout
Wastes time on fast completions, too short for slow onesPoll interval + max timeout
PID-only liveness checkCannot distinguish a hung process (PID alive, no progress) from a running oneHeartbeat file mtime — hung process stops touching the file
uv run
masking stale venv
uv run
has its own resolution that bypasses broken venv state; console scripts (pytest, ruff) have stale shebangs after repo rename
Run
uv sync --python 3.13 --extra dev
after any repo rename or move

反模式缺陷原因替代方案
tail -f | grep -m1
在macOS上失效——管道缓冲会导致grep退出后无法kill tail进程,造成永久挂起在循环中使用
grep
轮询日志文件的标记
仅使用
run_in_background
任务ID可能过期或无法检索;存在状态卡在运行中的bug;无法获取中间进度哨兵文件 + Agent轮询
从主上下文轮询每次轮询都会向上下文窗口注入输出,每次检查大约消耗5万token将轮询逻辑委托给子Agent
硬编码
sleep
超时
任务完成快时浪费时间,任务运行慢时超时时间不够轮询间隔 + 最大超时组合
仅用PID做存活检查无法区分挂起进程(PID存活但无进度)和正常运行的进程用心跳文件的mtime判断——挂起的进程会停止更新文件
uv run
掩盖stale venv问题
uv run
有自己的依赖解析逻辑,会绕过损坏的venv状态;仓库重命名后,pytest、ruff等控制台脚本的shebang会过期
任何仓库重命名或移动后执行
uv sync --python 3.13 --extra dev

Environment Preflight

环境预检

Run before entering any autonomous loop. If any check fails, fix before proceeding.
bash
undefined
进入任何自治循环前运行。如果任何检查失败,修复后再继续。
bash
undefined

1. Python package importable?

1. Python package importable?

uv run --python 3.13 python -c "import your_package"
|| uv sync --python 3.13 --extra dev
uv run --python 3.13 python -c "import your_package"
|| uv sync --python 3.13 --extra dev

2. Console scripts have valid shebangs? (catches post-rename breakage)

2. Console scripts have valid shebangs? (catches post-rename breakage)

uv run --python 3.13 pytest --co -q tests/ 2>/dev/null
|| uv sync --python 3.13 --extra dev
uv run --python 3.13 pytest --co -q tests/ 2>/dev/null
|| uv sync --python 3.13 --extra dev

3. External service reachable?

3. External service reachable?

curl -sf "http://localhost:PORT/?query=SELECT+1"
|| echo "FAIL: start service or SSH tunnel"
undefined
curl -sf "http://localhost:PORT/?query=SELECT+1"
|| echo "FAIL: start service or SSH tunnel"
undefined

Common uv/venv Failures

常见uv/venv故障

SymptomRoot CauseFix
ModuleNotFoundError
but
uv run python -c "import ..."
works
Stale venv — console script shebangs point to old repo path
uv sync --python 3.13 --extra dev
bad interpreter: ...old-path/.venv/bin/python3
Same —
.venv/bin/pytest
shebang hardcoded pre-rename directory
uv sync --python 3.13 --extra dev
pip show
says not found,
uv run
says installed
uv run
resolution bypasses venv pip metadata
uv sync
reconciles both
Rule: After any repo rename, directory move, or Python version change → always
uv sync --python 3.13 --extra dev
before running anything.
现象根本原因修复方案
出现
ModuleNotFoundError
uv run python -c "import ..."
执行正常
venv过期——控制台脚本的shebang指向旧的仓库路径
uv sync --python 3.13 --extra dev
出现
bad interpreter: ...old-path/.venv/bin/python3
同上——重命名目录前
.venv/bin/pytest
的shebang是硬编码的
uv sync --python 3.13 --extra dev
pip show
提示未安装,
uv run
提示已安装
uv run
的解析逻辑绕过了venv的pip元数据
uv sync
会同步两者状态
规则: 任何仓库重命名、目录移动或Python版本变更后→运行任何命令前务必先执行
uv sync --python 3.13 --extra dev