gateway-diagnose
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGateway Diagnosis
网关诊断
Structured diagnostic workflow for the joelclaw gateway daemon. Runs top-down from process health to message delivery, stopping at the first failure layer.
Default time range: 1 hour. Override by asking "check gateway logs for the last 4 hours" or similar.
适用于joelclaw网关daemon的结构化诊断工作流,从进程健康状态到消息投递自上而下排查,遇到第一层故障时即可停止。
默认时间范围:1小时,你可以通过类似"查询过去4小时的网关日志"的要求覆盖默认配置。
CLI Commands (use these first)
CLI 命令(优先使用)
bash
undefinedbash
undefinedAutomated health check — runs all layers, returns structured findings
自动化健康检查 — 运行全层检查,返回结构化排查结果
joelclaw gateway diagnose [--hours 1] [--lines 100]
joelclaw gateway diagnose [--hours 1] [--lines 100]
Session context — what happened recently? Exchanges, tools, errors.
会话上下文 — 查看近期发生的事件:交互记录、工具调用、报错信息
joelclaw gateway review [--hours 1] [--max 20]
Start with `diagnose` to find the failure layer. Use `review` to understand what the gateway was doing when it broke. Only drop to manual log reading (below) when the CLI output isn't enough.joelclaw gateway review [--hours 1] [--max 20]
先运行`diagnose`定位故障层级,使用`review`了解网关崩溃时的运行状态,只有当CLI输出不足以排查问题时,再使用下方的手动日志读取方法。Artifact Locations
产物位置
| Artifact | Path | What's in it |
|---|---|---|
| Daemon stdout | | Startup info, event flow, responses, fallback messages |
| Daemon stderr | | Errors, stack traces, retries, fallback activations — check this first |
| PID file | | Current daemon process ID |
| Session ID | | Current pi session ID |
| Session transcripts | | Full pi session history (most recent by mtime) |
| Gateway working dir | | Has |
| Launchd plist | | Service config, env vars, log paths |
| Start script | | Secret leasing, env setup, bun invocation |
| Tripwire | | Last heartbeat timestamp (updated every 15 min) |
| WS port | | WebSocket port for TUI attach (default 3018) |
| 产物 | 路径 | 内容说明 |
|---|---|---|
| Daemon标准输出 | | 启动信息、事件流、响应内容、降级消息 |
| Daemon标准错误 | | 报错信息、栈追踪、重试记录、降级激活日志 — 优先检查该文件 |
| PID文件 | | 当前daemon进程ID |
| 会话ID | | 当前pi会话ID |
| 会话记录 | | 完整pi会话历史(按修改时间倒序排列) |
| 网关工作目录 | | 包含压缩配置文件 |
| Launchd配置文件 | | 服务配置、环境变量、日志路径 |
| 启动脚本 | | 秘钥获取、环境初始化、bun调用逻辑 |
| 心跳文件 | | 最后一次心跳时间戳(每15分钟更新一次) |
| WS端口文件 | | TUI挂载的WebSocket端口(默认3018) |
Diagnostic Procedure
诊断流程
Run these steps in order. Stop and report at the first failure.
按顺序执行以下步骤,遇到第一个故障点即可停止并上报。
Layer 0: Process Health
第0层:进程健康状态
bash
undefinedbash
undefinedIs the daemon running?
检查daemon是否正在运行
launchctl list | grep gateway
ps aux | grep gateway | grep -v grep
launchctl list | grep gateway
ps aux | grep gateway | grep -v grep
What's the PID and uptime?
查看PID和运行时长
cat /tmp/joelclaw/gateway.pid
cat /tmp/joelclaw/gateway.pid
Compare PID to launchctl list output — mismatch = stale PID file
将PID与launchctl list输出对比,不匹配则说明PID文件已过时
**Failure patterns:**
- PID mismatch between launchctl and PID file → daemon restarted, PID file stale
- Exit code non-zero in launchctl → crash loop, check gateway.err
- Process not running but launchctl shows it → zombie, `launchctl kickstart -k`
**故障模式:**
- launchctl输出与PID文件的PID不匹配 → daemon已重启,PID文件未更新
- launchctl中的退出码非零 → 崩溃循环,检查gateway.err
- 进程未运行但launchctl显示运行中 → 僵尸进程,执行`launchctl kickstart -k`Layer 1: CLI Status
第1层:CLI状态
bash
joelclaw gateway statusCheck:
- — if not, Redis pod is down
redis: "connected" - — should have
activeSessionswithgatewayalive: true - — if >0, messages are backing up (session busy or stuck)
pending: 0
bash
joelclaw gateway status检查项:
- — 若未连接,说明Redis实例宕机
redis: "connected" - — 应存在
activeSessions条目且gatewayalive: true - — 若大于0,说明消息积压(会话繁忙或卡住)
pending: 0
Layer 2: Error Log (the money log)
第2层:错误日志(核心日志)
bash
undefinedbash
undefinedDefault: last 100 lines. Adjust for time range.
默认查看最近100行,可根据时间范围调整
tail -100 /tmp/joelclaw/gateway.err
**Known error patterns:**
| Pattern | Meaning | Root Cause |
|---------|---------|-----------|
| `Agent is already processing` | Command queue tried to prompt while session streaming | Session busy — long turn, compaction, or initialization race |
| `fallback activated` | Model timeout or consecutive failures triggered model swap | Primary model API down or slow |
| `no streaming tokens after Ns` | Timeout — prompt dispatched but no response | Model API issue, auth failure, or session not ready |
| `session still streaming, retrying` | Drain loop retry (3 attempts, 2s each) | Turn taking longer than expected |
| `watchdog: session appears stuck` | No turn_end for 10+ minutes after prompt | Hung tool call or model hang |
| `watchdog: session appears dead` | 3+ consecutive prompt failures | Triggers self-restart via graceful shutdown |
| `OTEL emit request failed: TimeoutError` | Typesense unreachable | k8s port-forward or Typesense pod issue (secondary) |
| `prompt failed` with `consecutiveFailures: N` | Nth failure in a row | Check model API, session state |tail -100 /tmp/joelclaw/gateway.err
**已知错误模式:**
| 错误模式 | 含义 | 根本原因 |
|---------|---------|-----------|
| `Agent is already processing` | 会话流处理时命令队列尝试发送prompt | 会话繁忙 — 长轮询、数据压缩或初始化竞态 |
| `fallback activated` | 模型超时或连续报错触发模型切换 | 主模型API宕机或响应缓慢 |
| `no streaming tokens after Ns` | 超时 — prompt已发送但无响应 | 模型API故障、鉴权失败或会话未就绪 |
| `session still streaming, retrying` | 队列消费重试(共3次,间隔2秒) | 轮询处理耗时超出预期 |
| `watchdog: session appears stuck` | prompt发送后10分钟以上未收到`turn_end`信号 | 工具调用挂起或模型无响应 |
| `watchdog: session appears dead` | 连续3次以上prompt发送失败 | 触发优雅关闭流程进行自重启 |
| `OTEL emit request failed: TimeoutError` | 无法访问Typesense | k8s端口转发或Typesense实例故障(次要影响) |
| `prompt failed` 附带 `consecutiveFailures: N` | 连续N次调用失败 | 检查模型API、会话状态 |Layer 3: Stdout Log (event flow)
第3层:标准输出日志(事件流)
bash
tail -100 /tmp/joelclaw/gateway.logLook for:
- — last startup time, model, session ID
[gateway] daemon started - — did the message arrive?
[gateway:telegram] message received - — was it persisted?
[gateway:store] persisted inbound message - — was a prompt sent to the model?
[gateway:fallback] prompt dispatched - — did the model respond?
[gateway] response ready - — is fallback model in use?
[gateway:fallback] activated - — which events are being filtered
[redis] suppressed N noise event(s) - — startup replay (can cause races)
[gateway:store] replayed unacked messages
bash
tail -100 /tmp/joelclaw/gateway.log排查要点:
- — 最后启动时间、使用模型、会话ID
[gateway] daemon started - — 消息是否成功送达
[gateway:telegram] message received - — 消息是否已持久化
[gateway:store] persisted inbound message - — prompt是否已发送到模型
[gateway:fallback] prompt dispatched - — 模型是否已返回响应
[gateway] response ready - — 是否启用了降级模型
[gateway:fallback] activated - — 哪些事件被过滤
[redis] suppressed N noise event(s) - — 启动时重放未确认消息(可能引发竞态)
[gateway:store] replayed unacked messages
Layer 4: E2E Delivery Test
第4层:端到端投递测试
bash
joelclaw gateway testbash
joelclaw gateway testWait 5 seconds
等待5秒
joelclaw gateway events
**Expected:** Test event pushed and drained (totalCount: 0 after drain).
**Failure:** Event stuck in queue → session not draining → check Layer 2 errors.joelclaw gateway events
**预期结果:** 测试事件已推送并消费完成(消费后`totalCount: 0`)。
**故障表现:** 事件卡在队列中 → 会话未消费消息 → 检查第2层错误日志。Layer 5: Session Transcript
第5层:会话记录
bash
undefinedbash
undefinedFind most recent gateway session
查找最近的网关会话文件
ls -lt ~/.joelclaw/sessions/gateway/*.jsonl | head -1
ls -lt ~/.joelclaw/sessions/gateway/*.jsonl | head -1
Read last N lines of the session JSONL
读取会话JSONL的最后N行
tail -50 ~/.joelclaw/sessions/gateway/<session-file>.jsonl
Each line is a JSON object. Look for:
- `"type": "turn_end"` — confirms turns are completing
- `"type": "error"` — model or tool errors
- Long gaps between `turn_start` and `turn_end` — slow turns
- Tool call entries — what was the session doing when it got stuck?tail -50 ~/.joelclaw/sessions/gateway/<session-file>.jsonl
每一行都是一个JSON对象,排查要点:
- `"type": "turn_end"` — 确认轮询已完成
- `"type": "error"` — 模型或工具报错
- `turn_start`和`turn_end`之间间隔过长 — 轮询处理缓慢
- 工具调用条目 — 会话卡住时正在执行什么操作Layer 6: OTEL Telemetry
第6层:OTEL遥测数据
bash
undefinedbash
undefinedGateway-specific events
网关专属事件
joelclaw otel search "gateway" --hours 1
joelclaw otel search "gateway" --hours 1
Fallback events
降级事件
joelclaw otel search "fallback" --hours 1
joelclaw otel search "fallback" --hours 1
Queue events
队列事件
joelclaw otel search "command-queue" --hours 1
undefinedjoelclaw otel search "command-queue" --hours 1
undefinedLayer 7: Model API Health
第7层:模型API健康状态
bash
undefinedbash
undefinedQuick API reachability test (auth error = API reachable)
快速测试API可达性(鉴权错误说明API可达)
curl -s -m 10 https://api.anthropic.com/v1/messages
-H "x-api-key: test"
-H "anthropic-version: 2023-06-01"
-H "content-type: application/json"
-d '{}' | jq .error.type
-H "x-api-key: test"
-H "anthropic-version: 2023-06-01"
-H "content-type: application/json"
-d '{}' | jq .error.type
curl -s -m 10 https://api.anthropic.com/v1/messages
-H "x-api-key: test"
-H "anthropic-version: 2023-06-01"
-H "content-type: application/json"
-d '{}' | jq .error.type
-H "x-api-key: test"
-H "anthropic-version: 2023-06-01"
-H "content-type: application/json"
-d '{}' | jq .error.type
Expected: "authentication_error" (means API is reachable)
预期输出:"authentication_error"(说明API可正常访问)
undefinedundefinedLayer 8: Redis State
第8层:Redis状态
bash
undefinedbash
undefinedCheck gateway queue directly
直接检查网关队列长度
kubectl exec -n joelclaw redis-0 -- redis-cli LLEN joelclaw:notify:gateway
kubectl exec -n joelclaw redis-0 -- redis-cli LLEN joelclaw:notify:gateway
Check message store
检查消息存储
kubectl exec -n joelclaw redis-0 -- redis-cli XLEN gateway:messages
kubectl exec -n joelclaw redis-0 -- redis-cli XLEN gateway:messages
Check unacked messages (these replay on restart)
检查未确认消息(重启时会重放这些消息)
kubectl exec -n joelclaw redis-0 -- redis-cli XRANGE gateway:messages - + COUNT 5
undefinedkubectl exec -n joelclaw redis-0 -- redis-cli XRANGE gateway:messages - + COUNT 5
undefinedKnown Failure Scenarios
已知故障场景
1. Session Initialization Race (ADR-0103 era)
1. 会话初始化竞态(ADR-0103版本阶段)
Symptoms: "already processing" errors immediately after restart, unacked message replay fails.
Cause: Drain loop processes replayed messages before the pi session finishes initializing.
Fix: Restart clears it. If persistent, check if runs before session is ready.
replayUnacked()症状: 重启后立即出现"already processing"错误,未确认消息重放失败。
原因: pi会话完成初始化前,队列消费循环就开始处理重放的消息。
修复方案: 重启即可解决,若问题持续,检查是否在会话就绪前运行。
replayUnacked()2. Model API Timeout
2. 模型API超时
Symptoms: "no streaming tokens after 90s", fallback activated.
Cause: Primary model (claude-opus-4-6) API slow or down.
Fix: Fallback auto-activates. Recovery probe runs every 10 min. If persistent, check Anthropic status.
症状: 出现"no streaming tokens after 90s"错误,降级模式激活。
原因: 主模型(claude-opus-4-6)API响应缓慢或宕机。
修复方案: 降级模式会自动激活,恢复探针每10分钟运行一次,若问题持续,检查Anthropic服务状态。
3. Stuck Tool Call
3. 工具调用卡住
Symptoms: Watchdog fires after 10 min, session stuck.
Cause: A tool call (bash, read, etc.) hanging indefinitely.
Fix: Watchdog auto-aborts. If stuck persists, .
joelclaw gateway restart症状: 10分钟后触发看门狗告警,会话无响应。
原因: 工具调用(bash、read等)无限期挂起。
修复方案: 看门狗会自动中止,若仍卡住,执行。
joelclaw gateway restart4. Redis Disconnection
4. Redis断开连接
Symptoms: Status shows redis disconnected, no events flowing.
Cause: Redis pod restart or port-forward dropped.
Fix: to verify, ioredis auto-reconnects.
kubectl get pods -n joelclaw症状: 状态显示redis未连接,无事件流转。
原因: Redis实例重启或端口转发中断。
修复方案: 执行验证状态,ioredis会自动重连。
kubectl get pods -n joelclaw5. Compaction During Message Delivery
5. 消息投递时触发数据压缩
Symptoms: "already processing" after a successful turn_end.
Cause: Auto-compaction triggers after turn_end, session enters streaming state again before drain loop processes next message.
Fix: The idle waiter should block until compaction finishes. If not, this is a pi SDK gap.
症状: 成功返回后出现"already processing"错误。
原因: 后触发自动压缩,队列消费循环处理下一条消息前,会话再次进入流处理状态。
修复方案: 空闲等待器会阻塞直到压缩完成,若未生效,属于pi SDK的功能缺口。
turn_endturn_endFallback Controller State
降级控制器状态
The gateway has a model fallback controller (ADR-0091) that swaps models when the primary fails:
- Threshold: 90s timeout for first token, or 3 consecutive prompt failures (configurable)
- Fallback model: gpt-5.3-codex-spark (via openai-codex provider)
- Recovery: Probes primary model every 10 minutes
- OTEL events: ,
model_fallback.swapped,model_fallback.primary_restoredmodel_fallback.probe_failed
Check fallback state in gateway.log: / .
[gateway:fallback] activatedrecovered网关内置模型降级控制器(ADR-0091),主模型故障时自动切换模型:
- 阈值: 首token返回超时90秒,或连续3次prompt调用失败(可配置)
- 降级模型: gpt-5.3-codex-spark(通过openai-codex provider调用)
- 恢复逻辑: 每10分钟探测一次主模型状态
- OTEL事件: 、
model_fallback.swapped、model_fallback.primary_restoredmodel_fallback.probe_failed
可在gateway.log中查看降级状态: / 。
[gateway:fallback] activatedrecoveredArchitecture Reference
架构参考
Telegram → channels/telegram.ts → enqueueToGateway()
Redis → channels/redis.ts → enqueueToGateway()
↓
command-queue.ts
(serial FIFO)
↓
session.prompt(text)
↓
pi SDK (isStreaming gate)
↓
Model API (claude-opus-4-6)
↓
turn_end → idleWaiter resolves
↓
Response routed to origin channelThe command queue processes ONE prompt at a time. blocks until fires. If a prompt is in flight, new messages queue behind it.
idleWaiterturn_endTelegram → channels/telegram.ts → enqueueToGateway()
Redis → channels/redis.ts → enqueueToGateway()
↓
command-queue.ts
(串行先进先出)
↓
session.prompt(text)
↓
pi SDK (isStreaming 流控)
↓
模型API (claude-opus-4-6)
↓
turn_end → 空闲等待器解除阻塞
↓
响应路由到来源通道命令队列一次仅处理一个prompt,会阻塞直到触发。若有正在处理的prompt,新消息会在队列中排队。
idleWaiterturn_endKey Code
核心代码
| File | What to look for |
|---|---|
| Session creation, event handler, idle waiter, watchdog |
| |
| Timeout tracking, fallback swap, recovery probes |
| Event batching, prompt building, sleep mode |
| Bot polling, message routing |
| Tripwire writer only (ADR-0103: no prompt injection) |
| 文件 | 排查要点 |
|---|---|
| 会话创建、事件处理器、空闲等待器、看门狗 |
| |
| 超时追踪、降级切换、恢复探针 |
| 事件聚合、prompt构建、休眠模式 |
| Bot轮询、消息路由 |
| 仅负责心跳文件写入(ADR-0103:无prompt注入逻辑) |
Related Skills
相关技能
- gateway — operational commands (restart, push, drain)
- joelclaw-system-check — full system health (broader scope)
- k8s — if Redis/Inngest pods are the problem
- gateway — 操作类命令(重启、推送、消费队列)
- joelclaw-system-check — 全系统健康检查(覆盖范围更广)
- k8s — 若Redis/Inngest实例出现故障可使用