gateway-diagnose

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Gateway Diagnosis

网关诊断

Structured diagnostic workflow for the joelclaw gateway daemon. Runs top-down from process health to message delivery, stopping at the first failure layer.
Default time range: 1 hour. Override by asking "check gateway logs for the last 4 hours" or similar.
适用于joelclaw网关daemon的结构化诊断工作流,从进程健康状态到消息投递自上而下排查,遇到第一层故障时即可停止。
默认时间范围:1小时,你可以通过类似"查询过去4小时的网关日志"的要求覆盖默认配置。

CLI Commands (use these first)

CLI 命令(优先使用)

bash
undefined
bash
undefined

Automated health check — runs all layers, returns structured findings

自动化健康检查 — 运行全层检查,返回结构化排查结果

joelclaw gateway diagnose [--hours 1] [--lines 100]
joelclaw gateway diagnose [--hours 1] [--lines 100]

Session context — what happened recently? Exchanges, tools, errors.

会话上下文 — 查看近期发生的事件:交互记录、工具调用、报错信息

joelclaw gateway review [--hours 1] [--max 20]

Start with `diagnose` to find the failure layer. Use `review` to understand what the gateway was doing when it broke. Only drop to manual log reading (below) when the CLI output isn't enough.
joelclaw gateway review [--hours 1] [--max 20]

先运行`diagnose`定位故障层级,使用`review`了解网关崩溃时的运行状态,只有当CLI输出不足以排查问题时,再使用下方的手动日志读取方法。

Artifact Locations

产物位置

ArtifactPathWhat's in it
Daemon stdout
/tmp/joelclaw/gateway.log
Startup info, event flow, responses, fallback messages
Daemon stderr
/tmp/joelclaw/gateway.err
Errors, stack traces, retries, fallback activations — check this first
PID file
/tmp/joelclaw/gateway.pid
Current daemon process ID
Session ID
~/.joelclaw/gateway.session
Current pi session ID
Session transcripts
~/.joelclaw/sessions/gateway/*.jsonl
Full pi session history (most recent by mtime)
Gateway working dir
~/.joelclaw/gateway/
Has
.pi/settings.json
for compaction config
Launchd plist
~/Library/LaunchAgents/com.joel.gateway.plist
Service config, env vars, log paths
Start script
~/.joelclaw/scripts/gateway-start.sh
Secret leasing, env setup, bun invocation
Tripwire
/tmp/joelclaw/last-heartbeat.ts
Last heartbeat timestamp (updated every 15 min)
WS port
/tmp/joelclaw/gateway.ws.port
WebSocket port for TUI attach (default 3018)
产物路径内容说明
Daemon标准输出
/tmp/joelclaw/gateway.log
启动信息、事件流、响应内容、降级消息
Daemon标准错误
/tmp/joelclaw/gateway.err
报错信息、栈追踪、重试记录、降级激活日志 — 优先检查该文件
PID文件
/tmp/joelclaw/gateway.pid
当前daemon进程ID
会话ID
~/.joelclaw/gateway.session
当前pi会话ID
会话记录
~/.joelclaw/sessions/gateway/*.jsonl
完整pi会话历史(按修改时间倒序排列)
网关工作目录
~/.joelclaw/gateway/
包含压缩配置文件
.pi/settings.json
Launchd配置文件
~/Library/LaunchAgents/com.joel.gateway.plist
服务配置、环境变量、日志路径
启动脚本
~/.joelclaw/scripts/gateway-start.sh
秘钥获取、环境初始化、bun调用逻辑
心跳文件
/tmp/joelclaw/last-heartbeat.ts
最后一次心跳时间戳(每15分钟更新一次)
WS端口文件
/tmp/joelclaw/gateway.ws.port
TUI挂载的WebSocket端口(默认3018)

Diagnostic Procedure

诊断流程

Run these steps in order. Stop and report at the first failure.
按顺序执行以下步骤,遇到第一个故障点即可停止并上报。

Layer 0: Process Health

第0层:进程健康状态

bash
undefined
bash
undefined

Is the daemon running?

检查daemon是否正在运行

launchctl list | grep gateway ps aux | grep gateway | grep -v grep
launchctl list | grep gateway ps aux | grep gateway | grep -v grep

What's the PID and uptime?

查看PID和运行时长

cat /tmp/joelclaw/gateway.pid
cat /tmp/joelclaw/gateway.pid

Compare PID to launchctl list output — mismatch = stale PID file

将PID与launchctl list输出对比,不匹配则说明PID文件已过时


**Failure patterns:**
- PID mismatch between launchctl and PID file → daemon restarted, PID file stale
- Exit code non-zero in launchctl → crash loop, check gateway.err
- Process not running but launchctl shows it → zombie, `launchctl kickstart -k`

**故障模式:**
- launchctl输出与PID文件的PID不匹配 → daemon已重启,PID文件未更新
- launchctl中的退出码非零 → 崩溃循环,检查gateway.err
- 进程未运行但launchctl显示运行中 → 僵尸进程,执行`launchctl kickstart -k`

Layer 1: CLI Status

第1层:CLI状态

bash
joelclaw gateway status
Check:
  • redis: "connected"
    — if not, Redis pod is down
  • activeSessions
    — should have
    gateway
    with
    alive: true
  • pending: 0
    — if >0, messages are backing up (session busy or stuck)
bash
joelclaw gateway status
检查项:
  • redis: "connected"
    — 若未连接,说明Redis实例宕机
  • activeSessions
    — 应存在
    gateway
    条目且
    alive: true
  • pending: 0
    — 若大于0,说明消息积压(会话繁忙或卡住)

Layer 2: Error Log (the money log)

第2层:错误日志(核心日志)

bash
undefined
bash
undefined

Default: last 100 lines. Adjust for time range.

默认查看最近100行,可根据时间范围调整

tail -100 /tmp/joelclaw/gateway.err

**Known error patterns:**

| Pattern | Meaning | Root Cause |
|---------|---------|-----------|
| `Agent is already processing` | Command queue tried to prompt while session streaming | Session busy — long turn, compaction, or initialization race |
| `fallback activated` | Model timeout or consecutive failures triggered model swap | Primary model API down or slow |
| `no streaming tokens after Ns` | Timeout — prompt dispatched but no response | Model API issue, auth failure, or session not ready |
| `session still streaming, retrying` | Drain loop retry (3 attempts, 2s each) | Turn taking longer than expected |
| `watchdog: session appears stuck` | No turn_end for 10+ minutes after prompt | Hung tool call or model hang |
| `watchdog: session appears dead` | 3+ consecutive prompt failures | Triggers self-restart via graceful shutdown |
| `OTEL emit request failed: TimeoutError` | Typesense unreachable | k8s port-forward or Typesense pod issue (secondary) |
| `prompt failed` with `consecutiveFailures: N` | Nth failure in a row | Check model API, session state |
tail -100 /tmp/joelclaw/gateway.err

**已知错误模式:**

| 错误模式 | 含义 | 根本原因 |
|---------|---------|-----------|
| `Agent is already processing` | 会话流处理时命令队列尝试发送prompt | 会话繁忙 — 长轮询、数据压缩或初始化竞态 |
| `fallback activated` | 模型超时或连续报错触发模型切换 | 主模型API宕机或响应缓慢 |
| `no streaming tokens after Ns` | 超时 — prompt已发送但无响应 | 模型API故障、鉴权失败或会话未就绪 |
| `session still streaming, retrying` | 队列消费重试(共3次,间隔2秒) | 轮询处理耗时超出预期 |
| `watchdog: session appears stuck` | prompt发送后10分钟以上未收到`turn_end`信号 | 工具调用挂起或模型无响应 |
| `watchdog: session appears dead` | 连续3次以上prompt发送失败 | 触发优雅关闭流程进行自重启 |
| `OTEL emit request failed: TimeoutError` | 无法访问Typesense | k8s端口转发或Typesense实例故障(次要影响) |
| `prompt failed` 附带 `consecutiveFailures: N` | 连续N次调用失败 | 检查模型API、会话状态 |

Layer 3: Stdout Log (event flow)

第3层:标准输出日志(事件流)

bash
tail -100 /tmp/joelclaw/gateway.log
Look for:
  • [gateway] daemon started
    — last startup time, model, session ID
  • [gateway:telegram] message received
    — did the message arrive?
  • [gateway:store] persisted inbound message
    — was it persisted?
  • [gateway:fallback] prompt dispatched
    — was a prompt sent to the model?
  • [gateway] response ready
    — did the model respond?
  • [gateway:fallback] activated
    — is fallback model in use?
  • [redis] suppressed N noise event(s)
    — which events are being filtered
  • [gateway:store] replayed unacked messages
    — startup replay (can cause races)
bash
tail -100 /tmp/joelclaw/gateway.log
排查要点:
  • [gateway] daemon started
    — 最后启动时间、使用模型、会话ID
  • [gateway:telegram] message received
    — 消息是否成功送达
  • [gateway:store] persisted inbound message
    — 消息是否已持久化
  • [gateway:fallback] prompt dispatched
    — prompt是否已发送到模型
  • [gateway] response ready
    — 模型是否已返回响应
  • [gateway:fallback] activated
    — 是否启用了降级模型
  • [redis] suppressed N noise event(s)
    — 哪些事件被过滤
  • [gateway:store] replayed unacked messages
    — 启动时重放未确认消息(可能引发竞态)

Layer 4: E2E Delivery Test

第4层:端到端投递测试

bash
joelclaw gateway test
bash
joelclaw gateway test

Wait 5 seconds

等待5秒

joelclaw gateway events

**Expected:** Test event pushed and drained (totalCount: 0 after drain).
**Failure:** Event stuck in queue → session not draining → check Layer 2 errors.
joelclaw gateway events

**预期结果:** 测试事件已推送并消费完成(消费后`totalCount: 0`)。
**故障表现:** 事件卡在队列中 → 会话未消费消息 → 检查第2层错误日志。

Layer 5: Session Transcript

第5层:会话记录

bash
undefined
bash
undefined

Find most recent gateway session

查找最近的网关会话文件

ls -lt ~/.joelclaw/sessions/gateway/*.jsonl | head -1
ls -lt ~/.joelclaw/sessions/gateway/*.jsonl | head -1

Read last N lines of the session JSONL

读取会话JSONL的最后N行

tail -50 ~/.joelclaw/sessions/gateway/<session-file>.jsonl

Each line is a JSON object. Look for:
- `"type": "turn_end"` — confirms turns are completing
- `"type": "error"` — model or tool errors
- Long gaps between `turn_start` and `turn_end` — slow turns
- Tool call entries — what was the session doing when it got stuck?
tail -50 ~/.joelclaw/sessions/gateway/<session-file>.jsonl

每一行都是一个JSON对象,排查要点:
- `"type": "turn_end"` — 确认轮询已完成
- `"type": "error"` — 模型或工具报错
- `turn_start`和`turn_end`之间间隔过长 — 轮询处理缓慢
- 工具调用条目 — 会话卡住时正在执行什么操作

Layer 6: OTEL Telemetry

第6层:OTEL遥测数据

bash
undefined
bash
undefined

Gateway-specific events

网关专属事件

joelclaw otel search "gateway" --hours 1
joelclaw otel search "gateway" --hours 1

Fallback events

降级事件

joelclaw otel search "fallback" --hours 1
joelclaw otel search "fallback" --hours 1

Queue events

队列事件

joelclaw otel search "command-queue" --hours 1
undefined
joelclaw otel search "command-queue" --hours 1
undefined

Layer 7: Model API Health

第7层:模型API健康状态

bash
undefined
bash
undefined

Quick API reachability test (auth error = API reachable)

快速测试API可达性(鉴权错误说明API可达)

curl -s -m 10 https://api.anthropic.com/v1/messages
-H "x-api-key: test"
-H "anthropic-version: 2023-06-01"
-H "content-type: application/json"
-d '{}' | jq .error.type
curl -s -m 10 https://api.anthropic.com/v1/messages
-H "x-api-key: test"
-H "anthropic-version: 2023-06-01"
-H "content-type: application/json"
-d '{}' | jq .error.type

Expected: "authentication_error" (means API is reachable)

预期输出:"authentication_error"(说明API可正常访问)

undefined
undefined

Layer 8: Redis State

第8层:Redis状态

bash
undefined
bash
undefined

Check gateway queue directly

直接检查网关队列长度

kubectl exec -n joelclaw redis-0 -- redis-cli LLEN joelclaw:notify:gateway
kubectl exec -n joelclaw redis-0 -- redis-cli LLEN joelclaw:notify:gateway

Check message store

检查消息存储

kubectl exec -n joelclaw redis-0 -- redis-cli XLEN gateway:messages
kubectl exec -n joelclaw redis-0 -- redis-cli XLEN gateway:messages

Check unacked messages (these replay on restart)

检查未确认消息(重启时会重放这些消息)

kubectl exec -n joelclaw redis-0 -- redis-cli XRANGE gateway:messages - + COUNT 5
undefined
kubectl exec -n joelclaw redis-0 -- redis-cli XRANGE gateway:messages - + COUNT 5
undefined

Known Failure Scenarios

已知故障场景

1. Session Initialization Race (ADR-0103 era)

1. 会话初始化竞态(ADR-0103版本阶段)

Symptoms: "already processing" errors immediately after restart, unacked message replay fails. Cause: Drain loop processes replayed messages before the pi session finishes initializing. Fix: Restart clears it. If persistent, check if
replayUnacked()
runs before session is ready.
症状: 重启后立即出现"already processing"错误,未确认消息重放失败。 原因: pi会话完成初始化前,队列消费循环就开始处理重放的消息。 修复方案: 重启即可解决,若问题持续,检查
replayUnacked()
是否在会话就绪前运行。

2. Model API Timeout

2. 模型API超时

Symptoms: "no streaming tokens after 90s", fallback activated. Cause: Primary model (claude-opus-4-6) API slow or down. Fix: Fallback auto-activates. Recovery probe runs every 10 min. If persistent, check Anthropic status.
症状: 出现"no streaming tokens after 90s"错误,降级模式激活。 原因: 主模型(claude-opus-4-6)API响应缓慢或宕机。 修复方案: 降级模式会自动激活,恢复探针每10分钟运行一次,若问题持续,检查Anthropic服务状态。

3. Stuck Tool Call

3. 工具调用卡住

Symptoms: Watchdog fires after 10 min, session stuck. Cause: A tool call (bash, read, etc.) hanging indefinitely. Fix: Watchdog auto-aborts. If stuck persists,
joelclaw gateway restart
.
症状: 10分钟后触发看门狗告警,会话无响应。 原因: 工具调用(bash、read等)无限期挂起。 修复方案: 看门狗会自动中止,若仍卡住,执行
joelclaw gateway restart

4. Redis Disconnection

4. Redis断开连接

Symptoms: Status shows redis disconnected, no events flowing. Cause: Redis pod restart or port-forward dropped. Fix:
kubectl get pods -n joelclaw
to verify, ioredis auto-reconnects.
症状: 状态显示redis未连接,无事件流转。 原因: Redis实例重启或端口转发中断。 修复方案: 执行
kubectl get pods -n joelclaw
验证状态,ioredis会自动重连。

5. Compaction During Message Delivery

5. 消息投递时触发数据压缩

Symptoms: "already processing" after a successful turn_end. Cause: Auto-compaction triggers after turn_end, session enters streaming state again before drain loop processes next message. Fix: The idle waiter should block until compaction finishes. If not, this is a pi SDK gap.
症状: 成功返回
turn_end
后出现"already processing"错误。 原因:
turn_end
后触发自动压缩,队列消费循环处理下一条消息前,会话再次进入流处理状态。 修复方案: 空闲等待器会阻塞直到压缩完成,若未生效,属于pi SDK的功能缺口。

Fallback Controller State

降级控制器状态

The gateway has a model fallback controller (ADR-0091) that swaps models when the primary fails:
  • Threshold: 90s timeout for first token, or 3 consecutive prompt failures (configurable)
  • Fallback model: gpt-5.3-codex-spark (via openai-codex provider)
  • Recovery: Probes primary model every 10 minutes
  • OTEL events:
    model_fallback.swapped
    ,
    model_fallback.primary_restored
    ,
    model_fallback.probe_failed
Check fallback state in gateway.log:
[gateway:fallback] activated
/
recovered
.
网关内置模型降级控制器(ADR-0091),主模型故障时自动切换模型:
  • 阈值: 首token返回超时90秒,或连续3次prompt调用失败(可配置)
  • 降级模型: gpt-5.3-codex-spark(通过openai-codex provider调用)
  • 恢复逻辑: 每10分钟探测一次主模型状态
  • OTEL事件:
    model_fallback.swapped
    model_fallback.primary_restored
    model_fallback.probe_failed
可在gateway.log中查看降级状态:
[gateway:fallback] activated
/
recovered

Architecture Reference

架构参考

Telegram → channels/telegram.ts → enqueueToGateway()
Redis    → channels/redis.ts    → enqueueToGateway()
                                 command-queue.ts
                                   (serial FIFO)
                              session.prompt(text)
                              pi SDK (isStreaming gate)
                              Model API (claude-opus-4-6)
                              turn_end → idleWaiter resolves
                              Response routed to origin channel
The command queue processes ONE prompt at a time.
idleWaiter
blocks until
turn_end
fires. If a prompt is in flight, new messages queue behind it.
Telegram → channels/telegram.ts → enqueueToGateway()
Redis    → channels/redis.ts    → enqueueToGateway()
                                 command-queue.ts
                                   (串行先进先出)
                              session.prompt(text)
                              pi SDK (isStreaming 流控)
                              模型API (claude-opus-4-6)
                              turn_end → 空闲等待器解除阻塞
                              响应路由到来源通道
命令队列一次仅处理一个prompt,
idleWaiter
会阻塞直到
turn_end
触发。若有正在处理的prompt,新消息会在队列中排队。

Key Code

核心代码

FileWhat to look for
packages/gateway/src/daemon.ts
Session creation, event handler, idle waiter, watchdog
packages/gateway/src/command-queue.ts
drain()
loop, retry logic, idle gate
packages/gateway/src/model-fallback.ts
Timeout tracking, fallback swap, recovery probes
packages/gateway/src/channels/redis.ts
Event batching, prompt building, sleep mode
packages/gateway/src/channels/telegram.ts
Bot polling, message routing
packages/gateway/src/heartbeat.ts
Tripwire writer only (ADR-0103: no prompt injection)
文件排查要点
packages/gateway/src/daemon.ts
会话创建、事件处理器、空闲等待器、看门狗
packages/gateway/src/command-queue.ts
drain()
循环、重试逻辑、空闲流控
packages/gateway/src/model-fallback.ts
超时追踪、降级切换、恢复探针
packages/gateway/src/channels/redis.ts
事件聚合、prompt构建、休眠模式
packages/gateway/src/channels/telegram.ts
Bot轮询、消息路由
packages/gateway/src/heartbeat.ts
仅负责心跳文件写入(ADR-0103:无prompt注入逻辑)

Related Skills

相关技能

  • gateway — operational commands (restart, push, drain)
  • joelclaw-system-check — full system health (broader scope)
  • k8s — if Redis/Inngest pods are the problem
  • gateway — 操作类命令(重启、推送、消费队列)
  • joelclaw-system-check — 全系统健康检查(覆盖范围更广)
  • k8s — 若Redis/Inngest实例出现故障可使用