gateway-diagnose

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Gateway Diagnosis

网关诊断

Structured diagnostic workflow for the joelclaw gateway daemon. Runs top-down from process health to message delivery, stopping at the first failure layer.

Default time range: 1 hour. Override by asking "check gateway logs for the last 4 hours" or similar.

适用于joelclaw网关daemon的结构化诊断工作流，从进程健康状态到消息投递自上而下排查，遇到第一层故障时即可停止。

默认时间范围：1小时，你可以通过类似"查询过去4小时的网关日志"的要求覆盖默认配置。

CLI Commands (use these first)

CLI 命令（优先使用）

bash

undefined

bash

undefined

Automated health check — runs all layers, returns structured findings

自动化健康检查 — 运行全层检查，返回结构化排查结果

joelclaw gateway diagnose [--hours 1] [--lines 100]

Session context — what happened recently? Exchanges, tools, errors.

会话上下文 — 查看近期发生的事件：交互记录、工具调用、报错信息

joelclaw gateway review [--hours 1] [--max 20]


Start with `diagnose` to find the failure layer. Use `review` to understand what the gateway was doing when it broke. Only drop to manual log reading (below) when the CLI output isn't enough.

joelclaw gateway review [--hours 1] [--max 20]


先运行`diagnose`定位故障层级，使用`review`了解网关崩溃时的运行状态，只有当CLI输出不足以排查问题时，再使用下方的手动日志读取方法。

Artifact Locations

产物位置

Artifact	Path	What's in it
Daemon stdout	`/tmp/joelclaw/gateway.log`	Startup info, event flow, responses, fallback messages
Daemon stderr	`/tmp/joelclaw/gateway.err`	Errors, stack traces, retries, fallback activations — check this first
PID file	`/tmp/joelclaw/gateway.pid`	Current daemon process ID
Session ID	`~/.joelclaw/gateway.session`	Current pi session ID
Session transcripts	`~/.joelclaw/sessions/gateway/*.jsonl`	Full pi session history (most recent by mtime)
Gateway working dir	`~/.joelclaw/gateway/`	Has `.pi/settings.json` for compaction config
Launchd plist	`~/Library/LaunchAgents/com.joel.gateway.plist`	Service config, env vars, log paths
Start script	`~/.joelclaw/scripts/gateway-start.sh`	Secret leasing, env setup, bun invocation
Tripwire	`/tmp/joelclaw/last-heartbeat.ts`	Last heartbeat timestamp (updated every 15 min)
WS port	`/tmp/joelclaw/gateway.ws.port`	WebSocket port for TUI attach (default 3018)

产物	路径	内容说明
Daemon标准输出	`/tmp/joelclaw/gateway.log`	启动信息、事件流、响应内容、降级消息
Daemon标准错误	`/tmp/joelclaw/gateway.err`	报错信息、栈追踪、重试记录、降级激活日志 — 优先检查该文件
PID文件	`/tmp/joelclaw/gateway.pid`	当前daemon进程ID
会话ID	`~/.joelclaw/gateway.session`	当前pi会话ID
会话记录	`~/.joelclaw/sessions/gateway/*.jsonl`	完整pi会话历史（按修改时间倒序排列）
网关工作目录	`~/.joelclaw/gateway/`	包含压缩配置文件 `.pi/settings.json`
Launchd配置文件	`~/Library/LaunchAgents/com.joel.gateway.plist`	服务配置、环境变量、日志路径
启动脚本	`~/.joelclaw/scripts/gateway-start.sh`	秘钥获取、环境初始化、bun调用逻辑
心跳文件	`/tmp/joelclaw/last-heartbeat.ts`	最后一次心跳时间戳（每15分钟更新一次）
WS端口文件	`/tmp/joelclaw/gateway.ws.port`	TUI挂载的WebSocket端口（默认3018）

Diagnostic Procedure

诊断流程

Run these steps in order. Stop and report at the first failure.

按顺序执行以下步骤，遇到第一个故障点即可停止并上报。

Layer 0: Process Health

第0层：进程健康状态

bash

undefined

bash

undefined

Is the daemon running?

检查daemon是否正在运行

launchctl list | grep gateway ps aux | grep gateway | grep -v grep

What's the PID and uptime?

查看PID和运行时长

cat /tmp/joelclaw/gateway.pid

Compare PID to launchctl list output — mismatch = stale PID file

将PID与launchctl list输出对比，不匹配则说明PID文件已过时


**Failure patterns:**
- PID mismatch between launchctl and PID file → daemon restarted, PID file stale
- Exit code non-zero in launchctl → crash loop, check gateway.err
- Process not running but launchctl shows it → zombie, `launchctl kickstart -k`


**故障模式：**
- launchctl输出与PID文件的PID不匹配 → daemon已重启，PID文件未更新
- launchctl中的退出码非零 → 崩溃循环，检查gateway.err
- 进程未运行但launchctl显示运行中 → 僵尸进程，执行`launchctl kickstart -k`

Layer 1: CLI Status

第1层：CLI状态

bash

joelclaw gateway status

Check:

```
redis: "connected"
```
— if not, Redis pod is down
```
activeSessions
```
— should have
```
gateway
```
with
```
alive: true
```
```
pending: 0
```
— if >0, messages are backing up (session busy or stuck)

bash

joelclaw gateway status

检查项：

```
redis: "connected"
```
— 若未连接，说明Redis实例宕机
```
activeSessions
```
— 应存在
```
gateway
```
条目且
```
alive: true
```
```
pending: 0
```
— 若大于0，说明消息积压（会话繁忙或卡住）

Layer 2: Error Log (the money log)

第2层：错误日志（核心日志）

bash

undefined

bash

undefined

Default: last 100 lines. Adjust for time range.

默认查看最近100行，可根据时间范围调整

tail -100 /tmp/joelclaw/gateway.err


**Known error patterns:**

| Pattern | Meaning | Root Cause |
|---------|---------|-----------|
| `Agent is already processing` | Command queue tried to prompt while session streaming | Session busy — long turn, compaction, or initialization race |
| `fallback activated` | Model timeout or consecutive failures triggered model swap | Primary model API down or slow |
| `no streaming tokens after Ns` | Timeout — prompt dispatched but no response | Model API issue, auth failure, or session not ready |
| `session still streaming, retrying` | Drain loop retry (3 attempts, 2s each) | Turn taking longer than expected |
| `watchdog: session appears stuck` | No turn_end for 10+ minutes after prompt | Hung tool call or model hang |
| `watchdog: session appears dead` | 3+ consecutive prompt failures | Triggers self-restart via graceful shutdown |
| `OTEL emit request failed: TimeoutError` | Typesense unreachable | k8s port-forward or Typesense pod issue (secondary) |
| `prompt failed` with `consecutiveFailures: N` | Nth failure in a row | Check model API, session state |

tail -100 /tmp/joelclaw/gateway.err


**已知错误模式：**

| 错误模式 | 含义 | 根本原因 |
|---------|---------|-----------|
| `Agent is already processing` | 会话流处理时命令队列尝试发送prompt | 会话繁忙 — 长轮询、数据压缩或初始化竞态 |
| `fallback activated` | 模型超时或连续报错触发模型切换 | 主模型API宕机或响应缓慢 |
| `no streaming tokens after Ns` | 超时 — prompt已发送但无响应 | 模型API故障、鉴权失败或会话未就绪 |
| `session still streaming, retrying` | 队列消费重试（共3次，间隔2秒） | 轮询处理耗时超出预期 |
| `watchdog: session appears stuck` | prompt发送后10分钟以上未收到`turn_end`信号 | 工具调用挂起或模型无响应 |
| `watchdog: session appears dead` | 连续3次以上prompt发送失败 | 触发优雅关闭流程进行自重启 |
| `OTEL emit request failed: TimeoutError` | 无法访问Typesense | k8s端口转发或Typesense实例故障（次要影响） |
| `prompt failed` 附带 `consecutiveFailures: N` | 连续N次调用失败 | 检查模型API、会话状态 |

Layer 3: Stdout Log (event flow)

第3层：标准输出日志（事件流）

bash

tail -100 /tmp/joelclaw/gateway.log

Look for:

```
[gateway] daemon started
```
— last startup time, model, session ID
```
[gateway:telegram] message received
```
— did the message arrive?

[gateway:store] persisted inbound message

— was it persisted?

```
[gateway:fallback] prompt dispatched
```
— was a prompt sent to the model?
```
[gateway] response ready
```
— did the model respond?
```
[gateway:fallback] activated
```
— is fallback model in use?
```
[redis] suppressed N noise event(s)
```
— which events are being filtered

[gateway:store] replayed unacked messages

— startup replay (can cause races)

bash

tail -100 /tmp/joelclaw/gateway.log

排查要点：

```
[gateway] daemon started
```
— 最后启动时间、使用模型、会话ID
```
[gateway:telegram] message received
```
— 消息是否成功送达

[gateway:store] persisted inbound message

— 消息是否已持久化

```
[gateway:fallback] prompt dispatched
```
— prompt是否已发送到模型
```
[gateway] response ready
```
— 模型是否已返回响应
```
[gateway:fallback] activated
```
— 是否启用了降级模型
```
[redis] suppressed N noise event(s)
```
— 哪些事件被过滤
```
[gateway:store] replayed unacked messages
```
— 启动时重放未确认消息（可能引发竞态）

Layer 4: E2E Delivery Test

第4层：端到端投递测试

bash

joelclaw gateway test

bash

joelclaw gateway test

Wait 5 seconds

等待5秒

joelclaw gateway events


**Expected:** Test event pushed and drained (totalCount: 0 after drain).
**Failure:** Event stuck in queue → session not draining → check Layer 2 errors.

joelclaw gateway events


**预期结果：** 测试事件已推送并消费完成（消费后`totalCount: 0`）。
**故障表现：** 事件卡在队列中 → 会话未消费消息 → 检查第2层错误日志。

Layer 5: Session Transcript

第5层：会话记录

bash

undefined

bash

undefined

Find most recent gateway session

查找最近的网关会话文件

ls -lt ~/.joelclaw/sessions/gateway/*.jsonl | head -1

Read last N lines of the session JSONL

读取会话JSONL的最后N行

tail -50 ~/.joelclaw/sessions/gateway/<session-file>.jsonl


Each line is a JSON object. Look for:
- `"type": "turn_end"` — confirms turns are completing
- `"type": "error"` — model or tool errors
- Long gaps between `turn_start` and `turn_end` — slow turns
- Tool call entries — what was the session doing when it got stuck?

tail -50 ~/.joelclaw/sessions/gateway/<session-file>.jsonl


每一行都是一个JSON对象，排查要点：
- `"type": "turn_end"` — 确认轮询已完成
- `"type": "error"` — 模型或工具报错
- `turn_start`和`turn_end`之间间隔过长 — 轮询处理缓慢
- 工具调用条目 — 会话卡住时正在执行什么操作

Layer 6: OTEL Telemetry

第6层：OTEL遥测数据

bash

undefined

bash

undefined

Gateway-specific events

网关专属事件

joelclaw otel search "gateway" --hours 1

Fallback events

降级事件

joelclaw otel search "fallback" --hours 1

Queue events

队列事件

joelclaw otel search "command-queue" --hours 1

undefined

joelclaw otel search "command-queue" --hours 1

undefined

Layer 7: Model API Health

第7层：模型API健康状态

bash

undefined

bash

undefined

Quick API reachability test (auth error = API reachable)

快速测试API可达性（鉴权错误说明API可达）

curl -s -m 10 https://api.anthropic.com/v1/messages
-H "x-api-key: test"
-H "anthropic-version: 2023-06-01"
-H "content-type: application/json"
-d '{}' | jq .error.type

Expected: "authentication_error" (means API is reachable)

预期输出："authentication_error"（说明API可正常访问）

undefined

undefined

Layer 8: Redis State

第8层：Redis状态

bash

undefined

bash

undefined

Check gateway queue directly

直接检查网关队列长度

kubectl exec -n joelclaw redis-0 -- redis-cli LLEN joelclaw:notify:gateway

Check message store

检查消息存储

kubectl exec -n joelclaw redis-0 -- redis-cli XLEN gateway:messages

Check unacked messages (these replay on restart)

检查未确认消息（重启时会重放这些消息）

kubectl exec -n joelclaw redis-0 -- redis-cli XRANGE gateway:messages - + COUNT 5

undefined

kubectl exec -n joelclaw redis-0 -- redis-cli XRANGE gateway:messages - + COUNT 5

undefined

Known Failure Scenarios

已知故障场景

1. Session Initialization Race (ADR-0103 era)

1. 会话初始化竞态（ADR-0103版本阶段）

Symptoms: "already processing" errors immediately after restart, unacked message replay fails. Cause: Drain loop processes replayed messages before the pi session finishes initializing. Fix: Restart clears it. If persistent, check if

replayUnacked()

runs before session is ready.

症状： 重启后立即出现"already processing"错误，未确认消息重放失败。 原因： pi会话完成初始化前，队列消费循环就开始处理重放的消息。 修复方案： 重启即可解决，若问题持续，检查

replayUnacked()

是否在会话就绪前运行。

2. Model API Timeout

2. 模型API超时

Symptoms: "no streaming tokens after 90s", fallback activated. Cause: Primary model (claude-opus-4-6) API slow or down. Fix: Fallback auto-activates. Recovery probe runs every 10 min. If persistent, check Anthropic status.

症状： 出现"no streaming tokens after 90s"错误，降级模式激活。 原因： 主模型（claude-opus-4-6）API响应缓慢或宕机。 修复方案： 降级模式会自动激活，恢复探针每10分钟运行一次，若问题持续，检查Anthropic服务状态。

3. Stuck Tool Call

3. 工具调用卡住

Symptoms: Watchdog fires after 10 min, session stuck. Cause: A tool call (bash, read, etc.) hanging indefinitely. Fix: Watchdog auto-aborts. If stuck persists,

joelclaw gateway restart

症状： 10分钟后触发看门狗告警，会话无响应。 原因： 工具调用（bash、read等）无限期挂起。 修复方案： 看门狗会自动中止，若仍卡住，执行

joelclaw gateway restart

。

4. Redis Disconnection

4. Redis断开连接

Symptoms: Status shows redis disconnected, no events flowing. Cause: Redis pod restart or port-forward dropped. Fix:

kubectl get pods -n joelclaw

to verify, ioredis auto-reconnects.

症状： 状态显示redis未连接，无事件流转。 原因： Redis实例重启或端口转发中断。 修复方案： 执行

kubectl get pods -n joelclaw

验证状态，ioredis会自动重连。

5. Compaction During Message Delivery

5. 消息投递时触发数据压缩

Symptoms: "already processing" after a successful turn_end. Cause: Auto-compaction triggers after turn_end, session enters streaming state again before drain loop processes next message. Fix: The idle waiter should block until compaction finishes. If not, this is a pi SDK gap.

症状： 成功返回

turn_end

后出现"already processing"错误。 原因：

turn_end

后触发自动压缩，队列消费循环处理下一条消息前，会话再次进入流处理状态。 修复方案： 空闲等待器会阻塞直到压缩完成，若未生效，属于pi SDK的功能缺口。

Fallback Controller State

降级控制器状态

The gateway has a model fallback controller (ADR-0091) that swaps models when the primary fails:

Threshold: 90s timeout for first token, or 3 consecutive prompt failures (configurable)
Fallback model: gpt-5.3-codex-spark (via openai-codex provider)
Recovery: Probes primary model every 10 minutes

OTEL events:

model_fallback.swapped

model_fallback.primary_restored

model_fallback.probe_failed

Check fallback state in gateway.log:

[gateway:fallback] activated

recovered

网关内置模型降级控制器（ADR-0091），主模型故障时自动切换模型：

阈值： 首token返回超时90秒，或连续3次prompt调用失败（可配置）
降级模型： gpt-5.3-codex-spark（通过openai-codex provider调用）
恢复逻辑： 每10分钟探测一次主模型状态

OTEL事件：

model_fallback.swapped

、

model_fallback.primary_restored

、

model_fallback.probe_failed

可在gateway.log中查看降级状态：

[gateway:fallback] activated

recovered

。

Architecture Reference

架构参考

Telegram → channels/telegram.ts → enqueueToGateway()
Redis    → channels/redis.ts    → enqueueToGateway()
                                        ↓
                                 command-queue.ts
                                   (serial FIFO)
                                        ↓
                              session.prompt(text)
                                        ↓
                              pi SDK (isStreaming gate)
                                        ↓
                              Model API (claude-opus-4-6)
                                        ↓
                              turn_end → idleWaiter resolves
                                        ↓
                              Response routed to origin channel

The command queue processes ONE prompt at a time.

idleWaiter

blocks until

turn_end

fires. If a prompt is in flight, new messages queue behind it.

Telegram → channels/telegram.ts → enqueueToGateway()
Redis    → channels/redis.ts    → enqueueToGateway()
                                        ↓
                                 command-queue.ts
                                   (串行先进先出)
                                        ↓
                              session.prompt(text)
                                        ↓
                              pi SDK (isStreaming 流控)
                                        ↓
                              模型API (claude-opus-4-6)
                                        ↓
                              turn_end → 空闲等待器解除阻塞
                                        ↓
                              响应路由到来源通道

命令队列一次仅处理一个prompt，

idleWaiter

会阻塞直到

turn_end

触发。若有正在处理的prompt，新消息会在队列中排队。

Key Code

核心代码

File	What to look for
`packages/gateway/src/daemon.ts`	Session creation, event handler, idle waiter, watchdog
`packages/gateway/src/command-queue.ts`	`drain()` loop, retry logic, idle gate
`packages/gateway/src/model-fallback.ts`	Timeout tracking, fallback swap, recovery probes
`packages/gateway/src/channels/redis.ts`	Event batching, prompt building, sleep mode
`packages/gateway/src/channels/telegram.ts`	Bot polling, message routing
`packages/gateway/src/heartbeat.ts`	Tripwire writer only (ADR-0103: no prompt injection)

文件	排查要点
`packages/gateway/src/daemon.ts`	会话创建、事件处理器、空闲等待器、看门狗
`packages/gateway/src/command-queue.ts`	`drain()` 循环、重试逻辑、空闲流控
`packages/gateway/src/model-fallback.ts`	超时追踪、降级切换、恢复探针
`packages/gateway/src/channels/redis.ts`	事件聚合、prompt构建、休眠模式
`packages/gateway/src/channels/telegram.ts`	Bot轮询、消息路由
`packages/gateway/src/heartbeat.ts`	仅负责心跳文件写入（ADR-0103：无prompt注入逻辑）