debug
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDebug
调试
Goals
目标
- Find why a run is stuck, retrying, or failing.
- Correlate Linear issue identity to a Codex session quickly.
- Read the right logs in the right order to isolate root cause.
- 查找运行卡住、重试或失败的原因。
- 快速将Linear问题标识与Codex会话关联。
- 按正确顺序读取对应日志,定位根因。
Log Sources
日志来源
- Primary runtime log:
log/symphony.log- Default comes from (
SymphonyElixir.LogFile).log/symphony.log - Includes orchestrator, agent runner, and Codex app-server lifecycle logs.
- Default comes from
- Rotated runtime logs:
log/symphony.log*- Check these when the relevant run is older.
- 主运行时日志:
log/symphony.log- 默认来自(
SymphonyElixir.LogFile)。log/symphony.log - 包含编排器、Agent运行器和Codex应用服务器的生命周期日志。
- 默认来自
- 轮转运行时日志:
log/symphony.log*- 相关运行记录时间较早时查看这些日志。
Correlation Keys
关联键
- : human ticket key (example:
issue_identifier)MT-625 - : Linear UUID (stable internal ID)
issue_id - : Codex thread-turn pair (
session_id)<thread_id>-<turn_id>
elixir/docs/logging.md- :人工可读工单键(示例:
issue_identifier)MT-625 - :Linear UUID(稳定的内部ID)
issue_id - :Codex线程-轮次对(
session_id)<thread_id>-<turn_id>
elixir/docs/logging.mdQuick Triage (Stuck Run)
快速排查(运行卡住)
- Confirm scheduler/worker symptoms for the ticket.
- Find recent lines for the ticket (first).
issue_identifier - Extract from matching lines.
session_id - Trace that across start, stream, completion/failure, and stall handling logs.
session_id - Decide class of failure: timeout/stall, app-server startup failure, turn failure, or orchestrator retry loop.
- 确认该工单对应的调度器/Worker异常表现。
- 查找该工单的最近日志行(优先用)。
issue_identifier - 从匹配的日志行中提取。
session_id - 追踪该对应的启动、流处理、完成/失败以及停滞处理日志。
session_id - 判定故障类型:超时/停滞、应用服务器启动失败、轮次执行失败、编排器重试循环。
Commands
命令
bash
undefinedbash
undefined1) Narrow by ticket key (fastest entry point)
1) Narrow by ticket key (fastest entry point)
rg -n "issue_identifier=MT-625" log/symphony.log*
rg -n "issue_identifier=MT-625" log/symphony.log*
2) If needed, narrow by Linear UUID
2) If needed, narrow by Linear UUID
rg -n "issue_id=<linear-uuid>" log/symphony.log*
rg -n "issue_id=<linear-uuid>" log/symphony.log*
3) Pull session IDs seen for that ticket
3) Pull session IDs seen for that ticket
rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u
rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u
4) Trace one session end-to-end
4) Trace one session end-to-end
rg -n "session_id=<thread>-<turn>" log/symphony.log*
rg -n "session_id=<thread>-<turn>" log/symphony.log*
5) Focus on stuck/retry signals
5) Focus on stuck/retry signals
rg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" log/symphony.log*
undefinedrg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" log/symphony.log*
undefinedInvestigation Flow
排查流程
- Locate the ticket slice:
- Search by .
issue_identifier=<KEY> - If noise is high, add .
issue_id=<UUID>
- Search by
- Establish timeline:
- Identify first .
Codex session started ... session_id=... - Follow with ,
Codex session completed, or worker exit lines.ended with error
- Identify first
- Classify the problem:
- Stall loop: .
Issue stalled ... restarting with backoff - App-server startup: .
Codex session failed ... - Turn execution failure: ,
turn_failed,turn_cancelled, orturn_timeout.ended with error - Worker crash: .
Agent task exited ... reason=...
- Stall loop:
- Validate scope:
- Check whether failures are isolated to one issue/session or repeating across multiple tickets.
- Capture evidence:
- Save key log lines with timestamps, ,
issue_identifier, andissue_id.session_id - Record probable root cause and the exact failing stage.
- Save key log lines with timestamps,
- 定位工单相关日志片段:
- 按搜索。
issue_identifier=<KEY> - 如果噪音过高,增加过滤。
issue_id=<UUID>
- 按
- 梳理时间线:
- 找到第一条记录。
Codex session started ... session_id=... - 后续跟进、
Codex session completed或Worker退出的日志行。ended with error
- 找到第一条
- 归类问题:
- 停滞循环:。
Issue stalled ... restarting with backoff - 应用服务器启动问题:。
Codex session failed ... - 轮次执行失败:、
turn_failed、turn_cancelled或turn_timeout。ended with error - Worker崩溃:。
Agent task exited ... reason=...
- 停滞循环:
- 确认影响范围:
- 检查故障是仅出现在单个问题/会话中,还是在多个工单中重复发生。
- 留存证据:
- 保存带时间戳、、
issue_identifier和issue_id的关键日志行。session_id - 记录可能的根因和确切的故障阶段。
- 保存带时间戳、
Reading Codex Session Logs
读取Codex会话日志
In Symphony, Codex session diagnostics are emitted into and
keyed by . Read them as a lifecycle:
log/symphony.logsession_idCodex session started ... session_id=...- Session stream/lifecycle events for the same
session_id - Terminal event:
- , or
Codex session completed ... - , or
Codex session ended with error ... Issue stalled ... restarting with backoff
For one specific session investigation, keep the trace narrow:
- Capture one for the ticket.
session_id - Build a timestamped slice for only that session:
rg -n "session_id=<thread>-<turn>" log/symphony.log*
- Mark the exact failing stage:
- Startup failure before stream events ().
Codex session failed ... - Turn/runtime failure after stream events (/
turn_*).ended with error - Stall recovery ().
Issue stalled ... restarting with backoff
- Startup failure before stream events (
- Pair findings with and
issue_identifierfrom nearby lines to confirm you are not mixing concurrent retries.issue_id
Always pair session findings with / to avoid mixing
concurrent runs.
issue_identifierissue_id在Symphony中,Codex会话诊断信息会输出到中,以为键标识。可按生命周期顺序读取:
log/symphony.logsession_idCodex session started ... session_id=...- 同一个对应的会话流/生命周期事件
session_id - 终止事件:
- ,或
Codex session completed ... - ,或
Codex session ended with error ... Issue stalled ... restarting with backoff
如果要排查单个特定会话,可缩小追踪范围:
- 获取该工单对应的一个。
session_id - 仅提取该会话的带时间戳的日志片段:
rg -n "session_id=<thread>-<turn>" log/symphony.log*
- 标记确切的故障阶段:
- 流事件前的启动失败()。
Codex session failed ... - 流事件后的轮次/运行时失败(/
turn_*)。ended with error - 停滞恢复()。
Issue stalled ... restarting with backoff
- 流事件前的启动失败(
- 将排查结果与附近日志行的和
issue_identifier匹配,确认没有混淆并发重试的记录。issue_id
始终将会话排查结果与/匹配,避免混淆并发运行的任务。
issue_identifierissue_idNotes
注意事项
- Prefer over
rgfor speed on large logs.grep - Check rotated logs () before concluding data is missing.
log/symphony.log* - If required context fields are missing in new log statements, align with
conventions.
elixir/docs/logging.md
- 处理大日志时优先使用而非
rg,速度更快。grep - 判定数据缺失前,先检查轮转日志()。
log/symphony.log* - 如果新的日志语句缺少必填的上下文字段,请遵循的规范调整。
elixir/docs/logging.md