debug

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Debug

调试

Goals

目标

  • Find why a run is stuck, retrying, or failing.
  • Correlate Linear issue identity to a Codex session quickly.
  • Read the right logs in the right order to isolate root cause.
  • 查找运行卡住、重试或失败的原因。
  • 快速将Linear问题标识与Codex会话关联。
  • 按正确顺序读取对应日志,定位根因。

Log Sources

日志来源

  • Primary runtime log:
    log/symphony.log
    • Default comes from
      SymphonyElixir.LogFile
      (
      log/symphony.log
      ).
    • Includes orchestrator, agent runner, and Codex app-server lifecycle logs.
  • Rotated runtime logs:
    log/symphony.log*
    • Check these when the relevant run is older.
  • 主运行时日志:
    log/symphony.log
    • 默认来自
      SymphonyElixir.LogFile
      log/symphony.log
      )。
    • 包含编排器、Agent运行器和Codex应用服务器的生命周期日志。
  • 轮转运行时日志:
    log/symphony.log*
    • 相关运行记录时间较早时查看这些日志。

Correlation Keys

关联键

  • issue_identifier
    : human ticket key (example:
    MT-625
    )
  • issue_id
    : Linear UUID (stable internal ID)
  • session_id
    : Codex thread-turn pair (
    <thread_id>-<turn_id>
    )
elixir/docs/logging.md
requires these fields for issue/session lifecycle logs. Use them as your join keys during debugging.
  • issue_identifier
    :人工可读工单键(示例:
    MT-625
  • issue_id
    :Linear UUID(稳定的内部ID)
  • session_id
    :Codex线程-轮次对(
    <thread_id>-<turn_id>
elixir/docs/logging.md
要求问题/会话生命周期日志必须包含这些字段。调试时可将它们作为关联键使用。

Quick Triage (Stuck Run)

快速排查(运行卡住)

  1. Confirm scheduler/worker symptoms for the ticket.
  2. Find recent lines for the ticket (
    issue_identifier
    first).
  3. Extract
    session_id
    from matching lines.
  4. Trace that
    session_id
    across start, stream, completion/failure, and stall handling logs.
  5. Decide class of failure: timeout/stall, app-server startup failure, turn failure, or orchestrator retry loop.
  1. 确认该工单对应的调度器/Worker异常表现。
  2. 查找该工单的最近日志行(优先用
    issue_identifier
    )。
  3. 从匹配的日志行中提取
    session_id
  4. 追踪该
    session_id
    对应的启动、流处理、完成/失败以及停滞处理日志。
  5. 判定故障类型:超时/停滞、应用服务器启动失败、轮次执行失败、编排器重试循环。

Commands

命令

bash
undefined
bash
undefined

1) Narrow by ticket key (fastest entry point)

1) Narrow by ticket key (fastest entry point)

rg -n "issue_identifier=MT-625" log/symphony.log*
rg -n "issue_identifier=MT-625" log/symphony.log*

2) If needed, narrow by Linear UUID

2) If needed, narrow by Linear UUID

rg -n "issue_id=<linear-uuid>" log/symphony.log*
rg -n "issue_id=<linear-uuid>" log/symphony.log*

3) Pull session IDs seen for that ticket

3) Pull session IDs seen for that ticket

rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u
rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u

4) Trace one session end-to-end

4) Trace one session end-to-end

rg -n "session_id=<thread>-<turn>" log/symphony.log*
rg -n "session_id=<thread>-<turn>" log/symphony.log*

5) Focus on stuck/retry signals

5) Focus on stuck/retry signals

rg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" log/symphony.log*
undefined
rg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" log/symphony.log*
undefined

Investigation Flow

排查流程

  1. Locate the ticket slice:
    • Search by
      issue_identifier=<KEY>
      .
    • If noise is high, add
      issue_id=<UUID>
      .
  2. Establish timeline:
    • Identify first
      Codex session started ... session_id=...
      .
    • Follow with
      Codex session completed
      ,
      ended with error
      , or worker exit lines.
  3. Classify the problem:
    • Stall loop:
      Issue stalled ... restarting with backoff
      .
    • App-server startup:
      Codex session failed ...
      .
    • Turn execution failure:
      turn_failed
      ,
      turn_cancelled
      ,
      turn_timeout
      , or
      ended with error
      .
    • Worker crash:
      Agent task exited ... reason=...
      .
  4. Validate scope:
    • Check whether failures are isolated to one issue/session or repeating across multiple tickets.
  5. Capture evidence:
    • Save key log lines with timestamps,
      issue_identifier
      ,
      issue_id
      , and
      session_id
      .
    • Record probable root cause and the exact failing stage.
  1. 定位工单相关日志片段:
    • issue_identifier=<KEY>
      搜索。
    • 如果噪音过高,增加
      issue_id=<UUID>
      过滤。
  2. 梳理时间线:
    • 找到第一条
      Codex session started ... session_id=...
      记录。
    • 后续跟进
      Codex session completed
      ended with error
      或Worker退出的日志行。
  3. 归类问题:
    • 停滞循环:
      Issue stalled ... restarting with backoff
    • 应用服务器启动问题:
      Codex session failed ...
    • 轮次执行失败:
      turn_failed
      turn_cancelled
      turn_timeout
      ended with error
    • Worker崩溃:
      Agent task exited ... reason=...
  4. 确认影响范围:
    • 检查故障是仅出现在单个问题/会话中,还是在多个工单中重复发生。
  5. 留存证据:
    • 保存带时间戳、
      issue_identifier
      issue_id
      session_id
      的关键日志行。
    • 记录可能的根因和确切的故障阶段。

Reading Codex Session Logs

读取Codex会话日志

In Symphony, Codex session diagnostics are emitted into
log/symphony.log
and keyed by
session_id
. Read them as a lifecycle:
  1. Codex session started ... session_id=...
  2. Session stream/lifecycle events for the same
    session_id
  3. Terminal event:
    • Codex session completed ...
      , or
    • Codex session ended with error ...
      , or
    • Issue stalled ... restarting with backoff
For one specific session investigation, keep the trace narrow:
  1. Capture one
    session_id
    for the ticket.
  2. Build a timestamped slice for only that session:
    • rg -n "session_id=<thread>-<turn>" log/symphony.log*
  3. Mark the exact failing stage:
    • Startup failure before stream events (
      Codex session failed ...
      ).
    • Turn/runtime failure after stream events (
      turn_*
      /
      ended with error
      ).
    • Stall recovery (
      Issue stalled ... restarting with backoff
      ).
  4. Pair findings with
    issue_identifier
    and
    issue_id
    from nearby lines to confirm you are not mixing concurrent retries.
Always pair session findings with
issue_identifier
/
issue_id
to avoid mixing concurrent runs.
在Symphony中,Codex会话诊断信息会输出到
log/symphony.log
中,以
session_id
为键标识。可按生命周期顺序读取:
  1. Codex session started ... session_id=...
  2. 同一个
    session_id
    对应的会话流/生命周期事件
  3. 终止事件:
    • Codex session completed ...
      ,或
    • Codex session ended with error ...
      ,或
    • Issue stalled ... restarting with backoff
如果要排查单个特定会话,可缩小追踪范围:
  1. 获取该工单对应的一个
    session_id
  2. 仅提取该会话的带时间戳的日志片段:
    • rg -n "session_id=<thread>-<turn>" log/symphony.log*
  3. 标记确切的故障阶段:
    • 流事件前的启动失败(
      Codex session failed ...
      )。
    • 流事件后的轮次/运行时失败(
      turn_*
      /
      ended with error
      )。
    • 停滞恢复(
      Issue stalled ... restarting with backoff
      )。
  4. 将排查结果与附近日志行的
    issue_identifier
    issue_id
    匹配,确认没有混淆并发重试的记录。
始终将会话排查结果与
issue_identifier
/
issue_id
匹配,避免混淆并发运行的任务。

Notes

注意事项

  • Prefer
    rg
    over
    grep
    for speed on large logs.
  • Check rotated logs (
    log/symphony.log*
    ) before concluding data is missing.
  • If required context fields are missing in new log statements, align with
    elixir/docs/logging.md
    conventions.
  • 处理大日志时优先使用
    rg
    而非
    grep
    ,速度更快。
  • 判定数据缺失前,先检查轮转日志(
    log/symphony.log*
    )。
  • 如果新的日志语句缺少必填的上下文字段,请遵循
    elixir/docs/logging.md
    的规范调整。