debug

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Debug

调试

Goals

目标

  • Find why a run is stuck, retrying, or failing.
  • Correlate Linear issue identity to a Codex session quickly.
  • Read the right logs in the right order to isolate root cause.
  • 找出任务运行卡住、反复重试或失败的原因。
  • 快速关联Linear工单标识与Codex会话。
  • 按正确顺序读取对应日志以定位根本原因。

Log Sources

日志来源

  • Primary runtime log:
    log/symphony.log
    • Default comes from
      SymphonyElixir.LogFile
      (
      log/symphony.log
      ).
    • Includes orchestrator, agent runner, and Codex app-server lifecycle logs.
  • Rotated runtime logs:
    log/symphony.log*
    • Check these when the relevant run is older.
  • 主要运行时日志:
    log/symphony.log
    • 默认来自
      SymphonyElixir.LogFile
      (路径为
      log/symphony.log
      )。
    • 包含编排器、Agent运行器和Codex应用服务器的生命周期日志。
  • 轮转运行时日志:
    log/symphony.log*
    • 当相关任务运行时间较久时,检查这些日志。

Correlation Keys

关联关键字

  • issue_identifier
    : human ticket key (example:
    MT-625
    )
  • issue_id
    : Linear UUID (stable internal ID)
  • session_id
    : Codex thread-turn pair (
    <thread_id>-<turn_id>
    )
elixir/docs/logging.md
requires these fields for issue/session lifecycle logs. Use them as your join keys during debugging.
  • issue_identifier
    : 人工工单编号(示例:
    MT-625
  • issue_id
    : Linear平台UUID(稳定的内部标识)
  • session_id
    : Codex线程-轮次对(格式:
    <thread_id>-<turn_id>
elixir/docs/logging.md
要求工单/会话生命周期日志中必须包含这些字段,调试时可将它们作为关联关键字使用。

Quick Triage (Stuck Run)

快速排查(任务卡住场景)

  1. Confirm scheduler/worker symptoms for the ticket.
  2. Find recent lines for the ticket (
    issue_identifier
    first).
  3. Extract
    session_id
    from matching lines.
  4. Trace that
    session_id
    across start, stream, completion/failure, and stall handling logs.
  5. Decide class of failure: timeout/stall, app-server startup failure, turn failure, or orchestrator retry loop.
  1. 确认工单对应的调度器/工作节点症状。
  2. 查找该工单的最新日志行(优先使用
    issue_identifier
    )。
  3. 从匹配的日志行中提取
    session_id
  4. 追踪该
    session_id
    对应的启动、流处理、完成/失败以及停滞处理日志。
  5. 判定故障类型:超时/停滞、应用服务器启动失败、轮次执行失败或编排器重试循环。

Commands

命令

bash
undefined
bash
undefined

1) Narrow by ticket key (fastest entry point)

1) 按工单编号筛选(最快的入口方式)

rg -n "issue_identifier=MT-625" log/symphony.log*
rg -n "issue_identifier=MT-625" log/symphony.log*

2) If needed, narrow by Linear UUID

2) 必要时,按Linear UUID筛选

rg -n "issue_id=<linear-uuid>" log/symphony.log*
rg -n "issue_id=<linear-uuid>" log/symphony.log*

3) Pull session IDs seen for that ticket

3) 提取该工单对应的所有会话ID

rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u
rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u

4) Trace one session end-to-end

4) 完整追踪单个会话的全流程

rg -n "session_id=<thread>-<turn>" log/symphony.log*
rg -n "session_id=<thread>-<turn>" log/symphony.log*

5) Focus on stuck/retry signals

5) 聚焦卡住/重试相关日志

rg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" log/symphony.log*
undefined
rg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" log/symphony.log*
undefined

Investigation Flow

排查流程

  1. Locate the ticket slice:
    • Search by
      issue_identifier=<KEY>
      .
    • If noise is high, add
      issue_id=<UUID>
      .
  2. Establish timeline:
    • Identify first
      Codex session started ... session_id=...
      .
    • Follow with
      Codex session completed
      ,
      ended with error
      , or worker exit lines.
  3. Classify the problem:
    • Stall loop:
      Issue stalled ... restarting with backoff
      .
    • App-server startup:
      Codex session failed ...
      .
    • Turn execution failure:
      turn_failed
      ,
      turn_cancelled
      ,
      turn_timeout
      , or
      ended with error
      .
    • Worker crash:
      Agent task exited ... reason=...
      .
  4. Validate scope:
    • Check whether failures are isolated to one issue/session or repeating across multiple tickets.
  5. Capture evidence:
    • Save key log lines with timestamps,
      issue_identifier
      ,
      issue_id
      , and
      session_id
      .
    • Record probable root cause and the exact failing stage.
  1. 定位工单相关日志片段:
    • 使用
      issue_identifier=<工单编号>
      搜索。
    • 如果日志噪音较大,添加
      issue_id=<UUID>
      缩小范围。
  2. 梳理时间线:
    • 找到首次出现的
      Codex session started ... session_id=...
      日志。
    • 跟进后续的
      Codex session completed
      ended with error
      或工作节点退出日志。
  3. 分类故障类型:
    • 停滞循环:
      Issue stalled ... restarting with backoff
    • 应用服务器启动失败:
      Codex session failed ...
    • 轮次执行失败:
      turn_failed
      turn_cancelled
      turn_timeout
      ended with error
    • 工作节点崩溃:
      Agent task exited ... reason=...
  4. 验证故障范围:
    • 检查故障是仅存在于单个工单/会话,还是在多个工单中重复出现。
  5. 留存证据:
    • 保存包含时间戳、
      issue_identifier
      issue_id
      session_id
      的关键日志行。
    • 记录可能的根本原因和具体的故障阶段。

Reading Codex Session Logs

解读Codex会话日志

In Symphony, Codex session diagnostics are emitted into
log/symphony.log
and keyed by
session_id
. Read them as a lifecycle:
  1. Codex session started ... session_id=...
  2. Session stream/lifecycle events for the same
    session_id
  3. Terminal event:
    • Codex session completed ...
      , or
    • Codex session ended with error ...
      , or
    • Issue stalled ... restarting with backoff
For one specific session investigation, keep the trace narrow:
  1. Capture one
    session_id
    for the ticket.
  2. Build a timestamped slice for only that session:
    • rg -n "session_id=<thread>-<turn>" log/symphony.log*
  3. Mark the exact failing stage:
    • Startup failure before stream events (
      Codex session failed ...
      ).
    • Turn/runtime failure after stream events (
      turn_*
      /
      ended with error
      ).
    • Stall recovery (
      Issue stalled ... restarting with backoff
      ).
  4. Pair findings with
    issue_identifier
    and
    issue_id
    from nearby lines to confirm you are not mixing concurrent retries.
Always pair session findings with
issue_identifier
/
issue_id
to avoid mixing concurrent runs.
在Symphony中,Codex会话诊断信息会输出到
log/symphony.log
,并以
session_id
作为标识。请按生命周期顺序解读:
  1. Codex session started ... session_id=...
  2. 同一
    session_id
    对应的会话流/生命周期事件
  3. 终止事件:
    • Codex session completed ...
      ,或
    • Codex session ended with error ...
      ,或
    • Issue stalled ... restarting with backoff
针对单个会话的排查,请缩小追踪范围:
  1. 获取该工单对应的一个
    session_id
  2. 生成仅包含该会话的带时间戳日志片段:
    • rg -n "session_id=<thread>-<turn>" log/symphony.log*
  3. 标记具体的故障阶段:
    • 流事件之前的启动失败(
      Codex session failed ...
      )。
    • 流事件之后的轮次/运行时失败(
      turn_*
      /
      ended with error
      )。
    • 停滞恢复(
      Issue stalled ... restarting with backoff
      )。
  4. 结合附近日志行中的
    issue_identifier
    issue_id
    验证结果,避免混淆并发的重试任务。
务必将会话排查结果与
issue_identifier
/
issue_id
关联,避免混淆并发运行的任务。

Notes

注意事项

  • Prefer
    rg
    over
    grep
    for speed on large logs.
  • Check rotated logs (
    log/symphony.log*
    ) before concluding data is missing.
  • If required context fields are missing in new log statements, align with
    elixir/docs/logging.md
    conventions.
  • 处理大型日志时,优先使用
    rg
    而非
    grep
    以提升速度。
  • 在判定日志缺失前,请先检查轮转日志(
    log/symphony.log*
    )。
  • 如果新日志语句中缺少必要的上下文字段,请对齐
    elixir/docs/logging.md
    中的规范。