debug
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDebug
调试
Goals
目标
- Find why a run is stuck, retrying, or failing.
- Correlate Linear issue identity to a Codex session quickly.
- Read the right logs in the right order to isolate root cause.
- 找出任务运行卡住、反复重试或失败的原因。
- 快速关联Linear工单标识与Codex会话。
- 按正确顺序读取对应日志以定位根本原因。
Log Sources
日志来源
- Primary runtime log:
log/symphony.log- Default comes from (
SymphonyElixir.LogFile).log/symphony.log - Includes orchestrator, agent runner, and Codex app-server lifecycle logs.
- Default comes from
- Rotated runtime logs:
log/symphony.log*- Check these when the relevant run is older.
- 主要运行时日志:
log/symphony.log- 默认来自(路径为
SymphonyElixir.LogFile)。log/symphony.log - 包含编排器、Agent运行器和Codex应用服务器的生命周期日志。
- 默认来自
- 轮转运行时日志:
log/symphony.log*- 当相关任务运行时间较久时,检查这些日志。
Correlation Keys
关联关键字
- : human ticket key (example:
issue_identifier)MT-625 - : Linear UUID (stable internal ID)
issue_id - : Codex thread-turn pair (
session_id)<thread_id>-<turn_id>
elixir/docs/logging.md- : 人工工单编号(示例:
issue_identifier)MT-625 - : Linear平台UUID(稳定的内部标识)
issue_id - : Codex线程-轮次对(格式:
session_id)<thread_id>-<turn_id>
elixir/docs/logging.mdQuick Triage (Stuck Run)
快速排查(任务卡住场景)
- Confirm scheduler/worker symptoms for the ticket.
- Find recent lines for the ticket (first).
issue_identifier - Extract from matching lines.
session_id - Trace that across start, stream, completion/failure, and stall handling logs.
session_id - Decide class of failure: timeout/stall, app-server startup failure, turn failure, or orchestrator retry loop.
- 确认工单对应的调度器/工作节点症状。
- 查找该工单的最新日志行(优先使用)。
issue_identifier - 从匹配的日志行中提取。
session_id - 追踪该对应的启动、流处理、完成/失败以及停滞处理日志。
session_id - 判定故障类型:超时/停滞、应用服务器启动失败、轮次执行失败或编排器重试循环。
Commands
命令
bash
undefinedbash
undefined1) Narrow by ticket key (fastest entry point)
1) 按工单编号筛选(最快的入口方式)
rg -n "issue_identifier=MT-625" log/symphony.log*
rg -n "issue_identifier=MT-625" log/symphony.log*
2) If needed, narrow by Linear UUID
2) 必要时,按Linear UUID筛选
rg -n "issue_id=<linear-uuid>" log/symphony.log*
rg -n "issue_id=<linear-uuid>" log/symphony.log*
3) Pull session IDs seen for that ticket
3) 提取该工单对应的所有会话ID
rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u
rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u
4) Trace one session end-to-end
4) 完整追踪单个会话的全流程
rg -n "session_id=<thread>-<turn>" log/symphony.log*
rg -n "session_id=<thread>-<turn>" log/symphony.log*
5) Focus on stuck/retry signals
5) 聚焦卡住/重试相关日志
rg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" log/symphony.log*
undefinedrg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" log/symphony.log*
undefinedInvestigation Flow
排查流程
- Locate the ticket slice:
- Search by .
issue_identifier=<KEY> - If noise is high, add .
issue_id=<UUID>
- Search by
- Establish timeline:
- Identify first .
Codex session started ... session_id=... - Follow with ,
Codex session completed, or worker exit lines.ended with error
- Identify first
- Classify the problem:
- Stall loop: .
Issue stalled ... restarting with backoff - App-server startup: .
Codex session failed ... - Turn execution failure: ,
turn_failed,turn_cancelled, orturn_timeout.ended with error - Worker crash: .
Agent task exited ... reason=...
- Stall loop:
- Validate scope:
- Check whether failures are isolated to one issue/session or repeating across multiple tickets.
- Capture evidence:
- Save key log lines with timestamps, ,
issue_identifier, andissue_id.session_id - Record probable root cause and the exact failing stage.
- Save key log lines with timestamps,
- 定位工单相关日志片段:
- 使用搜索。
issue_identifier=<工单编号> - 如果日志噪音较大,添加缩小范围。
issue_id=<UUID>
- 使用
- 梳理时间线:
- 找到首次出现的日志。
Codex session started ... session_id=... - 跟进后续的、
Codex session completed或工作节点退出日志。ended with error
- 找到首次出现的
- 分类故障类型:
- 停滞循环:。
Issue stalled ... restarting with backoff - 应用服务器启动失败:。
Codex session failed ... - 轮次执行失败:、
turn_failed、turn_cancelled或turn_timeout。ended with error - 工作节点崩溃:。
Agent task exited ... reason=...
- 停滞循环:
- 验证故障范围:
- 检查故障是仅存在于单个工单/会话,还是在多个工单中重复出现。
- 留存证据:
- 保存包含时间戳、、
issue_identifier和issue_id的关键日志行。session_id - 记录可能的根本原因和具体的故障阶段。
- 保存包含时间戳、
Reading Codex Session Logs
解读Codex会话日志
In Symphony, Codex session diagnostics are emitted into and
keyed by . Read them as a lifecycle:
log/symphony.logsession_idCodex session started ... session_id=...- Session stream/lifecycle events for the same
session_id - Terminal event:
- , or
Codex session completed ... - , or
Codex session ended with error ... Issue stalled ... restarting with backoff
For one specific session investigation, keep the trace narrow:
- Capture one for the ticket.
session_id - Build a timestamped slice for only that session:
rg -n "session_id=<thread>-<turn>" log/symphony.log*
- Mark the exact failing stage:
- Startup failure before stream events ().
Codex session failed ... - Turn/runtime failure after stream events (/
turn_*).ended with error - Stall recovery ().
Issue stalled ... restarting with backoff
- Startup failure before stream events (
- Pair findings with and
issue_identifierfrom nearby lines to confirm you are not mixing concurrent retries.issue_id
Always pair session findings with / to avoid mixing
concurrent runs.
issue_identifierissue_id在Symphony中,Codex会话诊断信息会输出到,并以作为标识。请按生命周期顺序解读:
log/symphony.logsession_idCodex session started ... session_id=...- 同一对应的会话流/生命周期事件
session_id - 终止事件:
- ,或
Codex session completed ... - ,或
Codex session ended with error ... Issue stalled ... restarting with backoff
针对单个会话的排查,请缩小追踪范围:
- 获取该工单对应的一个。
session_id - 生成仅包含该会话的带时间戳日志片段:
rg -n "session_id=<thread>-<turn>" log/symphony.log*
- 标记具体的故障阶段:
- 流事件之前的启动失败()。
Codex session failed ... - 流事件之后的轮次/运行时失败(/
turn_*)。ended with error - 停滞恢复()。
Issue stalled ... restarting with backoff
- 流事件之前的启动失败(
- 结合附近日志行中的和
issue_identifier验证结果,避免混淆并发的重试任务。issue_id
务必将会话排查结果与/关联,避免混淆并发运行的任务。
issue_identifierissue_idNotes
注意事项
- Prefer over
rgfor speed on large logs.grep - Check rotated logs () before concluding data is missing.
log/symphony.log* - If required context fields are missing in new log statements, align with
conventions.
elixir/docs/logging.md
- 处理大型日志时,优先使用而非
rg以提升速度。grep - 在判定日志缺失前,请先检查轮转日志()。
log/symphony.log* - 如果新日志语句中缺少必要的上下文字段,请对齐中的规范。
elixir/docs/logging.md