debug-traces
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDebug Coval Traces
调试Coval追踪数据
Use this skill when a customer expected traces in Coval and they are missing, wrong, sparse, duplicated, or not useful for production debugging.
当客户预期在Coval中看到追踪数据,但数据缺失、错误、稀疏、重复或无法用于生产环境调试时,可使用本技能。
Read First
先读内容
Load:
../references/debugging-matrix.md../references/coval-tracing-reference.md../references/agent-type-routing.md- when trace quality is the problem
../references/span-schema.md
加载以下文件:
../references/debugging-matrix.md../references/coval-tracing-reference.md../references/agent-type-routing.md- 当追踪数据质量存在问题时,加载
../references/span-schema.md
Phase 1: Identify The Failure Boundary
阶段1:确定故障边界
Separate the problem into one of these boundaries:
- agent never exported spans
- export returned an HTTP error
- export succeeded but targeted the wrong simulation/conversation/org
- spans are stored but the UI route is not where the user looked
- spans exist but are sparse or not useful
- custom trace metrics cannot find spans or attributes
Collect:
- endpoint used
- response status/body from
/v1/traces - whether the export used or
X-Simulation-IdX-Conversation-Id - simulation output ID or conversation ID, not the run ID unless this is a monitoring conversation
- Coval agent type and connection path
- recent trace viewer URL, Trace Search URL, or copied trace dump
- whether spans were sent as OTLP JSON or protobuf
Do not ask for raw API keys. Ask the user to run commands locally with env vars. Redact Coval agent before sharing API responses; provider keys may be stored there.
metadata将问题划分为以下边界之一:
- Agent从未导出跨度数据
- 导出操作返回HTTP错误
- 导出成功但目标模拟/对话/组织错误
- 跨度数据已存储,但用户查看的UI路径不正确
- 跨度数据存在但稀疏或无用
- 自定义追踪指标无法找到跨度数据或属性
收集以下信息:
- 使用的端点
- 接口的响应状态/响应体
/v1/traces - 导出是否使用了或
X-Simulation-IdX-Conversation-Id - 模拟输出ID或对话ID,除非是监控对话,否则不要使用运行ID
- Coval Agent类型和连接路径
- 最近的追踪查看器URL、Trace Search URL或复制的追踪数据转储
- 跨度数据是以OTLP JSON还是protobuf格式发送的
不要索要原始API密钥。请用户在本地使用环境变量运行命令。在分享API响应前,编辑Coval Agent的;提供商密钥可能存储在其中。
metadataPhase 2: Run Minimal Checks
阶段2:执行基础检查
Check Coval auth:
bash
coval whoamiRun a standalone connectivity check when useful:
bash
python skills/traces/setup-tracing/scripts/send-test-span.py \
--api-key "$COVAL_API_KEY" \
--simulation-id "$SIMULATION_OUTPUT_ID"Interpretation:
- 200 means Coval accepted and stored the test span for that target.
- 404 for or another known-fake ID means auth/connectivity worked but the target ID is not real. Use
coval-tracing-testfor intentional fake-ID checks.--allow-not-found - 404 for a real target usually means wrong ID, wrong org key, or using a run ID instead of a simulation output ID.
For conversation monitoring, use:
bash
python skills/traces/setup-tracing/scripts/send-test-span.py \
--api-key "$COVAL_API_KEY" \
--conversation-id "$CONVERSATION_ID"检查Coval认证:
bash
coval whoami必要时运行独立连通性检查:
bash
python skills/traces/setup-tracing/scripts/send-test-span.py \
--api-key "$COVAL_API_KEY" \
--simulation-id "$SIMULATION_OUTPUT_ID"结果解读:
- 200表示Coval已接收并存储该目标的测试跨度数据。
- 针对或其他已知虚假ID返回404,表示认证/连通性正常,但目标ID不存在。对于故意使用虚假ID的检查,可添加
coval-tracing-test参数。--allow-not-found - 针对真实目标返回404通常意味着ID错误、组织密钥错误,或使用了运行ID而非模拟输出ID。
对于对话监控,使用以下命令:
bash
python skills/traces/setup-tracing/scripts/send-test-span.py \
--api-key "$COVAL_API_KEY" \
--conversation-id "$CONVERSATION_ID"Phase 3: Apply The Troubleshooting Matrix
阶段3:应用故障排查矩阵
Use to map symptom to cause and fix.
../references/debugging-matrix.mdHigh-probability causes:
- no target header, or both target headers sent
- contains a run ID instead of a simulation output ID
X-Simulation-Id - used for a non-monitoring run
X-Conversation-Id - PSTN phone path expected SIP headers that cannot arrive
- WebSocket initialization payload did not include the simulation output ID
- wrong organization's API key
- payload over roughly 3-4 MB
- retry resent already accepted spans
- only auto-instrumented provider spans exist, so the trace lacks STT/TTS/tool context
- tracing helper files or OpenTelemetry dependencies were added locally but not copied into the deployed image/bundle
- WebSocket smoke tests sent less audio than the agent's response threshold, or the agent streamed a long canned response after Coval closed the socket
使用将症状映射到原因和修复方案。
../references/debugging-matrix.md高概率原因:
- 未添加目标头,或同时发送了两个目标头
- 包含运行ID而非模拟输出ID
X-Simulation-Id - 针对非监控运行使用了
X-Conversation-Id - PSTN电话路径需要SIP头,但无法获取
- WebSocket初始化负载未包含模拟输出ID
- 使用了错误组织的API密钥
- 负载大小超过约3-4 MB
- 重试操作重新发送了已被接收的跨度数据
- 仅存在自动插桩的提供商跨度数据,导致追踪数据缺少STT/TTS/工具上下文
- 追踪辅助文件或OpenTelemetry依赖项已在本地添加,但未复制到部署的镜像/包中
- WebSocket冒烟测试发送的音频少于Agent的响应阈值,或在Coval关闭套接字后Agent仍在流式传输长预设响应
Phase 4: Verify In Coval UI
阶段4:在Coval UI中验证
Use the right surface:
- simulation result: run result page, OTel Traces card, trace viewer
- cross-call investigation: Trace Search under Observability
- run-level flow failures: Transition Hotspots tab
- conversation monitoring: conversation result/trace search, not simulation-only routes
Trace Search filters that usually isolate issues:
- span name: ,
llm,stt,ttsllm_tool_call - status:
ERROR - duration greater than expected thresholds
- attribute exists: ,
metrics.ttfb,stt.providerName,function.namellm.finish_reason - agent/test set scope
使用正确的界面:
- 模拟结果:运行结果页面、OTel追踪卡片、追踪查看器
- 跨呼叫调查:可观测性下的Trace Search
- 运行级流程故障:Transition Hotspots标签页
- 对话监控:对话结果/追踪搜索,而非仅模拟路由
通常可定位问题的Trace Search过滤器:
- 跨度名称:,
llm,stt,ttsllm_tool_call - 状态:
ERROR - 持续时间超过预期阈值
- 属性存在:,
metrics.ttfb,stt.providerName,function.namellm.finish_reason - Agent/测试集范围
Phase 5: Fix Or Escalate
阶段5:修复或升级处理
Fix implementation issues directly when the repo is available and the change is additive. Escalate with exact evidence when:
- the target ID belongs to another org and the user must provide the correct key
- the customer needs SIP provisioning or Coval agent config changes outside the repo
- the Coval API returns repeated 500/503 after valid retries
- the trace exists in storage but the UI cannot load it
End with a short incident-style summary: observed status, root cause, fix applied or required, and the exact command or Coval UI check that proves the current state.
当代码库可用且变更为增量式时,直接修复实现问题。出现以下情况时,需提供确切证据并升级处理:
- 目标ID属于其他组织,用户必须提供正确的密钥
- 客户需要SIP配置或代码库之外的Coval Agent配置变更
- 经过有效重试后,Coval API仍持续返回500/503错误
- 追踪数据已存储但UI无法加载
最后提供简短的事件式总结:观测到的状态、根本原因、已应用或所需的修复方案,以及能证明当前状态的具体命令或Coval UI检查操作。