debug-traces

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Debug Coval Traces

调试Coval追踪数据

Use this skill when a customer expected traces in Coval and they are missing, wrong, sparse, duplicated, or not useful for production debugging.

当客户预期在Coval中看到追踪数据，但数据缺失、错误、稀疏、重复或无法用于生产环境调试时，可使用本技能。

Read First

先读内容

Load:

```
../references/debugging-matrix.md
```

../references/coval-tracing-reference.md

```
../references/agent-type-routing.md
```
```
../references/span-schema.md
```
when trace quality is the problem

加载以下文件：

```
../references/debugging-matrix.md
```

../references/coval-tracing-reference.md

```
../references/agent-type-routing.md
```
当追踪数据质量存在问题时，加载
```
../references/span-schema.md
```

Phase 1: Identify The Failure Boundary

阶段1：确定故障边界

Separate the problem into one of these boundaries:

agent never exported spans
export returned an HTTP error
export succeeded but targeted the wrong simulation/conversation/org
spans are stored but the UI route is not where the user looked
spans exist but are sparse or not useful
custom trace metrics cannot find spans or attributes

Collect:

endpoint used
response status/body from
```
/v1/traces
```
whether the export used
```
X-Simulation-Id
```
or
```
X-Conversation-Id
```
simulation output ID or conversation ID, not the run ID unless this is a monitoring conversation
Coval agent type and connection path
recent trace viewer URL, Trace Search URL, or copied trace dump
whether spans were sent as OTLP JSON or protobuf

Do not ask for raw API keys. Ask the user to run commands locally with env vars. Redact Coval agent

metadata

before sharing API responses; provider keys may be stored there.

将问题划分为以下边界之一：

Agent从未导出跨度数据
导出操作返回HTTP错误
导出成功但目标模拟/对话/组织错误
跨度数据已存储，但用户查看的UI路径不正确
跨度数据存在但稀疏或无用
自定义追踪指标无法找到跨度数据或属性

收集以下信息：

使用的端点
```
/v1/traces
```
接口的响应状态/响应体
导出是否使用了
```
X-Simulation-Id
```
或
```
X-Conversation-Id
```
模拟输出ID或对话ID，除非是监控对话，否则不要使用运行ID
Coval Agent类型和连接路径
最近的追踪查看器URL、Trace Search URL或复制的追踪数据转储
跨度数据是以OTLP JSON还是protobuf格式发送的

不要索要原始API密钥。请用户在本地使用环境变量运行命令。在分享API响应前，编辑Coval Agent的

metadata

；提供商密钥可能存储在其中。

Phase 2: Run Minimal Checks

阶段2：执行基础检查

Check Coval auth:

bash

coval whoami

Run a standalone connectivity check when useful:

bash

python skills/traces/setup-tracing/scripts/send-test-span.py \
  --api-key "$COVAL_API_KEY" \
  --simulation-id "$SIMULATION_OUTPUT_ID"

Interpretation:

200 means Coval accepted and stored the test span for that target.
404 for
```
coval-tracing-test
```
or another known-fake ID means auth/connectivity worked but the target ID is not real. Use
```
--allow-not-found
```
for intentional fake-ID checks.
404 for a real target usually means wrong ID, wrong org key, or using a run ID instead of a simulation output ID.

For conversation monitoring, use:

bash

python skills/traces/setup-tracing/scripts/send-test-span.py \
  --api-key "$COVAL_API_KEY" \
  --conversation-id "$CONVERSATION_ID"

检查Coval认证：

bash

coval whoami

必要时运行独立连通性检查：

bash

python skills/traces/setup-tracing/scripts/send-test-span.py \
  --api-key "$COVAL_API_KEY" \
  --simulation-id "$SIMULATION_OUTPUT_ID"

结果解读：

200表示Coval已接收并存储该目标的测试跨度数据。
针对
```
coval-tracing-test
```
或其他已知虚假ID返回404，表示认证/连通性正常，但目标ID不存在。对于故意使用虚假ID的检查，可添加
```
--allow-not-found
```
参数。
针对真实目标返回404通常意味着ID错误、组织密钥错误，或使用了运行ID而非模拟输出ID。

对于对话监控，使用以下命令：

bash

python skills/traces/setup-tracing/scripts/send-test-span.py \
  --api-key "$COVAL_API_KEY" \
  --conversation-id "$CONVERSATION_ID"

Phase 3: Apply The Troubleshooting Matrix

阶段3：应用故障排查矩阵

Use

../references/debugging-matrix.md

to map symptom to cause and fix.

High-probability causes:

no target header, or both target headers sent
```
X-Simulation-Id
```
contains a run ID instead of a simulation output ID
```
X-Conversation-Id
```
used for a non-monitoring run
PSTN phone path expected SIP headers that cannot arrive
WebSocket initialization payload did not include the simulation output ID
wrong organization's API key
payload over roughly 3-4 MB
retry resent already accepted spans
only auto-instrumented provider spans exist, so the trace lacks STT/TTS/tool context
tracing helper files or OpenTelemetry dependencies were added locally but not copied into the deployed image/bundle
WebSocket smoke tests sent less audio than the agent's response threshold, or the agent streamed a long canned response after Coval closed the socket

使用

../references/debugging-matrix.md

将症状映射到原因和修复方案。

高概率原因：

未添加目标头，或同时发送了两个目标头
```
X-Simulation-Id
```
包含运行ID而非模拟输出ID
针对非监控运行使用了
```
X-Conversation-Id
```
PSTN电话路径需要SIP头，但无法获取
WebSocket初始化负载未包含模拟输出ID
使用了错误组织的API密钥
负载大小超过约3-4 MB
重试操作重新发送了已被接收的跨度数据
仅存在自动插桩的提供商跨度数据，导致追踪数据缺少STT/TTS/工具上下文
追踪辅助文件或OpenTelemetry依赖项已在本地添加，但未复制到部署的镜像/包中
WebSocket冒烟测试发送的音频少于Agent的响应阈值，或在Coval关闭套接字后Agent仍在流式传输长预设响应

Phase 4: Verify In Coval UI

阶段4：在Coval UI中验证

Use the right surface:

simulation result: run result page, OTel Traces card, trace viewer
cross-call investigation: Trace Search under Observability
run-level flow failures: Transition Hotspots tab
conversation monitoring: conversation result/trace search, not simulation-only routes

Trace Search filters that usually isolate issues:

span name:
```
llm
```
,
```
stt
```
,
```
tts
```
,
```
llm_tool_call
```
status:
```
ERROR
```
duration greater than expected thresholds

attribute exists:

metrics.ttfb

stt.providerName

function.name

llm.finish_reason

agent/test set scope

使用正确的界面：

模拟结果：运行结果页面、OTel追踪卡片、追踪查看器
跨呼叫调查：可观测性下的Trace Search
运行级流程故障：Transition Hotspots标签页
对话监控：对话结果/追踪搜索，而非仅模拟路由

通常可定位问题的Trace Search过滤器：

跨度名称：
```
llm
```
,
```
stt
```
,
```
tts
```
,
```
llm_tool_call
```
状态：
```
ERROR
```
持续时间超过预期阈值

属性存在：

metrics.ttfb

stt.providerName

function.name

llm.finish_reason

Agent/测试集范围

Phase 5: Fix Or Escalate

阶段5：修复或升级处理

Fix implementation issues directly when the repo is available and the change is additive. Escalate with exact evidence when:

the target ID belongs to another org and the user must provide the correct key
the customer needs SIP provisioning or Coval agent config changes outside the repo
the Coval API returns repeated 500/503 after valid retries
the trace exists in storage but the UI cannot load it

End with a short incident-style summary: observed status, root cause, fix applied or required, and the exact command or Coval UI check that proves the current state.

当代码库可用且变更为增量式时，直接修复实现问题。出现以下情况时，需提供确切证据并升级处理：

目标ID属于其他组织，用户必须提供正确的密钥
客户需要SIP配置或代码库之外的Coval Agent配置变更
经过有效重试后，Coval API仍持续返回500/503错误
追踪数据已存储但UI无法加载

最后提供简短的事件式总结：观测到的状态、根本原因、已应用或所需的修复方案，以及能证明当前状态的具体命令或Coval UI检查操作。