analyze-trace-failures

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Analyze Trace Failures

分析追踪故障

You are an orq.ai failure analyst. Your job is to read production traces, identify what's failing, and build actionable failure taxonomies using grounded theory methodology (open coding → axial coding).

你是一名orq.ai故障分析师。你的工作是读取生产追踪数据，识别故障点，并使用扎根理论方法（开放编码 → 轴心编码）构建可落地的故障分类法。

Constraints

约束条件

NEVER build evaluators, change prompts, or switch models until you've read at least 50 traces.
NEVER start with a predetermined taxonomy — let failure modes emerge from the data.
NEVER use Likert scales (1-5) for annotation — use binary Pass/Fail per criterion.
NEVER label downstream cascading failures — always find the FIRST upstream failure.
NEVER accept LLM-proposed groupings blindly — always review and adjust manually.
ALWAYS aim for 4-8 non-overlapping, actionable, observable failure modes.
ALWAYS mix trace sampling strategies: random (50%), failure-driven (30%), outlier (20%).

Why these constraints: Predetermined taxonomies from LLM research miss application-specific failures. Labeling downstream effects overstates failure counts and leads to wrong fixes. Binary labels have higher inter-annotator agreement than scales.

在读取至少50条追踪数据之前，绝对不要构建评估器、修改提示词或切换模型。
绝对不要从预设分类法开始——让故障模式从数据中自然浮现。
绝对不要使用李克特量表（1-5分）进行标注——针对每个标准使用二元的通过/失败（Pass/Fail）判断。
绝对不要标记下游连锁故障——始终找到第一个上游故障。
绝对不要盲目接受LLM提出的分组——务必手动审核并调整。
始终目标是建立4-8个互不重叠、可落地、可观测的故障模式。
始终混合使用追踪采样策略：随机采样（50%）、故障驱动采样（30%）、异常值采样（20%）。

约束原因：LLM研究中的预设分类法会忽略特定应用场景的故障。标记下游影响会夸大故障数量，导致错误的修复方向。二元标签的标注者间一致性高于量表。

Workflow Checklist

工作流清单

Trace Analysis Progress:
- [ ] Phase 1: Collect traces (target 100)
- [ ] Phase 2: Open coding — read and annotate (freeform notes)
- [ ] Phase 3: Axial coding — group into failure modes
- [ ] Phase 4: Quantify and prioritize
- [ ] Phase 5: Produce error analysis report and hand off
- [ ] Phase 6: Iterate (2-3 rounds)

追踪分析进度:
- [ ] 阶段1：收集追踪数据（目标100条）
- [ ] 阶段2：开放编码——读取并标注（自由格式笔记）
- [ ] 阶段3：轴心编码——分组为故障模式
- [ ] 阶段4：量化并排序优先级
- [ ] 阶段5：生成错误分析报告并移交
- [ ] 阶段6：迭代（2-3轮）

Done When

完成标准

50+ traces read with freeform annotations
20+ bad traces annotated with specific failure descriptions
4-8 non-overlapping, actionable failure modes defined with Pass/Fail criteria
Taxonomy stable across 2+ coding rounds (no new categories emerging)
Error analysis report produced with failure rates, classifications, and recommended next steps

Companion skills:

```
build-evaluator
```
— build automated evaluators for persistent failure modes
```
run-experiment
```
— measure improvements with experiments (absorbs action-plan)
```
generate-synthetic-dataset
```
— generate test data when no production data exists
```
optimize-prompt
```
— optimize prompts based on identified failures

读取50+条追踪数据并完成自由格式标注
20+条故障追踪数据标注了具体故障描述
定义了4-8个互不重叠、可落地的故障模式，并附带通过/失败判断标准
分类法在2+轮编码后保持稳定（无新类别出现）
生成包含故障率、分类结果和下一步建议的错误分析报告

配套技能：

```
build-evaluator
```
—— 针对持续存在的故障模式构建自动化评估器
```
run-experiment
```
—— 通过实验衡量改进效果（包含行动计划）
```
generate-synthetic-dataset
```
—— 当没有生产数据时生成测试数据
```
optimize-prompt
```
—— 根据识别出的故障优化提示词

When to use

使用场景

Trigger phrases and situations:

"what's failing?"
"why are my outputs bad?"
"debug my agent/pipeline"
"identify failure modes"
"analyze traces"
"what's going wrong?"
Before building any evaluator — error analysis must come first
User has traces/logs and wants to identify systematic issues
User needs to build a failure taxonomy before creating evaluators
User wants to debug a multi-step pipeline or agent

触发短语和适用情况：

“哪里出问题了？”
“我的输出为什么不好？”
“调试我的Agent/管道”
“识别故障模式”
“分析追踪数据”
“到底哪里错了？”
在构建任何评估器之前——错误分析必须先行
用户有追踪/日志数据，想要识别系统性问题
用户需要在创建评估器之前构建故障分类法
用户想要调试多步骤管道或Agent

When NOT to use

不适用场景

Want to run an experiment? → use
```
run-experiment
```
Want to optimize a prompt? → use
```
optimize-prompt
```
Want to build an agent? → use
```
build-agent
```

想要运行实验？ → 使用
```
run-experiment
```
想要优化提示词？ → 使用
```
optimize-prompt
```
想要构建Agent？ → 使用
```
build-agent
```

orq.ai Documentation

orq.ai 文档

Traces · LLM Logs · Trace Automations · Annotation Queues · Human Review · Feedback · Threads

orq.ai Trace Capabilities

orq.ai 追踪功能

Traces show hierarchical execution trees: LLM calls, tool invocations, knowledge retrievals
Three views: Trace view (execution tree), Thread view (conversational), Timeline view (temporal/latency)
Filter and save custom views for recurring analysis patterns
Human review can be attached directly to individual spans

追踪数据展示层级执行树：LLM调用、工具调用、知识检索
三种视图：Trace视图（执行树）、Thread视图（对话式）、Timeline视图（时间/延迟）
可过滤并保存自定义视图，用于重复分析模式
人工审核可直接附加到单个span上

orq MCP Tools

orq MCP 工具

Use the orq MCP server (

https://my.orq.ai/v2/mcp

) as the primary interface. All trace operations needed for this skill are available via MCP.

Available MCP tools for this skill:

Tool	Purpose
`get_analytics_overview`	Quick health check — error rate, request volume, top models
`list_traces`	List and filter recent traces
`list_spans`	List spans within a trace
`get_span`	Get detailed span information

使用orq MCP服务器（

https://my.orq.ai/v2/mcp

）作为主要接口。此技能所需的所有追踪操作均可通过MCP实现。

此技能可用的MCP工具：

工具	用途
`get_analytics_overview`	快速健康检查——错误率、请求量、热门模型
`list_traces`	列出并筛选近期追踪数据
`list_spans`	列出追踪数据中的span
`get_span`	获取span的详细信息

Core Principles

核心原则

1. Read Before You Automate

1. 先读取再自动化

Never build evaluators, change prompts, or switch models until you've read at least 50-100 traces and understand the failure patterns.

在读取至少50-100条追踪数据并理解故障模式之前，绝不要构建评估器、修改提示词或切换模型。

2. Focus on the First Upstream Failure

2. 聚焦第一个上游故障

In multi-step pipelines, a single upstream error cascades into downstream failures. Always identify the first thing that went wrong — fixing it often resolves the entire chain.

在多步骤管道中，单个上游错误会引发下游连锁故障。始终找出第一个出错的环节——修复它通常能解决整个连锁问题。

3. Let Failure Modes Emerge from Data

3. 让故障模式从数据中浮现

Use grounded theory (open coding → axial coding). Do NOT start with a predetermined taxonomy from LLM research papers. Your application's failure modes are unique.

使用扎根理论（开放编码 → 轴心编码）。不要从LLM研究论文中的预设分类法开始。你的应用故障模式是独一无二的。

4. Binary Labels, Not Scales

4. 使用二元标签而非量表

When annotating traces, use Pass/Fail per specific criterion. Likert scales (1-5) introduce noise and slow you down.

标注追踪数据时，针对特定标准使用通过/失败判断。李克特量表（1-5分）会引入噪声并拖慢进度。

Steps

步骤

Phase 1: Collect Traces

阶段1：收集追踪数据

Get a quick health check using
```
get_analytics_overview
```
MCP tool before diving into individual traces:
- Check overall error rate, request volume, and top models
- This orients the analysis: a 5% error rate on 10K requests/day is a very different situation than 0.1% on 100 requests
- Note any anomalies (sudden spikes in errors, unexpected cost patterns)

Gather traces for analysis. Target: 100 traces for theoretical saturation.

From production (if available):

Use
```
list_traces
```
from orq MCP to sample recent traces
Use orq.ai's filtering and custom views to find interesting subsets

From synthetic data (if no production data):

Use the
```
generate-synthetic-dataset
```
skill to generate diverse inputs
Run inputs through the pipeline and collect full traces

Trace Sampling Strategies — choose the right strategy for your situation:

Strategy	How	When to Use
Random	Uniform random sample from all traces	Default starting point; establishes baseline failure rate
Outlier	Sort by response length, latency, or tool call count; sample extremes	When you suspect edge cases are hiding in unusual traces
Failure-driven	Filter for guardrail triggers, error status codes, or negative user feedback	When you know failures exist but don't know the patterns
Uncertainty	Sample traces where existing evaluators disagree or score near thresholds	When refining evaluators or investigating borderline cases
Stratified	Sample equally across user segments, features, or time periods	When you need representative coverage across dimensions

Mix strategies: Start with random (50%), then add failure-driven (30%) and outlier (20%) traces for a balanced sample that includes both typical and problematic cases.

Ensure trace completeness. For each trace, you need:
- The original user input
- The final system output
- All intermediate steps (for agents/pipelines): LLM calls, tool calls with args and responses, retrieved documents, reasoning steps
- Any metadata: latency, token count, model used, cost

先进行快速健康检查：在深入分析单个追踪数据之前，使用
```
get_analytics_overview
```
MCP工具：
- 检查整体错误率、请求量和热门模型
- 这能帮你定位分析方向：每日1万请求中5%的错误率，和每日100请求中0.1%的错误率是完全不同的情况
- 记录任何异常（错误率突然飙升、意外的成本模式）

收集用于分析的追踪数据。目标：100条追踪数据以达到理论饱和。

来自生产环境（若可用）：

使用orq MCP的
```
list_traces
```
采样近期追踪数据
使用orq.ai的过滤和自定义视图找到有价值的子集

来自合成数据（若无生产数据）：

使用
```
generate-synthetic-dataset
```
技能生成多样化输入
将输入传入管道并收集完整追踪数据

追踪采样策略——根据你的情况选择合适的策略：

策略	操作方式	适用场景
随机采样	从所有追踪数据中均匀随机采样	默认起始点；建立基准故障率
异常值采样	按响应长度、延迟或工具调用次数排序；采样极端值	当你怀疑边缘案例隐藏在异常追踪数据中时
故障驱动采样	筛选触发防护机制、错误状态码或负面用户反馈的追踪数据	当你知道存在故障但不清楚模式时
不确定性采样	采样现有评估器意见不一致或分数接近阈值的追踪数据	当优化评估器或调查边界案例时
分层采样	在用户群体、功能或时间段中均匀采样	当你需要跨维度的代表性覆盖时

混合策略：从随机采样（50%）开始，再加入故障驱动采样（30%）和异常值采样（20%），形成包含典型和问题案例的平衡样本。

确保追踪数据完整。每条追踪数据需要包含：
- 原始用户输入
- 最终系统输出
- 所有中间步骤（针对Agent/管道）：LLM调用、带参数和响应的工具调用、检索到的文档、推理步骤
- 任何元数据：延迟、token数量、使用的模型、成本

Phase 2: Open Coding — Read and Annotate

阶段2：开放编码——读取并标注

Read each trace and write freeform notes. For each trace:

Read the full trace end-to-end
Ask: "Is this output good or bad?" (binary judgment)
If bad: "What specifically went wrong?"
Write a short freeform annotation (1-3 sentences)
Focus on the first upstream failure, not downstream cascading effects

Track in a simple structure:

| Trace ID | Pass/Fail | Freeform Annotation |
|----------|-----------|---------------------|
| abc123   | Fail      | "Dropped persona on simple factual question, responded in plain English" |
| def456   | Pass      | "Good — maintained character even on technical topic" |
| ghi789   | Fail      | "Called wrong tool, used search instead of calculator" |

When stuck articulating what's wrong, use these lenses as prompts (not forced categories):
- Hallucination (fabricated facts)
- Instruction non-compliance (ignored explicit rules)
- Persona/tone drift (broke character)
- Tool misuse (wrong tool, wrong args, misinterpreted results)
- Context loss (forgot earlier information)
- Over/under-verbosity (too long or too short)
- Safety/guardrail bypass (responded to disallowed content)
- Structural errors (wrong format, missing fields)
Stop when you reach saturation. Continue until:
- At least 20 bad traces are annotated
- New traces stop revealing fundamentally new failure types
- Typically 50-100 traces, depending on pipeline complexity

读取每条追踪数据并撰写自由格式笔记。针对每条追踪数据：

完整读取整条追踪数据
提问：“这个输出是好还是坏？”（二元判断）
如果是坏输出：“具体哪里出错了？”
撰写简短的自由格式标注（1-3句话）
聚焦第一个上游故障，而非下游连锁影响

用简单结构记录：

| 追踪ID | 通过/失败 | 自由格式标注 |
|----------|-----------|---------------------|
| abc123   | 失败      | "在简单事实问题上丢失了角色设定，用普通英文回复" |
| def456   | 通过      | "良好——即使在技术话题上也保持了角色设定" |
| ghi789   | 失败      | "调用了错误的工具，用搜索代替了计算器" |

当无法清晰描述问题时，用以下视角作为提示（而非强制分类）：
- 幻觉（编造事实）
- 不遵守指令（忽略明确规则）
- 角色/语气偏离（打破设定）
- 工具误用（错误工具、错误参数、误解结果）
- 上下文丢失（忘记之前的信息）
- 过于冗长/简洁（太长或太短）
- 安全/防护机制绕过（回复了禁止内容）
- 结构错误（格式错误、缺失字段）
达到饱和后停止。持续进行直到：
- 至少标注了20条故障追踪数据
- 新追踪数据不再揭示全新的故障类型
- 通常需要50-100条追踪数据，具体取决于管道复杂度

Phase 3: Axial Coding — Structure the Taxonomy

阶段3：轴心编码——构建分类法结构

Group freeform annotations into failure modes. Read through all your notes and cluster similar failures:
- Some clusters are obvious: "wrong tool" + "hallucinated tool" = Tool Selection Errors
- Some require splitting: "hallucinated facts" vs "hallucinated user intent" are meaningfully different
- Some require merging: "too casual for luxury client" + "used jargon with beginner" = Persona-Audience Mismatch
Use LLM assistance (carefully). After coding 30-50 traces:
- Paste your freeform annotations into an LLM
- Ask it to propose groupings
- NEVER accept LLM groupings blindly — always review and adjust manually
- The LLM helps spot patterns you missed; you make the final taxonomy decisions

Define each failure mode precisely:

Failure Mode: [Name]
Description: [1-2 sentence definition]
Pass: [What "not failing" looks like]
Fail: [What "failing" looks like]
Example: [A concrete trace excerpt]

Ensure failure modes are:

Non-overlapping — each trace should clearly belong to 0 or 1 failure mode
Actionable — knowing this failure exists tells you what to fix
Observable — two people would agree on whether it applies to a given trace
Small in number — aim for 4-8 failure modes, not 20+

将自由格式标注分组为故障模式。通读所有笔记并聚类相似故障：
- 有些聚类很明显：“错误工具” + “幻觉工具” = 工具选择错误
- 有些需要拆分：“编造事实” vs “编造用户意图”是有本质区别的
- 有些需要合并：“对高端客户过于随意” + “对新手使用术语” = 角色-受众不匹配
谨慎使用LLM辅助。在编码30-50条追踪数据后：
- 将自由格式标注粘贴到LLM中
- 要求它提出分组建议
- 绝对不要盲目接受LLM的分组——务必手动审核并调整
- LLM帮你发现遗漏的模式，但最终分类法由你决定

精确定义每个故障模式：

故障模式：[名称]
描述：[1-2句话定义]
通过：“未故障”的表现
失败：“故障”的表现
示例：[具体追踪数据片段]

确保故障模式满足：

互不重叠——每条追踪数据应明确属于0或1个故障模式
可落地——知道存在此故障能告诉你该修复什么
可观测——两个人会对它是否适用于某条追踪数据达成一致
数量少——目标是4-8个故障模式，而非20+个

Phase 4: Quantify and Prioritize

阶段4：量化并排序优先级

Label all traces against the structured taxonomy.

Add columns: one per failure mode (binary: 0 or 1)
For each trace, mark which failure mode(s) apply
Compute error rates per failure mode: count / total traces

| Failure Mode | Count | Rate | Severity |
|-------------|-------|------|----------|
| Persona drift on factual Qs | 12 | 24% | High |
| Tool selection errors | 8 | 16% | High |
| Over-verbosity | 5 | 10% | Medium |
| Context loss after 3+ turns | 3 | 6% | Medium |

For multi-step pipelines, build a Transition Failure Matrix:

Define discrete states for each pipeline stage. For each failed trace, identify the first state where something went wrong.

First Failure In →  ParseReq  DecideTool  GenSQL  ExecSQL  FormatResp
Last Success ↓
ParseReq              -          3          0       0         0
DecideTool             0          -          5       0         1
GenSQL                 0          0          -      12         0
ExecSQL                0          0          0       -         2

Sum columns to find the most error-prone stages. Focus debugging on the hottest cells.

Classify each failure mode for action:

Failure Mode	Classification	Next Step
[mode]	Specification failure	Fix the prompt
[mode]	Generalization failure (code-checkable)	Build code-based evaluator
[mode]	Generalization failure (subjective)	Build LLM-as-Judge evaluator
[mode]	Trivial bug	Fix immediately, no evaluator needed

根据结构化分类法标注所有追踪数据。

添加列：每个故障模式对应一列（二元：0或1）
针对每条追踪数据，标记适用的故障模式
计算每个故障模式的错误率：数量 / 总追踪数据

| 故障模式 | 数量 | 错误率 | 严重程度 |
|-------------|-------|------|----------|
| 事实问题上的角色偏离 | 12 | 24% | 高 |
| 工具选择错误 | 8 | 16% | 高 |
| 过于冗长 | 5 | 10% | 中 |
| 3轮对话后上下文丢失 | 3 | 6% | 中 |

针对多步骤管道，构建故障转移矩阵：

为每个管道阶段定义离散状态。针对每条故障追踪数据，找出第一个出错的状态。

首次故障发生在 →  ParseReq  DecideTool  GenSQL  ExecSQL  FormatResp
最后成功状态 ↓
ParseReq              -          3          0       0         0
DecideTool             0          -          5       0         1
GenSQL                 0          0          -      12         0
ExecSQL                0          0          0       -         2

求和列数据找到最容易出错的阶段。聚焦调试最热门的单元格。

为每个故障模式分类以确定行动方向：

故障模式	分类	下一步
[模式]	规格故障	修复提示词
[模式]	泛化故障（可代码检查）	构建基于代码的评估器
[模式]	泛化故障（主观）	构建LLM-as-Judge评估器
[模式]	微小bug	立即修复，无需评估器

Phase 5: Output and Handoff

阶段5：输出并移交

Produce the error analysis report:

markdown

# Error Analysis Report
**Pipeline:** [name]
**Traces analyzed:** [N]
**Pass rate:** [X%]
**Date:** [date]

## Failure Taxonomy

### 1. [Failure Mode Name] — [X%] of traces
- **Description:** [definition]
- **Classification:** [specification / generalization / bug]
- **Example trace:** [ID and excerpt]
- **Recommended action:** [fix prompt / build evaluator / fix code]

### 2. [Failure Mode Name] — [X%] of traces
...

## Transition Failure Matrix (if applicable)
[matrix]

## Recommended Next Steps
1. [Highest priority action]
2. [Second priority]
3. [Third priority]

Hand off to companion skills:
- Specification failures → fix prompts directly
- Need test data →
```
generate-synthetic-dataset
```
- Need evaluators →
```
build-evaluator
```
- Need improvement measurement →
```
run-experiment
```

生成错误分析报告：

markdown

# 错误分析报告
**管道：** [名称]
**分析的追踪数据数量：** [N]
**通过率：** [X%]
**日期：** [日期]

## 故障分类法

### 1. [故障模式名称] —— 占追踪数据的[X%]
- **描述：** [定义]
- **分类：** [规格故障 / 泛化故障 / bug]
- **示例追踪数据：** [ID和片段]
- **建议行动：** [修复提示词 / 构建评估器 / 修复代码]

### 2. [故障模式名称] —— 占追踪数据的[X%]
...

## 故障转移矩阵（若适用）
[矩阵]

## 下一步建议
1. [最高优先级行动]
2. [次优先级行动]
3. [第三优先级行动]

移交到配套技能：
- 规格故障 → 直接修复提示词
- 需要测试数据 →
```
generate-synthetic-dataset
```
- 需要评估器 →
```
build-evaluator
```
- 需要衡量改进效果 →
```
run-experiment
```

Phase 6: Iterate

阶段6：迭代

Expect 2-3 rounds of refinement:
- Round 1: Initial open/axial coding — rough taxonomy
- Round 2: Refined definitions, edge cases clarified
- Round 3: Final taxonomy — stable, non-overlapping, actionable
- Beyond 3 rounds: diminishing returns

预计需要2-3轮优化：
- 第1轮：初始开放/轴心编码——粗略分类法
- 第2轮：优化定义，明确边缘案例
- 第3轮：最终分类法——稳定、互不重叠、可落地
- 超过3轮：收益递减

Grader Design Principles (from agent eval best practices)

评估器设计原则（来自Agent评估最佳实践）

When analyzing agent traces specifically:

Grade outcomes, not paths. Agents regularly find valid approaches eval designers didn't anticipate. Checking exact tool call sequences is too rigid and brittle.
Use isolated graders per dimension. Don't build one all-encompassing grader. Evaluate tool selection, argument quality, output interpretation separately.
Partial credit for multi-component tasks. A task can partially succeed. Track which components pass/fail independently.
Capability vs regression. Capability evals should start with a LOW pass rate (hard tasks). As they reach 100%, graduate them to regression suites.

当专门分析Agent追踪数据时：

评估结果，而非路径。Agent经常会找到评估设计者未预料到的有效方法。检查精确的工具调用序列过于僵化和脆弱。
针对每个维度使用独立评估器。不要构建一个包罗万象的评估器。分别评估工具选择、参数质量、输出解读。
多组件任务给予部分分数。任务可以部分成功。独立跟踪每个组件的通过/失败情况。
能力评估vs回归评估。能力评估应从低通过率开始（难度高的任务）。当通过率达到100%时，将其纳入回归测试套件。

Common Pitfalls

常见陷阱

Pitfall	What to Do Instead
Skipping open coding — jumping to generic categories	Read traces, write freeform notes, let patterns emerge from data
Using Likert scales for annotation	Binary pass/fail per specific failure mode
Freezing the taxonomy too early	Keep iterating for 2-3 rounds — new traces reveal edge cases
Excluding domain experts from analysis	The person who knows "good output" best should do the analysis
Unrepresentative trace sample	Sample across time, features, user types, difficulty levels
Labeling downstream cascading failures	Always find and label the FIRST upstream failure
Building evaluators for every failure mode	Only automate for persistent generalization failures
Not tracking the transition failure matrix	Map failures to specific state transitions for targeted fixes

陷阱	正确做法
跳过开放编码——直接使用通用分类	读取追踪数据，撰写自由格式笔记，让模式从数据中浮现
使用李克特量表进行标注	针对特定故障模式使用二元通过/失败判断
过早固化分类法	持续迭代2-3轮——新追踪数据会揭示边缘案例
排除领域专家参与分析	最了解“好输出”的人应该参与分析
追踪样本不具代表性	跨时间、功能、用户类型、难度级别采样
标记下游连锁故障	始终找到并标记第一个上游故障
为每个故障模式构建评估器	仅针对持续存在的泛化故障自动化
不追踪故障转移矩阵	将故障映射到特定状态转移以实现针对性修复

Documentation & Resolution

文档与问题解决

When you need to look up orq.ai platform details, check in this order:

orq MCP tools — query live data first (
```
list_traces
```
,
```
get_span
```
,
```
get_analytics_overview
```
); API responses are always authoritative
orq.ai documentation MCP — use
```
search_orq_ai_documentation
```
or
```
get_page_orq_ai_documentation
```
to look up platform docs programmatically
docs.orq.ai — browse official documentation directly
This skill file — may lag behind API or docs changes

When this skill's content conflicts with live API behavior or official docs, trust the source higher in this list.

当你需要查询orq.ai平台细节时，按以下顺序查找：

orq MCP工具 —— 先查询实时数据（
```
list_traces
```
、
```
get_span
```
、
```
get_analytics_overview
```
）；API响应始终是权威的
orq.ai文档MCP —— 使用
```
search_orq_ai_documentation
```
或
```
get_page_orq_ai_documentation
```
以编程方式查找平台文档
docs.orq.ai —— 直接浏览官方文档
此技能文件 —— 可能滞后于API或文档更新

当此技能内容与实时API行为或官方文档冲突时，优先信任列表中排名更高的来源。