error-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Error Analysis

错误分析

Guide the user through reading LLM pipeline traces and building a catalog of how the system fails.

引导用户完成LLM pipeline trace的读取，搭建系统故障类型目录。

Overview

概述

Collect ~100 representative traces
Read each trace, judge pass/fail, and note what went wrong
Group similar failures into categories
Label every trace against those categories
Compute failure rates to prioritize what to fix

收集约100条具有代表性的trace
逐条读取trace，判定是否合格，记录故障原因
将相似故障归为同一类别
用上述类别为所有trace打标签
计算各类故障的发生率，确定修复优先级

Core Process

核心流程

Step 1: Collect Traces

步骤1：收集Trace

Capture the full trace: input, all intermediate LLM calls, tool uses, retrieved documents, reasoning steps, and final output.

Target: ~100 traces. This is roughly where new traces stop revealing new kinds of failures. The number depends on system complexity.

From real user data (preferred):

Small volume: random sample
Large volume: sample across key dimensions (query type, user segment, feature area)
Use embedding clustering (K-means) to ensure diversity

From synthetic data (when real data is sparse):

Use the generate-synthetic-data skill
Run synthetic queries through the full pipeline and capture complete traces

采集完整trace：输入、所有中间LLM调用、工具使用情况、检索到的文档、推理步骤、最终输出。

目标：约100条trace。 通常到这个量级后，新的trace不会再出现新的故障类型，具体数量取决于系统复杂度。

从真实用户数据采集（优先选择）：

数据量小：随机抽样
数据量大：按关键维度抽样（查询类型、用户分层、功能模块）
使用embedding聚类（K-means）保证样本多样性

从合成数据采集（真实数据不足时使用）：

使用generate-synthetic-data skill
运行合成查询走完完整pipeline，采集完整trace

Step 2: Read Traces and Take Notes

步骤2：读取Trace并记录

Present each trace to the user. For each one, ask: did the system produce a good result? Pass or Fail.

For failures, note what went wrong. Focus on the first thing that went wrong in the trace — errors cascade, so downstream symptoms disappear when the root cause is fixed. Don't chase every issue in a single trace.

Write observations, not explanations. "SQL missed the budget constraint" not "The model probably didn't understand the budget."

Template:

| Trace ID | Trace | What went wrong | Pass/Fail |
|----------|-------|-----------------|-----------|
| 001      | [full trace] | Missing filter: pet-friendly requirement ignored in SQL | Fail |
| 002      | [full trace] | Proposed unavailable times despite calendar conflicts | Fail |
| 003      | [full trace] | Used casual tone for luxury client; wrong property type | Fail |
| 004      | [full trace] | - | Pass |

Heuristics:

Do NOT start with a pre-defined failure list. Let categories emerge from what the user actually sees.
If the user is stuck articulating what feels wrong, prompt with common failure types: made-up facts, malformed output, ignored user requirements, wrong tone, tool misuse.

向用户展示每条trace，对每条trace询问：系统输出的结果是否合格？ 标记合格/不合格。

对于不合格的trace，记录故障原因。重点关注trace中第一个出现的故障——错误会级联传播，根因修复后下游的症状会自动消失，不要在单条trace里排查所有问题。

记录观察到的现象，不要写推测性解释。比如写「SQL遗漏了预算约束」，而不是「模型可能没有理解预算规则」。

模板：

| Trace ID | Trace | What went wrong | Pass/Fail |
|----------|-------|-----------------|-----------|
| 001      | [full trace] | Missing filter: pet-friendly requirement ignored in SQL | Fail |
| 002      | [full trace] | Proposed unavailable times despite calendar conflicts | Fail |
| 003      | [full trace] | Used casual tone for luxury client; wrong property type | Fail |
| 004      | [full trace] | - | Pass |

经验规则：

不要预先定义故障列表，让分类从实际观察到的问题中自然产生。
如果用户无法清晰描述故障，可以提示常见故障类型：虚构事实、输出格式错误、忽略用户需求、语气错误、工具使用错误。

Step 3: Group Failures into Categories

步骤3：将故障归类

After reviewing 30-50 traces, start grouping similar notes into categories. Don't wait until all 100 are done — grouping early helps sharpen what to look for in the remaining traces. The categories will evolve. The goal is names that are specific and actionable, not perfect.

Read through all the failure notes
Group similar ones together
Split notes that look alike but have different root causes
Give each category a clear name and one-sentence definition

When to split vs. group:

Split these (different root causes):

"Made up property features (solar panels)" vs. "Made up client activity (scheduled a tour never requested)" — one fabricates external facts, the other fabricates user intent.

Group these (same root cause):

"Missing bedroom count filter" + "Missing pet-friendly filter" + "Missing price range filter" → Missing Query Constraints

LLM-assisted clustering (use only after the user has reviewed 30-50 traces):

Here are failure annotations from reviewing LLM pipeline traces.
Group similar failures into 5-10 distinct categories.
For each category, provide:
- A clear name
- A one-sentence definition
- Which annotations belong to it

Annotations:
[paste annotations]

Always review LLM-suggested groupings with the user. LLMs cluster by surface similarity (e.g., grouping "app crashes" and "login is slow" because both mention login).

Aim for 5-10 categories that are:

Distinct (each failure belongs to one category)
Clear enough that someone else could apply them consistently
Actionable (each points toward a specific fix)

审核完30-50条trace后，就可以开始将相似的故障记录归为同一类别。不要等100条全部审核完再归类——早归类可以帮你明确剩余trace需要重点关注的内容，分类会持续迭代，目标是得到具体、可落地的分类名称，不需要追求完美。

通读所有故障记录
将相似的记录归为一组
拆分表面相似但根因不同的记录
为每个类别起清晰的名称，写一句定义说明

拆分和归类的判断标准：

拆分以下情况（根因不同）： -「虚构房产属性（有太阳能板）」和「虚构用户行为（安排了从未被请求的看房）」——一个是捏造外部事实，一个是捏造用户意图。

归类以下情况（根因相同）： -「遗漏卧室数量过滤条件」+「遗漏宠物友好过滤条件」+「遗漏价格区间过滤条件」→ 遗漏查询约束

LLM辅助聚类（仅在用户已审核30-50条trace后使用）：

Here are failure annotations from reviewing LLM pipeline traces.
Group similar failures into 5-10 distinct categories.
For each category, provide:
- A clear name
- A one-sentence definition
- Which annotations belong to it

Annotations:
[paste annotations]

LLM给出的分类建议必须和用户一起复核。LLM只会按表面相似性聚类（比如会把「应用崩溃」和「登录缓慢」归为一类，因为两者都提到了登录）。

目标是得到5-10个分类，满足：

互斥（每个故障只属于一个分类）
定义清晰，其他人也可以一致地使用这些分类打标
可落地（每个分类都对应明确的修复方向）

Step 4: Label Every Trace

步骤4：为所有Trace打标签

Go back through all traces and apply binary labels (pass/fail) for each failure category. Each trace gets a column per category. Use whatever tool the user prefers — spreadsheet, annotation app (see build-review-interface), or a simple script.

回溯所有trace，为每个故障类别打上二分类标签（是/否属于该类故障）。每个trace对应每个分类都有一列标记。可以使用用户偏好的任意工具：电子表格、标注工具（参考build-review-interface）、简单脚本都可以。

Step 5: Compute Failure Rates

步骤5：计算故障率

python

failure_rates = labeled_df[failure_columns].sum() / len(labeled_df)
failure_rates.sort_values(ascending=False)

The most frequent failure category is where to focus first.

python

failure_rates = labeled_df[failure_columns].sum() / len(labeled_df)
failure_rates.sort_values(ascending=False)

出现频率最高的故障类别就是需要优先解决的方向。

Step 6: Decide What to Do About Each Failure

步骤6：确定各类故障的处理方案

Work through each category with the user in this order:

Can we just fix it? Many failures have obvious fixes that don't need an evaluator at all:

The prompt never mentioned the requirement. Example: the LLM never includes photo links in emails because the prompt never asked for them. Add the instruction.
A tool is missing or misconfigured. Example: the user wants to reschedule but there's no rescheduling tool exposed to the LLM. Add the tool.
An engineering bug in retrieval, parsing, or integration. Fix the code.

If a clear fix resolves the failure, do that first. Only consider an evaluator for failures that persist after fixing.

Is an evaluator worth the effort? Not every remaining failure needs one. Building and maintaining evaluators has real cost. Ask the user:

Does this failure happen frequently enough to matter?
What's the business impact when it does happen? A rare failure that causes revenue loss may outrank a frequent failure that's merely annoying.
Will this evaluator actually get used to iterate on the system, or is it checkbox work?

Reserve evaluators for failures the user will iterate on repeatedly. Start with the highest-frequency, highest-impact category.

For failures that warrant an evaluator: prefer code-based checks (regex, parsing, schema validation) for anything objective. Use write-judge-prompt only for failures that require judgment. Critical requirements (safety, compliance) may warrant an evaluator even after fixing the prompt, as a guardrail.

按以下顺序和用户一起梳理每个分类的故障：

我们能不能直接修复？ 很多故障有非常明确的修复方案，根本不需要评估器：

Prompt里根本没提到相关要求。比如：LLM从来不在邮件里附图片链接，因为prompt里从来没要求这么做，补充指令即可。
工具缺失或者配置错误。比如：用户想要改期，但LLM没有调用改期工具的权限，补充工具即可。
检索、解析、集成环节存在工程bug，修复代码即可。

如果有明确的方案可以解决故障，优先执行修复。只有修复后仍然存在的故障，才需要考虑使用评估器。

搭建评估器的投入是否值得？ 剩下的故障不一定都需要评估器，搭建和维护评估器有实际成本，可以问用户这些问题：

这个故障的发生频率够高，值得投入吗？
故障发生的业务影响有多大？发生概率低但会造成收入损失的故障，优先级可能比发生频率高但仅影响体验的故障更高。
这个评估器会真的被用来迭代系统，还是只是走个流程？

把评估器留给用户会反复迭代优化的故障，从频率最高、影响最大的分类开始。

对于需要评估器的故障： 客观可量化的故障优先使用代码检测（regex、解析、schema校验）。只有需要主观判断的故障，才使用write-judge-prompt。关键要求（安全、合规）即使修复了prompt也可能需要评估器作为防护栏。

Step 7: Iterate

步骤7：迭代

Expect 2-3 rounds of reviewing and refining categories. After each round:

Merge categories that overlap
Split categories that are too broad
Clarify definitions where the user would hesitate
Re-label traces with the refined categories

通常需要2-3轮审核和分类优化，每轮迭代后：

合并重叠的分类
拆分过宽的分类
明确用户存疑的分类定义
用优化后的分类重新为trace打标

Stopping Criteria

停止标准

Stop reviewing when new traces aren't revealing new kinds of failures. Roughly: ~100 traces reviewed with no new failure types appearing in the last 20. The exact number depends on system complexity.

当新的trace不再出现新的故障类型时，就可以停止审核。大致参考标准：已审核约100条trace，且最近20条trace中没有出现新的故障类型，具体数量取决于系统复杂度。

Trace Sampling Strategies

Trace抽样策略

When production volume is high, use a mix:

Strategy	When to Use	Method
Random	Default starting point	Sample uniformly from recent traces
Outlier	Surface unusual behavior	Sort by response length, latency, tool call count; review extremes
Failure-driven	After guardrail violations or user complaints	Prioritize flagged traces
Uncertainty	When automated judges exist	Focus on traces where judges disagree or have low confidence
Stratified	Ensure coverage across user segments	Sample within each dimension

生产环境流量大时，可以混合使用以下策略：

Strategy	When to Use	Method
Random	Default starting point	Sample uniformly from recent traces
Outlier	Surface unusual behavior	Sort by response length, latency, tool call count; review extremes
Failure-driven	After guardrail violations or user complaints	Prioritize flagged traces
Uncertainty	When automated judges exist	Focus on traces where judges disagree or have low confidence
Stratified	Ensure coverage across user segments	Sample within each dimension

Anti-Patterns

反模式

Brainstorming failure categories before reading traces. Read first, categorize what you find.
Starting with pre-defined categories. A fixed list causes confirmation bias. Let categories emerge.
Skipping the user for initial review. The user must review the first 30-50 traces to ground categories in domain knowledge.
Using generic scores as categories. "Hallucination score," "helpfulness score," "coherence score" are not grounded in the application's actual failure modes.
Building evaluators before fixing obvious problems. Fix prompt gaps, missing tools, and engineering bugs first.
Treating this as a one-time activity. Re-run after every significant change: new features, prompt rewrites, model switches, production incidents.

读取trace之前就头脑风暴故障分类。 先读trace，再给你发现的问题归类。
使用预先定义的分类。 固定的分类会带来确认偏差，让分类从实际问题中自然产生。
跳过用户的初始审核。 用户必须审核前30-50条trace，才能让分类基于领域知识落地。
用通用评分作为分类。 「幻觉分数」「有用性分数」「连贯性分数」没有基于应用的实际故障模式，无法落地。
在修复明显问题之前搭建评估器。 优先修复prompt缺失、工具缺失、工程bug。
把这个流程当成一次性工作。 每次重大变更后都要重新执行：新功能上线、prompt重写、模型切换、生产事故发生后。