failure-taxonomy

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Failure Taxonomy Builder

失败模式分类体系构建工具

Transform raw, freeform trace annotations from open coding sessions into a structured taxonomy of binary failure modes, following the grounded theory methodology from the Analyze-Measure-Improve evaluation lifecycle.

遵循Analyze-Measure-Improve评估生命周期的扎根理论方法，将开放式编码环节中得到的原始自由格式追踪注释转化为二元失败模式的结构化分类体系。

When This Skill Applies

适用场景

The user has already completed open coding — they've read through LLM pipeline traces and written short, freeform notes describing what went wrong (the "point of first failure"). Now they need to move from that chaotic pile of observations into an organized, actionable taxonomy. This is the axial coding step.

Typical inputs look like a JSON array, CSV, or spreadsheet of objects with fields like:

```
trace_id
```
— identifier for the trace
```
annotation
```
or
```
note
```
— the freeform open-coded observation
Optionally:
```
pass_fail
```
,
```
trace_summary
```
,
```
query
```
, or the full trace itself

用户已完成open coding——即已通读LLM管道追踪记录，并写下了描述问题所在（“首次失败点”）的简短自由格式笔记。现在需要从这些杂乱的观察结果转向有条理、可落地的分类体系。这就是axial coding环节。

典型输入为JSON数组、CSV或电子表格形式的对象，包含以下字段：

```
trace_id
```
— 追踪记录的标识符
```
annotation
```
或
```
note
```
— 开放式编码得到的自由格式观察笔记
可选字段：
```
pass_fail
```
、
```
trace_summary
```
、
```
query
```
或完整追踪记录

Core Workflow

核心流程

Step 1: Ingest and Understand the Annotations

步骤1：导入并理解注释

Ask the user to provide their open-coded annotations (JSON, CSV, or pasted text).
Read through ALL annotations before doing anything else. This mirrors how a human analyst would spread out all their sticky notes before grouping.
Count total annotations and note how many are unique vs. near-duplicates.
Identify the application domain from context clues in the annotations.

请用户提供其开放式编码得到的注释（JSON、CSV或粘贴的文本形式）。
在进行任何操作前通读所有注释。这模拟了人类分析师先将所有便利贴铺开再进行分组的工作方式。
统计注释总数，记录其中唯一注释与近似重复注释的数量。
从注释中的上下文线索识别应用领域。

Step 2: Draft Failure Mode Clusters (Axial Coding)

步骤2：草拟失败模式聚类（Axial Coding）

Group similar annotations into a small set of coherent, non-overlapping failure categories.

Key principles — these matter a lot:

Small set: Aim for 3–7 failure modes. Fewer is better. If you have more than 7, you're probably splitting too finely — merge related categories.
Binary and specific: Each failure mode is a yes/no question: "Did this failure occur?" Avoid vague categories like "bad output" or "hallucination" without qualification.
Non-overlapping: A single annotation should map cleanly to one failure mode. If annotations regularly fit two categories, the categories need rework.
Application-specific: Use the vocabulary of the user's domain, not generic LLM research terms. "Missing SQL constraint" beats "incomplete tool use". "Persona mismatch" beats "tone error".
Actionable: Each failure mode should suggest a clear engineering fix. If you can't imagine what a developer would change to address it, the category is too abstract.
No Likert scales: Everything is binary. A failure mode is present (1) or absent (0). Numeric severity scores introduce noise and reduce inter-annotator agreement.

Process:

Read through all annotations and mentally group similar observations.
For each emerging group, write a short title (2–5 words) and a one-line definition explaining what qualifies as this failure.
List 2–3 representative example annotations under each group.
Check for overlaps — if two categories share examples, merge or split until clean.
Check for orphans — annotations that don't fit anywhere may signal a missing category or may be genuine one-offs to flag separately.

Present the draft taxonomy to the user as a table:

| # | Failure Mode | Definition | Example Annotations |
|---|-------------|------------|---------------------|
| 1 | [Title]     | [One-line] | [2-3 examples]      |

将相似注释分组为一组数量少、连贯且无重叠的失败类别。

关键原则——至关重要：

数量精简：目标为3-7种失败模式。越少越好。如果超过7种，可能拆分过细——合并相关类别。
二元且具体：每种失败模式都是一个是非问题：“该失败是否发生？”避免使用“输出质量差”或“幻觉”等模糊类别，必须加以限定。
无重叠：单个注释应能清晰映射到一种失败模式。如果注释经常同时符合两个类别，需要重新调整类别。
贴合应用场景：使用用户领域的术语，而非通用的LLM研究术语。“缺失SQL约束”优于“工具使用不完整”。“角色匹配错误”优于“语气错误”。
可落地：每种失败模式应能明确指向一种工程修复方案。如果无法想象开发者会做出哪些修改来解决问题，说明该类别过于抽象。
不使用李克特量表：所有判定均为二元。失败模式要么存在（1），要么不存在（0）。数值严重程度评分会引入噪声，降低标注者间的一致性。

流程：

通读所有注释，在脑中对相似观察结果进行分组。
针对每个形成的分组，撰写一个简短标题（2-5个词）和一行定义，说明符合该失败模式的条件。
在每个分组下列出2-3个代表性注释示例。
检查重叠情况——如果两个类别共享示例，合并或拆分直至类别清晰无重叠。
检查孤立注释——无法归入任何类别的注释可能意味着存在缺失的类别，或是需要单独标记的特殊案例。

以表格形式向用户呈现草拟的分类体系：

| # | Failure Mode | Definition | Example Annotations |
|---|-------------|------------|---------------------|
| 1 | [Title]     | [One-line] | [2-3 examples]      |

Step 3: Refine Through Discussion

步骤3：通过讨论优化

After presenting the draft, prompt the user to consider:

Are any categories too broad? (Should they be split?)
Are any categories too narrow? (Should they be merged?)
Do the titles make sense to a domain expert who hasn't read the raw data?
Are there annotations that don't fit any category?

Iterate until the user confirms the taxonomy. Typical refinement takes 1–2 rounds.

呈现草拟版本后，提示用户考虑以下问题：

是否有类别过于宽泛？（是否需要拆分？）
是否有类别过于狭窄？（是否需要合并？）
对于未阅读过原始数据的领域专家来说，类别标题是否易懂？
是否存在无法归入任何类别的注释？

反复迭代，直至用户确认分类体系。通常需要1-2轮优化。

Step 4: Re-label Traces Against the Taxonomy

步骤4：基于分类体系重新标记追踪记录

Once the taxonomy is confirmed, systematically apply it back to every trace:

For each trace/annotation, assign a
```
1
```
or
```
0
```
for each failure mode.
A trace can have multiple failure modes present (they're independent binary columns).
If an annotation doesn't match any failure mode, flag it as "Uncategorized" for review.
Present the re-labeled data in a structured format (JSON or CSV).

分类体系确认后，系统地将其应用于每条追踪记录：

针对每条追踪记录/注释，为每种失败模式赋值
```
1
```
或
```
0
```
。
一条追踪记录可对应多种失败模式（它们是独立的二元列）。
如果注释与任何失败模式均不匹配，标记为“未分类”以便后续审查。
以结构化格式（JSON或CSV）呈现重新标记后的数据。

Step 5: Quantify and Prioritize

步骤5：量化与优先级排序

Compute error rates for each failure mode:

Count: How many traces exhibit this failure?
Rate: Count / total traces (as a percentage).
Rank: Order failure modes by prevalence.

Present a summary table and recommend which failure modes to address first based on frequency. Note: frequency alone doesn't determine priority — the user may weight certain failures higher based on business impact. Ask them.

计算每种失败模式的错误率：

计数：出现该失败的追踪记录数量。
占比：计数/总追踪记录数（以百分比表示）。
排名：按出现频率对失败模式排序。

呈现汇总表格，并根据出现频率建议优先解决的失败模式。注意：频率并非唯一的优先级判定标准——用户可能会根据业务影响对某些失败模式赋予更高权重。请向用户确认。

Output Formats

输出格式

The skill produces up to three artifacts:

Taxonomy definition (always produced) — A clean document defining each failure mode with its title, definition, and representative examples.
Re-labeled dataset (produced when input traces are provided) — The original annotations augmented with binary columns for each failure mode, as JSON or CSV.
Summary statistics (produced when re-labeling is done) — Error rates, counts, and a prioritized ranking.

For detailed output schemas and file format guidance, read

references/output-formats.md

该技能最多生成三类产物：

分类体系定义（必生成）——一份清晰的文档，定义每种失败模式的标题、定义及代表性示例。
重新标记后的数据集（当提供输入追踪记录时生成）——原始注释新增每种失败模式的二元列，格式为JSON或CSV。
汇总统计数据（当完成重新标记时生成）——错误率、计数及优先级排名。

如需详细的输出模式和文件格式指南，请阅读

references/output-formats.md

。

Anti-Patterns to Avoid

需避免的反模式

These are drawn directly from common pitfalls observed in practice:

Generic categories from LLM research: Don't default to "hallucination", "staying on task", "verbosity" without grounding in the actual annotations. The whole point of open coding first is to let application-specific patterns emerge.
Too many categories: If you have 10+ failure modes from 30 annotations, you're over-splitting. Merge until you have 3–7 crisp categories.
Likert scales or severity scores: Resist any urge to rate failures on a 1–5 scale. Binary decisions produce more consistent, reproducible labels.
Freezing too early: The taxonomy should evolve. After re-labeling, the user may discover that a category needs splitting or that a new pattern has emerged. This is normal and expected — support iteration.
Skipping representative examples: Every failure mode definition needs concrete examples. Without them, the category is too abstract to apply consistently.

这些均来自实践中观察到的常见误区：

直接使用LLM研究中的通用类别：不要默认使用“hallucination”“任务贴合度”“冗长性”等类别，必须结合实际注释。先进行开放式编码的核心目的就是让贴合应用场景的模式自然浮现。
类别数量过多：如果30条注释对应10+种失败模式，说明拆分过细。合并类别直至剩余3-7个清晰的类别。
使用李克特量表或严重程度评分：坚决避免对失败进行1-5分的评分。二元判定能产生更一致、可复现的标注结果。
过早固化分类体系：分类体系应不断演进。重新标记后，用户可能发现某类别需要拆分，或出现了新的模式。这是正常且预期的情况——支持迭代优化。
缺少代表性示例：每种失败模式的定义都需要具体示例。没有示例的话，类别会过于抽象，无法一致应用。

Using an LLM to Assist Clustering

利用LLM辅助聚类

When the user has 30+ annotations, it can help to use an LLM to propose initial groupings. If doing this, use the following prompt pattern:

Below is a list of open-ended annotations describing failures in [DOMAIN DESCRIPTION].
Please group them into a small set of coherent failure categories, where each category
captures similar types of mistakes. Each group should have:
- A short descriptive title (2-5 words)
- A brief one-line definition
- The annotation indices that belong to it

Do not invent new failure types; only cluster based on what is present in the notes.
Aim for 3-7 categories. If an annotation doesn't fit any group, list it separately
as "Uncategorized."

Annotations:
[PASTE ANNOTATIONS HERE]

Critical: LLM-generated groupings are a starting point, not the final answer. Always present them to the user for review and adjustment. The user's domain expertise is what makes the taxonomy meaningful.

当用户拥有30条以上注释时，可借助LLM提出初始分组建议。如需此操作，请使用以下提示模板：

Below is a list of open-ended annotations describing failures in [DOMAIN DESCRIPTION].
Please group them into a small set of coherent failure categories, where each category
captures similar types of mistakes. Each group should have:
- A short descriptive title (2-5 words)
- A brief one-line definition
- The annotation indices that belong to it

Do not invent new failure types; only cluster based on what is present in the notes.
Aim for 3-7 categories. If an annotation doesn't fit any group, list it separately
as "Uncategorized."

Annotations:
[PASTE ANNOTATIONS HERE]

关键提示：LLM生成的分组仅为起点，而非最终结果。务必将其提交给用户进行审查和调整。用户的领域专业知识才是分类体系具备实际意义的核心。

Connecting to Next Steps

衔接后续步骤

After the taxonomy is built, the user typically moves to one of:

Building LLM-as-Judge evaluators: Each failure mode becomes a separate binary evaluation prompt. The examples from the taxonomy become few-shot examples in the judge prompt.
Targeted pipeline improvements: The highest-frequency failure modes guide where to invest engineering effort (prompt changes, tool improvements, guardrails).
Generating more failure instances: For rare failure modes, the user may want to synthetically generate queries that trigger them to build a larger labeled dataset.

Mention these next steps when delivering the final taxonomy, so the user knows where to go from here.

构建完分类体系后，用户通常会进入以下环节之一：

构建LLM-as-Judge评估器：每种失败模式对应一个独立的二元评估提示词。分类体系中的示例会成为评估器提示词中的少样本示例。
针对性优化管道：出现频率最高的失败模式指导工程投入方向（提示词修改、工具优化、防护机制）。
生成更多失败实例：对于罕见的失败模式，用户可能希望生成触发该失败的查询，以构建更大规模的标注数据集。

在交付最终分类体系时提及这些后续步骤，帮助用户明确下一步方向。",