eval-audit

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Eval Audit

评估审计

Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.
检查LLM评估流水线,生成按优先级排序的问题列表及具体下一步行动。

Overview

概述

  1. Gather eval artifacts: traces, evaluator configs, judge prompts, labeled data, metrics dashboards
  2. Run diagnostic checks across six areas
  3. Produce a findings report ordered by impact, with each finding linking to a fix
  1. 收集评估相关产物:traces、评估器配置、评判器提示词、标注数据、指标看板
  2. 对六个维度执行诊断检查
  3. 生成按影响排序的问题报告,每个问题都关联对应的修复方案

Prerequisites

前置条件

Access to eval artifacts (traces, evaluator configs, judge prompts, labeled data) via an observability MCP server or local files. If none exist, skip to "No Eval Infrastructure."
可通过可观测MCP服务器或本地文件访问评估产物(traces、评估器配置、评判器提示词、标注数据)。如果不存在相关产物,直接跳转到「无评估基础设施」章节。

Connecting to Eval Infrastructure

连接评估基础设施

Check whether the user has an observability MCP server connected (Phoenix, Braintrust, LangSmith, Truesight or similar). If available, use it to pull traces, evaluator definitions, and experiment results. If not, ask for local files: CSVs, JSON trace exports, notebooks, or evaluation scripts.
检查用户是否已连接可观测MCP服务器(Phoenix、Braintrust、LangSmith、Truesight或同类工具)。如果有,用它拉取traces、评估器定义和实验结果。如果没有,向用户索要本地文件:CSV、JSON trace导出文件、笔记本或评估脚本。

Diagnostic Checks

诊断检查

Work through each area below. Inspect available artifacts, determine whether the problem exists, and record a finding if it does.
Prioritize findings by impact on the user's product. Present the most impactful findings first.
逐一排查以下各个维度。检查可用的产物,判断是否存在问题,发现问题后记录为待处理项。
按对用户产品的影响优先级排序待处理项,优先展示影响最大的问题。

1. Error Analysis

1. 错误分析

Check: Has the user done systematic error analysis on real or synthetic traces?
Look for: labeled trace datasets, failure category definitions, notes from trace review. If evaluators exist but no documented failure categories, error analysis was likely skipped.
Finding if missing: Evaluators built without error analysis measure generic qualities ("helpfulness", "coherence") instead of actual failure modes. Start with
error-analysis
, or
generate-synthetic-data
first if no traces exist.
Check: Were failure categories brainstormed or observed?
Generic labels borrowed from research ("hallucination score", "toxicity", "coherence") suggest brainstorming. Application-grounded categories ("missing query constraints", "wrong client tone", "fabricated property features") suggest observation.
Finding if brainstormed: Generic categories miss application-specific failures and produce evaluators that score well on paper but miss real problems. Re-do with
error-analysis
, starting from traces.
检查项: 用户是否针对真实或合成traces做过系统性错误分析?
查找内容:标注trace数据集、故障分类定义、trace评审记录。如果存在评估器但没有书面记录的故障分类,大概率跳过了错误分析环节。
缺失时的问题判定: 未做错误分析就搭建的评估器只会衡量通用属性(比如「有用性」「连贯性」),而非实际故障模式。可以先执行
error-analysis
,如果没有traces可先执行
generate-synthetic-data
检查项: 故障分类是头脑风暴得出的还是实际观察到的?
从研究中借用的通用标签(比如「幻觉得分」「毒性」「连贯性」)说明是头脑风暴得出的。和应用场景结合的分类(比如「缺失查询约束」「客户沟通语气错误」「虚构属性特征」)说明是实际观察得出的。
为头脑风暴得出时的问题判定: 通用分类会漏掉应用专属故障,产出的评估器纸面得分好看但发现不了实际问题。重新执行
error-analysis
,从trace分析开始。

2. Evaluator Design

2. 评估器设计

Check: Are evaluators binary pass/fail?
Flag any that use Likert scales (1-5), letter grades (A-F), or numeric scores without a clear pass/fail threshold.
Finding if not binary: Likert scales are difficult to calibrate. Annotators disagree on the difference between a 3 and a 4, and judges inherit that noise. Consider converting to binary pass/fail with explicit definitions using
write-judge-prompt
.
Check: Do LLM judge prompts target specific failure modes?
Flag any that evaluate holistically ("Is this response helpful?", "Rate the quality of this output").
Finding if vague: Holistic judges produce unactionable verdicts. Each judge should check exactly one failure mode with explicit pass/fail definitions and few-shot examples. Use
write-judge-prompt
.
Check: Are code-based checks used where possible?
Flag LLM judges used for objectively checkable criteria: format validation, constraint satisfaction, keyword presence, schema conformance.
Finding if over-relying on judges: Replace objective checks with code (regex, parsing, schema validation, execution tests). Reserve LLM judges for criteria requiring interpretation.
Check: Are similarity metrics used as primary evaluation?
Flag ROUGE, BERTScore, cosine similarity, or embedding distance used as the main evaluator for generation quality.
Finding if present: These metrics measure surface-level overlap, not correctness. They suit retrieval ranking but not generation evaluation. Replace with binary evaluators grounded in specific failure modes.
检查项: 评估器是否为二元通过/不通过类型?
标记所有使用李克特量表(1-5分)、字母等级(A-F)、或没有明确通过/不通过阈值的数值评分的评估器。
非二元类型时的问题判定: 李克特量表很难校准。标注人员对3分和4分的判定标准不一致,评判器也会继承这种噪声。可以用
write-judge-prompt
将其转换为带有明确定义的二元通过/不通过评估器。
检查项: LLM评判器提示词是否针对特定故障模式?
标记所有做整体评估的提示词(比如「这个回答有用吗?」「给这个输出的质量打分」)。
定义模糊时的问题判定: 整体评判给出的结论无法指导行动。每个评判器应该只检查一种故障模式,附带明确的通过/不通过定义和少量示例。使用
write-judge-prompt
优化。
检查项: 尽可能使用代码检查了吗?
标记把LLM评判器用于可客观校验标准的场景:格式校验、约束满足、关键词存在性、Schema符合性。
过度依赖评判器时的问题判定: 用代码(正则、解析、Schema校验、执行测试)替换客观检查项。把LLM评判器留给需要语义理解的判断场景。
检查项: 是否把相似度指标作为主要评估标准?
标记把ROUGE、BERTScore、cosine similarity或embedding距离作为生成质量主要评估指标的场景。
存在该问题时的判定: 这些指标衡量的是表层重合度,而非正确性。它们适合检索排序场景,但不适合生成评估。替换为基于特定故障模式的二元评估器。
参考:LLM Evals FAQ

3. Judge Validation

3. 评判器验证

Check: Are LLM judges validated against human labels?
Look for: confusion matrices, TPR/TNR measurements, alignment scores. Judges in production with no validation data is a critical finding.
Finding if unvalidated: An unvalidated judge may consistently miss failures or flag passing traces. Measure alignment using TPR and TNR on a held-out test set. Use
validate-evaluator
.
Check: Is alignment measured with TPR/TNR or with raw accuracy?
Flag "accuracy", "percent agreement", or Cohen's Kappa as the primary alignment metric.
Finding if using accuracy: With class imbalance, raw accuracy is misleading: a judge that always says "Pass" gets 90% accuracy when 90% of traces pass but catches zero failures. Use TPR and TNR, which map directly to bias correction. Use
validate-evaluator
.
Check: Is there a proper train/dev/test split?
Check whether few-shot examples in judge prompts come from the same data used to measure judge performance.
Finding if leaking: Using evaluation data as few-shot examples inflates alignment scores and hides real judge failures. Split into train (few-shot source), dev (iteration), and test (final measurement). Use
validate-evaluator
.
检查项: LLM评判器是否经过人工标注的验证?
查找内容:混淆矩阵、TPR/TNR测量值、对齐得分。生产环境中使用的评判器如果没有验证数据属于严重问题。
未验证时的问题判定: 未验证的评判器可能会持续漏报故障,或者误报正常trace。在留出测试集上用TPR和TNR测量对齐度,使用
validate-evaluator
完成验证。
检查项: 对齐度是用TPR/TNR测量还是用原始准确率测量?
标记把「准确率」「一致率百分比」或Cohen's Kappa作为主要对齐指标的场景。
使用准确率时的问题判定: 存在类别不平衡时,原始准确率会产生误导:当90%的trace都是通过状态时,永远返回「通过」的评判器能得到90%的准确率,但完全发现不了故障。使用直接对应偏差校正的TPR和TNR,执行
validate-evaluator
检查项: 是否有合理的训练/验证/测试集划分?
检查评判器提示词中的少量示例是否和用于测量评判器性能的数据来自同一数据集。
数据泄露时的问题判定: 把评估数据用作少量示例会虚高对齐得分,掩盖评判器的实际故障。划分为训练集(少量示例来源)、验证集(迭代用)、测试集(最终测量用),执行
validate-evaluator

4. Human Review Process

4. 人工评审流程

Check: Who is reviewing traces?
Determine whether domain experts or outsourced annotators are labeling data.
Finding if outsourced without domain expertise: General annotators catch formatting errors but miss domain-specific failures (wrong medical dosage, incorrect legal citation, mismatched property features). Involve a domain expert.
Check: Are reviewers seeing full traces or just final outputs?
Finding if output-only: Reviewing only the final output hides where the pipeline broke. Show the full trace: input, intermediate steps, tool calls, retrieved context, and final output.
Check: How is data displayed to reviewers?
Flag raw JSON, unformatted text, or spreadsheets with trace data in cells.
Finding if raw format: Reviewers spend effort parsing data instead of judging quality. Format in natural representation: render markdown, syntax-highlight code, display tables as tables. Use
build-review-interface
.
检查项: 谁在评审traces?
确认标注数据的是领域专家还是外包标注人员。
使用无领域知识的外包人员时的问题判定: 普通标注人员能发现格式错误,但会漏掉领域专属故障(错误的用药剂量、不正确的法律引用、不匹配的属性特征)。需要引入领域专家参与。
检查项: 评审人员看到的是完整trace还是只有最终输出?
仅提供输出时的问题判定: 只评审最终输出会掩盖流水线的故障位置。需要展示完整trace:输入、中间步骤、工具调用、检索到的上下文、最终输出。
检查项: 给评审人员展示的数据格式是什么样的?
标记用原始JSON、未格式化文本、单元格内放trace数据的表格展示的场景。
格式原始时的问题判定: 评审人员要花精力解析数据,而非判断质量。用自然展示格式:渲染Markdown、代码高亮、表格按标准格式展示。使用
build-review-interface
参考:LLM Evals FAQ

5. Labeled Data

5. 标注数据

Check: Is there enough labeled data?
For error analysis, ~100 traces is the rough target for saturation. For judge validation, ~50 Pass and ~50 Fail examples are needed for reliable TPR/TNR. If labeled data is sparse, collect more by sampling traces more effectively:
  • Random: Always include a random sample alongside other strategies to discover unknown issues.
  • Clustering: Group traces by semantic similarity and review representatives from each cluster.
  • Data analysis: Analyze statistics on latency, turns, tool calls, and tokens for outliers.
  • Classification: Use existing evals, a predictive model, or an LLM to surface problematic traces. Use with caution.
  • Feedback: Use explicit customer feedback (complaints, thumbs-down signals) to filter traces.
Finding if insufficient: Small datasets produce unreliable failure rates and wide confidence intervals. Use the sampling strategies above to collect more labeled data, or supplement with
generate-synthetic-data
.
检查项: 标注数据量足够吗?
做错误分析时,大约100条trace是达到饱和的大致标准。做评判器验证时,需要大约50条通过和50条不通过的示例才能得到可靠的TPR/TNR结果。如果标注数据不足,可以用更高效的trace采样方式收集更多数据:
  • 随机采样: 始终在其他策略之外加入随机采样,发现未知问题。
  • 聚类: 按语义相似度对trace分组,评审每个聚类的代表样本。
  • 数据分析: 分析延迟、轮次、工具调用、Token数的统计值,找出异常值。
  • 分类: 用现有评估器、预测模型或LLM定位有问题的trace,谨慎使用。
  • 反馈: 用明确的客户反馈(投诉、差评信号)筛选trace。
数据不足时的问题判定: 小数据集会得出不可靠的故障率,置信区间过宽。用以上采样策略收集更多标注数据,或者用
generate-synthetic-data
补充。

6. Pipeline Hygiene

6. 流水线规范

Check: Is error analysis re-run after significant changes?
Check when error analysis was last performed relative to model switches, prompt rewrites, new features, or production incidents.
Finding if stale: Failure modes shift after pipeline changes, and evaluators built for the old pipeline miss new failure types. Re-run error analysis after every significant change.
Check: Are evaluators maintained?
Look for periodic re-validation of judges or refreshed evaluation datasets.
Finding if set-and-forget: Evaluators degrade as the pipeline evolves. Re-validate judges against fresh human labels and update eval datasets to reflect current usage.
检查项: 重大变更后是否重新执行了错误分析?
对比上次执行错误分析的时间和模型切换、提示词重写、新功能上线、生产故障发生的时间。
分析过时的问题判定: 流水线变更后故障模式会发生变化,为旧流水线搭建的评估器会漏掉新的故障类型。每次重大变更后都要重新执行错误分析。
检查项: 评估器有没有持续维护?
查找评判器定期重验证、评估数据集更新的记录。
一次搭建后不再维护的问题判定: 随着流水线演进,评估器的效果会下降。需要用新的人工标注重新验证评判器,更新评估数据集以反映当前使用情况。

No Eval Infrastructure

无评估基础设施

If the user has no eval artifacts (no traces, no evaluators, no labeled data):
  1. Start with
    error-analysis
    on a sample of real traces.
  2. If no production data exists, use
    generate-synthetic-data
    to create test inputs, run them through the pipeline, then apply
    error-analysis
    to the resulting traces.
  3. Do not recommend building evaluators, judges, or dashboards before completing error analysis.
如果用户没有任何评估产物(没有traces、没有评估器、没有标注数据):
  1. 先对真实trace样本执行
    error-analysis
  2. 如果没有生产数据,用
    generate-synthetic-data
    创建测试输入,在流水线中运行,再对生成的trace执行
    error-analysis
  3. 完成错误分析之前,不要建议搭建评估器、评判器或看板。

Report Format

报告格式

Present findings ordered by impact. For each:
undefined
按影响排序展示问题,每个问题遵循以下格式:
undefined

[Problem Title]

[Problem Title]

Status: [Problem exists / OK / Cannot determine] [1-2 sentence explanation of the specific problem found] Fix: [Concrete action, referencing a skill or article]

Group under the six diagnostic areas. Omit areas where no problems were found.
Status: [Problem exists / OK / Cannot determine] [1-2 sentence explanation of the specific problem found] Fix: [Concrete action, referencing a skill or article]

按六个诊断维度分组,没有发现问题的维度可以省略。

Anti-Patterns

反模式

  • Running the audit as a checklist without inspecting actual artifacts.
  • Reporting generic advice disconnected from what was found in the user's pipeline.
  • Recommending evaluators before error analysis is complete.
  • Suggesting LLM judges for failures that code-based checks can handle.
  • Treating this audit as a one-time event. Re-audit after significant pipeline changes.
  • 只把审计当作检查清单,不检查实际产物。
  • 给出和用户流水线实际发现的问题无关的通用建议。
  • 错误分析完成之前就推荐搭建评估器。
  • 建议用LLM评判器处理代码检查就能覆盖的故障。
  • 把审计当作一次性事件。流水线重大变更后要重新审计。