ai-evals

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Evals

Scope

适用范围

Covers

Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured
Converting failures into a golden test set + error taxonomy + rubric
Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
Producing decision-ready results and an iteration loop (every bug becomes a new test)

When to use

“Design evals for this LLM feature so we can ship with confidence.”
“Create a rubric + golden set + benchmark for our AI assistant/copilot.”
“We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”
“Compare prompts/models safely with a clear acceptance threshold.”

When NOT to use

You need to decide what to build (use

problem-definition

building-with-llms

, or

ai-product-strategy

You’re primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).
You want model training research or infra design (this skill assumes API/model usage; delegate to ML/infra).
You only want vendor/model selection with no defined task + data (use
```
evaluating-new-technology
```
first, then come back with a concrete use case).

涵盖内容

为LLM/AI功能设计评估（即“evals”），将其作为执行契约：定义“合格”的标准及衡量方式
将故障转化为黄金测试集 + 错误分类体系 + 评分标准
选择评审方式（人工评审、LLM-as-judge、自动化校验），并制定可重复执行的测试框架/执行手册
输出可支撑决策的结果及迭代循环（每个bug都将转化为新的测试用例）

适用场景

“为这个LLM功能设计评估体系，让我们能放心上线。”
“为我们的AI助手/ copilots创建评分标准+黄金测试集+基准测试。”
“我们发现质量不稳定——请进行错误分析并将其转化为可重复的评估体系。”
“通过明确的验收阈值安全比较提示词/模型。”

不适用场景

你需要决定要构建什么（请使用
```
problem-definition
```
、
```
building-with-llms
```
或
```
ai-product-strategy
```
工具）。
你主要进行传统的非LLM软件测试（请使用标准的工程QA/单元/集成测试）。
你需要模型训练研究或基础设施设计（本技能假设你已使用API/模型；此类工作请委托给ML/基础设施团队）。
你仅想进行供应商/模型选型，但未定义具体任务+数据（请先使用
```
evaluating-new-technology
```
工具，再带着具体用例返回）。

Inputs

输入要求

Minimum required

System under test (SUT): what the AI does, for whom, in what workflow (inputs → outputs)
The decision the eval must support (ship/no-ship, compare options, regression gate)
What “good” means: 3–10 target behaviors + top failure modes
Constraints: privacy/compliance, safety policy, languages, cost/latency budgets, timeline

Missing-info strategy

Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
If details remain missing, proceed with explicit assumptions and provide 2–3 viable options (judge type, scoring scheme, dataset size).
If asked to run code or generate datasets from sensitive sources, request confirmation and apply least privilege (no secrets; redact/anonymize).

最低必填项

被测系统（SUT）：AI的功能、服务对象、工作流程（输入→输出）
评估需支撑的决策：上线/不上线、方案对比、回归校验
“合格”的定义：3-10项目标行为+主要故障模式
约束条件：隐私/合规要求、安全政策、语言支持、成本/延迟预算、时间周期

缺失信息处理策略

从references/INTAKE.md中提出最多5个问题（每次3-5个）。
若仍有信息缺失，基于明确假设推进，并提供2-3种可行方案（评审类型、评分方案、数据集规模）。
若要求运行代码或从敏感源生成数据集，需请求确认并遵循最小权限原则（不涉及密钥；需脱敏/匿名化处理）。

Outputs (deliverables)

输出交付物

Produce an AI Evals Pack (in chat; or as files if requested), in this order:

Eval PRD (evaluation requirements): decision, scope, target behaviors, success metrics, acceptance thresholds
Test set spec + initial golden set: schema, coverage plan, and a starter set of cases (tagged by scenario/risk)
Error taxonomy (from error analysis + open coding): failure modes, severity, examples
Rubric + judging guide: dimensions, scoring scale, definitions, examples, tie-breakers
Judge + harness plan: human vs LLM-as-judge vs automated checks, prompts/instructions, calibration, runbook, cost/time estimate
Reporting + iteration loop: baseline results format, regression policy, how new bugs become new tests
Risks / Open questions / Next steps (always included)

Templates: references/TEMPLATES.md

生成AI评估包（通过聊天交付；若有需求可生成文件），顺序如下：

评估PRD（评估需求）：决策目标、适用范围、目标行为、成功指标、验收阈值
测试集规范+初始黄金测试集：数据结构、覆盖方案、以及一组初始测试用例（按场景/风险标记）
错误分类体系（基于错误分析+开放式编码）：故障模式、严重程度、示例
评分标准+评审指南：评估维度、评分尺度、定义、示例、平局处理规则
评审+测试框架方案：人工评审 vs LLM-as-judge vs 自动化校验、提示词/说明、校准方式、执行手册、成本/时间估算
报告+迭代循环：基准结果格式、回归规则、以及如何将新发现的故障转化为新测试用例
风险/待解决问题/下一步计划（必含内容）

模板请参考：references/TEMPLATES.md

Workflow (7 steps)

工作流程（7个步骤）

1) Define the decision and write the Eval PRD

1) 定义决策目标并撰写评估PRD

Inputs: SUT description, stakeholders, decision to support.
Actions: Define the decision (ship/no-ship, compare A vs B), scope/non-goals, target behaviors, acceptance thresholds, and what must never happen.
Outputs: Draft Eval PRD (template in references/TEMPLATES.md).
Checks: A stakeholder can restate what is being measured, why, and what “pass” means.

输入：被测系统（SUT）描述、相关干系人、需支撑的决策
行动：定义决策目标（上线/不上线、方案A vs B对比）、适用范围/非目标、目标行为、验收阈值，以及绝对不允许出现的情况
输出：评估PRD草稿（模板见references/TEMPLATES.md）
校验标准：干系人能够清晰复述评估的测量内容、目的及“通过”的标准

2) Draft the golden set structure + coverage plan

2) 草拟黄金测试集结构+覆盖方案

Inputs: User workflows, edge cases, safety risks, data availability.
Actions: Specify the test case schema, tagging, and coverage targets (happy paths, tricky paths, adversarial/safety, long-tail). Create an initial starter set (small but high-signal).
Outputs: Test set spec + initial golden set.
Checks: Every target behavior has at least 2 test cases; high-severity risks are explicitly represented.

输入：用户工作流程、边缘场景、安全风险、数据可用性
行动：定义测试用例的数据结构、标记规则及覆盖目标（正常路径、复杂路径、对抗/安全场景、长尾场景）。创建一组初始的高价值测试用例
输出：测试集规范+初始黄金测试集
校验标准：每个目标行为至少对应2个测试用例；高风险场景已被明确覆盖

3) Run error analysis and open coding to build a taxonomy

3) 开展错误分析与开放式编码，构建错误分类体系

Inputs: Known failures, logs, stakeholder anecdotes, initial golden set.
Actions: Review failures, label them with open coding, consolidate into a taxonomy, and assign severity/impact. Identify likely root causes (prompting, missing context, tool misuse, formatting, policy).
Outputs: Error taxonomy + “top failure modes” list.
Checks: Taxonomy is mutually understandable by PM/eng; each category has 1–2 concrete examples.

输入：已知故障、日志、干系人反馈、初始黄金测试集
行动：回顾故障案例，通过开放式编码进行标记，整合为分类体系，并分配严重程度/影响等级。识别可能的根本原因（提示词问题、上下文缺失、工具误用、格式错误、政策合规问题）
输出：错误分类体系+“主要故障模式”列表
校验标准：分类体系需让产品经理/工程师都能理解；每个类别需包含1-2个具体示例

4) Convert taxonomy → rubric + scoring rules

4) 将分类体系转化为评分标准+评分规则

Inputs: Taxonomy, target behaviors, output formats.
Actions: Define scoring dimensions and scales; write clear judge instructions and tie-breakers; add examples and disallowed behaviors. Decide absolute scoring vs pairwise comparisons.
Outputs: Rubric + judging guide.
Checks: Two independent judges would likely score the same case similarly (instructions are specific, not vibes).

输入：错误分类体系、目标行为、输出格式
行动：定义评分维度与尺度；撰写清晰的评审说明与平局处理规则；添加示例及禁止行为。确定采用绝对评分还是两两对比评分
输出：评分标准+评审指南
校验标准：两名独立评审员对同一测试用例的打分结果应基本一致（说明需具体明确，而非模糊描述）

5) Choose the judging approach + harness/runbook

5) 选择评审方式+测试框架/执行手册

Inputs: Constraints (time/cost), required reliability, privacy/safety constraints.
Actions: Pick judge type(s): human, LLM-as-judge, automated checks. Define calibration (gold examples, inter-rater checks), sampling, and how results are stored. Write a runbook with estimated runtime/cost.
Outputs: Judge + harness plan.
Checks: The plan is repeatable (versioned prompts/models, deterministic settings where possible, clear data handling).

输入：约束条件（时间/成本）、所需可靠性、隐私/安全要求
行动：选择评审类型：人工、LLM-as-judge、自动化校验。定义校准方式（黄金示例、评审员间一致性校验）、抽样规则，以及结果存储方式。撰写包含预估执行时间/成本的执行手册
输出：评审+测试框架方案
校验标准：方案具备可重复性（提示词/模型版本化、尽可能采用确定性设置、数据处理规则清晰）

6) Define reporting, thresholds, and the iteration loop

6) 定义报告、阈值及迭代循环

Inputs: Stakeholder needs, release cadence.
Actions: Specify report format (overall + per-tag metrics), regression rules, and what changes require re-running evals. Define the iteration loop: every discovered failure becomes a new test + taxonomy update.
Outputs: Reporting + iteration loop.
Checks: A reader can make a decision from the report without additional meetings; regressions are detectable.

输入：干系人需求、发布节奏
行动：定义报告格式（整体指标+按标记分类的指标）、回归规则，以及哪些变更需要重新运行评估。定义迭代循环：每发现一个故障，就将其转化为新的测试用例并更新分类体系
输出：报告+迭代循环
校验标准：阅读报告后无需额外会议即可做出决策；能够检测到回归问题

7) Quality gate + finalize

7) 质量校验+最终定稿

Inputs: Full draft pack.
Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Fix missing coverage, vague rubric language, or non-repeatable harness steps. Always include Risks / Open questions / Next steps.
Outputs: Final AI Evals Pack.
Checks: The eval definition functions as a product requirement: clear, testable, and actionable.

输入：完整的评估包草稿
行动：使用references/CHECKLISTS.md和references/RUBRIC.md进行校验与评分。补充缺失的覆盖场景、优化模糊的评分标准语言、修正不可重复的测试框架步骤。务必包含风险/待解决问题/下一步计划
输出：最终版AI评估包
校验标准：评估定义需具备产品需求的属性：清晰、可测试、可执行

Quality gate (required)

质量校验（必填）

Use references/CHECKLISTS.md and references/RUBRIC.md.
Always include: Risks, Open questions, Next steps.

使用references/CHECKLISTS.md和references/RUBRIC.md进行校验。
务必包含：风险、待解决问题、下一步计划。

Examples

示例

Example 1 (answer quality + safety): “Use

ai-evals

to design evals for a customer-support reply drafting assistant. Constraints: no PII leakage, must cite KB articles, and must refuse unsafe requests. Output: AI Evals Pack.”

Example 2 (structured extraction): “Use

ai-evals

to create a rubric + golden set for an LLM that extracts invoice fields to JSON. Constraints: must always return valid JSON; prioritize recall for

amount

and

due_date

. Output: AI Evals Pack.”

Boundary example: “We don’t know what the AI feature should do yet—just ‘add AI’ and pick a model.”
Response: out of scope; first define the job/spec and success metrics (use

problem-definition

building-with-llms

), then return to

ai-evals

with a concrete SUT.

示例1（回答质量+安全）：“使用

ai-evals

为客户支持回复起草助手设计评估体系。约束条件：不得泄露个人身份信息（PII）、必须引用知识库文章、必须拒绝不安全请求。输出：AI评估包。”

示例2（结构化提取）：“使用

ai-evals

为提取发票字段生成JSON的LLM创建评分标准+黄金测试集。约束条件：必须始终返回有效的JSON；优先保证

amount

和

due_date

的召回率。输出：AI评估包。”

边界示例：“我们还不知道AI功能应该做什么——只是‘添加AI’并选择一个模型。”
回复：超出本技能范围；请先定义任务/规格及成功指标（使用

problem-definition

或

building-with-llms

工具），再带着具体的被测系统（SUT）返回使用

ai-evals

。