ai-evals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Evals

AI Evals

Scope

适用范围

Covers
  • Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured
  • Converting failures into a golden test set + error taxonomy + rubric
  • Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
  • Producing decision-ready results and an iteration loop (every bug becomes a new test)
When to use
  • “Design evals for this LLM feature so we can ship with confidence.”
  • “Create a rubric + golden set + benchmark for our AI assistant/copilot.”
  • “We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”
  • “Compare prompts/models safely with a clear acceptance threshold.”
When NOT to use
  • You need to decide what to build (use
    problem-definition
    ,
    building-with-llms
    , or
    ai-product-strategy
    ).
  • You’re primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).
  • You want model training research or infra design (this skill assumes API/model usage; delegate to ML/infra).
  • You only want vendor/model selection with no defined task + data (use
    evaluating-new-technology
    first, then come back with a concrete use case).
涵盖内容
  • 为LLM/AI功能设计评估(即“evals”),将其作为执行契约:定义“合格”的标准及衡量方式
  • 将故障转化为黄金测试集 + 错误分类体系 + 评分标准
  • 选择评审方式(人工评审、LLM-as-judge、自动化校验),并制定可重复执行的测试框架/执行手册
  • 输出可支撑决策的结果及迭代循环(每个bug都将转化为新的测试用例)
适用场景
  • “为这个LLM功能设计评估体系,让我们能放心上线。”
  • “为我们的AI助手/ copilots创建评分标准+黄金测试集+基准测试。”
  • “我们发现质量不稳定——请进行错误分析并将其转化为可重复的评估体系。”
  • “通过明确的验收阈值安全比较提示词/模型。”
不适用场景
  • 你需要决定要构建什么(请使用
    problem-definition
    building-with-llms
    ai-product-strategy
    工具)。
  • 你主要进行传统的非LLM软件测试(请使用标准的工程QA/单元/集成测试)。
  • 你需要模型训练研究或基础设施设计(本技能假设你已使用API/模型;此类工作请委托给ML/基础设施团队)。
  • 你仅想进行供应商/模型选型,但未定义具体任务+数据(请先使用
    evaluating-new-technology
    工具,再带着具体用例返回)。

Inputs

输入要求

Minimum required
  • System under test (SUT): what the AI does, for whom, in what workflow (inputs → outputs)
  • The decision the eval must support (ship/no-ship, compare options, regression gate)
  • What “good” means: 3–10 target behaviors + top failure modes
  • Constraints: privacy/compliance, safety policy, languages, cost/latency budgets, timeline
Missing-info strategy
  • Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
  • If details remain missing, proceed with explicit assumptions and provide 2–3 viable options (judge type, scoring scheme, dataset size).
  • If asked to run code or generate datasets from sensitive sources, request confirmation and apply least privilege (no secrets; redact/anonymize).
最低必填项
  • 被测系统(SUT):AI的功能、服务对象、工作流程(输入→输出)
  • 评估需支撑的决策:上线/不上线、方案对比、回归校验
  • “合格”的定义:3-10项目标行为+主要故障模式
  • 约束条件:隐私/合规要求、安全政策、语言支持、成本/延迟预算、时间周期
缺失信息处理策略
  • references/INTAKE.md中提出最多5个问题(每次3-5个)。
  • 若仍有信息缺失,基于明确假设推进,并提供2-3种可行方案(评审类型、评分方案、数据集规模)。
  • 若要求运行代码或从敏感源生成数据集,需请求确认并遵循最小权限原则(不涉及密钥;需脱敏/匿名化处理)。

Outputs (deliverables)

输出交付物

Produce an AI Evals Pack (in chat; or as files if requested), in this order:
  1. Eval PRD (evaluation requirements): decision, scope, target behaviors, success metrics, acceptance thresholds
  2. Test set spec + initial golden set: schema, coverage plan, and a starter set of cases (tagged by scenario/risk)
  3. Error taxonomy (from error analysis + open coding): failure modes, severity, examples
  4. Rubric + judging guide: dimensions, scoring scale, definitions, examples, tie-breakers
  5. Judge + harness plan: human vs LLM-as-judge vs automated checks, prompts/instructions, calibration, runbook, cost/time estimate
  6. Reporting + iteration loop: baseline results format, regression policy, how new bugs become new tests
  7. Risks / Open questions / Next steps (always included)
Templates: references/TEMPLATES.md
生成AI评估包(通过聊天交付;若有需求可生成文件),顺序如下:
  1. 评估PRD(评估需求):决策目标、适用范围、目标行为、成功指标、验收阈值
  2. 测试集规范+初始黄金测试集:数据结构、覆盖方案、以及一组初始测试用例(按场景/风险标记)
  3. 错误分类体系(基于错误分析+开放式编码):故障模式、严重程度、示例
  4. 评分标准+评审指南:评估维度、评分尺度、定义、示例、平局处理规则
  5. 评审+测试框架方案:人工评审 vs LLM-as-judge vs 自动化校验、提示词/说明、校准方式、执行手册、成本/时间估算
  6. 报告+迭代循环:基准结果格式、回归规则、以及如何将新发现的故障转化为新测试用例
  7. 风险/待解决问题/下一步计划(必含内容)
模板请参考:references/TEMPLATES.md

Workflow (7 steps)

工作流程(7个步骤)

1) Define the decision and write the Eval PRD

1) 定义决策目标并撰写评估PRD

  • Inputs: SUT description, stakeholders, decision to support.
  • Actions: Define the decision (ship/no-ship, compare A vs B), scope/non-goals, target behaviors, acceptance thresholds, and what must never happen.
  • Outputs: Draft Eval PRD (template in references/TEMPLATES.md).
  • Checks: A stakeholder can restate what is being measured, why, and what “pass” means.
  • 输入:被测系统(SUT)描述、相关干系人、需支撑的决策
  • 行动:定义决策目标(上线/不上线、方案A vs B对比)、适用范围/非目标、目标行为、验收阈值,以及绝对不允许出现的情况
  • 输出:评估PRD草稿(模板见references/TEMPLATES.md
  • 校验标准:干系人能够清晰复述评估的测量内容、目的及“通过”的标准

2) Draft the golden set structure + coverage plan

2) 草拟黄金测试集结构+覆盖方案

  • Inputs: User workflows, edge cases, safety risks, data availability.
  • Actions: Specify the test case schema, tagging, and coverage targets (happy paths, tricky paths, adversarial/safety, long-tail). Create an initial starter set (small but high-signal).
  • Outputs: Test set spec + initial golden set.
  • Checks: Every target behavior has at least 2 test cases; high-severity risks are explicitly represented.
  • 输入:用户工作流程、边缘场景、安全风险、数据可用性
  • 行动:定义测试用例的数据结构、标记规则及覆盖目标(正常路径、复杂路径、对抗/安全场景、长尾场景)。创建一组初始的高价值测试用例
  • 输出测试集规范+初始黄金测试集
  • 校验标准:每个目标行为至少对应2个测试用例;高风险场景已被明确覆盖

3) Run error analysis and open coding to build a taxonomy

3) 开展错误分析与开放式编码,构建错误分类体系

  • Inputs: Known failures, logs, stakeholder anecdotes, initial golden set.
  • Actions: Review failures, label them with open coding, consolidate into a taxonomy, and assign severity/impact. Identify likely root causes (prompting, missing context, tool misuse, formatting, policy).
  • Outputs: Error taxonomy + “top failure modes” list.
  • Checks: Taxonomy is mutually understandable by PM/eng; each category has 1–2 concrete examples.
  • 输入:已知故障、日志、干系人反馈、初始黄金测试集
  • 行动:回顾故障案例,通过开放式编码进行标记,整合为分类体系,并分配严重程度/影响等级。识别可能的根本原因(提示词问题、上下文缺失、工具误用、格式错误、政策合规问题)
  • 输出错误分类体系+“主要故障模式”列表
  • 校验标准:分类体系需让产品经理/工程师都能理解;每个类别需包含1-2个具体示例

4) Convert taxonomy → rubric + scoring rules

4) 将分类体系转化为评分标准+评分规则

  • Inputs: Taxonomy, target behaviors, output formats.
  • Actions: Define scoring dimensions and scales; write clear judge instructions and tie-breakers; add examples and disallowed behaviors. Decide absolute scoring vs pairwise comparisons.
  • Outputs: Rubric + judging guide.
  • Checks: Two independent judges would likely score the same case similarly (instructions are specific, not vibes).
  • 输入:错误分类体系、目标行为、输出格式
  • 行动:定义评分维度与尺度;撰写清晰的评审说明与平局处理规则;添加示例及禁止行为。确定采用绝对评分还是两两对比评分
  • 输出评分标准+评审指南
  • 校验标准:两名独立评审员对同一测试用例的打分结果应基本一致(说明需具体明确,而非模糊描述)

5) Choose the judging approach + harness/runbook

5) 选择评审方式+测试框架/执行手册

  • Inputs: Constraints (time/cost), required reliability, privacy/safety constraints.
  • Actions: Pick judge type(s): human, LLM-as-judge, automated checks. Define calibration (gold examples, inter-rater checks), sampling, and how results are stored. Write a runbook with estimated runtime/cost.
  • Outputs: Judge + harness plan.
  • Checks: The plan is repeatable (versioned prompts/models, deterministic settings where possible, clear data handling).
  • 输入:约束条件(时间/成本)、所需可靠性、隐私/安全要求
  • 行动:选择评审类型:人工、LLM-as-judge、自动化校验。定义校准方式(黄金示例、评审员间一致性校验)、抽样规则,以及结果存储方式。撰写包含预估执行时间/成本的执行手册
  • 输出评审+测试框架方案
  • 校验标准:方案具备可重复性(提示词/模型版本化、尽可能采用确定性设置、数据处理规则清晰)

6) Define reporting, thresholds, and the iteration loop

6) 定义报告、阈值及迭代循环

  • Inputs: Stakeholder needs, release cadence.
  • Actions: Specify report format (overall + per-tag metrics), regression rules, and what changes require re-running evals. Define the iteration loop: every discovered failure becomes a new test + taxonomy update.
  • Outputs: Reporting + iteration loop.
  • Checks: A reader can make a decision from the report without additional meetings; regressions are detectable.
  • 输入:干系人需求、发布节奏
  • 行动:定义报告格式(整体指标+按标记分类的指标)、回归规则,以及哪些变更需要重新运行评估。定义迭代循环:每发现一个故障,就将其转化为新的测试用例并更新分类体系
  • 输出报告+迭代循环
  • 校验标准:阅读报告后无需额外会议即可做出决策;能够检测到回归问题

7) Quality gate + finalize

7) 质量校验+最终定稿

  • Inputs: Full draft pack.
  • Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Fix missing coverage, vague rubric language, or non-repeatable harness steps. Always include Risks / Open questions / Next steps.
  • Outputs: Final AI Evals Pack.
  • Checks: The eval definition functions as a product requirement: clear, testable, and actionable.
  • 输入:完整的评估包草稿
  • 行动:使用references/CHECKLISTS.mdreferences/RUBRIC.md进行校验与评分。补充缺失的覆盖场景、优化模糊的评分标准语言、修正不可重复的测试框架步骤。务必包含风险/待解决问题/下一步计划
  • 输出:最终版AI评估包
  • 校验标准:评估定义需具备产品需求的属性:清晰、可测试、可执行

Quality gate (required)

质量校验(必填)

  • Use references/CHECKLISTS.md and references/RUBRIC.md.
  • Always include: Risks, Open questions, Next steps.
  • 使用references/CHECKLISTS.mdreferences/RUBRIC.md进行校验。
  • 务必包含:风险待解决问题下一步计划

Examples

示例

Example 1 (answer quality + safety): “Use
ai-evals
to design evals for a customer-support reply drafting assistant. Constraints: no PII leakage, must cite KB articles, and must refuse unsafe requests. Output: AI Evals Pack.”
Example 2 (structured extraction): “Use
ai-evals
to create a rubric + golden set for an LLM that extracts invoice fields to JSON. Constraints: must always return valid JSON; prioritize recall for
amount
and
due_date
. Output: AI Evals Pack.”
Boundary example: “We don’t know what the AI feature should do yet—just ‘add AI’ and pick a model.”
Response: out of scope; first define the job/spec and success metrics (use
problem-definition
or
building-with-llms
), then return to
ai-evals
with a concrete SUT.
示例1(回答质量+安全):“使用
ai-evals
为客户支持回复起草助手设计评估体系。约束条件:不得泄露个人身份信息(PII)、必须引用知识库文章、必须拒绝不安全请求。输出:AI评估包。”
示例2(结构化提取):“使用
ai-evals
为提取发票字段生成JSON的LLM创建评分标准+黄金测试集。约束条件:必须始终返回有效的JSON;优先保证
amount
due_date
的召回率。输出:AI评估包。”
边界示例:“我们还不知道AI功能应该做什么——只是‘添加AI’并选择一个模型。”
回复:超出本技能范围;请先定义任务/规格及成功指标(使用
problem-definition
building-with-llms
工具),再带着具体的被测系统(SUT)返回使用
ai-evals