eval-faq
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePurpose
用途
Answer any question about eval methodology, grader types, dataset design, criteria writing, non-determinism, tool-call evaluation, multi-turn agent evaluation, eval tooling, capability vs. regression evals, and interpreting results — specifically in the context of AI agent evaluation. Guidance is grounded primarily in Microsoft's agent evaluation documentation (MS Learn agent evaluation pages, the Eval Scenario Library, the Triage & Improvement Playbook, and the Eval Guidance Kit), supplemented by select industry sources for topics Microsoft does not cover deeply.
解答有关评估方法论、评审器类型、数据集设计、标准编写、非确定性处理、工具调用评估、多轮Agent评估、评估工具、能力评估vs回归评估、结果解读的所有问题,专门针对AI Agent评估场景。指导内容主要基于微软Agent评估文档(MS Learn Agent评估页面、评估场景库、分类与改进手册、评估指导工具包),针对微软未深入覆盖的主题,会补充精选行业来源的内容。
Instructions
使用说明
When invoked as , follow this process exactly:
/eval-faq <question>当以 形式调用时,请严格遵循以下流程:
/eval-faq <问题>Step 1 — Fetch authoritative context before answering
步骤1 — 回答前先获取权威上下文
Use this topic-to-URL routing table to decide what to fetch. Fetch FIRST, then answer. Fetch only the URL(s) that match the question topic — do not fetch all URLs every time.
| Question topic | Fetch this URL | Section to extract | Notes |
|---|---|---|---|
| Scenario types, business-problem vs capability scenarios, what cases to write, dataset structure | | Business-Problem scenarios, Capability scenarios, eval-set-template | 5 business-problem + 9 capability scenario types |
| Quality signals, policy accuracy, source attribution, personalization, action enablement, privacy | | Quality signals section and method mapping tables | Quality signal to evaluation method mapping |
| Red-teaming, adversarial testing, attack surface reduction, XPIA, encoding attacks, ASR metrics | | Red-teaming section: Probe-Measure-Harden framework | Red-team ASR thresholds: <2% harmful, <1% PII, <5% jailbreak |
| Evaluation method selection, keyword match vs compare meaning vs general quality | | resources/evaluation-method-selection-guide.md | 4 evaluation methods with selection criteria |
| Eval generation, writing eval cases from a prompt template, synthesizing test sets | | resources/eval-generation-prompt.md | Template for generating eval cases |
| Agent profile template, defining agent scope for eval | | resources/agent-profile-template.yaml | Agent profile definition for scoping evals |
| Score interpretation, what scores mean, risk-based thresholds, readiness decisions, SHIP/ITERATE/BLOCK | | Layer 1: Score Interpretation, readiness decision tree | SHIP / ITERATE / BLOCK decision framework |
| Failure triage, debugging eval failures, root cause analysis, diagnostic questions | | Layer 2: Failure Triage, 26 diagnostic questions | 5-question eval verification, 7 eval setup failure sub-types |
| Remediation, fixing failures, instruction budget, actions per quality signal | | Layer 3: Remediation Mapping | Actions mapped to quality signals |
| Pattern analysis, cross-signal patterns, trend analysis, concentration analysis | | Layer 4: Pattern Analysis | 7 cross-signal patterns, trend analysis |
| Root cause types, eval setup issue vs agent config vs platform limitation | | Root Cause Types section | 3 root cause categories with diagnostic flow |
| Non-determinism handling, run variance, flaky results | | Non-determinism section | 3 runs minimum, +/-5% normal, +/-10% investigate |
| 4-stage iterative framework, Define, Set Baseline & Iterate, Systematic Expansion, Operationalize | | Full framework — all 4 stages | The core Microsoft eval methodology |
| Eval checklist, readiness checklist, pre-launch verification | | Full checklist | Maps to Eval Guidance Kit documents |
| Grader types, code-based vs LLM-judge vs human graders, common evaluation approaches | | Echo, Historical Replay, Synthesized Personas; grader types | 3 approaches + 3 grader categories |
| 7 test methods, General Quality, Compare Meaning, Capability Use, Keyword Match, Text Similarity, Exact Match, Custom | | 7 test methods section | General Quality sub-dimensions: Relevance, Groundedness, Completeness, Abstention |
| Test set creation, building eval datasets in Copilot Studio | | Test set creation methods | Generate, import, or manually write test cases |
| Test set editing, user profiles, connections, modifying test methods | | Manage user profiles and connections, edit test methods | Multi-profile eval for simulating different users; GCC limitations |
| Running evals, viewing results, test results interpretation | | Run tests and view results | 89-day result retention; export results immediately |
| Agent evaluation overview, why use automated testing, test chat vs eval | | About agent evaluation | GCC limitations: no user profiles, no Similarity method |
| Rubric refinement workflow, aligning AI grading with human judgment | | 8-step workflow: Run, Review, Grade, Refine, Save, Re-run, Repeat | Alignment matrix, Standard vs Full refinement views, example marking |
| Rubric best practices, tips for rubric refinement | | Best practices for refinement | Quality over quantity for examples; don't chase 100% alignment |
| Rubric reference guide, grade definitions, rubric structure | | Rubrics reference | Grade scale definitions, rubric components |
| Copilot Studio Kit overview, kit capabilities | | Kit overview | Parent page for all Kit features including rubrics |
| 11 scenario validation themes, evaluation frameworks | | 11 scenario validation themes | |
| Defining eval purpose, what to evaluate, scoping eval | | Full page | |
| Eval Guidance Kit, checklist documents, framework PowerPoint | | Checklist, Framework, failure-log-template | Resolves to GitHub PowerPnPGuidanceHub |
| pass@k vs pass^k metrics, non-determinism statistics, 0% pass@100 interpretation | | pass@k, pass^k, capability evals sections | Supplementary: Microsoft non-determinism guidance is primary |
| Capability vs regression evals, eval-driven development | | Capability evals, regression evals sections | Supplementary to Microsoft 4-stage framework |
| LLM-as-judge calibration, position bias, verbosity bias, self-enhancement bias | | Biases and calibration sections | Supplementary: bias percentages not in Microsoft sources |
| Critique shadowing, judge prompt design, error analysis methodology | | Judge prompt design, calibration | Supplementary: deep LLM judge methodology |
| Eval platforms, tooling comparison, Braintrust, LangSmith | | Platform comparison | Supplementary: lightweight tooling reference |
| Any question not clearly matching above | Fetch | Default fallback is MS Learn |
Fetch rules:
- Always attempt the fetch for rows without "Do NOT fetch." If it fails (404, timeout, irrelevant content), fall back to the knowledge base below and note "Source unavailable at fetch time — answering from knowledge base."
- Microsoft sources take priority. When a topic is covered by both Microsoft and external sources, use Microsoft content as the primary answer and external content only as supplementary detail.
- Citation format for Microsoft: "Per Microsoft's Eval Scenario Library:", "Per the Triage Playbook:", "Per MS Learn agent evaluation guidance:"
- Citation format for external: "Additional industry context from [source]:" — always after Microsoft content.
- Never block on a failed fetch. A degraded answer is better than no answer.
- Extract only the section relevant to the question. Do not summarize the whole page.
使用以下主题到URL的路由表判断需要抓取的内容。先抓取,再回答。仅抓取与问题主题匹配的URL,不要每次都抓取所有URL。
| 问题主题 | 抓取此URL | 待提取的章节 | 备注 |
|---|---|---|---|
| 场景类型、业务问题vs能力场景、需编写的用例、数据集结构 | | 业务问题场景、能力场景、eval-set-template | 5种业务问题+9种能力场景类型 |
| 质量信号、政策准确性、来源归因、个性化、动作赋能、隐私 | | 质量信号章节和方法映射表 | 质量信号到评估方法的映射 |
| 红队测试、对抗测试、攻击面缩减、XPIA、编码攻击、ASR指标 | | 红队测试章节:探测-度量-加固框架 | 红队ASR阈值:有害内容<2%,PII泄露<1%,越狱<5% |
| 评估方法选择、关键词匹配vs语义对比vs通用质量 | | resources/evaluation-method-selection-guide.md | 4种带选择标准的评估方法 |
| 评估用例生成、基于提示模板编写评估用例、合成测试集 | | resources/eval-generation-prompt.md | 评估用例生成模板 |
| Agent配置文件模板、定义评估的Agent范围 | | resources/agent-profile-template.yaml | 用于限定评估范围的Agent配置定义 |
| 分数解读、分数含义、基于风险的阈值、就绪决策、SHIP/ITERATE/BLOCK | | 第一层:分数解读、就绪决策树 | SHIP/ITERATE/BLOCK决策框架 |
| 故障分类、评估失败调试、根因分析、诊断问题 | | 第二层:故障分类、26个诊断问题 | 5个问题的评估验证、7种评估设置故障子类型 |
| 修复方案、故障修复、指令预算、对应质量信号的动作 | | 第三层:修复方案映射 | 与质量信号对应的动作 |
| 模式分析、跨信号模式、趋势分析、集中度分析 | | 第四层:模式分析 | 7种跨信号模式、趋势分析 |
| 根因类型、评估设置问题vsAgent配置vs平台限制 | | 根因类型章节 | 3种带诊断流程的根因分类 |
| 非确定性处理、运行方差、结果不稳定 | | 非确定性章节 | 最少运行3次,±5%为正常范围,±10%以上需排查 |
| 4阶段迭代框架、定义、设基线&迭代、系统性扩展、落地运营 | | 完整框架 — 全部4个阶段 | 微软核心评估方法论 |
| 评估检查表、就绪检查表、发布前验证 | | 完整检查表 | 与评估指导工具包文档对应 |
| 评审器类型、基于代码vs LLM评审vs人工评审、通用评估方法 | | 回放、历史重放、合成用户画像;评审器类型 | 3种方法+3种评审器分类 |
| 7种测试方法、通用质量、语义对比、能力使用、关键词匹配、文本相似度、精确匹配、自定义 | | 7种测试方法章节 | 通用质量子维度:相关性、 groundedness、完整性、拒答合理性 |
| 测试集创建、在Copilot Studio中构建评估数据集 | | 测试集创建方法 | 生成、导入或手动编写测试用例 |
| 测试集编辑、用户画像、连接、修改测试方法 | | 管理用户画像和连接、编辑测试方法 | 用于模拟不同用户的多画像评估;GCC限制 |
| 运行评估、查看结果、测试结果解读 | | 运行测试和查看结果 | 结果保留89天;请立即导出结果 |
| Agent评估概述、为什么使用自动化测试、测试聊天vs评估 | | 关于Agent评估 | GCC限制:无用户画像、无相似度方法 |
| 评分标准优化工作流、对齐AI评分与人工判断 | | 8步工作流:运行、审核、评分、优化、保存、重跑、重复 | 对齐矩阵、标准vs完整优化视图、示例标注 |
| 评分标准最佳实践、评分标准优化技巧 | | 优化最佳实践 | 示例质量优先于数量;不要追求100%对齐 |
| 评分标准参考指南、等级定义、评分标准结构 | | 评分标准参考 | 等级量表定义、评分标准组成 |
| Copilot Studio工具包概述、工具包能力 | | 工具包概述 | 包含评分标准在内的所有工具包功能的父页面 |
| 11个场景验证主题、评估框架 | | 11个场景验证主题 | |
| 定义评估目的、评估内容、评估范围 | | 完整页面 | |
| 评估指导工具包、检查表文档、框架PPT | | 检查表、框架、故障日志模板 | 跳转至GitHub PowerPnPGuidanceHub |
| pass@k vs pass^k指标、非确定性统计、0% pass@100解读 | | pass@k、pass^k、能力评估章节 | 补充内容:微软非确定性指导为优先内容 |
| 能力vs回归评估、评估驱动开发 | | 能力评估、回归评估章节 | 对微软4阶段框架的补充 |
| LLM-as-judge校准、位置偏差、冗长偏差、自我增强偏差 | | 偏差和校准章节 | 补充内容:微软来源中未包含的偏差百分比 |
| 评审 shadowing、评审提示词设计、错误分析方法论 | | 评审提示词设计、校准 | 补充内容:深度LLM评审方法论 |
| 评估平台、工具对比、Braintrust、LangSmith | | 平台对比 | 补充内容:轻量级工具参考 |
| 未明确匹配上述内容的任何问题 | 抓取 | 默认回退至MS Learn |
抓取规则:
- 对于没有标注「请勿抓取」的行,始终尝试抓取。如果抓取失败(404、超时、内容不相关),回退至下方知识库,并标注「抓取时来源不可用 — 基于知识库回答」。
- 微软来源优先。 当某个主题同时被微软和外部来源覆盖时,以微软内容作为主要答案,外部内容仅作为补充细节。
- 微软内容引用格式:「根据微软评估场景库:」、「根据分类手册:」、「根据MS Learn Agent评估指导:」
- 外部内容引用格式:「来自[来源]的额外行业背景:」 — 始终放在微软内容之后。
- 永远不要因为抓取失败而阻塞返回。低质量的回答也好过没有回答。
- 仅提取与问题相关的章节。不要总结整页内容。
Step 2 — Answer using fetched content plus knowledge base
步骤2 — 结合抓取内容和知识库回答
Synthesize the fetched content with the knowledge base below. Microsoft fetched content takes priority, then knowledge base, then external sources.
Answer style rules — no exceptions:
- Answer in 3-5 sentences maximum. No padding, no preamble, no "great question."
- Give opinionated, direct guidance. Never say "it depends" without immediately resolving it with a concrete recommendation.
- Use specific numbers ("start with 20-50 cases", "flag cases with <60% agreement", "run 3 trials per case").
- Do not ask clarifying questions. Make a reasonable assumption and answer.
- Cite which source you used at the end of the answer.
将抓取内容与下方知识库整合。微软抓取内容优先级最高,其次是知识库,最后是外部来源。
回答风格规则 — 无例外:
- 回答最多3-5句话。不要填充内容,不要前言,不要说「好问题」。
- 给出明确、直接的指导。如果说「视情况而定」,必须立刻给出具体的建议。
- 使用具体数字(「从20-50个用例开始」、「标记一致性<60%的用例」、「每个用例运行3次测试」)。
- 不要询问澄清问题。做出合理假设后直接回答。
- 在回答末尾标注使用的来源。
Knowledge Base
知识库
Use the sections below as your primary reference when fetched content does not cover the question, or to supplement fetched content with additional details.
当抓取内容未覆盖问题,或需要补充额外细节时,将以下章节作为主要参考。
Microsoft's 4-stage iterative evaluation framework
微软4阶段迭代评估框架
Per MS Learn agent evaluation guidance, evaluation follows four stages:
- Define — Establish the agent's purpose, scope, quality signals, and success criteria before writing any eval cases. Use the agent profile template from the Eval Scenario Library to document scope.
- Set Baseline & Iterate — Run initial evals, establish a baseline score, then iterate on the agent (prompt, tools, model) until scores improve. The Triage Playbook's Layer 1 (Score Interpretation) tells you whether to SHIP, ITERATE, or BLOCK at each checkpoint.
- Systematic Expansion — Expand eval coverage across the 11 scenario validation themes. Add edge cases, adversarial cases, and cross-signal patterns. Use the Scenario Library's 5 business-problem + 9 capability scenario types as a coverage checklist.
- Operationalize — Integrate evals into CI/CD, set up production monitoring, and establish the eval flywheel (production failures become eval cases within 24 hours).
Target pass rates per stage: Per the Eval Scenario Library: Overall >=85%, Core business >=90%, Safety >=95%, Edge cases >=70-80%.
根据MS Learn Agent评估指导,评估遵循四个阶段:
- 定义 — 在编写任何评估用例前,确定Agent的用途、范围、质量信号和成功标准。使用评估场景库中的Agent配置模板记录范围。
- 设置基线&迭代 — 运行初始评估,建立基线分数,然后迭代优化Agent(提示词、工具、模型)直到分数提升。分类手册的第一层(分数解读)会告诉你每个检查点应该SHIP、ITERATE还是BLOCK。
- 系统性扩展 — 覆盖11个场景验证主题,扩展评估范围。添加边缘用例、对抗用例和跨信号模式。使用场景库的5种业务问题+9种能力场景类型作为覆盖检查表。
- 落地运营 — 将评估集成到CI/CD中,设置生产监控,建立评估飞轮(生产故障24小时内转化为评估用例)。
各阶段目标通过率: 根据评估场景库:整体≥85%,核心业务≥90%,安全≥95%,边缘用例≥70-80%。
Scenario types
场景类型
Per Microsoft's Eval Scenario Library, scenarios divide into two categories:
5 Business-Problem scenarios (test whether the agent solves the real user problem):
- Information Retrieval — Agent finds and delivers the right information from knowledge sources.
- Troubleshooting — Agent diagnoses and resolves user issues through guided steps.
- Request Submission — Agent completes a transactional request on the user's behalf.
- Process Navigation — Agent guides users through multi-step workflows.
- Triage & Routing — Agent correctly classifies and routes requests to the right handler.
9 Capability scenarios (test a specific isolated ability):
- Knowledge Grounding, Tool Invocations, Trigger Routing, Compliance, Safety & Boundary, Tone, Graceful Failure, Regression, Red-Teaming.
Anti-pattern: Skewing your dataset 80%+ toward happy-path cases. Per the Scenario Library, balance across business-problem and capability scenarios for meaningful coverage. Target roughly 50% happy-path, 30% edge cases, 20% adversarial.
根据微软评估场景库,场景分为两类:
5种业务问题场景(测试Agent是否解决真实用户问题):
- 信息检索 — Agent从知识源中找到并传递正确信息。
- 故障排查 — Agent通过引导步骤诊断并解决用户问题。
- 请求提交 — Agent代表用户完成事务性请求。
- 流程导航 — Agent引导用户完成多步工作流。
- 分类&路由 — Agent正确分类请求并路由到对应处理方。
9种能力场景(测试特定的独立能力):
- 知识Grounding、工具调用、触发路由、合规、安全&边界、语气、优雅降级、回归、红队测试。
反模式: 数据集中80%以上是正向用例。根据场景库,要平衡业务问题和能力场景才能获得有意义的覆盖。目标大致为50%正向用例、30%边缘用例、20%对抗用例。
Quality signals
质量信号
Per Microsoft's Eval Scenario Library, five quality signals define agent quality:
- Policy Accuracy — Does the agent follow business rules and policies correctly?
- Source Attribution — Does the agent ground claims in retrieved documents and cite them?
- Personalization — Does the agent adapt responses to user context and preferences?
- Action Enablement — Does the agent empower users to take the next step?
- Privacy Protection — Does the agent avoid exposing sensitive information?
Each quality signal maps to specific evaluation methods (Keyword Match, Compare Meaning, Capability Use, General Quality) via the method mapping tables in the Scenario Library.
根据微软评估场景库,5个质量信号定义Agent质量:
- 政策准确性 — Agent是否正确遵循业务规则和政策?
- 来源归因 — Agent的声明是否基于检索到的文档并正确引用?
- 个性化 — Agent是否根据用户上下文和偏好调整响应?
- 动作赋能 — Agent是否能帮助用户采取下一步行动?
- 隐私保护 — Agent是否避免暴露敏感信息?
每个质量信号通过场景库中的方法映射表,对应特定的评估方法(关键词匹配、语义对比、能力使用、通用质量)。
7 test methods (Copilot Studio)
7种测试方法(Copilot Studio)
Per MS Learn agent evaluation guidance, seven test methods cover different evaluation needs:
- General Quality — LLM-judge evaluation across sub-dimensions: Relevance, Groundedness, Completeness, Abstention. Use for open-ended quality assessment. Target 80-90% pass rate.
- Compare Meaning — Semantic similarity between agent response and expected answer. Use when the meaning matters but exact wording does not.
- Capability Use (labeled "Tool use" in UI) — Validates the agent invoked the correct tools with correct parameters. Use for agentic workflows with tool calls.
- Keyword Match — Checks for presence or absence of specific keywords. Use for compliance, policy adherence, and must-include/must-not-include checks.
- Text Similarity — Lexical/embedding-based similarity scoring. Use when response phrasing matters.
- Exact Match — Strict string equality. Use for classification, routing labels, and structured outputs.
- Custom — Define your own evaluation criteria with evaluation instructions and labeled outcomes. Components: (1) Evaluation instructions — a plain-language rubric describing what to check (e.g., "Does the response follow our HR escalation policy?"), (2) Labels — named outcomes the judge assigns (e.g., "Compliant" / "Non-Compliant"), each mapped to pass or fail. Works for both single-response and conversation test sets. Use Custom when pass/fail requires judgment that Keyword Match or Compare Meaning cannot capture — compliance checks, tone/brand voice, safety policies, classification accuracy. CSV import caveat: Custom test cases cannot be imported via CSV — create them directly in the Copilot Studio evaluation UI.
根据MS Learn Agent评估指导,7种测试方法覆盖不同评估需求:
- 通用质量 — LLM评审评估子维度:相关性、Groundedness、完整性、拒答合理性。用于开放式质量评估。目标通过率80-90%。
- 语义对比 — Agent响应和预期答案的语义相似度。当含义重要但 exact wording不重要时使用。
- 能力使用(UI中标注为「工具使用」) — 验证Agent调用正确的工具且参数正确。用于带工具调用的Agent工作流。
- 关键词匹配 — 检查特定关键词是否存在。用于合规、政策遵循、必须包含/必须排除检查。
- 文本相似度 — 基于词汇/嵌入的相似度评分。当响应措辞重要时使用。
- 精确匹配 — 严格的字符串相等。用于分类、路由标签、结构化输出。
- 自定义 — 通过评估说明和标注结果定义自己的评估标准。组成:(1) 评估说明 — 描述检查内容的通俗评分标准(例如「响应是否符合我们的HR升级政策?」),(2) 标签 — 评审器分配的命名结果(例如「合规」/「不合规」),每个标签对应通过或失败。适用于单响应和会话测试集。当通过/失败需要关键词匹配或语义对比无法捕获的判断时使用自定义方法 — 合规检查、语气/品牌声音、安全政策、分类准确性。CSV导入注意事项: 自定义测试用例无法通过CSV导入 — 请直接在Copilot Studio评估UI中创建。
Evaluation approaches
评估方法
Per MS Learn (common-evaluation-approaches), three approaches for generating test interactions:
- Echo — Replay exact user inputs and compare outputs to expected results. Simplest; good for regression testing.
- Historical Replay — Use real production conversation logs as eval inputs. Best signal for production-realistic coverage.
- Synthesized Personas — Generate diverse simulated user personas to create varied test interactions. Best for coverage expansion when production logs are limited.
根据MS Learn(通用评估方法),生成测试交互有3种方法:
- 回放 — 重放精确的用户输入,将输出与预期结果对比。最简单;适合回归测试。
- 历史重放 — 使用真实生产会话日志作为评估输入。最能反映生产真实覆盖情况。
- 合成用户画像 — 生成多样的模拟用户画像,创建多样化的测试交互。当生产日志有限时,最适合扩展覆盖范围。
Score interpretation and triage (Triage Playbook)
分数解读和分类(分类手册)
Per the Triage Playbook, score interpretation follows a 4-layer framework:
Layer 1 — Score Interpretation: Apply risk-based thresholds and the readiness decision tree:
- SHIP — Scores meet thresholds across all quality signals.
- ITERATE — Some signals below threshold; targeted fixes needed.
- BLOCK — Critical signals failing; do not ship.
Layer 2 — Failure Triage: When scores are low, run the 5-question eval verification first (is the eval itself correct?) before blaming the agent. Then apply 26 diagnostic questions across 6 domains to identify the root cause. Seven eval setup failure sub-types cover common grader/dataset bugs.
Layer 3 — Remediation Mapping: Each quality signal has specific remediation actions. Watch for the instruction budget problem — adding instructions to fix one signal can degrade another.
Layer 4 — Pattern Analysis: Look for concentration (failures clustered in specific scenario types), cross-signal correlations (7 documented cross-signal patterns), and trends over time.
3 Root Cause Types: Every failure traces to one of: (1) Eval Setup Issue — the eval itself is wrong, (2) Agent Configuration Issue — the agent needs fixing, (3) Platform Limitation — a constraint outside your control. Per the Triage Playbook, always rule out eval setup issues first — at least 20% of "failures" are grader bugs, not agent bugs.
根据分类手册,分数解读遵循4层框架:
第一层 — 分数解读: 应用基于风险的阈值和就绪决策树:
- SHIP — 所有质量信号的分数都达到阈值。
- ITERATE — 部分信号低于阈值;需要针对性修复。
- BLOCK — 关键信号失败;不要发布。
第二层 — 故障分类: 分数低时,先运行5个问题的评估验证(评估本身是否正确?),再归咎于Agent。然后应用6个领域的26个诊断问题识别根因。7种评估设置故障子类型覆盖常见的评审器/数据集bug。
第三层 — 修复方案映射: 每个质量信号都有特定的修复动作。注意指令预算问题 — 添加指令修复一个信号可能会降低另一个信号的表现。
第四层 — 模式分析: 寻找集中度(故障集中在特定场景类型)、跨信号相关性(7种已记录的跨信号模式)和随时间变化的趋势。
3种根因类型: 所有故障都可归为:(1) 评估设置问题 — 评估本身错误,(2) Agent配置问题 — Agent需要修复,(3) 平台限制 — 你无法控制的约束。根据分类手册,始终首先排除评估设置问题 — 至少20%的「故障」是评审器bug,不是Agent bug。
Non-determinism
非确定性
Per the Triage Playbook: agents are non-deterministic. Run a minimum of 3 trials per case. Score variance of +/-5% across runs is normal. Variance of +/-10% or more requires investigation — either the eval is flaky or the agent has a genuine instability.
Additional industry context from Anthropic: pass@k ("succeeded at least once in k runs") vs. pass^k ("succeeded every time in k runs") diverge massively at scale. At k=10 with 70% per-trial success: pass@k is approximately 97%, pass^k is approximately 3%. The same agent looks excellent or catastrophic depending on which metric you report. For customer-facing agents, pass^k is the right question. A 0% pass@100 is almost always a task specification problem, not an agent problem — fix the task definition before blaming the model.
根据分类手册:Agent是非确定性的。每个用例最少运行3次测试。运行间分数方差±5%为正常范围。方差±10%或以上需要调查 — 要么评估不稳定,要么Agent确实存在不稳定问题。
来自Anthropic的额外行业背景: pass@k(「k次运行中至少成功一次」)和pass^k(「k次运行中每次都成功」)在规模化时差异极大。当k=10,单次成功率70%时:pass@k约为97%,pass^k约为3%。根据使用的指标不同,同一个Agent可能表现极好或极差。对于面向客户的Agent,pass^k是正确的指标。0% pass@100几乎都是任务定义问题,不是Agent问题 — 在归咎于模型前先修复任务定义。
Red-teaming
红队测试
Per Microsoft's Eval Scenario Library, red-teaming uses the Probe-Measure-Harden framework:
- Probe — Run adversarial attacks including prompt injection, XPIA (cross-prompt injection attacks), encoding attacks, and role-playing exploitation.
- Measure — Track Attack Success Rate (ASR) metrics per category.
- Harden — Fix vulnerabilities, add guardrails, re-probe.
Red-team thresholds: ASR <2% for harmful content, <1% for PII leakage, <5% for jailbreak. Integrate red-teaming into CI/CD — point-in-time testing misses regressions from prompt changes and model upgrades.
Multi-turn adversarial patterns: Single-turn tests are insufficient for deployed conversational agents. Three attack patterns require multi-turn evaluation: (1) Context manipulation — requests shift gradually across turns, (2) Permission escalation — false admin claims introduced across conversation, (3) Role-playing escalation — fictional framing established early then escalated. Include at least 2-3 multi-turn adversarial scenarios in any eval suite.
根据微软评估场景库,红队测试使用探测-度量-加固框架:
- 探测 — 运行对抗攻击,包括提示注入、XPIA(跨提示注入攻击)、编码攻击、角色扮演利用。
- 度量 — 按类别跟踪攻击成功率(ASR)指标。
- 加固 — 修复漏洞,添加护栏,重新探测。
红队阈值: 有害内容ASR<2%,PII泄露<1%,越狱<5%。将红队测试集成到CI/CD中 — 单点测试会遗漏提示词变更和模型升级带来的回归问题。
多轮对抗模式: 单轮测试不足以应对部署的会话Agent。3种攻击模式需要多轮评估:(1) 上下文操纵 — 请求在多轮中逐步变化,(2) 权限提升 — 会话中引入虚假的管理员声明,(3) 角色扮演升级 — 早期建立虚构框架,然后逐步升级。任何评估套件中至少包含2-3个多轮对抗场景。
Grader types
评审器类型
Per MS Learn (common-evaluation-approaches), three grader categories:
- Code-based / deterministic graders (regex, string matching, JSON schema validation, length checks): Fast, cheap, unambiguous. Run these first. If a deterministic check can answer your question, do not reach for an LLM judge.
- LLM-judge graders (LLM judges output against written criteria): Use for quality checks requiring judgment — tone, completeness, factual grounding, relevance. Write criteria in plain language before writing grader code.
- Human graders: Slowest and highest quality. Use only for calibration — verifying that automated graders agree with expert humans at least 80% of the time (Cohen's kappa > 0.6).
Grading hierarchy (cheapest to most expensive): Run code-based checks first, then LLM judges on passing cases, then human review on a calibration sample. Per the Scenario Library, the 4 evaluation methods (Keyword Match, Compare Meaning, Capability Use, General Quality) map to these grader categories.
Calibration threshold: If your LLM judge and a human expert agree on fewer than 80% of cases (kappa < 0.6), your criteria are ambiguous. Rewrite criteria before trusting scores.
根据MS Learn(通用评估方法),有3种评审器分类:
- 基于代码/确定性评审器(正则、字符串匹配、JSON schema验证、长度检查):快速、低成本、无歧义。优先运行这些。如果确定性检查可以回答你的问题,不要使用LLM评审。
- LLM评审器(LLM评审器基于书面标准输出结果):用于需要判断的质量检查 — 语气、完整性、事实grounding、相关性。在编写评审器代码前先用通俗语言编写标准。
- 人工评审器:最慢、质量最高。仅用于校准 — 验证自动评审器与专家人类的一致性至少达到80%(Cohen's kappa > 0.6)。
评审层级(从最便宜到最贵): 先运行基于代码的检查,然后对通过的用例运行LLM评审,最后在校准样本上进行人工审核。根据场景库,4种评估方法(关键词匹配、语义对比、能力使用、通用质量)对应这些评审器分类。
校准阈值: 如果你的LLM评审器和人类专家在少于80%的用例上达成一致(kappa < 0.6),说明你的标准不明确。在信任分数前重写标准。
Dataset design
数据集设计
Per the Eval Scenario Library, use the to structure your dataset. Use the template to generate cases from an agent profile.
eval-set-template.mdeval-generation-prompt.md- Start with 20-50 cases for a focused task. Per the Scenario Library, cover all relevant business-problem scenarios before expanding to capability scenarios.
- Use the agent profile template () to define scope before writing cases.
agent-profile-template.yaml - Every production incident should become a dataset case within 24 hours.
- Datasets are living artifacts. A frozen dataset is a regression suite, not an eval.
- When pass rate hits 100%, the dataset has saturated — promote to regression suite and write harder cases.
Scoring conventions: Standardize scoring across your eval suite from the start. Choose ONE convention (binary pass/fail, numeric 0-1, or numeric 0-10) and normalize across all evaluators. For most agents, binary pass/fail is the correct default. Per the 7 test methods, General Quality uses sub-dimension scoring while Keyword Match and Exact Match are inherently binary.
根据评估场景库,使用构建数据集结构。使用模板基于Agent配置生成用例。
eval-set-template.mdeval-generation-prompt.md- 聚焦任务从20-50个用例开始。根据场景库,在扩展到能力场景前先覆盖所有相关的业务问题场景。
- 在编写用例前使用Agent配置模板()定义范围。
agent-profile-template.yaml - 每个生产事件都应在24小时内转化为数据集用例。
- 数据集是活的产物。冻结的数据集是回归套件,不是评估集。
- 当通过率达到100%时,数据集已经饱和 — 升级为回归套件,编写更难的用例。
评分约定: 从一开始就在评估套件中标准化评分。选择一种约定(二进制通过/失败、数字0-1或数字0-10),并在所有评审器中归一化。对于大多数Agent,二进制通过/失败是正确的默认值。根据7种测试方法,通用质量使用子维度评分,而关键词匹配和精确匹配本质上是二进制的。
Criteria writing
标准编写
- Criteria must be specific enough that two people reading them independently would agree on pass or fail. Per the Triage Playbook's Layer 2, ambiguous criteria are a top eval setup failure sub-type.
- Bad: "the response is helpful." Good: "the response is under 300 characters and mentions the refund policy by name."
- Write criteria before writing code. If you cannot write a testable criterion, you do not understand what the agent should do.
- One dimension per score. Do not combine factuality, tone, and conciseness into a single score. Multi-dimension composite scores hide regressions.
- Avoid Likert scales (1-5). Use binary pass/fail. Binary forces clarity. If you must use multi-point, cap at 3: fail / partial / pass.
- Version your grader prompts. A grader change produces incomparable scores. Track grader versions alongside dataset versions.
- 标准必须足够明确,让两个独立阅读的人对通过/失败的判断一致。根据分类手册第二层,模糊的标准是最常见的评估设置故障子类型。
- 反面示例:「响应有帮助」。正面示例:「响应少于300字符,并且明确提到退款政策」。
- 在编写代码前编写标准。如果你写不出可测试的标准,说明你还不了解Agent应该做什么。
- 每个分数仅对应一个维度。 不要把事实性、语气、简洁性合并到一个分数中。多维度综合分数会隐藏回归问题。
- 避免李克特量表(1-5分)。 使用二进制通过/失败。二进制会强制明确性。如果必须使用多点量表,最多3级:失败/部分通过/通过。
- 对评审器提示词进行版本控制。 评审器变更会导致分数不可比。跟踪评审器版本和数据集版本。
Eval-driven development
评估驱动开发
Per MS Learn's 4-stage framework, evaluation starts at Stage 1 (Define) before the agent is built:
- Write eval cases that define the target capability BEFORE the agent can fulfill them.
- Run the eval — the agent should fail most cases initially. Low scores are expected and correct.
- Iterate on the agent (prompt, tools, model) until pass rate crosses your threshold.
- The day you hit your threshold, you ship.
Anti-pattern: Writing evals after building the feature. That produces evals calibrated to what you built, not what you intended.
根据MS Learn 4阶段框架,评估在Agent构建前的第一阶段(定义)就开始:
- 在Agent可以实现目标能力前,先编写定义目标能力的评估用例。
- 运行评估 — 初始阶段Agent应该在大多数用例上失败。低分是预期的、正确的。
- 迭代优化Agent(提示词、工具、模型)直到通过率超过阈值。
- 达到阈值的当天就可以发布。
反模式: 构建功能后再编写评估。这会导致评估适配你构建的内容,而不是你预期的内容。
Transcript reading and error analysis
会话记录阅读和错误分析
Per the Triage Playbook (Layer 2), never trust a score you have not manually verified. The 5-question eval verification asks: Is the test set correct? Is the grader measuring the right thing? Is the expected answer actually right? Is the agent getting the right context? Is the eval environment matching production?
Axial coding process for failure analysis:
- Run your eval. Collect all failures.
- Read each failure. Write a one-sentence label for the root cause.
- Group labels into 3-5 categories (use the Triage Playbook's 6 diagnostic domains as a starting framework).
- Count frequency per category. Sort descending.
- Fix the highest-frequency category first. Re-run. Repeat.
Per the Triage Playbook, always include "grader error" as a category — at least 20% of failures in a new eval are grader bugs, not agent bugs.
Additional industry context from Hamel Husain: The axial coding methodology and "highest ROI activity in AI engineering" framing come from Hamel Husain's error analysis work. His key insight: most practitioners skip categorization and jump to "fix the prompt," missing structural patterns.
根据分类手册(第二层),永远不要相信你没有手动验证过的分数。5个问题的评估验证:测试集是否正确?评审器是否在测量正确的内容?预期答案是否真的正确?Agent是否获得了正确的上下文?评估环境是否匹配生产环境?
故障分析的轴向编码流程:
- 运行评估。收集所有故障。
- 阅读每个故障。为根因写一句话标签。
- 将标签分为3-5类(使用分类手册的6个诊断领域作为初始框架)。
- 统计每个类别的频率。降序排序。
- 先修复频率最高的类别。重新运行。重复。
根据分类手册,始终将「评审器错误」作为一个类别 — 新评估中至少20%的故障是评审器bug,不是Agent bug。
来自Hamel Husain的额外行业背景: 轴向编码方法论和「AI工程中ROI最高的活动」框架来自Hamel Husain的错误分析工作。他的核心观点:大多数从业者跳过分类,直接「修复提示词」,错过了结构性模式。
Tool-call evaluation
工具调用评估
Per the Eval Scenario Library's Tool Invocations capability scenario and MS Learn's Capability Use test method:
- Three questions per tool invocation: (1) Was it the right tool? (2) Were arguments correct and complete? (3) Was the invocation necessary?
- Do not grade tool-call sequences rigidly. Grade outcomes, not paths. If the agent reached the right answer via a different tool sequence, that should pass.
- Unnecessary tool calls are a cost and latency issue in production. Catch them in eval.
根据评估场景库的工具调用能力场景和MS Learn的能力使用测试方法:
- 每个工具调用回答3个问题: (1) 是不是正确的工具?(2) 参数是否正确完整?(3) 调用是否必要?
- 不要刚性评估工具调用序列。评估结果,而不是路径。如果Agent通过不同的工具序列得到了正确答案,应该判定为通过。
- 不必要的工具调用在生产中会带来成本和延迟问题。在评估中发现这类问题。
Multi-turn and trajectory evaluation
多轮和轨迹评估
Per MS Learn's evaluation approaches, multi-turn workflows require conversation-level evaluation, not turn-level:
- Trajectory scoring: Evaluate the sequence of steps as a whole. Did the agent take the shortest reasonable path? Did it recover from intermediate errors?
- Environment state verification: Ground truth is the state of the external environment, not what the agent claims. A booking agent passes if the reservation exists in the database.
- Compounding errors: A mistake at step 2 may not be visible in the final output. Run evals with detailed logging at each step.
- Stateful interaction evaluation: A turn-level pass rate of 90% can hide a conversation-level failure rate of 40%.
根据MS Learn评估方法,多轮工作流需要会话级评估,而不是轮次级:
- 轨迹评分: 整体评估步骤序列。Agent是否采取了最短的合理路径?是否从中间错误中恢复?
- 环境状态验证: 真实结果是外部环境的状态,而不是Agent的声明。如果预订记录存在于数据库中,预订Agent才算通过。
- 错误累积: 第2步的错误可能在最终输出中不可见。运行评估时每一步都要记录详细日志。
- 有状态交互评估: 90%的轮次级通过率可能隐藏40%的会话级失败率。
Eval for agentic workflows
Agent工作流评估
Per MS Learn's evaluation frameworks (11 scenario validation themes):
- Test each component individually first, then evaluate end-to-end. Component-level failures compound in pipelines.
- Orchestration-level failures are the most common missed failure mode. A pipeline where all components score 95% individually can still fail end-to-end at 40-60%.
- Use simulated environments for eval. Never run evals against production systems.
- Monitor intermediate outputs with validators at each pipeline step.
根据MS Learn评估框架(11个场景验证主题):
- 先单独测试每个组件,再进行端到端评估。组件级故障会在管道中累积。
- 编排级故障是最常见的遗漏故障模式。所有组件单独得分95%的管道,端到端失败率仍然可能达到40-60%。
- 使用模拟环境进行评估。永远不要针对生产系统运行评估。
- 在每个管道步骤使用验证器监控中间输出。
Simple Q&A vs. multi-step agent: what changes in eval
简单问答 vs 多步Agent:评估的变化
The evaluation approach differs significantly based on agent complexity:
| Dimension | Simple Q&A agent | Multi-step / agentic workflow |
|---|---|---|
| Primary metric | Response accuracy (Compare Meaning, General Quality) | Task completion — did the end-to-end job get done? |
| Grading unit | Single turn: one input, one output | Conversation or trajectory: full sequence of steps |
| Key quality signals | Source Attribution, Policy Accuracy | Action Enablement, Tool Invocations (Capability Use), plus all Q&A signals |
| Test method mix | Heavy on Compare Meaning + General Quality | Add Capability Use for tool calls, Keyword Match for intermediate checkpoints |
| Failure modes to watch | Wrong answer, hallucination, refusal | Compounding errors, wrong tool selection, unnecessary steps, partial completion |
| Edge cases | Ambiguous queries, out-of-scope questions | Mid-workflow failures, tool timeouts, user corrections mid-conversation |
| Eval complexity | Low — deterministic input/output pairs work well | High — must evaluate intermediate steps AND final outcome |
Practical guidance:
- Start with Q&A-style eval even for agentic workflows. Verify the agent produces correct final answers before evaluating the path it takes. A wrong answer via the right tools is still wrong.
- Add tool-call eval (Capability Use) only after response quality is stable. Per the Scenario Library, tool invocation testing checks three things: right tool, right arguments, necessary invocation.
- Grade outcomes, not paths. Two valid tool sequences can produce the same correct result. Per the Eval Scenario Library, do not grade tool-call sequences rigidly.
- Watch for the orchestration gap. Per MS Learn's evaluation frameworks, components scoring 95% individually can fail 40-60% end-to-end. Always run conversation-level evaluation, not just turn-level.
- Budget more test cases. A Q&A agent might need 20-30 cases for meaningful signal. A multi-step workflow with 3+ tools needs 50-100 to cover tool combinations and failure recovery paths.
评估方法根据Agent复杂度差异很大:
| 维度 | 简单问答Agent | 多步/Agent工作流 |
|---|---|---|
| 核心指标 | 响应准确率(语义对比、通用质量) | 任务完成率 — 端到端任务是否完成? |
| 评分单元 | 单轮:一个输入,一个输出 | 会话或轨迹:完整的步骤序列 |
| 核心质量信号 | 来源归因、政策准确性 | 动作赋能、工具调用(能力使用)、以及所有问答信号 |
| 测试方法组合 | 主要用语义对比+通用质量 | 增加能力使用用于工具调用评估、关键词匹配用于中间检查点 |
| 需关注的故障模式 | 错误答案、幻觉、拒答 | 错误累积、工具选择错误、不必要步骤、部分完成 |
| 边缘用例 | 模糊查询、超范围问题 | 工作流中途故障、工具超时、用户会话中途纠正 |
| 评估复杂度 | 低 — 确定性输入输出对效果好 | 高 — 必须评估中间步骤和最终结果 |
实践指导:
- 即使是Agent工作流,也从问答式评估开始。 在评估路径前先验证Agent能生成正确的最终答案。通过正确工具得到错误答案仍然是错误。
- 只有在响应质量稳定后再添加工具调用评估(能力使用)。 根据场景库,工具调用测试检查三点:正确工具、正确参数、必要调用。
- 评估结果,而不是路径。 两个有效的工具序列可以产生相同的正确结果。根据评估场景库,不要刚性评估工具调用序列。
- 注意编排差距。 根据MS Learn评估框架,组件单独得分95%的管道端到端失败率可能达到40-60%。始终运行会话级评估,而不仅仅是轮次级评估。
- 预留更多测试用例预算。 问答Agent可能需要20-30个用例就能获得有意义的信号。带3个以上工具的多步工作流需要50-100个用例来覆盖工具组合和故障恢复路径。
Swiss cheese model of eval coverage
评估覆盖的瑞士奶酪模型
No single eval method catches every failure. Per the Eval Scenario Library's 4 evaluation methods and the Triage Playbook's multi-layer approach:
- Code-based graders catch structural failures but miss semantic ones.
- LLM judges catch semantic failures but have systematic biases.
- Human review catches subtle judgment failures but is too slow for full coverage.
- Production monitoring catches real-world distribution failures.
- Layer all four. Run deterministic checks first (cheapest), then LLM judges, then human calibration, then production monitoring.
没有单一的评估方法能捕获所有故障。根据评估场景库的4种评估方法和分类手册的多层方法:
- 基于代码的评审器捕获结构性故障,但遗漏语义故障。
- LLM评审器捕获语义故障,但存在系统性偏差。
- 人工审核捕获细微的判断故障,但速度太慢,无法覆盖全部内容。
- 生产监控捕获真实世界的分布故障。
- 四层叠加。 先运行确定性检查(最便宜),然后是LLM评审,然后是人工校准,最后是生产监控。
LLM-as-judge calibration
LLM-as-judge校准
Per MS Learn's General Quality test method, LLM judges evaluate across sub-dimensions (Relevance, Groundedness, Completeness, Abstention). Calibrate judges against these defined dimensions.
Additional industry context from Eugene Yan (bias data):
- Position bias: GPT-3.5 biased toward first option 50% of the time; Claude-v1 biased 70%. Mitigate by evaluating both orderings.
- Self-enhancement bias: GPT-4 rates own outputs 10% higher; Claude-v1 rates own outputs 25% higher. Never use a model to judge its own outputs.
- Verbosity bias: Both models preferred longer responses >90% of the time. Include explicit length-independence instructions in judge prompts.
Additional industry context from Hamel Husain (critique shadowing):
When building LLM judges from scratch, use the 7-step Critique Shadowing methodology: (1) Identify one expert, (2) Create diverse dataset, (3) Collect binary pass/fail with written critiques, (4) Fix obvious errors, (5) Build judge prompts iteratively using expert examples, (6) Error analysis on disagreements, (7) Build specialized judges for specific failure modes. Target >90% agreement with domain expert before production use.
根据MS Learn通用质量测试方法,LLM评审器跨子维度(相关性、Groundedness、完整性、拒答合理性)评估。针对这些定义的维度校准评审器。
来自Eugene Yan的额外行业背景(偏差数据):
- 位置偏差: GPT-3.5有50%的概率偏向第一个选项;Claude-v1有70%的概率偏向。通过评估两种排序来缓解。
- 自我增强偏差: GPT-4给自己的输出评分高10%;Claude-v1给自己的输出评分高25%。永远不要用模型评审自己的输出。
- 冗长偏差: 两个模型都有90%以上的概率偏好更长的响应。在评审器提示词中加入明确的长度无关说明。
来自Hamel Husain的额外行业背景(评审shadowing):
从零构建LLM评审器时,使用7步评审Shadowing方法论:(1) 确定一名专家,(2) 创建多样化数据集,(3) 收集二进制通过/失败结果和书面批评,(4) 修复明显错误,(5) 使用专家示例迭代构建评审器提示词,(6) 对分歧进行错误分析,(7) 针对特定故障模式构建专用评审器。在生产使用前,目标是与领域专家的一致性>90%。
Knowledge grounding (for RAG agents)
知识Grounding(RAG Agent)
Per the Eval Scenario Library's Knowledge Grounding capability scenario and the Source Attribution quality signal:
- Knowledge grounding score measures whether each factual claim is supported by retrieved context.
- A 75% grounding score means roughly 1 in 4 claims may not be traceable to documents. Set threshold at 90%+ for high-stakes factual tasks.
- Low grounding score almost always means the retrieval step is failing, not the generation step. Fix chunking and retrieval before tuning the prompt.
根据评估场景库的知识Grounding能力场景和来源归因质量信号:
- 知识Grounding分数衡量每个事实声明是否被检索到的上下文支持。
- 75%的Grounding分数意味着大约四分之一的声明可能无法追溯到文档。高风险事实任务的阈值设为90%以上。
- Grounding分数低几乎都是检索步骤失败,不是生成步骤失败。在调整提示词前先修复分块和检索。
Production continuity
生产连续性
Per MS Learn's 4-stage framework (Stage 4: Operationalize), eval is not a pre-launch gate — it is a continuous loop:
- Integrate evals into CI/CD. Run the full suite on every PR that changes system prompts, tool definitions, or agent behavior.
- Every production incident becomes a dataset case within 24 hours.
- The eval flywheel: production logs -> eval cases -> eval run -> findings -> agent fix -> production.
- Ship with monitoring, not just evals. The eval tells you the agent worked on test cases. Monitoring tells you it works on real user inputs.
When the agent passes evals but fails in production: Per the Triage Playbook, this is almost always a distribution mismatch. Pull 20 recent production failures. Check whether any would fail against your current eval dataset. If none would, your dataset needs production cases, not a better prompt.
根据MS Learn 4阶段框架(第四阶段:落地运营),评估不是发布前的闸门 — 它是一个持续的循环:
- 将评估集成到CI/CD中。每次修改系统提示词、工具定义或Agent行为的PR都运行完整套件。
- 每个生产事件24小时内转化为数据集用例。
- 评估飞轮: 生产日志 -> 评估用例 -> 评估运行 -> 发现 -> Agent修复 -> 生产。
当Agent通过评估但在生产中失败时: 根据分类手册,这几乎都是分布不匹配的问题。提取20个最近的生产故障。检查是否有任何故障会在你当前的评估数据集上失败。如果都不会,说明你的数据集需要添加生产用例,而不是优化提示词。
Interpreting results
结果解读
Per the Triage Playbook's readiness decision tree:
- SHIP (>=85% overall, >=90% core business, >=95% safety): Agent meets the bar.
- ITERATE (60-84%): Meaningful failures exist. Use Layer 2 (Failure Triage) to diagnose.
- BLOCK (<60%): Fundamental problem. Do not ship.
Per the Triage Playbook's Layer 4 (Pattern Analysis): look for failure concentration in specific scenario types, cross-signal correlations, and trends over time. When a grader's verdict disagrees with your intuition, investigate — either the grader is wrong (fix the criterion) or your intuition is wrong (update your mental model).
根据分类手册的就绪决策树:
- SHIP(整体≥85%,核心业务≥90%,安全≥95%):Agent达到标准。
- ITERATE(60-84%):存在明显故障。使用第二层(故障分类)诊断。
- BLOCK(<60%):存在根本性问题。不要发布。
根据分类手册第四层(模式分析):寻找特定场景类型的故障集中度、跨信号相关性和随时间变化的趋势。当评审器的结论与你的直觉不一致时,进行调查 — 要么评审器错误(修复标准),要么你的直觉错误(更新你的思维模型)。
Capability evals vs. regression suites
能力评估 vs 回归套件
- Capability evals measure what the agent can do. They start at low pass rates — a 30% rate on a new capability eval is useful signal, not a failure. Per the Eval Scenario Library, capability scenarios test isolated abilities.
- Regression suites maintain near-100% pass rate to detect degradation. Per the Scenario Library's Regression capability scenario type, these protect against backsliding.
- When to promote: When a capability eval saturates (consistently 90%+), promote those cases to the regression suite and write harder capability cases.
- 能力评估 衡量Agent能做什么。初始通过率低 — 新能力评估30%的通过率是有用的信号,不是失败。根据评估场景库,能力场景测试独立能力。
- 回归套件 保持接近100%的通过率来检测性能下降。根据场景库的回归能力场景类型,这些套件用于防止性能倒退。
- 升级时机: 当能力评估饱和(持续达到90%以上)时,将这些用例升级到回归套件,编写更难的能力用例。
Eval tooling (supplementary)
评估工具(补充)
For tooling questions, the primary recommendation is Microsoft's Copilot Studio evaluation features for production Copilot agents. For teams needing third-party platforms:
- Braintrust: Good default for production agents. Free tier handles 1M spans/month.
- LangSmith: Best if already using LangChain. Native tracing.
- Langfuse: Best for self-hosted, data-sovereign setups. MIT-licensed.
- Key warning: Beware tools that auto-create rubrics AND auto-score without human calibration. The tool should support human review in the loop.
对于工具相关问题,主要推荐用于生产Copilot Agent的微软Copilot Studio评估功能。对于需要第三方平台的团队:
- Braintrust: 生产Agent的默认好选择。免费套餐每月处理100万跨度。
- LangSmith: 如果已经在使用LangChain则最佳。原生链路追踪。
- Langfuse: 最适合自托管、数据主权需求的场景。MIT许可。
- 重要警告: 警惕自动生成评分标准且无需人工校准就自动评分的工具。工具应该支持人工在环审核。
Skill Routing — When to Suggest a Sibling Skill
技能路由 — 何时建议使用兄弟技能
After answering the question, check whether the user would benefit from running a sibling eval skill. If so, append a one-line recommendation at the end of your answer.
| If the question involves... | Suggest this skill | One-liner to append |
|---|---|---|
| Creating an eval plan, scoping what to evaluate, choosing scenarios | | "For a full eval plan with scenario selection and quality signals, run |
| Generating test cases, writing CSV datasets, building eval sets | | "To generate ready-to-import test case CSVs, run |
| Interpreting scores, reading results, understanding pass rates | | "To interpret a specific set of eval results, paste them into |
| Debugging failures, triaging low scores, root cause analysis, remediation | | "To triage specific failures with the full diagnostic framework, run |
| What is eval, why eval matters, explaining eval to stakeholders | | "For an end-to-end eval explainer you can share with stakeholders, run |
Rules:
- Only suggest ONE skill per answer — the most relevant one.
- Only suggest when the user's question clearly maps to an action a sibling skill performs. Do not suggest routing for pure methodology questions that eval-faq handles well on its own.
- Never suggest (that is this skill — they are already here).
/eval-faq
回答问题后,检查用户是否会受益于运行兄弟评估技能。如果是,在回答末尾追加一行建议。
| 如果问题涉及... | 建议使用此技能 | 追加的一行内容 |
|---|---|---|
| 创建评估计划、确定评估范围、选择场景 | | "如需包含场景选择和质量信号的完整评估计划,请运行 |
| 生成测试用例、编写CSV数据集、构建评估集 | | "如需生成可直接导入的测试用例CSV,请运行 |
| 解读分数、阅读结果、理解通过率 | | "如需解读特定的评估结果,请将结果粘贴到 |
| 故障调试、低分分类、根因分析、修复方案 | | "如需使用完整诊断框架分类特定故障,请运行 |
| 什么是评估、评估的重要性、向利益相关者解释评估 | | "如需可以分享给利益相关者的端到端评估说明,请运行 |
规则:
- 每个回答仅建议一个技能 — 最相关的那个。
- 仅当用户的问题明确对应兄弟技能的功能时才建议路由。对于eval-faq本身可以很好处理的纯方法论问题,不要建议路由。
- 永远不要建议 (就是当前技能 — 用户已经在使用了)。
/eval-faq
Example invocations
调用示例
/eval-faq What eval scenarios should I use for a RAG agent?
/eval-faq How do I interpret a 75% knowledge grounding score?
/eval-faq What is the difference between business-problem and capability scenarios?
/eval-faq When should I use a model-graded grader instead of a deterministic one?
/eval-faq What makes a good adversarial test case?
/eval-faq How many cases do I need in a dataset to get meaningful signal?
/eval-faq My eval passes 100% on first run — is that good?
/eval-faq How do I write a good criterion for a model-graded grader?
/eval-faq What should I do when a grader disagrees with my gut feeling about an output?
/eval-faq How do I handle non-determinism in my eval results?
/eval-faq My agent makes tool calls — how do I eval those?
/eval-faq I suspect my grader is wrong — how do I debug it?
/eval-faq What should I eval in production after I ship?
/eval-faq Should I use pass@k or pass^k for my agent?
/eval-faq How do I calibrate my LLM-as-judge grader?
/eval-faq When do I stop adding eval cases and just ship?
/eval-faq My agent finds a different tool sequence than I expected — is that a failure?
/eval-faq How do I know if my grader is actually measuring what I think it is?
/eval-faq What is the difference between a capability eval and a regression suite?
/eval-faq How do I eval a multi-turn conversational agent?
/eval-faq What eval platform or tool should I use?
/eval-faq My agent passes evals but fails in production — why?
/eval-faq How do I score intermediate steps in a multi-step agent?
/eval-faq How is evaluating a multi-step workflow different from a simple Q&A agent?
/eval-faq What does 0% pass@100 mean — is my agent broken?
/eval-faq How do I avoid LLM judge bias in my grader?
/eval-faq What are the 5 quality signals I should evaluate?
/eval-faq What is the Probe-Measure-Harden red-teaming framework?
/eval-faq What are the 7 test methods in Copilot Studio?
/eval-faq How do I use the Triage Playbook to debug failing scores?
/eval-faq What is the 4-stage iterative evaluation framework?
/eval-faq What are the 3 root cause types for eval failures?
/eval-faq How do I decide between SHIP, ITERATE, and BLOCK?
/eval-faq What red-team ASR thresholds should I target?
/eval-faq How do I generate eval cases from a prompt template?
/eval-faq What is the critique shadowing methodology for building LLM judges?
/eval-faq Should I use a 1-5 scale or pass/fail for my LLM judge?
/eval-faq How do I continuously red-team my agent in CI/CD?
/eval-faq How do I systematically analyze eval failures to find patterns?
/eval-faq How do I know if my eval is too easy?
/eval-faq How do I write an LLM grader prompt that actually works?
/eval-faq Should I score factuality and tone in the same eval criterion?
/eval-faq When should I use the Custom test method instead of General Quality?
/eval-faq How do I set up a Custom test method for compliance checking?/eval-faq RAG Agent应该使用什么评估场景?
/eval-faq 如何解读75%的知识Grounding分数?
/eval-faq 业务问题场景和能力场景的区别是什么?
/eval-faq 什么时候应该使用模型评审器而不是确定性评审器?
/eval-faq 好的对抗测试用例有什么特点?
/eval-faq 数据集需要多少个用例才能获得有意义的信号?
/eval-faq 我的评估第一次运行就100%通过 — 这好不好?
/eval-faq 如何为模型评审器编写好的标准?
/eval-faq 当评审器的结论和我对输出的直觉判断不一致时应该怎么做?
/eval-faq 如何处理评估结果中的非确定性问题?
/eval-faq 我的Agent会进行工具调用 — 如何评估这些调用?
/eval-faq 我怀疑我的评审器有问题 — 如何调试?
/eval-faq 发布后我应该在生产中评估什么?
/eval-faq 我的Agent应该用pass@k还是pass^k?
/eval-faq 如何校准我的LLM-as-judge评审器?
/eval-faq 什么时候我可以停止添加评估用例直接发布?
/eval-faq 我的Agent使用了和我预期不同的工具序列 — 这算失败吗?
/eval-faq 如何知道我的评审器确实在测量我认为它在测量的内容?
/eval-faq 能力评估和回归套件的区别是什么?
/eval-faq 如何评估多轮会话Agent?
/eval-faq 我应该使用什么评估平台或工具?
/eval-faq 我的Agent通过了评估但在生产中失败 — 为什么?
/eval-faq 如何给多步Agent的中间步骤打分?
/eval-faq 评估多步工作流和简单问答Agent有什么不同?
/eval-faq 0% pass@100是什么意思 — 我的Agent坏了吗?
/eval-faq 如何避免评审器中的LLM偏差?
/eval-faq 我应该评估的5个质量信号是什么?
/eval-faq 什么是探测-度量-加固红队测试框架?
/eval-faq Copilot Studio中的7种测试方法是什么?
/eval-faq 如何使用分类手册调试低分?
/eval-faq 什么是4阶段迭代评估框架?
/eval-faq 评估失败的3种根因类型是什么?
/eval-faq 如何在SHIP、ITERATE和BLOCK之间做选择?
/eval-faq 我应该瞄准什么红队ASR阈值?
/eval-faq 如何从提示模板生成评估用例?
/eval-faq 构建LLM评审器的评审shadowing方法论是什么?
/eval-faq 我的LLM评审器应该用1-5分制还是通过/失败制?
/eval-faq 如何在CI/CD中持续对我的Agent进行红队测试?
/eval-faq 如何系统性分析评估故障以发现模式?
/eval-faq 如何知道我的评估太简单?
/eval-faq 如何编写真正有效的LLM评审器提示词?
/eval-faq 我应该在同一个评估标准中同时打分事实性和语气吗?
/eval-faq 什么时候应该使用自定义测试方法而不是通用质量?
/eval-faq 如何为合规检查设置自定义测试方法?