eval-triage-and-improvement

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Eval Triage & Improvement

评估分类诊断与优化

You help users interpret their agent evaluation results and find actionable next steps to improve. Follow the hybrid workflow: gather eval results first, then generate a structured triage report with root causes, owners, and recommended fixes.
This skill serves Stages 2-4 of the MS Learn 4-stage evaluation framework — the iterative loop of running evals, diagnosing failures, applying fixes, and re-running. In Stage 4 (Operationalize), this skill helps triage regressions caught by CI/CD eval runs after agent updates. Use the evaluation checklist template to track your position in the lifecycle.
你可以帮助用户解读Agent评估结果,并找到可落地的后续优化步骤。请遵循混合工作流:首先收集评估结果,然后生成结构化的诊断报告,包含根本原因、负责人和建议修复方案。
本技能适用于MS Learn四阶段评估框架第2-4阶段——即执行评估、诊断故障、应用修复、重新运行的迭代循环。在第4阶段(落地运营)中,本技能可帮助诊断Agent更新后CI/CD评估运行捕获到的功能退化问题。你可以使用评估清单模板来跟踪你在生命周期中的所处阶段。

When to use this skill vs. eval-result-interpreter

本技能与eval-result-interpreter的适用场景区别

These two skills share the same triage framework but serve different modes of work:
Use eval-triage-and-improvement when…Use eval-result-interpreter when…
You want interactive guidance walking through diagnosis step by stepYou have a CSV file or concrete results and want a one-shot structured report
You are in an ongoing improvement loop — fixing, re-running, and re-triagingThis is your first look at results — you need a verdict and top actions fast
You need detailed remediation help for specific quality signals (e.g., "wrong tool fires — now what?")You want a customer-deliverable artifact (the .docx triage report)
You have many failures (15+) and need help prioritizing which to investigateThe eval run is relatively straightforward (<20 failures)
You need the playbook worked examples and deeper diagnostic walkthroughsYou need the activity map / result comparison tool recommendations inline
If in doubt: Start with eval-result-interpreter to get the structured report, then switch to eval-triage-and-improvement if you need interactive help implementing the fixes.
这两个技能共享同一套诊断框架,但适用于不同的工作模式:
当满足以下条件时使用 eval-triage-and-improvement当满足以下条件时使用 eval-result-interpreter
你需要交互式引导,逐步完成诊断流程你已有CSV文件或具体结果,需要一次性生成结构化报告
你处于持续优化循环中——修复、重跑、重新诊断你是首次查看结果,需要快速得到结论和优先级最高的行动项
你需要针对特定质量信号的详细修复帮助(例如:"错误触发了工具——现在该怎么办?")你需要一份可交付给客户的产出物(.docx格式的诊断报告)
你遇到了大量失败用例(15个以上),需要帮助确定排查优先级评估运行的结果相对简单(失败用例<20个)
你需要操作手册示例和更深入的诊断步骤讲解你需要内嵌的活动映射/结果对比工具推荐
如果不确定选哪个: 先使用eval-result-interpreter获取结构化报告,如果你需要交互式帮助落地修复方案,再切换到eval-triage-and-improvement。

Workflow

工作流

Step 1: Gather Eval Results

步骤1:收集评估结果

Ask the user to share:
  1. Which eval sets ran and their pass rates (e.g., "Knowledge Grounding: 71%, Safety: 95%")
  2. Specific failing test cases — the test case ID, sample input, expected value, actual agent response, and eval method
  3. How many times they've run — is this the first run or have they run multiple times?
  4. What they've already tried — any fixes attempted so far?
If they don't have structured results, help them organize what they have. If they just have a general complaint ("my agent isn't working well"), guide them to run an eval first using the scenario library.
请用户提供以下信息:
  1. 运行了哪些评估集及其通过率(例如:"知识 grounded 评估:71%,安全评估:95%")
  2. 具体的失败测试用例——测试用例ID、示例输入、预期值、Agent实际响应、评估方法
  3. 已运行的次数——这是首次运行还是已经运行了多次?
  4. 已经尝试过的操作——目前已经尝试过哪些修复手段?
如果用户没有结构化的结果,可以帮助他们整理已有信息。如果用户只是泛泛抱怨("我的Agent运行效果不好"),可以引导他们先使用场景库运行评估。

Step 2: Score Interpretation

步骤2:评分解读

Use these thresholds to assess readiness:
READINESS ASSESSMENT

Safety/Compliance < 95%  → BLOCK (fix before anything else)
Core business < 80%      → ITERATE (focus here)
Capabilities < threshold → CONDITIONAL SHIP (document gaps)
All above threshold      → SHIP
Setting thresholds — don't apply fixed numbers. Derive from risk profile:
FactorHigher Threshold When...
Consequence of failureFinancial loss, safety risk, legal exposure
Frequency of query typeUsers trigger this quality signal often
Fallback availabilityNo human backup, or slow backup
AudienceExternal customers, regulated industry
使用以下阈值评估 readiness 状态:
READINESS ASSESSMENT

Safety/Compliance < 95%  → BLOCK (fix before anything else)
Core business < 80%      → ITERATE (focus here)
Capabilities < threshold → CONDITIONAL SHIP (document gaps)
All above threshold      → SHIP
阈值设置——不要使用固定数值,要根据风险 profile 推导:
因素当满足以下条件时需要设置更高阈值
失败后果会造成财务损失、安全风险、法律风险
查询类型的频率用户经常触发这类质量信号
兜底方案可用性没有人工兜底,或人工兜底响应慢
受众外部客户、受监管行业

Step 3: Pre-Triage Infrastructure Check

步骤3:诊断前基础设施检查

Before diagnosing individual failures, verify infrastructure was healthy during the eval run:
  • All knowledge sources accessible and fully indexed?
  • API backends and connectors returned no errors/timeouts?
  • Authentication tokens valid throughout the run?
  • Correct agent version was published and evaluated?
If any dependency was unhealthy, recommend re-running after fixing infrastructure before triaging.
在诊断单个失败用例之前,先验证评估运行期间基础设施是否正常:
  • 所有知识源都可访问且已完成全量索引?
  • API后端和连接器没有返回错误/超时?
  • 身份验证令牌在整个运行期间都有效?
  • 评估的是已发布的正确Agent版本?
如果任何依赖项不正常,建议先修复基础设施后重新运行评估,再进行诊断。

Step 4: Prioritize Failures

步骤4:确定故障优先级

If the user has many failures, recommend this triage order:
PriorityTriage FirstRationale
1Safety & compliance failuresHighest consequence; blocks ship
2Core business failures (high-priority tests)Direct impact on agent value
3Lowest-scoring eval set failuresLikely systemic — fixing root cause resolves multiple
4Recurring failures (same test case across runs)Most diagnosable
5Capability scenario failuresImportant but lower blast radius
15+ failures? Don't triage every one. Review 3-5 from the lowest-scoring eval set. If they share a root cause, fix that and re-run.
如果用户有大量失败用例,建议按照以下顺序进行诊断:
优先级优先诊断的类型原因
1安全与合规失败后果最严重,会阻断发布
2核心业务失败(高优先级测试)直接影响Agent的价值
3得分最低的评估集的失败用例很可能是系统性问题,修复根本原因可以解决多个故障
4反复出现的失败(同一测试用例在多次运行中都失败)最容易诊断
5功能场景失败重要但影响范围更小
失败用例超过15个? 不需要逐个诊断。先排查得分最低的评估集中的3-5个用例。如果它们有共同的根本原因,修复后重新运行即可。

Step 5: Classify Root Cause

步骤5:根因分类

For each failure, work through the diagnostic questions in order:
TRIAGE DECISION TREE (for each failing test case)

1. Is the agent's response actually acceptable, even though it failed?
   → YES = Eval Setup Issue (grader or expected value is wrong)

2. Is the expected answer still current against the actual source?
   → NO = Eval Setup Issue (expected answer outdated)

3. Does the test case represent a realistic user input?
   → NO = Eval Setup Issue (unrealistic test case)

4. Could a valid alternative response also be correct, but grader rejects it?
   → YES = Eval Setup Issue (grader too rigid)

5. Is the eval method appropriate for what you're testing?
   → NO = Eval Setup Issue (wrong method for this quality signal)

ALL PASS → The eval is valid. Proceed to agent diagnosis:

6. Can you identify a specific config to change?
   → YES = Agent Configuration Issue

7. Does the fix persist after config change + re-run?
   → NO = Platform Limitation
针对每个失败用例,按顺序排查以下诊断问题:
TRIAGE DECISION TREE (for each failing test case)

1. Is the agent's response actually acceptable, even though it failed?
   → YES = Eval Setup Issue (grader or expected value is wrong)

2. Is the expected answer still current against the actual source?
   → NO = Eval Setup Issue (expected answer outdated)

3. Does the test case represent a realistic user input?
   → NO = Eval Setup Issue (unrealistic test case)

4. Could a valid alternative response also be correct, but grader rejects it?
   → YES = Eval Setup Issue (grader too rigid)

5. Is the eval method appropriate for what you're testing?
   → NO = Eval Setup Issue (wrong method for this quality signal)

ALL PASS → The eval is valid. Proceed to agent diagnosis:

6. Can you identify a specific config to change?
   → YES = Agent Configuration Issue

7. Does the fix persist after config change + re-run?
   → NO = Platform Limitation

Step 5b: Conversation (Multi-Turn) Triage

步骤5b:对话(多轮)场景诊断

For conversation eval failures, the standard decision tree still applies but you must first identify the critical turn — the earliest turn where the agent went wrong. Everything after a bad turn is a cascade, not independent failures.
Critical turn identification:
  1. Walk the conversation turn by turn
  2. Find the first turn where the agent response diverges from expected behavior
  3. Classify that turn using the decision tree above
  4. Mark downstream turns as "cascade — blocked by Turn N fix"
Conversation-specific failure patterns and remediations:
PatternHow to spot itRoot cause areaRemediation
Context loss — Turn 1 fine, Turn 3+ forgetsAgent re-asks or contradicts earlier turnsAgent ConfigReview topic management; ensure conversation context is preserved across topic switches
State loop — Agent repeats the same responseIdentical or near-identical agent turns in sequenceAgent ConfigCheck topic routing for circular references; add explicit exit conditions
Clarification failure — Agent can't handle follow-upsTurn 2 fails when user provides clarification or correctionAgent ConfigAdd follow-up handling instructions; check that topics accept partial/corrective inputs
Last-mile failure — Understands but can't resolveEarly turns diagnose correctly, final resolution turn failsAgent Config or PlatformCheck action/connector configuration; verify the resolution path is wired correctly
Eval rigidity — Conversation is acceptable but grader rejectsReading the full conversation, the outcome is reasonableEval SetupConversation grading is limited (AI Generated or Approval Rating only); adjust rubric or expected values
Key difference from single-response triage: Do NOT triage each turn independently. Triage the critical turn, apply the fix, re-run, and then see which downstream turns self-resolve. Expect 40-60% of downstream failures to clear after fixing the critical turn.
针对对话评估失败的场景,标准决策树仍然适用,但你必须首先识别关键轮次——即Agent首次出错的轮次。错误轮次之后的所有问题都是连锁反应,不是独立故障。
关键轮次识别方法:
  1. 逐轮查看对话内容
  2. 找到Agent响应首次偏离预期行为的轮次
  3. 使用上述决策树对该轮次进行分类
  4. 将下游轮次标记为"连锁故障——受第N轮修复阻塞"
对话场景特有的故障模式和修复方案:
模式识别方法根因类别修复方案
上下文丢失——第1轮正常,第3轮及之后遗忘上下文Agent重复提问或与之前轮次的内容矛盾Agent配置检查主题管理逻辑,确保主题切换时保留对话上下文
状态循环——Agent重复相同响应Agent连续输出相同或几乎相同的回复Agent配置检查主题路由是否存在循环引用,添加明确的退出条件
澄清请求处理失败——Agent无法处理后续追问用户提供澄清或纠正信息后第2轮响应失败Agent配置添加后续追问处理指令,检查主题是否接受部分输入/纠正输入
最后一步失败——理解需求但无法解决前序轮次诊断正确,但最终解决轮次失败Agent配置或平台问题检查操作/连接器配置,验证解决路径是否正确配置
评估规则僵化——对话符合要求但评分器判定失败通读完整对话后认为结果合理评估配置对话评分存在局限性(仅支持AI生成或 approval 评分);调整评分规则或预期值
与单响应诊断的核心区别: 不要独立诊断每个轮次。先诊断关键轮次,应用修复后重新运行,再查看下游轮次是否自动恢复。修复关键轮次后,预计40-60%的下游故障会自行消失。

Three Root Cause Types

三类根因

Root CauseWho ActsWhat It Means
Eval Setup IssueEval authorThe test/grader is wrong. The agent may be fine.
Agent Configuration IssueAgent builderAgent genuinely produced a bad response. Fixable through config.
Platform LimitationPlatform teamCaused by platform behavior. Cannot fix through config.
根因类型负责人含义
评估配置问题评估作者测试/评分器配置错误,Agent本身可能没有问题
Agent配置问题Agent搭建者Agent确实输出了错误响应,可通过配置修复
平台限制平台团队由平台行为导致,无法通过配置修复

Step 6: Map to Remediation

步骤6:匹配修复方案

For detailed remediation steps by root cause type and quality signal, read the playbook files:
  • Full triage decision tree: Read
    triage-and-improvement-playbook/triage-decision-tree.md
  • Remediation mapping: Read
    triage-and-improvement-playbook/remediation-mapping.md
  • Pattern analysis: Read
    triage-and-improvement-playbook/pattern-analysis.md
  • Worked examples: Read
    triage-and-improvement-playbook/worked-examples.md
如需查看按根因类型和质量信号分类的详细修复步骤,请阅读操作手册文件:
  • 完整诊断决策树:请阅读
    triage-and-improvement-playbook/triage-decision-tree.md
  • 修复方案映射:请阅读
    triage-and-improvement-playbook/remediation-mapping.md
  • 模式分析:请阅读
    triage-and-improvement-playbook/pattern-analysis.md
  • 实际示例:请阅读
    triage-and-improvement-playbook/worked-examples.md

Quick Remediation Reference

快速修复参考

Eval Setup Fixes:
Sub-TypeFix
Outdated expected answerUpdate expected value to match current source content
Overly rigid graderSwitch to Compare Meaning, or broaden keyword set
Unrealistic test caseRewrite input using actual user language
Wrong eval methodChange method to match quality signal (see scenario library)
Grader error/biasReview rubric, add examples, consider deterministic method
Agent Configuration Fixes:
Quality SignalCommon Fix
Factual accuracy (wrong source)Review knowledge source config, verify indexing, check vocabulary match
Factual accuracy (wrong extraction)Add extraction guidance to system prompt
HallucinationAdd instruction: "Only answer from knowledge sources. If unavailable, say so."
Wrong tool firesRewrite tool descriptions to differentiate; add negative examples
Tool doesn't fireReview trigger conditions; check if tool is enabled and accessible
Wrong topic firesReview trigger phrase overlap; adjust priority ordering
Lacks empathyAdd context-specific tone instructions to system prompt
Scope violationAdd explicit out-of-scope instruction
PII leakageAdd PII protection instruction; review authentication scope
Platform Limitation Response:
  • Document the limitation with evidence
  • Implement workaround where possible
  • Adjust eval thresholds to account for known platform behavior
  • File with platform team with reproduction steps
评估配置修复方案:
子类型修复方案
预期答案过时更新预期值使其匹配当前源内容
评分器规则过于僵化切换为语义对比模式,或扩大关键词集合
测试用例不真实使用真实用户的话术重写输入
评估方法错误更换与质量信号匹配的评估方法(参考场景库)
评分器错误/偏见评审评分规则,添加示例,考虑使用确定性方法
Agent配置修复方案:
质量信号通用修复方案
事实准确性错误(来源错误)检查知识源配置,验证索引状态,检查词汇匹配度
事实准确性错误(提取错误)在系统提示词中添加提取规则指引
幻觉添加指令:"仅基于知识源回答,如果没有相关信息请如实告知"
错误触发工具重写工具描述以区分不同工具,添加负例
工具未触发检查触发条件,验证工具是否已启用且可访问
错误触发主题检查触发短语是否重叠,调整优先级顺序
缺乏同理心在系统提示词中添加特定场景的语气要求
超出范围响应添加明确的超出范围响应指令
PII泄露添加PII保护指令,检查身份验证范围
平台限制应对方案:
  • 记录限制情况及相关证据
  • 尽可能实现替代方案
  • 调整评估阈值以适配已知平台行为
  • 向平台团队提交问题并附上复现步骤

Step 7: Triage Rationale (teach the WHY)

步骤7:诊断逻辑说明(讲解背后的原因)

Before generating the report, add rationale that teaches the customer the reasoning behind triage decisions — not just the conclusions. For each of these, use the actual eval data from this triage:
  1. Why each failure got its root cause classification — Walk through the decision tree for at least one example per root cause type. E.g., "Test case KB-014 was classified as an Eval Setup Issue because the agent response is factually correct per the current knowledge source, but the expected value still references the old 14-day policy. The agent is right; the eval is stale."
  2. Why the remediation targets config vs. content vs. eval — Explain the logic: "We recommended updating the knowledge source rather than changing the prompt because the agent retrieval worked correctly — it found the right document — but the document itself contains outdated information. A prompt change would mask the real problem."
  3. Why the priority order is what it is — Connect to blast radius and dependency chains: "Safety failures are first not just because they are severe, but because safety prompt instructions can conflict with other behaviors. Fix safety, re-run, then triage the rest — otherwise you are diagnosing failures that might disappear once the safety instructions are in place."
  4. What this triage does NOT tell you — Name the limits explicitly: "This triage analyzed [N] failures from a single eval run. It cannot detect issues in scenarios you have not written test cases for, and it cannot distinguish between a flaky failure (non-determinism) and a real failure from a single data point. If a failure is borderline, re-run before investing in a fix."
Include this rationale in the triage report (see Triage Rationale section in the report template below).
在生成报告之前,添加逻辑说明,让客户了解诊断决策背后的推理过程——而不仅仅是结论。对于以下每一点,都要使用本次诊断中的实际评估数据:
  1. 每个故障对应根因分类的原因——每个根因类型至少举一个示例,走完决策树的推理过程。例如:"测试用例KB-014被归类为评估配置问题,因为根据当前知识源,Agent的响应在事实层面是正确的,但预期值仍然引用了旧的14天政策。Agent是对的,评估配置过时了。"
  2. 为什么修复目标是配置/内容/评估的原因——解释逻辑:"我们建议更新知识源而不是修改提示词,因为Agent的检索功能运行正常——它找到了正确的文档——但文档本身包含过时信息。修改提示词会掩盖真正的问题。"
  3. 优先级排序的原因——关联影响范围和依赖链:"安全故障优先处理,不仅仅是因为它们后果严重,还因为安全提示词指令可能与其他行为冲突。先修复安全问题,重新运行评估,再诊断其余问题——否则你诊断的故障可能在安全规则生效后就自动消失了。"
  4. 本次诊断没有覆盖的内容——明确说明局限性:"本次诊断分析了单次评估运行中的[N]个故障。它无法检测你没有编写测试用例的场景中的问题,也无法通过单个数据点区分偶发故障(非确定性)和真实故障。如果故障处于临界状态,先重新运行评估再投入资源修复。"
将这些逻辑说明纳入诊断报告(参考下方报告模板中的诊断逻辑说明部分)。

Step 8: Generate Triage Report

步骤8:生成诊断报告

Output a structured triage report:
markdown
undefined
输出结构化的诊断报告:
markdown
undefined

Triage Report: [Agent Name] — [Date]

Triage Report: [Agent Name] — [Date]

Score Summary

Score Summary

Eval SetPass RateThresholdStatus
.........PASS/BLOCK/ITERATE
Eval SetPass RateThresholdStatus
.........PASS/BLOCK/ITERATE

Readiness Assessment

Readiness Assessment

[SHIP / SHIP WITH KNOWN GAPS / ITERATE / BLOCK] [Rationale]
[SHIP / SHIP WITH KNOWN GAPS / ITERATE / BLOCK] [Rationale]

Failure Analysis

Failure Analysis

Failure 1: [Test Case ID]

Failure 1: [Test Case ID]

  • Quality Signal: ...
  • Sample Input: ...
  • Expected: ...
  • Actual: ...
  • Root Cause: [Eval Setup / Agent Config / Platform Limitation]
  • Diagnosis: [specific diagnosis]
  • Owner: [who needs to act]
  • Remediation: [specific action]
  • Verification: [how to verify the fix worked]
[Repeat for each triaged failure]
  • Quality Signal: ...
  • Sample Input: ...
  • Expected: ...
  • Actual: ...
  • Root Cause: [Eval Setup / Agent Config / Platform Limitation]
  • Diagnosis: [specific diagnosis]
  • Owner: [who needs to act]
  • Remediation: [specific action]
  • Verification: [how to verify the fix worked]
[Repeat for each triaged failure]

Triage Rationale

Triage Rationale

Why these root cause classifications

Why these root cause classifications

[Walk through the decision tree for representative examples — show the reasoning, not just the label]
[Walk through the decision tree for representative examples — show the reasoning, not just the label]

Why these remediations

Why these remediations

[Explain the logic connecting root cause to fix — why this fix and not an alternative]
[Explain the logic connecting root cause to fix — why this fix and not an alternative]

Why this priority order

Why this priority order

[Connect priority to blast radius and dependency chains]
[Connect priority to blast radius and dependency chains]

What this triage does NOT tell you

What this triage does NOT tell you

[Name the limits: coverage gaps, single-run non-determinism, untested scenarios]
[Name the limits: coverage gaps, single-run non-determinism, untested scenarios]

Systemic Patterns

Systemic Patterns

[If 80%+ of failures share a root cause, call it out]
[If 80%+ of failures share a root cause, call it out]

Action Items

Action Items

#ActionOwnerPriorityVerification
1.........Re-run [eval set]
#ActionOwnerPriorityVerification
1.........Re-run [eval set]

Post-Triage Checklist

Post-Triage Checklist

  • All safety/compliance failures addressed
  • Root causes verified (re-run after fixes)
  • Known gaps documented with owners
  • Platform limitations filed if applicable
  • All safety/compliance failures addressed
  • Root causes verified (re-run after fixes)
  • Known gaps documented with owners
  • Platform limitations filed if applicable

Human Review Required

Human Review Required

[Include human review checkpoints table — see Human Review Checkpoints section below]
undefined
[Include human review checkpoints table — see Human Review Checkpoints section below]
undefined

Post-Triage Verification

诊断后验证

After fixes are applied:
  • Scores flat after fix? → Wrong root cause, re-triage
  • One score up, another down? → Instruction conflict — the fix improved one behavior but degraded another
  • 80%+ of failures share root cause? → Systemic issue — fix the category, not individual test cases
修复应用后:
  • 修复后评分没有变化? → 根因判断错误,重新诊断
  • 一个评分上升,另一个下降? → 指令冲突——修复改善了一种行为但降低了另一种行为的表现
  • 80%以上的故障有共同根因? → 系统性问题——修复该类别问题,而不是逐个修复测试用例

Non-Determinism Handling

非确定性问题处理

LLM-based agents and graders produce variable outputs:
  • Establish baselines: Run 3+ times before treating any score as baseline. Use the average.
  • Normal variance: +/-5% between runs is expected. Investigate if >10%.
  • Flaky test cases (pass sometimes, fail others): Agent may produce two valid responses but eval is too rigid. Investigate whether to broaden the expected value.
  • Small eval sets (<30 test cases): A single test case flip changes the score by 3%+. Don't over-interpret.
基于LLM的Agent和评分器会产生可变输出:
  • 建立基准线: 至少运行3次后再将任何分数作为基准,使用平均值
  • 正常波动范围: 不同运行之间+/-5%的波动是预期内的,波动超过10%再排查
  • 不稳定测试用例(有时通过有时失败):Agent可能输出了两种有效响应,但评估规则过于僵化。考虑是否需要放宽预期值范围
  • 小型评估集(<30个测试用例): 单个测试用例的结果翻转会导致评分变化3%以上,不要过度解读

Supplementary Signal: User Reactions

补充信号:用户反馈

If the agent is deployed (even in preview), check user reactions (thumbs up/down) in Copilot Studio analytics alongside eval results. During an improvement loop, reactions help you prioritize:
  • High thumbs-down on a topic where eval passes: Your eval may not be testing what real users care about. Add test cases that reflect the actual user complaints.
  • Thumbs-down clustering after a config change: Your fix may have introduced a regression that the eval doesn’t catch yet. Investigate and expand test coverage.
  • Steady thumbs-up on a topic where eval fails: Consider whether the eval is too strict — real users may be satisfied with responses the grader rejects.
Reactions are noisy (biased toward engaged users, small sample) and cannot diagnose root causes. Use them as a prioritization signal, not a verdict.
如果Agent已经部署(哪怕是预览版),可以在Copilot Studio分析中查看用户反馈(点赞/点踩),结合评估结果一起参考。在优化循环中,用户反馈可以帮助你确定优先级:
  • 某个主题点踩率很高但评估通过: 你的评估可能没有测试真实用户关心的内容。添加反映真实用户投诉的测试用例
  • 配置变更后点踩集中出现: 你的修复可能引入了评估尚未捕获的功能退化。排查并扩展测试覆盖范围
  • 某个主题点踩率稳定很低但评估失败: 考虑评估是否过于严格——真实用户可能对评分器拒绝的响应感到满意
用户反馈存在噪音(偏向活跃用户、样本量小),无法用于诊断根本原因。将其作为优先级参考信号,而不是最终结论。

Human Review Checkpoints

人工评审检查点

Before acting on the triage report, review these checkpoints. Triage decisions directly drive agent changes — a wrong diagnosis wastes an entire iteration cycle.
#CheckpointWhy it matters
1Verify root cause classifications yourself — For each failure classified as eval setup issue, read the agent actual response. Is it truly acceptable, or is the triage giving the agent the benefit of the doubt?Misclassifying agent failures as eval issues means real problems get ignored. The 20% baseline is a starting point, not a blanket excuse.
2Confirm systemic pattern diagnoses before applying systemic fixes — If the report says 80%+ failures share a root cause, verify by reading the actual responses. Similar symptoms can have different causes.A wrong systemic diagnosis means you apply one fix expecting to resolve many failures, but only fix some or none.
3Validate remediation feasibility and priority order — Can your team actually make the suggested changes? Is the priority order right for your timeline and constraints?The triage prioritizes by impact, but your team knows effort and dependencies. A knowledge source fix may take 2 weeks; a prompt tweak may unblock you now.
4Check that proposed fixes will not regress passing scenarios — Before making changes, consider which currently-passing test cases could be affected. Prompt changes especially have ripple effects.Fixing 3 failures while introducing 5 new ones is a net loss. Plan to re-run the full suite after any agent configuration change.
5Validate platform limitation classifications before escalating — If a failure is classified as a platform limitation, confirm the behavior persists across multiple prompt and config variations before filing with the platform team.Escalating a configuration issue as a platform bug wastes platform team time and delays your actual fix.
6Review threshold choices against your actual risk tolerance — The readiness thresholds are defaults. Does SHIP/ITERATE/BLOCK match what you would actually be comfortable deploying?Only your team knows your real risk tolerance. A SHIP at 82% may be fine for an internal tool but unacceptable for a customer-facing agent in a regulated industry.
Include this table in the triage report output. Add: This triage report accelerates diagnosis but does not replace human judgment. Review checkpoints 1 and 2 before acting on any remediation — the distinction between eval issues and agent issues requires reading the actual responses.
在根据诊断报告采取行动之前,先评审这些检查点。诊断决策会直接驱动Agent变更——错误的诊断会浪费整个迭代周期的时间。
#检查点重要性
1自行验证根因分类——对于每个被归类为评估配置问题的故障,阅读Agent的实际响应。它是否真的可以接受,还是诊断给了Agent过多的容错空间?将Agent故障错误归类为评估问题会导致真实问题被忽略。20%的基线是起点,不是万能借口。
2在应用系统性修复前确认系统性模式诊断——如果报告显示80%以上的故障有共同根因,通过阅读实际响应验证这一点。相似的症状可能有不同的原因。错误的系统性诊断会导致你应用一个修复方案期望解决大量故障,但实际只修复了部分甚至没有修复任何问题。
3验证修复方案的可行性和优先级顺序——你的团队是否真的能完成建议的变更?优先级顺序是否符合你的时间线和约束?诊断是按影响排序优先级,但你的团队更了解工作量和依赖关系。知识源修复可能需要2周,而提示词调整可能可以立即解决阻塞问题。
4检查提议的修复不会导致已通过的场景退化——在做出变更之前,考虑哪些当前通过的测试用例可能会受到影响。提示词变更尤其会产生连锁反应。修复3个故障同时引入5个新故障是净损失。计划在任何Agent配置变更后重新运行完整的测试套件。
5在上报之前验证平台限制分类——如果故障被归类为平台限制,在向平台团队提交问题之前,确认该行为在多种提示词和配置变体下都持续存在。将配置问题作为平台Bug上报会浪费平台团队的时间,延迟你实际问题的修复。
6根据你的实际风险承受能力评审阈值选择——就绪性阈值是默认值。SHIP/ITERATE/BLOCK的判定是否匹配你实际愿意部署的风险等级?只有你的团队了解真实的风险承受能力。82%的通过率对于内部工具来说可能没问题,但对于受监管行业的面向客户的Agent来说是不可接受的。
将此表格纳入诊断报告输出中。补充说明:本诊断报告可以加速诊断流程,但不能替代人工判断。在采取任何修复措施之前,请先评审检查点1和2——评估问题和Agent问题之间的区别需要阅读实际响应才能判断。

Data Retention Warning

数据留存提醒

Copilot Studio deletes test run results after 89 days. This means your baseline results from an initial eval may be gone before your next quarterly review. After every triage cycle:
  1. Export the results CSV immediately (Test set → Export results)
  2. Store alongside your triage report in SharePoint, a repo, or wherever your team keeps versioned artifacts
  3. Tag with agent version and date so future comparisons are possible
If your triage identified a fix-and-rerun cycle, export the pre-fix results before applying changes. You need the before/after comparison, and Copilot Studio won't keep the "before" forever.
Copilot Studio 会在89天后删除测试运行结果。这意味着你初始评估的基线结果可能在下一次季度评审之前就被删除了。每个诊断周期结束后:
  1. 立即导出结果CSV(测试集 → 导出结果)
  2. 与诊断报告一起存储在SharePoint、代码仓库或任何团队存储版本化产物的位置
  3. 标记Agent版本和日期,以便后续可以进行对比
如果你的诊断确定需要修复后重跑,在应用变更前导出修复前的结果。你需要前后对比数据,而Copilot Studio不会永久保留"修复前"的结果。

Cross-Reference

交叉参考

This skill works alongside the AI Agent Evaluation Scenario Library (
github.com/microsoft/ai-agent-eval-scenario-library
), which defines the scenarios and quality signals that produce the eval results this triage skill helps interpret, and the Triage & Improvement Playbook (
github.com/microsoft/triage-and-improvement-playbook
), which provides the diagnostic frameworks used in this skill's triage steps.
本技能与AI Agent评估场景库
github.com/microsoft/ai-agent-eval-scenario-library
)配合使用,该库定义了生成评估结果的场景和质量信号,本诊断技能可以帮助解读这些结果;另外也与诊断与优化操作手册
github.com/microsoft/triage-and-improvement-playbook
)配合使用,该手册提供了本技能诊断步骤中使用的诊断框架。

Related eval skills

相关评估技能

After triage, if you need to...Use this skill
Build or expand the eval plan with new scenarios identified during triage
/eval-suite-planner
Generate new test cases for expanded or revised scenarios
/eval-generator
Get a quick structured report from a new CSV (without interactive triage)
/eval-result-interpreter
Answer a methodology question that came up during triage
/eval-faq
Walk the customer through the full eval pipeline end-to-end
/eval-guide
诊断完成后,如果你需要...使用此技能
基于诊断过程中识别的新场景构建或扩展评估计划
/eval-suite-planner
为扩展或修订的场景生成新的测试用例
/eval-generator
从新的CSV快速生成结构化报告(无需交互式诊断)
/eval-result-interpreter
解答诊断过程中出现的方法论问题
/eval-faq
引导客户走完端到端的完整评估 pipeline
/eval-guide