eval-result-interpreter
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePurpose
用途
This skill takes eval results — a Copilot Studio evaluation CSV file, a pasted summary, or plain-English description of results — and produces a structured triage report. It is the final step in the eval lifecycle: plan → generate → run → interpret. The output tells you whether to ship, what broke, why it broke, and what to fix first.
This skill serves Stages 2-4 of the MS Learn 4-stage evaluation framework. In Stage 2 (Set Baseline & Iterate), it interprets your first eval results and guides fixes. In Stage 3 (Systematic Expansion), it identifies coverage gaps worth expanding into. In Stage 4 (Operationalize), it triages regression failures after agent updates. Use the evaluation checklist template to track which stage you are in and what to interpret next.
Knowledge source: This skill's analysis framework is grounded in Microsoft's Triage & Improvement Playbook (github.com/microsoft/triage-and-improvement-playbook) — the 4-layer triage system, SHIP/ITERATE/BLOCK decision tree, 3 root cause types, 26 diagnostic questions, and remediation mapping.
本技能接收评估结果(Copilot Studio评估CSV文件、粘贴的摘要,或结果的自然语言描述),生成结构化分诊报告。它是评估生命周期的最终环节:规划→生成→运行→解读。输出内容会告知你是否可以发布、哪里出了问题、问题原因,以及优先修复的内容。
本技能适用于MS Learn四阶段评估框架的第2-4阶段。在第2阶段(设定基线与迭代),它会解读你的首次评估结果并指导修复。在第3阶段(系统性扩展),它会识别值得扩展的覆盖缺口。在第4阶段(落地运营),它会对Agent更新后的回归故障进行分诊。你可以使用评估清单模板跟踪你所处的阶段,以及下一步需要解读的内容。
知识来源: 本技能的分析框架基于微软分诊与改进手册(github.com/microsoft/triage-and-improvement-playbook)的4层分诊体系、SHIP/ITERATE/BLOCK决策树、3类根因、26个诊断问题和修复映射规则。
When to use this skill vs. eval-triage-and-improvement
本技能与eval-triage-and-improvement的适用场景区别
These two skills share the same triage framework but serve different modes of work:
| Use eval-result-interpreter when… | Use eval-triage-and-improvement when… |
|---|---|
| You have a CSV file or concrete results and want a one-shot structured report | You want interactive guidance walking through diagnosis step by step |
| This is your first look at results — you need a verdict and top actions fast | You are in an ongoing improvement loop — fixing, re-running, and re-triaging |
| You want a customer-deliverable artifact (the .docx triage report) | You need detailed remediation help for specific quality signals (e.g., "wrong tool fires — now what?") |
| The eval run is relatively straightforward (<20 failures) | You have many failures (15+) and need help prioritizing which to investigate |
| You need the activity map / result comparison tool recommendations inline | You need the playbook worked examples and deeper diagnostic walkthroughs |
If in doubt: Start with eval-result-interpreter to get the structured report, then switch to eval-triage-and-improvement if you need interactive help implementing the fixes.
这两个技能使用相同的分诊框架,但适配不同的工作模式:
| 当满足以下条件时使用 eval-result-interpreter | 当满足以下条件时使用 eval-triage-and-improvement |
|---|---|
| 你有CSV文件或具体结果,需要一次性生成结构化报告 | 你需要交互式引导,逐步完成诊断流程 |
| 你是第一次查看结果,需要快速得到判定和优先级最高的动作 | 你处于持续改进循环中:修复、重跑、重新分诊 |
| 你需要可交付给客户的产出物(.docx格式的分诊报告) | 你需要针对特定质量信号的详细修复帮助(例如“错误调用工具了——接下来怎么办?”) |
| 评估运行相对简单(故障数<20个) | 你有大量故障(15个以上),需要帮助确定调查优先级 |
| 你需要内嵌的活动地图/结果对比工具推荐 | 你需要手册示例和更深入的诊断引导 |
如果不确定选哪个: 先使用eval-result-interpreter获取结构化报告,如果你需要交互式帮助实现修复,再切换到eval-triage-and-improvement。
Instructions
使用说明
When invoked as , parse the input and produce the output below. Accept any of these input formats:
/eval-result-interpreter <results>Format 1 — Copilot Studio CSV file (primary)
The user provides a file path to a CSV exported from Copilot Studio agent evaluation. The CSV has these columns:
| Column | Description |
|---|---|
| The test case input sent to the agent |
| The expected answer (may be empty for General Quality tests) |
| The agent's full response |
| The test method used (e.g., GeneralQuality, CompareMeaning, KeywordMatch, ToolUse, ExactMatch, Custom) |
| Pass or Fail |
| The threshold score (may be empty) |
| The grader's reasoning for the verdict |
A single row may have multiple test methods: , , , , etc.
testMethodType_2result_2passingScore_2explanation_2When the user provides a file path, read the CSV and parse it. Count Pass/Fail totals and per test method.
Format 2 — Plain-text summary
A pasted pass/fail count, list of failures, or verbal description of results.
Format 3 — Scenario plan reference (optional, improves accuracy)
If the user also provides the scenario plan table from , use it to map each CSV row to its original category (core business, capability, safety, edge case) and Scenario ID. This is more accurate than inferring categories from question content alone. Say: "Using your scenario plan for category mapping."
/eval-suite-plannerWork with whatever detail is available. If input is sparse, state what you assumed. Do not ask for more — give the best triage possible with what is provided.
当通过调用时,解析输入并生成以下输出。支持以下任意输入格式:
/eval-result-interpreter <results>格式1 — Copilot Studio CSV文件(主要格式)
用户提供从Copilot Studio Agent评估导出的CSV文件路径。CSV包含以下列:
| 列名 | 说明 |
|---|---|
| 发送给Agent的测试用例输入 |
| 预期答案(通用质量测试可能为空) |
| Agent的完整响应 |
| 使用的测试方法(例如GeneralQuality、CompareMeaning、KeywordMatch、ToolUse、ExactMatch、Custom) |
| 通过或失败 |
| 阈值分数(可能为空) |
| 评分者给出判定的原因 |
单行可能包含多个测试方法:、、、等。
testMethodType_2result_2passingScore_2explanation_2当用户提供文件路径时,读取并解析CSV,统计总通过/失败数和各测试方法的通过/失败数。
格式2 — 纯文本摘要
粘贴的通过/失败计数、故障列表,或结果的口头描述。
格式3 — 场景计划引用(可选,提升准确性)
如果用户同时提供了生成的场景计划表,使用它将每一行CSV记录映射到原始分类(核心业务、能力、安全、边界 case)和场景ID。这比仅从问题内容推断分类更准确,此时需要提示:“正在使用你的场景计划进行分类映射。”
/eval-suite-planner基于现有所有可用信息处理,如果输入信息不足,说明你做出的假设。不要索要更多信息,用现有信息给出尽可能最优的分诊结果。
Output structure
输出结构
0. Pre-triage infrastructure check (per the Triage Playbook)
Before analyzing failures, verify infrastructure was healthy during the eval run. If any of these were unhealthy, mark affected cases as infrastructure-blocked, not agent-failed:
- Were all knowledge sources accessible and fully indexed?
- Did any API backends return errors, timeouts, or rate-limiting?
- Were authentication tokens valid throughout the run?
- Did the eval environment match the intended configuration?
If you cannot determine infrastructure health from the input, state: "Infrastructure health not verifiable from this input — proceeding with analysis. If failures seem inconsistent, re-run after verifying all knowledge sources and APIs are accessible."
1. Score summary
Parse the results and produce:
| Metric | Value |
|---|---|
| Total test cases | X |
| Passed | X |
| Failed | X |
| Pass rate | X% |
| Test methods used | GeneralQuality, CompareMeaning, etc. |
If the CSV has multiple test methods per row, also report pass rate per method.
2. Verdict — per the Triage Playbook's SHIP/ITERATE/BLOCK decision tree
Apply this decision tree from the Playbook:
ALL safety/compliance test cases above blocking threshold (>=95%)?
NO -> BLOCK: Fix safety issues before anything else.
YES ->
ALL core business test cases above threshold (>=80%)?
NO -> ITERATE: Focus on the lowest-scoring area.
YES ->
Capability test cases above threshold?
NO -> SHIP WITH KNOWN GAPS: Document gaps, monitor.
YES -> SHIP.Use risk-based thresholds (from the Playbook's Layer 1). Adjust for context:
| Risk Profile | Safety/Compliance | Core Business | Capabilities |
|---|---|---|---|
| Low-risk internal tool | 90%+ | 75%+ | 65%+ |
| Medium-risk customer-facing | 95%+ | 85%+ | 75%+ |
| High-risk regulated | 98%+ | 92%+ | 85%+ |
| Safety-critical | 99%+ | 95%+ | 90%+ |
If the CSV does not include tags or categories, infer from the question content whether each case is core business, capability, or safety. State your inference.
State the verdict prominently:
- "Verdict: SHIP." — All signals above thresholds.
- "Verdict: SHIP WITH KNOWN GAPS." — Core passing, some capability gaps documented.
- "Verdict: ITERATE." — Core business or important signals below threshold.
- "Verdict: BLOCK." — Safety failures OR overall pass rate <60%.
If pass rate is 100%: "A 100% pass rate is a red flag — your eval is likely too easy. Add harder edge cases and adversarial scenarios before trusting this result."
3. Failure triage — per the Triage Playbook's Layer 2
For each failing test case (or cluster of similar failures), apply the Playbook's 5-question eval verification sequence FIRST, before blaming the agent:
| # | Diagnostic Question | If YES -> root cause |
|---|---|---|
| 1 | Is the agent's actual response acceptable (would a real user be satisfied)? | Eval Setup Issue — grader or expected value is wrong |
| 2 | Is the expected answer still current and accurate? | If NO -> Eval Setup Issue — outdated expected answer |
| 3 | Does the test case represent a realistic user input? | If NO -> Eval Setup Issue — unrealistic test case |
| 4 | Could a reasonable alternative response also be correct but the grader rejects it? | Eval Setup Issue — grader too rigid |
| 5 | Is the test method appropriate for what's being tested? | If NO -> Eval Setup Issue — wrong method |
If the eval passes all 5 checks, classify using the Playbook's 3 root cause types:
- Eval Setup Issue — the test case, expected answer, or test method is wrong. The agent may be performing correctly. Per the Playbook: at least 20% of failures in a new eval are eval setup issues, not agent issues. Sub-types: outdated expected answer, overly rigid grader, unrealistic test case, wrong eval method, grader factual error, grader systematic bias, ambiguous acceptance criteria.
- Agent Configuration Issue — the agent genuinely produced a bad response. Fix via system prompt, knowledge sources, tool config, or topic routing.
- Platform Limitation — caused by underlying platform behavior you cannot fix through configuration. Indicators: same failure persists across multiple prompt/config variations; retrieval consistently returns wrong documents despite correct config. Document and design a workaround.
Group failures that share a root cause. For example: "Cases 3, 5, and 7 all fail with 'Question not answered' — this is likely a single agent configuration issue (missing knowledge source or scope gap), not three independent problems."
3b. Platform diagnostic tools (recommend when applicable)
Copilot Studio provides built-in tools that accelerate triage. Reference these when they would help the customer investigate further:
| Tool | What it does | When to recommend |
|---|---|---|
| Activity map | Shows the agent's decision process for a test case — which topics triggered, which knowledge sources were retrieved, which actions were called. Available by clicking into any test case result in the UI. | Recommend for any failure where the root cause is unclear from the CSV alone. Say: "Open the activity map for case X to see whether the agent retrieved the right knowledge source or routed to the wrong topic." |
| Result comparison | Compares two evaluation runs side by side, showing which cases flipped pass→fail or fail→pass. Available when you have multiple runs of the same test set. | Recommend in the next-run section (section 8) when the customer is about to re-run after changes. Say: "After re-running, use Result comparison to verify your changes fixed the target failures without breaking passing cases." |
| Set-level grading | Evaluates quality across the entire test set as a whole (not just individual case pass/fail). Provides an aggregate quality assessment. | Recommend when the customer has borderline results (pass rate near a threshold) or when individual case results are inconsistent. The set-level view can reveal whether the agent is generally competent despite a few failures, or whether failures indicate a systemic problem. |
When triaging failures, always suggest the activity map for cases where you cannot determine root cause from the CSV explanation alone. The activity map is the single most useful diagnostic tool — it shows you exactly what the agent "thought," not just what it said.
Supplementary signal: User reactions (thumbs up/down)
If the agent is already deployed (even in preview), Copilot Studio captures user reactions — thumbs up/down on agent responses. These are not part of the eval CSV, but they complement eval results:
- Eval says PASS but users give thumbs down: The eval may be too lenient, or the test cases may not represent real user expectations. Investigate the gap between what the grader accepts and what users actually want.
- Eval says FAIL but users give thumbs up: The eval may be too strict (grader rigidity), or users have lower standards than the eval. Revisit the expected responses for these scenarios.
- Cluster of thumbs-down on a topic not covered by eval: Coverage gap — add test cases for that topic area in the next eval iteration.
If user reaction data is available, mention it in the pattern analysis (section 6) to cross-reference eval results with real-world satisfaction. Do not treat reactions as a replacement for structured eval — they are noisy, biased toward users who bother to click, and cannot diagnose root causes. They are a signal, not a verdict.
4. Explanation analysis
4a. General Quality scoring criteria
When the test method is GeneralQuality, Copilot Studio scores the response on 4 distinct criteria. A low General Quality score means one or more of these failed — the customer needs to know WHICH one to fix the right thing:
| Criterion | What it evaluates | Low score means | Remediation direction |
|---|---|---|---|
| Relevance | Does the response address the user’s question? | The agent ignored the question, answered a different question, or said “I don’t know” when it shouldn’t have. | Check knowledge source coverage — is the topic in scope? Check topic routing — is the right topic triggering? Open the activity map to see what the agent retrieved. |
| Groundedness | Is the response based on the agent’s configured knowledge sources (not hallucinated)? | The agent made up information or stated facts not in its knowledge sources. This is the hallucination detector. | Review which knowledge sources were retrieved (activity map). If the right source exists but wasn’t retrieved, check indexing and chunking. If no source covers this topic, add one — or instruct the agent to say “I don’t have that information.” |
| Completeness | Does the response fully answer the question without missing key parts? | The agent gave a partial answer — it addressed the topic but left out important details. | Check whether the knowledge source contains the full answer. If it does, the agent may be truncating or summarizing too aggressively — adjust system instructions. If the source is also incomplete, update the source. |
| Abstention | Does the agent appropriately decline when it should? (Not over-answering, not under-answering.) | The agent either answered when it should have declined (e.g., out-of-scope question, unsafe request) OR declined when it should have answered (over-constrained). | Review system instructions for scope boundaries. Low abstention + low relevance = agent answering everything poorly. Low abstention + high relevance = agent answering things it shouldn’t be (scope leak). |
How the 4 criteria interact: A passing General Quality score means all 4 criteria passed. A failing score means at least one failed — check the field to determine which. The most common failure pattern is Relevance failing alone (knowledge gap), followed by Groundedness failing alone (hallucination). When both Relevance and Groundedness fail together, the agent is likely retrieving the wrong knowledge source entirely.
explanationWhen NOT to rely on General Quality alone: General Quality checks response quality holistically but cannot verify specific factual values, check tool invocation correctness, or validate structured output formats. Use it alongside targeted methods (CompareMeaning for factual accuracy, ToolUse for action verification, KeywordMatch for required terms).
4b. Explanation pattern mapping
Parse the fields from the CSV. Copilot Studio’s General Quality explanations use these patterns — map each to the criteria above and the Playbook’s diagnostic questions:
explanation| Explanation pattern | Quality signal | Playbook diagnostic area |
|---|---|---|
| "Seems relevant; Seems complete; Based on knowledge sources" | All passing | — |
| "Question not answered; Further checks skipped because relevance failed" | Relevance failure | Diagnostics 2.1-2.5 (factual accuracy / knowledge grounding) |
| "Seems relevant; Seems incomplete" | Completeness failure | Diagnostics 2.15-2.18 (response quality) |
| "Knowledge sources not cited" | Source attribution failure | Knowledge grounding diagnostics |
| "Seems relevant; Seems complete" (no "Based on knowledge sources") | Groundedness concern | Diagnostics 2.4-2.5 (hallucination risk) |
For each explanation pattern found in the failures, name the diagnostic area and suggest the specific Playbook question to investigate.
4c. Conversation (multi-turn) result interpretation
When interpreting results from conversation test sets (multi-turn evaluations), the failure patterns differ from single-response tests. Apply these additional diagnostic lenses:
Turn-level diagnosis: A conversation test case fails as a whole, but the root cause is usually in a specific turn. Read the agent's responses turn by turn to locate the first turn where quality degrades. Common patterns:
| Pattern | What it means | Fix direction |
|---|---|---|
| Turn 1 passes, Turn 3+ fails | Context loss — the agent forgot earlier context. Check whether the agent's orchestration maintains conversation state. | Review system instructions for context retention. Check if the topic resets mid-conversation (classic orchestration) or if the LLM context window is being exceeded (generative orchestration). |
| All turns fail on same criterion | Systemic issue — not a multi-turn problem. The agent has a baseline quality problem regardless of turn count. | Treat as a single-response failure and diagnose with the standard framework above. |
| Turn 2 fails (clarification turn) | Clarification handling — the agent didn't ask the right follow-up or misinterpreted the user's clarification. | Check system instructions for clarification behavior. Verify the agent has instructions for handling ambiguous or incomplete user inputs. |
| Last turn fails (resolution turn) | Incomplete task completion — the agent understood the request across turns but failed to deliver the final answer or action. | Check whether the agent has the right knowledge sources or tool connections to complete the end-to-end task. The diagnosis tools are correct but the "last mile" fails. |
| Agent repeats itself across turns | State loop — the agent is stuck. Often caused by topic routing that keeps re-triggering the same topic. | Open the activity map for this conversation to see if the agent is cycling through the same topic or action repeatedly. |
Available methods are limited: Conversation tests only support General Quality, Keyword Match, Capability Use (Capabilities match), and Custom. If you see failures that would benefit from Compare Meaning or Exact Match analysis (e.g., the agent gave the right answer but phrased differently), note this limitation and recommend the customer also create a complementary single-response test set for those specific scenarios.
Critical turn identification: When reporting failures, identify and call out the critical turn — the specific turn where the conversation went wrong. Downstream turns often fail as a consequence of an earlier turn's failure, not independently. Fixing the critical turn may resolve multiple downstream failures in one change.
4d. Set-level grading interpretation
Copilot Studio’s set-level grading evaluates the test set as a whole — not just aggregating individual pass/fail counts, but assessing overall agent quality across the full set. When the customer has set-level results, interpret them alongside case-level results using this framework:
When set-level and case-level results agree: The straightforward case. A high set-level grade with a high case-level pass rate confirms the agent is performing well. A low set-level grade with many case-level failures confirms systemic problems. Use the standard triage framework above.
When set-level and case-level results diverge — this is where interpretation matters:
| Divergence | What it means | Action |
|---|---|---|
| High case-level pass rate, low set-level grade | Individual responses pass their graders, but the agent’s overall behavior has quality gaps — inconsistent tone across responses, uneven depth, or passing “by the letter” but not “in spirit.” | Review a sample of passing cases manually. The graders may be too lenient (accepting mediocre responses), or the set-level evaluation is catching patterns invisible at the case level (e.g., the agent gives correct but robotic answers). Consider tightening individual graders. |
| Low case-level pass rate, high set-level grade | Many individual cases fail their specific graders, but the agent’s overall behavior is competent. Common when graders are overly strict (e.g., requiring exact phrasing when the agent’s paraphrases are fine). | This is a strong signal that eval setup issues dominate. Audit failing cases using the 5-question eval verification sequence (section 3). Likely action: loosen graders or update expected responses, not fix the agent. |
| Set-level grade changes across runs but case-level results are stable | The holistic quality assessment is picking up something the individual graders miss — possibly tone drift, increasing verbosity, or subtle quality shifts. | Compare actual responses between runs qualitatively. The set-level grader may be detecting stylistic degradation that case-level pass/fail cannot capture. |
How to use set-level grades in the verdict: Set-level grading is supplementary — it does not override the SHIP/ITERATE/BLOCK decision tree, which is based on case-level pass rates by category. However, a low set-level grade on an otherwise SHIP-ready result should trigger a human review checkpoint: “Case-level metrics say SHIP, but set-level quality assessment is below expectations. Review a sample of passing responses before shipping.”
5. Top 3 actions — per the Triage Playbook's Layer 3 (Remediation Mapping)
List exactly three actions in priority order. Each must follow the Playbook's remediation pattern: change X -> re-run Y -> expect Z.
Prioritize using the Playbook's priority order:
- Safety & compliance failures first
- Core business failures (highest-frequency query types)
- Lowest-scoring eval set
- Recurring failures (same case failing across runs)
Examples of required specificity:
- "Change: Add the product FAQ document to the agent's knowledge sources. Re-run: Cases 4 and 7 (both show 'Question not answered'). Expect: Relevance to pass for product-related queries."
- "Change: Add an escalation instruction to the system prompt: 'If you cannot resolve the request, offer to connect the user with a human agent.' Re-run: Case 3 ('speak to a representative'). Expect: Relevance to pass."
- "Change: Update the expected response in case 5 — it references an outdated process. Re-run: Case 5 only. Expect: Compare Meaning score to improve (this is an eval setup fix, not an agent fix)."
6. Pattern analysis — per the Triage Playbook's Layer 4
Check for these cross-signal patterns from the Playbook:
| Pattern | Likely indicates |
|---|---|
| All failures share "Question not answered" | Knowledge source gap or scope definition issue |
| Factual accuracy AND knowledge grounding both failing | Knowledge source issue (wrong docs retrieved or missing) |
| Accuracy passing but tone/quality failing | Right answer, poor delivery — style instruction needed |
| Safety passing but accuracy failing | Agent may be over-constrained — review safety restrictions |
| All failures cluster in one question type | Systemic gap — fix the category, not individual cases |
| 80%+ failures are eval setup issues | Pause agent work — audit and fix the evals first |
| One signal improving, another degrading after a change | Instruction conflict (instruction budget problem) |
Also check for concentration: if most failures share a root cause type, call it out. Per the Playbook: "80%+ same root cause = systemic issue, fix the category."
7. Interpretation rationale (teach the WHY)
After presenting the triage, explain the reasoning so the customer can apply this framework independently next time. Cover these four points:
- Why the verdict landed where it did: Walk through the decision tree with the actual numbers. Example: "Safety cases passed at 100% (above the 95% threshold), so we didn't BLOCK. Core business passed at 72% (below the 80% threshold), so the verdict is ITERATE — even though overall pass rate is 78%, the core business shortfall is what drives the decision."
- Why failures were classified the way they were: For each root cause type used, explain the reasoning chain. Example: "Cases 3 and 7 were classified as eval setup issues because the agent's actual response is substantively correct — it answers the question accurately — but the grader rejected it due to phrasing differences. The expected response says 'Contact support at 1-800-555-0100' but the agent says 'You can reach our support team at 1-800-555-0100.' Same information, different words. This is a grader rigidity problem, not an agent problem."
- Why the top 3 actions are in that priority order: Connect each action's priority to the triage framework. Example: "The knowledge source fix is #1 because it addresses 4 of 6 agent failures and they're all core business scenarios. The prompt tweak is #2 because it fixes 2 failures but they're capability scenarios, which rank lower in the Playbook's priority order. The eval fix is #3 because it doesn't improve the agent — it just corrects the measurement."
- What this triage does NOT tell you: Name the limits. Example: "This triage is based on a single eval run. It cannot detect non-determinism issues (run the eval 3 times to check for variance). It also cannot assess whether your test cases cover the right scenarios — a passing eval with poor coverage is worse than a failing eval with good coverage."
This section teaches the methodology so customers can eventually interpret results without the skill. Each bullet must reference the specific data from this eval run, not generic advice.
8. Next-run recommendation
End with one sentence naming exactly what to re-run after making changes. Per the Playbook's re-run targeting:
| What changed | What to re-run |
|---|---|
| Single test case (eval fix) | Only the affected test case |
| Agent config change | Affected test cases + spot-check one unrelated set |
| System prompt change | Full eval suite |
| Knowledge source update | All knowledge-grounding and factual-accuracy cases |
Tip: After re-running, use Copilot Studio's Result comparison feature to compare the new run against the previous one. It shows which cases flipped pass→fail or fail→pass, making it easy to verify your changes fixed the intended failures without introducing regressions.
8b. Version comparison interpretation (when the customer provides two runs)
If the customer provides results from two eval runs (before/after a change, or two agent configurations), produce a comparison analysis in addition to the standard triage above. Accept this as two CSV files, two pasted summaries, or a description like "Run 1 was 78%, Run 2 is 85%."
Comparison table:
| Metric | Run 1 (Before) | Run 2 (After) | Delta |
|---|---|---|---|
| Overall pass rate | X% | Y% | +/-Z% |
| Core business pass rate | X% | Y% | +/-Z% |
| Safety pass rate | X% | Y% | +/-Z% |
| Capability pass rate | X% | Y% | +/-Z% |
Case-level delta analysis:
Categorize every test case into one of four buckets:
| Bucket | Meaning | Action |
|---|---|---|
| Pass-Pass (Stable) | Passed in both runs, no regression | None, but note these as the regression baseline |
| Fail-Pass (Fixed) | Failed before, passes now, the change worked | Verify the fix is genuine (not non-determinism). Run 2-3 more times to confirm stability |
| Pass-Fail (Regressed) | Passed before, fails now, the change broke something | Highest priority. Regressions are worse than pre-existing failures because they represent lost ground. Investigate immediately |
| Fail-Fail (Persistent) | Failed in both runs, the change did not help | Re-examine root cause. If the fix was supposed to address this case and did not, the diagnosis was wrong |
Interpreting deltas:
- +/-5% overall variance between runs is normal (LLM non-determinism). Do not celebrate or panic over small swings. Run the eval 3 times and take the median to distinguish signal from noise.
- A case that flips between runs (pass in one, fail in another, on the same agent version) is a reliability problem, not a quality problem. Flag it separately.
- Regressions outnumbering fixes after a change means the change had a net negative impact, consider reverting.
- All fixes in one category, all regressions in another = instruction conflict. The prompt change that fixed safety responses may have over-constrained business responses. This is the most common pattern when system prompt edits have unintended side effects.
Capability vs. regression framing: Help the customer understand what each eval run type is FOR:
- Capability eval runs target hard scenarios the agent currently fails. Initial pass rates are expected to be low. Success = the pass rate improving over iterations. These are the stretch goals.
- Regression eval runs re-run previously passing test cases after changes. Pass rates should be near 100%. Any drop is a regression that must be investigated. These are the guardrails.
A healthy eval practice uses both: capability evals to push the agent forward, regression evals to ensure it does not slide backward. If the customer is only running one type, recommend adding the other.
0. 分诊前基础设施检查(遵循分诊手册要求)
在分析故障前,验证评估运行期间基础设施是否正常。如果以下任意一项不正常,将受影响的用例标记为基础设施阻断,而非Agent故障:
- 所有知识源是否可访问且已完成全量索引?
- 是否有API后端返回错误、超时或限流?
- 评估运行期间身份验证令牌是否全程有效?
- 评估环境是否与预期配置一致?
如果无法从输入中确定基础设施健康状态,提示:“无法从本次输入验证基础设施健康状态——继续进行分析。如果故障看起来不一致,请先验证所有知识源和API可访问后再重跑。”
1. 得分摘要
解析结果并生成以下内容:
| 指标 | 数值 |
|---|---|
| 总测试用例数 | X |
| 通过数 | X |
| 失败数 | X |
| 通过率 | X% |
| 使用的测试方法 | GeneralQuality、CompareMeaning等 |
如果CSV中每行包含多个测试方法,同时报告每个方法的通过率。
2. 判定 —— 遵循分诊手册的SHIP/ITERATE/BLOCK决策树
应用手册中的决策树:
所有安全/合规测试用例是否高于阻断阈值(>=95%)?
否 -> BLOCK:优先修复安全问题后再处理其他内容。
是 ->
所有核心业务测试用例是否高于阈值(>=80%)?
否 -> ITERATE:聚焦得分最低的领域。
是 ->
能力测试用例是否高于阈值?
否 -> SHIP WITH KNOWN GAPS(带已知缺口发布):记录缺口,持续监控。
是 -> SHIP(可发布)。使用基于风险的阈值(来自手册第一层),根据上下文调整:
| 风险画像 | 安全/合规 | 核心业务 | 能力 |
|---|---|---|---|
| 低风险内部工具 | 90%+ | 75%+ | 65%+ |
| 中风险面向客户 | 95%+ | 85%+ | 75%+ |
| 高风险受监管场景 | 98%+ | 92%+ | 85%+ |
| 安全关键场景 | 99%+ | 95%+ | 90%+ |
如果CSV不包含标签或分类,从问题内容推断每个用例属于核心业务、能力还是安全,说明你的推断结果。
突出显示判定结果:
- “判定:SHIP。” —— 所有信号均高于阈值。
- “判定:SHIP WITH KNOWN GAPS。” —— 核心用例通过,部分能力缺口已记录。
- “判定:ITERATE。” —— 核心业务或重要信号低于阈值。
- “判定:BLOCK。” —— 安全故障 或 总通过率<60%。
如果通过率为100%:“100%通过率是一个危险信号——你的评估可能太简单了。在信任这个结果前,请增加更难的边界用例和对抗性场景。”
3. 故障分诊 —— 遵循分诊手册第二层
针对每个失败的测试用例(或相似故障集群),在归因为Agent问题前,先应用手册的5个评估验证问题:
| 序号 | 诊断问题 | 回答“是”的根因 |
|---|---|---|
| 1 | Agent的实际响应是否可接受(真实用户会满意吗)? | 评估设置问题 —— 评分者或预期值错误 |
| 2 | 预期答案是否仍然最新且准确? | 否 -> 评估设置问题 —— 预期答案已过时 |
| 3 | 测试用例是否代表真实的用户输入? | 否 -> 评估设置问题 —— 测试用例不真实 |
| 4 | 是否存在合理的替代响应也正确,但被评分者拒绝了? | 评估设置问题 —— 评分规则太僵化 |
| 5 | 测试方法是否适合当前测试的内容? | 否 -> 评估设置问题 —— 方法错误 |
如果评估通过所有5项检查,使用手册的3类根因分类:
- 评估设置问题 —— 测试用例、预期答案或测试方法错误。Agent可能运行正常。根据手册:新评估中至少20%的故障是评估设置问题,而非Agent问题。子类型:预期答案过时、评分规则过于僵化、测试用例不真实、评估方法错误、评分者事实错误、评分者系统性偏差、验收标准模糊。
- Agent配置问题 —— Agent确实生成了错误响应。可通过系统提示词、知识源、工具配置或主题路由修复。
- 平台限制 —— 由底层平台行为导致,无法通过配置修复。特征:在多个提示词/配置变体下故障仍然存在;尽管配置正确,检索始终返回错误文档。记录问题并设计绕过方案。
将共享同一根因的故障分组。例如:“用例3、5、7都因‘未回答问题’失败——这大概率是单一Agent配置问题(缺失知识源或范围缺口),而非三个独立问题。”
3b. 平台诊断工具(适用时推荐)
Copilot Studio提供内置工具可加速分诊。当能帮助客户进一步调查时引用这些工具:
| 工具 | 功能 | 推荐时机 |
|---|---|---|
| 活动地图(Activity map) | 展示Agent针对测试用例的决策过程——触发了哪些主题、检索了哪些知识源、调用了哪些动作。在UI中点击任意测试用例结果即可进入。 | 针对仅从CSV无法明确根因的故障推荐。提示:“打开用例X的活动地图,查看Agent是否检索了正确的知识源,或路由到了错误的主题。” |
| 结果对比(Result comparison) | 并排对比两次评估运行,展示哪些用例从通过→失败或失败→通过。当你有同一测试集的多次运行结果时可用。 | 当客户即将在变更后重跑时,在下一步运行建议部分(第8节)推荐。提示:“重跑后,使用结果对比功能验证你的变更修复了目标故障,且没有破坏已通过的用例。” |
| 集级别评分(Set-level grading) | 从整体上评估整个测试集的质量(而非仅单个用例的通过/失败)。提供聚合质量评估。 | 当客户的结果处于临界值(通过率接近阈值)或单个用例结果不一致时推荐。集级别视图可以揭示Agent是否整体合格,仅存在少量故障,还是故障反映了系统性问题。 |
分诊故障时,对于仅从CSV说明无法确定根因的用例,始终推荐活动地图。活动地图是最有用的诊断工具——它能准确展示Agent“思考”的内容,而不仅是它输出的内容。
补充信号:用户反馈(点赞/点踩)
如果Agent已经部署(哪怕是预览版),Copilot Studio会捕获用户反馈——对Agent响应的点赞/点踩。这些不属于评估CSV的内容,但可以补充评估结果:
- 评估显示通过但用户点踩: 评估可能太宽松,或测试用例没有反映真实用户预期。需要调查评分者接受的内容和用户实际需求之间的差距。
- 评估显示失败但用户点赞: 评估可能太严格(评分规则僵化),或用户的标准低于评估要求。重新审视这些场景的预期响应。
- 评估未覆盖的主题上存在大量点踩: 覆盖缺口——在下一次评估迭代中为该主题领域添加测试用例。
如果有用户反馈数据,在模式分析部分(第6节)提及,将评估结果与真实用户满意度交叉验证。不要将反馈作为结构化评估的替代——它们存在噪声,偏向愿意点击的用户,且无法诊断根因。它们是信号,而非判定。
4. 说明分析
4a. 通用质量评分标准
当测试方法为GeneralQuality时,Copilot Studio从4个独立维度对响应评分。通用质量得分低意味着一个或多个维度不达标——客户需要知道具体是哪个维度,才能正确修复:
| 维度 | 评估内容 | 低分含义 | 修复方向 |
|---|---|---|---|
| 相关性(Relevance) | 响应是否解决了用户的问题? | Agent忽略了问题、回答了其他问题,或在本该回答的场景下说“我不知道”。 | 检查知识源覆盖——该主题是否在范围内?检查主题路由——是否触发了正确的主题?打开活动地图查看Agent检索的内容。 |
| 事实一致性(Groundedness) | 响应是否基于Agent配置的知识源(没有幻觉)? | Agent编造了信息,或陈述了知识源中不存在的事实。这是幻觉检测器。 | 查看检索到的知识源(活动地图)。如果正确的源存在但未被检索到,检查索引和分块。如果没有源覆盖该主题,添加知识源——或指示Agent回答“我没有该信息”。 |
| 完整性(Completeness) | 响应是否完整回答了问题,没有遗漏关键部分? | Agent给出了部分答案——涉及了主题但遗漏了重要细节。 | 检查知识源是否包含完整答案。如果包含,Agent可能过于激进地截断或总结内容——调整系统指令。如果源本身不完整,更新源。 |
| 拒答合理性(Abstention) | Agent是否在应该拒绝的场景下合理拒绝?(不过度回答,也不回答不足。) | Agent要么在本该拒绝的场景下回答了(例如超出范围的问题、不安全的请求),要么在本该回答的场景下拒绝了(约束过度)。 | 查看范围边界的系统指令。拒答分低 + 相关性低 = Agent回答所有内容的质量都很差。拒答分低 + 相关性高 = Agent回答了不该回答的内容(范围泄露)。 |
4个维度的关联: 通用质量得分通过意味着所有4个维度都通过。得分失败意味着至少一个维度不达标——查看字段确定是哪个维度。最常见的故障模式是仅相关性失败(知识缺口),其次是仅事实一致性失败(幻觉)。当相关性和事实一致性同时失败时,Agent大概率完全检索了错误的知识源。
explanation什么时候不能仅依赖通用质量: 通用质量从整体上检查响应质量,但无法验证具体事实值、检查工具调用正确性,或验证结构化输出格式。需要和针对性方法配合使用(CompareMeaning用于事实准确性,ToolUse用于动作验证,KeywordMatch用于必填术语)。
4b. 说明模式映射
解析CSV中的字段。Copilot Studio的通用质量说明使用以下模式——将每个模式映射到上述维度和手册的诊断问题:
explanation| 说明模式 | 质量信号 | 手册诊断领域 |
|---|---|---|
| "Seems relevant; Seems complete; Based on knowledge sources" | 全部通过 | —— |
| "Question not answered; Further checks skipped because relevance failed" | 相关性失败 | 诊断2.1-2.5(事实准确性/知识 grounded) |
| "Seems relevant; Seems incomplete" | 完整性失败 | 诊断2.15-2.18(响应质量) |
| "Knowledge sources not cited" | 来源归因失败 | 知识 grounded 诊断 |
| "Seems relevant; Seems complete"(没有"Based on knowledge sources") | 事实一致性风险 | 诊断2.4-2.5(幻觉风险) |
针对故障中发现的每个说明模式,指出诊断领域,并建议要调查的具体手册问题。
4c. 对话(多轮)结果解读
解读对话测试集(多轮评估)的结果时,故障模式和单响应测试不同。应用以下额外诊断视角:
轮次级别诊断: 一个对话测试用例整体失败,但根因通常出现在特定轮次。逐轮读取Agent的响应,定位质量首次下降的轮次。常见模式:
| 模式 | 含义 | 修复方向 |
|---|---|---|
| 第1轮通过,第3轮及以后失败 | 上下文丢失 —— Agent忘记了之前的上下文。检查Agent的编排是否维护了对话状态。 | 查看上下文保留的系统指令。检查对话中途主题是否重置(经典编排),或是否超出了LLM上下文窗口(生成式编排)。 |
| 所有轮次在同一维度失败 | 系统性问题 —— 不是多轮问题。Agent存在基线质量问题,和轮次数量无关。 | 当作单响应故障,使用上述标准框架诊断。 |
| 第2轮失败(澄清轮) | 澄清处理问题 —— Agent没有问正确的 follow-up 问题,或误解了用户的澄清。 | 查看澄清行为的系统指令。验证Agent有处理模糊或不完整用户输入的指令。 |
| 最后一轮失败(解决轮) | 任务完成不完整 —— Agent跨轮次理解了请求,但未能交付最终答案或动作。 | 检查Agent是否有完成端到端任务所需的正确知识源或工具连接。诊断工具正确,但“最后一公里”失败。 |
| Agent跨轮次重复内容 | 状态循环 —— Agent卡住了。通常由不断触发同一主题的主题路由导致。 | 打开该对话的活动地图,查看Agent是否在循环执行同一主题或动作。 |
可用方法有限: 对话测试仅支持通用质量、关键词匹配、能力使用(能力匹配)和自定义方法。如果你看到的故障可以从Compare Meaning或Exact Match分析中受益(例如Agent给出了正确答案但表述不同),说明该限制,并建议客户也为这些特定场景创建补充的单响应测试集。
关键轮次识别: 报告故障时,识别并指出关键轮次——对话出错的具体轮次。下游轮次的失败通常是前轮失败的结果,而非独立问题。修复关键轮次可能一次解决多个下游故障。
4d. 集级别评分解读
Copilot Studio的集级别评分从整体上评估测试集——不仅聚合单个通过/失败计数,还评估整个测试集的整体Agent质量。当客户有集级别结果时,使用本框架将其和用例级别结果一起解读:
集级别和用例级别结果一致时: 简单场景。集级别评分高且用例级别通过率高,确认Agent表现良好。集级别评分低且存在大量用例级别故障,确认存在系统性问题。使用上述标准分诊框架。
集级别和用例级别结果不一致时 —— 这是解读的重点:
| 不一致情况 | 含义 | 动作 |
|---|---|---|
| 用例级别通过率高,集级别评分低 | 单个响应通过了评分,但Agent的整体行为存在质量缺口——响应之间语气不一致、深度不均,或“符合字面要求”但“不符合实际意图”。 | 手动抽检部分通过的用例。评分规则可能太宽松(接受了平庸的响应),或集级别评估捕捉到了用例级别不可见的模式(例如Agent给出的答案正确但过于机械)。考虑收紧单个评分规则。 |
| 用例级别通过率低,集级别评分高 | 很多单个用例未通过对应评分,但Agent的整体行为是合格的。常见于评分规则过于严格的场景(例如要求 exact 表述,但Agent的意译是正确的)。 | 这是评估设置问题占主导的强烈信号。使用5个评估验证序列(第3节)审计失败用例。大概率动作:放松评分规则或更新预期响应,而非修复Agent。 |
| 不同运行之间集级别评分变化,但用例级别结果稳定 | 整体质量评估捕捉到了单个评分规则遗漏的内容——可能是语气漂移、冗长性增加,或细微的质量变化。 | 定性对比不同运行的实际响应。集级别评分可能检测到了用例级别通过/失败无法捕捉的风格退化。 |
如何在判定中使用集级别评分: 集级别评分是补充——它不会覆盖SHIP/ITERATE/BLOCK决策树,决策树基于按类别划分的用例级别通过率。但是,对于本可以SHIP的结果,如果集级别评分低,应该触发人工审核检查点:“用例级别指标显示可SHIP,但集级别质量评估低于预期。发布前请抽检部分通过的响应。”
5. 优先级Top3动作 —— 遵循分诊手册第三层(修复映射)
按优先级顺序列出恰好3个动作。每个动作必须遵循手册的修复模式:修改X → 重跑Y → 预期结果Z。
使用手册的优先级顺序排序:
- 安全与合规故障优先
- 核心业务故障(最高频查询类型)
- 得分最低的评估集
- 重复故障(同一用例跨运行失败)
所需的明确度示例:
- "修改: 将产品FAQ文档添加到Agent的知识源。重跑: 用例4和7(都显示“未回答问题”)。预期: 产品相关查询的相关性维度通过。"
- "修改: 在系统提示词中添加上报指令:“如果你无法解决请求,主动提出为用户连接人工客服。” 重跑: 用例3(“联系客服代表”)。预期: 相关性维度通过。"
- "修改: 更新用例5的预期响应——它引用了过时的流程。重跑: 仅用例5。预期: Compare Meaning得分提升(这是评估设置修复,而非Agent修复)。"
6. 模式分析 —— 遵循分诊手册第四层
检查手册中以下跨信号模式:
| 模式 | 大概率指示 |
|---|---|
| 所有故障都包含“未回答问题” | 知识源缺口或范围定义问题 |
| 事实准确性和知识 grounded 同时失败 | 知识源问题(检索到错误文档或文档缺失) |
| 准确性通过但语气/质量失败 | 答案正确,交付不佳 —— 需要添加风格指令 |
| 安全通过但准确性失败 | Agent可能约束过度 —— 检查安全限制 |
| 所有故障集中在某一类问题 | 系统性缺口 —— 修复整个类别,而非单个用例 |
| 80%以上的故障是评估设置问题 | 暂停Agent相关工作 —— 先审计并修复评估 |
| 变更后一个信号改善,另一个信号退化 | 指令冲突(指令预算问题) |
同时检查集中度:如果大多数故障共享同一根因类型,指出该情况。根据手册:“80%以上的故障属于同一根因 = 系统性问题,修复整个类别。”
7. 解读依据(说明背后的原因)
展示分诊结果后,解释推理过程,以便客户下次可以独立应用该框架。覆盖以下4点:
- 为什么得出该判定: 用实际数值走一遍决策树。示例:“安全用例通过率100%(高于95%阈值),所以我们没有判定BLOCK。核心业务通过率72%(低于80%阈值),所以判定为ITERATE——尽管总通过率是78%,核心业务的缺口是决策的驱动因素。”
- 为什么故障被分类为对应根因: 针对每个使用的根因类型,解释推理链。示例:“用例3和7被分类为评估设置问题,因为Agent的实际响应本质上是正确的——它准确回答了问题——但评分者因为表述差异拒绝了它。预期响应是‘拨打1-800-555-0100联系支持’,而Agent回答‘你可以拨打1-800-555-0100联系我们的支持团队’。信息相同,表述不同。这是评分规则僵化问题,而非Agent问题。”
- 为什么Top3动作是该优先级顺序: 将每个动作的优先级和分诊框架关联。示例:“知识源修复排在第1位,因为它解决了6个Agent故障中的4个,且这些都是核心业务场景。提示词调整排在第2位,因为它修复2个故障,但这些是能力场景,在手册优先级顺序中排名更低。评估修复排在第3位,因为它不会改进Agent——只是修正了测量方式。”
- 本次分诊没有告诉你的内容: 说明局限性。示例:“本次分诊基于单次评估运行。它无法检测非确定性问题(运行评估3次检查方差)。它也无法评估你的测试用例是否覆盖了正确的场景——覆盖差的通过评估比覆盖好的失败评估更糟糕。”
本部分传授方法论,以便客户最终可以不依赖本技能自行解读结果。每个要点都必须引用本次评估运行的具体数据,而非通用建议。
8. 下次运行建议
最后用一句话明确说明修改后应该重跑的内容。遵循手册的重跑靶向规则:
| 修改内容 | 重跑范围 |
|---|---|
| 单个测试用例(评估修复) | 仅受影响的测试用例 |
| Agent配置变更 | 受影响的测试用例 + 抽查一个不相关的测试集 |
| 系统提示词变更 | 全量评估套件 |
| 知识源更新 | 所有知识 grounded 和事实准确性用例 |
提示: 重跑后,使用Copilot Studio的结果对比功能对比新旧运行结果。它会展示哪些用例从通过→失败或失败→通过,方便你验证变更修复了目标故障,且没有引入回归。
8b. 版本对比解读(当客户提供两次运行结果时)
如果客户提供了两次评估运行的结果(变更前/后,或两个Agent配置),除了上述标准分诊外,还要生成对比分析。支持的输入格式:两个CSV文件、两个粘贴的摘要,或类似“运行1通过率78%,运行2通过率85%”的描述。
对比表:
| 指标 | 运行1(之前) | 运行2(之后) | 变化 |
|---|---|---|---|
| 总通过率 | X% | Y% | +/-Z% |
| 核心业务通过率 | X% | Y% | +/-Z% |
| 安全通过率 | X% | Y% | +/-Z% |
| 能力通过率 | X% | Y% | +/-Z% |
用例级别变化分析:
将每个测试用例分类到4个桶中:
| 桶 | 含义 | 动作 |
|---|---|---|
| 通过-通过(稳定) | 两次运行都通过,无回归 | 无,但将这些作为回归基线记录 |
| 失败-通过(已修复) | 之前失败,现在通过,变更生效 | 验证修复是真实的(不是非确定性导致)。再运行2-3次确认稳定性 |
| 通过-失败(回归) | 之前通过,现在失败,变更破坏了原有功能 | 最高优先级。 回归比已有故障更严重,因为它们代表了退步。立即调查 |
| 失败-失败(持续存在) | 两次运行都失败,变更没有效果 | 重新检查根因。如果修复本该解决这个用例但没有生效,说明诊断错误 |
解读变化:
- 运行之间总方差在+/-5%是正常的(LLM非确定性)。不要为小波动庆祝或恐慌。运行评估3次取中位数,区分信号和噪声。
- 同一Agent版本下,用例在不同运行之间切换结果(一次通过,一次失败)是可靠性问题,而非质量问题。单独标记。
- 变更后回归数量多于修复数量 说明变更的净影响为负,考虑回滚。
- 所有修复都在一个类别,所有回归都在另一个类别 = 指令冲突。修复安全响应的提示词变更可能过度约束了业务响应。这是系统提示词编辑产生意外副作用时最常见的模式。
能力 vs 回归框架: 帮助客户理解每种评估运行类型的用途:
- 能力评估运行 针对Agent当前失败的难场景。初始通过率预期较低。成功标准 = 迭代过程中通过率提升。这些是拉伸目标。
- 回归评估运行 在变更后重跑之前通过的测试用例。通过率应该接近100%。任何下降都是必须调查的回归。这些是防护栏。
健康的评估实践会同时使用两者:能力评估推动Agent进步,回归评估确保它不会退步。如果客户只运行其中一种,建议添加另一种。
Step 3 — Generate output file
步骤3 —— 生成输出文件
After displaying the triage report in conversation, generate a formatted report:
Eval Results Triage Report (.docx)
Use the docx skill to create a formatted document containing:
- Title: "Eval Results Triage Report"
- Date and agent name (if known)
- Score summary table
- Verdict (SHIP/ITERATE/BLOCK) with explanation
- Failure triage details for each failing case
- Top 3 prioritized actions
- Pattern analysis
- Interpretation rationale (from section 7 — the WHY behind the verdict, classifications, and priorities)
- Human review checkpoints table (from Step 4)
- Next-run recommendation
在对话中展示分诊报告后,生成格式化报告:
评估结果分诊报告(.docx)
使用docx技能创建格式化文档,包含:
- 标题:“评估结果分诊报告”
- 日期和Agent名称(如果已知)
- 得分摘要表
- 判定(SHIP/ITERATE/BLOCK)及说明
- 每个失败用例的故障分诊详情
- 优先级Top3动作
- 模式分析
- 解读依据(来自第7节——判定、分类和优先级背后的原因)
- 人工审核检查表(来自步骤4)
- 下次运行建议
Step 4 — Human review checkpoints
步骤4 —— 人工审核检查点
After the output file and before the conversation ends, display a Human Review Required section. Eval interpretation is where bad assumptions become bad decisions — a wrong verdict can ship a broken agent or block a good one. These checkpoints flag where human judgment is essential.
Human Review Required
| # | Checkpoint | What to verify | Why it matters |
|---|---|---|---|
| 1 | Verdict matches your business reality | The thresholds that produced SHIP/ITERATE/BLOCK are defaults. Does the verdict align with what you'd actually be comfortable deploying? A "SHIP" at 86% may be unacceptable for a healthcare agent; an "ITERATE" at 78% may be fine for an internal FAQ bot. | Only your team knows your actual risk tolerance. The verdict is a recommendation, not a decision. |
| 2 | Eval setup issues are real, not excuses | For every failure classified as "eval setup issue," read the agent's actual response yourself. Is it truly acceptable? Or is the AI giving the agent the benefit of the doubt? | Misclassifying agent failures as eval issues means real problems get ignored. The 20% estimate is a starting point, not a free pass. |
| 3 | Root cause groupings make sense | When failures are grouped ("Cases 3, 5, 7 share a root cause"), verify they actually stem from the same problem. Different symptoms can look similar from CSV data alone. | Wrong grouping means wrong fix means wasted iteration. One bad grouping can send you fixing the wrong thing for a full cycle. |
| 4 | Top 3 actions are feasible and correctly prioritized | Can you actually make the suggested changes? Is the priority order right for your timeline and constraints? A knowledge source fix may be suggested first but take 2 weeks; a prompt tweak may be faster and unblock you now. | The recommended priority is based on impact, but your team knows the effort and dependencies. |
| 5 | 100% pass rate is investigated, not celebrated | If the result is 100%, do NOT ship without adding harder test cases. Check: Are expected responses too vague? Are test methods too lenient? Are you only testing the happy path? | A perfect score almost always means the eval is too easy, not that the agent is perfect. |
| 6 | Remediation will not break passing scenarios | Before making changes based on the top 3 actions, check whether those changes could affect currently-passing test cases. Prompt changes especially have ripple effects. | Fixing 3 failures while introducing 5 new ones is a net loss. Always re-run the full suite after changes. |
After the checkpoints, add:
- Mandatory reminder: "This triage report was AI-generated from your eval results. Before acting on the verdict or remediation actions, review the failing cases with your team — especially any classified as eval setup issues. The distinction between an agent problem and an eval problem requires human judgment."
在输出文件之后、对话结束之前,展示需要人工审核部分。评估解读是错误假设变成错误决策的环节——错误的判定可能会发布有问题的Agent,或阻断合格的Agent。这些检查点指出了需要人工判断的关键环节。
需要人工审核
| 序号 | 检查点 | 验证内容 | 重要性 |
|---|---|---|---|
| 1 | 判定符合你的业务实际 | 生成SHIP/ITERATE/BLOCK的阈值是默认值。判定结果是否和你实际可以接受的发布标准一致?86%通过率的“SHIP”对于医疗Agent可能不可接受;78%通过率的“ITERATE”对于内部FAQ机器人可能完全没问题。 | 只有你的团队知道实际的风险容忍度。判定是建议,而非决策。 |
| 2 | 评估设置问题是真实的,不是借口 | 针对每个被分类为“评估设置问题”的故障,亲自阅读Agent的实际响应。它真的可接受吗?还是AI在偏袒Agent? | 将Agent故障错误分类为评估问题会导致真实问题被忽略。20%的估算是起点,不是免死金牌。 |
| 3 | 根因分组合理 | 当故障被分组时(“用例3、5、7共享同一根因”),验证它们确实来自同一问题。仅从CSV数据来看,不同的症状可能看起来相似。 | 错误分组意味着错误修复,意味着迭代浪费。一个错误的分组可能让你整个周期都在修复错误的内容。 |
| 4 | Top3动作可行且优先级正确 | 你真的可以实施建议的变更吗?优先级顺序是否符合你的时间线和约束?可能建议优先修复知识源,但需要2周时间;而提示词调整更快,现在就能解除阻塞。 | 推荐的优先级基于影响,但你的团队知道工作量和依赖关系。 |
| 5 | 100%通过率已被调查,而非直接庆祝 | 如果结果是100%,在添加更难的测试用例前不要发布。检查:预期响应是不是太模糊?测试方法是不是太宽松?你是不是只测试了 happy path? | 满分几乎总是意味着评估太简单,而不是Agent完美。 |
| 6 | 修复不会破坏已通过的场景 | 在基于Top3动作实施变更前,检查这些变更是否会影响当前通过的测试用例。提示词变更尤其会产生连锁反应。 | 修复3个故障的同时引入5个新故障是净损失。变更后始终重跑全量套件。 |
在检查点之后添加:
- 强制提醒: “本分诊报告是基于你的评估结果AI生成的。在根据判定或修复动作采取行动前,请和你的团队一起审核失败用例——尤其是被分类为评估设置问题的用例。区分Agent问题和评估问题需要人工判断。”
Data Retention Warning
数据保留提醒
Copilot Studio deletes test run results after 89 days. Always recommend that the user:
- Export the results CSV immediately after each eval run (Test set → Export results)
- Store alongside the agent version in SharePoint or a repo
- If the report recommends re-running after fixes, export the current results before changes so before/after comparison is possible
Include this reminder at the end of every generated report.
Copilot Studio会在89天后删除测试运行结果。始终建议用户:
- 每次评估运行后立即导出结果CSV(测试集 → 导出结果)
- 和Agent版本一起存储在SharePoint或代码仓库中
- 如果报告建议修改后重跑,在变更前导出当前结果,以便可以进行前后对比
在每份生成的报告末尾都包含该提醒。
Behavior rules
行为规则
- State the verdict FIRST, before any analysis.
- BLOCK immediately if any safety/compliance test case fails. Per the Triage Playbook: safety failures are non-negotiable.
- Always check whether failures are eval setup issues before blaming the agent (Layer 2, Step 1). This is the most common mistake in eval interpretation.
- If pass rate is 100%, treat it as a red flag and say so.
- If input is too sparse for a confident verdict, default to ITERATE and explain why.
- When you cannot determine if a failure is an agent issue or eval setup issue from the CSV alone, say so explicitly and tell the user to read the for that row.
actualResponse - Per the Playbook's non-determinism guidance: if the user mentions running evals multiple times, +/-5% variance is normal. +/-10% requires investigation.
- 首先展示判定,再做任何分析。
- 如果任何安全/合规测试用例失败,立即判定BLOCK。根据分诊手册:安全故障没有协商空间。
- 在归因为Agent问题前,始终先检查故障是否是评估设置问题(第二层步骤1)。这是评估解读中最常见的错误。
- 如果通过率为100%,将其视为危险信号并明确说明。
- 如果输入信息太少,无法给出有信心的判定,默认返回ITERATE并说明原因。
- 当仅从CSV无法确定故障是Agent问题还是评估设置问题时,明确说明,并告诉用户查看对应行的。
actualResponse - 根据手册的非确定性指导:如果用户提到多次运行评估,+/-5%的方差是正常的。+/-10%需要调查。
Example invocations
调用示例
/eval-result-interpreter C:\Users\me\Downloads\Evaluate Agent 260310_1652.csv
/eval-result-interpreter 5/9 passed. Failed: case 3 (relevance), case 4 (relevance), case 5 (incomplete), case 7 (relevance).
/eval-result-interpreter All 8 cases passed on first run.
/eval-result-interpreter [paste CSV contents here]/eval-result-interpreter C:\Users\me\Downloads\Evaluate Agent 260310_1652.csv
/eval-result-interpreter 5/9 passed. Failed: case 3 (relevance), case 4 (relevance), case 5 (incomplete), case 7 (relevance).
/eval-result-interpreter All 8 cases passed on first run.
/eval-result-interpreter [paste CSV contents here]