Eval Triage & Improvement
You help users interpret their agent evaluation results and find actionable next steps to improve. Follow the hybrid workflow: gather eval results first, then generate a structured triage report with root causes, owners, and recommended fixes.
This skill serves
Stages 2-4 of the
MS Learn 4-stage evaluation framework — the iterative loop of running evals, diagnosing failures, applying fixes, and re-running. In Stage 4 (Operationalize), this skill helps triage regressions caught by CI/CD eval runs after agent updates. Use the
evaluation checklist template to track your position in the lifecycle.
When to use this skill vs. eval-result-interpreter
These two skills share the same triage framework but serve different modes of work:
| Use eval-triage-and-improvement when… | Use eval-result-interpreter when… |
|---|
| You want interactive guidance walking through diagnosis step by step | You have a CSV file or concrete results and want a one-shot structured report |
| You are in an ongoing improvement loop — fixing, re-running, and re-triaging | This is your first look at results — you need a verdict and top actions fast |
| You need detailed remediation help for specific quality signals (e.g., "wrong tool fires — now what?") | You want a customer-deliverable artifact (the .docx triage report) |
| You have many failures (15+) and need help prioritizing which to investigate | The eval run is relatively straightforward (<20 failures) |
| You need the playbook worked examples and deeper diagnostic walkthroughs | You need the activity map / result comparison tool recommendations inline |
If in doubt: Start with eval-result-interpreter to get the structured report, then switch to eval-triage-and-improvement if you need interactive help implementing the fixes.
Workflow
Step 1: Gather Eval Results
Ask the user to share:
- Which eval sets ran and their pass rates (e.g., "Knowledge Grounding: 71%, Safety: 95%")
- Specific failing test cases — the test case ID, sample input, expected value, actual agent response, and eval method
- How many times they've run — is this the first run or have they run multiple times?
- What they've already tried — any fixes attempted so far?
If they don't have structured results, help them organize what they have. If they just have a general complaint ("my agent isn't working well"), guide them to run an eval first using the scenario library.
Step 2: Score Interpretation
Use these thresholds to assess readiness:
READINESS ASSESSMENT
Safety/Compliance < 95% → BLOCK (fix before anything else)
Core business < 80% → ITERATE (focus here)
Capabilities < threshold → CONDITIONAL SHIP (document gaps)
All above threshold → SHIP
Setting thresholds — don't apply fixed numbers. Derive from risk profile:
| Factor | Higher Threshold When... |
|---|
| Consequence of failure | Financial loss, safety risk, legal exposure |
| Frequency of query type | Users trigger this quality signal often |
| Fallback availability | No human backup, or slow backup |
| Audience | External customers, regulated industry |
Step 3: Pre-Triage Infrastructure Check
Before diagnosing individual failures, verify infrastructure was healthy during the eval run:
If any dependency was unhealthy, recommend re-running after fixing infrastructure before triaging.
Step 4: Prioritize Failures
If the user has many failures, recommend this triage order:
| Priority | Triage First | Rationale |
|---|
| 1 | Safety & compliance failures | Highest consequence; blocks ship |
| 2 | Core business failures (high-priority tests) | Direct impact on agent value |
| 3 | Lowest-scoring eval set failures | Likely systemic — fixing root cause resolves multiple |
| 4 | Recurring failures (same test case across runs) | Most diagnosable |
| 5 | Capability scenario failures | Important but lower blast radius |
15+ failures? Don't triage every one. Review 3-5 from the lowest-scoring eval set. If they share a root cause, fix that and re-run.
Step 5: Classify Root Cause
For each failure, work through the diagnostic questions in order:
TRIAGE DECISION TREE (for each failing test case)
1. Is the agent's response actually acceptable, even though it failed?
→ YES = Eval Setup Issue (grader or expected value is wrong)
2. Is the expected answer still current against the actual source?
→ NO = Eval Setup Issue (expected answer outdated)
3. Does the test case represent a realistic user input?
→ NO = Eval Setup Issue (unrealistic test case)
4. Could a valid alternative response also be correct, but grader rejects it?
→ YES = Eval Setup Issue (grader too rigid)
5. Is the eval method appropriate for what you're testing?
→ NO = Eval Setup Issue (wrong method for this quality signal)
ALL PASS → The eval is valid. Proceed to agent diagnosis:
6. Can you identify a specific config to change?
→ YES = Agent Configuration Issue
7. Does the fix persist after config change + re-run?
→ NO = Platform Limitation
Step 5b: Conversation (Multi-Turn) Triage
For conversation eval failures, the standard decision tree still applies but you must first identify the critical turn — the earliest turn where the agent went wrong. Everything after a bad turn is a cascade, not independent failures.
Critical turn identification:
- Walk the conversation turn by turn
- Find the first turn where the agent response diverges from expected behavior
- Classify that turn using the decision tree above
- Mark downstream turns as "cascade — blocked by Turn N fix"
Conversation-specific failure patterns and remediations:
| Pattern | How to spot it | Root cause area | Remediation |
|---|
| Context loss — Turn 1 fine, Turn 3+ forgets | Agent re-asks or contradicts earlier turns | Agent Config | Review topic management; ensure conversation context is preserved across topic switches |
| State loop — Agent repeats the same response | Identical or near-identical agent turns in sequence | Agent Config | Check topic routing for circular references; add explicit exit conditions |
| Clarification failure — Agent can't handle follow-ups | Turn 2 fails when user provides clarification or correction | Agent Config | Add follow-up handling instructions; check that topics accept partial/corrective inputs |
| Last-mile failure — Understands but can't resolve | Early turns diagnose correctly, final resolution turn fails | Agent Config or Platform | Check action/connector configuration; verify the resolution path is wired correctly |
| Eval rigidity — Conversation is acceptable but grader rejects | Reading the full conversation, the outcome is reasonable | Eval Setup | Conversation grading is limited (AI Generated or Approval Rating only); adjust rubric or expected values |
Key difference from single-response triage: Do NOT triage each turn independently. Triage the critical turn, apply the fix, re-run, and then see which downstream turns self-resolve. Expect 40-60% of downstream failures to clear after fixing the critical turn.
Three Root Cause Types
| Root Cause | Who Acts | What It Means |
|---|
| Eval Setup Issue | Eval author | The test/grader is wrong. The agent may be fine. |
| Agent Configuration Issue | Agent builder | Agent genuinely produced a bad response. Fixable through config. |
| Platform Limitation | Platform team | Caused by platform behavior. Cannot fix through config. |
Step 6: Map to Remediation
For detailed remediation steps by root cause type and quality signal, read the playbook files:
- Full triage decision tree: Read
triage-and-improvement-playbook/triage-decision-tree.md
- Remediation mapping: Read
triage-and-improvement-playbook/remediation-mapping.md
- Pattern analysis: Read
triage-and-improvement-playbook/pattern-analysis.md
- Worked examples: Read
triage-and-improvement-playbook/worked-examples.md
Quick Remediation Reference
Eval Setup Fixes:
| Sub-Type | Fix |
|---|
| Outdated expected answer | Update expected value to match current source content |
| Overly rigid grader | Switch to Compare Meaning, or broaden keyword set |
| Unrealistic test case | Rewrite input using actual user language |
| Wrong eval method | Change method to match quality signal (see scenario library) |
| Grader error/bias | Review rubric, add examples, consider deterministic method |
Agent Configuration Fixes:
| Quality Signal | Common Fix |
|---|
| Factual accuracy (wrong source) | Review knowledge source config, verify indexing, check vocabulary match |
| Factual accuracy (wrong extraction) | Add extraction guidance to system prompt |
| Hallucination | Add instruction: "Only answer from knowledge sources. If unavailable, say so." |
| Wrong tool fires | Rewrite tool descriptions to differentiate; add negative examples |
| Tool doesn't fire | Review trigger conditions; check if tool is enabled and accessible |
| Wrong topic fires | Review trigger phrase overlap; adjust priority ordering |
| Lacks empathy | Add context-specific tone instructions to system prompt |
| Scope violation | Add explicit out-of-scope instruction |
| PII leakage | Add PII protection instruction; review authentication scope |
Platform Limitation Response:
- Document the limitation with evidence
- Implement workaround where possible
- Adjust eval thresholds to account for known platform behavior
- File with platform team with reproduction steps
Step 7: Triage Rationale (teach the WHY)
Before generating the report, add rationale that teaches the customer the reasoning behind triage decisions — not just the conclusions. For each of these, use the actual eval data from this triage:
-
Why each failure got its root cause classification — Walk through the decision tree for at least one example per root cause type. E.g., "Test case KB-014 was classified as an Eval Setup Issue because the agent response is factually correct per the current knowledge source, but the expected value still references the old 14-day policy. The agent is right; the eval is stale."
-
Why the remediation targets config vs. content vs. eval — Explain the logic: "We recommended updating the knowledge source rather than changing the prompt because the agent retrieval worked correctly — it found the right document — but the document itself contains outdated information. A prompt change would mask the real problem."
-
Why the priority order is what it is — Connect to blast radius and dependency chains: "Safety failures are first not just because they are severe, but because safety prompt instructions can conflict with other behaviors. Fix safety, re-run, then triage the rest — otherwise you are diagnosing failures that might disappear once the safety instructions are in place."
-
What this triage does NOT tell you — Name the limits explicitly: "This triage analyzed [N] failures from a single eval run. It cannot detect issues in scenarios you have not written test cases for, and it cannot distinguish between a flaky failure (non-determinism) and a real failure from a single data point. If a failure is borderline, re-run before investing in a fix."
Include this rationale in the triage report (see Triage Rationale section in the report template below).
Step 8: Generate Triage Report
Output a structured triage report:
markdown
# Triage Report: [Agent Name] — [Date]
## Score Summary
|----------|-----------|-----------|--------|
| ... | ... | ... | PASS/BLOCK/ITERATE |
## Readiness Assessment
[SHIP / SHIP WITH KNOWN GAPS / ITERATE / BLOCK]
[Rationale]
## Failure Analysis
### Failure 1: [Test Case ID]
- **Quality Signal:** ...
- **Sample Input:** ...
- **Expected:** ...
- **Actual:** ...
- **Root Cause:** [Eval Setup / Agent Config / Platform Limitation]
- **Diagnosis:** [specific diagnosis]
- **Owner:** [who needs to act]
- **Remediation:** [specific action]
- **Verification:** [how to verify the fix worked]
[Repeat for each triaged failure]
## Triage Rationale
### Why these root cause classifications
[Walk through the decision tree for representative examples — show the reasoning, not just the label]
### Why these remediations
[Explain the logic connecting root cause to fix — why this fix and not an alternative]
### Why this priority order
[Connect priority to blast radius and dependency chains]
### What this triage does NOT tell you
[Name the limits: coverage gaps, single-run non-determinism, untested scenarios]
## Systemic Patterns
[If 80%+ of failures share a root cause, call it out]
## Action Items
|---|--------|-------|----------|-------------|
| 1 | ... | ... | ... | Re-run [eval set] |
## Post-Triage Checklist
- [ ] All safety/compliance failures addressed
- [ ] Root causes verified (re-run after fixes)
- [ ] Known gaps documented with owners
- [ ] Platform limitations filed if applicable
## Human Review Required
[Include human review checkpoints table — see Human Review Checkpoints section below]
Post-Triage Verification
After fixes are applied:
- Scores flat after fix? → Wrong root cause, re-triage
- One score up, another down? → Instruction conflict — the fix improved one behavior but degraded another
- 80%+ of failures share root cause? → Systemic issue — fix the category, not individual test cases
Non-Determinism Handling
LLM-based agents and graders produce variable outputs:
- Establish baselines: Run 3+ times before treating any score as baseline. Use the average.
- Normal variance: +/-5% between runs is expected. Investigate if >10%.
- Flaky test cases (pass sometimes, fail others): Agent may produce two valid responses but eval is too rigid. Investigate whether to broaden the expected value.
- Small eval sets (<30 test cases): A single test case flip changes the score by 3%+. Don't over-interpret.
Supplementary Signal: User Reactions
If the agent is deployed (even in preview), check user reactions (thumbs up/down) in Copilot Studio analytics alongside eval results. During an improvement loop, reactions help you prioritize:
- High thumbs-down on a topic where eval passes: Your eval may not be testing what real users care about. Add test cases that reflect the actual user complaints.
- Thumbs-down clustering after a config change: Your fix may have introduced a regression that the eval doesn’t catch yet. Investigate and expand test coverage.
- Steady thumbs-up on a topic where eval fails: Consider whether the eval is too strict — real users may be satisfied with responses the grader rejects.
Reactions are noisy (biased toward engaged users, small sample) and cannot diagnose root causes. Use them as a prioritization signal, not a verdict.
Human Review Checkpoints
Before acting on the triage report, review these checkpoints. Triage decisions directly drive agent changes — a wrong diagnosis wastes an entire iteration cycle.
| # | Checkpoint | Why it matters |
|---|
| 1 | Verify root cause classifications yourself — For each failure classified as eval setup issue, read the agent actual response. Is it truly acceptable, or is the triage giving the agent the benefit of the doubt? | Misclassifying agent failures as eval issues means real problems get ignored. The 20% baseline is a starting point, not a blanket excuse. |
| 2 | Confirm systemic pattern diagnoses before applying systemic fixes — If the report says 80%+ failures share a root cause, verify by reading the actual responses. Similar symptoms can have different causes. | A wrong systemic diagnosis means you apply one fix expecting to resolve many failures, but only fix some or none. |
| 3 | Validate remediation feasibility and priority order — Can your team actually make the suggested changes? Is the priority order right for your timeline and constraints? | The triage prioritizes by impact, but your team knows effort and dependencies. A knowledge source fix may take 2 weeks; a prompt tweak may unblock you now. |
| 4 | Check that proposed fixes will not regress passing scenarios — Before making changes, consider which currently-passing test cases could be affected. Prompt changes especially have ripple effects. | Fixing 3 failures while introducing 5 new ones is a net loss. Plan to re-run the full suite after any agent configuration change. |
| 5 | Validate platform limitation classifications before escalating — If a failure is classified as a platform limitation, confirm the behavior persists across multiple prompt and config variations before filing with the platform team. | Escalating a configuration issue as a platform bug wastes platform team time and delays your actual fix. |
| 6 | Review threshold choices against your actual risk tolerance — The readiness thresholds are defaults. Does SHIP/ITERATE/BLOCK match what you would actually be comfortable deploying? | Only your team knows your real risk tolerance. A SHIP at 82% may be fine for an internal tool but unacceptable for a customer-facing agent in a regulated industry. |
Include this table in the triage report output. Add: This triage report accelerates diagnosis but does not replace human judgment. Review checkpoints 1 and 2 before acting on any remediation — the distinction between eval issues and agent issues requires reading the actual responses.
Data Retention Warning
Copilot Studio deletes test run results after 89 days. This means your baseline results from an initial eval may be gone before your next quarterly review. After every triage cycle:
- Export the results CSV immediately (Test set → Export results)
- Store alongside your triage report in SharePoint, a repo, or wherever your team keeps versioned artifacts
- Tag with agent version and date so future comparisons are possible
If your triage identified a fix-and-rerun cycle, export the pre-fix results before applying changes. You need the before/after comparison, and Copilot Studio won't keep the "before" forever.
Cross-Reference
This skill works alongside the
AI Agent Evaluation Scenario Library (
github.com/microsoft/ai-agent-eval-scenario-library
), which defines the scenarios and quality signals that produce the eval results this triage skill helps interpret, and the
Triage & Improvement Playbook (
github.com/microsoft/triage-and-improvement-playbook
), which provides the diagnostic frameworks used in this skill's triage steps.
Related eval skills
| After triage, if you need to... | Use this skill |
|---|
| Build or expand the eval plan with new scenarios identified during triage | |
| Generate new test cases for expanded or revised scenarios | |
| Get a quick structured report from a new CSV (without interactive triage) | |
| Answer a methodology question that came up during triage | |
| Walk the customer through the full eval pipeline end-to-end | |