eval-generator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePurpose
用途
This skill generates concrete eval test cases — with realistic inputs, expected outputs, and evaluation method configurations. It is the second step in the eval lifecycle: plan → generate → run → interpret.
This skill covers Stage 2 (Set Baseline & Iterate) of the MS Learn 4-stage evaluation framework. Use first for Stage 1 (Define), then generate test cases here, run them, and interpret results with . Stage 3 (Systematic Expansion) means repeating this cycle with broader coverage — the checklist defines four expansion categories: Foundational core, Agent robustness, Architecture test, and Edge cases. Stage 4 (Operationalize) means embedding these evals into your agent's CI/CD pipeline. Point customers to the editable checklist template to track their progress across all four stages.
/eval-suite-planner/eval-result-interpreterPrimary mode: If the conversation already contains output from , use that plan’s scenario table, evaluation methods, quality signals, and tags as the blueprint. Generate one test case per row in the plan.
/eval-suite-plannerFallback mode: If no plan exists in the conversation, accept a plain-English agent description and generate test cases from scratch (6-8 cases minimum).
本技能可生成具体的评估测试用例,包含真实输入、预期输出和评估方法配置,是评估生命周期的第二个步骤:计划 → 生成 → 运行 → 解读。
本技能对应微软官方学习资料中 四阶段评估框架 的第2阶段(设置基线与迭代)。请先使用完成第1阶段(定义),再在此处生成测试用例,运行测试后通过解读结果。第3阶段(系统扩展)指扩大覆盖范围重复上述周期, checklist定义了四大扩展类别:基础核心能力、Agent鲁棒性、架构测试、边缘场景。第4阶段(落地运营)指将这些评估嵌入Agent的CI/CD流水线。可引导用户使用可编辑checklist模板跟踪四个阶段的进度。
/eval-suite-planner/eval-result-interpreter主模式:如果对话中已经存在的输出,就使用该计划的场景表、评估方法、质量信号和标签作为蓝图,为计划中的每一行生成一个测试用例。
/eval-suite-planner回退模式:如果对话中不存在评估计划,可接受纯英文Agent描述从零开始生成测试用例(最少6-8个用例)。
Instructions
使用说明
When invoked as (with or without additional input):
/eval-generator当以命令调用时(无论是否附带额外输入):
/eval-generatorStep 1 — Detect input mode
步骤1 — 检测输入模式
Check the conversation history for output from . Look for the scenario plan table (a markdown table with columns: #, Scenario Name, Category, Tag, Evaluation Methods).
/eval-suite-planner- Plan found: Use it as the blueprint. Say: "Generating test cases from your eval suite plan (X scenarios)." Generate one test case per row.
- No plan, but user provides an agent description: Generate from scratch. Say: "Generating eval scenarios for: [agent task in your own words]." If the description is fewer than two sentences or doesn’t mention success criteria, ask exactly one clarifying question, then wait.
- No plan and no description: Say: "I need either an agent description or a plan from . Run
/eval-suite-plannerfirst for the best results, or give me a description and I’ll generate directly."/eval-suite-planner <your agent description>
检查对话历史中是否存在的输出,寻找场景计划表(包含列:#、Scenario Name、Category、Tag、Evaluation Methods的markdown表格)。
/eval-suite-planner- 找到计划:将其作为蓝图使用,回复:"正在根据你的评估套件计划生成测试用例(共X个场景)",每一行对应生成一个测试用例。
- 未找到计划,但用户提供了Agent描述:从零开始生成,回复:"正在为以下场景生成评估方案:[用你自己的话复述Agent任务]"。如果描述少于两句话或未提及成功标准,仅提出一个明确的澄清问题后等待用户回复。
- 未找到计划也没有Agent描述:回复:"我需要Agent描述或生成的计划才能工作。推荐先运行
/eval-suite-planner获得最优结果,也可以直接给我Agent描述我会直接生成测试用例。"/eval-suite-planner <你的Agent描述>
Step 1b — Determine evaluation mode (Single Response vs. Conversation)
步骤1b — 确定评估模式(单响应 vs 对话模式)
Before generating test cases, determine which evaluation mode fits the agent. This affects the output format, available test methods, and import options.
Choose Conversation mode when the agent:
- Handles multi-step tasks that require context across turns (e.g., booking a trip with departure, return, and seat selection)
- Needs to ask clarifying questions before completing a request
- Must maintain state (e.g., remembering a customer’s account after initial identification)
- Has handoff or escalation flows that depend on prior turns
Choose Single Response mode when the agent:
- Answers standalone questions (FAQ, policy lookup, factual retrieval)
- Routes to a single tool per request
- Produces a self-contained output per input (e.g., a summary, a classification)
Default: If the plan or agent description does not indicate multi-turn behavior, default to Single Response.
If Conversation mode is selected, say: "This agent benefits from conversational (multi-turn) evaluation. I will generate conversation test cases — each is a multi-turn dialogue, not a single question."
Then skip to Step 2b below for conversation-specific generation.
生成测试用例前,先确定适合该Agent的评估模式,这会影响输出格式、可用测试方法和导入选项。
满足以下条件时选择对话模式:
- 处理需要跨轮次上下文的多步骤任务(例如预订行程需要确认出发时间、返程时间、座位选择)
- 需要先提出澄清问题再完成请求
- 必须维护状态(例如初始身份验证后记住客户账户信息)
- 存在依赖历史对话的转接或升级流程
满足以下条件时选择单响应模式:
- 回答独立问题(FAQ、政策查询、事实检索)
- 每个请求仅调用单个工具
- 每个输入对应独立输出(例如摘要、分类结果)
默认规则:如果计划或Agent描述未说明多轮交互行为,默认使用单响应模式。
如果选择了对话模式,回复:"该Agent适合对话(多轮)评估,我将生成对话测试用例,每个用例是多轮对话而非单个问题。"
然后直接跳转到下文的步骤2b进行对话场景专属的用例生成。
Step 2a — Generate single-response test cases
步骤2a — 生成单响应测试用例
When generating from a plan: Match each scenario row exactly — use its Scenario Name, Tag, and Evaluation Methods. Translate the evaluation methods into the right expected fields and test configurations.
When generating from scratch (no plan):
- 6-8 total scenarios
- At least 2 happy-path / core business cases
- At least 2 edge cases (empty input, long input, ambiguous input, malformed input)
- At least 1 adversarial case (prompt injection, out-of-scope request, policy violation attempt)
- Fill remaining with whatever gives the most signal for this agent
基于计划生成时:严格匹配每个场景行,使用原有的场景名称、标签和评估方法,将评估方法转换为对应的预期字段和测试配置。
从零开始生成(无计划)时:
- 共6-8个场景
- 至少2个 happy path/核心业务场景
- 至少2个边缘场景(空输入、长输入、模糊输入、格式错误输入)
- 至少1个对抗场景(提示注入、超出范围请求、尝试违反政策)
- 剩余用例选择能为该Agent提供最多质量信号的场景
Step 2b — Generate conversation (multi-turn) test cases
步骤2b — 生成对话(多轮)测试用例
Use this step instead of Step 2a when the agent requires multi-turn evaluation.
Conversation test set constraints (Copilot Studio limits):
- Up to 20 test cases per test set
- Each test case supports up to 12 total messages (6 user-agent Q&A pairs)
- Supported test methods: General Quality, Keyword Match, Capability Use (called "Capabilities match" in conversation UI), Custom (Classification)
- NOT supported in conversation mode: Compare Meaning, Text Similarity, Exact Match
When generating from a plan: Group related scenarios into multi-turn conversations. A single conversation test case can cover 2-3 related scenarios that would naturally occur in sequence (e.g., "check order status" then "request refund on that order" then "confirm refund policy").
When generating from scratch (no plan):
- 4-6 conversation test cases
- At least 2 happy-path multi-step task completions
- At least 1 conversation with a clarification loop (user gives vague input, agent asks, user clarifies)
- At least 1 conversation with a mid-conversation topic switch or escalation
- At least 1 adversarial turn mid-conversation (e.g., user attempts prompt injection after establishing context)
Conversation test case format (displayed in conversation):
For each test case, produce a structured conversation block:
Conversation Test Case #N: [Scenario Name]Turn 1 - User: [realistic user message] Turn 1 - Agent (expected): [expected agent response or behavior description]Turn 2 - User: [follow-up that depends on Turn 1 context] Turn 2 - Agent (expected): [expected response maintaining context]Turn 3 - User: [further follow-up or new request within same session] Turn 3 - Agent (expected): [expected response]Test method: General Quality / Keyword Match / Capability Use / Custom (Classification) Keywords (if Keyword Match): [comma-separated keywords expected in any agent response] What this tests: [1 sentence: what multi-turn capability is being evaluated] Critical turn: [which turn is most likely to fail and why]
Rules for conversation test cases:
- Each turn must build naturally on the previous turn — do not write turns that could stand alone
- Agent expected responses should describe the behavior, not the exact wording (these are reference responses for the LLM judge)
- Include at least one conversation where the user’s intent shifts or expands across turns
- Flag which turn is the "critical" turn — the one most likely to fail (e.g., Turn 3 where context from Turn 1 must be retained)
Important — no CSV import for conversation test sets: Unlike single-response test cases, conversation test sets cannot be imported via CSV. The customer must create them in Copilot Studio using one of these methods:
- Quick conversation set — auto-generates 10 short conversations from agent description and instructions
- Full conversation set — generates conversations from agent knowledge sources or topics (short or long)
- Use test chat — converts the latest test chat session into a conversation test case
- Manual entry — add user questions and agent reference responses turn by turn in the UI
The conversation test cases generated by this skill serve as a planning blueprint — the customer uses them to:
- Guide what they enter manually in the Copilot Studio conversation editor
- Compare against AI-generated conversation sets to check coverage
- Document the multi-turn scenarios that matter before creating test sets in the UI
当Agent需要多轮评估时使用本步骤替代步骤2a。
对话测试集限制(Copilot Studio限制):
- 每个测试集最多包含20个测试用例
- 每个测试用例最多支持12条消息(即6组用户-Agent问答对)
- 支持的测试方法:General Quality、Keyword Match、Capability Use(对话UI中显示为"Capabilities match")、Custom (Classification)
- 对话模式不支持的方法:Compare Meaning、Text Similarity、Exact Match
基于计划生成时:将相关场景组合为多轮对话,单个对话测试用例可覆盖2-3个自然连续出现的相关场景(例如"查询订单状态"→"申请该订单退款"→"确认退款政策")。
从零开始生成(无计划)时:
- 4-6个对话测试用例
- 至少2个happy path多步骤任务完成场景
- 至少1个包含澄清循环的对话(用户给出模糊输入,Agent提问,用户澄清)
- 至少1个包含对话中途切换主题或升级流程的场景
- 至少1个对话中途出现对抗输入的场景(例如用户建立上下文后尝试提示注入)
对话测试用例格式(对话中展示):
每个测试用例生成结构化的对话块:
对话测试用例 #N: [场景名称]第1轮 - 用户: [真实用户消息] 第1轮 - Agent(预期): [预期的Agent响应或行为描述]第2轮 - 用户: [依赖第1轮上下文的 follow-up 问题] 第2轮 - Agent(预期): [保留上下文的预期响应]第3轮 - 用户: [同一会话内的进一步跟进或新请求] 第3轮 - Agent(预期): [预期响应]测试方法: General Quality / Keyword Match / Capability Use / Custom (Classification) 关键词(如果使用Keyword Match): [所有Agent响应中需要出现的逗号分隔关键词] 测试目标: [1句话,说明评估的多轮能力是什么] 关键轮次: [最可能失败的轮次及原因]
对话测试用例规则:
- 每个轮次必须自然承接上一轮内容,不要编写可以独立存在的轮次
- Agent预期响应应描述行为,而非 exact wording(这些是给LLM评审的参考响应)
- 至少包含一个用户意图在对话过程中发生转移或扩展的场景
- 标记出"关键"轮次,即最可能失败的轮次(例如需要保留第1轮上下文的第3轮)
重要提示:对话测试集不支持CSV导入:和单响应测试用例不同,对话测试集无法通过CSV导入,用户必须通过以下方法在Copilot Studio中创建:
- 快速对话集 — 基于Agent描述和指令自动生成10条短对话
- 完整对话集 — 基于Agent知识源或主题生成对话(可短可长)
- 使用测试聊天 — 将最近的测试聊天会话转换为对话测试用例
- 手动录入 — 在UI中逐轮添加用户问题和Agent参考响应
本技能生成的对话测试用例作为规划蓝图,用户可以用来:
- 指导手动在Copilot Studio对话编辑器中录入内容
- 和AI生成的对话集对比检查覆盖度
- 在UI中创建测试集前提前记录重要的多轮场景
Output — Single Response
输出 — 单响应模式
Copilot Studio Test Set Table (displayed in conversation)
This is the primary output for single-response mode. Produce a markdown table matching the Copilot Studio test set format:
| # | Question | Expected Response | Test Method | Pass Score |
|---|---|---|---|---|
| 1 | [realistic user input] | [expected answer, or leave blank for General Quality] | General Quality | — |
| 2 | [realistic user input] | [expected answer for comparison] | Compare Meaning | 50 |
| 3 | [realistic user input] | [keywords to check] | Keyword Match (All) | — |
Map the evaluation methods from the plan (or from your analysis) to Copilot Studio’s 7 test methods using this selection guide:
Test Method Selection Guide
Use this table to select the correct test method based on what you need to verify about the response. When in doubt, match to the "What you’re testing" column first, not the method name.
| What you’re testing | Test Method | Expected Response format | Pass Score |
|---|---|---|---|
| Overall response quality, tone, helpfulness (LLM judge checks: relevance, groundedness, completeness, abstention) | General Quality | Leave blank — no expected response needed | — |
| Factual accuracy with flexible phrasing (meaning matters, wording doesn’t) | Compare Meaning | Write the ideal answer in natural language | 50 (default) |
| Factual accuracy with specific required facts, numbers, or policy language | Keyword Match (All) | Comma-separated list of keywords/phrases that MUST all appear | — |
| Partial keyword coverage (at least one key term should appear) | Keyword Match (Any) | Comma-separated list of keywords/phrases where ANY match is a pass | — |
| Correct tool/topic/connector invocation | Capability Use (shown as "Tool use" in UI) | Comma-separated list of expected tool or topic names | — |
| Classification, labeling, or structured output with exact expected values | Exact Match | The exact expected string (case-sensitive) | — |
| Response phrasing matters (not just meaning — specific wording preferred) | Text Similarity | Write the expected phrasing | 0.70 (default; scale 0–1) |
| Domain-specific criteria that don’t fit above methods | Custom | Write evaluation instructions and pass/fail label definitions | — |
How General Quality scoring works
General Quality is the default test method and the most commonly used. Understanding its 4 scoring criteria helps customers write better test cases and interpret results correctly. General Quality uses an LLM judge that scores each response on:
| Criterion | What it checks | A low score means | Tip for the customer |
|---|---|---|---|
| Relevance | Does the response directly address the user's question? Does it stay on topic? | The agent went off-topic, added irrelevant information, or answered a different question than what was asked | If relevance scores are low, check the agent's topic routing |
| Groundedness | Is the response based on the agent's knowledge sources? Does it avoid hallucinating unsupported claims? | The agent invented facts, cited sources it doesn't have, or generated plausible-sounding but unverified information | Low groundedness is the most dangerous failure. Check knowledge source coverage |
| Completeness | Does the response cover all aspects of the question with sufficient detail? | The agent gave a partial answer, missed key details, or was too brief to be useful | If completeness is low but relevance is high, the agent understands the question but its knowledge sources may have gaps |
| Abstention | Did the agent attempt to answer at all? (Abstaining on an answerable question is bad; abstaining on an unanswerable question is good) | The agent refused to answer a question it should have answered, OR answered a question it should have declined | Abstention failures often indicate overly restrictive or overly permissive system instructions |
All 4 criteria must be met for a response to score high. A response that is relevant, grounded, and complete but fails abstention (e.g., answering a question outside its scope) still gets a low score. This prevents agents from "helpfully" answering questions they shouldn't.
When NOT to rely on General Quality alone: If your scenario has a specific correct answer (a fact, a number, a policy), pair General Quality with Compare Meaning or Keyword Match. General Quality evaluates how well the agent responds, but without an expected answer it cannot catch factual errors where the agent sounds great but says the wrong thing.
How Custom test methods work
Custom is the newest test method in Copilot Studio. It lets you define your own evaluation criteria when none of the built-in methods fit — for example, compliance checks, tone audits, or domain-specific quality bars. Custom works for both single-response and conversation test sets.
A Custom test has two components you configure:
| Component | What it is | How to write it |
|---|---|---|
| Evaluation instructions | A prompt that tells the LLM judge what to look for in the agent's response | Be goal-oriented. Use bullet points and headings. Describe the specific qualities you want to assess. |
| Labels | Named categories the judge assigns to each response, each mapped to Pass or Fail | Use 2-3 labels. Each label has a name and a description of what qualifies. One label = Pass, the other = Fail. |
Example — HR Policy Compliance custom test:
- Evaluation instructions: "Evaluate the agent's response for HR policy compliance. Check: protects privacy, avoids discrimination or bias, provides safe HR-aligned guidance, does not give legal advice."
- Labels: (Pass) — "Response follows all HR policies and guidelines." /
Compliant(Fail) — "Response violates one or more HR policy requirements."Non-Compliant
When to use Custom:
- Domain-specific compliance or regulatory requirements the built-in methods cannot express
- Tone or style evaluation (e.g., "Is the response empathetic?" or "Does it match our brand voice?")
- Safety or guardrail checks beyond what General Quality's abstention criterion covers
- Classification accuracy where labels are complex or context-dependent (not a simple exact match)
Important — Custom is NOT available via CSV import. If your test plan includes Custom test cases, import the other test cases via CSV first, then add Custom test cases manually in the Copilot Studio UI. Note this in the docx report.
Advanced — Rubric-based grading (Copilot Studio Kit)
For customers using the Copilot Studio Kit, rubrics provide a structured 1–5 grading scale for Generative Answers test cases. Rubrics are more granular than the built-in test methods — they replace the standard validation logic with a custom AI grader that scores responses against domain-specific criteria.
Rubrics operate in two modes:
| Mode | Assignment level | AI output | Cost | Use when |
|---|---|---|---|---|
| Testing mode | Individual test case | Grade only (1–5) | Lower | Rubric is refined and trusted; routine QA and regression testing |
| Refinement mode | Entire test run | Grade + detailed rationale | Higher | Creating or improving a rubric; comparing AI vs. human grading to minimize misalignment |
Key workflow — the rubric refinement cycle: Rubric refinement is iterative, not a one-time configuration. The cycle is: Run → Review → Grade → Refine → Save → Re-run → Repeat. Expect several iterations before alignment is acceptable. The steps:
-
Start a refinement run — configure a test run with the rubric attached; the AI grades each Generative Answer test case on 1–5 and writes a detailed rationale.
-
Review in Standard view first — Standard refinement view hides AI grades so humans grade without bias. This is critical: research shows seeing AI grades anchors human judgment.
-
Human grading — assign a grade (1–5) and write reasoning for every test case. Reasoning must reference specific rubric criteria ("Grade 4: accurate technical info, professional tone, but lacks timeline estimates for Grade 5"). Vague reasoning ("Pretty good") undermines refinement.
-
Mark examples — flag select test cases as Good or Bad examples. Focus on misaligned cases and edge cases. Quality over quantity — a few well-chosen examples beat many mediocre ones.
-
Check alignment — switch to Full refinement view to compare AI vs. human grades. Alignment per test case:. Average across all graded cases for the run score.
alignment = 100% × (1 − |AI − Human| / 4)Average alignment Assessment Action 90–100% Excellent Rubric is reliable; move to testing mode 75–89% Good Mostly aligned; refine edge cases 60–74% Fair Needs improvement; focus on misalignment patterns < 60% Poor Significant refinement or rubric redesign needed -
Refine the rubric — click "Refine Rubric" to let AI update the rubric based on all human grades, reasoning, examples, and misalignment patterns. Use "Save As" for early iterations (preserve history), "Save" once stable.
-
Re-run — duplicate the test run with the refined rubric. Compare alignment to the prior iteration.
-
Iterate — repeat until alignment reaches 75–90%+. Don’t chase 100% — some subjectivity is inherent, and diminishing returns kick in around 85–90%.
Passing grade guidance:
- Grade 5 (default): Only exemplary responses pass — use for critical communications (IR reports, executive summaries)
- Grade 4: Strong or better responses pass — appropriate for most business communications
- Grade 3: Acceptable or better — minimum functional quality for internal tools
When to mention rubrics to customers: If the eval plan includes Custom test methods, domain-specific compliance checks, or tone/style evaluation, note in the docx report that the customer can create a rubric in Copilot Studio Kit for more granular, repeatable grading with human-AI alignment tracking. Rubrics are the next step beyond Custom for customers who need ongoing quality assurance with calibrated scoring. Include the alignment targets table above so they know what "good enough" looks like.
Tip — negative testing with Keyword Match: To verify an agent does NOT reveal forbidden content (e.g., internal policy details, competitor names, PII), use Keyword Match and list the keywords that should be absent. In the test case notes, flag this as a negative check so the reviewer knows a keyword "match" here means a failure.
Common method selection mistakes to avoid:
- Do NOT use General Quality when there IS a specific correct answer — use Compare Meaning or Keyword Match instead. General Quality has no expected response to compare against.
- Do NOT use Compare Meaning for factual claims with precise numbers/dates/names — use Keyword Match (All) to ensure the exact facts appear.
- Do NOT use Compare Meaning for adversarial/negative tests — there is often no "expected" answer. Use General Quality for adversarial handling and edge cases.
- Do NOT use General Quality for tool routing tests — use Capability Use to verify the correct connector/topic fired.
- Do NOT use Text Similarity when meaning is what matters — use Compare Meaning. Text Similarity penalizes paraphrasing.
- Do NOT use Exact Match for natural language responses — Exact Match fails on any variation in wording. Only use for labels, codes, or structured data.
Rules for inputs:
- Every must be a realistic input the agent would receive in production — specific, not a placeholder
Question - Every must be concrete and testable — never vague
Expected Response - For General Quality, leave Expected Response blank (the LLM judge evaluates without one)
- For Keyword Match, put the required keywords in Expected Response
- For adversarial cases, the Expected Response should describe what the agent should NOT do
Copilot Studio 测试集表格(对话中展示)
这是单响应模式的主要输出,生成符合Copilot Studio测试集格式的markdown表格:
| # | 问题 | 预期响应 | 测试方法 | 通过分数 |
|---|---|---|---|---|
| 1 | [真实用户输入] | [预期答案,使用General Quality时留空] | General Quality | — |
| 2 | [真实用户输入] | [用于对比的预期答案] | Compare Meaning | 50 |
| 3 | [真实用户输入] | [需要检查的关键词] | Keyword Match (All) | — |
使用以下选择指南将计划(或你的分析)中的评估方法映射到Copilot Studio的7种测试方法:
测试方法选择指南
根据你需要验证的响应特性选择正确的测试方法,不确定时优先匹配"测试目标"列,而非方法名称。
| 测试目标 | 测试方法 | 预期响应格式 | 通过分数 |
|---|---|---|---|
| 整体响应质量、语气、有用性(LLM评审检查:相关性、事实 groundedness、完整性、拒绝策略合理性) | General Quality | 留空,不需要预期响应 | — |
| 事实准确性,允许灵活表述(语义重要,措辞不重要) | Compare Meaning | 用自然语言编写理想答案 | 50(默认) |
| 事实准确性,要求包含特定事实、数字或政策原文 | Keyword Match (All) | 必须全部出现的关键词/短语的逗号分隔列表 | — |
| 部分关键词覆盖(至少出现一个关键词即可) | Keyword Match (Any) | 任意匹配即可通过的关键词/短语的逗号分隔列表 | — |
| 工具/主题/连接器调用是否正确 | Capability Use(UI中显示为"Tool use") | 预期的工具或主题名称的逗号分隔列表 | — |
| 分类、标签或结构化输出,有明确的预期值 | Exact Match | 精确的预期字符串(大小写敏感) | — |
| 响应措辞重要(不仅是语义,需要特定表述) | Text Similarity | 编写预期措辞 | 0.70(默认,范围0-1) |
| 不符合以上方法的领域专属标准 | Custom | 编写评估说明和通过/失败标签定义 | — |
General Quality 评分规则
General Quality是默认也是最常用的测试方法,理解其4个评分标准可以帮助用户编写更好的测试用例、正确解读结果。General Quality使用LLM评审从以下四个维度对每个响应打分:
| 维度 | 检查内容 | 低分意味着 | 给用户的建议 |
|---|---|---|---|
| 相关性 | 响应是否直接回应用户问题?是否跑题? | Agent跑题、添加无关信息、回答了其他问题 | 如果相关性得分低,检查Agent的主题路由逻辑 |
| Groundedness | 响应是否基于Agent的知识源?是否避免生成无依据的幻觉内容? | Agent虚构事实、引用不存在的来源、生成看似合理但未经验证的信息 | Groundedness低是最危险的故障,检查知识源覆盖范围 |
| 完整性 | 响应是否覆盖了问题的所有方面,细节足够? | Agent给出部分答案、遗漏关键细节、内容过短无法使用 | 如果完整性低但相关性高,说明Agent理解问题,但知识源可能存在缺口 |
| 拒绝策略合理性 | Agent是否尝试回答问题?(可回答的问题拒绝是坏的,不可回答的问题拒绝是好的) | Agent拒绝回答应该回答的问题,或者回答了应该拒绝的问题 | 拒绝策略故障通常意味着系统提示过于严格或过于宽松 |
四个维度全部符合才能获得高分,一个响应即使相关、基于事实、完整,但拒绝策略不合格(例如回答了超出范围的问题)仍然会得低分,避免Agent"过度热心"回答不该回答的问题。
不要仅依赖General Quality的场景:如果你的场景有明确的正确答案(事实、数字、政策),请将General Quality和Compare Meaning或Keyword Match结合使用。General Quality评估Agent响应的"好坏",但没有预期响应的话无法捕捉到Agent表述流畅但内容错误的事实性错误。
Custom测试方法规则
Custom是Copilot Studio中最新的测试方法,当内置方法都不适用时可以自定义评估标准,例如合规检查、语气审计、领域专属质量要求,同时支持单响应和对话测试集。
Custom测试包含两个可配置组件:
| 组件 | 说明 | 编写规则 |
|---|---|---|
| 评估说明 | 告诉LLM评审需要检查Agent响应的哪些维度的提示 | 以目标为导向,使用项目符号和标题,描述你要评估的具体特性 |
| 标签 | 评审给每个响应分配的命名类别,每个类别对应通过或失败 | 使用2-3个标签,每个标签包含名称和符合条件的描述,一个标签对应通过,其他对应失败 |
示例 — HR政策合规Custom测试:
- 评估说明:"评估Agent响应的HR政策合规性,检查:保护隐私、避免歧视或偏见、提供符合HR要求的安全指引、不提供法律建议。"
- 标签:(通过) — "响应符合所有HR政策和指引" /
合规(失败) — "响应违反一项或多项HR政策要求"。不合规
使用Custom的场景:
- 内置方法无法表达的领域专属合规或监管要求
- 语气或风格评估(例如"响应是否有同理心?"或"是否符合我们的品牌语气?")
- 超出General Quality拒绝策略覆盖范围的安全或护栏检查
- 标签复杂或依赖上下文的分类准确性评估(不是简单的精确匹配)
重要提示:Custom测试方法不支持CSV导入。如果你的测试计划包含Custom测试用例,先通过CSV导入其他测试用例,再在Copilot Studio UI中手动添加Custom测试用例,并在docx报告中注明该点。
高级功能 — 基于评分 rubric 的 grading(Copilot Studio Kit)
使用Copilot Studio Kit的用户可以使用评分rubric为生成式回答测试用例提供结构化的1-5分评分体系。Rubric比内置测试方法更细粒度,它用自定义AI评分器替代标准验证逻辑,根据领域专属标准对响应打分。
Rubric有两种运行模式:
| 模式 | 分配级别 | AI输出 | 成本 | 使用场景 |
|---|---|---|---|---|
| 测试模式 | 单个测试用例 | 仅分数(1-5) | 更低 | Rubric已优化且可信,用于日常QA和回归测试 |
| 优化模式 | 整个测试运行 | 分数+详细理由 | 更高 | 创建或优化rubric,对比AI和人工评分减少偏差 |
核心工作流 — rubric优化周期:Rubric优化是迭代过程,不是一次性配置,周期为:运行 → 审核 → 打分 → 优化 → 保存 → 重新运行 → 重复。通常需要多次迭代才能达到可接受的对齐度,步骤如下:
-
启动优化运行 — 配置绑定了rubric的测试运行,AI为每个生成式回答测试用例打1-5分并给出详细理由。
-
先在标准视图审核 — 标准优化视图隐藏AI分数,避免人工打分时产生偏见,这点非常重要:研究表明看到AI分数会锚定人类判断。
-
人工打分 — 为每个测试用例分配1-5分并填写理由,理由必须引用具体的rubric标准("4分:技术信息准确,语气专业,但缺少5分要求的时间线预估"),模糊的理由("还不错")会影响优化效果。
-
标记示例 — 将部分测试用例标记为好或坏的示例,重点关注对齐偏差的用例和边缘场景,重质不重量:几个精心挑选的示例比大量普通示例效果更好。
-
检查对齐度 — 切换到完整优化视图对比AI和人工分数,单测试用例对齐度:,所有已打分用例的平均值为本次运行的对齐度得分。
对齐度 = 100% × (1 − |AI分数 − 人工分数| / 4)平均对齐度 评估 行动 90–100% 优秀 Rubric可靠,切换到测试模式 75–89% 良好 基本对齐,优化边缘场景即可 60–74% 一般 需要改进,重点解决对齐偏差的模式 < 60% 差 需要大幅优化或重新设计rubric -
优化rubric — 点击"优化Rubric"让AI基于所有人工打分、理由、示例和对齐偏差模式更新rubric,早期迭代使用"另存为"保留历史,稳定后使用"保存"。
-
重新运行 — 用优化后的rubric复制测试运行,对比和上一次迭代的对齐度。
-
迭代 — 重复直到对齐度达到75-90%以上,不要追求100%:评分本身存在一定主观性,85-90%之后提升的边际收益会大幅下降。
通过分数指引:
- 5分(默认):仅 exemplary 响应通过,用于关键沟通场景(IR报告、高管摘要)
- 4分:良好及以上响应通过,适合大多数业务沟通场景
- 3分:可接受及以上,内部工具的最低功能质量要求
向用户提及rubric的场景: 如果评估计划包含Custom测试方法、领域专属合规检查或语气/风格评估,在docx报告中说明用户可以在Copilot Studio Kit中创建rubric,实现更细粒度、可重复的打分以及人AI对齐跟踪。对于需要持续质量保障和校准评分的用户,rubric是Custom之外的进阶方案,包含上述对齐度目标表让用户知道"足够好"的标准。
提示 — 使用Keyword Match做负面测试: 要验证Agent不会泄露禁止内容(例如内部政策细节、竞争对手名称、PII),可以使用Keyword Match列出不应出现的关键词,在测试用例备注中标记为负面检查,让审核者知道此处关键词"匹配"意味着测试失败。
需要避免的常见方法选择错误:
- 有明确正确答案时不要使用General Quality,请使用Compare Meaning或Keyword Match,General Quality没有预期响应可供对比。
- 对于包含精确数字/日期/名称的事实声明不要使用Compare Meaning,请使用**Keyword Match (All)**确保精确事实全部出现。
- 对抗/负面测试不要使用Compare Meaning,这类场景通常没有"预期"答案,请使用General Quality评估对抗处理和边缘场景表现。
- 工具路由测试不要使用General Quality,请使用Capability Use验证正确的连接器/主题被触发。
- 语义更重要的场景不要使用Text Similarity,请使用Compare Meaning,Text Similarity会惩罚同义改写。
- 自然语言响应不要使用Exact Match,措辞稍有不同Exact Match就会判定失败,仅用于标签、代码或结构化数据。
输入规则:
- 每个必须是Agent在生产环境会收到的真实输入,具体明确,不要用占位符
问题 - 每个必须具体可测试,不能模糊
预期响应 - 使用General Quality时预期响应留空(LLM评审不需要预期响应即可评估)
- 使用Keyword Match时将要求的关键词放在预期响应中
- 对抗场景的预期响应应描述Agent不应该做的行为
Output files
输出文件
After displaying the test cases in conversation, generate the output files:
A. Copilot Studio Import CSV (.csv) — Single Response only
Generate one CSV file for import into Copilot Studio. The CSV uses Copilot Studio’s 3-column import format, which supports specifying the test method per row — so all test cases go in a single file.
Note: CSV import is only available for single-response test sets. For conversation test sets, there is no CSV import — see Step 2b for how the customer creates conversation test cases in the Copilot Studio UI.
CSV format: Use the skill to write the file, or write CSV directly with proper quoting:
/xlsxcsv
"Question","Expected response","Testing method"
"How do I return an item?","The agent should explain the return policy...","Compare meaning"
"What are your hours?","","General quality"
"Can I get a refund after 90 days?","90 days, refund policy, exception","Keyword match"
"What is the order status for #12345?","Order #12345 is in transit","Exact match"CSV column rules:
- Three columns, in this exact order: ,
Question,Expected responseTesting method - Every value must be enclosed in double quotes
- Any double quotes inside a value must be escaped as
"" - maps to the Question column from the test case table
Question - maps to the Expected Response column (leave empty string
Expected responsefor General Quality rows)"" - must use one of these exact values (case-sensitive as shown):
Testing methodGeneral qualityCompare meaning- (this is Text Similarity — note: the CSV value is "Similarity", not "Text Similarity")
Similarity Exact matchKeyword match
Methods NOT available via CSV import: Capability Use (labeled "Tool use" in UI) and Custom test methods cannot be specified in the import CSV. If your test plan includes these methods:
- Import all other test cases via the CSV
- In the Copilot Studio UI, manually add test cases that use Capability Use or Custom
- Note this in the docx report so the customer knows which cases to add manually
After generating the CSV, display a summary showing the method distribution:
| Testing Method | # Test Cases | Notes |
|---|---|---|
| General quality | 3 | No expected response needed |
| Compare meaning | 4 | Set pass score in UI after import (default: 50) |
| Keyword match | 2 | — |
| Capability Use | 1 | Add manually in UI (not CSV-importable) |
This summary helps the customer verify the CSV and know what to configure manually in Copilot Studio.
Important: Pass scores for Compare meaning and Similarity are NOT set in the CSV — they are configured in the Copilot Studio UI after import. Note the recommended pass scores in the summary table and docx report.
Operational tips for the customer:
- Download the CSV template: In Copilot Studio, after selecting New evaluation, you can download a ready-to-use CSV template under Data source. Use this template to verify your column format before importing.
- Copilot Studio's built-in test generation — alternatives to CSV import: Besides importing the CSV this skill generates, customers can also create test sets directly in Copilot Studio using these methods (all under New evaluation → Single response):
- Quick question set — auto-generates 10 questions from the agent's description, instructions, and capabilities. Good for a fast baseline or to seed a larger test set.
- Full question set — generates questions from a specific knowledge source (text, Word, Excel — up to 5 MB) or from topics. Best for agents using generative orchestration (knowledge) or classic orchestration (topics). You choose how many questions to generate.
- Use your test chat — converts the latest test chat session into test cases. Useful after exploratory testing — just replay the questions you already asked.
- Create from analytics themes — uses themes (preview) from production conversations to generate test cases focused on one area (e.g., "billing and payments" questions). Requires the themes analytics feature to be enabled.
- Manual entry — write questions yourself directly in the UI. Recommend combining approaches: use this skill's CSV for structured, scenario-based test cases, then supplement with Copilot Studio's auto-generation to catch gaps the plan didn't anticipate. Note these options in the docx report.
- Test set limits: A single test set supports up to 100 test cases. Each question can be up to 1,000 characters (including spaces). If your eval plan exceeds 100 scenarios, split into multiple test sets by category or quality signal.
- 89-day result retention: Test results are only available in Copilot Studio for 89 days. After that, they are deleted. Always export results to CSV immediately after running evals — especially for baseline runs you will compare against later. Use the export option under test results to save a permanent copy.
- Version your test sets like code: Test sets are living artifacts — they change as the agent evolves, new failure modes appear, and business requirements shift. Treat them with the same rigor as source code:
- Keep a changelog: Every time you add, remove, or modify a test case, record what changed and why. A shared spreadsheet or version-controlled CSV works. Without this, you will not know whether a score change came from the agent improving or the test set shifting underneath it.
- Tag baselines: Before a major agent change (new knowledge source, updated system prompt, new tool), snapshot the current test set as a named baseline (e.g., "v2.1 — pre-knowledge-update"). Run the new agent version against the old baseline first, then against an updated test set. This separates "agent changed" from "test changed."
- Retire, don’t delete: When a test case becomes irrelevant (deprecated feature, changed policy), move it to a "retired" section instead of deleting it. You may need it for regression testing if the feature returns or the policy reverts.
- Track provenance: For each test case, note where it came from — was it generated from the eval plan, added from a production incident, suggested by a subject matter expert, or auto-generated by Copilot Studio? Provenance helps you prioritize: real-failure-sourced cases are higher signal than synthetic ones.
- Review cadence: Revisit the full test set quarterly (or after any major agent change). Stale expected responses are the #1 cause of false failures over time — the agent got better but the expected response still reflects the old behavior.
- User profiles — simulate different user experiences: Copilot Studio lets you assign a user profile to a test set so evals run under that user's authentication. This is critical when knowledge sources or SharePoint sites are role-gated — a director and an intern may get different answers from the same agent. To test this:
- Create separate test sets per profile (e.g., "Core Scenarios — Director" and "Core Scenarios — Intern")
- In each test set, select Manage profile and choose the appropriate account
- Compare results across profiles to find role-based gaps
- Limitation: Multi-profile evaluation is only supported for agents without connector dependencies. If the agent uses tools/connectors that require authentication, the eval must run under the logged-in account that owns those connections — selecting a different profile will fail with "This account cannot connect to tools". In that case, test knowledge-source differences with user profiles, but test tool-dependent scenarios under the tool owner's account.
- Note which profile was used in the docx report — test results in Copilot Studio show the profile, but exported CSVs may not.
- GCC (Government Community Cloud) limitations: If the customer is in a GCC environment, two features are unavailable:
- No user profiles — the test-set profile feature described above is not available in GCC. All evals run under the maker’s own account.
- No Text Similarity method — the “Similarity” test method cannot be used. Replace any Text Similarity test cases with Compare Meaning (for semantic matching) or Keyword Match (for specific phrasing). All other test methods work normally.
- Ask the customer early: “Are you in a GCC environment?” If yes, omit Text Similarity rows from the CSV and note the restriction in the docx report. Source: About agent evaluation.
B. Eval Test Set Report (.docx)
Generate a formatted .docx document for human review using the skill. Contents:
/docx- Title: "Eval Test Set: [Agent Name]" (derive agent name from the plan or description)
- Evaluation mode: State whether this is a Single Response or Conversation test set, and why that mode was chosen
- Plan summary: Brief summary of the eval suite plan this was generated from (or the agent description if generated from scratch)
- Test cases: Each test case formatted clearly with:
- Scenario name
- Input / question (for single-response) or full conversation turns (for conversation mode)
- Expected response or reference responses
- Test method
- Pass score (if applicable)
- Method rationale: Include the Why These Methods? section from Step 3 so the customer understands the reasoning behind each test method choice
- Import guide:
- For single-response: Document the CSV column format (Question, Expected response, Testing method), list which test methods are CSV-importable vs. UI-only, and note pass scores to configure in the UI after import
- For conversation: Explain that CSV import is not available, list the four creation methods (Quick set, Full set, Test chat, Manual entry), and recommend using the conversation test cases in this report as a blueprint for manual entry
- Human review checkpoints: Include the full checkpoint table from Step 4 so the customer has a printed checklist to work through before running evals
- Reviewer notes: The scenario suggestion and mandatory reminder from Step 4
- Operational reminders: Include the 89-day result retention warning (export results immediately after running), the 100-test-case limit per test set, and the CSV template download tip
Important — save your plan before running evals: Copilot Studio CSV exports do not include scenario categories or tags. Before running evals in Copilot Studio, save the scenario plan table from as a reference document. You will need it when interpreting results with .
/eval-suite-planner/eval-result-interpreter在对话中展示测试用例后,生成以下输出文件:
A. Copilot Studio导入CSV (.csv) — 仅单响应模式支持
生成一个可导入Copilot Studio的CSV文件,CSV使用Copilot Studio的3列导入格式,支持每行指定测试方法,所有测试用例放在同一个文件中。
注意:CSV导入仅支持单响应测试集,对话测试集不支持CSV导入,参考步骤2b了解用户如何在Copilot Studio UI中创建对话测试用例。
CSV格式:使用技能编写文件,或者直接编写符合引号规则的CSV:
/xlsxcsv
"Question","Expected response","Testing method"
"How do I return an item?","The agent should explain the return policy...","Compare meaning"
"What are your hours?","","General quality"
"Can I get a refund after 90 days?","90 days, refund policy, exception","Keyword match"
"What is the order status for #12345?","Order #12345 is in transit","Exact match"CSV列规则:
- 三列,顺序严格为:、
Question、Expected responseTesting method - 每个值必须用双引号包裹
- 值内部的双引号需要转义为
"" - 对应测试用例表中的问题列
Question - 对应预期响应列(General Quality行留空字符串
Expected response)"" - 必须使用以下精确值(大小写敏感):
Testing methodGeneral qualityCompare meaning- (即Text Similarity,注意CSV值是"Similarity"而非"Text Similarity")
Similarity Exact matchKeyword match
不支持CSV导入的方法:Capability Use(UI中标记为"Tool use")和Custom测试方法无法在导入CSV中指定,如果测试计划包含这些方法:
- 先通过CSV导入所有其他测试用例
- 在Copilot Studio UI中手动添加使用Capability Use或Custom的测试用例
- 在docx报告中注明该点,让用户知道哪些用例需要手动添加
生成CSV后,展示测试方法分布的汇总表:
| 测试方法 | 测试用例数量 | 备注 |
|---|---|---|
| General quality | 3 | 不需要预期响应 |
| Compare meaning | 4 | 导入后在UI中设置通过分数(默认:50) |
| Keyword match | 2 | — |
| Capability Use | 1 | 在UI中手动添加(不支持CSV导入) |
该汇总表帮助用户验证CSV,了解需要在Copilot Studio中手动配置的内容。
重要提示:Compare meaning和Similarity的通过分数不在CSV中设置,导入后在Copilot Studio UI中配置,在汇总表和docx报告中注明推荐的通过分数。
给用户的运营建议:
- 下载CSV模板:在Copilot Studio中选择新建评估后,可以在数据源下下载现成的CSV模板,导入前使用该模板验证列格式是否正确。
- Copilot Studio内置测试生成 — CSV导入的替代方案:除了导入本技能生成的CSV外,用户还可以通过以下方法直接在Copilot Studio中创建测试集(都在新建评估 → 单响应下):
- 快速问题集 — 基于Agent描述、指令和能力自动生成10个问题,适合快速搭建基线或扩充测试集
- 完整问题集 — 基于特定知识源(文本、Word、Excel,最大5MB)或主题生成问题,适合使用生成式编排(知识)或经典编排(主题)的Agent,可自定义生成问题数量
- 使用测试聊天 — 将最近的测试聊天会话转换为测试用例,探索性测试后非常有用,直接复用已经问过的问题即可
- 从分析主题创建 — 使用生产会话的主题(预览版)生成聚焦特定领域的测试用例(例如"账单和支付"问题),需要启用主题分析功能
- 手动录入 — 直接在UI中编写问题 推荐结合多种方法:使用本技能生成的CSV获取结构化的场景化测试用例,再补充Copilot Studio自动生成的用例覆盖计划未考虑到的缺口,在docx报告中注明这些选项。
- 测试集限制:单个测试集最多支持100个测试用例,每个问题最多1000字符(含空格)。如果评估计划超过100个场景,按类别或质量信号拆分为多个测试集。
- 89天结果保留期:Copilot Studio中仅保留最近89天的测试结果,到期自动删除。运行评估后请立即将结果导出为CSV,尤其是后续需要对比的基线运行结果,使用测试结果下的导出选项保存永久副本。
- 像代码一样版本管理测试集:测试集是活的 artifacts,会随着Agent迭代、新故障模式出现、业务需求变化而变更,用和源代码一样的严谨度管理它们:
- 保留变更日志:每次添加、删除或修改测试用例时,记录变更内容和原因,共享电子表格或版本控制的CSV都可以,没有变更日志的话你无法知道分数变化是因为Agent优化了还是测试集本身变了。
- 标记基线:重大Agent变更前(新知识源、更新系统提示、新工具),将当前测试集快照命名为基线(例如"v2.1 — 知识更新前"),先将新版本Agent在旧基线上运行,再在更新后的测试集上运行,区分"Agent变更"和"测试变更"的影响。
- 归档而非删除:当测试用例不再相关时(功能下线、政策变更),移动到"归档"分区而非删除,如果功能恢复或政策回滚你可能还需要它做回归测试。
- 跟踪来源:每个测试用例注明来源:是从评估计划生成、来自生产故障、由领域专家建议还是Copilot Studio自动生成?来源可以帮助你区分优先级:来自真实故障的用例比合成用例的信号价值更高。
- 定期回顾:每季度(或任何重大Agent变更后)回顾完整测试集,过期的预期响应是长期以来虚假失败的首要原因:Agent变好了但预期响应还是旧的。
- 用户配置文件 — 模拟不同用户体验:Copilot Studio允许为测试集分配用户配置文件,评估将以该用户的身份运行,当知识源或SharePoint站点有角色权限限制时这点非常重要:主管和实习生从同一个Agent可能得到不同的答案。测试方法:
- 为每个角色创建单独的测试集(例如"核心场景 — 主管"和"核心场景 — 实习生")
- 在每个测试集中选择管理配置文件,选择对应的账号
- 对比不同配置文件的结果发现角色相关的能力缺口
- 限制:多配置文件评估仅支持没有连接器依赖的Agent,如果Agent使用需要认证的工具/连接器,评估必须以拥有这些连接的登录账号运行,选择其他配置文件会报错*"该账号无法连接到工具"*。这种情况下,使用用户配置文件测试知识源差异,工具相关场景在工具所有者账号下测试。
- 在docx报告中注明使用的配置文件:Copilot Studio中的测试结果会显示配置文件,但导出的CSV可能不包含该信息。
- GCC(政府社区云)限制:如果用户使用GCC环境,两个功能不可用:
- 无用户配置文件:上述测试集配置文件功能在GCC中不可用,所有评估以制作者自己的账号运行。
- 无Text Similarity方法:无法使用"Similarity"测试方法,将所有Text Similarity测试用例替换为Compare Meaning(语义匹配)或Keyword Match(特定措辞),其他测试方法正常使用。 提前询问用户:"你是否在使用GCC环境?"如果是,从CSV中删除Text Similarity行,并在docx报告中注明该限制。来源:Agent评估简介。
B. 评估测试集报告 (.docx)
使用技能生成格式化的docx文档供人工审核,内容包括:
/docx- 标题:"评估测试集:[Agent名称]"(从计划或描述中提取Agent名称)
- 评估模式:说明是单响应还是对话测试集,以及选择该模式的原因
- 计划摘要:生成测试用例所基于的评估套件计划的简要摘要(如果从零生成则为Agent描述)
- 测试用例:每个测试用例清晰格式化,包含:
- 场景名称
- 输入/问题(单响应模式)或完整对话轮次(对话模式)
- 预期响应或参考响应
- 测试方法
- 通过分数(如果有)
- 方法理由:包含步骤3的"为什么选择这些方法?"部分,让用户理解每个测试方法选择背后的逻辑
- 导入指南:
- 单响应模式:说明CSV列格式(Question、Expected response、Testing method),列出哪些测试方法支持CSV导入、哪些仅支持UI添加,注明导入后需要在UI中配置的通过分数
- 对话模式:说明不支持CSV导入,列出四种创建方法(快速集、完整集、测试聊天、手动录入),建议使用本报告中的对话测试用例作为手动录入的蓝图
- 人工审核检查点:包含步骤4的完整检查点表格,让用户运行评估前有打印版checklist可用
- 审核者备注:步骤4的场景建议和强制提醒
- 运营提醒:包含89天结果保留警告(运行后立即导出结果)、单个测试集100用例限制、CSV模板下载提示
重要提示 — 运行评估前保存计划:Copilot Studio CSV导出不包含场景类别或标签,在Copilot Studio中运行评估前,保存生成的场景计划表作为参考文档,后续使用解读结果时需要用到。
/eval-suite-planner/eval-result-interpreterStep 3 — Method rationale (teach the WHY)
步骤3 — 方法理由(解释为什么)
After the test case table, include a Why These Methods? section that explains the reasoning behind the test method selections. This is critical — the table alone shows WHAT was chosen, but the customer needs to understand WHY so they can adjust methods for future test cases on their own.
For each unique test method used in the table, write 1-2 sentences explaining:
- What property of the response this method actually measures
- Why that property matters for the specific scenarios it was assigned to
- What would go wrong if a different method were used instead
Example format:
Compare Meaning (used for scenarios #1, #3, #5): These scenarios have a specific correct answer, but the exact wording does not matter — only the meaning. Compare Meaning uses an LLM to judge semantic equivalence, so the agent can phrase the answer differently and still pass. If you used Keyword Match instead, valid paraphrases would fail. If you used General Quality, there would be no expected answer to compare against, so factual errors would go undetected.General Quality (used for scenarios #2, #7): These are adversarial and edge-case scenarios where there is no single correct answer — the agent just needs to respond helpfully and safely. General Quality uses an LLM judge to evaluate relevance, groundedness, completeness, and appropriate abstention without needing an expected response.Keyword Match (All) (used for scenario #4): This scenario requires specific facts (policy numbers, dates, dollar amounts) that must appear verbatim. Compare Meaning might accept a response that captures the gist but drops a critical number. Keyword Match ensures every required fact is present.
Adapt the rationale to the actual scenarios and methods in the test set. If only one or two methods are used, explain why those methods are sufficient for this agent eval needs.
For conversation test sets, also explain:
- Why conversation mode was chosen over single-response (what multi-turn behavior is being tested)
- Why the available conversation methods (General Quality, Keyword Match, Capability Use, Custom) are sufficient — or flag if the customer should also create a complementary single-response test set for scenarios that need Compare Meaning, Exact Match, or Text Similarity
测试用例表后添加**为什么选择这些方法?**部分,解释测试方法选择背后的逻辑,这点非常重要:表格仅展示选择了什么,但用户需要理解为什么这么选,才能后续自行调整测试方法。
对于表格中用到的每个唯一测试方法,写1-2句话解释:
- 该方法实际测量响应的什么属性
- 为什么该属性对所分配的场景很重要
- 如果使用其他方法会出现什么问题
示例格式:
Compare Meaning(用于场景#1、#3、#5):这些场景有明确的正确答案,但精确措辞不重要,只有语义重要。Compare Meaning使用LLM判断语义等价性,Agent可以用不同措辞表述答案仍然可以通过。如果使用Keyword Match,有效的同义改写会被判定失败;如果使用General Quality,没有预期响应可供对比,事实错误会被遗漏。General Quality(用于场景#2、#7):这些是对抗和边缘场景,没有单一正确答案,Agent只需要做出有用且安全的响应即可。General Quality使用LLM评审评估相关性、groundedness、完整性和合理的拒绝策略,不需要预期响应。Keyword Match (All)(用于场景#4):该场景需要特定事实(政策编号、日期、金额)必须逐字出现,Compare Meaning可能会接受抓住要点但遗漏关键数字的响应,Keyword Match可以确保所有要求的事实都存在。
根据测试集中实际的场景和方法调整理由,如果只用到1-2种方法,解释为什么这些方法足以满足该Agent的评估需求。
对话测试集还需要额外解释:
- 为什么选择对话模式而非单响应模式(测试的是什么多轮行为)
- 为什么可用的对话方法(General Quality、Keyword Match、Capability Use、Custom)足够,或者说明用户是否需要额外创建互补的单响应测试集覆盖需要Compare Meaning、Exact Match或Text Similarity的场景
Step 4 — Human review checkpoints
步骤4 — 人工审核检查点
After the table and before generating the output files, display a Human Review Required section with these checkpoints. These flag specific decisions that require human judgment — the skill accelerates the work, but a human must validate the output before use.
Human Review Required
| # | Checkpoint | Why it matters |
|---|---|---|
| 1 | Are the test inputs realistic? Review every Question. Delete or rewrite any that would not occur in production. AI-generated inputs tend to be too clean — add typos, abbreviations, or ambiguity that real users would include. | Unrealistic inputs produce passing evals that do not predict production performance. |
| 2 | Are the expected responses correct? Verify every Expected Response against actual agent knowledge sources. A wrong expected response will cause a correct agent answer to fail. | This is the #1 source of false failures in eval results. |
| 3 | Is each test method appropriate? Check the method assigned to each row. Common mistakes: using Compare Meaning when exact facts matter (use Keyword Match), using General Quality when there IS a correct answer. | Wrong method means misleading pass/fail signals. See the selection guide above. |
| 4 | Are pass scores reasonable? Compare Meaning default is 50, Text Similarity default is 0.70. Adjust based on how much variation is acceptable for this agent. | Too strict = false failures. Too lenient = missed quality issues. |
| 5 | Missing scenarios? Are there realistic user inputs this test set does not cover? Think about your most common support tickets, edge cases unique to your business, and any compliance requirements. | Eval coverage gaps mean untested production paths. |
| 6 | Negative test coverage: For adversarial and out-of-scope test cases, verify the expected behavior matches your organization’s policy (e.g., should the agent refuse, redirect, or escalate?). | Policy alignment cannot be inferred — it must be specified by a human. |
| 7 | (Conversation mode) Are the turn sequences realistic? Review multi-turn conversations for natural flow. Do users actually ask these follow-ups in this order? Are there turns that should be added or removed? | Artificial conversation sequences test capabilities users never exercise. |
| 8 | (Conversation mode) Is the right evaluation mode selected? Confirm that conversation mode is the right choice. If the agent mostly handles standalone questions, single-response may give better signal. Consider creating both types. | Wrong mode means either missing multi-turn failures or over-constraining simple Q&A tests. |
After the checkpoints, add:
- One more scenario to consider: Describe an additional scenario worth adding manually — something that did not fit but is realistic.
- Mandatory reminder: "This test set was AI-generated and must be reviewed by someone who knows the agent’s domain before use. No eval should run without human validation of inputs, expected responses, and method choices."
表格之后、生成输出文件前,展示需要人工审核部分,列出以下检查点,标记需要人工判断的特定决策:本技能可以加速工作,但使用前必须人工验证输出。
需要人工审核
| # | 检查点 | 重要性 |
|---|---|---|
| 1 | 测试输入是否真实? 审核每个问题,删除或重写任何生产环境不会出现的输入,AI生成的输入通常过于规范,添加真实用户会出现的拼写错误、缩写、歧义。 | 不真实的输入会得到通过的评估结果,但无法反映生产性能。 |
| 2 | 预期响应是否正确? 对照Agent实际知识源验证每个预期响应,错误的预期响应会导致正确的Agent答案被判定失败。 | 这是评估结果虚假失败的首要来源。 |
| 3 | 每个测试方法是否合适? 检查每行分配的方法,常见错误:精确事实重要的场景使用Compare Meaning(应该用Keyword Match)、有正确答案的场景使用General Quality。 | 错误的方法会产生误导性的通过/失败信号,参考上方的选择指南。 |
| 4 | 通过分数是否合理? Compare Meaning默认50分,Text Similarity默认0.70,根据该Agent可接受的变化程度调整。 | 过于严格=虚假失败,过于宽松=遗漏质量问题。 |
| 5 | 是否遗漏场景? 是否有该测试集未覆盖的真实用户输入?考虑你最常见的支持工单、业务专属的边缘场景、任何合规要求。 | 评估覆盖缺口意味着生产路径未被测试。 |
| 6 | 负面测试覆盖: 对于对抗和超出范围的测试用例,验证预期行为符合组织政策(例如Agent应该拒绝、重定向还是升级?)。 | 政策对齐无法自动推断,必须由人工指定。 |
| 7 | (对话模式)轮次序列是否真实? 审核多轮对话的自然流畅度,用户真的会按这个顺序问这些follow-up问题吗?是否需要添加或删除轮次? | 人工构造的对话序列测试的是用户永远不会用到的能力。 |
| 8 | (对话模式)评估模式选择是否正确? 确认对话模式是正确选择,如果Agent主要处理独立问题,单响应模式可能提供更好的信号,考虑同时创建两种测试集。 | 错误的模式要么会遗漏多轮故障,要么会过度限制简单问答测试。 |
检查点之后添加:
- 值得考虑的额外场景:描述一个值得手动添加的额外场景,是本次生成未覆盖但很真实的场景。
- 强制提醒:"本测试集由AI生成,使用前必须由了解Agent领域的人员审核,任何评估运行前都需要人工验证输入、预期响应和方法选择。"
Behavior rules
行为规则
- Each case must be independently understandable — no references to "the previous case"
- When using a plan, generate exactly the scenarios listed — do not add or remove scenarios without saying why
- The Copilot Studio table (single-response) or conversation blocks (conversation mode) is the primary output displayed in conversation; the CSV (single-response only) and docx report are generated afterward
- Make inputs realistic and specific: use names, dates, product references, and context that a real user would provide
- The CSV must be valid and importable into Copilot Studio without manual editing
- For conversation mode, explicitly recommend whether the customer should also create a complementary single-response test set
- 每个用例必须独立可理解,不要引用"上一个用例"
- 使用计划生成时,严格生成列出的场景,未说明原因不要新增或删除场景
- 对话中优先展示Copilot Studio表格(单响应)或对话块(对话模式),之后再生成CSV(仅单响应)和docx报告
- 输入要真实具体:使用真实用户会提供的名称、日期、产品参考和上下文
- CSV必须有效,无需手动编辑即可导入Copilot Studio
- 对话模式下明确建议用户是否需要额外创建互补的单响应测试集
Example invocations
调用示例
/eval-suite-planner I am building a customer support agent that handles refund requests...
[planner outputs scenario plan table]
/eval-generator
<- generates from the plan above, one case per scenario row (single-response mode)
<- outputs single CSV with Question/Expected response/Testing method columns
/eval-generator I am building a meeting notes agent that takes a raw transcript and produces a structured summary with action items.
<- generates from scratch, 6-8 cases, single CSV with per-row test methods
/eval-generator I am building a travel booking agent that helps users search flights, select seats, and complete purchases across multiple conversation turns.
<- detects multi-turn behavior, generates 4-6 conversation test cases
<- outputs conversation planning blueprint (no CSV — conversation test sets are created in UI)
<- recommends complementary single-response test set for standalone queries
/eval-generator
<- no plan in conversation, no description provided — asks user to provide input/eval-suite-planner I am building a customer support agent that handles refund requests...
[planner outputs scenario plan table]
/eval-generator
<- 基于上述计划生成,每个场景行对应一个用例(单响应模式)
<- 输出包含Question/Expected response/Testing method列的单个CSV
/eval-generator I am building a meeting notes agent that takes a raw transcript and produces a structured summary with action items.
<- 从零开始生成6-8个用例,输出每行指定测试方法的单个CSV
/eval-generator I am building a travel booking agent that helps users search flights, select seats, and complete purchases across multiple conversation turns.
<- 检测到多轮行为,生成4-6个对话测试用例
<- 输出对话规划蓝图(无CSV — 对话测试集在UI中创建)
<- 建议为独立查询创建互补的单响应测试集
/eval-generator
<- 对话中无计划,也未提供描述 — 要求用户提供输入