eval-suite-planner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePurpose
用途
This skill takes a plain-English description of an agent and produces a structured eval suite plan. It is the first step in the eval lifecycle — use it before generating test cases or running any evals. The output tells you exactly what scenarios to build, which evaluation methods to use, and how to know when you're done.
This skill covers Stage 1 (Define) of the MS Learn 4-stage evaluation framework. After planning, use for Stage 2 (Set Baseline & Iterate), then expand coverage (Stage 3) and operationalize into CI/CD (Stage 4).
/eval-generatorKnowledge sources: This skill's guidance is grounded in three Microsoft sources:
- Eval Scenario Library (github.com/microsoft/ai-agent-eval-scenario-library) — 5 business-problem scenario types with 29 sub-scenarios, 9 capability scenario types with 49 sub-scenarios, quality signals, and evaluation method selection
- MS Learn agent evaluation documentation — the 4-stage iterative evaluation framework (Define, Set Baseline & Iterate, Systematic Expansion, Operationalize), 7 test methods, acceptance criteria design, and evaluation categories
- MS Learn evaluation checklist (guidance/evaluation-checklist) — a 4-stage checklist template with a downloadable editable version. The checklist defines Stage 3 expansion categories (Foundational core, Agent robustness, Architecture test, Edge cases) and introduces acceptance criteria design
该技能接收Agent的自然语言描述,生成结构化的评估套件计划。它是评估生命周期的第一步,请在生成测试用例或运行任何评估前使用。输出内容会明确告知你需要构建哪些场景、使用哪些评估方法,以及如何判断工作已完成。
该技能覆盖MS Learn四阶段评估框架的第1阶段(定义)。完成规划后,使用进入第2阶段(设置基线与迭代),随后扩展覆盖范围(第3阶段)并落地到CI/CD流程中(第4阶段)。
/eval-generator知识来源: 该技能的指导内容基于三个微软官方来源:
- Eval Scenario Library(github.com/microsoft/ai-agent-eval-scenario-library)——包含5类业务问题场景(29个子场景)、9类能力场景(49个子场景)、质量信号和评估方法选择规则
- MS Learn Agent评估文档——四阶段迭代评估框架(定义、设置基线与迭代、系统化扩展、落地)、7种测试方法、验收标准设计和评估分类
- MS Learn评估检查清单(guidance/evaluation-checklist)——四阶段检查清单模板,提供可下载的 editable 版本。清单定义了第3阶段的扩展分类(基础核心、Agent健壮性、架构测试、边缘 case),并介绍了验收标准设计方法
Instructions
使用说明
When invoked as , read the description, infer the agent's primary task, key capabilities, and failure modes, then produce the following output in this exact order. Do not ask clarifying questions, do not pad responses, do not hedge.
/eval-suite-planner <agent description>当以形式调用时,请读取描述,推断Agent的核心任务、关键能力和失败模式,随后严格按照以下顺序输出结果。不要询问澄清问题、不要填充冗余内容、不要给出模棱两可的表述。
/eval-suite-planner <agent description>Step 0 — Match the agent to scenario types
步骤0 — 将Agent与场景类型匹配
Use this routing table (from the Eval Scenario Library's Entry Path A) to identify which business-problem and capability scenario types apply to the described agent:
| If the agent... | Business-problem scenarios | Capability scenarios |
|---|---|---|
| Answers questions from knowledge sources | Information Retrieval (6 sub-scenarios) | Knowledge Grounding + Compliance |
| Executes tasks via APIs/connectors | Request Submission (6 sub-scenarios) | Tool Invocations + Safety |
| Walks users through troubleshooting | Troubleshooting (6 sub-scenarios) | Knowledge Grounding + Graceful Failure |
| Guides through multi-step processes | Process Navigation (6 sub-scenarios) | Trigger Routing + Tone & Quality |
| Routes conversations to teams/departments | Triage & Routing (5 sub-scenarios) | Trigger Routing + Graceful Failure |
| Handles sensitive data (PII, financial, health) | (add to whichever applies) | Safety + Compliance |
| Serves external customers | (add to whichever applies) | Tone & Quality + Safety |
| Is about to be updated or republished | (add to whichever applies) | Regression — re-run existing tests after changes |
| All agents (always include) | — | Red-Teaming — adversarial robustness testing |
Most agents match 1-2 business-problem types and 3-4 capability types. Select the ones that fit and name them explicitly.
About the Regression row: A regression set is not a separate scenario type — it is your existing suite of passing tests, re-run after any agent change to verify nothing broke. Include the regression row when the customer mentions upcoming changes (prompt edits, knowledge source updates, connector/plugin changes, republishing). When it applies:
- Flag that the customer's current passing test cases become their regression baseline
- Recommend re-running the full eval suite (or at minimum the core business + safety subsets) after every change
- This maps to Stage 4 (Operationalize) of the MS Learn framework — embedding evals into the agent's update workflow so regressions are caught before they reach users
- The regression set grows over time: every bug found and fixed should add a test case that catches that specific failure, preventing it from recurring
使用该路由表(来自Eval Scenario Library的Entry Path A)识别描述的Agent适用的业务问题场景和能力场景类型:
| 如果Agent... | 业务问题场景 | 能力场景 |
|---|---|---|
| 基于知识库回答问题 | Information Retrieval(6个子场景) | Knowledge Grounding + Compliance |
| 通过API/连接器执行任务 | Request Submission(6个子场景) | Tool Invocations + Safety |
| 引导用户完成故障排查 | Troubleshooting(6个子场景) | Knowledge Grounding + Graceful Failure |
| 指导用户完成多步骤流程 | Process Navigation(6个子场景) | Trigger Routing + Tone & Quality |
| 将对话路由到对应团队/部门 | Triage & Routing(5个子场景) | Trigger Routing + Graceful Failure |
| 处理敏感数据(PII、财务、健康数据) | (添加到适用的场景中) | Safety + Compliance |
| 服务外部客户 | (添加到适用的场景中) | Tone & Quality + Safety |
| 即将更新或重新发布 | (添加到适用的场景中) | Regression — 变更后重新运行现有测试 |
| 所有Agent(默认包含) | — | Red-Teaming — 对抗性健壮性测试 |
大多数Agent匹配1-2类业务问题场景和3-4类能力场景,请选择适配的场景并明确列出。
关于Regression行的说明: 回归测试集不是独立的场景类型,它是你现有的已通过测试用例集合,在Agent发生任何变更后重新运行,验证没有功能被破坏。当客户提到即将发生变更(prompt修改、知识库更新、连接器/插件变更、重新发布)时需要包含Regression行,适用场景下:
- 标注客户当前已通过的测试用例会成为他们的回归基线
- 建议每次变更后重新运行完整评估套件(至少运行核心业务+安全子集)
- 这对应MS Learn框架的第4阶段(落地)——将评估嵌入Agent的更新工作流,在问题触达用户前发现回归
- 回归测试集会随时间增长:每发现并修复一个bug,都要添加对应测试用例来捕获该特定失败,避免问题复发
Step 0b — Incorporate production data (for live agents)
步骤0b — 纳入生产数据(针对已上线的Agent)
If the customer already has an agent in production, recommend supplementing the synthetic eval plan with real user data. Copilot Studio offers two production-data features that create higher-fidelity test sets:
Themes-based test sets (recommended for production agents):
Copilot Studio's Analytics page groups real user questions into themes — clusters of related questions that triggered generative answers (e.g., "billing and payments", "password reset", "shipping status"). Customers can create a test set directly from any theme:
- Go to the agent's Analytics page > Themes list
- Hover over a theme > select Evaluate
- Select Create and open to generate a test set from real user questions in that theme
When to recommend themes-based test sets:
- The agent is already in production with real user traffic
- The customer wants to track quality per topic area (e.g., "How well does the agent handle billing questions specifically?")
- The synthetic plan may miss question phrasings real users actually use
- The customer is investigating a quality drop in a specific area flagged by analytics
How themes complement the eval plan: The scenario plan (Step 0) defines WHAT to test based on the agent's design. Themes-based test sets validate HOW WELL the agent handles what users actually ask. Use both:
- Synthetic test cases (from ) = structured coverage of planned scenarios, edge cases, and adversarial tests
/eval-generator - Themes-based test sets (from production) = real-world phrasing, actual user patterns, production-frequency weighting
Tell the customer: "Your eval plan covers what the agent SHOULD handle. Themes-based test sets cover what users ACTUALLY ask. The gap between these two is where surprises live."
Prerequisite: Themes require the themes (preview) feature to be available in the customer's environment. Check prerequisites before recommending.
Production data import: Customers can also import real user conversations as test cases directly. This is useful for reproducing specific reported issues or building regression tests from support tickets.
如果客户的Agent已经在生产环境运行,建议用真实用户数据补充合成评估计划。Copilot Studio提供两种生产数据功能,可以生成更高保真的测试集:
基于主题的测试集(推荐生产环境Agent使用):
Copilot Studio的分析页面会将真实用户问题分组为主题——即触发生成式回答的相关问题集群(例如「账单与支付」「密码重置」「物流状态」)。客户可以直接基于任意主题创建测试集:
- 进入Agent的分析页面 > 主题列表
- 悬停在某个主题上 > 选择评估
- 选择创建并打开,即可基于该主题下的真实用户问题生成测试集
推荐使用基于主题的测试集的场景:
- Agent已经在生产环境运行,有真实用户流量
- 客户希望按主题领域追踪质量(例如「Agent处理账单相关问题的表现如何?」)
- 合成计划可能会遗漏真实用户实际使用的提问方式
- 客户正在调查分析页面标记的某个特定领域的质量下降问题
主题对评估计划的补充作用: 场景计划(步骤0)基于Agent的设计定义了需要测试什么,基于主题的测试集验证了Agent对用户实际提问的处理效果。两者结合使用:
- (来自的)合成测试用例 = 对计划场景、边缘case和对抗测试的结构化覆盖
/eval-generator - (来自生产环境的)基于主题的测试集 = 真实世界的提问方式、实际用户行为、生产频率权重
告知客户:「你的评估计划覆盖了Agent应该处理的内容,基于主题的测试集覆盖了用户实际提问的内容,两者之间的差距就是意外问题的来源。」
前提条件: 主题功能需要客户环境中已上线主题(预览版)功能,推荐前请确认前提条件满足。
生产数据导入: 客户也可以直接将真实用户对话作为测试用例导入,适用于复现特定上报问题,或基于支持工单构建回归测试。
Step 0c — Plan test set creation strategy
步骤0c — 规划测试集创建策略
Copilot Studio offers multiple ways to create single-response test sets beyond CSV import. During planning, make the customer aware of all their options so they can combine approaches for best coverage:
| Creation method | What it does | When to recommend |
|---|---|---|
CSV import (from | Import structured test cases with Question, Expected response, and Testing method columns | Always — this is the primary method for structured, scenario-driven coverage |
| Quick question set | Auto-generates a small set of questions from the agent's knowledge sources | Early exploration — quickly see what questions the agent's content can answer before writing detailed cases |
| Full question set | Auto-generates a comprehensive set of questions from knowledge sources | Broader coverage check — use after the initial eval to find gaps the structured plan missed |
| Test chat → test set | Converts a manual test chat session into a reusable test set | When someone has already tested the agent manually and wants to make those tests repeatable |
| Themes-based (see Step 0b) | Creates test sets from real user question clusters in production analytics | Production agents — captures actual user phrasing and topic distribution |
| Manual entry | Create individual test cases directly in the Copilot Studio UI | One-off additions, Custom method cases (which cannot be CSV-imported), and edge cases discovered during testing |
Recommended strategy: Start with CSV import from for structured scenario coverage, then supplement with Quick or Full question sets to catch blind spots the plan didn't anticipate. For production agents, add themes-based test sets for real-world validation. Use manual entry for Custom-method cases that require rubric definitions.
/eval-generatorTell the customer: "CSV import gives you precision — every case tests a specific scenario you designed. Auto-generation gives you breadth — it finds questions you didn't think to ask. Use both."
除了CSV导入外,Copilot Studio还提供多种方式创建单响应测试集。规划阶段请告知客户所有可选方案,方便他们组合不同方法实现最佳覆盖:
| 创建方法 | 功能 | 推荐使用场景 |
|---|---|---|
CSV导入(来自 | 导入包含「问题、预期响应、测试方法」列的结构化测试用例 | 始终推荐——这是实现结构化、场景驱动覆盖的核心方法 |
| 快速问题集 | 基于Agent的知识库自动生成少量问题 | 早期探索阶段——在编写详细用例前快速查看Agent的内容可以回答哪些问题 |
| 完整问题集 | 基于知识库自动生成全面的问题集合 | 更广覆盖检查——在初始评估后使用,发现结构化计划遗漏的缺口 |
| 测试对话→测试集 | 将手动测试对话会话转换为可复用的测试集 | 已经有人手动测试过Agent,希望将这些测试转为可重复执行的场景 |
| 基于主题(见步骤0b) | 基于生产分析中的真实用户问题集群创建测试集 | 生产环境Agent——捕获真实用户的提问方式和主题分布 |
| 手动录入 | 直接在Copilot Studio UI中创建单个测试用例 | 一次性添加、无法通过CSV导入的自定义方法用例、测试过程中发现的边缘case |
推荐策略: 首先使用导出的CSV导入实现结构化场景覆盖,随后用快速/完整问题集补充计划未覆盖的盲区。对于生产环境Agent,添加基于主题的测试集做真实场景验证。使用手动录入添加需要评分规则定义的自定义方法用例。
/eval-generator告知客户:「CSV导入可以实现精准覆盖——每个用例都测试你设计的特定场景。自动生成可以实现广度覆盖——发现你没有想到的提问场景。建议两者结合使用。」
Step 0d — Choose test data generation approach
步骤0d — 选择测试数据生成方案
Per the MS Learn Common evaluation approaches, there are three strategies for generating request-response pairs. The choice affects multi-turn fidelity, cost, and what kinds of failures you can detect:
| Approach | How it works | Strengths | Weaknesses | Best for |
|---|---|---|---|---|
| Echo | Replay a static list of prompts word-for-word | Low cost; fair A/B comparisons when changing one variable (model upgrade, single tool change) | Can’t adapt to different responses — later turns may not match conversation context | Single-turn scenarios, deterministic checks (citation display, tool trigger, simple Q&A) |
| Historical replay | Replay each turn in context of prior prompts and responses | Detects where and how much each turn diverges from the ideal path | Still can’t handle truly dynamic conversations (learning, real-time web search) | Model change comparisons, understanding per-turn divergence from baseline behavior |
| Synthesized personas | A human or agentic actor generates conversation in real time based on a scenario and persona | Dynamically assesses complex scenarios (tutoring, negotiation, multi-step troubleshooting) | Grading requires nuance; higher cost (LLM or human tester per conversation) | Multi-turn agents, complex workflows, persona-dependent behavior |
How to recommend:
- Simple FAQ / knowledge agents → Echo is sufficient. Most test cases are single-turn with deterministic expected answers.
- Task agents being upgraded (model change, tool swap) → Historical replay to compare before/after at each turn.
- Complex multi-step agents (process navigation, troubleshooting, triage) → Synthesized personas for realistic coverage. Echo won’t catch context-dependent failures.
- Hybrid is common: use Echo for the core regression set (fast, cheap, repeatable) and Synthesized personas for the exploratory/edge-case set (realistic, expensive, high-signal).
Tell the customer: “Echo tells you if the same questions still get the same answers. Synthesized personas tell you if the agent can actually handle a real conversation. You need both — Echo for speed, personas for truth.”
根据MS Learn通用评估方法,生成请求-响应对有三种策略,选择会影响多轮对话保真度、成本和可检测的失败类型:
| 方案 | 实现逻辑 | 优势 | 劣势 | 适用场景 |
|---|---|---|---|---|
| Echo | 逐字重放静态prompt列表 | 成本低;修改变量(模型升级、单个工具变更)时可以实现公平的A/B对比 | 无法适配不同响应——后续轮次可能和对话上下文不匹配 | 单轮场景、确定性检查(引用展示、工具触发、简单Q&A) |
| 历史重放 | 结合之前的prompt和响应上下文重放每一轮对话 | 可以检测每一轮对话和理想路径的差异程度 | 仍然无法处理真正的动态对话(学习、实时网页搜索) | 模型变更对比、理解每轮对话和基线行为的差异 |
| 合成角色 | 由人或Agentic actor基于场景和角色实时生成对话 | 动态评估复杂场景(辅导、协商、多步故障排查) | 评分需要判断逻辑;成本更高(每个对话需要LLM或人工测试员) | 多轮Agent、复杂工作流、依赖角色的行为 |
推荐逻辑:
- 简单FAQ/知识库Agent → Echo即可满足需求,大多数测试用例是单轮的,有确定性的预期答案
- 待升级的任务型Agent(模型变更、工具替换) → 使用历史重放对比每轮对话的变更前后表现
- 复杂多步Agent(流程导航、故障排查、分流) → 使用合成角色实现真实场景覆盖,Echo无法捕获依赖上下文的失败
- 通常采用混合方案:核心回归测试集使用Echo(快速、低成本、可重复),探索/边缘case集使用合成角色(真实、高成本、高信号)
告知客户:「Echo可以验证相同的问题是否仍然得到相同的答案,合成角色可以验证Agent是否真的能处理真实对话。两者你都需要——Echo保证速度,角色保证真实性。」
Output structure
输出结构
1. One-line summary
Restate the agent's task in one sentence, starting with "Agent task:". Name the matched business-problem and capability scenario types by their plain names (e.g., "Information Retrieval", "Knowledge Grounding").
2. Scenario plan table
This table is the primary handoff artifact to — the generator will produce one test case per row. Make it complete enough that the generator needs no additional context.
/eval-generatorProduce a table with these columns:
| # | Scenario Name | Category | Tag | Evaluation Methods |
|---|
Be specific: name the actual scenario based on the agent description, not just the category.
Use this category distribution (from the Eval Scenario Library's eval-set-template):
- Core business scenarios: 30-40% of test cases
- Capability scenarios: 20-30%
- Edge cases & safety: 10-20%
- Variations (different phrasings of core): 10-20%
For evaluation methods, use the Scenario Library's quality-signal-to-method mapping:
| What you're testing | Primary method | Secondary method |
|---|---|---|
| Factual accuracy (specific facts, numbers) | Keyword Match (All) | Compare Meaning |
| Factual accuracy (flexible phrasing) | Compare Meaning | Keyword Match (Any) |
| Policy compliance (mandatory language) | Custom | Keyword Match (All) |
| Policy compliance (nuanced judgment) | Custom | General Quality |
| Tool invocation correctness | Capability Use | Keyword Match (Any) |
| Knowledge source selection | Capability Use | Compare Meaning |
| Topic routing accuracy | Capability Use | — |
| Response quality, tone, empathy | General Quality | Compare Meaning |
| Tone/brand voice adherence | Custom | General Quality |
| Hallucination prevention | Compare Meaning | General Quality |
| Regulatory/HR/legal compliance | Custom | Keyword Match (All) |
| Edge case handling | Keyword Match (Any) | General Quality |
| Negative tests (must NOT do X) | Keyword Match — negative | Capability Use — negative |
When to use Custom: The Custom test method lets you define evaluation instructions (a rubric) with labeled outcomes (e.g., "Compliant" / "Non-compliant") and assign pass/fail to each label. Use it when:
- The pass/fail criteria require judgment, not just keyword presence — e.g., "Is this response empathetic?" or "Does this follow our escalation policy?"
- You need domain-specific rubrics — e.g., HR compliance, medical disclaimers, financial suitability
- Standard methods (Keyword Match, Compare Meaning) cannot capture the quality signal — the answer is not about specific words or semantic similarity, it is about whether the response meets a policy or standard
- You want to test tone, style, or brand voice beyond what General Quality covers
Custom is not available for CSV import — test cases using Custom must be created directly in Copilot Studio's evaluation UI.
Beyond Custom — rubric-based grading: For customers who need more granular quality scoring than pass/fail, the Copilot Studio Kit supports rubric-based grading on a 1–5 scale. Rubrics replace the standard validation logic with a custom AI grader aligned to domain-specific criteria. Two modes: Refinement (grade + rationale — use first to calibrate the rubric against human judgment) and Testing (grade only — use for routine QA after the rubric is trusted). If the plan includes Custom methods for compliance, tone, or brand voice, note in the plan rationale that rubric-based grading is an advanced option for ongoing calibrated quality assurance.
Planning for rubric calibration effort: Rubric refinement is iterative — expect 3–5 calibration rounds before AI-human alignment is acceptable. Each round involves: running the test set, human-grading every case with written reasoning, comparing alignment scores, and refining the rubric. Plan for this when the eval plan includes rubric-graded scenarios:
- Alignment target: 75–90% average alignment (formula: ). Don’t plan for 100% — some subjectivity is inherent and diminishing returns start around 85%.
100% × (1 − |AI grade − Human grade| / 4) - Time investment: Each calibration round requires a domain expert to grade and write reasoning for every test case. Budget 1–2 hours per round for a 10–15 case set.
- When rubrics are worth the investment: Compliance-heavy domains, regulated industries, brand-sensitive customer-facing agents, or any scenario where "pass/fail" is too coarse and you need calibrated quality scores over time.
- When to skip rubrics: Low-risk internal tools, simple FAQ agents, or early-stage evals where Custom pass/fail is sufficient. Start with Custom; graduate to rubrics when you need ongoing calibrated scoring.
Always recommend two methods per scenario where possible.
Total count: 10-15 scenarios for a complete suite.
3. Quality signals
List the quality signals relevant to this agent (from the Eval Scenario Library's five quality signals). Only include signals that apply:
- Policy Accuracy — Does the agent follow business rules correctly?
- Source Attribution — Does the agent ground claims in retrieved documents and cite them?
- Personalization — Does the agent adapt responses to user context (role, department, history)?
- Action Enablement — Does the agent empower users to take the next step?
- Privacy Protection — Does the agent avoid exposing sensitive information?
Map each signal to the scenarios that test it.
4. Pass/fail thresholds
Use risk-based thresholds (from the Eval Scenario Library's eval-set-template):
| Category | Target pass rate | Blocking threshold |
|---|---|---|
| Overall | ≥85% | <60% → BLOCK |
| Core business scenarios | ≥90% | <80% → BLOCK |
| Capability scenarios | ≥90% | <80% → BLOCK |
| Safety & compliance | ≥95% | <95% → BLOCK |
| Regression (if applicable) | ≥95% | <90% → BLOCK — any regression means the change broke something |
| Edge cases | ≥70% | (hard by design — iterate, don't block) |
Adjust based on risk profile: low-risk internal tool (lower by 10%), customer-facing (standard), regulated or safety-critical (raise by 5-10%).
4b. Planning for version comparison
Copilot Studio supports comparative evaluation — running the same test set against different agent versions and comparing results side by side. Plan for this from the start:
Establish a baseline run: The first eval run against this plan becomes the baseline. Before making any agent changes, export the results CSV and save it. All future runs will be compared against this baseline to measure improvement or catch regressions.
When to use result comparison:
- After any agent change (prompt edit, knowledge source update, connector change) — compare the new run against the previous run to verify fixes didn't break passing cases
- When comparing two agent configurations (e.g., different system prompts, different knowledge source sets) — run the same test set against both and compare
- During Stage 3 (Systematic Expansion) — compare expanded test sets against the baseline to confirm new scenarios don't destabilize existing ones
Set-level grading: In addition to individual case pass/fail, Copilot Studio can evaluate quality across the entire test set as a whole. Use set-level grading when:
- Individual results are mixed (some pass, some fail) and you need to determine if the agent is generally competent or systematically broken
- Pass rate is near a threshold boundary (e.g., 84% vs. the 85% target) — set-level grading adds context beyond the raw number
- You want to track overall quality trends across multiple runs without getting lost in case-by-case noise
Plan implication: When designing the scenario plan, ensure test cases within each category are independently valuable — each should test a distinct behavior. This makes version comparison meaningful: if Case 5 flips from Pass to Fail after a change, you know exactly which behavior regressed because the case tests one specific thing.
⚠️ Data retention: Copilot Studio retains test run results for 89 days only — after that, results are permanently deleted. Plan an export habit from run #1:
- Export the baseline results CSV immediately after the first eval run
- Export after every subsequent run before comparing to the baseline
- Store exported CSVs alongside the eval plan document, tagged with agent version and run date
- This is especially critical for regression workflows: if you re-run after a change but the "before" results have expired, you cannot prove improvement
Tell the customer: "Treat every eval run as perishable. Export the CSV the same day you run it. In 89 days you'll thank yourself — or regret not listening."
5. Priority order
State which categories to write first. Default priority (from MS Learn Stage 2):
- Core business scenarios — proves the agent does its job
- Safety & compliance — catches deal-breaker failures early
- Capability scenarios — isolates component-level problems
- Edge cases & variations — stress-tests robustness
- Regression (when updating an existing agent) — run BEFORE deploying any change to verify nothing broke
Deviate only when the agent description implies safety-critical use (move safety to first). If the customer is updating an existing agent, regression moves to position 1 — verify existing behavior first, then evaluate the new changes.
When the agent is being updated (regression applies): The priority order changes. Run the existing regression set first — if core scenarios that previously passed now fail, stop and fix before writing new tests. Then follow the order above for any new scenarios added to cover the change.
6. Planning rationale (teach the WHY)
This section explains the reasoning behind the plan so the customer can modify it intelligently and build future eval plans without help. For each of the following, write 2-3 sentences of plain-language explanation:
- Why these scenario types were selected: Explain the connection between the agent's task and the chosen business-problem and capability scenario types. Example: "This agent retrieves answers from HR documents, so Information Retrieval is the primary business-problem type — it covers the core loop of question → search → answer. Knowledge Grounding is the primary capability type because the agent must stay within the documents and not hallucinate policy that doesn't exist."
- Why this category distribution: Explain why the percentages are weighted the way they are for this specific agent. A safety-critical agent should have more safety test cases than a low-risk internal tool. Example: "Because this agent handles refund decisions with real financial impact, core business scenarios are weighted at 40% (not the minimum 30%) — getting the refund policy wrong has direct cost. Safety is at 20% because the agent can make promises the company has to honor."
- Why these quality signals and not others: Explain which quality signals are most critical for this agent and why others were excluded or deprioritized. Example: "Policy Accuracy is the top signal because this agent must follow specific refund rules. Source Attribution matters because customers may dispute decisions and need to see the policy reference. Personalization is excluded — the refund policy applies uniformly regardless of who asks."
- What the plan does NOT cover: Explicitly state what's out of scope and why. This prevents the customer from assuming the eval plan is exhaustive. Example: "This plan does not cover multi-turn conversation flows (the agent handles single-turn Q&A). If the agent is later extended to handle follow-up questions, add Process Navigation scenarios."
This section is critical for customer enablement. The customer should walk away understanding the evaluation framework well enough to add new scenarios on their own when their agent changes.
-
How many test cases to start with: Customers often delay eval because they think they need hundreds of perfect test cases. They don't. Start with 20-50 cases built from the highest-impact scenarios. Prioritize cases drawn from real failures — support tickets, user complaints, known edge cases, and bugs found during manual testing. These are higher signal than synthetic "what if" cases because they represent problems that already happened. Example: "You don't need 200 test cases to start. Start with 20-30 that cover your core business scenarios and known failure modes. A small set of high-signal cases run weekly beats a comprehensive set that never gets built. Expand later — the eval checklist's Stage 3 is designed for exactly that."Why real failures beat synthetic cases: A test case built from a real support ticket ("user asked X, got wrong answer Y") tests a failure that actually happened to a real user. A synthetic test case ("what if someone asks about refunds in French?") tests a hypothetical. Both matter, but the real failure is higher priority — it already cost you something. Build your first eval set from real failures, then backfill with synthetic cases to cover gaps.
-
Does this agent need multi-profile testing? If the agent's knowledge sources are role-gated (e.g., SharePoint sites with different permissions for directors vs. interns), recommend creating separate test sets per user profile. Copilot Studio lets you assign a user profile to each test set — the eval runs under that user’s authentication, so results reflect what that role actually sees. This surfaces role-based gaps (e.g., an intern getting “access denied” while a director gets the answer). Limitation: Multi-profile testing only works for agents without connector dependencies. If the agent uses authenticated tools/connectors, evals must run under the tool owner’s account.
Understanding the two kinds of eval: Not all test sets serve the same purpose, and confusing them leads to misinterpreted results. Teach the customer this distinction:
- Capability evals test hard behaviors the agent doesn’t reliably do yet — new features, complex edge cases, scenarios where you’re pushing quality UP. Initial pass rates may be 30-60%, and that’s expected. Success = steady improvement over iterations. A 50% pass rate on a capability eval is progress, not failure.
- Regression evals test behaviors the agent already handles well — your existing passing test cases, re-run after every change. Pass rates should stay at ~100%. Success = nothing broke. A 95% pass rate on a regression eval is an alarm, not a good score.
When presenting the scenario plan, label each category as primarily capability or regression:
- Core business scenarios → Capability (initially) → Regression (once passing and the agent is updated)
- Capability scenarios → Capability
- Safety & compliance → Capability (initially) → Regression (once passing — these must never regress)
- Edge cases & variations → Capability (always — these are aspirational by design)
- Regression set → Regression (by definition)
Tell the customer: "When you read eval results, the first question isn’t 'what’s my pass rate?' It’s 'is this a capability eval or a regression eval?' A 40% pass rate is great news on one and terrible news on the other."
-
Recommended eval cadence: Customers often ask "how often should I run evals?" The answer depends on whether the agent is actively changing or stable. Include a cadence recommendation in the plan:Trigger-based (run immediately):
- After any agent change: system prompt edits, knowledge source updates, connector/plugin changes, model switches
- After any platform update that affects the agent's behavior
- Before every deployment or republish — this is the regression gate
Scheduled (run on a calendar):- Active development: Run the core business + safety subsets after every change. Run the full suite weekly.
- Post-launch, stable agent: Run the full suite monthly, or whenever analytics show a quality dip (e.g., spike in thumbs-down reactions or increased escalation rate).
- Regulated/high-risk agents: Run the full suite weekly regardless of changes — the eval run itself is evidence of ongoing compliance.
The production → eval feedback loop: Every production incident should become a test case within 24 hours. When a user reports a bad answer, a support ticket cites wrong information, or analytics show a new failure pattern — add a test case that reproduces it. This is the eval flywheel: production failure → new test case → eval catches it → fix → verify fix → the case joins the regression set permanently.Tell the customer: "The eval plan isn't a one-time document — it's a living test suite. Every bug your users find that your evals didn't is a gap in coverage. Close it the same day."
Stage 3 expansion guidance: When the customer is ready to expand beyond this initial plan (Stage 3 of the MS Learn framework), point them to the evaluation checklist and its downloadable template. Stage 3 introduces four expansion categories that broaden coverage beyond the initial plan:
- Foundational core — the "must pass" set for deployment and regression detection (maps to our Core business + Safety categories)
- Agent robustness — how the agent handles phrasing variations, rich context, multi-intent prompts, and user-specific requests (maps to our Variations category)
- Architecture test — functional performance of tools, knowledge retrieval, routing, and handoffs (maps to our Capability scenarios)
- Edge cases — boundary conditions, forbidden behaviors, out-of-scope handling (maps to our Edge cases & safety category)
Recommend the customer downloads the editable checklist template from GitHub to track their eval maturity across all four stages. Target a realistic pass rate of 80-90% per the checklist guidance — agents are probabilistic and perfect scores are suspicious, not aspirational.
1. 单行摘要
用一句话重述Agent的任务,以「Agent任务:」开头。用通用名称列出匹配的业务问题和能力场景类型(例如「Information Retrieval」「Knowledge Grounding」)。
2. 场景规划表
该表是交给的核心交付物——生成器会为每一行生成一个测试用例。请保证内容足够完整,生成器不需要额外上下文。
/eval-generator生成包含以下列的表格:
| # | 场景名称 | 分类 | 标签 | 评估方法 |
|---|
请具体描述:基于Agent描述给出实际场景名称,不要只写分类。
使用以下分类占比(来自Eval Scenario Library的eval-set-template):
- 核心业务场景:测试用例的30-40%
- 能力场景:20-30%
- 边缘case与安全:10-20%
- 变体(核心场景的不同提问方式):10-20%
评估方法请使用场景库的质量信号-方法映射表:
| 测试内容 | 主要方法 | 次要方法 |
|---|---|---|
| 事实准确性(特定事实、数字) | Keyword Match (All) | Compare Meaning |
| 事实准确性(灵活表述) | Compare Meaning | Keyword Match (Any) |
| 政策合规(强制话术) | Custom | Keyword Match (All) |
| 政策合规(需要 nuanced 判断) | Custom | General Quality |
| 工具调用正确性 | Capability Use | Keyword Match (Any) |
| 知识库选择正确性 | Capability Use | Compare Meaning |
| 主题路由准确性 | Capability Use | — |
| 响应质量、语气、共情能力 | General Quality | Compare Meaning |
| 语气/品牌声音一致性 | Custom | General Quality |
| 幻觉预防 | Compare Meaning | General Quality |
| 监管/HR/法律合规 | Custom | Keyword Match (All) |
| 边缘case处理 | Keyword Match (Any) | General Quality |
| 负向测试(禁止做X) | Keyword Match — negative | Capability Use — negative |
何时使用Custom: Custom测试方法允许你定义带有标注结果(例如「合规」/「不合规」)的评估指令(评分规则),并为每个标签分配通过/失败状态。以下场景适用:
- 通过/失败标准需要判断逻辑,不只是关键词存在——例如「该回答是否有共情能力?」或「是否符合我们的升级政策?」
- 需要领域特定的评分规则——例如HR合规、医疗免责声明、财务适用性
- 标准方法(Keyword Match、Compare Meaning)无法捕获质量信号——答案不涉及特定词语或语义相似性,核心是响应是否符合政策或标准
- 你需要测试语气、风格或品牌声音,超出General Quality的覆盖范围
Custom方法不支持CSV导入——使用Custom的测试用例必须直接在Copilot Studio的评估UI中创建。
Custom之外——基于评分规则的分级: 如果客户需要比通过/失败更细粒度的质量评分,Copilot Studio Kit支持1-5分的基于评分规则的分级。评分规则用对齐领域特定标准的自定义AI评分器替代标准验证逻辑,有两种模式:优化(评分+理由——首先用于基于人工判断校准评分规则)和测试(仅评分——评分规则可信后用于日常QA)。如果计划包含合规、语气或品牌声音的Custom方法,请在规划理由中注明,基于评分规则的分级是持续校准质量保证的高级选项。
评分规则校准工作量规划: 评分规则优化是迭代过程——预期需要3-5轮校准才能达到可接受的AI-人工对齐水平。每一轮包括:运行测试集、人工为每个用例评分并给出书面理由、对比对齐分数、优化评分规则。如果评估计划包含评分规则分级的场景,请做如下规划:
- 对齐目标: 75-90%的平均对齐率(公式:)。不要规划100%对齐——一定的主观性是固有属性,85%左右就会进入收益递减阶段。
100% × (1 − |AI评分 − 人工评分| / 4) - 时间投入: 每轮校准需要领域专家为每个测试用例评分并撰写理由,10-15个用例的集合每轮需要预留1-2小时。
- 评分规则值得投入的场景: 重合规领域、受监管行业、品牌敏感的面向客户的Agent,或任何「通过/失败」粒度过粗,需要长期校准质量分数的场景。
- 不需要使用评分规则的场景: 低风险内部工具、简单FAQ Agent、早期评估阶段Custom的通过/失败已足够的场景。先从Custom开始,当你需要持续校准评分时再升级到评分规则。
每个场景尽可能推荐两种方法。
总数量:完整套件包含10-15个场景。
3. 质量信号
列出和该Agent相关的质量信号(来自Eval Scenario Library的五类质量信号),仅包含适用的信号:
- Policy Accuracy — Agent是否正确遵守业务规则?
- Source Attribution — Agent是否将主张基于检索到的文档并引用来源?
- Personalization — Agent是否根据用户上下文(角色、部门、历史)适配响应?
- Action Enablement — Agent是否能引导用户采取下一步行动?
- Privacy Protection — Agent是否避免暴露敏感信息?
将每个信号映射到测试它的场景。
4. 通过/失败阈值
使用基于风险的阈值(来自Eval Scenario Library的eval-set-template):
| 分类 | 目标通过率 | 阻塞阈值 |
|---|---|---|
| 整体 | ≥85% | <60% → 阻塞发布 |
| 核心业务场景 | ≥90% | <80% → 阻塞发布 |
| 能力场景 | ≥90% | <80% → 阻塞发布 |
| 安全与合规 | ≥95% | <95% → 阻塞发布 |
| 回归测试(如适用) | ≥95% | <90% → 阻塞发布——任何回归都意味着变更破坏了现有功能 |
| 边缘case | ≥70% | (设计上难度较高——迭代优化,不要阻塞发布) |
根据风险画像调整:低风险内部工具(降低10%)、面向客户(标准)、受监管或安全关键(提升5-10%)。
4b. 版本对比规划
Copilot Studio支持对比评估——对不同版本的Agent运行相同测试集,并排对比结果。请从一开始就做相关规划:
建立基线运行: 基于该计划的第一次评估运行即为基线。在对Agent做任何变更前,导出结果CSV并保存,所有后续运行都会和该基线对比,衡量改进或发现回归。
使用结果对比的场景:
- Agent发生任何变更后(prompt编辑、知识库更新、连接器变更)——将新运行结果和之前的运行结果对比,验证修复没有破坏已通过的用例
- 对比两种Agent配置时(例如不同的系统prompt、不同的知识库集合)——对两者运行相同测试集并对比
- 第3阶段(系统化扩展)期间——将扩展后的测试集和基线对比,确认新场景不会破坏现有功能的稳定性
集级别评分: 除了单个用例的通过/失败外,Copilot Studio还可以对整个测试集的整体质量做评估。以下场景适用集级别评分:
- 单个结果有好有坏(部分通过、部分失败),你需要判断Agent整体是合格的还是存在系统性问题
- 通过率接近阈值边界(例如84% vs 85%的目标)——集级别评分可以提供原始数字之外的上下文
- 你希望追踪多轮运行的整体质量趋势,不用陷入单个用例的细节
规划影响: 设计场景计划时,请确保每个分类下的测试用例都是独立有价值的——每个用例都应该测试一个独立的行为。这可以让版本对比更有意义:如果变更后第5个用例从通过变为失败,你可以明确知道哪个行为发生了回归,因为该用例只测试一个特定功能。
⚠️ 数据留存: Copilot Studio仅保留测试运行结果89天——到期后结果会永久删除。请从第一次运行开始就养成导出习惯:
- 第一次评估运行完成后立即导出基线结果CSV
- 每次后续运行和基线对比前都要导出结果
- 将导出的CSV和评估计划文档一起存储,标记Agent版本和运行日期
- 这对回归工作流尤其重要:如果你在变更后重新运行测试,但「之前」的结果已经过期,你就无法证明功能得到了改进
告知客户:「请把每次评估运行都视为易失数据,运行当天就导出CSV。89天后你会感谢自己——或者后悔没听这个建议。」
5. 优先级顺序
说明应该优先编写哪些分类的用例。默认优先级(来自MS Learn第2阶段):
- 核心业务场景——验证Agent可以完成本职工作
- 安全与合规——尽早发现致命性失败
- 能力场景——隔离组件级问题
- 边缘case与变体——压力测试健壮性
- 回归测试(更新现有Agent时)——部署任何变更前先运行,验证没有破坏现有功能
仅当Agent描述涉及安全关键用途时调整顺序(将安全移到第一位)。如果客户是更新现有Agent,回归测试移到第一位——先验证现有行为,再评估新变更。
Agent正在更新时(适用回归测试): 优先级顺序会变化。首先运行现有回归测试集——如果之前通过的核心场景现在失败了,先修复问题再编写新测试。随后按照上述顺序编写覆盖变更的新场景。
6. 规划理由(讲解背后的原因)
该部分解释计划背后的逻辑,让客户可以智能修改计划,未来不需要帮助也能构建评估计划。针对以下每个点,写2-3句通俗易懂的解释:
- 为什么选择这些场景类型: 解释Agent的任务和选择的业务问题、能力场景类型之间的关联。示例:「该Agent从HR文档中检索答案,所以Information Retrieval是核心业务问题类型——它覆盖了提问→搜索→回答的核心循环。Knowledge Grounding是核心能力类型,因为Agent必须基于文档内容回答,不能 hallucinate 不存在的政策。」
- 为什么采用该分类占比: 解释该特定Agent的百分比权重逻辑。安全关键的Agent应该比低风险内部工具包含更多安全测试用例。示例:「因为该Agent处理有实际财务影响的退款决策,核心业务场景的权重设置为40%(不是最低的30%)——弄错退款政策会产生直接成本。安全占比20%,因为Agent做出的承诺公司必须兑现。」
- 为什么选择这些质量信号,而不是其他: 解释哪些质量信号对该Agent最关键,为什么其他信号被排除或降低优先级。示例:「Policy Accuracy是最重要的信号,因为该Agent必须遵守特定的退款规则。Source Attribution很重要,因为客户可能对决策有异议,需要查看政策参考。Personalization被排除——退款政策统一适用,和提问人身份无关。」
- 计划不覆盖的内容: 明确说明范围外的内容和原因,避免客户认为评估计划是穷尽的。示例:「该计划不覆盖多轮对话流程(该Agent处理单轮Q&A)。如果后续扩展Agent支持追问,请添加Process Navigation场景。」
该部分对客户赋能至关重要,客户应该能理解评估框架,当Agent变更时可以自行添加新场景。
-
初始需要多少测试用例: 客户经常因为觉得需要数百个完美的测试用例而推迟评估,其实不需要。从影响最高的场景构建20-50个用例开始即可。优先选择来自真实失败的用例——支持工单、用户投诉、已知边缘case、手动测试发现的bug。这些比合成的「假设」用例信号价值更高,因为它们代表已经发生过的问题。示例:「你不需要一开始就有200个测试用例,从覆盖核心业务场景和已知失败模式的20-30个用例开始即可。每周运行的少量高信号用例,比永远做不完的全面用例集合更有价值。后续再扩展——评估检查清单的第3阶段就是为这个设计的。」为什么真实失败比合成用例好: 基于真实支持工单构建的测试用例(「用户问了X,得到错误答案Y」)测试的是真实用户实际遇到的失败,合成测试用例(「如果有人用法语问退款相关问题怎么办?」)测试的是假设场景。两者都重要,但真实失败的优先级更高——它已经造成了实际损失。首先基于真实失败构建第一版评估集,再用合成用例补充覆盖缺口。
-
该Agent需要多配置文件测试吗? 如果Agent的知识库是按角色权限管控的(例如SharePoint站点对总监和实习生有不同权限),建议为每个用户配置文件创建独立的测试集。Copilot Studio允许为每个测试集分配用户配置文件——评估会在该用户的认证下运行,所以结果反映该角色实际看到的内容。这可以发现基于角色的缺口(例如实习生得到「访问拒绝」,而总监可以得到答案)。限制: 多配置文件测试仅适用于没有连接器依赖的Agent。如果Agent使用认证工具/连接器,评估必须在工具所有者的账户下运行。
理解两种评估类型: 不同测试集的用途不同,混淆它们会导致结果解读错误。请告知客户这一区别:
- 能力评估 测试Agent还不能稳定完成的硬行为——新功能、复杂边缘case、你正在提升质量的场景。初始通过率可能是30-60%,这是预期的。成功=迭代过程中持续提升。能力评估50%的通过率是进步,不是失败。
- 回归评估 测试Agent已经处理得很好的行为——你现有的已通过测试用例,每次变更后重新运行。通过率应该保持在~100%。成功=没有功能被破坏。回归评估95%的通过率是警报,不是好成绩。
展示场景计划时,将每个分类标注为主要是能力还是回归:
- 核心业务场景 → 能力(初始阶段) → 回归(通过后且Agent更新时)
- 能力场景 → 能力
- 安全与合规 → 能力(初始阶段) → 回归(通过后——这些绝对不能回归)
- 边缘case与变体 → 能力(始终是——设计上就是预期外的场景)
- 回归测试集 → 回归(按定义)
告知客户:「查看评估结果时,第一个问题不是『我的通过率是多少?』,而是『这是能力评估还是回归评估?』。40%的通过率在一种情况下是好消息,在另一种情况下是坏消息。」
-
推荐的评估频率: 客户经常问「我应该多久运行一次评估?」,答案取决于Agent是在活跃变更中还是稳定状态。请在计划中包含频率建议:触发式(立即运行):
- Agent发生任何变更后:系统prompt编辑、知识库更新、连接器/插件变更、模型切换
- 任何影响Agent行为的平台更新后
- 每次部署或重新发布前——这是回归关卡
定时(按日历运行):- 活跃开发阶段: 每次变更后运行核心业务+安全子集,每周运行完整套件
- 上线后稳定的Agent: 每月运行完整套件,或者分析显示质量下降时运行(例如点踩反应激增或升级率上升)
- 受监管/高风险Agent: 无论是否有变更,每周运行完整套件——评估运行本身就是持续合规的证据
生产→评估反馈循环: 每个生产事件都应该在24小时内转化为测试用例。当用户上报错误回答、支持工单引用错误信息、或分析显示新的失败模式时——添加复现该问题的测试用例。这就是评估飞轮:生产失败→新测试用例→评估捕获问题→修复→验证修复→该用例永久加入回归测试集。告知客户:「评估计划不是一次性文档,它是活的测试套件。用户发现的所有评估没覆盖到的bug都是覆盖缺口,请当天就补上。」
第3阶段扩展指导: 当客户准备好扩展初始计划(MS Learn框架的第3阶段)时,请引导他们查看评估检查清单和可下载模板。第3阶段引入四个扩展分类,扩展初始计划之外的覆盖范围:
- 基础核心 ——部署和回归检测的「必须通过」集合(对应我们的核心业务+安全分类)
- Agent健壮性 ——Agent如何处理表述变体、丰富上下文、多意图prompt和用户特定请求(对应我们的变体分类)
- 架构测试 ——工具、知识检索、路由和转接的功能表现(对应我们的能力场景)
- 边缘case ——边界条件、禁止行为、超出范围的处理(对应我们的边缘case与安全分类)
建议客户从GitHub下载可编辑的检查清单模板,追踪四个阶段的评估成熟度。根据检查清单指导,目标通过率设为**80-90%**即可——Agent是概率性的,满分很可疑,不是值得追求的目标。
Step 3 — Generate output files
步骤3 — 生成输出文件
After displaying the plan in the conversation, generate two files:
A. Eval Suite Plan Report (.docx)
Use the docx skill to create a formatted report containing:
- Title: "Eval Suite Plan: [Agent Name]"
- Agent description summary
- Scenario plan table
- Quality signals and their mapping
- Pass/fail thresholds
- Priority order
- Planning rationale (the WHY section — so the customer has the reasoning in the document, not just in chat)
- Human review checkpoints (the full table from Step 4 so the customer has a printed checklist)
- Next steps recommendation
B. Eval Suite Plan Spreadsheet (.xlsx)
Use the xlsx skill to create a spreadsheet with:
- Sheet 1: Scenario Plan (columns: #, Scenario Name, Category, Tag, Evaluation Methods)
- Sheet 2: Quality Signals (signal name, description, mapped scenarios)
- Sheet 3: Thresholds (category, target pass rate, adjustment notes)
在对话中展示计划后,生成两个文件:
A. 评估套件计划报告(.docx)
使用docx技能创建格式化报告,包含:
- 标题:「评估套件计划:[Agent名称]」
- Agent描述摘要
- 场景规划表
- 质量信号及其映射
- 通过/失败阈值
- 优先级顺序
- 规划理由(为什么部分——让客户在文档中也能看到逻辑,而不只是在聊天中)
- 人工审核检查点(步骤4的完整表格,让客户有打印版的检查清单)
- 下一步建议
B. 评估套件计划电子表格(.xlsx)
使用xlsx技能创建电子表格,包含:
- Sheet 1:场景计划(列:#、场景名称、分类、标签、评估方法)
- Sheet 2:质量信号(信号名称、描述、映射的场景)
- Sheet 3:阈值(分类、目标通过率、调整说明)
Step 4 — Human review checkpoints
步骤4 — 人工审核检查点
After the output files and before the conversation ends, display a 🔍 Human Review Required section. The eval plan is the foundation — mistakes here cascade into wrong test cases, misleading results, and wasted effort. These checkpoints flag where the customer’s domain expertise is essential.
🔍 Human Review Required
| # | Checkpoint | What to verify | Why it matters |
|---|---|---|---|
| 1 | Scenario coverage matches real usage | Compare the scenario plan against your agent’s actual usage patterns — analytics, support tickets, user feedback. Are the top 3 things users do represented? | AI-generated plans skew toward textbook scenarios. Your most important real-world flows may be missing. |
| 2 | Category distribution fits your risk profile | The default is 30-40% core / 20-30% capability / 10-20% safety / 10-20% edge. Adjust if your agent is safety-critical (increase safety %) or handles high-stakes tasks (increase core %). | One-size-fits-all distribution may under-test your highest-risk area. |
| 3 | Quality signals are complete | Review the listed quality signals. Are there business rules, compliance requirements, or brand guidelines that map to a signal not listed? | Missing a quality signal means an entire category of failures goes unmeasured. |
| 4 | Thresholds match your deployment gate | The suggested pass rates are starting points. Decide: what pass rate would make you confident shipping this agent? What failure rate would block a release? | Thresholds are business decisions, not technical ones — only you know your risk tolerance. |
| 5 | Priority order matches your timeline | If you’re launching in 2 weeks, you may not get to edge cases. The priority order should reflect what MUST be tested vs. what’s nice to test. | Better to thoroughly test 5 critical scenarios than superficially test 15. |
| 6 | Nothing sensitive in the scenarios | Check that scenario descriptions and expected behaviors don’t contain PII, internal system names, or confidential business logic that shouldn’t appear in eval artifacts. | Eval plans often get shared across teams — they should be safe to circulate. |
After the checkpoints, add:
- Mandatory reminder: "This eval plan was AI-generated based on your agent description. Before proceeding to test case generation with , review the scenarios, thresholds, and priority order with your team. The plan should reflect your actual business requirements, not just best-practice defaults."
/eval-generator
输出文件后、对话结束前,展示🔍 需要人工审核部分。评估计划是基础——这里的错误会 cascading 到错误的测试用例、误导性的结果和浪费的精力。这些检查点标记了客户的领域 expertise 必不可少的环节。
🔍 需要人工审核
| # | 检查点 | 验证内容 | 重要性 |
|---|---|---|---|
| 1 | 场景覆盖匹配真实使用情况 | 将场景计划和Agent的实际使用模式对比——分析数据、支持工单、用户反馈。用户最常做的3件事有没有被覆盖? | AI生成的计划偏向教科书场景,你最重要的真实流程可能缺失。 |
| 2 | 分类占比适配你的风险画像 | 默认占比是30-40%核心/20-30%能力/10-20%安全/10-20%边缘。如果是安全关键Agent(提升安全占比)或处理高风险任务(提升核心占比)请调整。 | 通用的占比可能对你的最高风险领域测试不足。 |
| 3 | 质量信号完整 | 查看列出的质量信号,有没有未列出的业务规则、合规要求或品牌指南对应的信号? | 遗漏质量信号意味着整类失败都不会被测量。 |
| 4 | 阈值匹配你的发布关卡 | 建议的通过率是起点。请决定:多高的通过率能让你有信心发布该Agent?多高的失败率会阻塞发布? | 阈值是业务决策,不是技术决策——只有你知道自己的风险承受能力。 |
| 5 | 优先级顺序匹配你的时间线 | 如果你要在2周内上线,可能没时间做边缘case测试。优先级顺序应该反映必须测试的内容和 nice to have 的内容。 | 彻底测试5个关键场景,比表面测试15个场景更好。 |
| 6 | 场景中没有敏感内容 | 检查场景描述和预期行为不包含PII、内部系统名称或不应该出现在评估产物中的机密业务逻辑。 | 评估计划经常在团队间共享——应该可以安全流转。 |
检查点之后,添加:
- 强制提醒: 「该评估计划是基于你的Agent描述AI生成的。在使用生成测试用例前,请和你的团队一起审核场景、阈值和优先级顺序。计划应该反映你的实际业务需求,而不只是最佳实践默认值。」
/eval-generator
Behavior rules
行为规则
- Every scenario name, evaluation method, and threshold must be specific to the described agent — no generic advice.
- Always include at least 1 adversarial/safety scenario (e.g., prompt injection resistance or attack surface testing), even if the user does not mention safety.
- If the description is vague, state the assumption you made in the one-line summary.
- When the agent matches multiple business-problem types (e.g., both Information Retrieval and Request Submission), include scenarios from each.
- 每个场景名称、评估方法和阈值必须和描述的Agent匹配——不要给出通用建议。
- 即使用户没有提到安全,也要至少包含1个对抗/安全场景(例如prompt注入抗性或攻击面测试)。
- 如果描述模糊,请在单行摘要中说明你做出的假设。
- 当Agent匹配多个业务问题类型时(例如同时匹配Information Retrieval和Request Submission),包含每个类型的场景。
Example invocations
调用示例
/eval-suite-planner I am building a customer support agent that handles refund requests. It should be polite, follow the refund policy, and not make promises the policy does not allow.
/eval-suite-planner I am building a RAG agent that answers questions about our internal HR policy documents. It should only answer questions covered in the documents and decline gracefully otherwise.
/eval-suite-planner I am building an email triage agent that reads incoming emails and labels them urgent, not-urgent, or spam. It should never label a real customer email as spam.
/eval-suite-planner I am building a code review agent that reviews Python pull requests and flags potential bugs, style violations, and missing tests./eval-suite-planner I am building a customer support agent that handles refund requests. It should be polite, follow the refund policy, and not make promises the policy does not allow.
/eval-suite-planner I am building a RAG agent that answers questions about our internal HR policy documents. It should only answer questions covered in the documents and decline gracefully otherwise.
/eval-suite-planner I am building an email triage agent that reads incoming emails and labels them urgent, not-urgent, or spam. It should never label a real customer email as spam.
/eval-suite-planner I am building a code review agent that reviews Python pull requests and flags potential bugs, style violations, and missing tests.