build-test-suite

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Build Test Suite

构建测试套件

Guide the user through building a complete test suite — test set + test cases with expected behaviors — for evaluating an AI agent using the
coval
CLI. Follow the phases below in order, asking questions at each step.
If
$ARGUMENTS
contains an agent name or use case, use it to skip or pre-fill questions in Phases 1-2.
指导用户使用
coval
CLI构建一套完整的测试套件——包含测试集和带有预期行为的测试用例——用于评估AI Agent。按以下阶段依次进行,在每个步骤中向用户提问。
如果
$ARGUMENTS
包含Agent名称或用例,可利用其跳过或预填充阶段1-2中的问题。

Phase 0: Setup + Preflight

阶段0:设置与预检

Step 1: Check authentication

步骤1:检查认证状态

bash
coval whoami
If not authenticated, guide the user:
bash
coval login
This prompts for an API key. Get one at https://app.coval.dev/settings (Organization > Manage > API Keys).
If the user doesn't have a Coval account, direct them to https://coval.dev to sign up.
bash
coval whoami
若未认证,指导用户执行:
bash
coval login
此命令会提示输入API密钥。可前往https://app.coval.dev/settings(组织 > 管理 > API密钥)获取。
若用户没有Coval账户,引导其前往https://coval.dev注册。

Step 2: Inventory existing resources

步骤2:盘点现有资源

Run these in parallel:
bash
coval agents list --format json
coval test-sets list --format json
Note existing agents and test sets for reference throughout the flow.
并行运行以下命令:
bash
coval agents list --format json
coval test-sets list --format json
记录现有Agent和测试集,以便在整个流程中参考。

Phase 1: Agent Context

阶段1:Agent上下文

Ask: "Which agent are these tests for?"
  • If agents exist, present a numbered list and let the user pick or say "new"
  • If
    $ARGUMENTS
    matches an agent name, select it automatically
Fetch the selected agent's details:
bash
coval agents get <agent_id> --format json
Capture from the response:
  • agent_id
  • model_type
    (voice, chat, etc.)
  • prompt
    (system prompt, if available)
  • display_name
If the agent has a system prompt, use it later to generate more specific, domain-relevant test scenarios instead of generic templates.
提问:“这些测试是针对哪个Agent的?”
  • 若存在已有的Agent,展示编号列表让用户选择或输入“new”创建新Agent
  • $ARGUMENTS
    匹配某个Agent名称,自动选中该Agent
获取选中Agent的详细信息:
bash
coval agents get <agent_id> --format json
从响应中捕获以下信息:
  • agent_id
  • model_type
    (语音、聊天等)
  • prompt
    (系统提示词,若可用)
  • display_name
若Agent有系统提示词,后续可利用它生成更贴合该Agent领域的特定测试场景,而非使用通用模板。

Phase 2: Test Set Type Selection

阶段2:测试集类型选择

Load
references/test-set-types.md
and present the available types.
Ask: "What type of test set do you want to create?"
  • SCENARIO is the default and best for most use cases
  • Explain when each type is appropriate based on the reference
  • If the user is unsure, recommend SCENARIO
Note: Test set type is not configurable via the CLI — all test sets default to SCENARIO type. To create other types, use the API:
POST /v1/test-sets
with a
test_set_type
field.
Then ask:
  1. "What would you like to name this test set?" — suggest:
    "<Agent Name> Evaluation"
  2. "Brief description?" — suggest based on agent type and use case
Create the test set:
bash
coval test-sets create --name "<name>" --description "<desc>" --format json
Capture
test_set_id
from the JSON response.
加载
references/test-set-types.md
并展示可用的测试集类型。
提问:“你想要创建哪种类型的测试集?”
  • SCENARIO是默认类型,适用于大多数场景
  • 根据参考文档说明每种类型的适用场景
  • 若用户不确定,推荐选择SCENARIO类型
注意: 测试集类型无法通过CLI配置——所有测试集默认都是SCENARIO类型。若要创建其他类型,需使用API:发送
POST /v1/test-sets
请求并携带
test_set_type
字段。
随后提问:
  1. “你想给这个测试集起什么名字?”——建议格式:
    "<Agent名称> 评估"
  2. “请提供简短描述?”——根据Agent类型和用例给出建议
创建测试集:
bash
coval test-sets create --name "<name>" --description "<desc>" --format json
从JSON响应中捕获
test_set_id

Phase 3: Scenario Design

阶段3:场景设计

Load
references/test-case-templates.md
and select the templates matching the agent's vertical/use case.
Present the 3-category pattern:
  • happy_path — The standard, successful interaction
  • edge_case — Unusual or challenging situations
  • compliance — Regulatory, policy, or safety requirements
If the agent has a system prompt, customize the scenarios to be specific to the agent's domain rather than using generic templates. For example, if the agent handles dental appointments, tailor scenarios to dental-specific situations.
Present a summary table before creating:
Test Set: "<name>"

  [happy_path]   <test case name>
                 <scenario description>
  [edge_case]    <test case name>
                 <scenario description>
  [compliance]   <test case name>
                 <scenario description>
Ask: "Create these test cases? (yes / customize / add more)"
  • yes → proceed to Phase 4
  • customize → let the user edit scenarios, then re-present
  • add more → generate additional scenarios, then re-present
加载
references/test-case-templates.md
并选择与Agent所属垂直领域/用例匹配的模板。
展示三类场景模式:
  • happy_path——标准的成功交互场景
  • edge_case——特殊或具有挑战性的场景
  • compliance——符合监管、政策或安全要求的场景
若Agent有系统提示词,需针对Agent的特定领域定制场景,而非使用通用模板。例如,若Agent处理牙科预约,需将场景调整为牙科相关的具体情况。
在创建前展示汇总表格:
测试集:"<name>"

  [happy_path]   <测试用例名称>
                 <场景描述>
  [edge_case]    <测试用例名称>
                 <场景描述>
  [compliance]   <测试用例名称>
                 <场景描述>
提问:“是否创建这些测试用例?(是/自定义/添加更多)”
  • → 进入阶段4
  • 自定义 → 允许用户编辑场景,之后重新展示汇总表格
  • 添加更多 → 生成额外场景,之后重新展示汇总表格

Phase 4: Expected Behaviors

阶段4:预期行为编写

For each test case, help craft an
expected_behaviors
array. These are what the Composite Evaluation metric scores against.
Good expected behaviors are:
  • Specific — describes a concrete action or output
  • Observable — can be verified from the conversation transcript
  • Binary — it either happened or it didn't
Examples of GOOD expected behaviors:
  • "Agent verifies caller identity before sharing account details"
  • "Agent provides a confirmation number"
  • "Agent offers at least two alternative time slots"
  • "Agent does NOT share information from a different policy"
Examples of BAD expected behaviors (avoid these):
  • "Agent is helpful" — too vague
  • "Agent sounds nice" — subjective
  • "Agent handles the situation well" — not observable
Present each test case with its expected behaviors for confirmation. Let the user add, remove, or edit behaviors.
针对每个测试用例,协助编写
expected_behaviors
数组。这些内容将作为复合评估指标的评分依据。
优质的预期行为应具备以下特点:
  • 具体性——描述明确的操作或输出
  • 可观测性——可通过对话记录验证
  • 二元性——要么发生,要么未发生
优质预期行为示例:
  • "Agent在共享账户详情前验证来电者身份"
  • "Agent提供确认编号"
  • "Agent提供至少两个替代时间段"
  • "Agent不得共享其他保单的信息"
需避免的劣质预期行为示例:
  • "Agent很有帮助"——过于模糊
  • "Agent语气友好"——主观性强
  • "Agent妥善处理情况"——无法观测
展示每个测试用例及其预期行为供用户确认。允许用户添加、删除或编辑行为。

Phase 5: Bulk Creation

阶段5:批量创建

Create each test case:
bash
coval test-cases create \
  --test-set-id <test_set_id> \
  --input '<scenario text>' \
  --expected "Agent greets the customer professionally" \
  --expected "Agent verifies caller identity" \
  --expected "Agent resolves the issue or escalates" \
  --description "<test case name>" \
  --format json
Pass each expected behavior as a separate
--expected
flag. This ensures they are stored as individual items in the
expected_behaviors
array, which the Composite Evaluation metric scores individually.
Shell tip: Use single quotes for
--input
values to avoid shell interpolation issues (e.g.,
$45.99
becoming
.99
).
If the CLI does not support multiple
--expected
flags, use the Coval API directly for structured expected behaviors:
bash
curl -s -X POST https://api.coval.dev/v1/test-cases \
  -H "X-API-Key: $COVAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "test_set_id": "<test_set_id>",
    "input_str": "<scenario text>",
    "expected_behaviors": [
      "Agent greets the customer professionally",
      "Agent verifies caller identity",
      "Agent resolves the issue or escalates"
    ],
    "description": "<test case name>"
  }'
Present progress as each test case is created. Capture
test_case_id
from each response.
创建每个测试用例:
bash
coval test-cases create \\
  --test-set-id <test_set_id> \\
  --input '<场景文本>' \\
  --expected "Agent专业地问候客户" \\
  --expected "Agent验证来电者身份" \\
  --expected "Agent解决问题或升级处理" \\
  --description "<测试用例名称>" \\
  --format json
将每个预期行为作为单独的
--expected
参数传入。确保它们被存储为
expected_behaviors
数组中的独立项,以便复合评估指标对其分别评分。
Shell提示:
--input
的值使用单引号,避免Shell插值问题(例如
$45.99
变成
.99
)。
若CLI不支持多个
--expected
参数,可直接使用Coval API创建结构化的预期行为:
bash
curl -s -X POST https://api.coval.dev/v1/test-cases \\
  -H "X-API-Key: $COVAL_API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
    "test_set_id": "<test_set_id>",
    "input_str": "<场景文本>",
    "expected_behaviors": [
      "Agent专业地问候客户",
      "Agent验证来电者身份",
      "Agent解决问题或升级处理"
    ],
    "description": "<测试用例名称>"
  }'
在每个测试用例创建完成后展示进度。从每个响应中捕获
test_case_id

Phase 6: Coverage Summary + Next Steps

阶段6:覆盖范围总结与后续步骤

Present what was created:
Test Suite Complete!

  Test Set:     <name> (<test_set_id>)
  Test Cases:   <N> total
    [happy_path]   <count>
    [edge_case]    <count>
    [compliance]   <count>
展示已创建的内容:
测试套件构建完成!

  测试集:     <name> (<test_set_id>)
  测试用例总数:<N>个
    [happy_path]   <数量>
    [edge_case]    <数量>
    [compliance]   <数量>

Coverage Analysis

覆盖范围分析

Review the test cases and suggest areas that might need more coverage:
  • Are there common failure modes not covered?
  • Are there regulatory requirements specific to the vertical?
  • Would the agent benefit from multi-turn conversation tests?
  • Are there language/accent scenarios worth testing (for voice agents)?
回顾测试用例并建议可能需要补充覆盖的领域:
  • 是否存在未覆盖的常见失败模式?
  • 是否有该垂直领域特有的监管要求?
  • Agent是否需要多轮对话测试?
  • 是否值得测试语言/口音场景(针对语音Agent)?

Suggested Next Steps

建议后续步骤

  • Design a test persona:
    /design-persona
  • Configure evaluation metrics:
    /configure-metrics
  • Launch a quick evaluation:
    /quick-eval
  • Add more test cases later:
    bash
    coval test-cases create --test-set-id <test_set_id> --input "..." --expected "..." --description "..."
  • 设计测试角色:
    /design-persona
  • 配置评估指标:
    /configure-metrics
  • 启动快速评估:
    /quick-eval
  • 后续添加更多测试用例:
    bash
    coval test-cases create --test-set-id <test_set_id> --input "..." --expected "..." --description "..."
",