build-test-suite

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Build Test Suite

构建测试套件

Guide the user through building a complete test suite — test set + test cases with expected behaviors — for evaluating an AI agent using the

coval

CLI. Follow the phases below in order, asking questions at each step.

$ARGUMENTS

contains an agent name or use case, use it to skip or pre-fill questions in Phases 1-2.

指导用户使用

coval

CLI构建一套完整的测试套件——包含测试集和带有预期行为的测试用例——用于评估AI Agent。按以下阶段依次进行，在每个步骤中向用户提问。

如果

$ARGUMENTS

包含Agent名称或用例，可利用其跳过或预填充阶段1-2中的问题。

Phase 0: Setup + Preflight

阶段0：设置与预检

Step 1: Check authentication

步骤1：检查认证状态

bash

coval whoami

If not authenticated, guide the user:

bash

coval login

This prompts for an API key. Get one at https://app.coval.dev/settings (Organization > Manage > API Keys).

If the user doesn't have a Coval account, direct them to https://coval.dev to sign up.

bash

coval whoami

若未认证，指导用户执行：

bash

coval login

此命令会提示输入API密钥。可前往https://app.coval.dev/settings（组织 > 管理 > API密钥）获取。

若用户没有Coval账户，引导其前往https://coval.dev注册。

Step 2: Inventory existing resources

步骤2：盘点现有资源

Run these in parallel:

bash

coval agents list --format json
coval test-sets list --format json

Note existing agents and test sets for reference throughout the flow.

并行运行以下命令：

bash

coval agents list --format json
coval test-sets list --format json

记录现有Agent和测试集，以便在整个流程中参考。

Phase 1: Agent Context

阶段1：Agent上下文

Ask: "Which agent are these tests for?"

If agents exist, present a numbered list and let the user pick or say "new"
If
```
$ARGUMENTS
```
matches an agent name, select it automatically

Fetch the selected agent's details:

bash

coval agents get <agent_id> --format json

Capture from the response:

```
agent_id
```
```
model_type
```
(voice, chat, etc.)
```
prompt
```
(system prompt, if available)
```
display_name
```

If the agent has a system prompt, use it later to generate more specific, domain-relevant test scenarios instead of generic templates.

提问：“这些测试是针对哪个Agent的？”

若存在已有的Agent，展示编号列表让用户选择或输入“new”创建新Agent
若
```
$ARGUMENTS
```
匹配某个Agent名称，自动选中该Agent

获取选中Agent的详细信息：

bash

coval agents get <agent_id> --format json

从响应中捕获以下信息：

```
agent_id
```
```
model_type
```
（语音、聊天等）
```
prompt
```
（系统提示词，若可用）
```
display_name
```

若Agent有系统提示词，后续可利用它生成更贴合该Agent领域的特定测试场景，而非使用通用模板。

Phase 2: Test Set Type Selection

阶段2：测试集类型选择

Load

references/test-set-types.md

and present the available types.

Ask: "What type of test set do you want to create?"

SCENARIO is the default and best for most use cases
Explain when each type is appropriate based on the reference
If the user is unsure, recommend SCENARIO

Note: Test set type is not configurable via the CLI — all test sets default to SCENARIO type. To create other types, use the API:
POST /v1/test-sets
with a
test_set_type
field.

Then ask:

"What would you like to name this test set?" — suggest:
```
"<Agent Name> Evaluation"
```
"Brief description?" — suggest based on agent type and use case

Create the test set:

bash

coval test-sets create --name "<name>" --description "<desc>" --format json

Capture

test_set_id

from the JSON response.

加载

references/test-set-types.md

并展示可用的测试集类型。

提问：“你想要创建哪种类型的测试集？”

SCENARIO是默认类型，适用于大多数场景
根据参考文档说明每种类型的适用场景
若用户不确定，推荐选择SCENARIO类型

注意： 测试集类型无法通过CLI配置——所有测试集默认都是SCENARIO类型。若要创建其他类型，需使用API：发送
POST /v1/test-sets
请求并携带
test_set_type
字段。

随后提问：

“你想给这个测试集起什么名字？”——建议格式：
```
"<Agent名称> 评估"
```
“请提供简短描述？”——根据Agent类型和用例给出建议

创建测试集：

bash

coval test-sets create --name "<name>" --description "<desc>" --format json

从JSON响应中捕获

test_set_id

。

Phase 3: Scenario Design

阶段3：场景设计

Load

references/test-case-templates.md

and select the templates matching the agent's vertical/use case.

Present the 3-category pattern:

happy_path — The standard, successful interaction
edge_case — Unusual or challenging situations
compliance — Regulatory, policy, or safety requirements

If the agent has a system prompt, customize the scenarios to be specific to the agent's domain rather than using generic templates. For example, if the agent handles dental appointments, tailor scenarios to dental-specific situations.

Present a summary table before creating:

Test Set: "<name>"

  [happy_path]   <test case name>
                 <scenario description>
  [edge_case]    <test case name>
                 <scenario description>
  [compliance]   <test case name>
                 <scenario description>

Ask: "Create these test cases? (yes / customize / add more)"

yes → proceed to Phase 4
customize → let the user edit scenarios, then re-present
add more → generate additional scenarios, then re-present

加载

references/test-case-templates.md

并选择与Agent所属垂直领域/用例匹配的模板。

展示三类场景模式：

happy_path——标准的成功交互场景
edge_case——特殊或具有挑战性的场景
compliance——符合监管、政策或安全要求的场景

若Agent有系统提示词，需针对Agent的特定领域定制场景，而非使用通用模板。例如，若Agent处理牙科预约，需将场景调整为牙科相关的具体情况。

在创建前展示汇总表格：

测试集："<name>"

  [happy_path]   <测试用例名称>
                 <场景描述>
  [edge_case]    <测试用例名称>
                 <场景描述>
  [compliance]   <测试用例名称>
                 <场景描述>

提问：“是否创建这些测试用例？（是/自定义/添加更多）”

是 → 进入阶段4
自定义 → 允许用户编辑场景，之后重新展示汇总表格
添加更多 → 生成额外场景，之后重新展示汇总表格

Phase 4: Expected Behaviors

阶段4：预期行为编写

For each test case, help craft an

expected_behaviors

array. These are what the Composite Evaluation metric scores against.

Good expected behaviors are:

Specific — describes a concrete action or output
Observable — can be verified from the conversation transcript
Binary — it either happened or it didn't

Examples of GOOD expected behaviors:

"Agent verifies caller identity before sharing account details"
"Agent provides a confirmation number"
"Agent offers at least two alternative time slots"
"Agent does NOT share information from a different policy"

Examples of BAD expected behaviors (avoid these):

"Agent is helpful" — too vague
"Agent sounds nice" — subjective
"Agent handles the situation well" — not observable

Present each test case with its expected behaviors for confirmation. Let the user add, remove, or edit behaviors.

针对每个测试用例，协助编写

expected_behaviors

数组。这些内容将作为复合评估指标的评分依据。

优质的预期行为应具备以下特点：

具体性——描述明确的操作或输出
可观测性——可通过对话记录验证
二元性——要么发生，要么未发生

优质预期行为示例：

"Agent在共享账户详情前验证来电者身份"
"Agent提供确认编号"
"Agent提供至少两个替代时间段"
"Agent不得共享其他保单的信息"

需避免的劣质预期行为示例：

"Agent很有帮助"——过于模糊
"Agent语气友好"——主观性强
"Agent妥善处理情况"——无法观测

展示每个测试用例及其预期行为供用户确认。允许用户添加、删除或编辑行为。

Phase 5: Bulk Creation

阶段5：批量创建

Create each test case:

bash

coval test-cases create \
  --test-set-id <test_set_id> \
  --input '<scenario text>' \
  --expected "Agent greets the customer professionally" \
  --expected "Agent verifies caller identity" \
  --expected "Agent resolves the issue or escalates" \
  --description "<test case name>" \
  --format json

Pass each expected behavior as a separate

--expected

flag. This ensures they are stored as individual items in the

expected_behaviors

array, which the Composite Evaluation metric scores individually.

Shell tip: Use single quotes for
--input
values to avoid shell interpolation issues (e.g.,
$45.99
becoming
.99
).

If the CLI does not support multiple

--expected

flags, use the Coval API directly for structured expected behaviors:

bash

curl -s -X POST https://api.coval.dev/v1/test-cases \
  -H "X-API-Key: $COVAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "test_set_id": "<test_set_id>",
    "input_str": "<scenario text>",
    "expected_behaviors": [
      "Agent greets the customer professionally",
      "Agent verifies caller identity",
      "Agent resolves the issue or escalates"
    ],
    "description": "<test case name>"
  }'

Present progress as each test case is created. Capture

test_case_id

from each response.

创建每个测试用例：

bash

coval test-cases create \\
  --test-set-id <test_set_id> \\
  --input '<场景文本>' \\
  --expected "Agent专业地问候客户" \\
  --expected "Agent验证来电者身份" \\
  --expected "Agent解决问题或升级处理" \\
  --description "<测试用例名称>" \\
  --format json

将每个预期行为作为单独的

--expected

参数传入。确保它们被存储为

expected_behaviors

数组中的独立项，以便复合评估指标对其分别评分。

Shell提示： 对
--input
的值使用单引号，避免Shell插值问题（例如
$45.99
变成
.99
）。

若CLI不支持多个

--expected

参数，可直接使用Coval API创建结构化的预期行为：

bash

curl -s -X POST https://api.coval.dev/v1/test-cases \\
  -H "X-API-Key: $COVAL_API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
    "test_set_id": "<test_set_id>",
    "input_str": "<场景文本>",
    "expected_behaviors": [
      "Agent专业地问候客户",
      "Agent验证来电者身份",
      "Agent解决问题或升级处理"
    ],
    "description": "<测试用例名称>"
  }'

在每个测试用例创建完成后展示进度。从每个响应中捕获

test_case_id

。

Phase 6: Coverage Summary + Next Steps

阶段6：覆盖范围总结与后续步骤

Present what was created:

Test Suite Complete!

  Test Set:     <name> (<test_set_id>)
  Test Cases:   <N> total
    [happy_path]   <count>
    [edge_case]    <count>
    [compliance]   <count>

展示已创建的内容：

测试套件构建完成！

  测试集：     <name> (<test_set_id>)
  测试用例总数：<N>个
    [happy_path]   <数量>
    [edge_case]    <数量>
    [compliance]   <数量>

Coverage Analysis

覆盖范围分析

Review the test cases and suggest areas that might need more coverage:

Are there common failure modes not covered?
Are there regulatory requirements specific to the vertical?
Would the agent benefit from multi-turn conversation tests?
Are there language/accent scenarios worth testing (for voice agents)?

回顾测试用例并建议可能需要补充覆盖的领域：

是否存在未覆盖的常见失败模式？
是否有该垂直领域特有的监管要求？
Agent是否需要多轮对话测试？
是否值得测试语言/口音场景（针对语音Agent）？

Suggested Next Steps

建议后续步骤

Design a test persona:
```
/design-persona
```
Configure evaluation metrics:
```
/configure-metrics
```
Launch a quick evaluation:
```
/quick-eval
```

Add more test cases later:

bash

coval test-cases create --test-set-id <test_set_id> --input "..." --expected "..." --description "..."

设计测试角色：
```
/design-persona
```
配置评估指标：
```
/configure-metrics
```
启动快速评估：
```
/quick-eval
```

后续添加更多测试用例：

bash

coval test-cases create --test-set-id <test_set_id> --input "..." --expected "..." --description "..."