cekura-eval-design

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Cekura Eval Design

Cekura 评估器设计

Purpose

目的

Guide the creation of effective Cekura evaluators (test scenarios) that thoroughly exercise AI voice agent capabilities. Evaluators simulate callers to test the main agent — they are NOT metrics (which evaluate transcripts after the fact).

指导创建高效的Cekura评估器（测试场景），全面检验AI语音Agent的能力。评估器模拟呼叫者来测试主Agent——它们不是事后评估通话记录的指标。

Performing Platform Actions

执行平台操作

When this skill suggests creating, listing, updating, or evaluating something on Cekura, prefer using available platform tools over describing API calls or dashboard steps. In Claude Code with the Cekura plugin installed, these tools are auto-configured and handle authentication, parameter validation, and error handling for you. Fall back to direct API endpoints or dashboard guidance only when no tools are available in the current session.

当本技能建议在Cekura上创建、列出、更新或评估内容时，优先使用可用的平台工具，而非描述API调用或控制台步骤。在安装了Cekura插件的Claude Code中，这些工具已自动配置，可处理身份验证、参数验证和错误处理。仅当当前会话中无可用工具时，才退回到直接使用API端点或控制台指导。

Core Terminology

核心术语

Main agent: The client's AI voice agent being tested
Testing agent: Cekura's simulated caller that exercises the main agent
Evaluator/Scenario: A test case defining what the simulated caller does and what success looks like
Metric: A post-call evaluation that scores a transcript (separate concept — see cekura-metrics plugin)
Personality: Voice, language, accent, and behavioral traits for the simulated caller
Test Profile: Identity and context data passed to testing agent AND main agent (for chat/websocket runs)
Conditional Action: Structured, deterministic testing agent behavior with adaptive fallback

主Agent：被测试的客户AI语音Agent
测试Agent：Cekura的模拟呼叫者，用于检验主Agent
评估器/场景：定义模拟呼叫者行为及成功标准的测试用例
指标：通话后评估并为通话记录打分的机制（独立概念——详见cekura-metrics插件）
人格配置：模拟呼叫者的语音、语言、口音和行为特征
测试配置文件：传递给测试Agent和主Agent的身份及上下文数据（适用于聊天/WebSocket运行模式）
条件动作：结构化、确定性的测试Agent行为，带有自适应回退机制

The Eval Design Workflow

评估器设计工作流

Understand the agent — Read the agent description (GET the agent record) to identify all workflows, decision points, and edge cases
Choose a tool strategy — Ask the user which approach they want for handling the agent's external tool calls. This is a fundamental decision that shapes everything else. See "Tool Strategy — Three Approaches" below.
Always create a folder first — Before generating or creating scenarios, create a folder to organize them. Never dump scenarios into the root. POST to the scenarios folder endpoint with
```
name
```
,
```
project_id
```
, and optionally
```
parent_path
```
. Then pass the
```
folder_path
```
to the generate endpoint or set it on individual scenarios.
Run the pre-creation checkpoint — Confirm all key decisions with the user before building anything. See "Pre-Creation Checkpoint" below.
Author evaluators — pick the path based on the mode (per "Choosing Authoring Mode" below):
- Behavioral mode (default): start with auto-generate via
```
POST /test_framework/v1/scenarios/generate-bg/
```
  . Provide category-level guidance in
```
extra_instructions
```
  . If using Cekura mock tools, the generator creates tool-aware scenarios automatically. See "Auto-Generation" section below.
- Conditional-actions mode: auto-gen can produce either behavioral or conditional-action scenarios — check the
```
scenario_type
```
  of generated output and proceed accordingly. When you need full structural control (verbatim phrasing, exact-sequence regression, IVR/voicemail/DTMF flows), author each scenario directly via
```
POST /test_framework/v1/scenarios/
```
  with
```
scenario_type: "conditional_actions"
```
  and the
```
conditional_actions
```
  payload. See "Designing Conditional Actions" below.
Review and fix generation artifacts (only if you ran auto-gen in step 5) — Check the
```
scenario_type
```
of each generated scenario and inspect the corresponding payload (
```
instructions
```
for behavioral,
```
conditional_actions
```
for conditional-action). PATCH
```
scenario_language
```
for non-English scenarios (defaults to "en" regardless of content). PATCH
```
first_message
```
if auto-gen added greetings instead of exact questions. Check for partial completion (generation may produce fewer than requested).
Supplement manually — Add edge cases, red-team scenarios, and deterministic tests that the generator didn't cover, or author additional scenarios directly when you need full structural control.
Set up test infrastructure — Check existing test profiles first, then create new ones. Configure tool data according to the chosen tool strategy.
Attach metrics — ALWAYS include baseline metrics (Expected Outcome, Infrastructure Issues, Tool Call Success, Latency) on every evaluator. Without metrics, runs only report call completion, not correctness.
Run and validate — Execute via
```
run_scenarios
```
, review transcripts, iterate

了解Agent —— 读取Agent描述（调用GET接口获取Agent记录），识别所有工作流、决策点和边缘用例
选择工具策略 —— 询问用户希望如何处理Agent的外部工具调用。这是决定后续所有工作的核心决策。详见下文「工具策略——三种方案」。
始终先创建文件夹 —— 在生成或创建场景前，先创建文件夹来组织它们。切勿将场景直接放在根目录下。调用场景文件夹的POST接口，传入
```
name
```
、
```
project_id
```
，可选传入
```
parent_path
```
。然后将
```
folder_path
```
传递给生成接口，或为单个场景设置该参数。
执行创建前检查点 —— 在构建任何内容前，与用户确认所有关键决策。详见下文「创建前检查点」。
编写评估器——根据模式选择路径（参考下文「选择编写模式」）：
- 行为模式（默认）：通过
```
POST /test_framework/v1/scenarios/generate-bg/
```
  接口自动生成。在
```
extra_instructions
```
  中提供类别级指导。如果使用Cekura模拟工具，生成器会自动创建支持工具的场景。详见下文「自动生成」部分。
- 条件动作模式：自动生成可生成行为型或条件动作型场景——检查生成结果的
```
scenario_type
```
  并相应处理。当需要完全的结构控制（逐字表述、精确序列回归、IVR/语音信箱/DTMF流程）时，直接通过
```
POST /test_framework/v1/scenarios/
```
  接口编写每个场景，设置
```
scenario_type: "conditional_actions"
```
  并传入
```
conditional_actions
```
  负载。详见下文「设计条件动作」。
检查并修复生成产物（仅当步骤5中使用了自动生成时） —— 检查每个生成场景的
```
scenario_type
```
，并查看对应的负载（行为型场景看
```
instructions
```
，条件动作型场景看
```
conditional_actions
```
）。对于非英语场景，PATCH更新
```
scenario_language
```
（默认值为"en"，与内容无关）。如果自动生成在
```
first_message
```
中添加了问候语而非精确问题，PATCH更新
```
first_message
```
。检查是否存在生成不完整的情况（生成的场景数量可能少于请求数量）。
手动补充 —— 添加生成器未覆盖的边缘用例、红队场景和确定性测试，或在需要完全结构控制时直接编写额外场景。
设置测试基础设施 —— 先检查现有测试配置文件，再创建新的。根据所选工具策略配置工具数据。
关联指标 —— 每个评估器必须包含基线指标（预期结果、基础设施问题、工具调用成功、延迟）。没有指标的话，运行结果仅报告通话是否完成，无法验证正确性。
运行并验证 —— 通过
```
run_scenarios
```
执行，查看通话记录，迭代优化

Tool Strategy — Three Approaches

工具策略——三种方案

Ask the user early: "Does your agent call external tools during calls? If so, how do you want to handle tool data for testing?"

Approach	When to use	Your job
A. Client-side mock data	Client has staging API/test DB	Align test profiles with their mock data
B. Cekura mock tools	No staging, want predictable isolated tests	Set up mock mappings + match test profiles to outputs
C. No mock data	Conversational-only agents, testing tone/soft skills	Use test profiles for identity only

Critical rule for Approach B: derive test profile values FROM mock outputs (same format, same values). Creating them independently guarantees mismatches.

See
references/tool-strategies.md
for full workflow, key questions to ask, and validation guidance for each approach.

尽早询问用户：「您的Agent在通话中会调用外部工具吗？如果是，您希望如何处理测试中的工具数据？」

方案	使用场景	你的工作
A. 客户端模拟数据	客户拥有 staging API/测试数据库	使测试配置文件与客户的模拟数据保持一致
B. Cekura模拟工具	无staging环境，需要可预测的隔离测试	设置模拟映射 + 使测试配置文件与输出匹配
C. 不使用模拟数据	仅对话型Agent，测试语气/软技能	仅使用测试配置文件存储身份信息

方案B的关键规则：从模拟输出中导出测试配置文件的值（格式、值完全一致）。独立创建配置文件必然会导致不匹配。

详见
references/tool-strategies.md
获取完整工作流、关键问题及各方案的验证指导。

Choosing Authoring Mode

选择编写模式

The default authoring mode is behavioral instructions (free-form, first-person scenario instructions). Switch to conditional actions in two situations:

默认编写模式为行为指令（自由格式、第一人称场景指令）。在以下两种情况下切换为条件动作模式：

Switch immediately, no confirmation, when the user says any of:

当用户提及以下内容时，直接切换，无需确认：

"conditional actions", "structured scenario", "scripted scenario", "scripted test", "deterministic test", "unit test", "regression test", "exact flow", "fixed sequence", "compliance test". The user has stated their authoring intent — proceed straight to designing conditional actions (see "Designing Conditional Actions" below).

「条件动作」「结构化场景」「脚本化场景」「脚本化测试」「确定性测试」「单元测试」「回归测试」「精确流程」「固定序列」「合规测试」。用户已明确表达编写意图——直接进入条件动作设计（详见下文「设计条件动作」）。

Ask first when the user mentions a tag-supported feature without specifying a mode:

当用户提及支持标签的功能但未指定模式时，先询问：

"voicemail", "voicemail test", "IVR menu", "IVR navigation", "DTMF entry", "DTMF input", "hold music", "interruption test", "network simulation", "packet loss", "background noise". Conditional actions support these via dedicated XML tags (

<voicemail>

<dtmf>

, etc.) and produce higher-fidelity tests, but a behavioral instruction may also be acceptable. Ask one short question:

"This involves [voicemail / IVR / DTMF / etc.]. Conditional actions support
<voicemail>
/
<dtmf>
/
<...>
tags directly for high-fidelity testing — should I author this as a conditional-actions evaluator (structured turn-by-turn with the right tags), or behavioral instructions (free-form, looser)?"

「语音信箱」「语音信箱测试」「IVR菜单」「IVR导航」「DTMF输入」「DTMF按键」「等待音乐」「打断测试」「网络模拟」「丢包」「背景噪音」。条件动作通过专用XML标签（

<voicemail>

、

<dtmf>

等）支持这些功能，可生成更高保真的测试，但行为指令也可能适用。询问一个简短问题：

「这涉及[语音信箱/IVR/DTMF等]。条件动作直接支持
<voicemail>
/
<dtmf>
/
<...>
标签以实现高保真测试——我应该将其编写为条件动作评估器（结构化逐轮交互，带有对应标签），还是行为指令（自由格式，灵活性更高）？」

Stay in behavioral mode for:

保持行为模式的场景：

Open-ended persona dialogue, exploratory red-team without specific attack scripts, soft-skill / tone / empathy testing, general edge-case quality probing where the conversation path isn't predictable. The "Writing Instructions" section below is the primary guide for this mode.

开放式角色对话、无特定攻击脚本的探索性红队测试、软技能/语气/同理心测试、对话路径不可预测的通用边缘用例质量探测。下文「编写指令」部分是该模式的主要指导。

Concrete examples (which mode for which scenario)

具体示例（不同场景对应哪种模式）

Scenario the user describes	Default mode	Why
Appointment scheduling happy path	Behavioral	Path is predictable but doesn't need exact phrasing; behavioral lets the testing agent improvise naturally.
Appointment scheduling — exact-sequence regression test	Conditional actions	"Regression test" is a direct trigger phrase.
Compliance disclosure / account-number readback	Conditional actions	Verbatim phrasing required ( `fixed_message: true` + `<spell>` ); "compliance" is a direct trigger phrase.
Identity verification with name + DOB + last 4 SSN	Conditional actions	Each turn's action is data-bound (read from test profile); structure prevents drift.
Inbound IVR menu navigation	Ask first	Mentions IVR — could be conditional (high-fidelity, `<dtmf>` ) or behavioral (looser); confirm with user.
Voicemail handling test	Ask first	Mentions voicemail — `<voicemail>` tag is purpose-built but behavioral can work.
Angry caller / de-escalation	Behavioral	Tone-driven, exploratory; no fixed sequence.
Red-team prompt injection (a single attack pattern)	Conditional actions	Specific scripted attack; one evaluator per expected outcome.
Red-team free-form probing	Behavioral	Path not predictable; the agent improvises attacks.
Multi-language tone testing	Behavioral	Soft-skill evaluation; `scenario_language` set on either mode.
Multi-language compliance verification	Conditional actions	Verbatim disclosures + language-specific phrasing.
Network degradation under packet loss	Ask first	Mentions network simulation — `<network_simulation>` tag is purpose-built.
Tool failure recovery flow (specific failure + recovery path)	Conditional actions	Specific failure trigger + specific recovery step.
General "test my agent's quality"	Behavioral	No structural commitment specified.

用户描述的场景	默认模式	原因
预约调度正常流程	行为模式	路径可预测但无需逐字表述；行为模式允许测试Agent自然发挥。
预约调度——精确序列回归测试	条件动作	「回归测试」是直接触发词。
合规披露/账号号码回读	条件动作	需要逐字表述（ `fixed_message: true` + `<spell>` ）；「合规」是直接触发词。
姓名+出生日期+社保后四位身份验证	条件动作	每一轮动作都与数据绑定（从测试配置文件读取）；结构化可防止偏离。
呼入IVR菜单导航	先询问	提及IVR——可以是条件动作（高保真， `<dtmf>` ）或行为模式（灵活性高）；需与用户确认。
语音信箱处理测试	先询问	提及语音信箱—— `<voicemail>` 标签是专门设计的，但行为模式也可行。
愤怒呼叫者/降级处理	行为模式	以语气为导向，探索性测试；无固定序列。
红队提示注入（单一攻击模式）	条件动作	特定脚本化攻击；每个评估器对应一个预期结果。
红队自由式探测	行为模式	路径不可预测；Agent自主发起攻击。
多语言语气测试	行为模式	软技能评估；两种模式均可设置 `scenario_language` 。
多语言合规验证	条件动作	逐字披露+特定语言表述。
丢包下的网络降级	先询问	提及网络模拟—— `<network_simulation>` 标签是专门设计的。
工具故障恢复流程（特定故障+恢复路径）	条件动作	特定故障触发+特定恢复步骤。
通用「测试我的Agent质量」	行为模式	未指定结构要求。

Test Profiles — Always Use Them

测试配置文件——务必使用

Test profiles are the backbone of reliable evals. They serve three critical purposes:

Memory persistence — The testing agent reliably uses profile data during calls. Data in instructions often leads to hallucinations.
Dynamic variables — For outbound and websocket runs, test profile fields are sent to the main agent as caller context, mimicking what production systems provide. This lets you test the full end-to-end flow.
Single source of truth — No risk of name in test profile saying "Sarah" while instructions say "John", which causes the testing agent to hallucinate.

Always use test profiles. Never hardcode identity data (names, DOBs, account IDs, addresses, phone numbers, service addresses, discrepancy amounts — anything persona-related) in scenario instructions. Instead, create a test profile with the data and let the instructions reference it generically (e.g., "State your name when asked").

Building test profiles from real data: The best approach is to pull call history from observability and/or past eval runs and use data that is known to work:

Fetch recent call transcript_json records from the API
Analyze toolcall inputs and outputs from real calls
Build a memory document mapping existing data (names, account IDs, appointment IDs, etc.)
Create test profiles using this verified data This ensures test profiles work against production tools.

Always check for existing test profiles first. Clients often pre-build profiles that are tested against their mock backend — reuse these rather than creating from scratch.

Template variables in instructions: Use

{{test_profile.field_name}}

{{test_profile['key']}}

for dynamic injection. For nested data:

{{test_profile.address.city}}

. Note: in voice scenarios, the simulated caller reads from the instruction text directly — the profile data is there for the caller to reference, not injected as hidden context.

See

references/test-profiles.md

for full details and the data-extraction workflow.

测试配置文件是可靠评估的核心。它们有三个关键作用：

内存持久化 —— 测试Agent在通话中可靠使用配置文件数据。指令中的数据常导致幻觉。
动态变量 —— 对于外呼和WebSocket运行模式，测试配置文件字段作为呼叫者上下文发送给主Agent，模拟生产系统提供的内容。这使你能够测试完整的端到端流程。
单一数据源 —— 不会出现测试配置文件中姓名是「Sarah」而指令中是「John」的情况，避免测试Agent产生幻觉。

务必使用测试配置文件。切勿在场景指令中硬编码身份数据（姓名、出生日期、账号ID、地址、电话号码、服务地址、差异金额——任何与角色相关的数据）。相反，创建包含这些数据的测试配置文件，让指令以通用方式引用（例如：「被询问时说出你的姓名」）。

从真实数据构建测试配置文件：最佳方法是从可观测性数据和/或过往评估运行记录中提取通话历史，使用已知有效的数据：

从API获取最近的通话transcript_json记录
分析真实通话中的工具调用输入和输出
构建映射现有数据（姓名、账号ID、预约ID等）的内存文档
使用这些经过验证的数据创建测试配置文件这确保测试配置文件可在生产工具上正常工作。

务必先检查现有测试配置文件。客户通常会预先构建经过测试、可对接其模拟后端的配置文件——优先复用这些配置文件，而非从头创建。

指令中的模板变量：使用

{{test_profile.field_name}}

或

{{test_profile['key']}}

进行动态注入。对于嵌套数据：

{{test_profile.address.city}}

。注意：在语音场景中，模拟呼叫者直接读取指令文本——配置文件数据供呼叫者参考，而非作为隐藏上下文注入。

详见

references/test-profiles.md

获取完整细节和数据提取工作流。

Writing Instructions

编写指令

Instructions tell the testing agent what to do. Write in first person from the testing agent's perspective.

指令告知测试Agent要做什么。以第一人称从测试Agent的视角编写。

Instruction Style

指令风格

First person: "State your name when asked" NOT "The caller should state their name"
Behavioral, not scripted: "Report fever and cough, request same provider" NOT "Say exactly: I have a fever"
Reference test profile data: "Provide your date of birth when asked for verification" (the actual DOB comes from the test profile)

第一人称：「被询问时说出你的姓名」而非「呼叫者应说出他们的姓名」
行为化，非脚本化：「报告发烧和咳嗽，要求同一位医生」而非「准确说出：我发烧了」
引用测试配置文件数据：「被要求验证时提供你的出生日期」（实际出生日期来自测试配置文件）

Good Instructions Pattern

优质指令模板

Wrap instructions in

<scenario>

tags with a step-by-step format:

<scenario>
SCENARIO: [Brief scenario name]

YOUR BEHAVIOR:
1. State your intent to [action]
2. Confirm you are the patient when asked
3. Say and spell your first name when asked for verification
4. Provide your date of birth when asked
5. If the agent says no slots are available, say you are flexible with timing

KEY INTERACTION POINTS:
[Specific workflow nodes or edge cases to exercise]
</scenario>

Be explicit about exact phrases when mock/backend behavior depends on them (e.g.,

say "follow-up appointment" exactly

if the mock's reason-for-visit matching requires it).

用

<scenario>

标签包裹指令，采用分步格式：

<scenario>
场景：[简短场景名称]

你的行为：
1. 表明你要[执行动作]的意图
2. 被询问时确认你是患者
3. 被要求验证时说出并拼写你的名字
4. 被询问时提供你的出生日期
5. 如果Agent说没有可用时段，表示你可以灵活调整时间

关键交互点：
[需要检验的特定工作流节点或边缘用例]
</scenario>

当模拟/后端行为依赖特定表述时，明确说明精确短语（例如，如果模拟的就诊原因匹配需要特定表述，可写

准确说出"随访预约"

）。

Common Instruction Mistakes

常见指令错误

Filler steps that add nothing — NEVER write steps like "Listen to the agent's response", "Wait for the agent to speak", "End the call politely", or "Respond accordingly". The testing agent already does these things automatically. Every step must describe a specific action the caller takes — information they provide, a decision they make, or a behavior they exhibit. If a step doesn't tell the caller to DO something specific, delete it.
Hardcoding profile data in instructions — Names, DOBs, addresses, account numbers belong in test profiles, not instructions. When data is in both places and they differ, the testing agent hallucinates. This is the single most common mistake across clients.
Using instructions for voice characteristics — Instructions like "speak in a mumbling voice" or "be interruptive" don't change the testing agent's vocal style. Use personalities for that — they control actual voice model parameters (accent, interruption level, background noise, speed).
Including examples of what the main agent "may say" — Don't write
```
When the agent says "How can I help you", respond with...
```
. Instead, reference action points by topic:
```
When asked about what you need help with, explain that you need help with your billing address.
```
The former is brittle; the latter works regardless of exact agent phrasing.
Not providing enough context for multi-step flows — If a scenario involves a complex process (scheduling, onboarding), the testing agent needs step-by-step context to avoid hallucinating after the first few steps. For structured flows, use conditional actions instead.
Vague or generic instructions — "Call to schedule an appointment" is useless. Be specific: what type of appointment, what constraints, what complications should arise. The more specific the scenario, the more useful the test.
Third-person perspective instead of first person
Too scripted (exact dialogue) instead of behavioral goals
Missing edge case triggers

无意义的填充步骤 —— 切勿编写「聆听Agent的回复」「等待Agent说话」「礼貌结束通话」或「做出相应回应」这类步骤。测试Agent会自动执行这些操作。每个步骤必须描述呼叫者的特定动作——提供的信息、做出的决定或表现的行为。如果步骤没有告知呼叫者要执行具体操作，就删除它。
在指令中硬编码配置文件数据 —— 姓名、出生日期、地址、账号属于测试配置文件，而非指令。如果数据同时存在于两处且不一致，测试Agent会产生幻觉。这是客户最常犯的错误。
用指令控制语音特征 —— 「说话含糊不清」或「频繁打断」这类指令不会改变测试Agent的语音风格。请使用人格配置来控制——它们管理实际的语音模型参数（口音、打断频率、背景噪音、语速）。
包含主Agent「可能会说」的示例 —— 不要写
```
当Agent说"我能帮你什么吗"时，回应...
```
。相反，按主题引用动作点：
```
当被询问需要什么帮助时，说明你需要修改账单地址。
```
前者很脆弱；后者无论Agent的表述如何都能生效。
未为多步骤流程提供足够上下文 —— 如果场景涉及复杂流程（调度、入职），测试Agent需要分步上下文以避免在最初几步后产生幻觉。对于结构化流程，请使用条件动作。
模糊或通用的指令 —— 「打电话预约」毫无用处。要具体：预约类型、限制条件、可能出现的问题。场景越具体，测试就越有用。
使用第三人称而非第一人称视角
过于脚本化（精确对话）而非行为化目标
缺少边缘用例触发条件

Bad vs Good Instructions

错误指令 vs 优质指令

BAD (filler, vague, passive):

<scenario>
1. When the agent asks to confirm your identity and whether you are the intended person, clearly state: "No, you have the wrong number."
2. Listen to the agent's response.
3. End the call politely.
</scenario>

GOOD (every step is a specific caller action):

<scenario>
SCENARIO: Wrong number — caller is not the intended recipient

YOUR BEHAVIOR:
1. When the agent asks for your name or tries to verify your identity, say this is the wrong number and you don't know the person they're looking for
2. If the agent asks for any additional information, decline — you have no connection to the intended person
3. If the agent apologizes and offers to remove your number, confirm that's fine
</scenario>

BAD (generic, no specifics):

<scenario>
1. Call to schedule an appointment.
2. Provide your information when asked.
3. Confirm the appointment.
</scenario>

GOOD (specific scenario with constraints):

<scenario>
SCENARIO: New adult patient scheduling with insurance

YOUR BEHAVIOR:
1. State you're a new patient and need to schedule a first visit with a primary care provider
2. When asked about insurance, say you have Blue Cross PPO
3. Provide your date of birth and spell your full name when asked for verification
4. Request a morning appointment if given timing options
5. If no morning slots are available, accept the earliest available afternoon slot
6. Confirm the appointment details when the agent reads them back

KEY INTERACTION POINTS:
- New patient registration flow
- Insurance verification
- Appointment slot selection with preference constraints
</scenario>

错误（填充内容、模糊、被动）：

<scenario>
1. 当Agent要求确认你的身份以及你是否为目标联系人时，明确说出："不，你打错电话了。"
2. 聆听Agent的回复。
3. 礼貌结束通话。
</scenario>

优质（每个步骤都是呼叫者的特定动作）：

<scenario>
场景：打错电话——呼叫者并非目标联系人

你的行为：
1. 当Agent询问你的姓名或尝试验证你的身份时，说明这是错误号码，你不认识他们要找的人
2. 如果Agent要求提供任何额外信息，拒绝——你与目标联系人无关
3. 如果Agent道歉并提出移除你的号码，确认没问题
</scenario>

错误（通用、无细节）：

<scenario>
1. 打电话预约。
2. 被询问时提供你的信息。
3. 确认预约。
</scenario>

优质（带限制条件的具体场景）：

<scenario>
场景：有保险的成年新患者预约

你的行为：
1. 表明你是新患者，需要预约初级保健医生的首次就诊
2. 被询问保险情况时，说明你有Blue Cross PPO保险
3. 被要求验证时提供你的出生日期并拼写全名
4. 如果有时间选项，要求上午的预约
5. 如果没有上午时段，接受最早的下午时段
6. 当Agent复述预约详情时进行确认

关键交互点：
- 新患者注册流程
- 保险验证
- 带偏好限制的预约时段选择
</scenario>

Auto-Generation

自动生成

The

POST /test_framework/v1/scenarios/generate-bg/

endpoint is the preferred workflow for bulk scenario creation. Generated scenarios may come back as either behavioral (

scenario_type: "instruction"

) or conditional-action (

scenario_type: "conditional_actions"

) — check what was created and proceed accordingly. When you need full structural control (verbatim phrasing, exact-sequence regression, IVR/voicemail/DTMF flows), author conditional-action evaluators directly via the create endpoint — see "Designing Conditional Actions" below.

Full schema:

Field	Type	Required	Description
`agent_id`	integer	Yes	Agent to generate scenarios for
`num_scenarios`	integer	Yes	How many to generate
`extra_instructions`	string	No	Category-level guidance (e.g., "focus on cancellation edge cases")
`personalities`	array[integer]	No	Personality IDs to use
`generate_expected_outcomes`	boolean	No	Auto-generate expected outcomes
`folder_path`	string	No	Folder to place generated scenarios in (always set this — create the folder first)
`tags`	array[string]	No	Tags to apply to all generated scenarios
`tool_ids`	array[string]	No	Tools to enable (e.g., `TOOL_END_CALL` )

Returns:

{"progress_id": "<uuid>"}

. Poll with

GET /test_framework/v1/scenarios/generate-progress/?progress_id=<id>

Response has:

total_scenarios

completed_scenarios

failed_scenarios

scenarios_list

POST /test_framework/v1/scenarios/generate-bg/

接口是批量创建场景的首选工作流。生成的场景可能是行为型（

scenario_type: "instruction"

）或条件动作型（

scenario_type: "conditional_actions"

）——检查生成结果并相应处理。当需要完全的结构控制（逐字表述、精确序列回归、IVR/语音信箱/DTMF流程）时，直接通过创建接口编写条件动作评估器——详见下文「设计条件动作」。

完整 schema：

字段	类型	必填	描述
`agent_id`	integer	是	要为其生成场景的Agent ID
`num_scenarios`	integer	是	要生成的场景数量
`extra_instructions`	string	否	类别级指导（例如：「重点关注取消边缘用例」）
`personalities`	array[integer]	否	要使用的人格配置ID
`generate_expected_outcomes`	boolean	否	自动生成预期结果
`folder_path`	string	否	放置生成场景的文件夹路径（务必设置此参数——先创建文件夹）
`tags`	array[string]	否	应用于所有生成场景的标签
`tool_ids`	array[string]	否	要启用的工具（例如： `TOOL_END_CALL` ）

返回结果：

{"progress_id": "<uuid>"}

。通过

GET /test_framework/v1/scenarios/generate-progress/?progress_id=<id>

轮询进度。

响应包含：

total_scenarios

、

completed_scenarios

、

failed_scenarios

、

scenarios_list

。

Generation Gotchas

生成注意事项

Generation can partially complete — May produce fewer scenarios than requested (e.g., 15/18) with the remainder stuck. After a reasonable timeout, generate the remainder in a smaller batch with more specific
```
extra_instructions
```
.
scenario_language
defaults to "en" — Auto-gen sets all scenarios to English even when
```
extra_instructions
```
specify non-English languages. PATCH each scenario with the correct language code (
```
ru
```
,
```
hi
```
,
```
es
```
,
```
zh
```
,
```
ko
```
,
```
pt
```
,
```
de
```
, etc.) after generation. This is required for correct TTS voice/pronunciation.
Auto-gen may add greetings to
first_message
— When
```
extra_instructions
```
specify exact verbatim questions, some scenarios get a greeting (e.g., "Здравствуйте") as the
```
first_message
```
while the actual question is in instructions as a follow-up. PATCH
```
first_message
```
after generation.
Language-specific personalities may not be enabled per-project — Non-English personalities may return "Personality is not enabled" errors. Workaround: use personality 693 (Normal Male English) and rely on
```
scenario_language
```
to drive TTS and pronunciation. See "Checking Available Personalities" under the Personality section.
Mock tool awareness — When mock tools are enabled on an agent, the generate endpoint creates tool-aware scenarios automatically.

生成可能部分完成 —— 生成的场景数量可能少于请求数量（例如：15/18），剩余场景停滞。合理超时后，使用更具体的
```
extra_instructions
```
以较小批量生成剩余场景。
scenario_language
默认值为"en" —— 即使
```
extra_instructions
```
指定非英语语言，自动生成仍会将所有场景设置为英语。生成后需PATCH更新每个场景的正确语言代码（
```
ru
```
、
```
hi
```
、
```
es
```
、
```
zh
```
、
```
ko
```
、
```
pt
```
、
```
de
```
等）。这是正确TTS语音/发音的必要步骤。
自动生成可能在
first_message
中添加问候语 —— 当
```
extra_instructions
```
指定精确逐字问题时，部分场景的
```
first_message
```
会添加问候语（例如："Здравствуйте"），而实际问题在后续指令中。生成后需PATCH更新
```
first_message
```
。
特定语言的人格配置可能未按项目启用 —— 非英语人格配置可能返回「人格配置未启用」错误。解决方法：使用人格配置693（正常男性英语），并依赖
```
scenario_language
```
驱动TTS和发音。详见人格配置部分的「检查可用人格配置」。
模拟工具感知 —— 当Agent启用模拟工具时，生成接口会自动创建支持工具的场景。

Personality — Required, Controls Voice Characteristics

人格配置——必填项，控制语音特征

personality
is required on every scenario — the API returns 400 if missing. Use personalities (not instructions) to control the testing agent's vocal style. Personalities manage:

Language and accent
Voice model and provider (ElevenLabs, Cartesia)
Interruption level (how often the caller interrupts)
Background noise (office, street, etc.)
Speech speed and patterns

Wrong: putting

"speak in a mumbling voice and interrupt frequently"

in the instructions. Right: select or create a personality with the desired interruption level and voice characteristics.

Instructions cannot alter actual speaking style — they only affect what the testing agent says, not how it sounds.

每个场景都必须设置
personality
—— 如果缺失，API会返回400错误。请使用**人格配置（而非指令）**控制测试Agent的语音风格。人格配置管理：

语言和口音
语音模型和提供商（ElevenLabs、Cartesia）
打断频率（呼叫者打断的频率）
背景噪音（办公室、街道等）
语速和说话模式

错误做法：在指令中写入

"说话含糊不清并频繁打断"

。 正确做法：选择或创建具有所需打断频率和语音特征的人格配置。

指令无法改变实际说话风格——它们仅影响测试Agent说什么，而非怎么说。

Picking the Right Personality

选择合适的人格配置

For conditional-actions scenarios: Use the normal personality for the target language (e.g., 693 for English, 362 for Spanish). Conditional actions encode all behavioral logic — interruptions, pacing, silence, hold — directly in the

conditions

array via XML tags. A separate interrupter or edge-case personality adds no value and can interfere with the scripted turn sequence.

For behavioral scenarios: Match personality to scenario intent. Recommended suite distribution for full coverage:

Scenario intent	Personality to use	Example
Happy path / baseline	Normal Male/Female (same language)	ID 693 for English
Urgent / fast-paced caller	Interrupter personality	Scheduling with time pressure
Real-world ambient noise	Background noise personality (street/café)	Mobile caller in public
Non-native / accented speaker	Slow Speaker or language-specific accent	Accessibility testing
Aggressive / frustrated caller	Interrupter + high emotional tone	De-escalation red team

Rough distribution for a balanced suite:

~60% standard (normal male/female in the scenario's language)
~20% challenging (interrupter, fast-paced, background noise)
~10% non-native speakers or accented
~10% edge cases (frustrated, extreme speech rate)

Recommended defaults:

English: 693 (Normal Male, en/American)
Spanish: 362 (Normal Spanish Male)
Other languages: Use 693 + set
```
scenario_language
```
to the correct code, OR list personalities via
```
GET /test_framework/v1/personalities/
```
and pick the matching language. The platform uses
```
scenario_language
```
for TTS, not just personality.

对于条件动作场景：使用目标语言的正常人格配置（例如：英语用693，西班牙语用362）。条件动作通过

conditions

数组中的XML标签直接编码所有行为逻辑——打断、节奏、沉默、等待。单独的打断型或边缘用例人格配置毫无价值，还可能干扰脚本化的轮次序列。

对于行为场景：人格配置与场景意图匹配。推荐的全面覆盖套件分布：

场景意图	使用的人格配置	示例
正常流程/基线	同语言的正常男性/女性	英语用ID 693
紧急/快节奏呼叫者	打断型人格配置	有时间压力的预约调度
真实环境背景噪音	背景噪音人格配置（街道/咖啡馆）	公共场合的移动呼叫者
非母语/带口音说话者	慢语速说话者或特定语言口音	无障碍测试
攻击性/愤怒呼叫者	打断型+高情绪语调	降级处理红队测试

平衡套件的大致分布：

~60%标准配置（场景语言的正常男性/女性）
~20%挑战性配置（打断型、快节奏、背景噪音）
~10%非母语或带口音配置
~10%边缘用例配置（愤怒、极端语速）

推荐默认值：

英语：693（正常男性，美式英语）
西班牙语：362（正常西班牙语男性）
其他语言：使用693 + 设置
```
scenario_language
```
为正确代码，或通过
```
GET /test_framework/v1/personalities/
```
列出人格配置并选择匹配语言。平台使用
```
scenario_language
```
进行TTS，而非仅依赖人格配置。

Checking Available Personalities

检查可用人格配置

Always list available personalities before assigning — what's enabled varies per project:

GET /test_framework/v1/personalities/

Non-English personalities (e.g., Russian, Hindi) may not be enabled for a given project. If a personality returns "Personality is not enabled", use ID 693 and rely on

scenario_language

to drive TTS and pronunciation.

List available personalities with

GET /test_framework/v1/personalities/

分配前务必列出可用人格配置——不同项目启用的配置不同：

GET /test_framework/v1/personalities/

非英语人格配置（例如：俄语、印地语）可能未在特定项目中启用。如果人格配置返回「人格配置未启用」，使用ID 693并依赖

scenario_language

驱动TTS和发音。

通过

GET /test_framework/v1/personalities/

列出可用人格配置。

Tool Enablement — Critical for Credit Efficiency

工具启用——对成本效率至关重要

Every evaluator should have the right tools enabled for the testing agent. Missing tools cause elongated calls, wasted credits, and false results.

Tool	When to Enable	Why
`TOOL_END_CALL`	Recommended by default — so the testing agent can hang up after completing its objective	Without this, the testing agent can't hang up — calls run until timeout, wasting credits
`TOOL_END_CALL_ONLY_ON_TRANSFER`	When the main agent transfers to a human/IVR	Without this, the testing agent stays on the line through hold music, voicemail, etc.
`TOOL_DTMF`	When the flow involves IVR/phone menus	Allows the testing agent to send touch-tone inputs

Always instruct the testing agent to end the call after completing its objective if

TOOL_END_CALL

is enabled. Otherwise the call continues unnecessarily.

Transfer scenarios: If the expected outcome involves a transfer to a human, enable

TOOL_END_CALL_ONLY_ON_TRANSFER

to prevent dead call time after the transfer completes.

每个评估器都应为测试Agent启用合适的工具。缺少工具会导致通话时长增加、浪费积分和错误结果。

工具	启用时机	原因
`TOOL_END_CALL`	默认推荐启用——使测试Agent在完成目标后可挂断	没有此工具，测试Agent无法挂断——通话会一直运行到超时，浪费积分
`TOOL_END_CALL_ONLY_ON_TRANSFER`	当主Agent转接至人工/IVR时	没有此工具，测试Agent会在等待音乐、语音信箱等环节一直保持在线
`TOOL_DTMF`	当流程涉及IVR/电话菜单时	允许测试Agent发送按键输入

如果启用了
TOOL_END_CALL
，务必告知测试Agent在完成目标后结束通话。否则通话会不必要地继续。

转接场景：如果预期结果涉及转接至人工，启用

TOOL_END_CALL_ONLY_ON_TRANSFER

以防止转接完成后出现无效通话时间。

Metrics — Always Attach Baseline Metrics

指标——务必关联基线指标

Every evaluator should have at minimum these metrics enabled:

Expected Outcome — Evaluates whether the agent achieved what the scenario expected
Infrastructure Issues — Flags silent periods, connection drops, agent non-response
Tool Call Success — Monitors whether tool calls succeed or fail
Latency — Measures response time

Two-step process: Metrics must be both (1) toggled on for simulations at the project level AND (2) added to the individual evaluators. Missing either step means the metric won't fire. Use

actions → modify scenarios

to bulk-add metrics to existing evaluators.

Without metrics, runs return success/failure based only on whether the call completed — not whether the agent actually did the right thing. This leads to false passes that require manual review.

每个评估器至少应启用以下指标：

预期结果 —— 评估Agent是否完成场景预期目标
基础设施问题 —— 标记静默时段、连接中断、Agent无响应
工具调用成功 —— 监控工具调用成功或失败
延迟 —— 测量响应时间

两步流程：指标必须同时（1）在项目级为模拟测试开启，（2）添加到单个评估器中。缺少任何一步，指标都不会生效。使用

actions → modify scenarios

批量为现有评估器添加指标。

没有指标的话，运行结果仅根据通话是否完成返回成功/失败——无法验证Agent是否实际执行了正确操作。这会导致需要人工审核的误判通过。

Designing Conditional Actions

设计条件动作

When in conditional-actions mode (per "Choosing Authoring Mode" above), set

scenario_type: "conditional_actions"

on the scenario payload and pass

{ "role": "...", "conditions": [...] }

through the

conditional_actions

field — not through

instructions

. The testing agent walks the

conditions

array turn by turn.

当处于条件动作模式（参考上文「选择编写模式」）时，在场景负载中设置

scenario_type: "conditional_actions"

，并通过

conditional_actions

字段传入

{ "role": "...", "conditions": [...] }

——不要通过

instructions

传入。测试Agent会逐轮执行

conditions

数组中的内容。

Authoring sequence

编写流程

Follow these steps in order. Skipping any of them is the most common cause of avoidable rework:

Confirm the path — inbound vs outbound, who speaks first, what the structural test goal is. Especially for IVR, voicemail, and DTMF scenarios — see the inbound vs outbound split in
```
references/conditional-actions.md
```
.
Define the role — one sentence describing only what the testing agent is pretending to be ("You are a patient calling to cancel an appointment"). Never describe what the main agent is or does — the role is purely the testing agent's persona.
Choose the first turn (
id: 0
) — does the testing agent speak first (
```
action: "Hi, I need to..."
```
,
```
fixed_message: true
```
) or does the main agent speak first (
```
action: ""
```
, e.g., IVR/voicemail)?
Write standard conditions — one per agent prompt the testing agent must respond to. Each
```
condition
```
is a description of what the agent says; each
```
action
```
is the testing agent's response (verbatim with
```
fixed_message: true
```
, or behavioral with
```
false
```
).
Add
action_followup
and tags as needed — multi-part responses, interruptions, DTMF, voicemail, silence/hold, network simulation, background noise. Each tag has placement constraints — see the reference's XML Tags table. Timing: an
```
action_followup
```
fires on the testing agent's next turn after its referenced condition — one main-agent reply elapses in between, regardless of the reply's content. It never fires in the same turn as its parent. See
```
references/conditional-actions.md
```
for the full rule and worked examples.
Attach the supporting fields on the scenario — test profile (for any identity data), tools (
```
TOOL_END_CALL
```
,
```
TOOL_DTMF
```
for IVR, etc.), metrics (Expected Outcome + Infrastructure Issues + Tool Call Success + Latency), personality (
```
scenario_language
```
is inherited from it), folder.
Run the validation checklist — from
```
references/conditional-actions.md
```
§ Validation Checklist. Catches missing FIRST_MESSAGE, missing
```
type
```
/
```
fixed_message
```
, XML tag misuse, etc., before you hit the API.

API payload skeleton (this is what to POST/PATCH to
/test_framework/v1/scenarios/
):

json

{
  "agent": 123,
  "personality": 456,
  "name": "CA-01: <descriptive name>",
  "scenario_type": "conditional_actions",
  "scenario_language": "en",
  "conditional_actions": {
    "role": "You are a [persona] calling to [goal]",
    "conditions": [
      { "id": 0, "condition": "FIRST_MESSAGE", "action": "Hi, I need to ...", "type": "standard", "fixed_message": true },
      { "id": 1, "condition": "The agent asks for X", "action": "Provide X", "type": "standard", "fixed_message": false },
      { "id": 2, "condition": "The agent confirms", "action": "Thanks, that's all I needed <endcall />", "type": "standard", "fixed_message": true }
    ]
  }
}

Three load-bearing top-level fields:

scenario_type: "conditional_actions"
— explicit, required. Without this the scenario is created as behavioral and your
```
conditional_actions
```
payload is ignored.
conditional_actions
— JSON object carrying
```
{role, conditions[]}
```
. Do not put this object in
```
instructions
```
.
scenario_language
— required for
```
conditional_actions
```
. Set explicitly, or rely on the assigned personality's language.

Do not set

first_message

instructions

when using

conditional_actions

— they are managed for you.

All five condition fields (

id

condition

action

type

fixed_message

) are required on every condition.

id: 0

must use

condition: "FIRST_MESSAGE"

(literal) and

fixed_message: true

; set

action: ""

if the main agent speaks first.

按以下顺序执行步骤。跳过任何步骤都是可避免返工的最常见原因：

确认路径 —— 呼入 vs 呼出，谁先说话，结构化测试目标是什么。尤其是IVR、语音信箱和DTMF场景——详见
```
references/conditional-actions.md
```
中的呼入/呼出区分。
定义角色 —— 一句话仅描述测试Agent的角色（「你是一位打电话取消预约的患者」）。切勿描述主Agent是什么或做什么——角色纯粹是测试Agent的身份。
选择第一轮（
id: 0
） —— 测试Agent先说话（
```
action: "Hi, I need to..."
```
,
```
fixed_message: true
```
）还是主Agent先说话（
```
action: ""
```
，例如：IVR/语音信箱）？
编写标准条件 —— 测试Agent需要响应的每个Agent提示对应一个条件。每个
```
condition
```
是Agent所说内容的描述；每个
```
action
```
是测试Agent的回应（
```
fixed_message: true
```
表示逐字，
```
false
```
表示行为化）。
按需添加
action_followup
和标签 —— 多部分响应、打断、DTMF、语音信箱、沉默/等待、网络模拟、背景噪音。每个标签都有放置约束——详见参考文档的XML标签表。时机：
```
action_followup
```
在其引用的条件之后的下一轮测试Agent回合触发——无论主Agent的回复内容如何，都会间隔一次主Agent回复。它永远不会在其父条件的同一回合触发。详见
```
references/conditional-actions.md
```
获取完整规则和示例。
关联场景的支持字段 —— 测试配置文件（用于任何身份数据）、工具（
```
TOOL_END_CALL
```
、IVR用
```
TOOL_DTMF
```
等）、指标（预期结果+基础设施问题+工具调用成功+延迟）、人格配置（
```
scenario_language
```
继承自它）、文件夹。
执行验证清单 —— 来自
```
references/conditional-actions.md
```
的「验证清单」部分。在调用API前捕获缺失的FIRST_MESSAGE、缺失的
```
type
```
/
```
fixed_message
```
、XML标签误用等问题。

API负载框架（要POST/PATCH到
/test_framework/v1/scenarios/
的内容）：

json

{
  "agent": 123,
  "personality": 456,
  "name": "CA-01: <描述性名称>",
  "scenario_type": "conditional_actions",
  "scenario_language": "en",
  "conditional_actions": {
    "role": "You are a [角色] calling to [目标]",
    "conditions": [
      { "id": 0, "condition": "FIRST_MESSAGE", "action": "Hi, I need to ...", "type": "standard", "fixed_message": true },
      { "id": 1, "condition": "The agent asks for X", "action": "Provide X", "type": "standard", "fixed_message": false },
      { "id": 2, "condition": "The agent confirms", "action": "Thanks, that's all I needed <endcall />", "type": "standard", "fixed_message": true }
    ]
  }
}

三个核心顶级字段：

scenario_type: "conditional_actions"
—— 明确必填。没有此字段，场景会被创建为行为型，
```
conditional_actions
```
负载会被忽略。
conditional_actions
—— 包含
```
{role, conditions[]}
```
的JSON对象。不要将此对象放在
```
instructions
```
中。
scenario_language
—— 条件动作场景必填。可显式设置，或依赖分配的人格配置的语言。

使用条件动作时，不要设置

first_message

或

instructions

——它们会被自动管理。

每个条件的五个字段（

id

、

condition

、

action

、

type

、

fixed_message

）都是必填项。

id: 0

必须使用

condition: "FIRST_MESSAGE"

（字面量）且

fixed_message: true

；如果主Agent先说话，设置

action: ""

。

XML tag constraints (the ones you'll hit most)

XML标签约束（最常用的约束）

All XML tags require
fixed_message: true
. With
```
false
```
, the testing agent reads angle brackets as literal text.
<ivr text="..." />
and
<voicemail text="..." />
(or
```
<voicemail />
```
for silent) must be the entire action — no surrounding text or other tags. Use a separate
```
action_followup
```
for post-IVR / post-beep content.
<interruption time="Xs" />
requires
```
type: "action_followup"
```
AND must be at the very start of the action string. It fires
```
Xs
```
after the main agent's next turn begins.
<silence time="Xs" />
is interruptible by the main agent; condition matching restarts after an interrupt. <hold time="Xs" />
is not interruptible; multiple
```
<hold>
```
tags allowed in one action.
<dtmf digits="..." />
supports
```
0–9
```
,
```
#
```
,
```
*
```
; combinable with surrounding text.
<endcall />
combinable with text — natural sign-offs like
```
Thanks, that's all I needed <endcall />
```
work.
<spell>TEXT</spell>
wraps text to spell letter by letter (good for IDs, account numbers).
<speed ratio="N" />
range 0.8–1.2; <volume ratio="N" />
range 0–2 (Cartesia voices only) — both must be at the start of the action.

<network_simulation packet_loss="N" />
— only

packet_loss

is supported.

所有XML标签都要求
fixed_message: true
。如果设置为
```
false
```
，测试Agent会将尖括号读作文本。
<ivr text="..." />
和
<voicemail text="..." />
（或
```
<voicemail />
```
表示静默）必须是完整的action——不能有周围文本或其他标签。使用单独的
```
action_followup
```
处理IVR后/蜂鸣后的内容。
<interruption time="Xs" />
要求
type: "action_followup"
且必须位于action字符串的最开头。它会在主Agent下一轮开始后
```
Xs
```
触发。
**
```
<silence time="Xs" />
```
可被主Agent打断；打断后条件匹配会重新开始。
```
<hold time="Xs" />
```
**不可被打断；一个action中允许多个
```
<hold>
```
标签。
**
```
<dtmf digits="..." />
```
**支持
```
0–9
```
、
```
#
```
、
```
*
```
；可与周围文本组合使用。
**
```
<endcall />
```
**可与文本组合使用——自然结束语如
```
Thanks, that's all I needed <endcall />
```
有效。
**
```
<spell>TEXT</spell>
```
**包裹文本以逐字母拼写（适用于ID、账号）。
<speed ratio="N" />
范围0.8–1.2；<volume ratio="N" />
范围0–2（仅Cartesia语音支持）——两者都必须位于action的开头。

<network_simulation packet_loss="N" />
—— 仅支持

packet_loss

参数。

Worked example — Linear verification flow

示例——线性验证流程

json

{
  "role": "You are an established patient calling to check your appointment status",
  "conditions": [
    { "id": 0, "condition": "FIRST_MESSAGE", "action": "Hi, I'd like to check on my upcoming appointment", "type": "standard", "fixed_message": true },
    { "id": 1, "condition": "The agent asks for your name", "action": "My name is {{test_profile.first_name}} {{test_profile.last_name}}", "type": "standard", "fixed_message": true },
    { "id": 2, "condition": "The agent asks for your date of birth", "action": "Provide your date of birth", "type": "standard", "fixed_message": false },
    { "id": 3, "condition": "The agent asks for your account number", "action": "My account number is <spell>{{test_profile.account_number}}</spell>", "type": "standard", "fixed_message": true },
    { "id": 4, "condition": "The agent confirms your identity and provides appointment details", "action": "Thank you, that's all I needed <endcall />", "type": "standard", "fixed_message": true }
  ]
}

Pattern → reference map. For any of these scenario types, see

references/conditional-actions.md

§ "Pattern Library by Use Case" for the full worked JSON:

IVR menu navigation (inbound vs outbound — patterns differ on whether
id:0 action
is empty or contains
<ivr>
), voicemail with post-beep, verification/compliance verbatim, multi-part response, mid-flow pivot, interruption mid-sentence, degraded connection, noisy environment, hostile caller, red-team prompt injection, scripted sequence, multi-language.

Always load the reference before writing conditions for: full XML tag rubric (placement, ranges, voice constraints), test profile template-variable syntax, the

<silence>

<hold>

distinction, the 30

<background_noise>

sound names, the full anti-patterns list, the post-authoring quality checklist, and the troubleshooting matrix.

The reference is

references/conditional-actions.md

. Read it once at the start of any conditional-actions authoring session, and the inline content above will be enough to draft. Re-read sections of the reference if validation errors come back.

json

{
  "role": "You are an established patient calling to check your appointment status",
  "conditions": [
    { "id": 0, "condition": "FIRST_MESSAGE", "action": "Hi, I'd like to check on my upcoming appointment", "type": "standard", "fixed_message": true },
    { "id": 1, "condition": "The agent asks for your name", "action": "My name is {{test_profile.first_name}} {{test_profile.last_name}}", "type": "standard", "fixed_message": true },
    { "id": 2, "condition": "The agent asks for your date of birth", "action": "Provide your date of birth", "type": "standard", "fixed_message": false },
    { "id": 3, "condition": "The agent asks for your account number", "action": "My account number is <spell>{{test_profile.account_number}}</spell>", "type": "standard", "fixed_message": true },
    { "id": 4, "condition": "The agent confirms your identity and provides appointment details", "action": "Thank you, that's all I needed <endcall />", "type": "standard", "fixed_message": true }
  ]
}

模式→参考映射。对于以下任何场景类型，详见

references/conditional-actions.md

的「按用例分类的模式库」获取完整的JSON示例：

IVR菜单导航**（呼入vs呼出——模式差异在于
```
id:0 action
```
是空还是包含
```
<ivr>
```
）**、带蜂鸣后内容的语音信箱、验证/合规逐字表述、多部分响应、流程中途转向、中途打断、连接降级、嘈杂环境、敌对呼叫者、红队提示注入、脚本化序列、多语言。

编写条件前务必查看参考文档：获取完整XML标签规则（放置、范围、语音约束）、测试配置文件模板变量语法、

<silence>

与

<hold>

的区别、30种

<background_noise>

音效名称、完整反模式列表、编写后质量清单和故障排除矩阵。

参考文档为

references/conditional-actions.md

。在任何条件动作编写会话开始时通读一次，上文的内联内容足以指导起草。如果出现验证错误，重新阅读参考文档的相关部分。

Pre-Creation Checkpoint — Confirm Before Building

创建前检查点——构建前确认

Before creating scenarios or generating them, always pause and confirm key decisions with the user. Do not assume defaults — present your plan and get explicit approval. AI agents that skip this step make costly assumptions that waste credits and require rework.

在创建或生成场景前，务必暂停并与用户确认关键决策。不要假设默认值——展示你的计划并获得明确批准。跳过此步骤的AI Agent会做出代价高昂的假设，浪费积分并需要返工。

What to Confirm

需要确认的内容

Present a checkpoint like this before proceeding:

Tool strategy — "How do you want to handle your agent's tool calls during testing?"
- A) Client-side mock data — You manage your own staging backend; I'll align test profiles with your test data
- B) Cekura mock tools — Cekura intercepts tool calls and returns mock responses; I'll set up the mappings
- C) No mock data — Tools aren't relevant to these tests; we'll focus on conversational behavior
Test profile — "Want me to create
```
<profile-name>
```
with these fields?" Show the full
```
information
```
dict. For Approach A: fields must match client's staging data formats. For Approach B: fields must match Cekura mock tool outputs exactly (derive FROM mock data). For Approach C: only caller identity fields needed.
Run mode — "Default to text/chat for the first pass? It's cheapest, and since tools are mocked the results are the same as voice for logic validation." Recommend text unless the user specifically needs voice testing (latency, interruption handling, TTS quality).
Personality — For conditional-actions scenarios, default to the normal personality for the target language (e.g., 693 for English) — behavioral logic is in the conditions, not the personality. For behavioral scenarios, propose a mix: ~60% normal, ~20% challenging (interrupter/background noise), ~10% non-native, ~10% edge cases. Confirm with the user before using anything other than the normal default. See "Picking the Right Personality" above.
Authoring mode — Default is behavioral instructions. Switch automatically when the user's request used a direct trigger phrase ("conditional actions", "structured", "scripted", "deterministic test", "regression test", "compliance test", "exact flow", "fixed sequence"). Ask the user when the scenario mentions a tag-supported feature (voicemail, IVR, DTMF, hold, interruption, network simulation, background noise) without specifying a mode. See "Choosing Authoring Mode" above.
Folder — "I'll create a folder called
```
<name>
```
to organize these scenarios."
Metrics — "I'll attach the baseline metrics (Expected Outcome, Infrastructure Issues, Tool Call Success, Latency) to all scenarios."

在继续前展示如下检查点：

工具策略 —— 「您希望如何处理测试中Agent的工具调用？」
- A) 客户端模拟数据 —— 您管理自己的staging后端；我会使测试配置文件与您的测试数据保持一致
- B) Cekura模拟工具 —— Cekura拦截工具调用并返回模拟响应；我会设置映射
- C) 不使用模拟数据 —— 工具与这些测试无关；我们将专注于对话行为
测试配置文件 —— 「需要我创建名为
```
<配置文件名称>
```
的配置文件，包含以下字段吗？」展示完整的
```
information
```
字典。对于方案A：字段必须匹配客户的staging数据格式。对于方案B：字段必须与Cekura模拟工具输出完全匹配（从模拟数据导出）。对于方案C：仅需呼叫者身份字段。
运行模式 —— 「第一轮默认使用文本/聊天模式？这是最便宜的方式，且由于工具已模拟，逻辑验证结果与语音模式一致。」除非用户明确需要语音测试（延迟、打断处理、TTS质量），否则推荐文本模式。
人格配置 —— 对于条件动作场景，默认使用目标语言的正常人格配置（例如：英语用693）——行为逻辑在条件中，而非人格配置。对于行为场景，建议混合使用：~60%正常配置，~20%挑战性配置（打断型/背景噪音），~10%非母语配置，~10%边缘用例配置。在使用非正常默认配置前与用户确认。详见上文「选择合适的人格配置」。
编写模式 —— 默认是行为指令。当用户的请求使用直接触发词（「条件动作」「结构化」「脚本化」「确定性测试」「回归测试」「合规测试」「精确流程」「固定序列」）时，自动切换。当场景提及支持标签的功能（语音信箱、IVR、DTMF、等待、打断、网络模拟、背景噪音）但未指定模式时，询问用户。详见上文「选择编写模式」。
文件夹 —— 「我将创建名为
```
<名称>
```
的文件夹来组织这些场景。」
指标 —— 「我会为所有场景关联基线指标（预期结果、基础设施问题、工具调用成功、延迟）。」

Why This Matters

为什么这很重要

Without checkpoints, the AI agent will:

Pick the wrong tool strategy (setting up Cekura mocks when the client has a staging backend, or ignoring tools when they're critical)
Create test profiles with fields that don't match mock/staging data (authentication failures)
Default to voice mode when text would be 10x cheaper for the same coverage
Use conditional actions when adaptive instructions are more appropriate
Scatter scenarios without folder organization
Skip metric attachment (producing useless runs)

One checkpoint before creating saves multiple rounds of rework after.

没有检查点，AI Agent会：

选择错误的工具策略（当客户有staging后端时设置Cekura模拟工具，或在工具至关重要时忽略它们）
创建字段与模拟/staging数据不匹配的测试配置文件（导致身份验证失败）
默认使用语音模式，而文本模式成本仅为其1/10且覆盖范围相同
在自适应指令更合适时使用条件动作
不使用文件夹组织，导致场景分散
跳过指标关联（生成无用的运行结果）

创建前一次检查可避免创建后多次返工。

Eval Types

评估类型

A complete suite covers: Workflow (happy path), Deterministic/Unit Test (conditional actions for exact flows), Edge Case (tool failures, ambiguous inputs), Red Team (prompt injection, social engineering), Error Handling (hostile caller, clinical questions), Multi-Language.

See
references/coverage-patterns.md
for one-paragraph descriptions of each type, the tag-based naming convention, and category breakdowns from real deployments.

完整的测试套件应覆盖：工作流（正常流程）、确定性/单元测试（精确流程的条件动作）、边缘用例（工具故障、模糊输入）、红队（提示注入、社会工程）、错误处理（敌对呼叫者、临床问题）、多语言。

详见
references/coverage-patterns.md
获取每种类型的简短描述、基于标签的命名约定以及实际部署的类别划分。

Execution Modes

执行模式

Practical guidance: use text/chat for development iteration (fast, cheap, tests logic), voice for final validation before deployment. WebSocket for agents built on WebSocket providers, Pipecat for Pipecat framework agents. Test profile data is passed to the main agent in chat and websocket runs, enabling tool verification without voice calls. Full speed/cost comparison table in

references/coverage-patterns.md

实用指导：开发迭代使用文本/聊天模式（快速、便宜、测试逻辑），部署前最终验证使用语音模式。WebSocket适用于基于WebSocket提供商构建的Agent，Pipecat适用于Pipecat框架Agent。在聊天和WebSocket运行模式中，测试配置文件数据会传递给主Agent，无需语音通话即可验证工具。完整的速度/成本对比表见

references/coverage-patterns.md

。

Mock Tool Data Design

模拟工具数据设计

When using Approach B (Cekura mock tools), the mock-tool data design is critical and load-bearing. Key principles:

Per-input branching: one mapping per distinct input the agent might send; not one mapping per tool
Phone format variants: always add 10-digit, 11-digit-with-1, and E.164 forms (mismatches cause 404s)
Append-not-replace: PATCHing
```
information
```
REPLACES the array; always GET → merge → PATCH
Test profile alignment: derive profile values FROM mock outputs, not independently

See
references/mock-tool-design.md
for full guidance, examples, the backup-phone pattern, and the phone pool workflow.

当使用方案B（Cekura模拟工具）时，模拟工具数据设计至关重要。核心原则：

按输入分支：每个Agent可能发送的不同输入对应一个映射；而非每个工具对应一个映射
电话号码格式变体：务必添加10位、带1前缀的11位和E.164格式（不匹配会导致404错误）
追加而非替换：PATCH更新
```
information
```
会替换数组；务必先GET→合并→PATCH
测试配置文件对齐：从模拟输出导出配置文件值，而非独立创建

详见
references/mock-tool-design.md
获取完整指导、示例、备用电话号码模式和电话号码池工作流。

Tagging Strategy

标签策略

Format:

tags: ["Category", "priority-level", "scenario-ID"]

. Category codes: S=Scheduling, RS=Rescheduling, CN=Cancellation, V=Verification, SA=Safety, RT=RedTeam, etc.

格式：

tags: ["Category", "priority-level", "scenario-ID"]

。类别代码：S=调度，RS=重新调度，CN=取消，V=验证，SA=安全，RT=红队等。

Expected Outcomes

预期结果

Focus on the main agent's behavior, not the caller's experience:

Agent-centric: "Agent books appointment and provides arrival instructions" — not "the caller has a great experience"
Specific and measurable: Include concrete actions (book, transfer, cancel, inform)
Include follow-up actions: What happens after the primary action
Keep them concise — expected outcomes are evaluated by an LLM judge that checks whether each part was satisfied. Overly specific prompts (e.g., specifying exact dates/times) cause false failures. Focus on the behavioral outcome, not exact details.

聚焦主Agent的行为，而非呼叫者的体验：

以Agent为中心：「Agent完成预约并提供到达指引」——而非「呼叫者体验良好」
具体可衡量：包含具体动作（预约、转接、取消、告知）
包含后续动作：主要动作完成后发生的情况
保持简洁 —— 预期结果由LLM评判，检查是否满足每个部分。过于具体的提示（例如：指定确切日期/时间）会导致误判失败。聚焦行为结果，而非确切细节。

Create Evaluator from Transcript

从通话记录创建评估器

POST /test_framework/v1/scenarios/create_scenario_from_transcript/

turns a real call (by observability call-log ID) into a replayable evaluator — useful for regression tests from real edge cases. Always review post-creation and attach metrics, profile, folder, tools. See
references/coverage-patterns.md
§ Create Evaluator from Transcript for the workflow.

POST /test_framework/v1/scenarios/create_scenario_from_transcript/

接口将真实通话（通过可观测性通话日志ID）转换为可重放的评估器——适用于从真实边缘用例创建回归测试。创建后务必检查并关联指标、配置文件、文件夹、工具。详见
references/coverage-patterns.md
的「从通话记录创建评估器」部分获取工作流。

Documentation

文档

Public docs: https://docs.cekura.ai
LLM-friendly docs: https://docs.cekura.ai/llms.txt
Concepts: https://docs.cekura.ai/documentation/key-concepts/
Full API endpoints:
```
references/api-reference.md
```

公开文档：https://docs.cekura.ai
LLM友好文档：https://docs.cekura.ai/llms.txt
概念：https://docs.cekura.ai/documentation/key-concepts/
完整API端点：
```
references/api-reference.md
```

Session Memory Document

会话内存文档

For multi-session eval projects, offer to create a session memory document that captures key decisions (tool strategy, profiles, scenarios, open items) so future sessions don't re-derive context.

See
references/session-memory.md
for the template and update workflow.

对于多会话评估项目，可主动提出创建会话内存文档，记录关键决策（工具策略、配置文件、场景、未完成事项），以便后续会话无需重新推导上下文。

详见
references/session-memory.md
获取模板和更新工作流。

Next Steps

后续步骤

After completing eval design, the user typically needs:

Run the suite → execute via the run-scenarios endpoints (see
```
references/api-reference.md
```
)
Review results → check transcripts and metric scores
Add or improve metrics → invoke cekura-metric-design for new metrics, cekura-metric-improvement to refine existing ones
Connect a new agent first → invoke cekura-create-agent

完成评估器设计后，用户通常需要：

运行测试套件 → 通过run-scenarios端点执行（详见
```
references/api-reference.md
```
）
查看结果 → 检查通话记录和指标得分
添加或改进指标 → 调用cekura-metric-design创建新指标，调用cekura-metric-improvement优化现有指标
先连接新Agent → 调用cekura-create-agent

Additional Resources

其他资源

Reference Files (loaded on demand)

参考文件（按需加载）

references/tool-strategies.md
— Full workflow for Approaches A/B/C
references/mock-tool-design.md
— Per-input branching, append-not-replace, phone-pool gotchas
references/test-profiles.md
— Profile creation from real data, template variables
references/conditional-actions.md
— Conditional actions: field semantics, XML-tag constraints, worked examples, anti-patterns, validation checklist, quick-reference card
references/coverage-patterns.md
— Test coverage category breakdowns
references/session-memory.md
— Multi-session project memory document template
references/api-reference.md
— Complete API endpoints: scenarios, profiles, results

references/tool-strategies.md
—— 方案A/B/C的完整工作流
references/mock-tool-design.md
—— 按输入分支、追加而非替换、电话号码池注意事项
references/test-profiles.md
—— 从真实数据创建配置文件、模板变量
references/conditional-actions.md
—— 条件动作：字段语义、XML标签约束、示例、反模式、验证清单、快速参考卡
references/coverage-patterns.md
—— 测试覆盖类别划分
references/session-memory.md
—— 多会话项目内存文档模板
references/api-reference.md
—— 完整API端点：场景、配置文件、结果

Example Files

示例文件

examples/csv-eval-creation.md
— CSV-to-evaluator workflow
examples/workflow-eval.md
— Single workflow evaluator example
examples/red-team-eval.md
— Red-team evaluator example

examples/csv-eval-creation.md
—— CSV转评估器工作流
examples/workflow-eval.md
—— 单个工作流评估器示例
examples/red-team-eval.md
—— 红队评估器示例