cekura-eval-design

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cekura Eval Design

Cekura 评估器设计

Purpose

目的

Guide the creation of effective Cekura evaluators (test scenarios) that thoroughly exercise AI voice agent capabilities. Evaluators simulate callers to test the main agent — they are NOT metrics (which evaluate transcripts after the fact).
指导创建高效的Cekura评估器(测试场景),全面检验AI语音Agent的能力。评估器模拟呼叫者来测试主Agent——它们不是事后评估通话记录的指标。

Performing Platform Actions

执行平台操作

When this skill suggests creating, listing, updating, or evaluating something on Cekura, prefer using available platform tools over describing API calls or dashboard steps. In Claude Code with the Cekura plugin installed, these tools are auto-configured and handle authentication, parameter validation, and error handling for you. Fall back to direct API endpoints or dashboard guidance only when no tools are available in the current session.
当本技能建议在Cekura上创建、列出、更新或评估内容时,优先使用可用的平台工具,而非描述API调用或控制台步骤。在安装了Cekura插件的Claude Code中,这些工具已自动配置,可处理身份验证、参数验证和错误处理。仅当当前会话中无可用工具时,才退回到直接使用API端点或控制台指导。

Core Terminology

核心术语

  • Main agent: The client's AI voice agent being tested
  • Testing agent: Cekura's simulated caller that exercises the main agent
  • Evaluator/Scenario: A test case defining what the simulated caller does and what success looks like
  • Metric: A post-call evaluation that scores a transcript (separate concept — see cekura-metrics plugin)
  • Personality: Voice, language, accent, and behavioral traits for the simulated caller
  • Test Profile: Identity and context data passed to testing agent AND main agent (for chat/websocket runs)
  • Conditional Action: Structured, deterministic testing agent behavior with adaptive fallback
  • 主Agent:被测试的客户AI语音Agent
  • 测试Agent:Cekura的模拟呼叫者,用于检验主Agent
  • 评估器/场景:定义模拟呼叫者行为及成功标准的测试用例
  • 指标:通话后评估并为通话记录打分的机制(独立概念——详见cekura-metrics插件)
  • 人格配置:模拟呼叫者的语音、语言、口音和行为特征
  • 测试配置文件:传递给测试Agent和主Agent的身份及上下文数据(适用于聊天/WebSocket运行模式)
  • 条件动作:结构化、确定性的测试Agent行为,带有自适应回退机制

The Eval Design Workflow

评估器设计工作流

  1. Understand the agent — Read the agent description (GET the agent record) to identify all workflows, decision points, and edge cases
  2. Choose a tool strategy — Ask the user which approach they want for handling the agent's external tool calls. This is a fundamental decision that shapes everything else. See "Tool Strategy — Three Approaches" below.
  3. Always create a folder first — Before generating or creating scenarios, create a folder to organize them. Never dump scenarios into the root. POST to the scenarios folder endpoint with
    name
    ,
    project_id
    , and optionally
    parent_path
    . Then pass the
    folder_path
    to the generate endpoint or set it on individual scenarios.
  4. Run the pre-creation checkpoint — Confirm all key decisions with the user before building anything. See "Pre-Creation Checkpoint" below.
  5. Author evaluators — pick the path based on the mode (per "Choosing Authoring Mode" below):
    • Behavioral mode (default): start with auto-generate via
      POST /test_framework/v1/scenarios/generate-bg/
      . Provide category-level guidance in
      extra_instructions
      . If using Cekura mock tools, the generator creates tool-aware scenarios automatically. See "Auto-Generation" section below.
    • Conditional-actions mode: auto-gen can produce either behavioral or conditional-action scenarios — check the
      scenario_type
      of generated output and proceed accordingly. When you need full structural control (verbatim phrasing, exact-sequence regression, IVR/voicemail/DTMF flows), author each scenario directly via
      POST /test_framework/v1/scenarios/
      with
      scenario_type: "conditional_actions"
      and the
      conditional_actions
      payload. See "Designing Conditional Actions" below.
  6. Review and fix generation artifacts (only if you ran auto-gen in step 5) — Check the
    scenario_type
    of each generated scenario and inspect the corresponding payload (
    instructions
    for behavioral,
    conditional_actions
    for conditional-action). PATCH
    scenario_language
    for non-English scenarios (defaults to "en" regardless of content). PATCH
    first_message
    if auto-gen added greetings instead of exact questions. Check for partial completion (generation may produce fewer than requested).
  7. Supplement manually — Add edge cases, red-team scenarios, and deterministic tests that the generator didn't cover, or author additional scenarios directly when you need full structural control.
  8. Set up test infrastructure — Check existing test profiles first, then create new ones. Configure tool data according to the chosen tool strategy.
  9. Attach metrics — ALWAYS include baseline metrics (Expected Outcome, Infrastructure Issues, Tool Call Success, Latency) on every evaluator. Without metrics, runs only report call completion, not correctness.
  10. Run and validate — Execute via
    run_scenarios
    , review transcripts, iterate
  1. 了解Agent —— 读取Agent描述(调用GET接口获取Agent记录),识别所有工作流、决策点和边缘用例
  2. 选择工具策略 —— 询问用户希望如何处理Agent的外部工具调用。这是决定后续所有工作的核心决策。详见下文「工具策略——三种方案」。
  3. 始终先创建文件夹 —— 在生成或创建场景前,先创建文件夹来组织它们。切勿将场景直接放在根目录下。调用场景文件夹的POST接口,传入
    name
    project_id
    ,可选传入
    parent_path
    。然后将
    folder_path
    传递给生成接口,或为单个场景设置该参数。
  4. 执行创建前检查点 —— 在构建任何内容前,与用户确认所有关键决策。详见下文「创建前检查点」。
  5. 编写评估器——根据模式选择路径(参考下文「选择编写模式」):
    • 行为模式(默认):通过
      POST /test_framework/v1/scenarios/generate-bg/
      接口自动生成。在
      extra_instructions
      中提供类别级指导。如果使用Cekura模拟工具,生成器会自动创建支持工具的场景。详见下文「自动生成」部分。
    • 条件动作模式:自动生成可生成行为型或条件动作型场景——检查生成结果的
      scenario_type
      并相应处理。当需要完全的结构控制(逐字表述、精确序列回归、IVR/语音信箱/DTMF流程)时,直接通过
      POST /test_framework/v1/scenarios/
      接口编写每个场景,设置
      scenario_type: "conditional_actions"
      并传入
      conditional_actions
      负载。详见下文「设计条件动作」。
  6. 检查并修复生成产物(仅当步骤5中使用了自动生成时) —— 检查每个生成场景的
    scenario_type
    ,并查看对应的负载(行为型场景看
    instructions
    ,条件动作型场景看
    conditional_actions
    )。对于非英语场景,PATCH更新
    scenario_language
    (默认值为"en",与内容无关)。如果自动生成在
    first_message
    中添加了问候语而非精确问题,PATCH更新
    first_message
    。检查是否存在生成不完整的情况(生成的场景数量可能少于请求数量)。
  7. 手动补充 —— 添加生成器未覆盖的边缘用例、红队场景和确定性测试,或在需要完全结构控制时直接编写额外场景。
  8. 设置测试基础设施 —— 先检查现有测试配置文件,再创建新的。根据所选工具策略配置工具数据。
  9. 关联指标 —— 每个评估器必须包含基线指标(预期结果、基础设施问题、工具调用成功、延迟)。没有指标的话,运行结果仅报告通话是否完成,无法验证正确性。
  10. 运行并验证 —— 通过
    run_scenarios
    执行,查看通话记录,迭代优化

Tool Strategy — Three Approaches

工具策略——三种方案

Ask the user early: "Does your agent call external tools during calls? If so, how do you want to handle tool data for testing?"
ApproachWhen to useYour job
A. Client-side mock dataClient has staging API/test DBAlign test profiles with their mock data
B. Cekura mock toolsNo staging, want predictable isolated testsSet up mock mappings + match test profiles to outputs
C. No mock dataConversational-only agents, testing tone/soft skillsUse test profiles for identity only
Critical rule for Approach B: derive test profile values FROM mock outputs (same format, same values). Creating them independently guarantees mismatches.
See
references/tool-strategies.md
for full workflow, key questions to ask, and validation guidance for each approach.
尽早询问用户:「您的Agent在通话中会调用外部工具吗?如果是,您希望如何处理测试中的工具数据?」
方案使用场景你的工作
A. 客户端模拟数据客户拥有 staging API/测试数据库使测试配置文件与客户的模拟数据保持一致
B. Cekura模拟工具无staging环境,需要可预测的隔离测试设置模拟映射 + 使测试配置文件与输出匹配
C. 不使用模拟数据仅对话型Agent,测试语气/软技能仅使用测试配置文件存储身份信息
方案B的关键规则:从模拟输出中导出测试配置文件的值(格式、值完全一致)。独立创建配置文件必然会导致不匹配。
详见
references/tool-strategies.md
获取完整工作流、关键问题及各方案的验证指导。

Choosing Authoring Mode

选择编写模式

The default authoring mode is behavioral instructions (free-form, first-person scenario instructions). Switch to conditional actions in two situations:
默认编写模式为行为指令(自由格式、第一人称场景指令)。在以下两种情况下切换为条件动作模式:

Switch immediately, no confirmation, when the user says any of:

当用户提及以下内容时,直接切换,无需确认:

"conditional actions", "structured scenario", "scripted scenario", "scripted test", "deterministic test", "unit test", "regression test", "exact flow", "fixed sequence", "compliance test". The user has stated their authoring intent — proceed straight to designing conditional actions (see "Designing Conditional Actions" below).
「条件动作」「结构化场景」「脚本化场景」「脚本化测试」「确定性测试」「单元测试」「回归测试」「精确流程」「固定序列」「合规测试」。用户已明确表达编写意图——直接进入条件动作设计(详见下文「设计条件动作」)。

Ask first when the user mentions a tag-supported feature without specifying a mode:

当用户提及支持标签的功能但未指定模式时,先询问:

"voicemail", "voicemail test", "IVR menu", "IVR navigation", "DTMF entry", "DTMF input", "hold music", "interruption test", "network simulation", "packet loss", "background noise". Conditional actions support these via dedicated XML tags (
<voicemail>
,
<dtmf>
, etc.) and produce higher-fidelity tests, but a behavioral instruction may also be acceptable. Ask one short question:
"This involves [voicemail / IVR / DTMF / etc.]. Conditional actions support
<voicemail>
/
<dtmf>
/
<...>
tags directly for high-fidelity testing — should I author this as a conditional-actions evaluator (structured turn-by-turn with the right tags), or behavioral instructions (free-form, looser)?"
「语音信箱」「语音信箱测试」「IVR菜单」「IVR导航」「DTMF输入」「DTMF按键」「等待音乐」「打断测试」「网络模拟」「丢包」「背景噪音」。条件动作通过专用XML标签(
<voicemail>
<dtmf>
等)支持这些功能,可生成更高保真的测试,但行为指令也可能适用。询问一个简短问题:
「这涉及[语音信箱/IVR/DTMF等]。条件动作直接支持
<voicemail>
/
<dtmf>
/
<...>
标签以实现高保真测试——我应该将其编写为条件动作评估器(结构化逐轮交互,带有对应标签),还是行为指令(自由格式,灵活性更高)?」

Stay in behavioral mode for:

保持行为模式的场景:

Open-ended persona dialogue, exploratory red-team without specific attack scripts, soft-skill / tone / empathy testing, general edge-case quality probing where the conversation path isn't predictable. The "Writing Instructions" section below is the primary guide for this mode.
开放式角色对话、无特定攻击脚本的探索性红队测试、软技能/语气/同理心测试、对话路径不可预测的通用边缘用例质量探测。下文「编写指令」部分是该模式的主要指导。

Concrete examples (which mode for which scenario)

具体示例(不同场景对应哪种模式)

Scenario the user describesDefault modeWhy
Appointment scheduling happy pathBehavioralPath is predictable but doesn't need exact phrasing; behavioral lets the testing agent improvise naturally.
Appointment scheduling — exact-sequence regression testConditional actions"Regression test" is a direct trigger phrase.
Compliance disclosure / account-number readbackConditional actionsVerbatim phrasing required (
fixed_message: true
+
<spell>
); "compliance" is a direct trigger phrase.
Identity verification with name + DOB + last 4 SSNConditional actionsEach turn's action is data-bound (read from test profile); structure prevents drift.
Inbound IVR menu navigationAsk firstMentions IVR — could be conditional (high-fidelity,
<dtmf>
) or behavioral (looser); confirm with user.
Voicemail handling testAsk firstMentions voicemail —
<voicemail>
tag is purpose-built but behavioral can work.
Angry caller / de-escalationBehavioralTone-driven, exploratory; no fixed sequence.
Red-team prompt injection (a single attack pattern)Conditional actionsSpecific scripted attack; one evaluator per expected outcome.
Red-team free-form probingBehavioralPath not predictable; the agent improvises attacks.
Multi-language tone testingBehavioralSoft-skill evaluation;
scenario_language
set on either mode.
Multi-language compliance verificationConditional actionsVerbatim disclosures + language-specific phrasing.
Network degradation under packet lossAsk firstMentions network simulation —
<network_simulation>
tag is purpose-built.
Tool failure recovery flow (specific failure + recovery path)Conditional actionsSpecific failure trigger + specific recovery step.
General "test my agent's quality"BehavioralNo structural commitment specified.
用户描述的场景默认模式原因
预约调度正常流程行为模式路径可预测但无需逐字表述;行为模式允许测试Agent自然发挥。
预约调度——精确序列回归测试条件动作「回归测试」是直接触发词。
合规披露/账号号码回读条件动作需要逐字表述(
fixed_message: true
+
<spell>
);「合规」是直接触发词。
姓名+出生日期+社保后四位身份验证条件动作每一轮动作都与数据绑定(从测试配置文件读取);结构化可防止偏离。
呼入IVR菜单导航先询问提及IVR——可以是条件动作(高保真,
<dtmf>
)或行为模式(灵活性高);需与用户确认。
语音信箱处理测试先询问提及语音信箱——
<voicemail>
标签是专门设计的,但行为模式也可行。
愤怒呼叫者/降级处理行为模式以语气为导向,探索性测试;无固定序列。
红队提示注入(单一攻击模式)条件动作特定脚本化攻击;每个评估器对应一个预期结果。
红队自由式探测行为模式路径不可预测;Agent自主发起攻击。
多语言语气测试行为模式软技能评估;两种模式均可设置
scenario_language
多语言合规验证条件动作逐字披露+特定语言表述。
丢包下的网络降级先询问提及网络模拟——
<network_simulation>
标签是专门设计的。
工具故障恢复流程(特定故障+恢复路径)条件动作特定故障触发+特定恢复步骤。
通用「测试我的Agent质量」行为模式未指定结构要求。

Test Profiles — Always Use Them

测试配置文件——务必使用

Test profiles are the backbone of reliable evals. They serve three critical purposes:
  1. Memory persistence — The testing agent reliably uses profile data during calls. Data in instructions often leads to hallucinations.
  2. Dynamic variables — For outbound and websocket runs, test profile fields are sent to the main agent as caller context, mimicking what production systems provide. This lets you test the full end-to-end flow.
  3. Single source of truth — No risk of name in test profile saying "Sarah" while instructions say "John", which causes the testing agent to hallucinate.
Always use test profiles. Never hardcode identity data (names, DOBs, account IDs, addresses, phone numbers, service addresses, discrepancy amounts — anything persona-related) in scenario instructions. Instead, create a test profile with the data and let the instructions reference it generically (e.g., "State your name when asked").
Building test profiles from real data: The best approach is to pull call history from observability and/or past eval runs and use data that is known to work:
  1. Fetch recent call transcript_json records from the API
  2. Analyze toolcall inputs and outputs from real calls
  3. Build a memory document mapping existing data (names, account IDs, appointment IDs, etc.)
  4. Create test profiles using this verified data This ensures test profiles work against production tools.
Always check for existing test profiles first. Clients often pre-build profiles that are tested against their mock backend — reuse these rather than creating from scratch.
Template variables in instructions: Use
{{test_profile.field_name}}
or
{{test_profile['key']}}
for dynamic injection. For nested data:
{{test_profile.address.city}}
. Note: in voice scenarios, the simulated caller reads from the instruction text directly — the profile data is there for the caller to reference, not injected as hidden context.
See
references/test-profiles.md
for full details and the data-extraction workflow.
测试配置文件是可靠评估的核心。它们有三个关键作用:
  1. 内存持久化 —— 测试Agent在通话中可靠使用配置文件数据。指令中的数据常导致幻觉。
  2. 动态变量 —— 对于外呼和WebSocket运行模式,测试配置文件字段作为呼叫者上下文发送给主Agent,模拟生产系统提供的内容。这使你能够测试完整的端到端流程。
  3. 单一数据源 —— 不会出现测试配置文件中姓名是「Sarah」而指令中是「John」的情况,避免测试Agent产生幻觉。
务必使用测试配置文件。切勿在场景指令中硬编码身份数据(姓名、出生日期、账号ID、地址、电话号码、服务地址、差异金额——任何与角色相关的数据)。相反,创建包含这些数据的测试配置文件,让指令以通用方式引用(例如:「被询问时说出你的姓名」)。
从真实数据构建测试配置文件: 最佳方法是从可观测性数据和/或过往评估运行记录中提取通话历史,使用已知有效的数据:
  1. 从API获取最近的通话transcript_json记录
  2. 分析真实通话中的工具调用输入和输出
  3. 构建映射现有数据(姓名、账号ID、预约ID等)的内存文档
  4. 使用这些经过验证的数据创建测试配置文件 这确保测试配置文件可在生产工具上正常工作。
务必先检查现有测试配置文件。客户通常会预先构建经过测试、可对接其模拟后端的配置文件——优先复用这些配置文件,而非从头创建。
指令中的模板变量:使用
{{test_profile.field_name}}
{{test_profile['key']}}
进行动态注入。对于嵌套数据:
{{test_profile.address.city}}
。注意:在语音场景中,模拟呼叫者直接读取指令文本——配置文件数据供呼叫者参考,而非作为隐藏上下文注入。
详见
references/test-profiles.md
获取完整细节和数据提取工作流。

Writing Instructions

编写指令

Instructions tell the testing agent what to do. Write in first person from the testing agent's perspective.
指令告知测试Agent要做什么。以第一人称从测试Agent的视角编写。

Instruction Style

指令风格

  • First person: "State your name when asked" NOT "The caller should state their name"
  • Behavioral, not scripted: "Report fever and cough, request same provider" NOT "Say exactly: I have a fever"
  • Reference test profile data: "Provide your date of birth when asked for verification" (the actual DOB comes from the test profile)
  • 第一人称:「被询问时说出你的姓名」而非「呼叫者应说出他们的姓名」
  • 行为化,非脚本化:「报告发烧和咳嗽,要求同一位医生」而非「准确说出:我发烧了」
  • 引用测试配置文件数据:「被要求验证时提供你的出生日期」(实际出生日期来自测试配置文件)

Good Instructions Pattern

优质指令模板

Wrap instructions in
<scenario>
tags with a step-by-step format:
<scenario>
SCENARIO: [Brief scenario name]

YOUR BEHAVIOR:
1. State your intent to [action]
2. Confirm you are the patient when asked
3. Say and spell your first name when asked for verification
4. Provide your date of birth when asked
5. If the agent says no slots are available, say you are flexible with timing

KEY INTERACTION POINTS:
[Specific workflow nodes or edge cases to exercise]
</scenario>
Be explicit about exact phrases when mock/backend behavior depends on them (e.g.,
say "follow-up appointment" exactly
if the mock's reason-for-visit matching requires it).
<scenario>
标签包裹指令,采用分步格式:
<scenario>
场景:[简短场景名称]

你的行为:
1. 表明你要[执行动作]的意图
2. 被询问时确认你是患者
3. 被要求验证时说出并拼写你的名字
4. 被询问时提供你的出生日期
5. 如果Agent说没有可用时段,表示你可以灵活调整时间

关键交互点:
[需要检验的特定工作流节点或边缘用例]
</scenario>
当模拟/后端行为依赖特定表述时,明确说明精确短语(例如,如果模拟的就诊原因匹配需要特定表述,可写
准确说出"随访预约"
)。

Common Instruction Mistakes

常见指令错误

  • Filler steps that add nothing — NEVER write steps like "Listen to the agent's response", "Wait for the agent to speak", "End the call politely", or "Respond accordingly". The testing agent already does these things automatically. Every step must describe a specific action the caller takes — information they provide, a decision they make, or a behavior they exhibit. If a step doesn't tell the caller to DO something specific, delete it.
  • Hardcoding profile data in instructions — Names, DOBs, addresses, account numbers belong in test profiles, not instructions. When data is in both places and they differ, the testing agent hallucinates. This is the single most common mistake across clients.
  • Using instructions for voice characteristics — Instructions like "speak in a mumbling voice" or "be interruptive" don't change the testing agent's vocal style. Use personalities for that — they control actual voice model parameters (accent, interruption level, background noise, speed).
  • Including examples of what the main agent "may say" — Don't write
    When the agent says "How can I help you", respond with...
    . Instead, reference action points by topic:
    When asked about what you need help with, explain that you need help with your billing address.
    The former is brittle; the latter works regardless of exact agent phrasing.
  • Not providing enough context for multi-step flows — If a scenario involves a complex process (scheduling, onboarding), the testing agent needs step-by-step context to avoid hallucinating after the first few steps. For structured flows, use conditional actions instead.
  • Vague or generic instructions — "Call to schedule an appointment" is useless. Be specific: what type of appointment, what constraints, what complications should arise. The more specific the scenario, the more useful the test.
  • Third-person perspective instead of first person
  • Too scripted (exact dialogue) instead of behavioral goals
  • Missing edge case triggers
  • 无意义的填充步骤 —— 切勿编写「聆听Agent的回复」「等待Agent说话」「礼貌结束通话」或「做出相应回应」这类步骤。测试Agent会自动执行这些操作。每个步骤必须描述呼叫者的特定动作——提供的信息、做出的决定或表现的行为。如果步骤没有告知呼叫者要执行具体操作,就删除它。
  • 在指令中硬编码配置文件数据 —— 姓名、出生日期、地址、账号属于测试配置文件,而非指令。如果数据同时存在于两处且不一致,测试Agent会产生幻觉。这是客户最常犯的错误。
  • 用指令控制语音特征 —— 「说话含糊不清」或「频繁打断」这类指令不会改变测试Agent的语音风格。请使用人格配置来控制——它们管理实际的语音模型参数(口音、打断频率、背景噪音、语速)。
  • 包含主Agent「可能会说」的示例 —— 不要写
    当Agent说"我能帮你什么吗"时,回应...
    。相反,按主题引用动作点:
    当被询问需要什么帮助时,说明你需要修改账单地址。
    前者很脆弱;后者无论Agent的表述如何都能生效。
  • 未为多步骤流程提供足够上下文 —— 如果场景涉及复杂流程(调度、入职),测试Agent需要分步上下文以避免在最初几步后产生幻觉。对于结构化流程,请使用条件动作。
  • 模糊或通用的指令 —— 「打电话预约」毫无用处。要具体:预约类型、限制条件、可能出现的问题。场景越具体,测试就越有用。
  • 使用第三人称而非第一人称视角
  • 过于脚本化(精确对话)而非行为化目标
  • 缺少边缘用例触发条件

Bad vs Good Instructions

错误指令 vs 优质指令

BAD (filler, vague, passive):
<scenario>
1. When the agent asks to confirm your identity and whether you are the intended person, clearly state: "No, you have the wrong number."
2. Listen to the agent's response.
3. End the call politely.
</scenario>
GOOD (every step is a specific caller action):
<scenario>
SCENARIO: Wrong number — caller is not the intended recipient

YOUR BEHAVIOR:
1. When the agent asks for your name or tries to verify your identity, say this is the wrong number and you don't know the person they're looking for
2. If the agent asks for any additional information, decline — you have no connection to the intended person
3. If the agent apologizes and offers to remove your number, confirm that's fine
</scenario>
BAD (generic, no specifics):
<scenario>
1. Call to schedule an appointment.
2. Provide your information when asked.
3. Confirm the appointment.
</scenario>
GOOD (specific scenario with constraints):
<scenario>
SCENARIO: New adult patient scheduling with insurance

YOUR BEHAVIOR:
1. State you're a new patient and need to schedule a first visit with a primary care provider
2. When asked about insurance, say you have Blue Cross PPO
3. Provide your date of birth and spell your full name when asked for verification
4. Request a morning appointment if given timing options
5. If no morning slots are available, accept the earliest available afternoon slot
6. Confirm the appointment details when the agent reads them back

KEY INTERACTION POINTS:
- New patient registration flow
- Insurance verification
- Appointment slot selection with preference constraints
</scenario>
错误(填充内容、模糊、被动):
<scenario>
1. 当Agent要求确认你的身份以及你是否为目标联系人时,明确说出:"不,你打错电话了。"
2. 聆听Agent的回复。
3. 礼貌结束通话。
</scenario>
优质(每个步骤都是呼叫者的特定动作):
<scenario>
场景:打错电话——呼叫者并非目标联系人

你的行为:
1. 当Agent询问你的姓名或尝试验证你的身份时,说明这是错误号码,你不认识他们要找的人
2. 如果Agent要求提供任何额外信息,拒绝——你与目标联系人无关
3. 如果Agent道歉并提出移除你的号码,确认没问题
</scenario>
错误(通用、无细节):
<scenario>
1. 打电话预约。
2. 被询问时提供你的信息。
3. 确认预约。
</scenario>
优质(带限制条件的具体场景):
<scenario>
场景:有保险的成年新患者预约

你的行为:
1. 表明你是新患者,需要预约初级保健医生的首次就诊
2. 被询问保险情况时,说明你有Blue Cross PPO保险
3. 被要求验证时提供你的出生日期并拼写全名
4. 如果有时间选项,要求上午的预约
5. 如果没有上午时段,接受最早的下午时段
6. 当Agent复述预约详情时进行确认

关键交互点:
- 新患者注册流程
- 保险验证
- 带偏好限制的预约时段选择
</scenario>

Auto-Generation

自动生成

The
POST /test_framework/v1/scenarios/generate-bg/
endpoint is the preferred workflow for bulk scenario creation. Generated scenarios may come back as either behavioral (
scenario_type: "instruction"
) or conditional-action (
scenario_type: "conditional_actions"
) — check what was created and proceed accordingly. When you need full structural control (verbatim phrasing, exact-sequence regression, IVR/voicemail/DTMF flows), author conditional-action evaluators directly via the create endpoint — see "Designing Conditional Actions" below.
Full schema:
FieldTypeRequiredDescription
agent_id
integerYesAgent to generate scenarios for
num_scenarios
integerYesHow many to generate
extra_instructions
stringNoCategory-level guidance (e.g., "focus on cancellation edge cases")
personalities
array[integer]NoPersonality IDs to use
generate_expected_outcomes
booleanNoAuto-generate expected outcomes
folder_path
stringNoFolder to place generated scenarios in (always set this — create the folder first)
tags
array[string]NoTags to apply to all generated scenarios
tool_ids
array[string]NoTools to enable (e.g.,
TOOL_END_CALL
)
Returns:
{"progress_id": "<uuid>"}
. Poll with
GET /test_framework/v1/scenarios/generate-progress/?progress_id=<id>
.
Response has:
total_scenarios
,
completed_scenarios
,
failed_scenarios
,
scenarios_list
.
POST /test_framework/v1/scenarios/generate-bg/
接口是批量创建场景的首选工作流。生成的场景可能是行为型(
scenario_type: "instruction"
)或条件动作型(
scenario_type: "conditional_actions"
)——检查生成结果并相应处理。当需要完全的结构控制(逐字表述、精确序列回归、IVR/语音信箱/DTMF流程)时,直接通过创建接口编写条件动作评估器——详见下文「设计条件动作」。
完整 schema
字段类型必填描述
agent_id
integer要为其生成场景的Agent ID
num_scenarios
integer要生成的场景数量
extra_instructions
string类别级指导(例如:「重点关注取消边缘用例」)
personalities
array[integer]要使用的人格配置ID
generate_expected_outcomes
boolean自动生成预期结果
folder_path
string放置生成场景的文件夹路径(务必设置此参数——先创建文件夹)
tags
array[string]应用于所有生成场景的标签
tool_ids
array[string]要启用的工具(例如:
TOOL_END_CALL
返回结果
{"progress_id": "<uuid>"}
。通过
GET /test_framework/v1/scenarios/generate-progress/?progress_id=<id>
轮询进度。
响应包含
total_scenarios
completed_scenarios
failed_scenarios
scenarios_list

Generation Gotchas

生成注意事项

  1. Generation can partially complete — May produce fewer scenarios than requested (e.g., 15/18) with the remainder stuck. After a reasonable timeout, generate the remainder in a smaller batch with more specific
    extra_instructions
    .
  2. scenario_language
    defaults to "en"
    — Auto-gen sets all scenarios to English even when
    extra_instructions
    specify non-English languages. PATCH each scenario with the correct language code (
    ru
    ,
    hi
    ,
    es
    ,
    zh
    ,
    ko
    ,
    pt
    ,
    de
    , etc.) after generation. This is required for correct TTS voice/pronunciation.
  3. Auto-gen may add greetings to
    first_message
    — When
    extra_instructions
    specify exact verbatim questions, some scenarios get a greeting (e.g., "Здравствуйте") as the
    first_message
    while the actual question is in instructions as a follow-up. PATCH
    first_message
    after generation.
  4. Language-specific personalities may not be enabled per-project — Non-English personalities may return "Personality is not enabled" errors. Workaround: use personality 693 (Normal Male English) and rely on
    scenario_language
    to drive TTS and pronunciation. See "Checking Available Personalities" under the Personality section.
  5. Mock tool awareness — When mock tools are enabled on an agent, the generate endpoint creates tool-aware scenarios automatically.
  1. 生成可能部分完成 —— 生成的场景数量可能少于请求数量(例如:15/18),剩余场景停滞。合理超时后,使用更具体的
    extra_instructions
    以较小批量生成剩余场景。
  2. scenario_language
    默认值为"en"
    —— 即使
    extra_instructions
    指定非英语语言,自动生成仍会将所有场景设置为英语。生成后需PATCH更新每个场景的正确语言代码(
    ru
    hi
    es
    zh
    ko
    pt
    de
    等)。这是正确TTS语音/发音的必要步骤。
  3. 自动生成可能在
    first_message
    中添加问候语
    —— 当
    extra_instructions
    指定精确逐字问题时,部分场景的
    first_message
    会添加问候语(例如:"Здравствуйте"),而实际问题在后续指令中。生成后需PATCH更新
    first_message
  4. 特定语言的人格配置可能未按项目启用 —— 非英语人格配置可能返回「人格配置未启用」错误。解决方法:使用人格配置693(正常男性英语),并依赖
    scenario_language
    驱动TTS和发音。详见人格配置部分的「检查可用人格配置」。
  5. 模拟工具感知 —— 当Agent启用模拟工具时,生成接口会自动创建支持工具的场景。

Personality — Required, Controls Voice Characteristics

人格配置——必填项,控制语音特征

personality
is required on every scenario
— the API returns 400 if missing. Use personalities (not instructions) to control the testing agent's vocal style. Personalities manage:
  • Language and accent
  • Voice model and provider (ElevenLabs, Cartesia)
  • Interruption level (how often the caller interrupts)
  • Background noise (office, street, etc.)
  • Speech speed and patterns
Wrong: putting
"speak in a mumbling voice and interrupt frequently"
in the instructions. Right: select or create a personality with the desired interruption level and voice characteristics.
Instructions cannot alter actual speaking style — they only affect what the testing agent says, not how it sounds.
每个场景都必须设置
personality
—— 如果缺失,API会返回400错误。请使用**人格配置(而非指令)**控制测试Agent的语音风格。人格配置管理:
  • 语言和口音
  • 语音模型和提供商(ElevenLabs、Cartesia)
  • 打断频率(呼叫者打断的频率)
  • 背景噪音(办公室、街道等)
  • 语速和说话模式
错误做法:在指令中写入
"说话含糊不清并频繁打断"
正确做法:选择或创建具有所需打断频率和语音特征的人格配置。
指令无法改变实际说话风格——它们仅影响测试Agent说什么,而非怎么说。

Picking the Right Personality

选择合适的人格配置

For conditional-actions scenarios: Use the normal personality for the target language (e.g., 693 for English, 362 for Spanish). Conditional actions encode all behavioral logic — interruptions, pacing, silence, hold — directly in the
conditions
array via XML tags. A separate interrupter or edge-case personality adds no value and can interfere with the scripted turn sequence.
For behavioral scenarios: Match personality to scenario intent. Recommended suite distribution for full coverage:
Scenario intentPersonality to useExample
Happy path / baselineNormal Male/Female (same language)ID 693 for English
Urgent / fast-paced callerInterrupter personalityScheduling with time pressure
Real-world ambient noiseBackground noise personality (street/café)Mobile caller in public
Non-native / accented speakerSlow Speaker or language-specific accentAccessibility testing
Aggressive / frustrated callerInterrupter + high emotional toneDe-escalation red team
Rough distribution for a balanced suite:
  • ~60% standard (normal male/female in the scenario's language)
  • ~20% challenging (interrupter, fast-paced, background noise)
  • ~10% non-native speakers or accented
  • ~10% edge cases (frustrated, extreme speech rate)
Recommended defaults:
  • English: 693 (Normal Male, en/American)
  • Spanish: 362 (Normal Spanish Male)
  • Other languages: Use 693 + set
    scenario_language
    to the correct code, OR list personalities via
    GET /test_framework/v1/personalities/
    and pick the matching language. The platform uses
    scenario_language
    for TTS, not just personality.
对于条件动作场景:使用目标语言的正常人格配置(例如:英语用693,西班牙语用362)。条件动作通过
conditions
数组中的XML标签直接编码所有行为逻辑——打断、节奏、沉默、等待。单独的打断型或边缘用例人格配置毫无价值,还可能干扰脚本化的轮次序列。
对于行为场景:人格配置与场景意图匹配。推荐的全面覆盖套件分布:
场景意图使用的人格配置示例
正常流程/基线同语言的正常男性/女性英语用ID 693
紧急/快节奏呼叫者打断型人格配置有时间压力的预约调度
真实环境背景噪音背景噪音人格配置(街道/咖啡馆)公共场合的移动呼叫者
非母语/带口音说话者慢语速说话者或特定语言口音无障碍测试
攻击性/愤怒呼叫者打断型+高情绪语调降级处理红队测试
平衡套件的大致分布
  • ~60%标准配置(场景语言的正常男性/女性)
  • ~20%挑战性配置(打断型、快节奏、背景噪音)
  • ~10%非母语或带口音配置
  • ~10%边缘用例配置(愤怒、极端语速)
推荐默认值
  • 英语:693(正常男性,美式英语)
  • 西班牙语:362(正常西班牙语男性)
  • 其他语言:使用693 + 设置
    scenario_language
    为正确代码,或通过
    GET /test_framework/v1/personalities/
    列出人格配置并选择匹配语言。平台使用
    scenario_language
    进行TTS,而非仅依赖人格配置。

Checking Available Personalities

检查可用人格配置

Always list available personalities before assigning — what's enabled varies per project:
GET /test_framework/v1/personalities/
Non-English personalities (e.g., Russian, Hindi) may not be enabled for a given project. If a personality returns "Personality is not enabled", use ID 693 and rely on
scenario_language
to drive TTS and pronunciation.
List available personalities with
GET /test_framework/v1/personalities/
.
分配前务必列出可用人格配置——不同项目启用的配置不同:
GET /test_framework/v1/personalities/
非英语人格配置(例如:俄语、印地语)可能未在特定项目中启用。如果人格配置返回「人格配置未启用」,使用ID 693并依赖
scenario_language
驱动TTS和发音。
通过
GET /test_framework/v1/personalities/
列出可用人格配置。

Tool Enablement — Critical for Credit Efficiency

工具启用——对成本效率至关重要

Every evaluator should have the right tools enabled for the testing agent. Missing tools cause elongated calls, wasted credits, and false results.
ToolWhen to EnableWhy
TOOL_END_CALL
Recommended by default — so the testing agent can hang up after completing its objectiveWithout this, the testing agent can't hang up — calls run until timeout, wasting credits
TOOL_END_CALL_ONLY_ON_TRANSFER
When the main agent transfers to a human/IVRWithout this, the testing agent stays on the line through hold music, voicemail, etc.
TOOL_DTMF
When the flow involves IVR/phone menusAllows the testing agent to send touch-tone inputs
Always instruct the testing agent to end the call after completing its objective if
TOOL_END_CALL
is enabled. Otherwise the call continues unnecessarily.
Transfer scenarios: If the expected outcome involves a transfer to a human, enable
TOOL_END_CALL_ONLY_ON_TRANSFER
to prevent dead call time after the transfer completes.
每个评估器都应为测试Agent启用合适的工具。缺少工具会导致通话时长增加、浪费积分和错误结果。
工具启用时机原因
TOOL_END_CALL
默认推荐启用——使测试Agent在完成目标后可挂断没有此工具,测试Agent无法挂断——通话会一直运行到超时,浪费积分
TOOL_END_CALL_ONLY_ON_TRANSFER
当主Agent转接至人工/IVR时没有此工具,测试Agent会在等待音乐、语音信箱等环节一直保持在线
TOOL_DTMF
当流程涉及IVR/电话菜单时允许测试Agent发送按键输入
如果启用了
TOOL_END_CALL
,务必告知测试Agent在完成目标后结束通话
。否则通话会不必要地继续。
转接场景:如果预期结果涉及转接至人工,启用
TOOL_END_CALL_ONLY_ON_TRANSFER
以防止转接完成后出现无效通话时间。

Metrics — Always Attach Baseline Metrics

指标——务必关联基线指标

Every evaluator should have at minimum these metrics enabled:
  1. Expected Outcome — Evaluates whether the agent achieved what the scenario expected
  2. Infrastructure Issues — Flags silent periods, connection drops, agent non-response
  3. Tool Call Success — Monitors whether tool calls succeed or fail
  4. Latency — Measures response time
Two-step process: Metrics must be both (1) toggled on for simulations at the project level AND (2) added to the individual evaluators. Missing either step means the metric won't fire. Use
actions → modify scenarios
to bulk-add metrics to existing evaluators.
Without metrics, runs return success/failure based only on whether the call completed — not whether the agent actually did the right thing. This leads to false passes that require manual review.
每个评估器至少应启用以下指标:
  1. 预期结果 —— 评估Agent是否完成场景预期目标
  2. 基础设施问题 —— 标记静默时段、连接中断、Agent无响应
  3. 工具调用成功 —— 监控工具调用成功或失败
  4. 延迟 —— 测量响应时间
两步流程:指标必须同时(1)在项目级为模拟测试开启,(2)添加到单个评估器中。缺少任何一步,指标都不会生效。使用
actions → modify scenarios
批量为现有评估器添加指标。
没有指标的话,运行结果仅根据通话是否完成返回成功/失败——无法验证Agent是否实际执行了正确操作。这会导致需要人工审核的误判通过。

Designing Conditional Actions

设计条件动作

When in conditional-actions mode (per "Choosing Authoring Mode" above), set
scenario_type: "conditional_actions"
on the scenario payload and pass
{ "role": "...", "conditions": [...] }
through the
conditional_actions
field — not through
instructions
. The testing agent walks the
conditions
array turn by turn.
当处于条件动作模式(参考上文「选择编写模式」)时,在场景负载中设置
scenario_type: "conditional_actions"
,并通过
conditional_actions
字段传入
{ "role": "...", "conditions": [...] }
——不要通过
instructions
传入。测试Agent会逐轮执行
conditions
数组中的内容。

Authoring sequence

编写流程

Follow these steps in order. Skipping any of them is the most common cause of avoidable rework:
  1. Confirm the path — inbound vs outbound, who speaks first, what the structural test goal is. Especially for IVR, voicemail, and DTMF scenarios — see the inbound vs outbound split in
    references/conditional-actions.md
    .
  2. Define the role — one sentence describing only what the testing agent is pretending to be ("You are a patient calling to cancel an appointment"). Never describe what the main agent is or does — the role is purely the testing agent's persona.
  3. Choose the first turn (
    id: 0
    )
    — does the testing agent speak first (
    action: "Hi, I need to..."
    ,
    fixed_message: true
    ) or does the main agent speak first (
    action: ""
    , e.g., IVR/voicemail)?
  4. Write standard conditions — one per agent prompt the testing agent must respond to. Each
    condition
    is a description of what the agent says; each
    action
    is the testing agent's response (verbatim with
    fixed_message: true
    , or behavioral with
    false
    ).
  5. Add
    action_followup
    and tags as needed
    — multi-part responses, interruptions, DTMF, voicemail, silence/hold, network simulation, background noise. Each tag has placement constraints — see the reference's XML Tags table. Timing: an
    action_followup
    fires on the testing agent's next turn after its referenced condition — one main-agent reply elapses in between, regardless of the reply's content. It never fires in the same turn as its parent. See
    references/conditional-actions.md
    for the full rule and worked examples.
  6. Attach the supporting fields on the scenario — test profile (for any identity data), tools (
    TOOL_END_CALL
    ,
    TOOL_DTMF
    for IVR, etc.), metrics (Expected Outcome + Infrastructure Issues + Tool Call Success + Latency), personality (
    scenario_language
    is inherited from it), folder.
  7. Run the validation checklist — from
    references/conditional-actions.md
    § Validation Checklist. Catches missing FIRST_MESSAGE, missing
    type
    /
    fixed_message
    , XML tag misuse, etc., before you hit the API.
API payload skeleton (this is what to POST/PATCH to
/test_framework/v1/scenarios/
):
json
{
  "agent": 123,
  "personality": 456,
  "name": "CA-01: <descriptive name>",
  "scenario_type": "conditional_actions",
  "scenario_language": "en",
  "conditional_actions": {
    "role": "You are a [persona] calling to [goal]",
    "conditions": [
      { "id": 0, "condition": "FIRST_MESSAGE", "action": "Hi, I need to ...", "type": "standard", "fixed_message": true },
      { "id": 1, "condition": "The agent asks for X", "action": "Provide X", "type": "standard", "fixed_message": false },
      { "id": 2, "condition": "The agent confirms", "action": "Thanks, that's all I needed <endcall />", "type": "standard", "fixed_message": true }
    ]
  }
}
Three load-bearing top-level fields:
  • scenario_type: "conditional_actions"
    — explicit, required. Without this the scenario is created as behavioral and your
    conditional_actions
    payload is ignored.
  • conditional_actions
    — JSON object carrying
    {role, conditions[]}
    . Do not put this object in
    instructions
    .
  • scenario_language
    — required for
    conditional_actions
    . Set explicitly, or rely on the assigned personality's language.
Do not set
first_message
or
instructions
when using
conditional_actions
— they are managed for you.
All five condition fields (
id
,
condition
,
action
,
type
,
fixed_message
) are required on every condition.
id: 0
must use
condition: "FIRST_MESSAGE"
(literal) and
fixed_message: true
; set
action: ""
if the main agent speaks first.
按以下顺序执行步骤。跳过任何步骤都是可避免返工的最常见原因:
  1. 确认路径 —— 呼入 vs 呼出,谁先说话,结构化测试目标是什么。尤其是IVR、语音信箱和DTMF场景——详见
    references/conditional-actions.md
    中的呼入/呼出区分。
  2. 定义角色 —— 一句话仅描述测试Agent的角色(「你是一位打电话取消预约的患者」)。切勿描述主Agent是什么或做什么——角色纯粹是测试Agent的身份。
  3. 选择第一轮(
    id: 0
    —— 测试Agent先说话(
    action: "Hi, I need to..."
    ,
    fixed_message: true
    )还是主Agent先说话(
    action: ""
    ,例如:IVR/语音信箱)?
  4. 编写标准条件 —— 测试Agent需要响应的每个Agent提示对应一个条件。每个
    condition
    是Agent所说内容的描述;每个
    action
    是测试Agent的回应(
    fixed_message: true
    表示逐字,
    false
    表示行为化)。
  5. 按需添加
    action_followup
    和标签
    —— 多部分响应、打断、DTMF、语音信箱、沉默/等待、网络模拟、背景噪音。每个标签都有放置约束——详见参考文档的XML标签表。时机
    action_followup
    在其引用的条件之后的下一轮测试Agent回合触发——无论主Agent的回复内容如何,都会间隔一次主Agent回复。它永远不会在其父条件的同一回合触发。详见
    references/conditional-actions.md
    获取完整规则和示例。
  6. 关联场景的支持字段 —— 测试配置文件(用于任何身份数据)、工具(
    TOOL_END_CALL
    、IVR用
    TOOL_DTMF
    等)、指标(预期结果+基础设施问题+工具调用成功+延迟)、人格配置(
    scenario_language
    继承自它)、文件夹。
  7. 执行验证清单 —— 来自
    references/conditional-actions.md
    的「验证清单」部分。在调用API前捕获缺失的FIRST_MESSAGE、缺失的
    type
    /
    fixed_message
    、XML标签误用等问题。
API负载框架(要POST/PATCH到
/test_framework/v1/scenarios/
的内容)
json
{
  "agent": 123,
  "personality": 456,
  "name": "CA-01: <描述性名称>",
  "scenario_type": "conditional_actions",
  "scenario_language": "en",
  "conditional_actions": {
    "role": "You are a [角色] calling to [目标]",
    "conditions": [
      { "id": 0, "condition": "FIRST_MESSAGE", "action": "Hi, I need to ...", "type": "standard", "fixed_message": true },
      { "id": 1, "condition": "The agent asks for X", "action": "Provide X", "type": "standard", "fixed_message": false },
      { "id": 2, "condition": "The agent confirms", "action": "Thanks, that's all I needed <endcall />", "type": "standard", "fixed_message": true }
    ]
  }
}
三个核心顶级字段:
  • scenario_type: "conditional_actions"
    —— 明确必填。没有此字段,场景会被创建为行为型,
    conditional_actions
    负载会被忽略。
  • conditional_actions
    —— 包含
    {role, conditions[]}
    的JSON对象。不要将此对象放在
    instructions
    中。
  • scenario_language
    —— 条件动作场景必填。可显式设置,或依赖分配的人格配置的语言。
使用条件动作时,不要设置
first_message
instructions
——它们会被自动管理。
每个条件的五个字段(
id
condition
action
type
fixed_message
)都是必填项。
id: 0
必须使用
condition: "FIRST_MESSAGE"
(字面量)且
fixed_message: true
;如果主Agent先说话,设置
action: ""

XML tag constraints (the ones you'll hit most)

XML标签约束(最常用的约束)

  • All XML tags require
    fixed_message: true
    .
    With
    false
    , the testing agent reads angle brackets as literal text.
  • <ivr text="..." />
    and
    <voicemail text="..." />
    (or
    <voicemail />
    for silent) must be the entire action — no surrounding text or other tags. Use a separate
    action_followup
    for post-IVR / post-beep content.
  • <interruption time="Xs" />
    requires
    type: "action_followup"
    AND must be at the very start of the action string. It fires
    Xs
    after the main agent's next turn begins.
  • <silence time="Xs" />
    is interruptible by the main agent; condition matching restarts after an interrupt.
    <hold time="Xs" />
    is not interruptible; multiple
    <hold>
    tags allowed in one action.
  • <dtmf digits="..." />
    supports
    0–9
    ,
    #
    ,
    *
    ; combinable with surrounding text.
  • <endcall />
    combinable with text — natural sign-offs like
    Thanks, that's all I needed <endcall />
    work.
  • <spell>TEXT</spell>
    wraps text to spell letter by letter (good for IDs, account numbers).
  • <speed ratio="N" />
    range 0.8–1.2;
    <volume ratio="N" />
    range 0–2 (Cartesia voices only) — both must be at the start of the action.
  • <network_simulation packet_loss="N" />
    — only
    packet_loss
    is supported.
  • 所有XML标签都要求
    fixed_message: true
    。如果设置为
    false
    ,测试Agent会将尖括号读作文本。
  • <ivr text="..." />
    <voicemail text="..." />
    (或
    <voicemail />
    表示静默)必须是完整的action——不能有周围文本或其他标签。使用单独的
    action_followup
    处理IVR后/蜂鸣后的内容。
  • <interruption time="Xs" />
    要求
    type: "action_followup"
    且必须位于
    action字符串的最开头
    。它会在主Agent下一轮开始后
    Xs
    触发。
  • **
    <silence time="Xs" />
    可被主Agent打断;打断后条件匹配会重新开始。
    <hold time="Xs" />
    **不可被打断;一个action中允许多个
    <hold>
    标签。
  • **
    <dtmf digits="..." />
    **支持
    0–9
    #
    *
    ;可与周围文本组合使用。
  • **
    <endcall />
    **可与文本组合使用——自然结束语如
    Thanks, that's all I needed <endcall />
    有效。
  • **
    <spell>TEXT</spell>
    **包裹文本以逐字母拼写(适用于ID、账号)。
  • <speed ratio="N" />
    范围0.8–1.2
    <volume ratio="N" />
    范围0–2
    (仅Cartesia语音支持)——两者都必须位于action的开头
  • <network_simulation packet_loss="N" />
    —— 仅支持
    packet_loss
    参数。

Worked example — Linear verification flow

示例——线性验证流程

json
{
  "role": "You are an established patient calling to check your appointment status",
  "conditions": [
    { "id": 0, "condition": "FIRST_MESSAGE", "action": "Hi, I'd like to check on my upcoming appointment", "type": "standard", "fixed_message": true },
    { "id": 1, "condition": "The agent asks for your name", "action": "My name is {{test_profile.first_name}} {{test_profile.last_name}}", "type": "standard", "fixed_message": true },
    { "id": 2, "condition": "The agent asks for your date of birth", "action": "Provide your date of birth", "type": "standard", "fixed_message": false },
    { "id": 3, "condition": "The agent asks for your account number", "action": "My account number is <spell>{{test_profile.account_number}}</spell>", "type": "standard", "fixed_message": true },
    { "id": 4, "condition": "The agent confirms your identity and provides appointment details", "action": "Thank you, that's all I needed <endcall />", "type": "standard", "fixed_message": true }
  ]
}
Pattern → reference map. For any of these scenario types, see
references/conditional-actions.md
§ "Pattern Library by Use Case" for the full worked JSON:
  • IVR menu navigation (inbound vs outbound — patterns differ on whether
    id:0 action
    is empty or contains
    <ivr>
    )
    , voicemail with post-beep, verification/compliance verbatim, multi-part response, mid-flow pivot, interruption mid-sentence, degraded connection, noisy environment, hostile caller, red-team prompt injection, scripted sequence, multi-language.
Always load the reference before writing conditions for: full XML tag rubric (placement, ranges, voice constraints), test profile template-variable syntax, the
<silence>
vs
<hold>
distinction, the 30
<background_noise>
sound names, the full anti-patterns list, the post-authoring quality checklist, and the troubleshooting matrix.
The reference is
references/conditional-actions.md
. Read it once at the start of any conditional-actions authoring session, and the inline content above will be enough to draft. Re-read sections of the reference if validation errors come back.
json
{
  "role": "You are an established patient calling to check your appointment status",
  "conditions": [
    { "id": 0, "condition": "FIRST_MESSAGE", "action": "Hi, I'd like to check on my upcoming appointment", "type": "standard", "fixed_message": true },
    { "id": 1, "condition": "The agent asks for your name", "action": "My name is {{test_profile.first_name}} {{test_profile.last_name}}", "type": "standard", "fixed_message": true },
    { "id": 2, "condition": "The agent asks for your date of birth", "action": "Provide your date of birth", "type": "standard", "fixed_message": false },
    { "id": 3, "condition": "The agent asks for your account number", "action": "My account number is <spell>{{test_profile.account_number}}</spell>", "type": "standard", "fixed_message": true },
    { "id": 4, "condition": "The agent confirms your identity and provides appointment details", "action": "Thank you, that's all I needed <endcall />", "type": "standard", "fixed_message": true }
  ]
}
模式→参考映射。对于以下任何场景类型,详见
references/conditional-actions.md
的「按用例分类的模式库」获取完整的JSON示例:
  • IVR菜单导航**(呼入vs呼出——模式差异在于
    id:0 action
    是空还是包含
    <ivr>
    )**、带蜂鸣后内容的语音信箱、验证/合规逐字表述、多部分响应、流程中途转向、中途打断、连接降级、嘈杂环境、敌对呼叫者、红队提示注入、脚本化序列、多语言。
编写条件前务必查看参考文档:获取完整XML标签规则(放置、范围、语音约束)、测试配置文件模板变量语法、
<silence>
<hold>
的区别、30种
<background_noise>
音效名称、完整反模式列表、编写后质量清单和故障排除矩阵。
参考文档为
references/conditional-actions.md
。在任何条件动作编写会话开始时通读一次,上文的内联内容足以指导起草。如果出现验证错误,重新阅读参考文档的相关部分。

Pre-Creation Checkpoint — Confirm Before Building

创建前检查点——构建前确认

Before creating scenarios or generating them, always pause and confirm key decisions with the user. Do not assume defaults — present your plan and get explicit approval. AI agents that skip this step make costly assumptions that waste credits and require rework.
在创建或生成场景前,务必暂停并与用户确认关键决策。不要假设默认值——展示你的计划并获得明确批准。跳过此步骤的AI Agent会做出代价高昂的假设,浪费积分并需要返工。

What to Confirm

需要确认的内容

Present a checkpoint like this before proceeding:
  1. Tool strategy — "How do you want to handle your agent's tool calls during testing?"
    • A) Client-side mock data — You manage your own staging backend; I'll align test profiles with your test data
    • B) Cekura mock tools — Cekura intercepts tool calls and returns mock responses; I'll set up the mappings
    • C) No mock data — Tools aren't relevant to these tests; we'll focus on conversational behavior
  2. Test profile — "Want me to create
    <profile-name>
    with these fields?" Show the full
    information
    dict. For Approach A: fields must match client's staging data formats. For Approach B: fields must match Cekura mock tool outputs exactly (derive FROM mock data). For Approach C: only caller identity fields needed.
  3. Run mode — "Default to text/chat for the first pass? It's cheapest, and since tools are mocked the results are the same as voice for logic validation." Recommend text unless the user specifically needs voice testing (latency, interruption handling, TTS quality).
  4. Personality — For conditional-actions scenarios, default to the normal personality for the target language (e.g., 693 for English) — behavioral logic is in the conditions, not the personality. For behavioral scenarios, propose a mix: ~60% normal, ~20% challenging (interrupter/background noise), ~10% non-native, ~10% edge cases. Confirm with the user before using anything other than the normal default. See "Picking the Right Personality" above.
  5. Authoring mode — Default is behavioral instructions. Switch automatically when the user's request used a direct trigger phrase ("conditional actions", "structured", "scripted", "deterministic test", "regression test", "compliance test", "exact flow", "fixed sequence"). Ask the user when the scenario mentions a tag-supported feature (voicemail, IVR, DTMF, hold, interruption, network simulation, background noise) without specifying a mode. See "Choosing Authoring Mode" above.
  6. Folder — "I'll create a folder called
    <name>
    to organize these scenarios."
  7. Metrics — "I'll attach the baseline metrics (Expected Outcome, Infrastructure Issues, Tool Call Success, Latency) to all scenarios."
在继续前展示如下检查点:
  1. 工具策略 —— 「您希望如何处理测试中Agent的工具调用?」
    • A) 客户端模拟数据 —— 您管理自己的staging后端;我会使测试配置文件与您的测试数据保持一致
    • B) Cekura模拟工具 —— Cekura拦截工具调用并返回模拟响应;我会设置映射
    • C) 不使用模拟数据 —— 工具与这些测试无关;我们将专注于对话行为
  2. 测试配置文件 —— 「需要我创建名为
    <配置文件名称>
    的配置文件,包含以下字段吗?」展示完整的
    information
    字典。对于方案A:字段必须匹配客户的staging数据格式。对于方案B:字段必须与Cekura模拟工具输出完全匹配(从模拟数据导出)。对于方案C:仅需呼叫者身份字段。
  3. 运行模式 —— 「第一轮默认使用文本/聊天模式?这是最便宜的方式,且由于工具已模拟,逻辑验证结果与语音模式一致。」除非用户明确需要语音测试(延迟、打断处理、TTS质量),否则推荐文本模式。
  4. 人格配置 —— 对于条件动作场景,默认使用目标语言的正常人格配置(例如:英语用693)——行为逻辑在条件中,而非人格配置。对于行为场景,建议混合使用:~60%正常配置,~20%挑战性配置(打断型/背景噪音),~10%非母语配置,~10%边缘用例配置。在使用非正常默认配置前与用户确认。详见上文「选择合适的人格配置」。
  5. 编写模式 —— 默认是行为指令。当用户的请求使用直接触发词(「条件动作」「结构化」「脚本化」「确定性测试」「回归测试」「合规测试」「精确流程」「固定序列」)时,自动切换。当场景提及支持标签的功能(语音信箱、IVR、DTMF、等待、打断、网络模拟、背景噪音)但未指定模式时,询问用户。详见上文「选择编写模式」。
  6. 文件夹 —— 「我将创建名为
    <名称>
    的文件夹来组织这些场景。」
  7. 指标 —— 「我会为所有场景关联基线指标(预期结果、基础设施问题、工具调用成功、延迟)。」

Why This Matters

为什么这很重要

Without checkpoints, the AI agent will:
  • Pick the wrong tool strategy (setting up Cekura mocks when the client has a staging backend, or ignoring tools when they're critical)
  • Create test profiles with fields that don't match mock/staging data (authentication failures)
  • Default to voice mode when text would be 10x cheaper for the same coverage
  • Use conditional actions when adaptive instructions are more appropriate
  • Scatter scenarios without folder organization
  • Skip metric attachment (producing useless runs)
One checkpoint before creating saves multiple rounds of rework after.
没有检查点,AI Agent会:
  • 选择错误的工具策略(当客户有staging后端时设置Cekura模拟工具,或在工具至关重要时忽略它们)
  • 创建字段与模拟/staging数据不匹配的测试配置文件(导致身份验证失败)
  • 默认使用语音模式,而文本模式成本仅为其1/10且覆盖范围相同
  • 在自适应指令更合适时使用条件动作
  • 不使用文件夹组织,导致场景分散
  • 跳过指标关联(生成无用的运行结果)
创建前一次检查可避免创建后多次返工

Eval Types

评估类型

A complete suite covers: Workflow (happy path), Deterministic/Unit Test (conditional actions for exact flows), Edge Case (tool failures, ambiguous inputs), Red Team (prompt injection, social engineering), Error Handling (hostile caller, clinical questions), Multi-Language.
See
references/coverage-patterns.md
for one-paragraph descriptions of each type, the tag-based naming convention, and category breakdowns from real deployments.
完整的测试套件应覆盖:工作流(正常流程)、确定性/单元测试(精确流程的条件动作)、边缘用例(工具故障、模糊输入)、红队(提示注入、社会工程)、错误处理(敌对呼叫者、临床问题)、多语言
详见
references/coverage-patterns.md
获取每种类型的简短描述、基于标签的命名约定以及实际部署的类别划分。

Execution Modes

执行模式

Practical guidance: use text/chat for development iteration (fast, cheap, tests logic), voice for final validation before deployment. WebSocket for agents built on WebSocket providers, Pipecat for Pipecat framework agents. Test profile data is passed to the main agent in chat and websocket runs, enabling tool verification without voice calls. Full speed/cost comparison table in
references/coverage-patterns.md
.
实用指导:开发迭代使用文本/聊天模式(快速、便宜、测试逻辑),部署前最终验证使用语音模式。WebSocket适用于基于WebSocket提供商构建的Agent,Pipecat适用于Pipecat框架Agent。在聊天和WebSocket运行模式中,测试配置文件数据会传递给主Agent,无需语音通话即可验证工具。完整的速度/成本对比表见
references/coverage-patterns.md

Mock Tool Data Design

模拟工具数据设计

When using Approach B (Cekura mock tools), the mock-tool data design is critical and load-bearing. Key principles:
  • Per-input branching: one mapping per distinct input the agent might send; not one mapping per tool
  • Phone format variants: always add 10-digit, 11-digit-with-1, and E.164 forms (mismatches cause 404s)
  • Append-not-replace: PATCHing
    information
    REPLACES the array; always GET → merge → PATCH
  • Test profile alignment: derive profile values FROM mock outputs, not independently
See
references/mock-tool-design.md
for full guidance, examples, the backup-phone pattern, and the phone pool workflow.
当使用方案B(Cekura模拟工具)时,模拟工具数据设计至关重要。核心原则:
  • 按输入分支:每个Agent可能发送的不同输入对应一个映射;而非每个工具对应一个映射
  • 电话号码格式变体:务必添加10位、带1前缀的11位和E.164格式(不匹配会导致404错误)
  • 追加而非替换:PATCH更新
    information
    会替换数组;务必先GET→合并→PATCH
  • 测试配置文件对齐:从模拟输出导出配置文件值,而非独立创建
详见
references/mock-tool-design.md
获取完整指导、示例、备用电话号码模式和电话号码池工作流。

Tagging Strategy

标签策略

Format:
tags: ["Category", "priority-level", "scenario-ID"]
. Category codes: S=Scheduling, RS=Rescheduling, CN=Cancellation, V=Verification, SA=Safety, RT=RedTeam, etc.
格式:
tags: ["Category", "priority-level", "scenario-ID"]
。类别代码:S=调度,RS=重新调度,CN=取消,V=验证,SA=安全,RT=红队等。

Expected Outcomes

预期结果

Focus on the main agent's behavior, not the caller's experience:
  • Agent-centric: "Agent books appointment and provides arrival instructions" — not "the caller has a great experience"
  • Specific and measurable: Include concrete actions (book, transfer, cancel, inform)
  • Include follow-up actions: What happens after the primary action
  • Keep them concise — expected outcomes are evaluated by an LLM judge that checks whether each part was satisfied. Overly specific prompts (e.g., specifying exact dates/times) cause false failures. Focus on the behavioral outcome, not exact details.
聚焦主Agent的行为,而非呼叫者的体验:
  • 以Agent为中心:「Agent完成预约并提供到达指引」——而非「呼叫者体验良好」
  • 具体可衡量:包含具体动作(预约、转接、取消、告知)
  • 包含后续动作:主要动作完成后发生的情况
  • 保持简洁 —— 预期结果由LLM评判,检查是否满足每个部分。过于具体的提示(例如:指定确切日期/时间)会导致误判失败。聚焦行为结果,而非确切细节。

Create Evaluator from Transcript

从通话记录创建评估器

POST /test_framework/v1/scenarios/create_scenario_from_transcript/
turns a real call (by observability call-log ID) into a replayable evaluator — useful for regression tests from real edge cases. Always review post-creation and attach metrics, profile, folder, tools. See
references/coverage-patterns.md
§ Create Evaluator from Transcript
for the workflow.
POST /test_framework/v1/scenarios/create_scenario_from_transcript/
接口将真实通话(通过可观测性通话日志ID)转换为可重放的评估器——适用于从真实边缘用例创建回归测试。创建后务必检查并关联指标、配置文件、文件夹、工具。详见
references/coverage-patterns.md
的「从通话记录创建评估器」部分
获取工作流。

Documentation

文档

Session Memory Document

会话内存文档

For multi-session eval projects, offer to create a session memory document that captures key decisions (tool strategy, profiles, scenarios, open items) so future sessions don't re-derive context.
See
references/session-memory.md
for the template and update workflow.
对于多会话评估项目,可主动提出创建会话内存文档,记录关键决策(工具策略、配置文件、场景、未完成事项),以便后续会话无需重新推导上下文。
详见
references/session-memory.md
获取模板和更新工作流。

Next Steps

后续步骤

After completing eval design, the user typically needs:
  • Run the suite → execute via the run-scenarios endpoints (see
    references/api-reference.md
    )
  • Review results → check transcripts and metric scores
  • Add or improve metrics → invoke cekura-metric-design for new metrics, cekura-metric-improvement to refine existing ones
  • Connect a new agent first → invoke cekura-create-agent
完成评估器设计后,用户通常需要:
  • 运行测试套件 → 通过run-scenarios端点执行(详见
    references/api-reference.md
  • 查看结果 → 检查通话记录和指标得分
  • 添加或改进指标 → 调用cekura-metric-design创建新指标,调用cekura-metric-improvement优化现有指标
  • 先连接新Agent → 调用cekura-create-agent

Additional Resources

其他资源

Reference Files (loaded on demand)

参考文件(按需加载)

  • references/tool-strategies.md
    — Full workflow for Approaches A/B/C
  • references/mock-tool-design.md
    — Per-input branching, append-not-replace, phone-pool gotchas
  • references/test-profiles.md
    — Profile creation from real data, template variables
  • references/conditional-actions.md
    — Conditional actions: field semantics, XML-tag constraints, worked examples, anti-patterns, validation checklist, quick-reference card
  • references/coverage-patterns.md
    — Test coverage category breakdowns
  • references/session-memory.md
    — Multi-session project memory document template
  • references/api-reference.md
    — Complete API endpoints: scenarios, profiles, results
  • references/tool-strategies.md
    —— 方案A/B/C的完整工作流
  • references/mock-tool-design.md
    —— 按输入分支、追加而非替换、电话号码池注意事项
  • references/test-profiles.md
    —— 从真实数据创建配置文件、模板变量
  • references/conditional-actions.md
    —— 条件动作:字段语义、XML标签约束、示例、反模式、验证清单、快速参考卡
  • references/coverage-patterns.md
    —— 测试覆盖类别划分
  • references/session-memory.md
    —— 多会话项目内存文档模板
  • references/api-reference.md
    —— 完整API端点:场景、配置文件、结果

Example Files

示例文件

  • examples/csv-eval-creation.md
    — CSV-to-evaluator workflow
  • examples/workflow-eval.md
    — Single workflow evaluator example
  • examples/red-team-eval.md
    — Red-team evaluator example
  • examples/csv-eval-creation.md
    —— CSV转评估器工作流
  • examples/workflow-eval.md
    —— 单个工作流评估器示例
  • examples/red-team-eval.md
    —— 红队评估器示例