cekura-metric-design
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCekura Metric Design
Cekura 指标设计
Purpose
目的
Guide the creation of effective Cekura metrics that accurately evaluate AI voice agent call quality. Metrics measure call quality after the fact by evaluating transcripts against defined criteria. Each metric targets a specific workflow or KPI that needs tracking per call.
指导创建有效的Cekura指标,以准确评估AI语音Agent的通话质量。指标通过将通话记录与既定标准对比,在通话结束后衡量通话质量。每个指标针对每次通话需要跟踪的特定工作流或KPI。
Performing Platform Actions
执行平台操作
When this skill suggests creating, listing, updating, or evaluating something on Cekura, prefer using available platform tools over describing API calls or dashboard steps. In Claude Code with the Cekura plugin installed, these tools are auto-configured and handle authentication, parameter validation, and error handling for you. Fall back to direct API endpoints or dashboard guidance only when no tools are available in the current session.
当本技能建议在Cekura上创建、列出、更新或评估内容时,优先使用可用的平台工具,而非描述API调用或仪表板步骤。在安装了Cekura插件的Claude Code中,这些工具已自动配置,可处理身份验证、参数验证和错误处理。仅当当前会话中无可用工具时,才退回到直接API端点或仪表板指导。
Core Terminology
核心术语
- Main agent: The client's AI voice agent being tested
- Testing agent: Cekura's simulated caller that exercises the main agent
- Metric: A post-call evaluation that scores a transcript
- Evaluator/Scenario: A test case that simulates a caller (separate concept — see cekura-eval-design skill)
- Main agent:正在测试的客户AI语音Agent
- Testing agent:Cekura的模拟呼叫者,用于测试主Agent
- Metric:通话后评估,为通话记录打分
- Evaluator/Scenario:模拟呼叫者的测试用例(独立概念——详见cekura-eval-design技能)
The Metric Creation Workflow
指标创建工作流
Follow this workflow every time. Skipping steps (especially step 2) leads to metrics that miss edge cases.
- Gather context — Understand the client's use case, what they care about, get sample conversation IDs with expected outcomes
- Fetch real transcripts — Pull 3-5 actual records from the Cekura API. Study: what roles appear (
transcript_json,Main Agent,User,Function Call), what timestamps are available, how tool calls are structured, what the conversation flow looks like in practice. Metrics written without reading real data will miss edge cases.Function Call Result - Identify the signal — What specific thing in the transcript indicates pass vs fail? A tool call, a timestamp gap, a phrase, a behavioral pattern?
- Write the prompt — Use proven structures (see below), grounded in what the real transcripts look like
- Deploy and test — Create the metric via API, run on sample conversations, compare to expected outcomes
- Iterate — Adjust the prompt based on results, re-run, repeat until the metric matches expectations on all samples. Plan for at least one iteration — the first run reveals measurement issues.
每次创建指标都需遵循此工作流。跳过步骤(尤其是步骤2)会导致指标遗漏边缘情况。
- 收集上下文 — 了解客户的用例、关注点,获取带有预期结果的示例对话ID
- 获取真实通话记录 — 从Cekura API拉取3-5条实际记录。研究:出现了哪些角色(
transcript_json、Main Agent、User、Function Call)、可用的时间戳、工具调用的结构、实际对话流程。未参考真实数据编写的指标会遗漏边缘情况。Function Call Result - 识别信号 — 通话记录中哪些特定内容表示通过或失败?工具调用、时间戳间隔、特定短语、行为模式?
- 编写提示词 — 使用经过验证的结构(见下文),基于真实通话记录的实际情况
- 部署与测试 — 通过API创建指标,在示例对话上运行,与预期结果对比
- 迭代 — 根据结果调整提示词,重新运行,重复直到指标在所有样本上符合预期。至少计划一次迭代——首次运行会暴露测量问题。
Metric Types
指标类型
llm_judge (preferred default)
llm_judge(首选默认类型)
An LLM evaluates the prompt in the field against the call transcript. Prefer llm_judge over custom_code. Custom_code seems appealing for "objective" checks (timestamps, tool call presence) but is brittle in practice. Voice AI transcripts have messy timing — agents transfer mid-tool-chain, background tasks complete after speech resumes, timestamps overlap. An LLM reading the transcript handles these nuances naturally. Express measurements in natural language, not code.
descriptionCritical: The evaluation prompt goes in the field, NOT the field.
descriptionpromptLLM会将字段中的提示词与通话记录进行对比评估。优先选择llm_judge而非custom_code。custom_code看似适合“客观”检查(时间戳、工具调用存在性),但实际应用中很脆弱。语音AI通话记录的时间逻辑混乱——Agent在工具链中途转接、后台任务在语音恢复后才完成、时间戳重叠。LLM读取通话记录可自然处理这些细微差异。用自然语言表达测量要求,而非代码。
description关键注意事项:评估提示词需放在字段,而非字段。
descriptionpromptcustom_code (Python on Lambda)
custom_code(Lambda上的Python代码)
Python code in field. Has access to dict, , and upstream metric results. Reserve for cases that genuinely need programmatic logic.
custom_codedataevaluate_basic_metric()Use only when:
- Gating on upstream metric results ()
data.get("Exact Metric Name") - Section extraction from agent description (Pythonic pattern)
- Multiple LLM calls with different prompts based on conditions
- N/A short-circuiting before calling the LLM
custom_codedataevaluate_basic_metric()仅在以下情况使用:
- 基于上游指标结果进行判断()
data.get("Exact Metric Name") - 从Agent描述中提取章节(Python式模式)
- 根据条件使用不同提示词进行多次LLM调用
- 在调用LLM前进行N/A短路处理
Metric Evolution Path
指标演进路径
Start as for rapid iteration. Once the prompt is validated (through labs/feedback), convert to with section extraction for production. Cekura allows a metric to have both (llm_judge prompt) AND — the active type is toggled. This means the LLM prompt can be refined through labs, then the custom_code version uses that same prompt with targeted context extraction.
llm_judgecustom_codedescriptioncustom_code从开始,快速迭代。一旦提示词通过实验室/反馈验证,转换为带章节提取的用于生产环境。Cekura允许指标同时拥有(llm_judge提示词)和——激活类型可切换。这意味着LLM提示词可通过实验室优化,然后custom_code版本使用相同提示词并进行针对性上下文提取。
llm_judgecustom_codedescriptioncustom_codeEval Types
评估类型
| Eval Type | Output | Use For |
|---|---|---|
| TRUE/FALSE | Soft skills, quality assessments |
| TRUE/FALSE | Flow compliance checks |
| String from defined values | Classification tasks |
| Float score | Scoring tasks |
| Continuous score | Continuous quality assessment |
| 评估类型 | 输出 | 适用场景 |
|---|---|---|
| TRUE/FALSE | 软技能、质量评估 |
| TRUE/FALSE | 流程合规性检查 |
| 预定义值中的字符串 | 分类任务 |
| 浮点分数 | 评分任务 |
| 连续分数 | 持续性质量评估 |
LLM Judge Prompt Structure
LLM Judge提示词结构
Two proven structures exist. See for full templates.
references/prompt-patterns.md有两种经过验证的结构。完整模板详见。
references/prompt-patterns.mdStructure A: Sectioned (best for multi-criteria metrics)
结构A:分章节(适用于多标准指标)
- SCOPE & FOCUS — What this metric evaluates ONLY + what to IGNORE
- DO NOT FLAG — Common false positives: behavioral patterns that look like fails but aren't for THIS metric
- INPUTS — Only relevant template variables
- SECTIONS — Numbered evaluation criteria with pass/fail examples
- FAILURE CONDITIONS (Only These Count) — Narrow, closed list of what constitutes a failure
- SAFEGUARDING NOTES — Spirit vs letter overrides
- OUTPUT INSTRUCTIONS — Return format, timestamps for failures
- 范围与重点 — 本指标仅评估的内容 + 需要忽略的内容
- 请勿标记 — 常见误报:看似失败但不属于本指标的行为模式
- 输入 — 仅相关模板变量
- 评估章节 — 带通过/失败示例的编号评估标准
- 失败条件(仅以下情况计数) — 狭窄、封闭的失败场景列表
- 防护说明 — 精神重于字面的覆盖规则
- 输出说明 — 返回格式、失败时间戳
Structure B: Narrative (best for behavioral/timing metrics)
结构B:叙事式(适用于行为/时间类指标)
- SCOPE & FOCUS — What this metric evaluates ONLY + what to IGNORE
- DO NOT FLAG — Common false positives specific to this metric
- CONTEXT — What this call type looks like, why the metric matters
- WHAT TO LOOK FOR — Specific items in the transcript (tool names, phrases, patterns)
- FAILURE CONDITIONS (Only These Count) — Narrow, closed list of specific failure patterns
- NUANCES — Edge cases, overrides, things that look like fails but aren't
- OUTPUT — TRUE/FALSE/N/A with timestamps and evidence
Being explicit about PASS vs FAIL with real examples from actual conversations is the single most impactful thing for prompt quality. Vague criteria like "agent should be responsive" produce inconsistent results.
- 范围与重点 — 本指标仅评估的内容 + 需要忽略的内容
- 请勿标记 — 本指标特有的常见误报
- 上下文 — 此类通话的特征、指标的重要性
- 检查要点 — 通话记录中的特定内容(工具名称、短语、模式)
- 失败条件(仅以下情况计数) — 狭窄、封闭的特定失败模式列表
- 细微差异 — 边缘情况、覆盖规则、看似失败但实际不是的场景
- 输出 — TRUE/FALSE/N/A,附带时间戳和证据
明确标注基于真实对话示例的通过/失败情况,是提升提示词质量最有效的方法。模糊标准如“Agent应响应及时”会产生不一致的结果。
Anti-Cross-Pollination Scoping (when using {{agent.description}}
)
{{agent.description}}防交叉污染范围界定(使用{{agent.description}}
时)
{{agent.description}}The most common source of false failures: a metric using fails based on rules from an unrelated flow (e.g., Emergency metric fails because of a Booking Flow rule).
{{agent.description}}Three-layer scoping pattern: SCOPE & FOCUS ("evaluates X ONLY"), DO NOT FLAG (enumerate false positives by behavioral pattern), FAILURE CONDITIONS (narrow closed list).
See for full structure and the rule that all scoping language must be concept-based, never hardcoded to a specific agent's section names.
references/advanced-patterns.md最常见的误报来源:使用的指标因无关流程规则而失败(例如,紧急指标因预订流程规则而失败)。
{{agent.description}}三层范围界定模式:范围与重点(“仅评估X”)、请勿标记(按行为模式列举误报)、失败条件(狭窄封闭列表)。
**详见**获取完整结构,以及所有范围界定语言必须基于概念、而非硬编码到特定Agent章节名称的规则。
references/advanced-patterns.mdAvailable Template Variables
可用模板变量
| Variable | Description |
|---|---|
| Full conversation text |
| Structured transcript with timestamps |
| Full blob of custom variables from calls |
| Specific dynamic variable by key (dot notation preferred) |
| Main agent's system prompt |
| Call metadata |
| How the call ended |
Include only variables relevant to the specific metric. Listing all variables creates noise and dilutes evaluation focus.
When using , point to specific metadata fields the LLM judge should reference (e.g., "Check to verify booking was created").
{{metadata}}metadata.appointment_id| 变量 | 描述 |
|---|---|
| 完整对话文本 |
| 带时间戳的结构化通话记录 |
| 通话中的完整自定义变量 blob |
| 特定动态变量(首选点表示法) |
| 主Agent的系统提示词 |
| 通话元数据 |
| 通话结束原因 |
仅包含与特定指标相关的变量。列出所有变量会产生噪音,降低评估焦点。
使用时,需指明LLM judge应参考的特定元数据字段(例如:“检查以验证预订已创建”)。
{{metadata}}metadata.appointment_idThe Spirit vs Letter Principle
精神重于字面原则
This is the most critical concept in metric design.
Agent descriptions describe the intended functionality of the main agent, but must not be taken literally by the evaluator. Understand the intent behind each instruction and write the metric to capture the spirit, not the literal text.
Example: Agent description says "ask only 1 question at a time"
- Spirit: Prevent cognitive overload on the caller
- Literal (wrong): Fail any turn with more than one question mark
- Correct metric behavior:
- PASS: "Are you the owner of 123 Easy St? Can I get your name?" (related data cluster)
- PASS: "Is this a new issue, or an existing one?" (A/B rephrasing = single question)
- FAIL: "Does Thursday work? Also, did you get our text message?" (unrelated questions)
When uncertain about the intent behind an instruction, ask the user to clarify before encoding it into the metric. Include explicit safeguarding examples in the prompt showing what should and should not be penalized.
这是指标设计中最关键的概念。
Agent描述说明了主Agent的预期功能,但评估者不能字面理解。需理解每条指令背后的意图,编写指标以捕捉其精神,而非字面文本。
示例:Agent描述称“一次仅问一个问题”
- 精神:避免呼叫者认知过载
- 字面理解(错误):任何包含多个问号的回合都判定失败
- 正确指标行为:
- 通过:“您是123 Easy St的业主吗?能告诉我您的名字吗?”(相关数据集群)
- 通过:“这是新问题还是已有问题?”(A/B表述=单个问题)
- 失败:“周四可以吗?另外,您收到我们的短信了吗?”(无关问题)
当不确定指令背后的意图时,请用户澄清后再编码到指标中。在提示词中包含明确的防护示例,说明哪些情况应被处罚、哪些不应。
Trigger Design
触发器设计
| Trigger Type | When to Use |
|---|---|
| Metrics that apply to every call (soft skills, business context) |
| Conditional metrics (booking flow only fires when booking intent detected) |
| Complex trigger logic requiring code |
| Let Cekura auto-determine (less control) |
Use the endpoint (see ) to auto-generate trigger prompts from metric descriptions. Triggers can be layered in specificity — e.g., one trigger fires on any onboarding call, another fires only when the user gets stuck.
generate_evaluation_triggerreferences/api-reference.md| 触发器类型 | 使用场景 |
|---|---|
| 适用于所有通话的指标(软技能、业务上下文) |
带 | 条件指标(仅当检测到预订意图时触发预订流程指标) |
带 | 需要代码的复杂触发器逻辑 |
| 让Cekura自动判定(可控性较低) |
使用端点(详见)从指标描述自动生成触发器提示词。触发器可按特异性分层——例如,一个触发器在任何入职通话时触发,另一个仅在用户遇到问题时触发。
generate_evaluation_triggerreferences/api-reference.mdTwo-Layer N/A Strategy
两层N/A策略
Triggers and metric descriptions handle N/A at different levels:
- Trigger-level N/A (first defense): The trigger gates out obviously irrelevant calls BEFORE the metric runs. This saves LLM cost. Example: a Booking Flow metric's trigger checks if booking intent exists — if not, the metric doesn't fire and outputs N/A.
- Description-level N/A (nuanced cases): The metric prompt itself handles edge cases that need transcript context to determine. Example: a call has booking intent (trigger fires) but the caller hangs up before the flow starts — the metric description returns N/A with "VALID_SKIP: caller disconnected before booking could begin."
Design triggers to catch the obvious non-applicable calls; design the metric prompt to handle the nuanced edge cases that require reading the transcript.
触发器和指标描述在不同层面处理N/A:
- 触发器层N/A(第一道防线):触发器在指标运行前筛选出明显不相关的通话。可节省LLM成本。示例:预订流程指标的触发器检查是否存在预订意图——如果不存在,指标不触发并输出N/A。
- 描述层N/A(细微场景):指标提示词本身处理需要通话记录上下文才能判定的边缘情况。示例:通话有预订意图(触发器触发)但呼叫者在流程开始前挂断——指标描述返回N/A并附带“VALID_SKIP: caller disconnected before booking could begin.”
设计触发器以捕捉明显不适用的通话;设计指标提示词以处理需要读取通话记录的细微边缘情况。
Trigger Prompt Template
触发器提示词模板
Write triggers with the positive-then-negative pattern:
Evaluate whether this call involves [specific scenario].
Return TRUE if ANY of these indicators are present:
- [Positive indicator 1]
- [Positive indicator 2]
Do NOT trigger if ANY of these apply:
- Call is under 30 seconds or contains no substantive interaction beyond a greeting
- Line disconnection / voicemail / outbound non-engagement
- [Specific exclusion for this metric — e.g., "Emergency-flow transfers (covered by Emergency metric)"]
- [Another exclusion]
Be inclusive — if there's reasonable evidence the scenario occurred, return TRUE.Always include the short-call exclusion. Calls under ~30 seconds (hang-ups, wrong numbers, voicemails) produce false positives/negatives on every metric. Gate them out at the trigger level.
使用先正后负的模式编写触发器:
评估本次通话是否涉及[特定场景]。
如果存在以下任一指标,返回TRUE:
- [正向指标1]
- [正向指标2]
如果存在以下任一情况,请勿触发:
- 通话时长不足30秒或除问候外无实质性交互
- 线路断开/语音信箱/外呼未接通
- [本指标的特定排除项——例如:“紧急流程转接(由紧急指标覆盖)”]
- [另一排除项]
保持包容性——如果有合理证据表明场景发生,返回TRUE。始终包含短通话排除项。时长约30秒以下的通话(挂断、错号、语音信箱)会在所有指标上产生误报/漏报。在触发器层面将其排除。
Trigger Produces N/A Output
触发器产生N/A输出
When and the trigger returns false, the metric outputs N/A — it is not evaluated. This means even binary metrics (True/False) can have three outcomes: True, False, or N/A. This is correct behavior for conditional metrics.
evaluation_trigger: "custom"当且触发器返回false时,指标输出N/A——不进行评估。这意味着即使是二元指标(True/False)也可能有三种结果:True、False或N/A。这是条件指标的正确行为。
evaluation_trigger: "custom"Key Patterns
关键模式
VALID_SKIP Pattern
VALID_SKIP模式
For legitimate deviations where the metric should not apply (tool failures, user hangup before flow starts, caller requesting transfer immediately). The LLM returns explanation starting with "VALID_SKIP:" and the custom_code wrapper converts to .
_result = None针对指标不应适用的合理偏差场景(工具故障、用户在流程开始前挂断、呼叫者立即请求转接)。LLM返回以“VALID_SKIP:”开头的解释,custom_code包装器将其转换为。
_result = NoneGated Metrics
gated指标
Access upstream metric results via . The key must match the upstream metric's field exactly. Use to branch evaluation logic based on prior classification.
data.get("Exact Metric Name")name通过访问上游指标结果。键必须与上游指标的字段完全匹配。用于基于先前分类分支评估逻辑。
data.get("Exact Metric Name")namePythonic Section Extraction
Python式章节提取
Extract only relevant sections from agent description before passing to LLM. Prevents context drift from irrelevant description sections and reduces token usage. See for the extraction utility.
references/pythonic-patterns.md在传递给LLM前仅从Agent描述中提取相关章节。防止无关描述部分导致上下文偏移,减少token使用。提取工具详见。
references/pythonic-patterns.mdN/A Conditions
N/A条件
Check first for conditions where the metric should not apply:
- Immediate transfer/human request within first 1-2 exchanges
- Caller hangup before flow begins
- Out-of-scope caller (wrong number, sales call)
- Infrastructure failure preventing flow execution
- Agent description lacks the relevant section (for optional workflows)
首先检查指标不应适用的情况:
- 前1-2轮对话内立即转接/请求人工服务
- 呼叫者在流程开始前挂断
- 超出范围的呼叫者(错号、销售电话)
- 基础设施故障导致流程无法执行
- Agent描述缺少相关章节(针对可选工作流)
Dynamic Variable-Driven Generalized Metrics
动态变量驱动的通用指标
For clients that inject per-call (e.g., per-node system prompts, feature flags, employment types), create metrics that adapt to each call instead of hardcoding behavior. Pattern: one metric per injected prompt variable. Each metric references ONLY its specific , not the full blob or .
dynamic_variables{{dynamic_variables.promptName}}{{agent.description}}See for the example prompt structure and the discovery workflow for finding dynamic variables in real calls.
references/advanced-patterns.md对于在每次通话中注入的客户(例如,每个节点的系统提示词、功能标志、雇佣类型),创建可适应每次通话的指标,而非硬编码行为。模式:每个注入的提示词变量对应一个指标。每个指标仅引用其特定的,而非完整blob或。
dynamic_variables{{dynamic_variables.promptName}}{{agent.description}}**详见**获取示例提示词结构,以及在真实通话中发现动态变量的探索工作流。
references/advanced-patterns.mdTool Call Hallucination Metrics
工具调用幻觉指标
For agents with detailed tool definitions, build a metric that evaluates whether the agent called the correct tool for each situation — "action hallucination" (wrong action) vs "fact hallucination" (wrong information). Pattern: extract tool→scenario mapping from agent description, encode as explicit FAILURE CONDITIONS (closed list), DO NOT FLAG API errors / known quirks.
See for the full prompt structure and TOOL-TO-SCENARIO MAPPING template.
references/advanced-patterns.md对于具有详细工具定义的Agent,构建指标评估Agent是否在每种场景下调用了正确的工具——“动作幻觉”(错误动作)vs“事实幻觉”(错误信息)。模式:从Agent描述中提取工具→场景映射,编码为明确的失败条件(封闭列表),请勿标记API错误/已知异常。
**详见**获取完整提示词结构和TOOL-TO-SCENARIO MAPPING模板。
references/advanced-patterns.mdBaseline Metrics — Always Recommend
基线指标——始终推荐
Every agent should have at minimum these predefined metrics enabled for both observability and simulations:
| Metric | Purpose | Why It Matters |
|---|---|---|
| Expected Outcome | Checks if the agent achieved the scenario's expected result | Without this, runs pass/fail based only on call completion — not correctness |
| Infrastructure Issues | Flags silent periods, connection drops, agent non-response | Catches issues like agent going silent for 10+ seconds that aren't visible in pass/fail |
| Tool Call Success | Monitors whether tool calls succeed or fail | Requires provider integration (assistant IDs + API keys) to get toolcall data in transcripts |
| Latency | Measures response time | Identifies performance degradation |
Two-step activation required: Metrics must be (1) toggled on for simulations at the project level AND (2) added to individual evaluators. Missing either step means metrics won't fire. Without metrics enabled, users get false passes and must manually review every run.
Expected Outcome is transcript-only — it cannot evaluate audio-layer behavior. Expected Outcome reads the conversation text to determine whether the agent achieved its goal. It has no visibility into silences, interruptions, barge-ins, audio dropouts, or other voice-channel signals. Do not rely on Expected Outcome to catch these. For anything that depends on the audio stream rather than conversation content, use predefined metrics instead.
Toolcall data prerequisite: Tool Call Success and advanced monitoring require the agent to have its provider assistant ID configured on Cekura and complete call data being sent. If transcripts are missing toolcall data, recommend the user configure their provider integration.
每个Agent至少应启用以下预定义指标,用于可观测性和模拟:
| 指标 | 目的 | 重要性 |
|---|---|---|
| Expected Outcome | 检查Agent是否达到场景的预期结果 | 没有此指标,运行结果仅基于通话完成情况,而非正确性 |
| Infrastructure Issues | 标记静默期、连接中断、Agent无响应 | 捕捉Agent静默10秒以上等无法通过通过/失败结果发现的问题 |
| Tool Call Success | 监控工具调用成功或失败情况 | 需要提供商集成(助手ID + API密钥)才能在通话记录中获取工具调用数据 |
| Latency | 测量响应时间 | 识别性能下降情况 |
需要两步激活:指标必须(1)在项目层面为模拟开启,且(2)添加到单个评估器中。缺少任一步骤,指标都不会触发。未启用指标时,用户会得到错误的通过结果,必须手动审核每次运行。
Expected Outcome仅基于通话记录——无法评估音频层行为。Expected Outcome读取对话文本以判定Agent是否达成目标。无法检测静默、打断、插话、音频中断或其他语音通道信号。不要依赖Expected Outcome捕捉这些问题。任何依赖音频流而非对话内容的情况,请使用预定义指标。
工具调用数据前提:Tool Call Success和高级监控要求Agent在Cekura上配置了提供商助手ID,并发送完整通话数据。如果通话记录缺少工具调用数据,建议用户配置提供商集成。
Output Requirements
输出要求
All metrics must require:
- Brief explanation of the result (what happened and why)
- For failures: specific timestamps in MM:SS format where violations occurred
- For metadata-based checks: reference the specific metadata fields examined
所有指标必须包含:
- 结果的简要解释(发生了什么及原因)
- 失败情况:违规发生的具体时间戳(MM:SS格式)
- 基于元数据的检查:引用检查的特定元数据字段
Common Custom Metrics Worth Suggesting
值得推荐的常见自定义指标
Beyond the baseline predefined metrics, these are commonly valuable custom metrics based on patterns seen across clients:
- Question stacking / information dumping — Agent asking 3+ unrelated questions in one turn or dumping large blocks of information. Poor UX that overwhelms callers.
- Workflow adherence — Agent follows the defined flow steps in order (booking, verification, cancellation, etc.)
- Soft skills — Tone, empathy, appropriate greetings, not exposing system internals
- Business context accuracy — Agent provides correct business information (hours, addresses, pricing)
- Transfer/callback handling — Agent follows proper protocol when transferring or scheduling callbacks
除基线预定义指标外,以下是基于客户模式的常用高价值自定义指标:
- 问题堆叠/信息过载 — Agent在一个回合中提出3个以上无关问题或输出大量信息。会让呼叫者感到负担,用户体验差。
- 工作流合规性 — Agent按定义的流程步骤顺序执行(预订、验证、取消等)
- 软技能 — 语气、同理心、恰当问候、不暴露系统内部信息
- 业务上下文准确性 — Agent提供正确的业务信息(营业时间、地址、价格)
- 转接/回电处理 — Agent遵循转接或安排回电的正确流程
Operational Rules
操作规则
Cost Guard — Never Evaluate >100 Calls Without Confirmation
成本防护——未经确认请勿评估超过100通电话
Each evaluation costs the client real money. Before evaluating metrics on a batch of calls, ALWAYS query the call count first (use and read the response) and report the number to the user. If count > 100, stop and ask for explicit approval before proceeding. Use parameter (up to 200) instead of paginating, and use server-side filters (, , /) to scope calls.
page_size=1page_sizeagent_idprojecttimestamp__gtetimestamp__lte每次评估都会产生真实成本。在批量通话上评估指标前,务必先查询通话数量(使用并读取响应),并将数量告知用户。如果数量>100,请停止操作并请求明确批准后再继续。使用参数(最多200)而非分页,并使用服务器端过滤器(、、/)限定通话范围。
page_size=1page_sizeagent_idprojecttimestamp__gtetimestamp__lteManual Fix First, Then Labs
先手动修复,再用实验室反馈
When metrics have systemic issues (high false-fail rates), do NOT jump straight to labs feedback. Instead:
- Read failure explanations and categorize root causes (e.g., extra_questions, end_protocol, should_be_na)
- Write manual prompt fixes targeting the dominant failure categories
- PATCH the updated descriptions via API
- Re-evaluate a sample of 20-30 calls per metric to validate the fixes
- THEN use labs feedback for remaining edge cases that manual fixes didn't catch
This avoids wasting labs iterations on issues that are clearly fixable by prompt editing.
当指标存在系统性问题(高误判失败率)时,不要直接使用实验室反馈。应遵循以下步骤:
- 读取失败解释并分类根本原因(例如:extra_questions、end_protocol、should_be_na)
- 编写针对主要失败类别的手动提示词修复方案
- 通过API PATCH更新后的描述
- 每个指标重新评估20-30条通话样本以验证修复效果
- 然后使用实验室反馈处理手动修复未解决的剩余边缘情况
这样可避免将实验室迭代浪费在明显可通过提示词编辑修复的问题上。
Common Pitfalls
常见陷阱
- Writing metrics without reading real transcripts first — always fetch and study actual transcript_json before writing
- Putting the prompt in field instead of
promptfor llm_judgedescription - Using deprecated types (,
basic) — API returns 400custom_prompt - Using for checks the LLM can handle naturally (timestamps, tool call detection)
custom_code - Not matching upstream metric name exactly for gated metrics
- Passing full agent description when only a section is relevant (context drift)
- Missing VALID_SKIP handling in custom_code metrics
- No N/A conditions for conditional metrics
- Taking agent description instructions literally instead of capturing their spirit
- Not including safeguarding examples for nuanced evaluation criteria
- Omitting timestamps in failure explanations
- 未读取真实通话记录就编写指标 — 编写前务必获取并研究实际transcript_json
- 为llm_judge将提示词放在字段而非
prompt字段description - 使用已弃用类型(、
basic)——API会返回400错误custom_prompt - 使用处理LLM可自然处理的检查(时间戳、工具调用检测)
custom_code - gated指标未与上游指标名称完全匹配
- 仅需部分内容时传递完整Agent描述(上下文偏移)
- custom_code指标未处理VALID_SKIP
- 条件指标缺少N/A条件
- 字面理解Agent描述指令而非捕捉其精神
- 未为细微评估标准添加防护示例
- 失败解释中省略时间戳
Next Steps
下一步
After creating a metric, the user typically needs:
- Validate it on real calls → use the evaluate-calls flow (see )
references/api-reference.md - Iterate on accuracy → invoke cekura-metric-improvement to run the labs feedback cycle
- Design test scenarios that exercise this metric → invoke cekura-eval-design
创建指标后,用户通常需要:
- 在真实通话上验证 → 使用evaluate-calls流程(详见)
references/api-reference.md - 迭代提升准确性 → 调用cekura-metric-improvement运行实验室反馈周期
- 设计测试场景以验证该指标 → 调用cekura-eval-design
Documentation
文档
- Public docs: https://docs.cekura.ai
- LLM-friendly docs: https://docs.cekura.ai/llms.txt
- Concepts: https://docs.cekura.ai/documentation/key-concepts/
See for complete endpoint documentation and field schemas.
references/api-reference.md- 公开文档:https://docs.cekura.ai
- LLM友好文档:https://docs.cekura.ai/llms.txt
- 概念:https://docs.cekura.ai/documentation/key-concepts/
完整端点文档和字段模式详见。
references/api-reference.mdAdditional Resources
额外资源
Reference Files (loaded on demand)
参考文件(按需加载)
- — Full LLM judge prompt templates with real examples
references/prompt-patterns.md - — Section extraction utility and custom_code patterns
references/pythonic-patterns.md - — Anti-cross-pollination scoping, dynamic-variable-driven metrics, tool-call hallucination metrics
references/advanced-patterns.md - — Complete Cekura metrics API endpoints and schemas
references/api-reference.md
- — 完整的LLM judge提示词模板及真实示例
references/prompt-patterns.md - — 章节提取工具和custom_code模式
references/pythonic-patterns.md - — 防交叉污染范围界定、动态变量驱动指标、工具调用幻觉指标
references/advanced-patterns.md - — 完整的Cekura指标API端点和模式
references/api-reference.md
Example Files
示例文件
- — Complete llm_judge metric example (sectioned structure)
examples/llm-judge-metric.md - — Complete llm_judge metric example (narrative structure)
examples/narrative-metric.md - — Complete custom_code metric with gating and VALID_SKIP
examples/custom-code-metric.py - — Pythonic metric with agent description scoping
examples/section-extraction-metric.py
- — 完整的llm_judge指标示例(分章节结构)
examples/llm-judge-metric.md - — 完整的llm_judge指标示例(叙事结构)
examples/narrative-metric.md - — 带gating和VALID_SKIP的完整custom_code指标
examples/custom-code-metric.py - — 带Agent描述范围界定的Python式指标",
examples/section-extraction-metric.py