cekura-metric-design

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Cekura Metric Design

Cekura 指标设计

Purpose

目的

Guide the creation of effective Cekura metrics that accurately evaluate AI voice agent call quality. Metrics measure call quality after the fact by evaluating transcripts against defined criteria. Each metric targets a specific workflow or KPI that needs tracking per call.

指导创建有效的Cekura指标，以准确评估AI语音Agent的通话质量。指标通过将通话记录与既定标准对比，在通话结束后衡量通话质量。每个指标针对每次通话需要跟踪的特定工作流或KPI。

Performing Platform Actions

执行平台操作

When this skill suggests creating, listing, updating, or evaluating something on Cekura, prefer using available platform tools over describing API calls or dashboard steps. In Claude Code with the Cekura plugin installed, these tools are auto-configured and handle authentication, parameter validation, and error handling for you. Fall back to direct API endpoints or dashboard guidance only when no tools are available in the current session.

当本技能建议在Cekura上创建、列出、更新或评估内容时，优先使用可用的平台工具，而非描述API调用或仪表板步骤。在安装了Cekura插件的Claude Code中，这些工具已自动配置，可处理身份验证、参数验证和错误处理。仅当当前会话中无可用工具时，才退回到直接API端点或仪表板指导。

Core Terminology

核心术语

Main agent: The client's AI voice agent being tested
Testing agent: Cekura's simulated caller that exercises the main agent
Metric: A post-call evaluation that scores a transcript
Evaluator/Scenario: A test case that simulates a caller (separate concept — see cekura-eval-design skill)

Main agent：正在测试的客户AI语音Agent
Testing agent：Cekura的模拟呼叫者，用于测试主Agent
Metric：通话后评估，为通话记录打分
Evaluator/Scenario：模拟呼叫者的测试用例（独立概念——详见cekura-eval-design技能）

The Metric Creation Workflow

指标创建工作流

Follow this workflow every time. Skipping steps (especially step 2) leads to metrics that miss edge cases.

Gather context — Understand the client's use case, what they care about, get sample conversation IDs with expected outcomes
Fetch real transcripts — Pull 3-5 actual
```
transcript_json
```
records from the Cekura API. Study: what roles appear (
```
Main Agent
```
,
```
User
```
,
```
Function Call
```
,
```
Function Call Result
```
), what timestamps are available, how tool calls are structured, what the conversation flow looks like in practice. Metrics written without reading real data will miss edge cases.
Identify the signal — What specific thing in the transcript indicates pass vs fail? A tool call, a timestamp gap, a phrase, a behavioral pattern?
Write the prompt — Use proven structures (see below), grounded in what the real transcripts look like
Deploy and test — Create the metric via API, run on sample conversations, compare to expected outcomes
Iterate — Adjust the prompt based on results, re-run, repeat until the metric matches expectations on all samples. Plan for at least one iteration — the first run reveals measurement issues.

每次创建指标都需遵循此工作流。跳过步骤（尤其是步骤2）会导致指标遗漏边缘情况。

收集上下文 — 了解客户的用例、关注点，获取带有预期结果的示例对话ID
获取真实通话记录 — 从Cekura API拉取3-5条实际
```
transcript_json
```
记录。研究：出现了哪些角色（
```
Main Agent
```
、
```
User
```
、
```
Function Call
```
、
```
Function Call Result
```
）、可用的时间戳、工具调用的结构、实际对话流程。未参考真实数据编写的指标会遗漏边缘情况。
识别信号 — 通话记录中哪些特定内容表示通过或失败？工具调用、时间戳间隔、特定短语、行为模式？
编写提示词 — 使用经过验证的结构（见下文），基于真实通话记录的实际情况
部署与测试 — 通过API创建指标，在示例对话上运行，与预期结果对比
迭代 — 根据结果调整提示词，重新运行，重复直到指标在所有样本上符合预期。至少计划一次迭代——首次运行会暴露测量问题。

Metric Types

指标类型

llm_judge (preferred default)

llm_judge（首选默认类型）

An LLM evaluates the prompt in the

description

field against the call transcript. Prefer llm_judge over custom_code. Custom_code seems appealing for "objective" checks (timestamps, tool call presence) but is brittle in practice. Voice AI transcripts have messy timing — agents transfer mid-tool-chain, background tasks complete after speech resumes, timestamps overlap. An LLM reading the transcript handles these nuances naturally. Express measurements in natural language, not code.

Critical: The evaluation prompt goes in the

description

field, NOT the

prompt

field.

LLM会将

description

字段中的提示词与通话记录进行对比评估。优先选择llm_judge而非custom_code。custom_code看似适合“客观”检查（时间戳、工具调用存在性），但实际应用中很脆弱。语音AI通话记录的时间逻辑混乱——Agent在工具链中途转接、后台任务在语音恢复后才完成、时间戳重叠。LLM读取通话记录可自然处理这些细微差异。用自然语言表达测量要求，而非代码。

关键注意事项：评估提示词需放在

description

字段，而非

prompt

字段。

custom_code (Python on Lambda)

custom_code（Lambda上的Python代码）

Python code in

custom_code

field. Has access to

data

dict,

evaluate_basic_metric()

, and upstream metric results. Reserve for cases that genuinely need programmatic logic.

Use only when:

Gating on upstream metric results (
```
data.get("Exact Metric Name")
```
)
Section extraction from agent description (Pythonic pattern)
Multiple LLM calls with different prompts based on conditions
N/A short-circuiting before calling the LLM

custom_code

字段中的Python代码。可访问

data

字典、

evaluate_basic_metric()

函数以及上游指标结果。仅保留给真正需要程序化逻辑的场景。

仅在以下情况使用：

基于上游指标结果进行判断（
```
data.get("Exact Metric Name")
```
）
从Agent描述中提取章节（Python式模式）
根据条件使用不同提示词进行多次LLM调用
在调用LLM前进行N/A短路处理

Metric Evolution Path

指标演进路径

Start as

llm_judge

for rapid iteration. Once the prompt is validated (through labs/feedback), convert to

custom_code

with section extraction for production. Cekura allows a metric to have both

description

(llm_judge prompt) AND

custom_code

— the active type is toggled. This means the LLM prompt can be refined through labs, then the custom_code version uses that same prompt with targeted context extraction.

从

llm_judge

开始，快速迭代。一旦提示词通过实验室/反馈验证，转换为带章节提取的

custom_code

用于生产环境。Cekura允许指标同时拥有

description

（llm_judge提示词）和

custom_code

——激活类型可切换。这意味着LLM提示词可通过实验室优化，然后custom_code版本使用相同提示词并进行针对性上下文提取。

Eval Types

评估类型

Eval Type	Output	Use For
`binary_qualitative`	TRUE/FALSE	Soft skills, quality assessments
`binary_workflow_adherence`	TRUE/FALSE	Flow compliance checks
`enum`	String from defined values	Classification tasks
`numeric`	Float score	Scoring tasks
`continuous_qualitative`	Continuous score	Continuous quality assessment

评估类型	输出	适用场景
`binary_qualitative`	TRUE/FALSE	软技能、质量评估
`binary_workflow_adherence`	TRUE/FALSE	流程合规性检查
`enum`	预定义值中的字符串	分类任务
`numeric`	浮点分数	评分任务
`continuous_qualitative`	连续分数	持续性质量评估

LLM Judge Prompt Structure

LLM Judge提示词结构

Two proven structures exist. See

references/prompt-patterns.md

for full templates.

有两种经过验证的结构。完整模板详见

references/prompt-patterns.md

。

Structure A: Sectioned (best for multi-criteria metrics)

结构A：分章节（适用于多标准指标）

SCOPE & FOCUS — What this metric evaluates ONLY + what to IGNORE
DO NOT FLAG — Common false positives: behavioral patterns that look like fails but aren't for THIS metric
INPUTS — Only relevant template variables
SECTIONS — Numbered evaluation criteria with pass/fail examples
FAILURE CONDITIONS (Only These Count) — Narrow, closed list of what constitutes a failure
SAFEGUARDING NOTES — Spirit vs letter overrides
OUTPUT INSTRUCTIONS — Return format, timestamps for failures

范围与重点 — 本指标仅评估的内容 + 需要忽略的内容
请勿标记 — 常见误报：看似失败但不属于本指标的行为模式
输入 — 仅相关模板变量
评估章节 — 带通过/失败示例的编号评估标准
失败条件（仅以下情况计数） — 狭窄、封闭的失败场景列表
防护说明 — 精神重于字面的覆盖规则
输出说明 — 返回格式、失败时间戳

Structure B: Narrative (best for behavioral/timing metrics)

结构B：叙事式（适用于行为/时间类指标）

SCOPE & FOCUS — What this metric evaluates ONLY + what to IGNORE
DO NOT FLAG — Common false positives specific to this metric
CONTEXT — What this call type looks like, why the metric matters
WHAT TO LOOK FOR — Specific items in the transcript (tool names, phrases, patterns)
FAILURE CONDITIONS (Only These Count) — Narrow, closed list of specific failure patterns
NUANCES — Edge cases, overrides, things that look like fails but aren't
OUTPUT — TRUE/FALSE/N/A with timestamps and evidence

Being explicit about PASS vs FAIL with real examples from actual conversations is the single most impactful thing for prompt quality. Vague criteria like "agent should be responsive" produce inconsistent results.

范围与重点 — 本指标仅评估的内容 + 需要忽略的内容
请勿标记 — 本指标特有的常见误报
上下文 — 此类通话的特征、指标的重要性
检查要点 — 通话记录中的特定内容（工具名称、短语、模式）
失败条件（仅以下情况计数） — 狭窄、封闭的特定失败模式列表
细微差异 — 边缘情况、覆盖规则、看似失败但实际不是的场景
输出 — TRUE/FALSE/N/A，附带时间戳和证据

明确标注基于真实对话示例的通过/失败情况，是提升提示词质量最有效的方法。模糊标准如“Agent应响应及时”会产生不一致的结果。

Anti-Cross-Pollination Scoping (when using

{{agent.description}}

)

防交叉污染范围界定（使用

{{agent.description}}

时）

The most common source of false failures: a metric using

{{agent.description}}

fails based on rules from an unrelated flow (e.g., Emergency metric fails because of a Booking Flow rule).

Three-layer scoping pattern: SCOPE & FOCUS ("evaluates X ONLY"), DO NOT FLAG (enumerate false positives by behavioral pattern), FAILURE CONDITIONS (narrow closed list).

See
references/advanced-patterns.md
for full structure and the rule that all scoping language must be concept-based, never hardcoded to a specific agent's section names.

最常见的误报来源：使用

{{agent.description}}

的指标因无关流程规则而失败（例如，紧急指标因预订流程规则而失败）。

三层范围界定模式：范围与重点（“仅评估X”）、请勿标记（按行为模式列举误报）、失败条件（狭窄封闭列表）。

**详见

references/advanced-patterns.md

**获取完整结构，以及所有范围界定语言必须基于概念、而非硬编码到特定Agent章节名称的规则。

Available Template Variables

可用模板变量

Variable	Description
`{{transcript}}`	Full conversation text
`{{transcript_json}}`	Structured transcript with timestamps
`{{dynamic_variables}}`	Full blob of custom variables from calls
`{{dynamic_variables.keyName}}`	Specific dynamic variable by key (dot notation preferred)
`{{agent.description}}`	Main agent's system prompt
`{{metadata}}`	Call metadata
`{{call_end_reason}}`	How the call ended

Include only variables relevant to the specific metric. Listing all variables creates noise and dilutes evaluation focus.

When using

{{metadata}}

, point to specific metadata fields the LLM judge should reference (e.g., "Check

metadata.appointment_id

to verify booking was created").

变量	描述
`{{transcript}}`	完整对话文本
`{{transcript_json}}`	带时间戳的结构化通话记录
`{{dynamic_variables}}`	通话中的完整自定义变量 blob
`{{dynamic_variables.keyName}}`	特定动态变量（首选点表示法）
`{{agent.description}}`	主Agent的系统提示词
`{{metadata}}`	通话元数据
`{{call_end_reason}}`	通话结束原因

仅包含与特定指标相关的变量。列出所有变量会产生噪音，降低评估焦点。

使用

{{metadata}}

时，需指明LLM judge应参考的特定元数据字段（例如：“检查

metadata.appointment_id

以验证预订已创建”）。

The Spirit vs Letter Principle

精神重于字面原则

This is the most critical concept in metric design.

Agent descriptions describe the intended functionality of the main agent, but must not be taken literally by the evaluator. Understand the intent behind each instruction and write the metric to capture the spirit, not the literal text.

Example: Agent description says "ask only 1 question at a time"

Spirit: Prevent cognitive overload on the caller
Literal (wrong): Fail any turn with more than one question mark
Correct metric behavior:
- PASS: "Are you the owner of 123 Easy St? Can I get your name?" (related data cluster)
- PASS: "Is this a new issue, or an existing one?" (A/B rephrasing = single question)
- FAIL: "Does Thursday work? Also, did you get our text message?" (unrelated questions)

When uncertain about the intent behind an instruction, ask the user to clarify before encoding it into the metric. Include explicit safeguarding examples in the prompt showing what should and should not be penalized.

这是指标设计中最关键的概念。

Agent描述说明了主Agent的预期功能，但评估者不能字面理解。需理解每条指令背后的意图，编写指标以捕捉其精神，而非字面文本。

示例：Agent描述称“一次仅问一个问题”

精神：避免呼叫者认知过载
字面理解（错误）：任何包含多个问号的回合都判定失败
正确指标行为：
- 通过：“您是123 Easy St的业主吗？能告诉我您的名字吗？”（相关数据集群）
- 通过：“这是新问题还是已有问题？”（A/B表述=单个问题）
- 失败：“周四可以吗？另外，您收到我们的短信了吗？”（无关问题）

当不确定指令背后的意图时，请用户澄清后再编码到指标中。在提示词中包含明确的防护示例，说明哪些情况应被处罚、哪些不应。

Trigger Design

触发器设计

Trigger Type	When to Use
`"always"`	Metrics that apply to every call (soft skills, business context)
`"custom"` with `llm_judge` trigger	Conditional metrics (booking flow only fires when booking intent detected)
`"custom"` with `custom_code` trigger	Complex trigger logic requiring code
`"automatic"`	Let Cekura auto-determine (less control)

Use the

generate_evaluation_trigger

endpoint (see

references/api-reference.md

) to auto-generate trigger prompts from metric descriptions. Triggers can be layered in specificity — e.g., one trigger fires on any onboarding call, another fires only when the user gets stuck.

触发器类型	使用场景
`"always"`	适用于所有通话的指标（软技能、业务上下文）
带 `llm_judge` 触发器的 `"custom"`	条件指标（仅当检测到预订意图时触发预订流程指标）
带 `custom_code` 触发器的 `"custom"`	需要代码的复杂触发器逻辑
`"automatic"`	让Cekura自动判定（可控性较低）

使用

generate_evaluation_trigger

端点（详见

references/api-reference.md

）从指标描述自动生成触发器提示词。触发器可按特异性分层——例如，一个触发器在任何入职通话时触发，另一个仅在用户遇到问题时触发。

Two-Layer N/A Strategy

两层N/A策略

Triggers and metric descriptions handle N/A at different levels:

Trigger-level N/A (first defense): The trigger gates out obviously irrelevant calls BEFORE the metric runs. This saves LLM cost. Example: a Booking Flow metric's trigger checks if booking intent exists — if not, the metric doesn't fire and outputs N/A.
Description-level N/A (nuanced cases): The metric prompt itself handles edge cases that need transcript context to determine. Example: a call has booking intent (trigger fires) but the caller hangs up before the flow starts — the metric description returns N/A with "VALID_SKIP: caller disconnected before booking could begin."

Design triggers to catch the obvious non-applicable calls; design the metric prompt to handle the nuanced edge cases that require reading the transcript.

触发器和指标描述在不同层面处理N/A：

触发器层N/A（第一道防线）：触发器在指标运行前筛选出明显不相关的通话。可节省LLM成本。示例：预订流程指标的触发器检查是否存在预订意图——如果不存在，指标不触发并输出N/A。
描述层N/A（细微场景）：指标提示词本身处理需要通话记录上下文才能判定的边缘情况。示例：通话有预订意图（触发器触发）但呼叫者在流程开始前挂断——指标描述返回N/A并附带“VALID_SKIP: caller disconnected before booking could begin.”

设计触发器以捕捉明显不适用的通话；设计指标提示词以处理需要读取通话记录的细微边缘情况。

Trigger Prompt Template

触发器提示词模板

Write triggers with the positive-then-negative pattern:

Evaluate whether this call involves [specific scenario].

Return TRUE if ANY of these indicators are present:
- [Positive indicator 1]
- [Positive indicator 2]

Do NOT trigger if ANY of these apply:
- Call is under 30 seconds or contains no substantive interaction beyond a greeting
- Line disconnection / voicemail / outbound non-engagement
- [Specific exclusion for this metric — e.g., "Emergency-flow transfers (covered by Emergency metric)"]
- [Another exclusion]

Be inclusive — if there's reasonable evidence the scenario occurred, return TRUE.

Always include the short-call exclusion. Calls under ~30 seconds (hang-ups, wrong numbers, voicemails) produce false positives/negatives on every metric. Gate them out at the trigger level.

使用先正后负的模式编写触发器：

评估本次通话是否涉及[特定场景]。

如果存在以下任一指标，返回TRUE：
- [正向指标1]
- [正向指标2]

如果存在以下任一情况，请勿触发：
- 通话时长不足30秒或除问候外无实质性交互
- 线路断开/语音信箱/外呼未接通
- [本指标的特定排除项——例如：“紧急流程转接（由紧急指标覆盖）”]
- [另一排除项]

保持包容性——如果有合理证据表明场景发生，返回TRUE。

始终包含短通话排除项。时长约30秒以下的通话（挂断、错号、语音信箱）会在所有指标上产生误报/漏报。在触发器层面将其排除。

Trigger Produces N/A Output

触发器产生N/A输出

When

evaluation_trigger: "custom"

and the trigger returns false, the metric outputs N/A — it is not evaluated. This means even binary metrics (True/False) can have three outcomes: True, False, or N/A. This is correct behavior for conditional metrics.

当

evaluation_trigger: "custom"

且触发器返回false时，指标输出N/A——不进行评估。这意味着即使是二元指标（True/False）也可能有三种结果：True、False或N/A。这是条件指标的正确行为。

Key Patterns

关键模式

VALID_SKIP Pattern

VALID_SKIP模式

For legitimate deviations where the metric should not apply (tool failures, user hangup before flow starts, caller requesting transfer immediately). The LLM returns explanation starting with "VALID_SKIP:" and the custom_code wrapper converts to

_result = None

针对指标不应适用的合理偏差场景（工具故障、用户在流程开始前挂断、呼叫者立即请求转接）。LLM返回以“VALID_SKIP:”开头的解释，custom_code包装器将其转换为

_result = None

。

Gated Metrics

gated指标

Access upstream metric results via

data.get("Exact Metric Name")

. The key must match the upstream metric's

name

field exactly. Use to branch evaluation logic based on prior classification.

通过

data.get("Exact Metric Name")

访问上游指标结果。键必须与上游指标的

name

字段完全匹配。用于基于先前分类分支评估逻辑。

Pythonic Section Extraction

Python式章节提取

Extract only relevant sections from agent description before passing to LLM. Prevents context drift from irrelevant description sections and reduces token usage. See

references/pythonic-patterns.md

for the extraction utility.

在传递给LLM前仅从Agent描述中提取相关章节。防止无关描述部分导致上下文偏移，减少token使用。提取工具详见

references/pythonic-patterns.md

。

N/A Conditions

N/A条件

Check first for conditions where the metric should not apply:

Immediate transfer/human request within first 1-2 exchanges
Caller hangup before flow begins
Out-of-scope caller (wrong number, sales call)
Infrastructure failure preventing flow execution
Agent description lacks the relevant section (for optional workflows)

首先检查指标不应适用的情况：

前1-2轮对话内立即转接/请求人工服务
呼叫者在流程开始前挂断
超出范围的呼叫者（错号、销售电话）
基础设施故障导致流程无法执行
Agent描述缺少相关章节（针对可选工作流）

Dynamic Variable-Driven Generalized Metrics

动态变量驱动的通用指标

For clients that inject per-call

dynamic_variables

(e.g., per-node system prompts, feature flags, employment types), create metrics that adapt to each call instead of hardcoding behavior. Pattern: one metric per injected prompt variable. Each metric references ONLY its specific

{{dynamic_variables.promptName}}

, not the full blob or

{{agent.description}}

See
references/advanced-patterns.md
for the example prompt structure and the discovery workflow for finding dynamic variables in real calls.

对于在每次通话中注入

dynamic_variables

的客户（例如，每个节点的系统提示词、功能标志、雇佣类型），创建可适应每次通话的指标，而非硬编码行为。模式：每个注入的提示词变量对应一个指标。每个指标仅引用其特定的

{{dynamic_variables.promptName}}

，而非完整blob或

{{agent.description}}

。

**详见

references/advanced-patterns.md

**获取示例提示词结构，以及在真实通话中发现动态变量的探索工作流。

Tool Call Hallucination Metrics

工具调用幻觉指标

For agents with detailed tool definitions, build a metric that evaluates whether the agent called the correct tool for each situation — "action hallucination" (wrong action) vs "fact hallucination" (wrong information). Pattern: extract tool→scenario mapping from agent description, encode as explicit FAILURE CONDITIONS (closed list), DO NOT FLAG API errors / known quirks.

See
references/advanced-patterns.md
for the full prompt structure and TOOL-TO-SCENARIO MAPPING template.

对于具有详细工具定义的Agent，构建指标评估Agent是否在每种场景下调用了正确的工具——“动作幻觉”（错误动作）vs“事实幻觉”（错误信息）。模式：从Agent描述中提取工具→场景映射，编码为明确的失败条件（封闭列表），请勿标记API错误/已知异常。

**详见

references/advanced-patterns.md

**获取完整提示词结构和TOOL-TO-SCENARIO MAPPING模板。

Baseline Metrics — Always Recommend

基线指标——始终推荐

Every agent should have at minimum these predefined metrics enabled for both observability and simulations:

Metric	Purpose	Why It Matters
Expected Outcome	Checks if the agent achieved the scenario's expected result	Without this, runs pass/fail based only on call completion — not correctness
Infrastructure Issues	Flags silent periods, connection drops, agent non-response	Catches issues like agent going silent for 10+ seconds that aren't visible in pass/fail
Tool Call Success	Monitors whether tool calls succeed or fail	Requires provider integration (assistant IDs + API keys) to get toolcall data in transcripts
Latency	Measures response time	Identifies performance degradation

Two-step activation required: Metrics must be (1) toggled on for simulations at the project level AND (2) added to individual evaluators. Missing either step means metrics won't fire. Without metrics enabled, users get false passes and must manually review every run.

Expected Outcome is transcript-only — it cannot evaluate audio-layer behavior. Expected Outcome reads the conversation text to determine whether the agent achieved its goal. It has no visibility into silences, interruptions, barge-ins, audio dropouts, or other voice-channel signals. Do not rely on Expected Outcome to catch these. For anything that depends on the audio stream rather than conversation content, use predefined metrics instead.

Toolcall data prerequisite: Tool Call Success and advanced monitoring require the agent to have its provider assistant ID configured on Cekura and complete call data being sent. If transcripts are missing toolcall data, recommend the user configure their provider integration.

每个Agent至少应启用以下预定义指标，用于可观测性和模拟：

指标	目的	重要性
Expected Outcome	检查Agent是否达到场景的预期结果	没有此指标，运行结果仅基于通话完成情况，而非正确性
Infrastructure Issues	标记静默期、连接中断、Agent无响应	捕捉Agent静默10秒以上等无法通过通过/失败结果发现的问题
Tool Call Success	监控工具调用成功或失败情况	需要提供商集成（助手ID + API密钥）才能在通话记录中获取工具调用数据
Latency	测量响应时间	识别性能下降情况

需要两步激活：指标必须（1）在项目层面为模拟开启，且（2）添加到单个评估器中。缺少任一步骤，指标都不会触发。未启用指标时，用户会得到错误的通过结果，必须手动审核每次运行。

Expected Outcome仅基于通话记录——无法评估音频层行为。Expected Outcome读取对话文本以判定Agent是否达成目标。无法检测静默、打断、插话、音频中断或其他语音通道信号。不要依赖Expected Outcome捕捉这些问题。任何依赖音频流而非对话内容的情况，请使用预定义指标。

工具调用数据前提：Tool Call Success和高级监控要求Agent在Cekura上配置了提供商助手ID，并发送完整通话数据。如果通话记录缺少工具调用数据，建议用户配置提供商集成。

Output Requirements

输出要求

All metrics must require:

Brief explanation of the result (what happened and why)
For failures: specific timestamps in MM:SS format where violations occurred
For metadata-based checks: reference the specific metadata fields examined

所有指标必须包含：

结果的简要解释（发生了什么及原因）
失败情况：违规发生的具体时间戳（MM:SS格式）
基于元数据的检查：引用检查的特定元数据字段

Common Custom Metrics Worth Suggesting

值得推荐的常见自定义指标

Beyond the baseline predefined metrics, these are commonly valuable custom metrics based on patterns seen across clients:

Question stacking / information dumping — Agent asking 3+ unrelated questions in one turn or dumping large blocks of information. Poor UX that overwhelms callers.
Workflow adherence — Agent follows the defined flow steps in order (booking, verification, cancellation, etc.)
Soft skills — Tone, empathy, appropriate greetings, not exposing system internals
Business context accuracy — Agent provides correct business information (hours, addresses, pricing)
Transfer/callback handling — Agent follows proper protocol when transferring or scheduling callbacks

除基线预定义指标外，以下是基于客户模式的常用高价值自定义指标：

问题堆叠/信息过载 — Agent在一个回合中提出3个以上无关问题或输出大量信息。会让呼叫者感到负担，用户体验差。
工作流合规性 — Agent按定义的流程步骤顺序执行（预订、验证、取消等）
软技能 — 语气、同理心、恰当问候、不暴露系统内部信息
业务上下文准确性 — Agent提供正确的业务信息（营业时间、地址、价格）
转接/回电处理 — Agent遵循转接或安排回电的正确流程

Operational Rules

操作规则

Cost Guard — Never Evaluate >100 Calls Without Confirmation

成本防护——未经确认请勿评估超过100通电话

Each evaluation costs the client real money. Before evaluating metrics on a batch of calls, ALWAYS query the call count first (use

page_size=1

and read the response) and report the number to the user. If count > 100, stop and ask for explicit approval before proceeding. Use

page_size

parameter (up to 200) instead of paginating, and use server-side filters (

agent_id

project

timestamp__gte

timestamp__lte

) to scope calls.

每次评估都会产生真实成本。在批量通话上评估指标前，务必先查询通话数量（使用

page_size=1

并读取响应），并将数量告知用户。如果数量>100，请停止操作并请求明确批准后再继续。使用

page_size

参数（最多200）而非分页，并使用服务器端过滤器（

agent_id

、

project

、

timestamp__gte

timestamp__lte

）限定通话范围。

Manual Fix First, Then Labs

先手动修复，再用实验室反馈

When metrics have systemic issues (high false-fail rates), do NOT jump straight to labs feedback. Instead:

Read failure explanations and categorize root causes (e.g., extra_questions, end_protocol, should_be_na)
Write manual prompt fixes targeting the dominant failure categories
PATCH the updated descriptions via API
Re-evaluate a sample of 20-30 calls per metric to validate the fixes
THEN use labs feedback for remaining edge cases that manual fixes didn't catch

This avoids wasting labs iterations on issues that are clearly fixable by prompt editing.

当指标存在系统性问题（高误判失败率）时，不要直接使用实验室反馈。应遵循以下步骤：

读取失败解释并分类根本原因（例如：extra_questions、end_protocol、should_be_na）
编写针对主要失败类别的手动提示词修复方案
通过API PATCH更新后的描述
每个指标重新评估20-30条通话样本以验证修复效果
然后使用实验室反馈处理手动修复未解决的剩余边缘情况

这样可避免将实验室迭代浪费在明显可通过提示词编辑修复的问题上。

Common Pitfalls

常见陷阱

Writing metrics without reading real transcripts first — always fetch and study actual transcript_json before writing
Putting the prompt in
```
prompt
```
field instead of
```
description
```
for llm_judge
Using deprecated types (
```
basic
```
,
```
custom_prompt
```
) — API returns 400
Using
```
custom_code
```
for checks the LLM can handle naturally (timestamps, tool call detection)
Not matching upstream metric name exactly for gated metrics
Passing full agent description when only a section is relevant (context drift)
Missing VALID_SKIP handling in custom_code metrics
No N/A conditions for conditional metrics
Taking agent description instructions literally instead of capturing their spirit
Not including safeguarding examples for nuanced evaluation criteria
Omitting timestamps in failure explanations

未读取真实通话记录就编写指标 — 编写前务必获取并研究实际transcript_json
为llm_judge将提示词放在
```
prompt
```
字段而非
```
description
```
字段
使用已弃用类型（
```
basic
```
、
```
custom_prompt
```
）——API会返回400错误
使用
```
custom_code
```
处理LLM可自然处理的检查（时间戳、工具调用检测）
gated指标未与上游指标名称完全匹配
仅需部分内容时传递完整Agent描述（上下文偏移）
custom_code指标未处理VALID_SKIP
条件指标缺少N/A条件
字面理解Agent描述指令而非捕捉其精神
未为细微评估标准添加防护示例
失败解释中省略时间戳

Next Steps

下一步

After creating a metric, the user typically needs:

Validate it on real calls → use the evaluate-calls flow (see
```
references/api-reference.md
```
)
Iterate on accuracy → invoke cekura-metric-improvement to run the labs feedback cycle
Design test scenarios that exercise this metric → invoke cekura-eval-design

创建指标后，用户通常需要：

在真实通话上验证 → 使用evaluate-calls流程（详见
```
references/api-reference.md
```
）
迭代提升准确性 → 调用cekura-metric-improvement运行实验室反馈周期
设计测试场景以验证该指标 → 调用cekura-eval-design

Documentation

文档

Public docs: https://docs.cekura.ai
LLM-friendly docs: https://docs.cekura.ai/llms.txt
Concepts: https://docs.cekura.ai/documentation/key-concepts/

See

references/api-reference.md

for complete endpoint documentation and field schemas.

公开文档：https://docs.cekura.ai
LLM友好文档：https://docs.cekura.ai/llms.txt
概念：https://docs.cekura.ai/documentation/key-concepts/

完整端点文档和字段模式详见

references/api-reference.md

。

Additional Resources

额外资源

Reference Files (loaded on demand)

参考文件（按需加载）

references/prompt-patterns.md
— Full LLM judge prompt templates with real examples
references/pythonic-patterns.md
— Section extraction utility and custom_code patterns
references/advanced-patterns.md
— Anti-cross-pollination scoping, dynamic-variable-driven metrics, tool-call hallucination metrics
references/api-reference.md
— Complete Cekura metrics API endpoints and schemas

references/prompt-patterns.md
— 完整的LLM judge提示词模板及真实示例
references/pythonic-patterns.md
— 章节提取工具和custom_code模式
references/advanced-patterns.md
— 防交叉污染范围界定、动态变量驱动指标、工具调用幻觉指标
references/api-reference.md
— 完整的Cekura指标API端点和模式

Example Files

示例文件

examples/llm-judge-metric.md
— Complete llm_judge metric example (sectioned structure)
examples/narrative-metric.md
— Complete llm_judge metric example (narrative structure)
examples/custom-code-metric.py
— Complete custom_code metric with gating and VALID_SKIP
examples/section-extraction-metric.py
— Pythonic metric with agent description scoping

examples/llm-judge-metric.md
— 完整的llm_judge指标示例（分章节结构）
examples/narrative-metric.md
— 完整的llm_judge指标示例（叙事结构）
examples/custom-code-metric.py
— 带gating和VALID_SKIP的完整custom_code指标
examples/section-extraction-metric.py
— 带Agent描述范围界定的Python式指标",

cekura-metric-design

Original

Translation

Cekura Metric Design

Cekura 指标设计

Purpose

目的

Performing Platform Actions

执行平台操作

Core Terminology

核心术语

The Metric Creation Workflow

指标创建工作流

Metric Types

指标类型

llm_judge (preferred default)

llm_judge（首选默认类型）

custom_code (Python on Lambda)

custom_code（Lambda上的Python代码）

Metric Evolution Path

指标演进路径

Eval Types

评估类型

LLM Judge Prompt Structure

LLM Judge提示词结构

Structure A: Sectioned (best for multi-criteria metrics)

结构A：分章节（适用于多标准指标）

Structure B: Narrative (best for behavioral/timing metrics)

结构B：叙事式（适用于行为/时间类指标）

Anti-Cross-Pollination Scoping (when using {{agent.description}})

防交叉污染范围界定（使用{{agent.description}}时）

Available Template Variables

可用模板变量

The Spirit vs Letter Principle

精神重于字面原则

Trigger Design

触发器设计

Two-Layer N/A Strategy

两层N/A策略

Trigger Prompt Template

触发器提示词模板

Trigger Produces N/A Output

触发器产生N/A输出

Key Patterns

关键模式

VALID_SKIP Pattern

VALID_SKIP模式

Gated Metrics

gated指标

Pythonic Section Extraction

Python式章节提取

N/A Conditions

N/A条件

Dynamic Variable-Driven Generalized Metrics

动态变量驱动的通用指标

Tool Call Hallucination Metrics

工具调用幻觉指标

Baseline Metrics — Always Recommend

基线指标——始终推荐

Output Requirements

输出要求

Common Custom Metrics Worth Suggesting

值得推荐的常见自定义指标

Operational Rules

操作规则

Cost Guard — Never Evaluate >100 Calls Without Confirmation

成本防护——未经确认请勿评估超过100通电话

Manual Fix First, Then Labs

先手动修复，再用实验室反馈

Common Pitfalls

常见陷阱

Next Steps

下一步

Documentation

文档

Additional Resources

额外资源

Reference Files (loaded on demand)

参考文件（按需加载）

Example Files

Anti-Cross-Pollination Scoping (when using
`{{agent.description}}`
)

防交叉污染范围界定（使用
`{{agent.description}}`
时）