adk-evals
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseADK Evals Skill
ADK Evals 技能
What are Evals?
什么是Evals?
Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, what gets written to tables, and more.
Evals run against a live dev bot (), so they test the full stack — not mocks.
adk devEvals是针对ADK agents的自动化对话测试。每个eval定义一个场景——一系列用户消息或事件——并断言机器人应执行的操作:回复内容、调用的工具、状态变化、写入表格的数据等等。
Evals在实时开发机器人()上运行,因此它们测试的是完整堆栈而非模拟环境。
adk devWhen to Use This Skill
何时使用该技能
Use this skill when the developer asks about:
- Writing evals — file format, assertions, turn types, setup
- Running evals — CLI commands, filtering, output interpretation
- Testing specific primitives — how to test actions, tools, workflows, conversations, tables, state
- The testing loop — write → run → inspect traces → iterate
- CI integration — exit codes, flag, tagging strategies
--format json - Eval configuration — idleTimeout, judgePassThreshold, judgeModel
Or when you are developing an ADK bot and need to write the equivalent of unit/end-to-end tests.
Trigger questions:
- "How do I write an eval?"
- "How do I test my workflow?"
- "How do I assert that a tool was called with specific params?"
- "My eval is failing, how do I debug it?"
- "How do I test that the bot stays silent?"
- "How do I run evals in CI?"
- "How do I seed state before an eval?"
- "How do I trigger a workflow in an eval?"
当开发者询问以下内容时使用该技能:
- 编写evals — 文件格式、断言、轮次类型、设置
- 运行evals — CLI命令、过滤规则、输出解读
- 测试特定基础组件 — 如何测试动作、工具、工作流、对话、表格、状态
- 测试循环 — 编写 → 运行 → 检查追踪信息 → 迭代
- CI集成 — 退出码、参数、标签策略
--format json - Eval配置 — idleTimeout、judgePassThreshold、judgeModel
或者当你开发ADK机器人,需要编写单元测试/端到端测试的等效测试时。
触发问题:
- "如何编写eval?"
- "如何测试我的工作流?"
- "如何断言工具被调用时传入了特定参数?"
- "我的eval失败了,如何调试?"
- "如何测试机器人保持静默?"
- "如何在CI中运行evals?"
- "如何在eval前初始化状态?"
- "如何在eval中触发工作流?"
Available Documentation
可用文档
| File | Contents |
|---|---|
| Complete file format — all fields, turn types, assertion categories, match operators, setup, outcome, options |
| Running evals, interpreting output, using traces, the write → test → iterate loop, CI integration |
| Per-primitive patterns for actions, tools, workflows, conversations, tables, and state |
| 文件 | 内容 |
|---|---|
| 完整文件格式——所有字段、轮次类型、断言类别、匹配运算符、设置、结果、选项 |
| 运行evals、解读输出、使用追踪信息、编写→测试→迭代循环、CI集成 |
| 针对动作、工具、工作流、对话、表格和状态的基础组件测试模式 |
How to Answer
如何解答
- Writing an eval → Read for structure and assertions
eval-format.md - Running evals → Read for CLI commands and output
testing-workflow.md - Testing a specific primitive → Read for the relevant section
test-patterns.md - Debugging a failure → Combine (inspect traces) +
testing-workflow.md(check assertion syntax)eval-format.md
- 编写eval → 阅读了解结构和断言
eval-format.md - 运行evals → 阅读了解CLI命令和输出
testing-workflow.md - 测试特定基础组件 → 阅读中的相关章节
test-patterns.md - 调试失败问题 → 结合(检查追踪信息) +
testing-workflow.md(检查断言语法)eval-format.md
Quick Reference
快速参考
Eval file structure
Eval文件结构
typescript
import { Eval } from '@botpress/adk'
export default new Eval({
name: 'greeting',
type: 'regression',
tags: ['basic'],
setup: {
state: { bot: { welcomeSent: false } },
workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
},
conversation: [
{
user: 'Hi!',
assert: {
response: [
{ not_contains: 'error' },
{ llm_judge: 'Response is friendly and offers to help' },
],
tools: [{ not_called: 'createTicket' }],
state: [{ path: 'conversation.greeted', equals: true }],
},
},
],
outcome: {
state: [{ path: 'conversation.greeted', equals: true }],
},
options: {
idleTimeout: 20000,
judgePassThreshold: 4,
},
})typescript
import { Eval } from '@botpress/adk'
export default new Eval({
name: 'greeting',
type: 'regression',
tags: ['basic'],
setup: {
state: { bot: { welcomeSent: false } },
workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
},
conversation: [
{
user: 'Hi!',
assert: {
response: [
{ not_contains: 'error' },
{ llm_judge: 'Response is friendly and offers to help' },
],
tools: [{ not_called: 'createTicket' }],
state: [{ path: 'conversation.greeted', equals: true }],
},
},
],
outcome: {
state: [{ path: 'conversation.greeted', equals: true }],
},
options: {
idleTimeout: 20000,
judgePassThreshold: 4,
},
})Turn types
轮次类型
| Turn | When to use |
|---|---|
| Standard user message |
| Non-message trigger (webhook, integration event) |
| Assert bot does NOT respond |
| 轮次 | 使用场景 |
|---|---|
| 标准用户消息 |
| 非消息触发(webhook、集成事件) |
| 断言机器人不回复 |
Assertion categories
断言类别
| Category | What it checks |
|---|---|
| Bot reply text (contains, matches, llm_judge, similar_to) |
| Tool calls (called, not_called, call_order, params) |
| Bot/user/conversation state (equals, changed) |
| Table rows (row_exists, row_count) |
| Workflow execution (entered, completed) |
| Response time in ms (lte, gte) |
| 类别 | 检查内容 |
|---|---|
| 机器人回复文本(contains、matches、llm_judge、similar_to) |
| 工具调用(called、not_called、call_order、params) |
| 机器人/用户/对话状态(equals、changed) |
| 表格行(row_exists、row_count) |
| 工作流执行(entered、completed) |
| 响应时间(毫秒,lte、gte) |
CLI commands
CLI命令
bash
adk evals # run all evals
adk evals <name> # run one eval
adk evals --tag <tag> # filter by tag
adk evals --type regression # filter by type
adk evals --verbose # show all assertions
adk evals --format json # JSON output for CI
adk evals runs # list recent runs
adk evals runs --latest # most recent run
adk evals runs --latest -v # with full detailsbash
adk evals # 运行所有evals
adk evals <name> # 运行单个eval
adk evals --tag <tag> # 按标签过滤
adk evals --type regression # 按类型过滤
adk evals --verbose # 显示所有断言
adk evals --format json # 输出JSON格式用于CI
adk evals runs # 列出最近运行记录
adk evals runs --latest # 最新运行记录
adk evals runs --latest -v # 显示完整详情Critical Patterns
关键模式
✅ Every turn needs or
usereventtypescript
// CORRECT
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }❌ alone is not a valid turn
expectSilencetypescript
// WRONG — missing user or event
{ expectSilence: true }✅ Assert tool params to verify correct extraction
typescript
// CORRECT — verifies the LLM extracted the right values
{ called: 'createTicket', params: { priority: { equals: 'high' } } }❌ Only asserting the tool was called
typescript
// INCOMPLETE — doesn't verify params were correct
{ called: 'createTicket' }✅ Use for post-conversation state and table assertions
outcometypescript
// CORRECT — final state checked once after all turns
outcome: {
state: [{ path: 'conversation.resolved', equals: true }],
tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}❌ Checking tables in per-turn assertions when the write happens at the end
typescript
// WRONG — table may not be written until after all turns
conversation: [
{
user: 'Create a ticket',
assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
},
]✅ Seed state to test conditional behavior without running setup turns
typescript
// CORRECT — start in a known state
setup: {
state: {
user: { plan: 'pro' },
conversation: { phase: 'support' },
},
}❌ Using conversation turns to set up state (slow and fragile)
typescript
// WRONG — depends on the bot correctly processing setup turns
conversation: [
{ user: 'I am on the pro plan' }, // hoping bot sets user.plan
{ user: 'I need help with billing' }, // actual test turn
]✅ 每个轮次需要或
usereventtypescript
// 正确示例
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }❌ 仅不是有效的轮次
expectSilencetypescript
// 错误示例 — 缺少user或event
{ expectSilence: true }✅ 断言工具参数以验证提取是否正确
typescript
// 正确示例 — 验证LLM提取了正确的值
{ called: 'createTicket', params: { priority: { equals: 'high' } } }❌ 仅断言工具被调用
typescript
// 不完整示例 — 未验证参数是否正确
{ called: 'createTicket' }✅ 使用进行对话后的状态和表格断言
outcometypescript
// 正确示例 — 所有轮次结束后检查最终状态
outcome: {
state: [{ path: 'conversation.resolved', equals: true }],
tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}❌ 当写入操作在最后执行时,在每轮断言中检查表格
typescript
// 错误示例 — 表格可能在所有轮次结束后才写入
conversation: [
{
user: 'Create a ticket',
assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
},
]✅ 初始化状态以测试条件行为,无需运行设置轮次
typescript
// 正确示例 — 从已知状态开始
setup: {
state: {
user: { plan: 'pro' },
conversation: { phase: 'support' },
},
}❌ 使用对话轮次设置状态(缓慢且脆弱)
typescript
// 错误示例 — 依赖机器人正确处理设置轮次
conversation: [
{ user: 'I am on the pro plan' }, // 期望机器人设置user.plan
{ user: 'I need help with billing' }, // 实际测试轮次
]Example Questions
示例问题
Writing evals:
- "Write an eval that tests my createTicket tool is called with the right priority"
- "How do I assert that the bot stays silent after an internal event?"
- "How do I test a multi-turn conversation where context is retained?"
Running evals:
- "How do I run only regression evals?"
- "How do I see which assertions failed and why?"
- "How do I integrate evals into GitHub Actions?"
Debugging:
- "My eval says the tool wasn't called but I think it was — how do I check?"
- "How do I inspect what the bot actually did during an eval?"
Per-primitive:
- "How do I test a workflow that uses step.sleep()?"
- "How do I verify a row was written to a table after a conversation?"
- "How do I test that state changed from the seeded value?"
编写evals:
- "编写一个eval,测试我的createTicket工具被调用时传入了正确的优先级"
- "如何断言机器人在内部事件后保持静默?"
- "如何测试保留上下文的多轮对话?"
运行evals:
- "如何仅运行回归测试类型的evals?"
- "如何查看哪些断言失败以及原因?"
- "如何将evals集成到GitHub Actions中?"
调试:
- "我的eval显示工具未被调用,但我认为它被调用了——如何检查?"
- "如何查看eval运行期间机器人的实际操作?"
基础组件测试:
- "如何测试使用step.sleep()的工作流?"
- "如何验证对话后表格中写入了一行数据?"
- "如何测试状态是否从初始化值发生了变化?"
Response Format
响应格式
When helping a developer write an eval:
- Show the complete call with realistic field values
new Eval({}) - Include imports ()
import { Eval } from '@botpress/adk' - Explain each assertion and why it's the right choice for that scenario
- Point out any mutual exclusivity rules if relevant (vs
expectSilence,assert.responsevsuser)event - Suggest the CLI command to run it:
adk evals <name>
When helping debug a failing eval:
- Ask for or show the failing assertion (/
expecteddiff)actual - Suggest opening traces in the Control Panel to see what the bot did
- Identify whether the issue is in the eval assertion or the bot's behavior
当帮助开发者编写eval时:
- 展示完整的调用,并使用真实的字段值
new Eval({}) - 包含导入语句()
import { Eval } from '@botpress/adk' - 解释每个断言,以及为什么它是该场景的正确选择
- 指出任何互斥规则(如与
expectSilence、assert.response与user)event - 建议运行它的CLI命令:
adk evals <name>
当帮助调试失败的eval时:
- 请求或展示失败的断言(/
expected差异)actual - 建议在控制面板中打开追踪信息,查看机器人的操作
- 判断问题出在eval断言还是机器人行为上