surrogate-verifier
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSurrogate Verifier
代理验证器
Generate structured test assertions and failure diagnostics for skill packages through
information-isolated verification. The verifier operates without access to the skill
generator's reasoning — it sees only the skill definition, a task prompt, and the
output artifacts. This isolation prevents confirmation bias and is the single largest
contributor to skill quality in co-evolutionary generation (+30pp per EvoSkills).
通过信息隔离验证为技能包生成结构化测试断言和故障诊断信息。验证器无法访问技能生成器的推理过程——它只能看到技能定义、任务提示和输出产物。这种隔离可以防止确认偏差,是协同进化生成中提升技能质量的最大单一因素(每个EvoSkills可提升30个百分点)。
Reference Files
参考文件
| File | Contents | Load When |
|---|---|---|
| Assertion catalog by skill category with weight guidance | Always |
| Failure diagnostic templates with root-cause categories | When producing failure reports |
| 文件路径 | 内容描述 | 加载时机 |
|---|---|---|
| 按技能类别划分的断言目录及权重指导 | 始终加载 |
| 包含根本原因分类的故障诊断模板 | 生成故障报告时加载 |
Information Isolation Protocol
信息隔离协议
This is the most critical constraint. Violating isolation degrades verification quality.
The verifier MUST NOT access:
- The generator's conversation history or reasoning chain
- Prior evolution iterations or refinement context
- The generator's internal notes or decision rationale
- Any context beyond what is explicitly listed below
The verifier receives ONLY:
- The skill's content (the definition file)
SKILL.md - One or more task prompts representing intended use
- The skill's output (when diagnosing failures)
- The assertion results from (when diagnosing)
scripts/eval_assertions.py
Implementation: When invoked by the agent, this skill MUST be loaded
into a separate Agent spawn using or at minimum a fresh session
with no shared context. The invoking agent passes artifacts as explicit text, not as
conversation references.
test-engineerisolation: "worktree"这是最关键的约束条件。违反隔离规则会降低验证质量。
验证器严禁访问:
- 生成器的对话历史或推理链
- 之前的进化迭代或优化上下文
- 生成器的内部笔记或决策依据
- 以下明确列出内容之外的任何上下文
验证器仅能接收:
- 技能的内容(定义文件)
SKILL.md - 一个或多个代表预期用途的任务提示
- 技能的输出结果(诊断故障时)
- 的断言结果(诊断故障时)
scripts/eval_assertions.py
实现方式: 当被 Agent调用时,此技能必须加载到独立的Agent进程中,使用配置,或至少使用无共享上下文的全新会话。调用Agent需将产物作为明确文本传递,而非对话引用。
test-engineerisolation: "worktree"Workflow
工作流程
Mode 1: Assertion Generation
模式1:断言生成
Generate assertions for a skill given its definition and task prompts.
根据技能定义和任务提示为技能生成断言。
Phase 1: Skill Analysis
阶段1:技能分析
Read the definition and extract:
SKILL.md- Stated capabilities — what the skill claims to do (from description + workflow sections)
- Output format — expected structure of the skill's output (markdown, JSON, tables, etc.)
- Error handling — documented failure modes and recovery paths
- Prerequisites — required tools, dependencies, or context
- Trigger boundaries — what the skill does NOT handle (negative scope)
读取定义并提取以下信息:
SKILL.md- 声明的能力 —— 技能声称可完成的任务(来自描述和工作流章节)
- 输出格式 —— 技能输出的预期结构(markdown、JSON、表格等)
- 错误处理 —— 文档记录的故障模式和恢复路径
- 前置条件 —— 所需的工具、依赖项或上下文
- 触发边界 —— 技能不处理的场景(负向范围)
Phase 2: Assertion Design
阶段2:断言设计
For each task prompt, generate 5-10 assertions covering these dimensions:
| Dimension | Assertion Types to Use | Purpose |
|---|---|---|
| Output completeness | | All claimed sections/components present |
| Format compliance | | Output matches declared structure |
| Factual signals | | Key domain terms present, hallmarks absent |
| Tool usage | | Expected tools were invoked |
| Negative constraints | | Forbidden patterns absent |
Weight assignment:
- Output completeness assertions: weight 1.0 (must have)
- Format compliance: weight 0.8 (structural correctness)
- Factual signals: weight 0.6 (content quality)
- Tool usage: weight 0.5 (method verification)
- Negative constraints: weight 0.3 (absence checks are weaker signals)
See for category-specific assertion catalogs.
references/assertion-patterns.md针对每个任务提示,生成5-10个覆盖以下维度的断言:
| 维度 | 可使用的断言类型 | 目的 |
|---|---|---|
| 输出完整性 | | 确保所有声称的章节/组件均存在 |
| 格式合规性 | | 输出符合声明的结构 |
| 事实信号 | | 包含关键领域术语,无标志性错误内容 |
| 工具使用 | | 已调用预期工具 |
| 负向约束 | | 不存在禁用模式 |
权重分配:
- 输出完整性断言:权重1.0(必须具备)
- 格式合规性:权重0.8(结构正确性)
- 事实信号:权重0.6(内容质量)
- 工具使用:权重0.5(方法验证)
- 负向约束:权重0.3(缺失检查的信号强度较弱)
有关特定类别的断言目录,请参阅。
references/assertion-patterns.mdPhase 3: Output
阶段3:输出
Produce assertions in the schema format:
evals/cases.yamlyaml
assertions:
- type: contains
target: "## Scalability"
weight: 1.0
- type: output_format
target: markdown_table
weight: 0.8
- type: not_contains
target: "TODO"
weight: 0.3
- type: calls_tool
target: Read
weight: 0.5Context cap: Do not consume more than 70% of the available context window. If the
skill definition is very long, focus assertion generation on the workflow phases and
output format sections. Summarize rather than quote verbatim.
按照 schema格式生成断言:
evals/cases.yamlyaml
assertions:
- type: contains
target: "## Scalability"
weight: 1.0
- type: output_format
target: markdown_table
weight: 0.8
- type: not_contains
target: "TODO"
weight: 0.3
- type: calls_tool
target: Read
weight: 0.5上下文限制: 占用的上下文窗口不得超过70%。如果技能定义非常长,断言生成应聚焦于工作流阶段和输出格式章节。采用总结而非逐字引用的方式。
Mode 2: Failure Diagnostics
模式2:故障诊断
When an oracle returns , produce a structured diagnostic explaining why.
fail当验证器返回时,生成结构化诊断信息解释原因。
failInput
输入
- The skill's (same as Mode 1)
SKILL.md - The task prompt that was executed
- The output that failed
- The assertion results: which passed, which failed, with details
- 技能的(与模式1相同)
SKILL.md - 执行的任务提示
- 验证失败的输出结果
- 断言结果:哪些通过、哪些失败及详细信息
Phase 1: Failure Classification
阶段1:故障分类
Categorize each failed assertion into a root-cause category:
| Category | Signal | Severity |
|---|---|---|
| Missing capability | | HIGH |
| Format mismatch | | HIGH |
| Incomplete output | Multiple | MEDIUM |
| Hallucinated content | | HIGH |
| Wrong tool usage | | MEDIUM |
| Partial success | Some assertions in a group pass, others fail | LOW |
将每个失败的断言归类到根本原因类别:
| 类别 | 信号特征 | 严重程度 |
|---|---|---|
| 缺失能力 | 声称功能对应的 | 高 |
| 格式不匹配 | | 高 |
| 输出不完整 | 同一章节中多个 | 中 |
| 虚构内容 | | 高 |
| 工具使用错误 | | 中 |
| 部分成功 | 一组断言中部分通过、部分失败 | 低 |
Phase 2: Root-Cause Analysis
阶段2:根本原因分析
For each failed assertion:
- Identify the specific section of that promises the missing capability
SKILL.md - Compare what the skill definition instructs vs. what the output actually contains
- Hypothesize why the gap exists (missing workflow step, ambiguous instruction, wrong tool choice)
针对每个失败的断言:
- 找出中承诺该缺失能力的具体章节
SKILL.md - 对比技能定义的要求与实际输出内容
- 假设差距存在的原因(缺失工作流步骤、指令模糊、工具选择错误)
Phase 3: Remediation Suggestions
阶段3:修复建议
For each root cause, produce a concrete, actionable fix:
- Missing capability: "Add a workflow step between Phase 2 and Phase 3 that explicitly generates [X]"
- Format mismatch: "Change the output format instruction from 'produce a summary' to 'produce a markdown table with columns: [A, B, C]'"
- Hallucinated content: "Add a negative constraint in the workflow: 'Do NOT include [X] unless [condition]'"
- Wrong tool usage: "Replace 'use Bash to read the file' with 'use the Read tool for file contents'"
针对每个根本原因,生成具体、可执行的修复方案:
- 缺失能力: "在第2阶段和第3阶段之间添加一个工作流步骤,明确生成[X]"
- 格式不匹配: "将输出格式指令从'生成摘要'修改为'生成包含以下列的markdown表格:[A, B, C]'"
- 虚构内容: "在工作流中添加负向约束:'除非满足[条件],否则不得包含[X]'"
- 工具使用错误: "将'使用Bash读取文件'替换为'使用Read工具读取文件内容'"
Phase 4: Diagnostic Output
阶段4:诊断输出
Produce a structured diagnostic string:
DIAGNOSTIC: [skill-name] failed on [task-prompt-summary]
FAILED ASSERTIONS (N/M):
1. [SEVERITY] type=contains target="..." — Missing capability: [explanation]
2. [SEVERITY] type=output_format target="..." — Format mismatch: [explanation]
ROOT CAUSES:
- [category]: [specific explanation with SKILL.md section reference]
REMEDIATION:
1. [Concrete change to SKILL.md with exact section and wording]
2. [Concrete change to workflow with step numbers]See for worked examples per root-cause category.
references/diagnostic-templates.md生成结构化诊断字符串:
DIAGNOSTIC: [skill-name] failed on [task-prompt-summary]
FAILED ASSERTIONS (N/M):
1. [SEVERITY] type=contains target="..." — Missing capability: [explanation]
2. [SEVERITY] type=output_format target="..." — Format mismatch: [explanation]
ROOT CAUSES:
- [category]: [specific explanation with SKILL.md section reference]
REMEDIATION:
1. [Concrete change to SKILL.md with exact section and wording]
2. [Concrete change to workflow with step numbers]有关各根本原因类别的示例,请参阅。
references/diagnostic-templates.mdBudget Parameters
预算参数
Per EvoSkills Algorithm 1:
- Context cap: 0.7 (70% of available context window)
- Max surrogate retries: 15 per oracle round
- Max oracle rounds: 5 (enforced by the orchestrating agent, not the verifier)
The verifier does not track its own budget — the agent manages iteration limits.
test-engineer根据EvoSkills算法1:
- 上下文限制: 0.7(可用上下文窗口的70%)
- 最大代理重试次数: 每个验证轮次15次
- 最大验证轮次: 5次(由编排Agent强制执行,验证器不负责)
验证器不跟踪自身预算—— Agent负责管理迭代限制。
test-engineerLimitations
局限性
- No execution capability: The verifier generates assertions but does not execute them. Execution
is handled by (the oracle).
scripts/run_evals.py - Text-only verification: Cannot verify visual outputs, interactive behaviors, or side effects. Assertions operate on the textual output only.
- Single-turn scope: Each verification is independent. The verifier does not remember prior rounds (the orchestrating agent feeds context as needed).
- Assertion granularity: The 5 assertion types cover common patterns but not all possible
verification needs. Custom assertion types require extending .
scripts/eval_assertions.py
- 无执行能力: 验证器仅生成断言,不执行断言。执行由(验证器)处理。
scripts/run_evals.py - 仅文本验证: 无法验证视觉输出、交互行为或副作用。断言仅基于文本输出操作。
- 单轮范围: 每次验证都是独立的。验证器不保留之前轮次的信息(编排Agent会按需提供上下文)。
- 断言粒度: 5种断言类型覆盖常见模式,但无法满足所有可能的验证需求。自定义断言类型需要扩展。
scripts/eval_assertions.py
Error Handling
错误处理
| Error | Resolution |
|---|---|
| Skill definition too large | Summarize to workflow phases + output format sections only |
| No assertions generatable | Return empty assertions list with warning; skill may be too vague |
| Ambiguous output format | Default to |
| Context cap exceeded | Truncate diagnostic detail; preserve failed assertion list |
| 错误类型 | 解决方式 |
|---|---|
| 技能定义过大 | 仅总结工作流阶段和输出格式章节 |
| 无法生成任何断言 | 返回空断言列表并附带警告;技能可能过于模糊 |
| 输出格式模糊 | 默认使用 |
| 超出上下文限制 | 截断诊断细节;保留失败断言列表 |