surrogate-verifier

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Surrogate Verifier

代理验证器

Generate structured test assertions and failure diagnostics for skill packages through information-isolated verification. The verifier operates without access to the skill generator's reasoning — it sees only the skill definition, a task prompt, and the output artifacts. This isolation prevents confirmation bias and is the single largest contributor to skill quality in co-evolutionary generation (+30pp per EvoSkills).
通过信息隔离验证为技能包生成结构化测试断言和故障诊断信息。验证器无法访问技能生成器的推理过程——它只能看到技能定义、任务提示和输出产物。这种隔离可以防止确认偏差,是协同进化生成中提升技能质量的最大单一因素(每个EvoSkills可提升30个百分点)。

Reference Files

参考文件

FileContentsLoad When
references/assertion-patterns.md
Assertion catalog by skill category with weight guidanceAlways
references/diagnostic-templates.md
Failure diagnostic templates with root-cause categoriesWhen producing failure reports
文件路径内容描述加载时机
references/assertion-patterns.md
按技能类别划分的断言目录及权重指导始终加载
references/diagnostic-templates.md
包含根本原因分类的故障诊断模板生成故障报告时加载

Information Isolation Protocol

信息隔离协议

This is the most critical constraint. Violating isolation degrades verification quality.
The verifier MUST NOT access:
  • The generator's conversation history or reasoning chain
  • Prior evolution iterations or refinement context
  • The generator's internal notes or decision rationale
  • Any context beyond what is explicitly listed below
The verifier receives ONLY:
  1. The skill's
    SKILL.md
    content (the definition file)
  2. One or more task prompts representing intended use
  3. The skill's output (when diagnosing failures)
  4. The assertion results from
    scripts/eval_assertions.py
    (when diagnosing)
Implementation: When invoked by the
test-engineer
agent, this skill MUST be loaded into a separate Agent spawn using
isolation: "worktree"
or at minimum a fresh session with no shared context. The invoking agent passes artifacts as explicit text, not as conversation references.
这是最关键的约束条件。违反隔离规则会降低验证质量。
验证器严禁访问:
  • 生成器的对话历史或推理链
  • 之前的进化迭代或优化上下文
  • 生成器的内部笔记或决策依据
  • 以下明确列出内容之外的任何上下文
验证器仅能接收:
  1. 技能的
    SKILL.md
    内容(定义文件)
  2. 一个或多个代表预期用途的任务提示
  3. 技能的输出结果(诊断故障时)
  4. scripts/eval_assertions.py
    的断言结果(诊断故障时)
实现方式: 当被
test-engineer
Agent调用时,此技能必须加载到独立的Agent进程中,使用
isolation: "worktree"
配置,或至少使用无共享上下文的全新会话。调用Agent需将产物作为明确文本传递,而非对话引用。

Workflow

工作流程

Mode 1: Assertion Generation

模式1:断言生成

Generate assertions for a skill given its definition and task prompts.
根据技能定义和任务提示为技能生成断言。

Phase 1: Skill Analysis

阶段1:技能分析

Read the
SKILL.md
definition and extract:
  1. Stated capabilities — what the skill claims to do (from description + workflow sections)
  2. Output format — expected structure of the skill's output (markdown, JSON, tables, etc.)
  3. Error handling — documented failure modes and recovery paths
  4. Prerequisites — required tools, dependencies, or context
  5. Trigger boundaries — what the skill does NOT handle (negative scope)
读取
SKILL.md
定义并提取以下信息:
  1. 声明的能力 —— 技能声称可完成的任务(来自描述和工作流章节)
  2. 输出格式 —— 技能输出的预期结构(markdown、JSON、表格等)
  3. 错误处理 —— 文档记录的故障模式和恢复路径
  4. 前置条件 —— 所需的工具、依赖项或上下文
  5. 触发边界 —— 技能不处理的场景(负向范围)

Phase 2: Assertion Design

阶段2:断言设计

For each task prompt, generate 5-10 assertions covering these dimensions:
DimensionAssertion Types to UsePurpose
Output completeness
contains
,
matches_regex
All claimed sections/components present
Format compliance
output_format
,
contains
Output matches declared structure
Factual signals
contains
,
not_contains
Key domain terms present, hallmarks absent
Tool usage
calls_tool
Expected tools were invoked
Negative constraints
not_contains
Forbidden patterns absent
Weight assignment:
  • Output completeness assertions: weight 1.0 (must have)
  • Format compliance: weight 0.8 (structural correctness)
  • Factual signals: weight 0.6 (content quality)
  • Tool usage: weight 0.5 (method verification)
  • Negative constraints: weight 0.3 (absence checks are weaker signals)
See
references/assertion-patterns.md
for category-specific assertion catalogs.
针对每个任务提示,生成5-10个覆盖以下维度的断言:
维度可使用的断言类型目的
输出完整性
contains
,
matches_regex
确保所有声称的章节/组件均存在
格式合规性
output_format
,
contains
输出符合声明的结构
事实信号
contains
,
not_contains
包含关键领域术语,无标志性错误内容
工具使用
calls_tool
已调用预期工具
负向约束
not_contains
不存在禁用模式
权重分配:
  • 输出完整性断言:权重1.0(必须具备)
  • 格式合规性:权重0.8(结构正确性)
  • 事实信号:权重0.6(内容质量)
  • 工具使用:权重0.5(方法验证)
  • 负向约束:权重0.3(缺失检查的信号强度较弱)
有关特定类别的断言目录,请参阅
references/assertion-patterns.md

Phase 3: Output

阶段3:输出

Produce assertions in the
evals/cases.yaml
schema format:
yaml
assertions:
  - type: contains
    target: "## Scalability"
    weight: 1.0
  - type: output_format
    target: markdown_table
    weight: 0.8
  - type: not_contains
    target: "TODO"
    weight: 0.3
  - type: calls_tool
    target: Read
    weight: 0.5
Context cap: Do not consume more than 70% of the available context window. If the skill definition is very long, focus assertion generation on the workflow phases and output format sections. Summarize rather than quote verbatim.
按照
evals/cases.yaml
schema格式生成断言:
yaml
assertions:
  - type: contains
    target: "## Scalability"
    weight: 1.0
  - type: output_format
    target: markdown_table
    weight: 0.8
  - type: not_contains
    target: "TODO"
    weight: 0.3
  - type: calls_tool
    target: Read
    weight: 0.5
上下文限制: 占用的上下文窗口不得超过70%。如果技能定义非常长,断言生成应聚焦于工作流阶段和输出格式章节。采用总结而非逐字引用的方式。

Mode 2: Failure Diagnostics

模式2:故障诊断

When an oracle returns
fail
, produce a structured diagnostic explaining why.
当验证器返回
fail
时,生成结构化诊断信息解释原因。

Input

输入

  • The skill's
    SKILL.md
    (same as Mode 1)
  • The task prompt that was executed
  • The output that failed
  • The assertion results: which passed, which failed, with details
  • 技能的
    SKILL.md
    (与模式1相同)
  • 执行的任务提示
  • 验证失败的输出结果
  • 断言结果:哪些通过、哪些失败及详细信息

Phase 1: Failure Classification

阶段1:故障分类

Categorize each failed assertion into a root-cause category:
CategorySignalSeverity
Missing capability
contains
assertion failed for a claimed feature
HIGH
Format mismatch
output_format
assertion failed
HIGH
Incomplete outputMultiple
contains
assertions failed in the same section
MEDIUM
Hallucinated content
not_contains
assertion failed (forbidden pattern present)
HIGH
Wrong tool usage
calls_tool
assertion failed
MEDIUM
Partial successSome assertions in a group pass, others failLOW
将每个失败的断言归类到根本原因类别:
类别信号特征严重程度
缺失能力声称功能对应的
contains
断言失败
格式不匹配
output_format
断言失败
输出不完整同一章节中多个
contains
断言失败
虚构内容
not_contains
断言失败(存在禁用模式)
工具使用错误
calls_tool
断言失败
部分成功一组断言中部分通过、部分失败

Phase 2: Root-Cause Analysis

阶段2:根本原因分析

For each failed assertion:
  1. Identify the specific section of
    SKILL.md
    that promises the missing capability
  2. Compare what the skill definition instructs vs. what the output actually contains
  3. Hypothesize why the gap exists (missing workflow step, ambiguous instruction, wrong tool choice)
针对每个失败的断言:
  1. 找出
    SKILL.md
    中承诺该缺失能力的具体章节
  2. 对比技能定义的要求与实际输出内容
  3. 假设差距存在的原因(缺失工作流步骤、指令模糊、工具选择错误)

Phase 3: Remediation Suggestions

阶段3:修复建议

For each root cause, produce a concrete, actionable fix:
  • Missing capability: "Add a workflow step between Phase 2 and Phase 3 that explicitly generates [X]"
  • Format mismatch: "Change the output format instruction from 'produce a summary' to 'produce a markdown table with columns: [A, B, C]'"
  • Hallucinated content: "Add a negative constraint in the workflow: 'Do NOT include [X] unless [condition]'"
  • Wrong tool usage: "Replace 'use Bash to read the file' with 'use the Read tool for file contents'"
针对每个根本原因,生成具体、可执行的修复方案:
  • 缺失能力: "在第2阶段和第3阶段之间添加一个工作流步骤,明确生成[X]"
  • 格式不匹配: "将输出格式指令从'生成摘要'修改为'生成包含以下列的markdown表格:[A, B, C]'"
  • 虚构内容: "在工作流中添加负向约束:'除非满足[条件],否则不得包含[X]'"
  • 工具使用错误: "将'使用Bash读取文件'替换为'使用Read工具读取文件内容'"

Phase 4: Diagnostic Output

阶段4:诊断输出

Produce a structured diagnostic string:
DIAGNOSTIC: [skill-name] failed on [task-prompt-summary]

FAILED ASSERTIONS (N/M):
  1. [SEVERITY] type=contains target="..." — Missing capability: [explanation]
  2. [SEVERITY] type=output_format target="..." — Format mismatch: [explanation]

ROOT CAUSES:
  - [category]: [specific explanation with SKILL.md section reference]

REMEDIATION:
  1. [Concrete change to SKILL.md with exact section and wording]
  2. [Concrete change to workflow with step numbers]
See
references/diagnostic-templates.md
for worked examples per root-cause category.
生成结构化诊断字符串:
DIAGNOSTIC: [skill-name] failed on [task-prompt-summary]

FAILED ASSERTIONS (N/M):
  1. [SEVERITY] type=contains target="..." — Missing capability: [explanation]
  2. [SEVERITY] type=output_format target="..." — Format mismatch: [explanation]

ROOT CAUSES:
  - [category]: [specific explanation with SKILL.md section reference]

REMEDIATION:
  1. [Concrete change to SKILL.md with exact section and wording]
  2. [Concrete change to workflow with step numbers]
有关各根本原因类别的示例,请参阅
references/diagnostic-templates.md

Budget Parameters

预算参数

Per EvoSkills Algorithm 1:
  • Context cap: 0.7 (70% of available context window)
  • Max surrogate retries: 15 per oracle round
  • Max oracle rounds: 5 (enforced by the orchestrating agent, not the verifier)
The verifier does not track its own budget — the
test-engineer
agent manages iteration limits.
根据EvoSkills算法1:
  • 上下文限制: 0.7(可用上下文窗口的70%)
  • 最大代理重试次数: 每个验证轮次15次
  • 最大验证轮次: 5次(由编排Agent强制执行,验证器不负责)
验证器不跟踪自身预算——
test-engineer
Agent负责管理迭代限制。

Limitations

局限性

  • No execution capability: The verifier generates assertions but does not execute them. Execution is handled by
    scripts/run_evals.py
    (the oracle).
  • Text-only verification: Cannot verify visual outputs, interactive behaviors, or side effects. Assertions operate on the textual output only.
  • Single-turn scope: Each verification is independent. The verifier does not remember prior rounds (the orchestrating agent feeds context as needed).
  • Assertion granularity: The 5 assertion types cover common patterns but not all possible verification needs. Custom assertion types require extending
    scripts/eval_assertions.py
    .
  • 无执行能力: 验证器仅生成断言,不执行断言。执行由
    scripts/run_evals.py
    (验证器)处理。
  • 仅文本验证: 无法验证视觉输出、交互行为或副作用。断言仅基于文本输出操作。
  • 单轮范围: 每次验证都是独立的。验证器不保留之前轮次的信息(编排Agent会按需提供上下文)。
  • 断言粒度: 5种断言类型覆盖常见模式,但无法满足所有可能的验证需求。自定义断言类型需要扩展
    scripts/eval_assertions.py

Error Handling

错误处理

ErrorResolution
Skill definition too largeSummarize to workflow phases + output format sections only
No assertions generatableReturn empty assertions list with warning; skill may be too vague
Ambiguous output formatDefault to
contains
assertions; avoid
output_format
checks
Context cap exceededTruncate diagnostic detail; preserve failed assertion list
错误类型解决方式
技能定义过大仅总结工作流阶段和输出格式章节
无法生成任何断言返回空断言列表并附带警告;技能可能过于模糊
输出格式模糊默认使用
contains
断言;避免
output_format
检查
超出上下文限制截断诊断细节;保留失败断言列表