surrogate-verifier

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Surrogate Verifier

代理验证器

Generate structured test assertions and failure diagnostics for skill packages through information-isolated verification. The verifier operates without access to the skill generator's reasoning — it sees only the skill definition, a task prompt, and the output artifacts. This isolation prevents confirmation bias and is the single largest contributor to skill quality in co-evolutionary generation (+30pp per EvoSkills).

通过信息隔离验证为技能包生成结构化测试断言和故障诊断信息。验证器无法访问技能生成器的推理过程——它只能看到技能定义、任务提示和输出产物。这种隔离可以防止确认偏差，是协同进化生成中提升技能质量的最大单一因素（每个EvoSkills可提升30个百分点）。

Reference Files

参考文件

File	Contents	Load When
`references/assertion-patterns.md`	Assertion catalog by skill category with weight guidance	Always
`references/diagnostic-templates.md`	Failure diagnostic templates with root-cause categories	When producing failure reports

文件路径	内容描述	加载时机
`references/assertion-patterns.md`	按技能类别划分的断言目录及权重指导	始终加载
`references/diagnostic-templates.md`	包含根本原因分类的故障诊断模板	生成故障报告时加载

Information Isolation Protocol

信息隔离协议

This is the most critical constraint. Violating isolation degrades verification quality.

The verifier MUST NOT access:

The generator's conversation history or reasoning chain
Prior evolution iterations or refinement context
The generator's internal notes or decision rationale
Any context beyond what is explicitly listed below

The verifier receives ONLY:

The skill's
```
SKILL.md
```
content (the definition file)
One or more task prompts representing intended use
The skill's output (when diagnosing failures)
The assertion results from
```
scripts/eval_assertions.py
```
(when diagnosing)

Implementation: When invoked by the

test-engineer

agent, this skill MUST be loaded into a separate Agent spawn using

isolation: "worktree"

or at minimum a fresh session with no shared context. The invoking agent passes artifacts as explicit text, not as conversation references.

这是最关键的约束条件。违反隔离规则会降低验证质量。

验证器严禁访问：

生成器的对话历史或推理链
之前的进化迭代或优化上下文
生成器的内部笔记或决策依据
以下明确列出内容之外的任何上下文

验证器仅能接收：

技能的
```
SKILL.md
```
内容（定义文件）
一个或多个代表预期用途的任务提示
技能的输出结果（诊断故障时）
```
scripts/eval_assertions.py
```
的断言结果（诊断故障时）

实现方式： 当被

test-engineer

Agent调用时，此技能必须加载到独立的Agent进程中，使用

isolation: "worktree"

配置，或至少使用无共享上下文的全新会话。调用Agent需将产物作为明确文本传递，而非对话引用。

Workflow

工作流程

Mode 1: Assertion Generation

模式1：断言生成

Generate assertions for a skill given its definition and task prompts.

根据技能定义和任务提示为技能生成断言。

Phase 1: Skill Analysis

阶段1：技能分析

Read the

SKILL.md

definition and extract:

Stated capabilities — what the skill claims to do (from description + workflow sections)
Output format — expected structure of the skill's output (markdown, JSON, tables, etc.)
Error handling — documented failure modes and recovery paths
Prerequisites — required tools, dependencies, or context
Trigger boundaries — what the skill does NOT handle (negative scope)

读取

SKILL.md

定义并提取以下信息：

声明的能力 —— 技能声称可完成的任务（来自描述和工作流章节）
输出格式 —— 技能输出的预期结构（markdown、JSON、表格等）
错误处理 —— 文档记录的故障模式和恢复路径
前置条件 —— 所需的工具、依赖项或上下文
触发边界 —— 技能不处理的场景（负向范围）

Phase 2: Assertion Design

阶段2：断言设计

For each task prompt, generate 5-10 assertions covering these dimensions:

Dimension	Assertion Types to Use	Purpose
Output completeness	`contains` , `matches_regex`	All claimed sections/components present
Format compliance	`output_format` , `contains`	Output matches declared structure
Factual signals	`contains` , `not_contains`	Key domain terms present, hallmarks absent
Tool usage	`calls_tool`	Expected tools were invoked
Negative constraints	`not_contains`	Forbidden patterns absent

Weight assignment:

Output completeness assertions: weight 1.0 (must have)
Format compliance: weight 0.8 (structural correctness)
Factual signals: weight 0.6 (content quality)
Tool usage: weight 0.5 (method verification)
Negative constraints: weight 0.3 (absence checks are weaker signals)

See

references/assertion-patterns.md

for category-specific assertion catalogs.

针对每个任务提示，生成5-10个覆盖以下维度的断言：

维度	可使用的断言类型	目的
输出完整性	`contains` , `matches_regex`	确保所有声称的章节/组件均存在
格式合规性	`output_format` , `contains`	输出符合声明的结构
事实信号	`contains` , `not_contains`	包含关键领域术语，无标志性错误内容
工具使用	`calls_tool`	已调用预期工具
负向约束	`not_contains`	不存在禁用模式

权重分配：

输出完整性断言：权重1.0（必须具备）
格式合规性：权重0.8（结构正确性）
事实信号：权重0.6（内容质量）
工具使用：权重0.5（方法验证）
负向约束：权重0.3（缺失检查的信号强度较弱）

有关特定类别的断言目录，请参阅

references/assertion-patterns.md

。

Phase 3: Output

阶段3：输出

Produce assertions in the

evals/cases.yaml

schema format:

yaml

assertions:
  - type: contains
    target: "## Scalability"
    weight: 1.0
  - type: output_format
    target: markdown_table
    weight: 0.8
  - type: not_contains
    target: "TODO"
    weight: 0.3
  - type: calls_tool
    target: Read
    weight: 0.5

Context cap: Do not consume more than 70% of the available context window. If the skill definition is very long, focus assertion generation on the workflow phases and output format sections. Summarize rather than quote verbatim.

按照

evals/cases.yaml

schema格式生成断言：

yaml

assertions:
  - type: contains
    target: "## Scalability"
    weight: 1.0
  - type: output_format
    target: markdown_table
    weight: 0.8
  - type: not_contains
    target: "TODO"
    weight: 0.3
  - type: calls_tool
    target: Read
    weight: 0.5

上下文限制： 占用的上下文窗口不得超过70%。如果技能定义非常长，断言生成应聚焦于工作流阶段和输出格式章节。采用总结而非逐字引用的方式。

Mode 2: Failure Diagnostics

模式2：故障诊断

When an oracle returns

fail

, produce a structured diagnostic explaining why.

当验证器返回

fail

时，生成结构化诊断信息解释原因。

Input

输入

The skill's
```
SKILL.md
```
(same as Mode 1)
The task prompt that was executed
The output that failed
The assertion results: which passed, which failed, with details

技能的
```
SKILL.md
```
（与模式1相同）
执行的任务提示
验证失败的输出结果
断言结果：哪些通过、哪些失败及详细信息

Phase 1: Failure Classification

阶段1：故障分类

Categorize each failed assertion into a root-cause category:

Category	Signal	Severity
Missing capability	`contains` assertion failed for a claimed feature	HIGH
Format mismatch	`output_format` assertion failed	HIGH
Incomplete output	Multiple `contains` assertions failed in the same section	MEDIUM
Hallucinated content	`not_contains` assertion failed (forbidden pattern present)	HIGH
Wrong tool usage	`calls_tool` assertion failed	MEDIUM
Partial success	Some assertions in a group pass, others fail	LOW

将每个失败的断言归类到根本原因类别：

类别	信号特征	严重程度
缺失能力	声称功能对应的 `contains` 断言失败	高
格式不匹配	`output_format` 断言失败	高
输出不完整	同一章节中多个 `contains` 断言失败	中
虚构内容	`not_contains` 断言失败（存在禁用模式）	高
工具使用错误	`calls_tool` 断言失败	中
部分成功	一组断言中部分通过、部分失败	低

Phase 2: Root-Cause Analysis

阶段2：根本原因分析

For each failed assertion:

Identify the specific section of
```
SKILL.md
```
that promises the missing capability
Compare what the skill definition instructs vs. what the output actually contains
Hypothesize why the gap exists (missing workflow step, ambiguous instruction, wrong tool choice)

针对每个失败的断言：

找出
```
SKILL.md
```
中承诺该缺失能力的具体章节
对比技能定义的要求与实际输出内容
假设差距存在的原因（缺失工作流步骤、指令模糊、工具选择错误）

Phase 3: Remediation Suggestions

阶段3：修复建议

For each root cause, produce a concrete, actionable fix:

Missing capability: "Add a workflow step between Phase 2 and Phase 3 that explicitly generates [X]"
Format mismatch: "Change the output format instruction from 'produce a summary' to 'produce a markdown table with columns: [A, B, C]'"
Hallucinated content: "Add a negative constraint in the workflow: 'Do NOT include [X] unless [condition]'"
Wrong tool usage: "Replace 'use Bash to read the file' with 'use the Read tool for file contents'"

针对每个根本原因，生成具体、可执行的修复方案：

缺失能力： "在第2阶段和第3阶段之间添加一个工作流步骤，明确生成[X]"
格式不匹配： "将输出格式指令从'生成摘要'修改为'生成包含以下列的markdown表格：[A, B, C]'"
虚构内容： "在工作流中添加负向约束：'除非满足[条件]，否则不得包含[X]'"
工具使用错误： "将'使用Bash读取文件'替换为'使用Read工具读取文件内容'"

Phase 4: Diagnostic Output

阶段4：诊断输出

Produce a structured diagnostic string:

DIAGNOSTIC: [skill-name] failed on [task-prompt-summary]

FAILED ASSERTIONS (N/M):
  1. [SEVERITY] type=contains target="..." — Missing capability: [explanation]
  2. [SEVERITY] type=output_format target="..." — Format mismatch: [explanation]

ROOT CAUSES:
  - [category]: [specific explanation with SKILL.md section reference]

REMEDIATION:
  1. [Concrete change to SKILL.md with exact section and wording]
  2. [Concrete change to workflow with step numbers]

See

references/diagnostic-templates.md

for worked examples per root-cause category.

生成结构化诊断字符串：

DIAGNOSTIC: [skill-name] failed on [task-prompt-summary]

FAILED ASSERTIONS (N/M):
  1. [SEVERITY] type=contains target="..." — Missing capability: [explanation]
  2. [SEVERITY] type=output_format target="..." — Format mismatch: [explanation]

ROOT CAUSES:
  - [category]: [specific explanation with SKILL.md section reference]

REMEDIATION:
  1. [Concrete change to SKILL.md with exact section and wording]
  2. [Concrete change to workflow with step numbers]

有关各根本原因类别的示例，请参阅

references/diagnostic-templates.md

。

Budget Parameters

预算参数

Per EvoSkills Algorithm 1:

Context cap: 0.7 (70% of available context window)
Max surrogate retries: 15 per oracle round
Max oracle rounds: 5 (enforced by the orchestrating agent, not the verifier)

The verifier does not track its own budget — the

test-engineer

agent manages iteration limits.

根据EvoSkills算法1：

上下文限制： 0.7（可用上下文窗口的70%）
最大代理重试次数： 每个验证轮次15次
最大验证轮次： 5次（由编排Agent强制执行，验证器不负责）

验证器不跟踪自身预算——

test-engineer

Agent负责管理迭代限制。

Limitations

局限性

No execution capability: The verifier generates assertions but does not execute them. Execution is handled by
```
scripts/run_evals.py
```
(the oracle).
Text-only verification: Cannot verify visual outputs, interactive behaviors, or side effects. Assertions operate on the textual output only.
Single-turn scope: Each verification is independent. The verifier does not remember prior rounds (the orchestrating agent feeds context as needed).
Assertion granularity: The 5 assertion types cover common patterns but not all possible verification needs. Custom assertion types require extending
```
scripts/eval_assertions.py
```
.

无执行能力： 验证器仅生成断言，不执行断言。执行由
```
scripts/run_evals.py
```
（验证器）处理。
仅文本验证： 无法验证视觉输出、交互行为或副作用。断言仅基于文本输出操作。
单轮范围： 每次验证都是独立的。验证器不保留之前轮次的信息（编排Agent会按需提供上下文）。
断言粒度： 5种断言类型覆盖常见模式，但无法满足所有可能的验证需求。自定义断言类型需要扩展
```
scripts/eval_assertions.py
```
。

Error Handling

错误处理

Error	Resolution
Skill definition too large	Summarize to workflow phases + output format sections only
No assertions generatable	Return empty assertions list with warning; skill may be too vague
Ambiguous output format	Default to `contains` assertions; avoid `output_format` checks
Context cap exceeded	Truncate diagnostic detail; preserve failed assertion list

错误类型	解决方式
技能定义过大	仅总结工作流阶段和输出格式章节
无法生成任何断言	返回空断言列表并附带警告；技能可能过于模糊
输出格式模糊	默认使用 `contains` 断言；避免 `output_format` 检查
超出上下文限制	截断诊断细节；保留失败断言列表