evaluate-presets

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Evaluate Presets

评估预设

Overview

概述

Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.
使用Shell脚本系统化测试所有帽子集合预设。直接通过CLI调用——无元编排复杂度。

When to Use

适用场景

  • Testing preset configurations after changes
  • Auditing the preset library for quality
  • Validating new presets work correctly
  • After modifying hat routing logic
  • 修改后测试预设配置
  • 审核预设库质量
  • 验证新预设功能正常
  • 修改Hat路由逻辑后

Quick Start

快速开始

Evaluate a single preset:
bash
./tools/evaluate-preset.sh tdd-red-green claude
Evaluate all presets:
bash
./tools/evaluate-all-presets.sh claude
Arguments:
  • First arg: preset name (without
    .yml
    extension)
  • Second arg: backend (
    claude
    or
    kiro
    , defaults to
    claude
    )
评估单个预设:
bash
./tools/evaluate-preset.sh tdd-red-green claude
评估所有预设:
bash
./tools/evaluate-all-presets.sh claude
参数说明:
  • 第一个参数:预设名称(不含
    .yml
    扩展名)
  • 第二个参数:后端服务(
    claude
    kiro
    ,默认值为
    claude

Bash Tool Configuration

Bash工具配置

IMPORTANT: When invoking these scripts via the Bash tool, use these settings:
  • Single preset evaluation: Use
    timeout: 600000
    (10 minutes max) and
    run_in_background: true
  • All presets evaluation: Use
    timeout: 600000
    (10 minutes max) and
    run_in_background: true
Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the
TaskOutput
tool to check progress periodically.
Example invocation pattern:
Bash tool with:
  command: "./tools/evaluate-preset.sh tdd-red-green claude"
  timeout: 600000
  run_in_background: true
After launching, use
TaskOutput
with
block: false
to check status without waiting for completion.
重要提示: 通过Bash工具调用这些脚本时,请使用以下设置:
  • 单个预设评估: 使用
    timeout: 600000
    (最长10分钟)和
    run_in_background: true
  • 所有预设评估: 使用
    timeout: 600000
    (最长10分钟)和
    run_in_background: true
由于预设评估可能运行数小时(尤其是完整套件),请始终在后台模式下运行,并定期使用
TaskOutput
工具检查进度。
调用示例:
Bash tool with:
  command: "./tools/evaluate-preset.sh tdd-red-green claude"
  timeout: 600000
  run_in_background: true
启动后,使用
TaskOutput
并设置
block: false
来检查状态,无需等待完成。

What the Scripts Do

脚本功能说明

evaluate-preset.sh

evaluate-preset.sh

  1. Loads test task from
    tools/preset-test-tasks.yml
    (if
    yq
    available)
  2. Creates merged config with evaluation settings
  3. Runs Ralph with
    --record-session
    for metrics capture
  4. Captures output logs, exit codes, and timing
  5. Extracts metrics: iterations, hats activated, events published
Output structure:
.eval/
├── logs/<preset>/<timestamp>/
│   ├── output.log          # Full stdout/stderr
│   ├── session.jsonl       # Recorded session
│   ├── metrics.json        # Extracted metrics
│   ├── environment.json    # Runtime environment
│   └── merged-config.yml   # Config used
└── logs/<preset>/latest -> <timestamp>
  1. tools/preset-test-tasks.yml
    加载测试任务(若
    yq
    可用)
  2. 创建包含评估设置的合并配置
  3. 运行Ralph并添加
    --record-session
    参数以捕获指标
  4. 捕获输出日志、退出码和耗时
  5. 提取指标:迭代次数、激活的Hat数量、发布的事件数
输出结构:
.eval/
├── logs/<preset>/<timestamp>/
│   ├── output.log          # 完整标准输出/错误输出
│   ├── session.jsonl       # 录制的会话
│   ├── metrics.json        # 提取的指标
│   ├── environment.json    # 运行时环境
│   └── merged-config.yml   # 使用的配置
└── logs/<preset>/latest -> <timestamp>

evaluate-all-presets.sh

evaluate-all-presets.sh

Runs all 12 presets sequentially and generates a summary:
.eval/results/<suite-id>/
├── SUMMARY.md              # Markdown report
├── <preset>.json           # Per-preset metrics
└── latest -> <suite-id>
按顺序运行全部12个预设并生成汇总报告:
.eval/results/<suite-id>/
├── SUMMARY.md              # Markdown报告
├── <preset>.json           # 单预设指标
└── latest -> <suite-id>

Presets Under Evaluation

待评估的预设

PresetTest Task
tdd-red-green
Add
is_palindrome()
function
adversarial-review
Review user input handler for security
socratic-learning
Understand
HatRegistry
spec-driven
Specify and implement
StringUtils::truncate()
mob-programming
Implement a
Stack
data structure
scientific-method
Debug failing mock test assertion
code-archaeology
Understand history of
config.rs
performance-optimization
Profile hat matching
api-design
Design a
Cache
trait
documentation-first
Document
RateLimiter
incident-response
Respond to "tests failing in CI"
migration-safety
Plan v1 to v2 config migration
预设名称测试任务
tdd-red-green
添加
is_palindrome()
函数
adversarial-review
审核用户输入处理程序的安全性
socratic-learning
理解
HatRegistry
spec-driven
定义并实现
StringUtils::truncate()
mob-programming
实现
Stack
数据结构
scientific-method
调试失败的模拟测试断言
code-archaeology
理解
config.rs
的历史
performance-optimization
分析Hat匹配性能
api-design
设计
Cache
trait
documentation-first
编写
RateLimiter
文档
incident-response
响应“CI中测试失败”问题
migration-safety
规划v1到v2的配置迁移

Interpreting Results

结果解读

Exit codes from
evaluate-preset.sh
:
  • 0
    — Success (LOOP_COMPLETE reached)
  • 124
    — Timeout (preset hung or took too long)
  • Other — Failure (check
    output.log
    )
Metrics in
metrics.json
:
  • iterations
    — How many event loop cycles
  • hats_activated
    — Which hats were triggered
  • events_published
    — Total events emitted
  • completed
    — Whether completion promise was reached
evaluate-preset.sh
的退出码:
  • 0
    — 成功(达到LOOP_COMPLETE状态)
  • 124
    — 超时(预设挂起或耗时过长)
  • 其他值 — 失败(查看
    output.log
metrics.json
中的指标:
  • iterations
    — 事件循环周期数
  • hats_activated
    — 触发的Hat列表
  • events_published
    — 发布的事件总数
  • completed
    — 是否达成完成条件

Hat Routing Performance

Hat路由性能

Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").
关键验证点: 需验证Hat是否遵循原则1“新鲜上下文即可靠性”,在每次迭代中获取新鲜上下文。

What Good Looks Like

理想表现

Each hat should execute in its own iteration:
Iter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETE
每个Hat应在独立迭代中执行:
Iter 1: Ralph → 发布起始事件 → 停止
Iter 2: Hat A → 执行任务 → 发布下一个事件 → 停止
Iter 3: Hat B → 执行任务 → 发布下一个事件 → 停止
Iter 4: Hat C → 执行任务 → LOOP_COMPLETE

Red Flags (Same-Iteration Hat Switching)

危险信号(同迭代内Hat切换)

BAD: Multiple hat personas in one iteration:
Iter 2: Ralph does Blue Team + Red Team + Fixer work
        ^^^ All in one bloated context!
不良表现: 单个迭代中存在多个Hat角色:
Iter 2: Ralph同时执行蓝队+红队+修复者任务
        ^^^ 所有操作都在同一个臃肿的上下文中!

How to Check

检查方法

1. Count iterations vs events in
session.jsonl
:
bash
undefined
1. 在
session.jsonl
中统计迭代次数与事件数:
bash
undefined

Count iterations

统计迭代次数

grep -c "_meta.loop_start|ITERATION" .eval/logs/<preset>/latest/output.log
grep -c "_meta.loop_start|ITERATION" .eval/logs/<preset>/latest/output.log

Count events published

统计发布的事件数

grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl

**Expected:** iterations ≈ events published (one event per iteration)
**Bad sign:** 2-3 iterations but 5+ events (all work in single iteration)

**2. Check for same-iteration hat switching in `output.log`:**
```bash
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
    .eval/logs/<preset>/latest/output.log
Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.
3. Check event timestamps in
session.jsonl
:
bash
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
Red flag: Multiple events with identical timestamps (published in same iteration).
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl

**预期结果:** 迭代次数 ≈ 发布的事件数(每次迭代一个事件)
**不良信号:** 2-3次迭代但有5+个事件(所有操作都在单个迭代中)

**2. 在`output.log`中检查同迭代内的Hat切换:**
```bash
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
    .eval/logs/<preset>/latest/output.log
危险信号: Hat切换语句之间没有ITERATION分隔符。
3. 在
session.jsonl
中检查事件时间戳:
bash
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
危险信号: 多个事件具有相同的时间戳(在同一迭代中发布)。

Routing Performance Triage

路由性能问题排查

PatternDiagnosisAction
iterations ≈ events✅ GoodHat routing working
iterations << events⚠️ Same-iteration switchingCheck prompt has STOP instruction
iterations >> events⚠️ Recovery loopsAgent not publishing required events
0 events❌ BrokenEvents not being read from JSONL
模式诊断操作
迭代次数 ≈ 事件数✅ 正常Hat路由工作正常
迭代次数 << 事件数⚠️ 同迭代内切换检查提示语是否包含STOP指令
迭代次数 >> 事件数⚠️ 恢复循环Agent未发布所需事件
0个事件❌ 故障事件未从JSONL读取

Root Cause Checklist

根本原因检查清单

If hat routing is broken:
  1. Check workflow prompt in
    hatless_ralph.rs
    :
    • Does it say "CRITICAL: STOP after publishing"?
    • Is the DELEGATE section clear about yielding control?
  2. Check hat instructions propagation:
    • Does
      HatInfo
      include
      instructions
      field?
    • Are instructions rendered in the
      ## HATS
      section?
  3. Check events context:
    • Is
      build_prompt(context)
      using the context parameter?
    • Does prompt include
      ## PENDING EVENTS
      section?
若Hat路由出现故障:
  1. 检查
    hatless_ralph.rs
    中的工作流提示语:
    • 是否包含“CRITICAL: STOP after publishing”?
    • DELEGATE部分是否明确说明要移交控制权?
  2. 检查Hat指令的传播:
    • HatInfo
      是否包含
      instructions
      字段?
    • 指令是否在
      ## HATS
      部分中渲染?
  3. 检查事件上下文:
    • build_prompt(context)
      是否使用了context参数?
    • 提示语是否包含
      ## PENDING EVENTS
      部分?

Autonomous Fix Workflow

自主修复流程

After evaluation, delegate fixes to subagents:
评估完成后,将修复任务委派给子Agent:

Step 1: Triage Results

步骤1:分类结果

Read
.eval/results/latest/SUMMARY.md
and identify:
  • ❌ FAIL
    → Create code tasks for fixes
  • ⏱️ TIMEOUT
    → Investigate infinite loops
  • ⚠️ PARTIAL
    → Check for edge cases
阅读
.eval/results/latest/SUMMARY.md
并识别:
  • ❌ FAIL
    → 创建代码修复任务
  • ⏱️ TIMEOUT
    → 调查无限循环问题
  • ⚠️ PARTIAL
    → 检查边缘情况

Step 2: Dispatch Task Creation

步骤2:派发任务创建

For each issue, spawn a Task agent:
"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"
针对每个问题,启动Task Agent:
"使用/code-task-generator创建修复任务:[评估发现的问题]
输出至:tasks/preset-fixes/"

Step 3: Dispatch Implementation

步骤3:派发实现任务

For each created task:
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"
针对每个已创建的任务:
"使用/code-assist实现:tasks/preset-fixes/[任务文件].code-task.md
模式:auto"

Step 4: Re-evaluate

步骤4:重新评估

bash
./tools/evaluate-preset.sh <fixed-preset> claude
bash
./tools/evaluate-preset.sh <fixed-preset> claude

Prerequisites

前置条件

  • yq (optional): For loading test tasks from YAML. Install:
    brew install yq
  • Cargo: Must be able to build Ralph
  • yq(可选):用于从YAML加载测试任务。安装:
    brew install yq
  • Cargo:必须能够构建Ralph

Related Files

相关文件

  • tools/evaluate-preset.sh
    — Single preset evaluation
  • tools/evaluate-all-presets.sh
    — Full suite evaluation
  • tools/preset-test-tasks.yml
    — Test task definitions
  • tools/preset-evaluation-findings.md
    — Manual findings doc
  • presets/
    — The preset collection being evaluated
  • tools/evaluate-preset.sh
    — 单个预设评估
  • tools/evaluate-all-presets.sh
    — 完整套件评估
  • tools/preset-test-tasks.yml
    — 测试任务定义
  • tools/preset-evaluation-findings.md
    — 人工发现文档
  • presets/
    — 待评估的预设集合