evaluate-presets

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Evaluate Presets

评估预设

Overview

概述

Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.

使用Shell脚本系统化测试所有帽子集合预设。直接通过CLI调用——无元编排复杂度。

When to Use

适用场景

Testing preset configurations after changes
Auditing the preset library for quality
Validating new presets work correctly
After modifying hat routing logic

修改后测试预设配置
审核预设库质量
验证新预设功能正常
修改Hat路由逻辑后

Quick Start

快速开始

Evaluate a single preset:

bash

./tools/evaluate-preset.sh tdd-red-green claude

Evaluate all presets:

bash

./tools/evaluate-all-presets.sh claude

Arguments:

First arg: preset name (without
```
.yml
```
extension)
Second arg: backend (
```
claude
```
or
```
kiro
```
, defaults to
```
claude
```
)

评估单个预设：

bash

./tools/evaluate-preset.sh tdd-red-green claude

评估所有预设：

bash

./tools/evaluate-all-presets.sh claude

参数说明：

第一个参数：预设名称（不含
```
.yml
```
扩展名）
第二个参数：后端服务（
```
claude
```
或
```
kiro
```
，默认值为
```
claude
```
）

Bash Tool Configuration

Bash工具配置

IMPORTANT: When invoking these scripts via the Bash tool, use these settings:

Single preset evaluation: Use
```
timeout: 600000
```
(10 minutes max) and
```
run_in_background: true
```
All presets evaluation: Use
```
timeout: 600000
```
(10 minutes max) and
```
run_in_background: true
```

Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the

TaskOutput

tool to check progress periodically.

Example invocation pattern:

Bash tool with:
  command: "./tools/evaluate-preset.sh tdd-red-green claude"
  timeout: 600000
  run_in_background: true

After launching, use

TaskOutput

with

block: false

to check status without waiting for completion.

重要提示： 通过Bash工具调用这些脚本时，请使用以下设置：

单个预设评估： 使用
```
timeout: 600000
```
（最长10分钟）和
```
run_in_background: true
```
所有预设评估： 使用
```
timeout: 600000
```
（最长10分钟）和
```
run_in_background: true
```

由于预设评估可能运行数小时（尤其是完整套件），请始终在后台模式下运行，并定期使用

TaskOutput

工具检查进度。

调用示例：

Bash tool with:
  command: "./tools/evaluate-preset.sh tdd-red-green claude"
  timeout: 600000
  run_in_background: true

启动后，使用

TaskOutput

并设置

block: false

来检查状态，无需等待完成。

What the Scripts Do

脚本功能说明

evaluate-preset.sh

evaluate-preset.sh

Loads test task from
```
tools/preset-test-tasks.yml
```
(if
```
yq
```
available)
Creates merged config with evaluation settings
Runs Ralph with
```
--record-session
```
for metrics capture
Captures output logs, exit codes, and timing
Extracts metrics: iterations, hats activated, events published

Output structure:

.eval/
├── logs/<preset>/<timestamp>/
│   ├── output.log          # Full stdout/stderr
│   ├── session.jsonl       # Recorded session
│   ├── metrics.json        # Extracted metrics
│   ├── environment.json    # Runtime environment
│   └── merged-config.yml   # Config used
└── logs/<preset>/latest -> <timestamp>

从
```
tools/preset-test-tasks.yml
```
加载测试任务（若
```
yq
```
可用）
创建包含评估设置的合并配置
运行Ralph并添加
```
--record-session
```
参数以捕获指标
捕获输出日志、退出码和耗时
提取指标：迭代次数、激活的Hat数量、发布的事件数

输出结构：

.eval/
├── logs/<preset>/<timestamp>/
│   ├── output.log          # 完整标准输出/错误输出
│   ├── session.jsonl       # 录制的会话
│   ├── metrics.json        # 提取的指标
│   ├── environment.json    # 运行时环境
│   └── merged-config.yml   # 使用的配置
└── logs/<preset>/latest -> <timestamp>

evaluate-all-presets.sh

evaluate-all-presets.sh

Runs all 12 presets sequentially and generates a summary:

.eval/results/<suite-id>/
├── SUMMARY.md              # Markdown report
├── <preset>.json           # Per-preset metrics
└── latest -> <suite-id>

按顺序运行全部12个预设并生成汇总报告：

.eval/results/<suite-id>/
├── SUMMARY.md              # Markdown报告
├── <preset>.json           # 单预设指标
└── latest -> <suite-id>

Presets Under Evaluation

待评估的预设

Preset	Test Task
`tdd-red-green`	Add `is_palindrome()` function
`adversarial-review`	Review user input handler for security
`socratic-learning`	Understand `HatRegistry`
`spec-driven`	Specify and implement `StringUtils::truncate()`
`mob-programming`	Implement a `Stack` data structure
`scientific-method`	Debug failing mock test assertion
`code-archaeology`	Understand history of `config.rs`
`performance-optimization`	Profile hat matching
`api-design`	Design a `Cache` trait
`documentation-first`	Document `RateLimiter`
`incident-response`	Respond to "tests failing in CI"
`migration-safety`	Plan v1 to v2 config migration

预设名称	测试任务
`tdd-red-green`	添加 `is_palindrome()` 函数
`adversarial-review`	审核用户输入处理程序的安全性
`socratic-learning`	理解 `HatRegistry`
`spec-driven`	定义并实现 `StringUtils::truncate()`
`mob-programming`	实现 `Stack` 数据结构
`scientific-method`	调试失败的模拟测试断言
`code-archaeology`	理解 `config.rs` 的历史
`performance-optimization`	分析Hat匹配性能
`api-design`	设计 `Cache` trait
`documentation-first`	编写 `RateLimiter` 文档
`incident-response`	响应“CI中测试失败”问题
`migration-safety`	规划v1到v2的配置迁移

Interpreting Results

结果解读

Exit codes from
evaluate-preset.sh
:

```
0
```
— Success (LOOP_COMPLETE reached)
```
124
```
— Timeout (preset hung or took too long)
Other — Failure (check
```
output.log
```
)

Metrics in
metrics.json
:

```
iterations
```
— How many event loop cycles
```
hats_activated
```
— Which hats were triggered
```
events_published
```
— Total events emitted
```
completed
```
— Whether completion promise was reached

evaluate-preset.sh
的退出码：

```
0
```
— 成功（达到LOOP_COMPLETE状态）
```
124
```
— 超时（预设挂起或耗时过长）
其他值 — 失败（查看
```
output.log
```
）

metrics.json
中的指标：

```
iterations
```
— 事件循环周期数
```
hats_activated
```
— 触发的Hat列表
```
events_published
```
— 发布的事件总数
```
completed
```
— 是否达成完成条件

Hat Routing Performance

Hat路由性能

Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").

关键验证点： 需验证Hat是否遵循原则1“新鲜上下文即可靠性”，在每次迭代中获取新鲜上下文。

What Good Looks Like

理想表现

Each hat should execute in its own iteration:

Iter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETE

每个Hat应在独立迭代中执行：

Iter 1: Ralph → 发布起始事件 → 停止
Iter 2: Hat A → 执行任务 → 发布下一个事件 → 停止
Iter 3: Hat B → 执行任务 → 发布下一个事件 → 停止
Iter 4: Hat C → 执行任务 → LOOP_COMPLETE

Red Flags (Same-Iteration Hat Switching)

危险信号（同迭代内Hat切换）

BAD: Multiple hat personas in one iteration:

Iter 2: Ralph does Blue Team + Red Team + Fixer work
        ^^^ All in one bloated context!

不良表现： 单个迭代中存在多个Hat角色：

Iter 2: Ralph同时执行蓝队+红队+修复者任务
        ^^^ 所有操作都在同一个臃肿的上下文中！

How to Check

检查方法

1. Count iterations vs events in
session.jsonl
:

bash

undefined

1. 在
session.jsonl
中统计迭代次数与事件数：

bash

undefined

Count iterations

统计迭代次数

grep -c "_meta.loop_start|ITERATION" .eval/logs/<preset>/latest/output.log

Count events published

统计发布的事件数

grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl


**Expected:** iterations ≈ events published (one event per iteration)
**Bad sign:** 2-3 iterations but 5+ events (all work in single iteration)

**2. Check for same-iteration hat switching in `output.log`:**
```bash
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
    .eval/logs/<preset>/latest/output.log

Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.

3. Check event timestamps in
session.jsonl
:

bash

cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'

Red flag: Multiple events with identical timestamps (published in same iteration).

grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl


**预期结果：** 迭代次数 ≈ 发布的事件数（每次迭代一个事件）
**不良信号：** 2-3次迭代但有5+个事件（所有操作都在单个迭代中）

**2. 在`output.log`中检查同迭代内的Hat切换：**
```bash
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
    .eval/logs/<preset>/latest/output.log

危险信号： Hat切换语句之间没有ITERATION分隔符。

3. 在
session.jsonl
中检查事件时间戳：

bash

cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'

危险信号： 多个事件具有相同的时间戳（在同一迭代中发布）。

Routing Performance Triage

路由性能问题排查

Pattern	Diagnosis	Action
iterations ≈ events	✅ Good	Hat routing working
iterations << events	⚠️ Same-iteration switching	Check prompt has STOP instruction
iterations >> events	⚠️ Recovery loops	Agent not publishing required events
0 events	❌ Broken	Events not being read from JSONL

模式	诊断	操作
迭代次数 ≈ 事件数	✅ 正常	Hat路由工作正常
迭代次数 << 事件数	⚠️ 同迭代内切换	检查提示语是否包含STOP指令
迭代次数 >> 事件数	⚠️ 恢复循环	Agent未发布所需事件
0个事件	❌ 故障	事件未从JSONL读取

Root Cause Checklist

根本原因检查清单

If hat routing is broken:

Check workflow prompt in
```
hatless_ralph.rs
```
:
- Does it say "CRITICAL: STOP after publishing"?
- Is the DELEGATE section clear about yielding control?
Check hat instructions propagation:
- Does
```
HatInfo
```
  include
```
instructions
```
  field?
- Are instructions rendered in the
```
## HATS
```
  section?
Check events context:
- Is
```
build_prompt(context)
```
  using the context parameter?
- Does prompt include
```
## PENDING EVENTS
```
  section?

若Hat路由出现故障：

检查
hatless_ralph.rs
中的工作流提示语：
- 是否包含“CRITICAL: STOP after publishing”？
- DELEGATE部分是否明确说明要移交控制权？
检查Hat指令的传播：
- ```
HatInfo
```
  是否包含
```
instructions
```
  字段？
- 指令是否在
```
## HATS
```
  部分中渲染？
检查事件上下文：
- ```
build_prompt(context)
```
  是否使用了context参数？
- 提示语是否包含
```
## PENDING EVENTS
```
  部分？

Autonomous Fix Workflow

自主修复流程

After evaluation, delegate fixes to subagents:

评估完成后，将修复任务委派给子Agent：

Step 1: Triage Results

步骤1：分类结果

Read

.eval/results/latest/SUMMARY.md

and identify:

```
❌ FAIL
```
→ Create code tasks for fixes
```
⏱️ TIMEOUT
```
→ Investigate infinite loops
```
⚠️ PARTIAL
```
→ Check for edge cases

阅读

.eval/results/latest/SUMMARY.md

并识别：

```
❌ FAIL
```
→ 创建代码修复任务
```
⏱️ TIMEOUT
```
→ 调查无限循环问题
```
⚠️ PARTIAL
```
→ 检查边缘情况

Step 2: Dispatch Task Creation

步骤2：派发任务创建

For each issue, spawn a Task agent:

"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"

针对每个问题，启动Task Agent：

"使用/code-task-generator创建修复任务：[评估发现的问题]
输出至：tasks/preset-fixes/"

Step 3: Dispatch Implementation

步骤3：派发实现任务

For each created task:

"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"

针对每个已创建的任务：

"使用/code-assist实现：tasks/preset-fixes/[任务文件].code-task.md
模式：auto"

Step 4: Re-evaluate

步骤4：重新评估

bash

./tools/evaluate-preset.sh <fixed-preset> claude

bash

./tools/evaluate-preset.sh <fixed-preset> claude

Prerequisites

前置条件

yq (optional): For loading test tasks from YAML. Install:
```
brew install yq
```
Cargo: Must be able to build Ralph

yq（可选）：用于从YAML加载测试任务。安装：
```
brew install yq
```
Cargo：必须能够构建Ralph

evaluate-presets

Original

Translation

Evaluate Presets

评估预设

Overview

概述

When to Use

适用场景

Quick Start

快速开始

Bash Tool Configuration

Bash工具配置

What the Scripts Do

脚本功能说明