evaluate-presets
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEvaluate Presets
评估预设
Overview
概述
Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.
使用Shell脚本系统化测试所有帽子集合预设。直接通过CLI调用——无元编排复杂度。
When to Use
适用场景
- Testing preset configurations after changes
- Auditing the preset library for quality
- Validating new presets work correctly
- After modifying hat routing logic
- 修改后测试预设配置
- 审核预设库质量
- 验证新预设功能正常
- 修改Hat路由逻辑后
Quick Start
快速开始
Evaluate a single preset:
bash
./tools/evaluate-preset.sh tdd-red-green claudeEvaluate all presets:
bash
./tools/evaluate-all-presets.sh claudeArguments:
- First arg: preset name (without extension)
.yml - Second arg: backend (or
claude, defaults tokiro)claude
评估单个预设:
bash
./tools/evaluate-preset.sh tdd-red-green claude评估所有预设:
bash
./tools/evaluate-all-presets.sh claude参数说明:
- 第一个参数:预设名称(不含扩展名)
.yml - 第二个参数:后端服务(或
claude,默认值为kiro)claude
Bash Tool Configuration
Bash工具配置
IMPORTANT: When invoking these scripts via the Bash tool, use these settings:
- Single preset evaluation: Use (10 minutes max) and
timeout: 600000run_in_background: true - All presets evaluation: Use (10 minutes max) and
timeout: 600000run_in_background: true
Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the tool to check progress periodically.
TaskOutputExample invocation pattern:
Bash tool with:
command: "./tools/evaluate-preset.sh tdd-red-green claude"
timeout: 600000
run_in_background: trueAfter launching, use with to check status without waiting for completion.
TaskOutputblock: false重要提示: 通过Bash工具调用这些脚本时,请使用以下设置:
- 单个预设评估: 使用(最长10分钟)和
timeout: 600000run_in_background: true - 所有预设评估: 使用(最长10分钟)和
timeout: 600000run_in_background: true
由于预设评估可能运行数小时(尤其是完整套件),请始终在后台模式下运行,并定期使用工具检查进度。
TaskOutput调用示例:
Bash tool with:
command: "./tools/evaluate-preset.sh tdd-red-green claude"
timeout: 600000
run_in_background: true启动后,使用并设置来检查状态,无需等待完成。
TaskOutputblock: falseWhat the Scripts Do
脚本功能说明
evaluate-preset.sh
evaluate-preset.shevaluate-preset.sh
evaluate-preset.sh- Loads test task from (if
tools/preset-test-tasks.ymlavailable)yq - Creates merged config with evaluation settings
- Runs Ralph with for metrics capture
--record-session - Captures output logs, exit codes, and timing
- Extracts metrics: iterations, hats activated, events published
Output structure:
.eval/
├── logs/<preset>/<timestamp>/
│ ├── output.log # Full stdout/stderr
│ ├── session.jsonl # Recorded session
│ ├── metrics.json # Extracted metrics
│ ├── environment.json # Runtime environment
│ └── merged-config.yml # Config used
└── logs/<preset>/latest -> <timestamp>- 从加载测试任务(若
tools/preset-test-tasks.yml可用)yq - 创建包含评估设置的合并配置
- 运行Ralph并添加参数以捕获指标
--record-session - 捕获输出日志、退出码和耗时
- 提取指标:迭代次数、激活的Hat数量、发布的事件数
输出结构:
.eval/
├── logs/<preset>/<timestamp>/
│ ├── output.log # 完整标准输出/错误输出
│ ├── session.jsonl # 录制的会话
│ ├── metrics.json # 提取的指标
│ ├── environment.json # 运行时环境
│ └── merged-config.yml # 使用的配置
└── logs/<preset>/latest -> <timestamp>evaluate-all-presets.sh
evaluate-all-presets.shevaluate-all-presets.sh
evaluate-all-presets.shRuns all 12 presets sequentially and generates a summary:
.eval/results/<suite-id>/
├── SUMMARY.md # Markdown report
├── <preset>.json # Per-preset metrics
└── latest -> <suite-id>按顺序运行全部12个预设并生成汇总报告:
.eval/results/<suite-id>/
├── SUMMARY.md # Markdown报告
├── <preset>.json # 单预设指标
└── latest -> <suite-id>Presets Under Evaluation
待评估的预设
| Preset | Test Task |
|---|---|
| Add |
| Review user input handler for security |
| Understand |
| Specify and implement |
| Implement a |
| Debug failing mock test assertion |
| Understand history of |
| Profile hat matching |
| Design a |
| Document |
| Respond to "tests failing in CI" |
| Plan v1 to v2 config migration |
| 预设名称 | 测试任务 |
|---|---|
| 添加 |
| 审核用户输入处理程序的安全性 |
| 理解 |
| 定义并实现 |
| 实现 |
| 调试失败的模拟测试断言 |
| 理解 |
| 分析Hat匹配性能 |
| 设计 |
| 编写 |
| 响应“CI中测试失败”问题 |
| 规划v1到v2的配置迁移 |
Interpreting Results
结果解读
Exit codes from :
evaluate-preset.sh- — Success (LOOP_COMPLETE reached)
0 - — Timeout (preset hung or took too long)
124 - Other — Failure (check )
output.log
Metrics in :
metrics.json- — How many event loop cycles
iterations - — Which hats were triggered
hats_activated - — Total events emitted
events_published - — Whether completion promise was reached
completed
evaluate-preset.sh- — 成功(达到LOOP_COMPLETE状态)
0 - — 超时(预设挂起或耗时过长)
124 - 其他值 — 失败(查看)
output.log
metrics.json- — 事件循环周期数
iterations - — 触发的Hat列表
hats_activated - — 发布的事件总数
events_published - — 是否达成完成条件
completed
Hat Routing Performance
Hat路由性能
Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").
关键验证点: 需验证Hat是否遵循原则1“新鲜上下文即可靠性”,在每次迭代中获取新鲜上下文。
What Good Looks Like
理想表现
Each hat should execute in its own iteration:
Iter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETE每个Hat应在独立迭代中执行:
Iter 1: Ralph → 发布起始事件 → 停止
Iter 2: Hat A → 执行任务 → 发布下一个事件 → 停止
Iter 3: Hat B → 执行任务 → 发布下一个事件 → 停止
Iter 4: Hat C → 执行任务 → LOOP_COMPLETERed Flags (Same-Iteration Hat Switching)
危险信号(同迭代内Hat切换)
BAD: Multiple hat personas in one iteration:
Iter 2: Ralph does Blue Team + Red Team + Fixer work
^^^ All in one bloated context!不良表现: 单个迭代中存在多个Hat角色:
Iter 2: Ralph同时执行蓝队+红队+修复者任务
^^^ 所有操作都在同一个臃肿的上下文中!How to Check
检查方法
1. Count iterations vs events in :
session.jsonlbash
undefined1. 在中统计迭代次数与事件数:
session.jsonlbash
undefinedCount iterations
统计迭代次数
grep -c "_meta.loop_start|ITERATION" .eval/logs/<preset>/latest/output.log
grep -c "_meta.loop_start|ITERATION" .eval/logs/<preset>/latest/output.log
Count events published
统计发布的事件数
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
**Expected:** iterations ≈ events published (one event per iteration)
**Bad sign:** 2-3 iterations but 5+ events (all work in single iteration)
**2. Check for same-iteration hat switching in `output.log`:**
```bash
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
.eval/logs/<preset>/latest/output.logRed flag: Hat-switching phrases WITHOUT an ITERATION separator between them.
3. Check event timestamps in :
session.jsonlbash
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'Red flag: Multiple events with identical timestamps (published in same iteration).
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
**预期结果:** 迭代次数 ≈ 发布的事件数(每次迭代一个事件)
**不良信号:** 2-3次迭代但有5+个事件(所有操作都在单个迭代中)
**2. 在`output.log`中检查同迭代内的Hat切换:**
```bash
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
.eval/logs/<preset>/latest/output.log危险信号: Hat切换语句之间没有ITERATION分隔符。
3. 在中检查事件时间戳:
session.jsonlbash
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'危险信号: 多个事件具有相同的时间戳(在同一迭代中发布)。
Routing Performance Triage
路由性能问题排查
| Pattern | Diagnosis | Action |
|---|---|---|
| iterations ≈ events | ✅ Good | Hat routing working |
| iterations << events | ⚠️ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | ⚠️ Recovery loops | Agent not publishing required events |
| 0 events | ❌ Broken | Events not being read from JSONL |
| 模式 | 诊断 | 操作 |
|---|---|---|
| 迭代次数 ≈ 事件数 | ✅ 正常 | Hat路由工作正常 |
| 迭代次数 << 事件数 | ⚠️ 同迭代内切换 | 检查提示语是否包含STOP指令 |
| 迭代次数 >> 事件数 | ⚠️ 恢复循环 | Agent未发布所需事件 |
| 0个事件 | ❌ 故障 | 事件未从JSONL读取 |
Root Cause Checklist
根本原因检查清单
If hat routing is broken:
-
Check workflow prompt in:
hatless_ralph.rs- Does it say "CRITICAL: STOP after publishing"?
- Is the DELEGATE section clear about yielding control?
-
Check hat instructions propagation:
- Does include
HatInfofield?instructions - Are instructions rendered in the section?
## HATS
- Does
-
Check events context:
- Is using the context parameter?
build_prompt(context) - Does prompt include section?
## PENDING EVENTS
- Is
若Hat路由出现故障:
-
检查中的工作流提示语:
hatless_ralph.rs- 是否包含“CRITICAL: STOP after publishing”?
- DELEGATE部分是否明确说明要移交控制权?
-
检查Hat指令的传播:
- 是否包含
HatInfo字段?instructions - 指令是否在部分中渲染?
## HATS
-
检查事件上下文:
- 是否使用了context参数?
build_prompt(context) - 提示语是否包含部分?
## PENDING EVENTS
Autonomous Fix Workflow
自主修复流程
After evaluation, delegate fixes to subagents:
评估完成后,将修复任务委派给子Agent:
Step 1: Triage Results
步骤1:分类结果
Read and identify:
.eval/results/latest/SUMMARY.md- → Create code tasks for fixes
❌ FAIL - → Investigate infinite loops
⏱️ TIMEOUT - → Check for edge cases
⚠️ PARTIAL
阅读并识别:
.eval/results/latest/SUMMARY.md- → 创建代码修复任务
❌ FAIL - → 调查无限循环问题
⏱️ TIMEOUT - → 检查边缘情况
⚠️ PARTIAL
Step 2: Dispatch Task Creation
步骤2:派发任务创建
For each issue, spawn a Task agent:
"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"针对每个问题,启动Task Agent:
"使用/code-task-generator创建修复任务:[评估发现的问题]
输出至:tasks/preset-fixes/"Step 3: Dispatch Implementation
步骤3:派发实现任务
For each created task:
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"针对每个已创建的任务:
"使用/code-assist实现:tasks/preset-fixes/[任务文件].code-task.md
模式:auto"Step 4: Re-evaluate
步骤4:重新评估
bash
./tools/evaluate-preset.sh <fixed-preset> claudebash
./tools/evaluate-preset.sh <fixed-preset> claudePrerequisites
前置条件
- yq (optional): For loading test tasks from YAML. Install:
brew install yq - Cargo: Must be able to build Ralph
- yq(可选):用于从YAML加载测试任务。安装:
brew install yq - Cargo:必须能够构建Ralph
Related Files
相关文件
- — Single preset evaluation
tools/evaluate-preset.sh - — Full suite evaluation
tools/evaluate-all-presets.sh - — Test task definitions
tools/preset-test-tasks.yml - — Manual findings doc
tools/preset-evaluation-findings.md - — The preset collection being evaluated
presets/
- — 单个预设评估
tools/evaluate-preset.sh - — 完整套件评估
tools/evaluate-all-presets.sh - — 测试任务定义
tools/preset-test-tasks.yml - — 人工发现文档
tools/preset-evaluation-findings.md - — 待评估的预设集合
presets/