Loading...
Loading...
Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
npx skill4agent add mikeyobrien/ralph-orchestrator evaluate-presets./tools/evaluate-preset.sh tdd-red-green claude./tools/evaluate-all-presets.sh claude.ymlclaudekiroclaudetimeout: 600000run_in_background: truetimeout: 600000run_in_background: trueTaskOutputBash tool with:
command: "./tools/evaluate-preset.sh tdd-red-green claude"
timeout: 600000
run_in_background: trueTaskOutputblock: falseevaluate-preset.shtools/preset-test-tasks.ymlyq--record-session.eval/
├── logs/<preset>/<timestamp>/
│ ├── output.log # Full stdout/stderr
│ ├── session.jsonl # Recorded session
│ ├── metrics.json # Extracted metrics
│ ├── environment.json # Runtime environment
│ └── merged-config.yml # Config used
└── logs/<preset>/latest -> <timestamp>evaluate-all-presets.sh.eval/results/<suite-id>/
├── SUMMARY.md # Markdown report
├── <preset>.json # Per-preset metrics
└── latest -> <suite-id>| Preset | Test Task |
|---|---|
| Add |
| Review user input handler for security |
| Understand |
| Specify and implement |
| Implement a |
| Debug failing mock test assertion |
| Understand history of |
| Profile hat matching |
| Design a |
| Document |
| Respond to "tests failing in CI" |
| Plan v1 to v2 config migration |
evaluate-preset.sh0124output.logmetrics.jsoniterationshats_activatedevents_publishedcompletedIter 1: Ralph → publishes starting event → STOPS
Iter 2: Hat A → does work → publishes next event → STOPS
Iter 3: Hat B → does work → publishes next event → STOPS
Iter 4: Hat C → does work → LOOP_COMPLETEIter 2: Ralph does Blue Team + Red Team + Fixer work
^^^ All in one bloated context!session.jsonl# Count iterations
grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log
# Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonloutput.loggrep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
.eval/logs/<preset>/latest/output.logsession.jsonlcat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'| Pattern | Diagnosis | Action |
|---|---|---|
| iterations ≈ events | ✅ Good | Hat routing working |
| iterations << events | ⚠️ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | ⚠️ Recovery loops | Agent not publishing required events |
| 0 events | ❌ Broken | Events not being read from JSONL |
hatless_ralph.rsHatInfoinstructions## HATSbuild_prompt(context)## PENDING EVENTS.eval/results/latest/SUMMARY.md❌ FAIL⏱️ TIMEOUT⚠️ PARTIAL"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/""Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"./tools/evaluate-preset.sh <fixed-preset> claudebrew install yqtools/evaluate-preset.shtools/evaluate-all-presets.shtools/preset-test-tasks.ymltools/preset-evaluation-findings.mdpresets/