Empirical Prompt Tuning
The quality of a prompt isn't apparent to the person who wrote it. What the writer thinks is "clear" often causes other agents to get stuck. The core of this skill is having unbiased executors actually run the prompt, evaluate from both sides, and iterate. Don't stop until improvement plateaus.
When to Use It
- Immediately after creating or significantly revising a skill / slash command / task prompt
- When an agent isn't behaving as expected, and you suspect the cause is ambiguity in the instructions
- When you want to harden high-priority instructions (frequently used skills, core prompts for automation)
Scenarios to Avoid:
- One-off disposable prompts (evaluation costs aren't justified)
- When your goal isn't to improve success rates, but only to reflect the writer's subjective preferences
Workflow
-
Iteration 0 — Consistency Check Between Description and Body (static, no dispatch required)
- Read the trigger / use case stated in the frontmatter
- Read the scope covered in the body
- If there's a discrepancy, align the description or body before proceeding to Iteration 1
- Example: Detect discrepancies like the description stating "navigation / form filling / data extraction" but the body only containing CLI references for
- Skipping this step will cause subagents to "reinterpret" the body to match the description, leading to false positives where the skill seems accurate even though it doesn't meet requirements
-
Baseline Preparation: Finalize the target prompt and prepare the following two items:
- 2–3 evaluation scenarios (1 median case + 1–2 edge cases). Use realistic tasks that simulate actual application of the target prompt.
- Requirements checklist (for accuracy calculation). List 3–7 "requirements the deliverable must meet" per scenario. Accuracy % = number of met items / total items. Fix this in advance (don't adjust later).
-
Bias-Free Reading: Have a "blank slate" executor read the instructions. Dispatch a new subagent using the Task tool. Do not rely on self-review (it's structurally impossible to objectively view text you just wrote). If running multiple scenarios in parallel, call multiple agents within a single message. Refer to the "Environmental Constraints" section for handling environments where dispatch isn't possible.
-
Execution: Pass a prompt that follows the Subagent Activation Contract (described later) to the subagent, and have it execute the scenario. The executor will generate the implementation or output, and return a self-report at the end.
-
Dual-Perspective Evaluation: Record the following from the returned results:
- Executor Self-Report (extracted from the subagent's report body): Ambiguities / discretionary completions / points where the executor got stuck applying templates
- Instruction-Side Metrics (judgment rules are centrally defined in this section; refer to this section elsewhere):
- Success/Failure: Success (○) only if all requirements tagged are fully met (○). Failure (×) if even one is unmet (×) or partially met. Use only the binary labels ○ / ×.
- Accuracy (achievement rate % of the requirements checklist. Sum ○ = full points, × = 0, partial = 0.5, then divide by total items)
- Step Count (directly use from the usage metadata returned by the Task tool. Include Read / Grep, do not exclude)
- Duration (use from the Task tool's usage metadata)
- Retry Count (extracted from the subagent's self-report; cannot be measured by the instruction side)
- For failures, add 1 line to the "Ambiguities" section of the presentation format stating "Which [critical] item failed" (for root cause tracking)
- The requirements checklist must include at least one item tagged (success judgment becomes vacuous if there are none). Do not add or remove tags after the fact.
-
Apply Changes: Make minimal fixes to the prompt to resolve ambiguities. Focus on one theme per iteration (multiple related fixes are allowed; unrelated fixes should be deferred to the next iteration).
- Before making a fix, explicitly state "Which item in the requirements checklist / judgment wording this fix addresses" (fixes guessed from axis names often don't work. See the "Fix Ripple Patterns" section below).
-
Re-Evaluate: Repeat steps 2 → 5 with a new subagent (do not reuse the same agent: it may have learned from previous improvements). Increase parallelism if improvement doesn't plateau as iterations progress.
-
Convergence Judgment: Stop when the following are all met for 2 consecutive iterations (use 3 consecutive iterations for high-priority prompts):
- 0 new ambiguities
- Accuracy improvement vs. previous iteration: ≤ +3 percentage points (e.g., saturation from 5% to 8%)
- Step count change vs. previous iteration: ±10% or less
- Duration change vs. previous iteration: ±15% or less
- Overfitting Check: When judging convergence, add one unused hold-out scenario for evaluation. If accuracy drops by 15 percentage points or more from the recent average, overfitting has occurred. Return to baseline scenario design and add edge cases.
Evaluation Axes
| Axis | Measurement Method | Meaning |
|---|
| Success/Failure | Did the executor produce the intended deliverable? (binary) | Minimum threshold |
| Accuracy | What percentage of requirements did the deliverable meet? | Degree of partial success |
| Step Count | Number of tool calls / decision steps used by the executor | Indicator of waste in instructions |
| Duration | Executor's duration_ms | Proxy indicator of cognitive load |
| Retry Count | Number of times the same judgment was repeated | Signal of instruction ambiguity |
| Ambiguities (Self-Report) | Bulleted list from the executor | Qualitative improvement material |
| Discretionary Completions (Self-Report) | Judgments not specified in the instructions | Reveals implicit specifications |
Weighting: Prioritize qualitative factors (ambiguities, discretionary completions), use quantitative factors (time, step count) as supplements. Focusing only on time reduction can lead to overly sparse prompts.
Qualitative Interpretation of
Looking only at accuracy can hide skill issues. Using
as a
relative value between scenarios reveals structural defects:
- If is 3–5x higher than other scenarios, it's a sign the skill is biased toward decision-tree indexing and lacks self-containment. The executor is being forced to descend into references.
- Typical example: All scenarios have of 1–3, but one scenario has 15+ → the skill lacks a recipe for that scenario, so the executor is cross-searching references/
- Fix: In Iteration 2, add "minimal complete example inline" or "guidelines for when to read references" at the start of SKILL.md to significantly reduce
Even with 100% accuracy, a bias in
is grounds for triggering Iteration 2. "Judging only by accuracy and stopping" tends to miss structural defects.
Fix Ripple Patterns (Conservative / Overperformance / No Impact)
Fixes → effects are not linear. The following three patterns can occur in pre-estimates:
- Conservative Ripple (Estimate > Actual): Targeted multiple axes with one fix, but only one axis improved. "Multi-axis targeting often misses the mark"
- Overperformance Ripple (Estimate < Actual): One piece of structural information (e.g., combination of command + settings + expected output) satisfied multiple axis judgment criteria simultaneously. "Combined information structurally impacts multiple axes"
- No Impact (Estimate > 0, Actual = 0): A fix guessed from an axis name didn't address any judgment wording. "Axis names and judgment wording are separate"
To stabilize this, have the subagent verbalize "Which judgment wording this fix addresses" before applying changes. Without linking to threshold-level wording, estimate accuracy will be low. When adding new evaluation axes, also specify judgment criteria down to the threshold wording level (granularity where subagents can judge, e.g., "all explicitly stated" or "full minimal working configuration").
Subagent Activation Contract
The prompt passed to the executor must follow the structure below. This is the input contract for "dual-perspective evaluation".
You are an executor reading <Target Prompt Name> with a blank slate.
## Target Prompt
<Paste the full text of the target prompt or specify the path to read via Read>
## Scenario
<1 paragraph of scenario context>
## Requirements Checklist (Items the deliverable must meet)
1. [critical] <Item included in the minimum threshold>
2. <Regular item>
3. <Regular item>
...
(Judgment rules are centrally defined in the "Workflow 4. Dual-Perspective Evaluation / Instruction-Side Metrics" section. At least one [critical] item is required.)
## Tasks
1. Execute the scenario according to the target prompt and generate a deliverable.
2. Respond using the report structure below upon completion.
## Report Structure
- Deliverable: <Generated output or summary of execution results>
- Requirements Met: For each item, mark ○ / × / Partial (with reason)
- Ambiguities: Points where you got stuck on the target prompt, or wording you struggled to interpret (bulleted list)
- Discretionary Completions: Points where you filled in gaps with your own judgment because they weren't specified in the instructions (bulleted list)
- Retries: Number of times you repeated the same judgment and the reason
The caller extracts the self-report portion from the report, and retrieves
/
from the Agent tool's usage metadata to populate the evaluation axis table.
Environmental Constraints
Do not apply this skill in environments where new subagents cannot be dispatched (e.g., already operating as a subagent, Task tool disabled).
- Alternative 1: Ask the user in the parent session to start a separate Claude Code session
- Alternative 2: Abandon evaluation and explicitly report to the user: "empirical evaluation skipped: dispatch unavailable"
- NG: Substitute with self-review (evaluation results cannot be trusted due to bias)
Structural Review Mode: If you only want to check consistency and clarity of skill / prompt descriptions (not empirical evaluation), explicitly switch to structural review mode. Add the note "This is structural review mode: text consistency check only, no execution" to the prompt sent to the subagent. This allows the subagent to return a static review without triggering the skip behavior in the environmental constraints section. Structural review is a supplement to empirical evaluation (not a replacement, and cannot be used for consecutive pass judgments).
Iteration Termination Criteria
- Convergence (Stop): All of the following are met for 2 consecutive iterations:
- 0 new ambiguities
- Accuracy improvement vs. previous iteration: ≤ +3 percentage points
- Step count change vs. previous iteration: ±10% or less
- Duration change vs. previous iteration: ±15% or less
- Overfitting Check: When judging convergence, add one unused hold-out scenario for evaluation. If accuracy drops by 15 percentage points or more from the recent average, overfitting has occurred. Return to baseline scenario design and add edge cases.
- Divergence (Question Design): If new ambiguities don't decrease after 3+ iterations → the prompt's design approach may be fundamentally wrong. Stop applying patch fixes and rewrite the structure.
- Resource Termination: Stop when priority and improvement costs are no longer balanced (judgment to release at 80 points)
Presentation Format
Record and present to the user in the following format for each iteration:
## Iteration N
### Changes (Difference from Previous Iteration)
- <1 line describing the fix>
### Execution Results (By Scenario)
| Scenario | Success/Failure | Accuracy | Steps | Duration | Retries |
|---|---|---|---|---|---|
| A | ○ | 90% | 4 | 20s | 0 |
| B | × | 60% | 9 | 41s | 2 |
### New Ambiguities (This Iteration)
- <Scenario B>: [critical] Item N failed — <1 line reason for failure> # Mandatory for failures
- <Scenario B>: <Other feedback, 1 line>
- <Scenario A>: (No new ambiguities)
### New Discretionary Completions (This Iteration)
- <Scenario B>: <Completion content>
### Next Proposed Fix
- <1 line minimal fix>
(Convergence Judgment: X consecutive passes / Y iterations remaining until stop condition)
Red Flags (Caution on Rationalization)
| Rationalization | Reality |
|---|
| "Reading it myself will have the same effect" | You cannot "objectively view" text you just wrote. Always dispatch a new subagent. |
| "One scenario is enough" | One scenario leads to overfitting. Use at least 2, preferably 3. |
| "I can stop since there were 0 ambiguities once" | This could be a coincidence. Confirm with 2 consecutive iterations. |
| "Let's fix multiple ambiguities at once" | You won't know which fix worked. Focus on one theme per iteration. |
| "Let's split related minor fixes into separate iterations one by one" | This is the opposite trap. "One theme" means a meaningful unit. You can group 2–3 related minor fixes into one iteration. Splitting too much causes iteration count to explode. |
| "Metrics are good, so ignore qualitative feedback" | Time reduction can also be a sign of overly sparse prompts. Prioritize qualitative factors. |
| "Rewriting is faster" | Correct if ambiguities don't decrease after 3+ iterations. Otherwise, don't escape early. |
| "Let's reuse the same subagent" | It may have learned from previous improvements. Dispatch a new one every time. |
Common Failures
- Scenarios are too easy / too hard: Neither produces meaningful signals. Use one median case and one edge case from real-world usage.
- Only looking at metrics: Focusing only on time reduction leads to removal of important explanations, making the prompt fragile.
- Too many changes per iteration: You won't be able to track which fix worked. One fix per iteration.
- Tuning scenarios to match fixes: Simplifying scenarios to make it seem like ambiguities are resolved → putting the cart before the horse.
Related
superpowers:writing-skills
— TDD approach for skill creation. Essentially the same as this skill's "baseline → fix → re-execute with subagent"
- — Formalizing learnings after tasks. Use this skill during prompt development, and retrospective-codify after task completion.
superpowers:dispatching-parallel-agents
— Practices for running multiple scenarios in parallel