Empirical Prompt Tuning

The quality of a prompt isn't apparent to the person who wrote it. What the writer thinks is "clear" often causes other agents to get stuck. The core of this skill is having unbiased executors actually run the prompt, evaluate from both sides, and iterate. Don't stop until improvement plateaus.

When to Use It

Immediately after creating or significantly revising a skill / slash command / task prompt
When an agent isn't behaving as expected, and you suspect the cause is ambiguity in the instructions
When you want to harden high-priority instructions (frequently used skills, core prompts for automation)

Scenarios to Avoid:

One-off disposable prompts (evaluation costs aren't justified)
When your goal isn't to improve success rates, but only to reflect the writer's subjective preferences

Workflow

Iteration 0 — Consistency Check Between Description and Body (static, no dispatch required)
- Read the trigger / use case stated in the frontmatter
```
description
```
- Read the scope covered in the body
- If there's a discrepancy, align the description or body before proceeding to Iteration 1
- Example: Detect discrepancies like the description stating "navigation / form filling / data extraction" but the body only containing CLI references for
```
npx playwright test
```
- Skipping this step will cause subagents to "reinterpret" the body to match the description, leading to false positives where the skill seems accurate even though it doesn't meet requirements
Baseline Preparation: Finalize the target prompt and prepare the following two items:
- 2–3 evaluation scenarios (1 median case + 1–2 edge cases). Use realistic tasks that simulate actual application of the target prompt.
- Requirements checklist (for accuracy calculation). List 3–7 "requirements the deliverable must meet" per scenario. Accuracy % = number of met items / total items. Fix this in advance (don't adjust later).
Bias-Free Reading: Have a "blank slate" executor read the instructions. Dispatch a new subagent using the Task tool. Do not rely on self-review (it's structurally impossible to objectively view text you just wrote). If running multiple scenarios in parallel, call multiple agents within a single message. Refer to the "Environmental Constraints" section for handling environments where dispatch isn't possible.
Execution: Pass a prompt that follows the Subagent Activation Contract (described later) to the subagent, and have it execute the scenario. The executor will generate the implementation or output, and return a self-report at the end.
Dual-Perspective Evaluation: Record the following from the returned results:
- Executor Self-Report (extracted from the subagent's report body): Ambiguities / discretionary completions / points where the executor got stuck applying templates
- Instruction-Side Metrics (judgment rules are centrally defined in this section; refer to this section elsewhere):
  - Success/Failure: Success (○) only if all requirements tagged
    [critical]
    are fully met (○). Failure (×) if even one is unmet (×) or partially met. Use only the binary labels ○ / ×.
  - Accuracy (achievement rate % of the requirements checklist. Sum ○ = full points, × = 0, partial = 0.5, then divide by total items)
  - Step Count (directly use
    tool_uses
    from the usage metadata returned by the Task tool. Include Read / Grep, do not exclude)
  - Duration (use
    duration_ms
    from the Task tool's usage metadata)
  - Retry Count (extracted from the subagent's self-report; cannot be measured by the instruction side)
  - For failures, add 1 line to the "Ambiguities" section of the presentation format stating "Which [critical] item failed" (for root cause tracking)
- The requirements checklist must include at least one item tagged
  [critical]
  (success judgment becomes vacuous if there are none). Do not add or remove
```
[critical]
```
  tags after the fact.
Apply Changes: Make minimal fixes to the prompt to resolve ambiguities. Focus on one theme per iteration (multiple related fixes are allowed; unrelated fixes should be deferred to the next iteration).
- Before making a fix, explicitly state "Which item in the requirements checklist / judgment wording this fix addresses" (fixes guessed from axis names often don't work. See the "Fix Ripple Patterns" section below).
Re-Evaluate: Repeat steps 2 → 5 with a new subagent (do not reuse the same agent: it may have learned from previous improvements). Increase parallelism if improvement doesn't plateau as iterations progress.
Convergence Judgment: Stop when the following are all met for 2 consecutive iterations (use 3 consecutive iterations for high-priority prompts):
- 0 new ambiguities
- Accuracy improvement vs. previous iteration: ≤ +3 percentage points (e.g., saturation from 5% to 8%)
- Step count change vs. previous iteration: ±10% or less
- Duration change vs. previous iteration: ±15% or less
- Overfitting Check: When judging convergence, add one unused hold-out scenario for evaluation. If accuracy drops by 15 percentage points or more from the recent average, overfitting has occurred. Return to baseline scenario design and add edge cases.

Evaluation Axes

Axis	Measurement Method	Meaning
Success/Failure	Did the executor produce the intended deliverable? (binary)	Minimum threshold
Accuracy	What percentage of requirements did the deliverable meet?	Degree of partial success
Step Count	Number of tool calls / decision steps used by the executor	Indicator of waste in instructions
Duration	Executor's duration_ms	Proxy indicator of cognitive load
Retry Count	Number of times the same judgment was repeated	Signal of instruction ambiguity
Ambiguities (Self-Report)	Bulleted list from the executor	Qualitative improvement material
Discretionary Completions (Self-Report)	Judgments not specified in the instructions	Reveals implicit specifications

Weighting: Prioritize qualitative factors (ambiguities, discretionary completions), use quantitative factors (time, step count) as supplements. Focusing only on time reduction can lead to overly sparse prompts.

Qualitative Interpretation of

tool_uses

Looking only at accuracy can hide skill issues. Using

tool_uses

as a relative value between scenarios reveals structural defects:

If
```
tool_uses
```
is 3–5x higher than other scenarios, it's a sign the skill is biased toward decision-tree indexing and lacks self-containment. The executor is being forced to descend into references.
Typical example: All scenarios have
```
tool_uses
```
of 1–3, but one scenario has 15+ → the skill lacks a recipe for that scenario, so the executor is cross-searching references/
Fix: In Iteration 2, add "minimal complete example inline" or "guidelines for when to read references" at the start of SKILL.md to significantly reduce
```
tool_uses
```

Even with 100% accuracy, a bias in

tool_uses

is grounds for triggering Iteration 2. "Judging only by accuracy and stopping" tends to miss structural defects.

Fix Ripple Patterns (Conservative / Overperformance / No Impact)

Fixes → effects are not linear. The following three patterns can occur in pre-estimates:

Conservative Ripple (Estimate > Actual): Targeted multiple axes with one fix, but only one axis improved. "Multi-axis targeting often misses the mark"
Overperformance Ripple (Estimate < Actual): One piece of structural information (e.g., combination of command + settings + expected output) satisfied multiple axis judgment criteria simultaneously. "Combined information structurally impacts multiple axes"
No Impact (Estimate > 0, Actual = 0): A fix guessed from an axis name didn't address any judgment wording. "Axis names and judgment wording are separate"

To stabilize this, have the subagent verbalize "Which judgment wording this fix addresses" before applying changes. Without linking to threshold-level wording, estimate accuracy will be low. When adding new evaluation axes, also specify judgment criteria down to the threshold wording level (granularity where subagents can judge, e.g., "all explicitly stated" or "full minimal working configuration").

Subagent Activation Contract

The prompt passed to the executor must follow the structure below. This is the input contract for "dual-perspective evaluation".

You are an executor reading <Target Prompt Name> with a blank slate.

## Target Prompt
<Paste the full text of the target prompt or specify the path to read via Read>

## Scenario
<1 paragraph of scenario context>

## Requirements Checklist (Items the deliverable must meet)
1. [critical] <Item included in the minimum threshold>
2. <Regular item>
3. <Regular item>
...
(Judgment rules are centrally defined in the "Workflow 4. Dual-Perspective Evaluation / Instruction-Side Metrics" section. At least one [critical] item is required.)

## Tasks
1. Execute the scenario according to the target prompt and generate a deliverable.
2. Respond using the report structure below upon completion.

## Report Structure
- Deliverable: <Generated output or summary of execution results>
- Requirements Met: For each item, mark ○ / × / Partial (with reason)
- Ambiguities: Points where you got stuck on the target prompt, or wording you struggled to interpret (bulleted list)
- Discretionary Completions: Points where you filled in gaps with your own judgment because they weren't specified in the instructions (bulleted list)
- Retries: Number of times you repeated the same judgment and the reason

The caller extracts the self-report portion from the report, and retrieves

tool_uses

duration_ms

from the Agent tool's usage metadata to populate the evaluation axis table.

Environmental Constraints

Do not apply this skill in environments where new subagents cannot be dispatched (e.g., already operating as a subagent, Task tool disabled).

Alternative 1: Ask the user in the parent session to start a separate Claude Code session
Alternative 2: Abandon evaluation and explicitly report to the user: "empirical evaluation skipped: dispatch unavailable"
NG: Substitute with self-review (evaluation results cannot be trusted due to bias)

Structural Review Mode: If you only want to check consistency and clarity of skill / prompt descriptions (not empirical evaluation), explicitly switch to structural review mode. Add the note "This is structural review mode: text consistency check only, no execution" to the prompt sent to the subagent. This allows the subagent to return a static review without triggering the skip behavior in the environmental constraints section. Structural review is a supplement to empirical evaluation (not a replacement, and cannot be used for consecutive pass judgments).

Iteration Termination Criteria

Convergence (Stop): All of the following are met for 2 consecutive iterations:
- 0 new ambiguities
- Accuracy improvement vs. previous iteration: ≤ +3 percentage points
- Step count change vs. previous iteration: ±10% or less
- Duration change vs. previous iteration: ±15% or less
- Overfitting Check: When judging convergence, add one unused hold-out scenario for evaluation. If accuracy drops by 15 percentage points or more from the recent average, overfitting has occurred. Return to baseline scenario design and add edge cases.
Divergence (Question Design): If new ambiguities don't decrease after 3+ iterations → the prompt's design approach may be fundamentally wrong. Stop applying patch fixes and rewrite the structure.
Resource Termination: Stop when priority and improvement costs are no longer balanced (judgment to release at 80 points)

Presentation Format

Record and present to the user in the following format for each iteration:

## Iteration N

### Changes (Difference from Previous Iteration)
- <1 line describing the fix>

### Execution Results (By Scenario)
| Scenario | Success/Failure | Accuracy | Steps | Duration | Retries |
|---|---|---|---|---|---|
| A | ○ | 90% | 4 | 20s | 0 |
| B | × | 60% | 9 | 41s | 2 |

### New Ambiguities (This Iteration)
- <Scenario B>: [critical] Item N failed — <1 line reason for failure>   # Mandatory for failures
- <Scenario B>: <Other feedback, 1 line>
- <Scenario A>: (No new ambiguities)

### New Discretionary Completions (This Iteration)
- <Scenario B>: <Completion content>

### Next Proposed Fix
- <1 line minimal fix>

(Convergence Judgment: X consecutive passes / Y iterations remaining until stop condition)

Red Flags (Caution on Rationalization)

Rationalization	Reality
"Reading it myself will have the same effect"	You cannot "objectively view" text you just wrote. Always dispatch a new subagent.
"One scenario is enough"	One scenario leads to overfitting. Use at least 2, preferably 3.
"I can stop since there were 0 ambiguities once"	This could be a coincidence. Confirm with 2 consecutive iterations.
"Let's fix multiple ambiguities at once"	You won't know which fix worked. Focus on one theme per iteration.
"Let's split related minor fixes into separate iterations one by one"	This is the opposite trap. "One theme" means a meaningful unit. You can group 2–3 related minor fixes into one iteration. Splitting too much causes iteration count to explode.
"Metrics are good, so ignore qualitative feedback"	Time reduction can also be a sign of overly sparse prompts. Prioritize qualitative factors.
"Rewriting is faster"	Correct if ambiguities don't decrease after 3+ iterations. Otherwise, don't escape early.
"Let's reuse the same subagent"	It may have learned from previous improvements. Dispatch a new one every time.

Common Failures

Scenarios are too easy / too hard: Neither produces meaningful signals. Use one median case and one edge case from real-world usage.
Only looking at metrics: Focusing only on time reduction leads to removal of important explanations, making the prompt fragile.
Too many changes per iteration: You won't be able to track which fix worked. One fix per iteration.
Tuning scenarios to match fixes: Simplifying scenarios to make it seem like ambiguities are resolved → putting the cart before the horse.

```
superpowers:writing-skills
```
— TDD approach for skill creation. Essentially the same as this skill's "baseline → fix → re-execute with subagent"
```
retrospective-codify
```
— Formalizing learnings after tasks. Use this skill during prompt development, and retrospective-codify after task completion.
```
superpowers:dispatching-parallel-agents
```
— Practices for running multiple scenarios in parallel

empirical-prompt-tuning

NPX Install

Tags

SKILL.md Content (Chinese)