Skill Autoresearch
Use this skill to improve another skill through measured iteration instead of gut feel.
The job is simple: run the target skill on a small test set, score outputs with binary evals, change one thing in the prompt, and keep only mutations that improve the score. Repeat until the score plateaus, the budget cap is hit, or the user stops the loop.
When to use this skill
- A skill works inconsistently and needs a repeatable improvement loop
- You want to benchmark a SKILL.md before editing it
- You need binary evals for prompt or skill quality
- You want a mutation log instead of ad-hoc rewriting
- You want to compare baseline vs improved prompt behavior
Required inputs
Do not start experiments until all inputs below are known:
- Target skill path
- Three to five representative test inputs
- Three to six binary yes/no evals
- Runs per experiment, default
- Experiment interval, default
- Optional budget cap
For writing reliable evals, read references/eval-guide.md.
Instructions
Step 1: Read the target skill
- Read the target
- Read any directly linked files under that skill's
- Identify the core job, required steps, output format, and likely failure modes
- Note buried instructions or conflicting rules before changing anything
Step 2: Build the eval suite
Convert the user's quality criteria into binary checks only.
Use this format:
text
EVAL 1: Short name
Question: Yes/no question about the output
Pass: Specific condition that counts as yes
Fail: Specific condition that counts as no
Rules:
- Use binary yes/no checks only
- Prefer observable checks over taste-based judgments
- Keep evals distinct; do not double-count the same failure
- Use three to six evals total
Step 3: Create the experiment workspace
Inside the target skill folder, create:
text
skill-autoresearch-[skill-name]/
dashboard.html
results.json
results.tsv
changelog.md
SKILL.md.baseline
Requirements:
- stores experiment summaries
- powers the dashboard
- is a self-contained status page
- is the untouched original
Step 4: Establish the baseline
Run the target skill as-is before editing it.
- Back up the original skill as
- Run the skill times on the same test inputs
- Score every run against every eval
- Record experiment as the baseline
- If baseline is already above 90 percent, confirm whether more optimization is worth it
text
experiment score max_score pass_rate status description
Step 5: Run the mutation loop
This is the core loop:
- Inspect the failing outputs
- Form one hypothesis about the failure
- Make one targeted change to
- Re-run the same test set
- Score all outputs again
- Keep the change only if the score improves
- Revert ties or regressions
- Append the result to , , and
Good mutations:
- Clarify an ambiguous instruction
- Move a critical rule higher
- Add one anti-pattern for a recurring failure
- Add one focused example
- Remove a noisy instruction that causes overfitting
Bad mutations:
- Rewrite the whole skill at once
- Add many rules in one experiment
- Optimize for length instead of behavior
- Use intuition instead of measured score
Step 6: Keep the dashboard live
The dashboard should refresh from
and show:
- Experiment number
- Score and pass rate progression
- Baseline vs keep vs discard status
- Per-eval failure hotspots
- Current run state: running, idle, or complete
Use a single self-contained HTML file. Inline CSS/JS is preferred.
Step 7: Log every experiment
Append after every run:
markdown
## Experiment N — keep|discard
Score: X/Y
Change: one-sentence mutation summary
Reasoning: why this mutation was tried
Result: what improved or regressed
Remaining failures: what still breaks
Discarded experiments matter. They stop future agents from repeating dead ends.
Step 8: Deliver results
When the loop stops, report:
- Baseline score to final score
- Number of experiments run
- Keep vs discard count
- Top changes that helped most
- Remaining failure patterns
- Artifact locations
Rules
- Do not run experiments before inputs and evals are defined
- Use the same test set for baseline and mutations
- Change one thing at a time
- Keep or discard by score, not by preference
- Record every attempt
- Stop only on manual stop, budget cap, or clear score plateau
Output format
Expected artifacts:
text
skill-autoresearch-[skill-name]/
dashboard.html
results.json
results.tsv
changelog.md
SKILL.md.baseline
The improved skill stays in place at its original path.