skill-autoresearch

Original：🇺🇸 English

Translated

Autonomously optimize an existing AI skill by running it repeatedly against binary evals, mutating one instruction at a time, and keeping only changes that improve pass rate. Based on Karpathy-style autoresearch, but applied to SKILL.md iteration instead of ML training. Use when optimizing a skill, benchmarking prompt quality, building evals for a skill, or running self-improvement loops on reusable agent instructions. Triggers on: skill-autoresearch, optimize this skill, improve this skill, benchmark this skill, eval my skill, run autoresearch on this skill, self-improve skill.

9installs

Sourceakillness/oh-my-skills

Added on2026-04-03

NPX Install

npx skill4agent add akillness/oh-my-skills skill-autoresearch

SKILL.md Content

View Translation Comparison →

Skill Autoresearch

Use this skill to improve another skill through measured iteration instead of gut feel.

The job is simple: run the target skill on a small test set, score outputs with binary evals, change one thing in the prompt, and keep only mutations that improve the score. Repeat until the score plateaus, the budget cap is hit, or the user stops the loop.

When to use this skill

A skill works inconsistently and needs a repeatable improvement loop
You want to benchmark a SKILL.md before editing it
You need binary evals for prompt or skill quality
You want a mutation log instead of ad-hoc rewriting
You want to compare baseline vs improved prompt behavior

Required inputs

Do not start experiments until all inputs below are known:

Target skill path
Three to five representative test inputs
Three to six binary yes/no evals
Runs per experiment, default
```
5
```
Experiment interval, default
```
2m
```
Optional budget cap

For writing reliable evals, read references/eval-guide.md.

Instructions

Step 1: Read the target skill

Read the target
```
SKILL.md
```
Read any directly linked files under that skill's
```
references/
```
Identify the core job, required steps, output format, and likely failure modes
Note buried instructions or conflicting rules before changing anything

Step 2: Build the eval suite

Convert the user's quality criteria into binary checks only.

Use this format:

text

EVAL 1: Short name
Question: Yes/no question about the output
Pass: Specific condition that counts as yes
Fail: Specific condition that counts as no

Rules:

Use binary yes/no checks only
Prefer observable checks over taste-based judgments
Keep evals distinct; do not double-count the same failure
Use three to six evals total

Step 3: Create the experiment workspace

Inside the target skill folder, create:

text

skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline

Requirements:

```
results.tsv
```
stores experiment summaries
```
results.json
```
powers the dashboard
```
dashboard.html
```
is a self-contained status page
```
SKILL.md.baseline
```
is the untouched original

Step 4: Establish the baseline

Run the target skill as-is before editing it.

Back up the original skill as
```
SKILL.md.baseline
```
Run the skill
```
N
```
times on the same test inputs
Score every run against every eval
Record experiment
```
0
```
as the baseline
If baseline is already above 90 percent, confirm whether more optimization is worth it

Use this

results.tsv

header:

text

experiment	score	max_score	pass_rate	status	description

Step 5: Run the mutation loop

This is the core loop:

Inspect the failing outputs
Form one hypothesis about the failure
Make one targeted change to
```
SKILL.md
```
Re-run the same test set
Score all outputs again
Keep the change only if the score improves
Revert ties or regressions
Append the result to
```
results.tsv
```
,
```
results.json
```
, and
```
changelog.md
```

Good mutations:

Clarify an ambiguous instruction
Move a critical rule higher
Add one anti-pattern for a recurring failure
Add one focused example
Remove a noisy instruction that causes overfitting

Bad mutations:

Rewrite the whole skill at once
Add many rules in one experiment
Optimize for length instead of behavior
Use intuition instead of measured score

Step 6: Keep the dashboard live

The dashboard should refresh from

results.json

and show:

Experiment number
Score and pass rate progression
Baseline vs keep vs discard status
Per-eval failure hotspots
Current run state: running, idle, or complete

Use a single self-contained HTML file. Inline CSS/JS is preferred.

Step 7: Log every experiment

Append after every run:

markdown

## Experiment N — keep|discard

Score: X/Y
Change: one-sentence mutation summary
Reasoning: why this mutation was tried
Result: what improved or regressed
Remaining failures: what still breaks

Discarded experiments matter. They stop future agents from repeating dead ends.

Step 8: Deliver results

When the loop stops, report:

Baseline score to final score
Number of experiments run
Keep vs discard count
Top changes that helped most
Remaining failure patterns
Artifact locations

Rules

Do not run experiments before inputs and evals are defined
Use the same test set for baseline and mutations
Change one thing at a time
Keep or discard by score, not by preference
Record every attempt
Stop only on manual stop, budget cap, or clear score plateau

Output format

Expected artifacts:

text

skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline

The improved skill stays in place at its original path.

skill-autoresearch

NPX Install

Tags

SKILL.md Content

Skill Autoresearch

When to use this skill

Required inputs

Instructions

Step 1: Read the target skill

Step 2: Build the eval suite

Step 3: Create the experiment workspace

Step 4: Establish the baseline

Step 5: Run the mutation loop

Step 6: Keep the dashboard live

Step 7: Log every experiment

Step 8: Deliver results

Rules

Output format