skill-autoresearch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Skill Autoresearch

技能自动研究

Use this skill to improve another skill through measured iteration instead of gut feel.

The job is simple: run the target skill on a small test set, score outputs with binary evals, change one thing in the prompt, and keep only mutations that improve the score. Repeat until the score plateaus, the budget cap is hit, or the user stops the loop.

使用该技能通过可量化的迭代而非主观感受来优化其他技能。

工作逻辑很简单：在小型测试集上运行目标技能，通过二元评估为输出打分，修改提示词中的一处内容，仅保留能提升分数的改动。重复该过程直到分数进入平台期、达到预算上限，或用户终止循环。

When to use this skill

适用场景

A skill works inconsistently and needs a repeatable improvement loop
You want to benchmark a SKILL.md before editing it
You need binary evals for prompt or skill quality
You want a mutation log instead of ad-hoc rewriting
You want to compare baseline vs improved prompt behavior

技能表现不稳定，需要可复用的改进循环
你希望在编辑SKILL.md之前先对其做基准测试
你需要用于评估提示词或技能质量的二元评估体系
你需要变更日志而非临时改写
你希望对比基线版本与优化后提示词的表现差异

Required inputs

必要输入

Do not start experiments until all inputs below are known:

Target skill path
Three to five representative test inputs
Three to six binary yes/no evals
Runs per experiment, default
```
5
```
Experiment interval, default
```
2m
```
Optional budget cap

For writing reliable evals, read references/eval-guide.md.

在获取以下所有输入前请勿启动实验：

目标技能路径
3-5个代表性测试输入
3-6个二元是/否评估项
每次实验运行次数，默认值为
```
5
```
实验间隔，默认值为
```
2m
```
可选的预算上限

如需编写可靠的评估项，请参考references/eval-guide.md。

Instructions

操作指引

Step 1: Read the target skill

步骤1：读取目标技能

Read the target
```
SKILL.md
```
Read any directly linked files under that skill's
```
references/
```
Identify the core job, required steps, output format, and likely failure modes
Note buried instructions or conflicting rules before changing anything

读取目标
```
SKILL.md
```
文件
读取该技能
```
references/
```
目录下所有直接关联的文件
明确核心任务、必要步骤、输出格式和潜在失效模式
在做任何改动前记录隐含的指令或冲突的规则

Step 2: Build the eval suite

步骤2：构建评估套件

Convert the user's quality criteria into binary checks only.

Use this format:

text

EVAL 1: Short name
Question: Yes/no question about the output
Pass: Specific condition that counts as yes
Fail: Specific condition that counts as no

Rules:

Use binary yes/no checks only
Prefer observable checks over taste-based judgments
Keep evals distinct; do not double-count the same failure
Use three to six evals total

将用户的质量标准仅转换为二元检查项。

使用如下格式：

text

EVAL 1: Short name
Question: Yes/no question about the output
Pass: Specific condition that counts as yes
Fail: Specific condition that counts as no

规则：

仅使用二元是/否检查项
优先选择可观测的检查项，而非基于主观感受的判断
保持评估项相互独立，不要对同一失效情况重复计数
评估项总数控制在3-6个

Step 3: Create the experiment workspace

步骤3：创建实验工作区

Inside the target skill folder, create:

text

skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline

Requirements:

```
results.tsv
```
stores experiment summaries
```
results.json
```
powers the dashboard
```
dashboard.html
```
is a self-contained status page
```
SKILL.md.baseline
```
is the untouched original

在目标技能文件夹下创建如下目录结构：

text

skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline

要求：

```
results.tsv
```
存储实验汇总信息
```
results.json
```
为仪表盘提供数据支撑
```
dashboard.html
```
是独立的状态页面
```
SKILL.md.baseline
```
是未修改的原始版本

Step 4: Establish the baseline

步骤4：确定基线

Run the target skill as-is before editing it.

Back up the original skill as
```
SKILL.md.baseline
```
Run the skill
```
N
```
times on the same test inputs
Score every run against every eval
Record experiment
```
0
```
as the baseline
If baseline is already above 90 percent, confirm whether more optimization is worth it

Use this

results.tsv

header:

text

experiment	score	max_score	pass_rate	status	description

在编辑目标技能前先按原始版本运行。

将原始技能备份为
```
SKILL.md.baseline
```
对相同的测试输入运行技能
```
N
```
次
对照每个评估项为每次运行打分
将实验
```
0
```
记录为基线
如果基线通过率已经高于90%，确认是否值得继续优化

使用如下

results.tsv

表头：

text

experiment	score	max_score	pass_rate	status	description

Step 5: Run the mutation loop

步骤5：运行变更循环

This is the core loop:

Inspect the failing outputs
Form one hypothesis about the failure
Make one targeted change to
```
SKILL.md
```
Re-run the same test set
Score all outputs again
Keep the change only if the score improves
Revert ties or regressions
Append the result to
```
results.tsv
```
,
```
results.json
```
, and
```
changelog.md
```

Good mutations:

Clarify an ambiguous instruction
Move a critical rule higher
Add one anti-pattern for a recurring failure
Add one focused example
Remove a noisy instruction that causes overfitting

Bad mutations:

Rewrite the whole skill at once
Add many rules in one experiment
Optimize for length instead of behavior
Use intuition instead of measured score

这是核心循环：

检查未通过的输出结果
针对失效原因提出一个假设
对
```
SKILL.md
```
做一处针对性修改
重新运行同一测试集
再次为所有输出打分
仅当分数提升时保留该改动
回退分数持平或下降的改动
将结果追加到
```
results.tsv
```
、
```
results.json
```
和
```
changelog.md
```
中

良性改动：

明确模糊的指令
将关键规则前置
针对反复出现的失效添加一条反模式说明
添加一个针对性的示例
删除会导致过拟合的冗余指令

不良改动：

一次性重写整个技能
单次实验中添加多条规则
优化长度而非行为表现
依赖直觉而非量化分数

Step 6: Keep the dashboard live

步骤6：保持仪表盘实时更新

The dashboard should refresh from

results.json

and show:

Experiment number
Score and pass rate progression
Baseline vs keep vs discard status
Per-eval failure hotspots
Current run state: running, idle, or complete

Use a single self-contained HTML file. Inline CSS/JS is preferred.

仪表盘应基于

results.json

刷新，展示如下信息：

实验编号
分数和通过率变化趋势
基线、保留、丢弃状态对比
各评估项的失效热点
当前运行状态：运行中、空闲、已完成

使用单个独立HTML文件，优先使用内联CSS/JS。

Step 7: Log every experiment

步骤7：记录所有实验

Append after every run:

markdown

undefined

每次运行后追加如下内容：

markdown

undefined

Experiment N — keep|discard

Score: X/Y Change: one-sentence mutation summary Reasoning: why this mutation was tried Result: what improved or regressed Remaining failures: what still breaks


Discarded experiments matter. They stop future agents from repeating dead ends.

Score: X/Y Change: one-sentence mutation summary Reasoning: why this mutation was tried Result: what improved or regressed Remaining failures: what still breaks


被丢弃的实验同样有价值，可以避免后续Agent重复走死胡同。

Step 8: Deliver results

步骤8：交付结果

When the loop stops, report:

Baseline score to final score
Number of experiments run
Keep vs discard count
Top changes that helped most
Remaining failure patterns
Artifact locations

循环终止时，报告如下内容：

基线分数到最终分数的变化
运行的实验总数
保留与丢弃的实验数量
效果最显著的几项改动
剩余的失效模式
产物存储位置

Rules

规则

Do not run experiments before inputs and evals are defined
Use the same test set for baseline and mutations
Change one thing at a time
Keep or discard by score, not by preference
Record every attempt
Stop only on manual stop, budget cap, or clear score plateau

在输入和评估项未定义前请勿运行实验
基线和所有变更实验使用同一测试集
每次仅修改一处内容
根据分数而非偏好决定保留或丢弃改动
记录每一次尝试
仅在手动终止、达到预算上限或分数明显进入平台期时停止

Output format

输出格式

Expected artifacts:

text

skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline

The improved skill stays in place at its original path.

预期产物：

text

skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline

优化后的技能仍保留在其原始路径下。