skill-autoresearch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Autoresearch

技能自动研究

Use this skill to improve another skill through measured iteration instead of gut feel.
The job is simple: run the target skill on a small test set, score outputs with binary evals, change one thing in the prompt, and keep only mutations that improve the score. Repeat until the score plateaus, the budget cap is hit, or the user stops the loop.
使用该技能通过可量化的迭代而非主观感受来优化其他技能。
工作逻辑很简单:在小型测试集上运行目标技能,通过二元评估为输出打分,修改提示词中的一处内容,仅保留能提升分数的改动。重复该过程直到分数进入平台期、达到预算上限,或用户终止循环。

When to use this skill

适用场景

  • A skill works inconsistently and needs a repeatable improvement loop
  • You want to benchmark a SKILL.md before editing it
  • You need binary evals for prompt or skill quality
  • You want a mutation log instead of ad-hoc rewriting
  • You want to compare baseline vs improved prompt behavior
  • 技能表现不稳定,需要可复用的改进循环
  • 你希望在编辑SKILL.md之前先对其做基准测试
  • 你需要用于评估提示词或技能质量的二元评估体系
  • 你需要变更日志而非临时改写
  • 你希望对比基线版本与优化后提示词的表现差异

Required inputs

必要输入

Do not start experiments until all inputs below are known:
  1. Target skill path
  2. Three to five representative test inputs
  3. Three to six binary yes/no evals
  4. Runs per experiment, default
    5
  5. Experiment interval, default
    2m
  6. Optional budget cap
For writing reliable evals, read references/eval-guide.md.
在获取以下所有输入前请勿启动实验:
  1. 目标技能路径
  2. 3-5个代表性测试输入
  3. 3-6个二元是/否评估项
  4. 每次实验运行次数,默认值为
    5
  5. 实验间隔,默认值为
    2m
  6. 可选的预算上限
如需编写可靠的评估项,请参考references/eval-guide.md

Instructions

操作指引

Step 1: Read the target skill

步骤1:读取目标技能

  1. Read the target
    SKILL.md
  2. Read any directly linked files under that skill's
    references/
  3. Identify the core job, required steps, output format, and likely failure modes
  4. Note buried instructions or conflicting rules before changing anything
  1. 读取目标
    SKILL.md
    文件
  2. 读取该技能
    references/
    目录下所有直接关联的文件
  3. 明确核心任务、必要步骤、输出格式和潜在失效模式
  4. 在做任何改动前记录隐含的指令或冲突的规则

Step 2: Build the eval suite

步骤2:构建评估套件

Convert the user's quality criteria into binary checks only.
Use this format:
text
EVAL 1: Short name
Question: Yes/no question about the output
Pass: Specific condition that counts as yes
Fail: Specific condition that counts as no
Rules:
  • Use binary yes/no checks only
  • Prefer observable checks over taste-based judgments
  • Keep evals distinct; do not double-count the same failure
  • Use three to six evals total
将用户的质量标准仅转换为二元检查项。
使用如下格式:
text
EVAL 1: Short name
Question: Yes/no question about the output
Pass: Specific condition that counts as yes
Fail: Specific condition that counts as no
规则:
  • 仅使用二元是/否检查项
  • 优先选择可观测的检查项,而非基于主观感受的判断
  • 保持评估项相互独立,不要对同一失效情况重复计数
  • 评估项总数控制在3-6个

Step 3: Create the experiment workspace

步骤3:创建实验工作区

Inside the target skill folder, create:
text
skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline
Requirements:
  • results.tsv
    stores experiment summaries
  • results.json
    powers the dashboard
  • dashboard.html
    is a self-contained status page
  • SKILL.md.baseline
    is the untouched original
在目标技能文件夹下创建如下目录结构:
text
skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline
要求:
  • results.tsv
    存储实验汇总信息
  • results.json
    为仪表盘提供数据支撑
  • dashboard.html
    是独立的状态页面
  • SKILL.md.baseline
    是未修改的原始版本

Step 4: Establish the baseline

步骤4:确定基线

Run the target skill as-is before editing it.
  1. Back up the original skill as
    SKILL.md.baseline
  2. Run the skill
    N
    times on the same test inputs
  3. Score every run against every eval
  4. Record experiment
    0
    as the baseline
  5. If baseline is already above 90 percent, confirm whether more optimization is worth it
Use this
results.tsv
header:
text
experiment	score	max_score	pass_rate	status	description
在编辑目标技能前先按原始版本运行。
  1. 将原始技能备份为
    SKILL.md.baseline
  2. 对相同的测试输入运行技能
    N
  3. 对照每个评估项为每次运行打分
  4. 将实验
    0
    记录为基线
  5. 如果基线通过率已经高于90%,确认是否值得继续优化
使用如下
results.tsv
表头:
text
experiment	score	max_score	pass_rate	status	description

Step 5: Run the mutation loop

步骤5:运行变更循环

This is the core loop:
  1. Inspect the failing outputs
  2. Form one hypothesis about the failure
  3. Make one targeted change to
    SKILL.md
  4. Re-run the same test set
  5. Score all outputs again
  6. Keep the change only if the score improves
  7. Revert ties or regressions
  8. Append the result to
    results.tsv
    ,
    results.json
    , and
    changelog.md
Good mutations:
  • Clarify an ambiguous instruction
  • Move a critical rule higher
  • Add one anti-pattern for a recurring failure
  • Add one focused example
  • Remove a noisy instruction that causes overfitting
Bad mutations:
  • Rewrite the whole skill at once
  • Add many rules in one experiment
  • Optimize for length instead of behavior
  • Use intuition instead of measured score
这是核心循环:
  1. 检查未通过的输出结果
  2. 针对失效原因提出一个假设
  3. SKILL.md
    做一处针对性修改
  4. 重新运行同一测试集
  5. 再次为所有输出打分
  6. 仅当分数提升时保留该改动
  7. 回退分数持平或下降的改动
  8. 将结果追加到
    results.tsv
    results.json
    changelog.md
良性改动:
  • 明确模糊的指令
  • 将关键规则前置
  • 针对反复出现的失效添加一条反模式说明
  • 添加一个针对性的示例
  • 删除会导致过拟合的冗余指令
不良改动:
  • 一次性重写整个技能
  • 单次实验中添加多条规则
  • 优化长度而非行为表现
  • 依赖直觉而非量化分数

Step 6: Keep the dashboard live

步骤6:保持仪表盘实时更新

The dashboard should refresh from
results.json
and show:
  • Experiment number
  • Score and pass rate progression
  • Baseline vs keep vs discard status
  • Per-eval failure hotspots
  • Current run state: running, idle, or complete
Use a single self-contained HTML file. Inline CSS/JS is preferred.
仪表盘应基于
results.json
刷新,展示如下信息:
  • 实验编号
  • 分数和通过率变化趋势
  • 基线、保留、丢弃状态对比
  • 各评估项的失效热点
  • 当前运行状态:运行中、空闲、已完成
使用单个独立HTML文件,优先使用内联CSS/JS。

Step 7: Log every experiment

步骤7:记录所有实验

Append after every run:
markdown
undefined
每次运行后追加如下内容:
markdown
undefined

Experiment N — keep|discard

Experiment N — keep|discard

Score: X/Y Change: one-sentence mutation summary Reasoning: why this mutation was tried Result: what improved or regressed Remaining failures: what still breaks

Discarded experiments matter. They stop future agents from repeating dead ends.
Score: X/Y Change: one-sentence mutation summary Reasoning: why this mutation was tried Result: what improved or regressed Remaining failures: what still breaks

被丢弃的实验同样有价值,可以避免后续Agent重复走死胡同。

Step 8: Deliver results

步骤8:交付结果

When the loop stops, report:
  1. Baseline score to final score
  2. Number of experiments run
  3. Keep vs discard count
  4. Top changes that helped most
  5. Remaining failure patterns
  6. Artifact locations
循环终止时,报告如下内容:
  1. 基线分数到最终分数的变化
  2. 运行的实验总数
  3. 保留与丢弃的实验数量
  4. 效果最显著的几项改动
  5. 剩余的失效模式
  6. 产物存储位置

Rules

规则

  • Do not run experiments before inputs and evals are defined
  • Use the same test set for baseline and mutations
  • Change one thing at a time
  • Keep or discard by score, not by preference
  • Record every attempt
  • Stop only on manual stop, budget cap, or clear score plateau
  • 在输入和评估项未定义前请勿运行实验
  • 基线和所有变更实验使用同一测试集
  • 每次仅修改一处内容
  • 根据分数而非偏好决定保留或丢弃改动
  • 记录每一次尝试
  • 仅在手动终止、达到预算上限或分数明显进入平台期时停止

Output format

输出格式

Expected artifacts:
text
skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline
The improved skill stays in place at its original path.
预期产物:
text
skill-autoresearch-[skill-name]/
  dashboard.html
  results.json
  results.tsv
  changelog.md
  SKILL.md.baseline
优化后的技能仍保留在其原始路径下。