skill-autoresearch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill Autoresearch
技能自动研究
Use this skill to improve another skill through measured iteration instead of gut feel.
The job is simple: run the target skill on a small test set, score outputs with binary evals, change one thing in the prompt, and keep only mutations that improve the score. Repeat until the score plateaus, the budget cap is hit, or the user stops the loop.
使用该技能通过可量化的迭代而非主观感受来优化其他技能。
工作逻辑很简单:在小型测试集上运行目标技能,通过二元评估为输出打分,修改提示词中的一处内容,仅保留能提升分数的改动。重复该过程直到分数进入平台期、达到预算上限,或用户终止循环。
When to use this skill
适用场景
- A skill works inconsistently and needs a repeatable improvement loop
- You want to benchmark a SKILL.md before editing it
- You need binary evals for prompt or skill quality
- You want a mutation log instead of ad-hoc rewriting
- You want to compare baseline vs improved prompt behavior
- 技能表现不稳定,需要可复用的改进循环
- 你希望在编辑SKILL.md之前先对其做基准测试
- 你需要用于评估提示词或技能质量的二元评估体系
- 你需要变更日志而非临时改写
- 你希望对比基线版本与优化后提示词的表现差异
Required inputs
必要输入
Do not start experiments until all inputs below are known:
- Target skill path
- Three to five representative test inputs
- Three to six binary yes/no evals
- Runs per experiment, default
5 - Experiment interval, default
2m - Optional budget cap
For writing reliable evals, read references/eval-guide.md.
在获取以下所有输入前请勿启动实验:
- 目标技能路径
- 3-5个代表性测试输入
- 3-6个二元是/否评估项
- 每次实验运行次数,默认值为
5 - 实验间隔,默认值为
2m - 可选的预算上限
如需编写可靠的评估项,请参考references/eval-guide.md。
Instructions
操作指引
Step 1: Read the target skill
步骤1:读取目标技能
- Read the target
SKILL.md - Read any directly linked files under that skill's
references/ - Identify the core job, required steps, output format, and likely failure modes
- Note buried instructions or conflicting rules before changing anything
- 读取目标文件
SKILL.md - 读取该技能目录下所有直接关联的文件
references/ - 明确核心任务、必要步骤、输出格式和潜在失效模式
- 在做任何改动前记录隐含的指令或冲突的规则
Step 2: Build the eval suite
步骤2:构建评估套件
Convert the user's quality criteria into binary checks only.
Use this format:
text
EVAL 1: Short name
Question: Yes/no question about the output
Pass: Specific condition that counts as yes
Fail: Specific condition that counts as noRules:
- Use binary yes/no checks only
- Prefer observable checks over taste-based judgments
- Keep evals distinct; do not double-count the same failure
- Use three to six evals total
将用户的质量标准仅转换为二元检查项。
使用如下格式:
text
EVAL 1: Short name
Question: Yes/no question about the output
Pass: Specific condition that counts as yes
Fail: Specific condition that counts as no规则:
- 仅使用二元是/否检查项
- 优先选择可观测的检查项,而非基于主观感受的判断
- 保持评估项相互独立,不要对同一失效情况重复计数
- 评估项总数控制在3-6个
Step 3: Create the experiment workspace
步骤3:创建实验工作区
Inside the target skill folder, create:
text
skill-autoresearch-[skill-name]/
dashboard.html
results.json
results.tsv
changelog.md
SKILL.md.baselineRequirements:
- stores experiment summaries
results.tsv - powers the dashboard
results.json - is a self-contained status page
dashboard.html - is the untouched original
SKILL.md.baseline
在目标技能文件夹下创建如下目录结构:
text
skill-autoresearch-[skill-name]/
dashboard.html
results.json
results.tsv
changelog.md
SKILL.md.baseline要求:
- 存储实验汇总信息
results.tsv - 为仪表盘提供数据支撑
results.json - 是独立的状态页面
dashboard.html - 是未修改的原始版本
SKILL.md.baseline
Step 4: Establish the baseline
步骤4:确定基线
Run the target skill as-is before editing it.
- Back up the original skill as
SKILL.md.baseline - Run the skill times on the same test inputs
N - Score every run against every eval
- Record experiment as the baseline
0 - If baseline is already above 90 percent, confirm whether more optimization is worth it
Use this header:
results.tsvtext
experiment score max_score pass_rate status description在编辑目标技能前先按原始版本运行。
- 将原始技能备份为
SKILL.md.baseline - 对相同的测试输入运行技能次
N - 对照每个评估项为每次运行打分
- 将实验记录为基线
0 - 如果基线通过率已经高于90%,确认是否值得继续优化
使用如下表头:
results.tsvtext
experiment score max_score pass_rate status descriptionStep 5: Run the mutation loop
步骤5:运行变更循环
This is the core loop:
- Inspect the failing outputs
- Form one hypothesis about the failure
- Make one targeted change to
SKILL.md - Re-run the same test set
- Score all outputs again
- Keep the change only if the score improves
- Revert ties or regressions
- Append the result to ,
results.tsv, andresults.jsonchangelog.md
Good mutations:
- Clarify an ambiguous instruction
- Move a critical rule higher
- Add one anti-pattern for a recurring failure
- Add one focused example
- Remove a noisy instruction that causes overfitting
Bad mutations:
- Rewrite the whole skill at once
- Add many rules in one experiment
- Optimize for length instead of behavior
- Use intuition instead of measured score
这是核心循环:
- 检查未通过的输出结果
- 针对失效原因提出一个假设
- 对做一处针对性修改
SKILL.md - 重新运行同一测试集
- 再次为所有输出打分
- 仅当分数提升时保留该改动
- 回退分数持平或下降的改动
- 将结果追加到、
results.tsv和results.json中changelog.md
良性改动:
- 明确模糊的指令
- 将关键规则前置
- 针对反复出现的失效添加一条反模式说明
- 添加一个针对性的示例
- 删除会导致过拟合的冗余指令
不良改动:
- 一次性重写整个技能
- 单次实验中添加多条规则
- 优化长度而非行为表现
- 依赖直觉而非量化分数
Step 6: Keep the dashboard live
步骤6:保持仪表盘实时更新
The dashboard should refresh from and show:
results.json- Experiment number
- Score and pass rate progression
- Baseline vs keep vs discard status
- Per-eval failure hotspots
- Current run state: running, idle, or complete
Use a single self-contained HTML file. Inline CSS/JS is preferred.
仪表盘应基于刷新,展示如下信息:
results.json- 实验编号
- 分数和通过率变化趋势
- 基线、保留、丢弃状态对比
- 各评估项的失效热点
- 当前运行状态:运行中、空闲、已完成
使用单个独立HTML文件,优先使用内联CSS/JS。
Step 7: Log every experiment
步骤7:记录所有实验
Append after every run:
markdown
undefined每次运行后追加如下内容:
markdown
undefinedExperiment N — keep|discard
Experiment N — keep|discard
Score: X/Y
Change: one-sentence mutation summary
Reasoning: why this mutation was tried
Result: what improved or regressed
Remaining failures: what still breaks
Discarded experiments matter. They stop future agents from repeating dead ends.Score: X/Y
Change: one-sentence mutation summary
Reasoning: why this mutation was tried
Result: what improved or regressed
Remaining failures: what still breaks
被丢弃的实验同样有价值,可以避免后续Agent重复走死胡同。Step 8: Deliver results
步骤8:交付结果
When the loop stops, report:
- Baseline score to final score
- Number of experiments run
- Keep vs discard count
- Top changes that helped most
- Remaining failure patterns
- Artifact locations
循环终止时,报告如下内容:
- 基线分数到最终分数的变化
- 运行的实验总数
- 保留与丢弃的实验数量
- 效果最显著的几项改动
- 剩余的失效模式
- 产物存储位置
Rules
规则
- Do not run experiments before inputs and evals are defined
- Use the same test set for baseline and mutations
- Change one thing at a time
- Keep or discard by score, not by preference
- Record every attempt
- Stop only on manual stop, budget cap, or clear score plateau
- 在输入和评估项未定义前请勿运行实验
- 基线和所有变更实验使用同一测试集
- 每次仅修改一处内容
- 根据分数而非偏好决定保留或丢弃改动
- 记录每一次尝试
- 仅在手动终止、达到预算上限或分数明显进入平台期时停止
Output format
输出格式
Expected artifacts:
text
skill-autoresearch-[skill-name]/
dashboard.html
results.json
results.tsv
changelog.md
SKILL.md.baselineThe improved skill stays in place at its original path.
预期产物:
text
skill-autoresearch-[skill-name]/
dashboard.html
results.json
results.tsv
changelog.md
SKILL.md.baseline优化后的技能仍保留在其原始路径下。