skillify

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skillify — The Meta Skill

Skillify — 元技能

Relationship to
/cross-modal-review
:
That skill is the manual mid-flow "second opinion" gate (one model reviews work product before commit). This skill's Phase 3 below uses
gbrain eval cross-modal
instead — three different-provider frontier models score-and-iterate on a documented dimension list before tests cement behavior. Use
/cross-modal-review
for ad-hoc second opinions; use Phase 3 here when skillifying a feature.
/cross-modal-review
的关系:
该技能是手动流程中的「二次意见」关卡(提交前由一个模型评审工作成果)。而本技能的第3阶段则使用
gbrain eval cross-modal
——在测试固化行为之前,由三个不同供应商的前沿模型基于已记录的维度列表进行评分与迭代。如需临时获取二次意见,请使用
/cross-modal-review
;在对功能进行技能化时,请使用本技能的第3阶段。

Contract

约定

A feature is "properly skilled" when all 11 checklist items pass. Item 3 (cross-modal eval) is informational in v1.1.0 — it does not gate the skillpack-check audit, but a missing or stale receipt is surfaced so the user knows where the gate stands.
当所有11项检查清单全部通过时,一项功能才算「完成技能化」。在v1.1.0版本中,第3项(跨模态评估)仅作参考——它不影响skillpack-check审计,但如果缺少或过时的评估凭证会被标出,以便用户了解该关卡的状态。

The Checklist

检查清单

□ 1.  SKILL.md           — skill file with frontmatter + contract + phases
□ 2.  Code               — deterministic script if applicable
□ 3.  Cross-modal eval   — 3 frontier models from 3 providers; informational
□ 4.  Unit tests         — cover every branch of deterministic logic
□ 5.  Integration tests  — exercise live endpoints
□ 6.  LLM evals          — quality/correctness cases for LLM-involving steps
□ 7.  Resolver trigger   — entry in skills/RESOLVER.md with real user trigger phrases
□ 8.  Resolver eval      — test that triggers route to this skill
□ 9.  Check-resolvable   — DRY + MECE audit, no orphans
□ 10. E2E test           — smoke test: trigger → side effect
□ 11. Brain filing       — if it writes pages, entry in brain/RESOLVER.md
□ 1.  SKILL.md           — 包含前置元数据、约定与阶段的技能文件
□ 2.  代码               — 若适用,需为确定性脚本
□ 3.  跨模态评估         — 来自3个供应商的3个前沿模型;仅作参考
□ 4.  单元测试           — 覆盖确定性逻辑的所有分支
□ 5.  集成测试           — 调用真实端点
□ 6.  LLM评估            — 针对涉及LLM步骤的质量/正确性用例
□ 7.  解析器触发器       — 在skills/RESOLVER.md中添加真实用户触发短语
□ 8.  解析器评估         — 测试触发器是否能路由到本技能
□ 9.  可解析性检查       — DRY(避免重复)+ MECE(相互独立、完全穷尽)审计,无孤立项
□ 10. 端到端测试         — 冒烟测试:触发 → 副作用验证
□ 11. 脑文件归档         — 若该技能会生成页面,需在brain/RESOLVER.md中添加条目

Phase 0: Should This Be a Skill?

阶段0:是否应该将其技能化?

Before skillifying, check:
  • Will this be invoked 2+ times? (One-off work ≠ skill)
  • Is there >20 lines of logic? (Trivial helpers don't need full infrastructure)
  • Does it have a clear trigger phrase a user would actually say?
If no to all three, it's a script, not a skill. Move on.
在进行技能化之前,请检查:
  • 该功能是否会被调用2次以上?(一次性工作 ≠ 技能)
  • 逻辑代码是否超过20行?(简单辅助工具无需完整基础设施)
  • 是否有用户实际会使用的明确触发短语?
如果以上全否,那么它只是一个脚本,而非技能。无需继续。

Phase 1: Audit

阶段1:审计

Feature: [name]
Code: [path]
Missing items: [check each of the 11]
功能:[名称]
代码:[路径]
缺失项:[逐一检查11项]

Phase 2: Write SKILL.md + Code (items 1-2)

阶段2:编写SKILL.md + 代码(第1-2项)

SKILL.md frontmatter template (copy-paste):

SKILL.md前置元数据模板(可复制粘贴):

yaml
---
name: my-skill
version: 1.0.0
description: |
  One paragraph. What it does, when to use it.
triggers:
  - "trigger phrase users actually say"
  - "another real trigger"
tools:
  - exec
  - read
  - write
mutating: false  # true if it writes to brain/disk
---
Body must include: Contract (what it guarantees), Phases (step-by-step), Output Format (what it produces).
Extract deterministic code into
scripts/*.ts
.
yaml
---
name: my-skill
version: 1.0.0
description: |
  一段描述。说明该技能的功能与适用场景。
triggers:
  - "用户实际会说的触发短语"
  - "另一个真实触发短语"
tools:
  - exec
  - read
  - write
mutating: false  # 若该技能会写入脑/磁盘则设为true
---
正文必须包含:约定(技能保证的内容)、阶段(分步流程)、输出格式(生成的内容规范)。
将确定性代码提取至
scripts/*.ts

Phase 3: Cross-Modal Eval (item 3) — THE QUALITY GATE

阶段3:跨模态评估(第3项)—— 质量关卡

Why this comes before tests

为何先评估再测试

Tests lock in behavior. If the behavior is mediocre, tests lock in mediocrity. Cross-modal eval proves the quality bar FIRST, then tests cement it.
测试会固化行为。如果行为质量平庸,测试也会将这种平庸固化。跨模态评估先验证质量标准,再由测试固化达标行为。

Step 1: Pick a representative input

步骤1:选择代表性输入

Choose the input that exercises the skill's hardest documented use case. If unsure: use the primary trigger example from SKILL.md, or the most complex real-world input from the last 7 days of memory files.
选择能覆盖技能最复杂已记录用例的输入。若不确定:使用SKILL.md中的主要触发示例,或过去7天内存文件中最复杂的真实输入。

Step 2: Run the skill, capture output

步骤2:运行技能,捕获输出

Run the skill on the representative input. The OUTPUT FILE is what gets evaluated.
在代表性输入上运行技能。输出文件将作为评估对象。

Step 3: Run the eval gate

步骤3:运行评估关卡

bash
gbrain eval cross-modal \
  --task "What this skill is supposed to accomplish" \
  --output skills/<slug>/SKILL.md
The command runs 3 frontier models from 3 different providers in parallel, scores the OUTPUT against the TASK on 5 documented dimensions, and writes a receipt under
~/.gbrain/.gbrain/eval-receipts/<slug>-<sha8>.json
(the sha-8 binds the receipt to the current SKILL.md content — re-running after edits writes a new receipt).
Default models (override per slot via
--slot-a-model
,
--slot-b-model
,
--slot-c-model
):
SlotDefaultProvider
A
openai:gpt-4o
OpenAI
B
anthropic:claude-opus-4-7
Anthropic
C
google:gemini-1.5-pro
Google
These MUST be frontier models from DIFFERENT providers. Using a single provider's family or budget models defeats the purpose — different families have less correlated blind spots. Refresh the list when a new model generation ships.
Pass criteria (BOTH must be true):
  1. Every dimension's mean across successful models ≥ 7.
  2. No single model scored any dimension < 5 (the floor).
Inconclusive: fewer than 2 of 3 models returned parseable scores. Receipt is still written (forensics) but the gate is not authoritative. Exit code 2; CI wrappers should treat this as "did not run cleanly", not "failed quality gate".
bash
gbrain eval cross-modal \\
  --task "该技能应完成的任务" \\
  --output skills/<slug>/SKILL.md
该命令会并行调用来自3个不同供应商的3个前沿模型,基于5个已记录维度,针对任务输出进行评分,并将评估凭证写入
~/.gbrain/.gbrain/eval-receipts/<slug>-<sha8>.json
(sha-8将凭证与当前SKILL.md内容绑定——编辑后重新运行会生成新凭证)。
默认模型(可通过
--slot-a-model
--slot-b-model
--slot-c-model
覆盖各插槽):
SlotDefaultProvider
A
openai:gpt-4o
OpenAI
B
anthropic:claude-opus-4-7
Anthropic
C
google:gemini-1.5-pro
Google
必须使用来自不同供应商的前沿模型。使用同一供应商的系列模型或经济型模型会失去意义——不同系列模型的盲区相关性更低。当新一代模型发布时,请更新模型列表。
通过标准(需同时满足):
  1. 所有成功返回评分的模型在每个维度上的平均分 ≥7。
  2. 没有任何一个模型在任意维度上的评分 <5(最低分阈值)。
不确定结果: 3个模型中返回可解析评分的不足2个。 仍会生成凭证(用于溯源)但该关卡结果不具备权威性。退出码为2;CI包装器应将其视为「未正常运行」而非「未通过质量关卡」。

Step 4: Cycle until you pass (≤3 cycles)

步骤4:循环迭代直至通过(≤3次循环)

CYCLE 1:
  Eval → scores + top 10 improvements
  IF pass: → done, write tests
  ELSE:
    Apply top 10 improvements to the actual file
    Log: which improvements applied, what changed

CYCLE 2:
  Re-eval the FIXED output (same 3 models, same dimensions)
  Compare: before/after scores per dimension (track delta)
  IF pass: → done, write tests
  ELSE: apply remaining improvements + new ones

CYCLE 3 (final):
  Re-eval
  IF pass: → ship
  ELSE: → ship with KNOWN_GAPS section listing:
    - Which dimensions are still below 7
    - Which improvements couldn't be resolved
    - Why (e.g., "would require architectural change")
循环1:
  评估 → 评分 + 前10项改进建议
  若通过: → 完成,编写测试
  否则:
    将前10项改进建议应用到实际文件
    记录:应用了哪些改进,做出了哪些变更

循环2:
  对修复后的输出重新评估(使用相同的3个模型与维度)
  对比:各维度的评分变化(跟踪差值)
  若通过: → 完成,编写测试
  否则: 应用剩余改进建议 + 新的改进建议

循环3(最终):
  重新评估
  若通过: → 发布
  否则: → 发布并添加KNOWN_GAPS部分,列出:
    - 哪些维度评分仍低于7
    - 哪些改进建议无法解决
    - 原因(例如:"需要架构变更")

Cycles + cost guardrails

循环次数 + 成本管控

  • Default
    --cycles 3
    in TTY,
    --cycles 1
    in non-TTY (limits scripted bulk spend in CI loops).
  • The command prints an estimated max-cost-per-cycle from a small pricing constant before each run. Real cost varies with prompt size; treat the estimate as a ceiling for default
    --max-tokens 4000
    .
  • A
    --budget-usd N
    hard cap is a v0.27.x follow-up TODO.
  • 在TTY环境中默认
    --cycles 3
    ,非TTY环境中默认
    --cycles 1
    (限制CI循环中的批量脚本支出)。
  • 每次运行前,该命令会基于一个小型定价常数打印每轮循环的预估最高成本。实际成本随提示词大小变化;默认
    --max-tokens 4000
    时,可将预估成本视为上限。
  • --budget-usd N
    硬上限是v0.27.x版本的后续待办功能。

Provider configuration

供应商配置

Models resolve through the gbrain AI gateway. Configure once with:
bash
gbrain providers test    # see what's configured
gbrain config            # set keys
Or set env vars:
OPENAI_API_KEY
,
ANTHROPIC_API_KEY
,
GOOGLE_GENERATIVE_AI_API_KEY
,
TOGETHER_API_KEY
, etc. The gateway reads from
~/.gbrain/config.json
plus
process.env
.
模型通过gbrain AI网关解析。只需配置一次:
bash
gbrain providers test    # 查看已配置的供应商
gbrain config            # 设置密钥
或设置环境变量:
OPENAI_API_KEY
ANTHROPIC_API_KEY
GOOGLE_GENERATIVE_AI_API_KEY
TOGETHER_API_KEY
等。网关会从
~/.gbrain/config.json
process.env
读取配置。

Cost expectations

成本预期

3 cycles × 3 models = 9 frontier calls max per run. With Opus-class + GPT-4o-class + Gemini-1.5-Pro, expect $1–3 per full run on default
--max-tokens 4000
. Receipts include the per-call model identifiers so you can audit retroactively.
最多3次循环 ×3个模型 = 每次运行最多9次前沿模型调用。使用Opus级 + GPT-4o级 + Gemini-1.5-Pro模型时,默认
--max-tokens 4000
配置下,每次完整运行的成本预计为1–3美元。凭证中包含每次调用的模型标识,便于事后审计。

Skip cross-modal eval when:

可跳过跨模态评估的场景:

  • Output is < 200 tokens (trivial — not worth 9 API calls).
  • The skill is a thin wrapper around a single API call (one cycle is enough).
  • 输出内容 <200 tokens(过于简单——不值得调用9次API)。
  • 该技能是单个API调用的轻量封装(一次循环即可)。

Phase 4: Tests (items 4-6)

阶段4:测试(第4-6项)

NOW that eval has proven quality, write tests that lock it in:
Unit tests — every branch of deterministic logic. Mock external calls. Integration tests — hit real endpoints. Catch bugs mocks hide. LLM evals — quality/correctness for LLM steps. Lighter than cross-modal eval — test specific behaviors.
现在评估已验证质量达标,编写测试以固化行为:
单元测试 — 覆盖确定性逻辑的所有分支。模拟外部调用。 集成测试 — 调用真实端点。捕获模拟调用无法发现的bug。 LLM评估 — 针对LLM步骤的质量/正确性验证。比跨模态评估更轻量化——测试特定行为。

Phase 5: Resolver + Check-Resolvable (items 7-9)

阶段5:解析器 + 可解析性检查(第7-9项)

  1. Add to skills/RESOLVER.md with trigger phrases users ACTUALLY type
  2. Resolver eval: feed triggers, assert correct routing
  3. Check-resolvable:
    • Skill reachable from skills/RESOLVER.md (not orphaned)
    • No MECE overlap with other skills
    • No DRY violations (shared logic in lib/, not copy-pasted)
    • No ambiguous trigger routing
  1. 在skills/RESOLVER.md中添加用户实际会输入的触发短语
  2. 解析器评估:输入触发短语,验证是否能正确路由
  3. 可解析性检查:
    • 技能可通过skills/RESOLVER.md访问(非孤立项)
    • 与其他技能无MECE重叠
    • 无DRY违规(共享逻辑应放在lib/中,而非复制粘贴)
    • 无模糊的触发器路由

Phase 6: E2E + Brain Filing (items 10-11)

阶段6:端到端测试 + 脑文件归档(第10-11项)

  • E2E smoke: full pipeline from trigger to side effect
  • Brain filing: add to brain/RESOLVER.md if the skill writes brain pages
  • 端到端冒烟测试:从触发到副作用的完整流程验证
  • 脑文件归档:若该技能会生成脑页面,需添加至brain/RESOLVER.md

Phase 7: Verify

阶段7:验证

bash
bun test test/<skill>.test.ts                    # unit tests
gbrain skillify check skills/<slug>/scripts/<slug>.mjs --json | \
  jq '.[] | .items[] | select(.name | contains("Cross-modal"))'
ls ~/.gbrain/.gbrain/eval-receipts/              # receipt landed
gbrain check-resolvable --json | jq .ok          # resolver clean
bash
bun test test/<skill>.test.ts                    # 运行单元测试
gbrain skillify check skills/<slug>/scripts/<slug>.mjs --json | \\
  jq '.[] | .items[] | select(.name | contains("Cross-modal"))'
ls ~/.gbrain/.gbrain/eval-receipts/              # 确认凭证已生成
gbrain check-resolvable --json | jq .ok          # 确认解析器配置正常

Worked Example: Skillifying a "summarize-pr" Feature

实战示例:将"summarize-pr"功能技能化

Phase 0: Yes — invoked weekly, 50+ lines, clear trigger "summarize this PR"
Phase 1: Audit → SKILL.md missing, no tests, no resolver entry. Score: 1/11
Phase 2: Write SKILL.md + extract script to scripts/summarize-pr.ts
Phase 3: Cross-modal eval cycle 1 →
  GPT-4o: goal=6, depth=5, specificity=4 → "misses file-level diffs"
  Opus 4.7: goal=7, depth=6, specificity=5 → "no test plan in summary"
  Gemini 1.5 Pro: goal=6, depth=5, specificity=5 → "template feels generic"
  Aggregate: goal=6.3 FAIL, depth=5.3 FAIL
  Top improvements: add file-level changes, include test plan, use PR context
  → Apply fixes → Cycle 2: goal=8, depth=7.5, specificity=7 → PASS
Phase 4: Write 12 unit tests locking in the improved behavior
Phase 5: Add "summarize this PR" trigger to skills/RESOLVER.md
Phase 6: E2E test: feed a real PR URL → verify brain page created
Phase 7: All green. Score: 11/11
阶段0:是 — 每周都会被调用,代码超过50行,有明确触发短语"summarize this PR"
阶段1:审计 → 缺少SKILL.md、无测试、无解析器条目。得分:1/11
阶段2:编写SKILL.md + 将脚本提取至scripts/summarize-pr.ts
阶段3:跨模态评估循环1 →
  GPT-4o: 目标达成度=6,深度=5,特异性=4 → "遗漏文件级差异"
  Opus 4.7: 目标达成度=7,深度=6,特异性=5 → "摘要中未包含测试计划"
  Gemini 1.5 Pro: 目标达成度=6,深度=5,特异性=5 → "模板过于通用"
  平均分:目标达成度=6.3 未通过,深度=5.3 未通过
  首要改进建议:添加文件级变更、包含测试计划、利用PR上下文
  → 应用修复 → 循环2:目标达成度=8,深度=7.5,特异性=7 → 通过
阶段4:编写12个单元测试,固化优化后的行为
阶段5:在skills/RESOLVER.md中添加"summarize this PR"触发短语
阶段6:端到端测试:输入真实PR URL → 验证脑页面已生成
阶段7:所有检查通过。得分:11/11

Quality Gates

质量关卡

NOT properly skilled until:
  • All required items pass (1-2, 4-10; 11 only when applicable).
  • Cross-modal eval (item 3) has a current receipt OR is explicitly waived with rationale (item 3 is informational; not blocking, but a missing receipt is visible in the audit).
  • All tests pass (unit + integration + LLM evals).
  • Resolver entry exists with real trigger phrases.
  • Check-resolvable shows no orphans, overlaps, or DRY violations.
  • Brain filing if applicable.
未完成技能化的情况:
  • 所有必填项(1-2、4-10;11仅在适用时必填)未全部通过。
  • 跨模态评估(第3项)无当前凭证,且未提供明确理由豁免(第3项仅作参考;不阻塞通过,但审计时会显示缺少凭证)。
  • 所有测试(单元+集成+LLM评估)未全部通过。
  • 无解析器条目或未使用真实触发短语。
  • 可解析性检查存在孤立项、重叠或DRY违规。
  • 适用场景下未完成脑文件归档。

Output Format

输出格式

Skillify produces three durable artifacts per skill:
  1. The skill tree on disk.
    skills/<slug>/SKILL.md
    ,
    scripts/<slug>.mjs
    ,
    routing-eval.jsonl
    , plus a
    test/<slug>.test.ts
    skeleton. Generated by
    gbrain skillify scaffold <name>
    and refined by the human/agent into a real implementation.
  2. A cross-modal eval receipt at
    ~/.gbrain/.gbrain/eval-receipts/<slug>-<sha8>.json
    . The sha-8 binds the receipt to the current
    SKILL.md
    content.
    gbrain skillify check
    surfaces the status (
    found
    /
    stale
    /
    missing
    ) as informational.
  3. An audit verdict from
    gbrain skillify check
    :
    properly skilled
    |
    close — create: <missing items>
    |
    needs skillify — run /skillify on <target>
    . Score is
    <passed>/<total>
    . Required items gate the verdict; item 11 (cross-modal eval) is informational and never blocks PASS.
JSON output (
gbrain skillify check --json
) includes the same fields plus the per-item detail string, so agents can route on the structured envelope without parsing prose.
Skillify为每个技能生成三个持久化产物:
  1. 磁盘上的技能树
    skills/<slug>/SKILL.md
    scripts/<slug>.mjs
    routing-eval.jsonl
    ,以及
    test/<slug>.test.ts
    骨架。通过
    gbrain skillify scaffold <name>
    生成,再由人类/Agent优化为真实实现。
  2. 跨模态评估凭证,路径为
    ~/.gbrain/.gbrain/eval-receipts/<slug>-<sha8>.json
    。sha-8将凭证与当前
    SKILL.md
    内容绑定。
    gbrain skillify check
    会显示凭证状态(
    found
    /
    stale
    /
    missing
    ),仅作参考。
  3. 审计结论,来自
    gbrain skillify check
    properly skilled
    (已完成技能化)|
    close — create: <missing items>
    (接近完成 — 需创建:<缺失项>)|
    needs skillify — run /skillify on <target>
    (需技能化 — 在<目标>上运行/skillify)。得分为
    <passed>/<total>
    。必填项决定结论;第11项(跨模态评估)仅作参考,从不阻塞通过。
JSON输出(
gbrain skillify check --json
)包含相同字段及每个检查项的详细字符串,以便Agent无需解析 prose 即可基于结构化信息路由。

Anti-Patterns

反模式

  • ❌ Writing tests before cross-modal eval (locks in mediocrity)
  • ❌ Using budget models for eval (C student grading A student)
  • ❌ Using a single provider's family for all 3 slots (correlated blind spots)
  • ❌ Skipping eval "because the output looks fine" (your judgment isn't 3 models)
  • ❌ Eval without fix cycle (vanity metrics)
  • ❌ Code with no SKILL.md (invisible to resolver)
  • ❌ Tests that reimplement production code (masks real bugs)
  • ❌ Resolver entry with internal jargon (must mirror real user language)
  • ❌ Two skills doing the same thing (merge or kill one)
  • ❌ Running cross-modal eval on trivial outputs (< 200 tokens, not worth 9 API calls)
  • ❌ 在跨模态评估前编写测试(固化平庸行为)
  • ❌ 使用经济型模型进行评估(用差生标准评判优生)
  • ❌ 所有3个插槽使用同一供应商的系列模型(盲区相关性高)
  • ❌ 因「输出看起来没问题」而跳过评估(你的判断不如3个模型)
  • ❌ 仅评估不进行修复循环(虚荣指标)
  • ❌ 有代码但无SKILL.md(解析器无法识别)
  • ❌ 测试重写生产代码(掩盖真实bug)
  • ❌ 解析器条目使用内部术语(必须匹配真实用户语言)
  • ❌ 两个技能功能重复(合并或移除其中一个)
  • ❌ 对简单输出(<200 tokens)进行跨模态评估(不值得调用9次API) ",