beo-writing-skills

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Writing Skills

编写Skills

Overview

概述

Skills are code. They have bugs. Test them before deploying.
This is the TDD-for-skills methodology adapted from Superpowers (N=28,000 scale testing confirms persuasion-optimized skills produce 3-4x better agent compliance than plain instructions).
THE IRON LAW: NO SKILL WITHOUT A FAILING TEST FIRST. Write skill before testing? Delete it. Start over. No exceptions. Not for "simple additions," not for "just a section," not for "reference only."
Skills本质是代码,会存在缺陷,部署前请务必测试。
这是适配Skills开发的TDD方法论,源自Superpowers(N=28,000的规模化测试证实,经过说服优化的skill带来的Agent合规度是普通指令的3-4倍)。
铁律:没有先写失败用例的skill一律不允许存在。 先写skill再测试?直接删掉,从头来过,没有例外。哪怕是「简单新增」、「只是加个章节」、「仅作参考」都不行。

The Core Cycle: RED -> GREEN -> REFACTOR

核心循环:RED -> GREEN -> REFACTOR

TDD ConceptSkill Equivalent
Test casePressure scenario with subagent
Production codeSKILL.md
Test fails (RED)Agent violates rule without skill
Test passes (GREEN)Agent complies with skill present
RefactorClose loopholes, maintain compliance

TDD概念Skill对应项
测试用例搭配subagent的压力场景
生产代码SKILL.md
测试失败(RED)未启用skill时Agent违反规则
测试通过(GREEN)启用skill后Agent遵守规则
重构填补漏洞,维持合规性

PHASE 1: RED: Write the Failing Test

阶段1:RED:编写失败用例

HARD-GATE: Do not write any skill content until you complete this phase.
硬门槛:完成本阶段前,禁止编写任何skill内容。

Minimum Pressure-Test Set

最低压力测试集

Before calling a skill "tested", run at least 3 pressure scenarios that are realistic enough to tempt failure:
  • a time-pressure scenario
  • an ambiguity scenario
  • a convenience/rationalization scenario
A real RED failure means the agent had a fair chance to choose correctly and still violated the intended rule. A real GREEN pass means the agent follows the skill under pressure without needing hidden help from the evaluator.
Load
references/writing-skills-operations.md
for the exact RED/GREEN/REFACTOR execution steps and validation/documentation flow.
Teams that skip baseline testing consistently deploy skills with predictable, preventable failures.
What to record:
Scenario: [name]
Combined pressures: [list]
Exact violation: [what agent chose]
Exact rationalization (verbatim): "[quote]"

在宣称skill「已通过测试」前,至少运行3个足够贴近真实场景、容易触发故障的压力场景:
  • 时间压力场景
  • 歧义场景
  • 便利/合理化辩解场景
真实的RED失败指的是Agent有公平的机会做出正确选择,但仍违反了预期规则。 真实的GREEN通过指的是Agent在压力下无需评估者的隐性提示,就能遵守skill要求。
加载
references/writing-skills-operations.md
查看完整的RED/GREEN/REFACTOR执行步骤、验证与文档流转流程。 跳过基础测试的团队,部署的skill总会出现可预测、可避免的故障。
需要记录的内容:
Scenario: [名称]
Combined pressures: [列表]
Exact violation: [Agent的选择]
Exact rationalization (verbatim): "[引用原文]"

PHASE 2: GREEN: Write the Minimal Skill

阶段2:GREEN:编写最小可用Skill

Write SKILL.md addressing the specific rationalizations documented in RED only. Do not add content for hypothetical cases you didn't observe. Hypothetical content bloats the skill and gets skipped.
Use
references/writing-skills-operations.md
for the exact rerun and validation flow.
SKILL.md checklist:
  • YAML frontmatter starts on line 1 (
    ---
    )
  • name
    : letters/numbers/hyphens/slashes only, matches directory path
  • description
    : starts with "Use when...", triggering conditions ONLY, no workflow summary
  • Description is third-person, <=1024 chars
  • Body < 400 lines (move details to
    references/
    )
  • Uses persuasion principles (see table below)
  • HARD-GATE markers on critical stops
  • references/
    files never nested more than one level deep
Description trap (most common mistake): Workflow summary in description -> Claude follows description instead of reading skill body. Every time.
yaml
undefined
编写SKILL.md时,仅针对RED阶段记录的具体合理化辩解场景即可。 不要为你没观测到的假设场景增加内容,假设性内容会让skill变得臃肿,还会被忽略。
使用
references/writing-skills-operations.md
查看完整的重跑与验证流程。
SKILL.md检查清单:
  • YAML frontmatter从第1行开始(
    ---
  • name
    仅包含字母/数字/连字符/斜杠,与目录路径匹配
  • description
    以「Use when...」开头,仅包含触发条件,不包含工作流摘要
  • 描述使用第三人称,长度≤1024字符
  • 正文<400行(细节内容移到
    references/
    目录)
  • 应用说服原则(见下表)
  • 关键阻断点标注硬门槛标记
  • references/
    下的文件嵌套不超过1层
描述陷阱(最常见错误): 描述里写工作流摘要->Claude会直接按照描述执行,不会阅读skill正文,次次如此。
yaml
undefined

BAD: workflow summary

错误示例:包含工作流摘要

description: Use when creating skills -- run baseline test, write minimal skill, run tests
description: Use when creating skills -- run baseline test, write minimal skill, run tests

GOOD: triggering conditions only

正确示例:仅包含触发条件

description: Use when creating a new beo skill or editing an existing one

**Apply persuasion principles with restraint:**

| Principle | Implementation | Use For |
|---|---|---|
| **Authority** | crisp prohibitions only where failure is genuinely costly or irreversible | hard safety rails and invariant-preserving rules |
| **Commitment** | ordered checklists, announce skill usage | multi-step processes |
| **Scarcity** | "before proceeding", "immediately after X" | verification requirements |
| **Social Proof** | "teams report...", evidence-backed failure patterns | repeated, observed mistakes |
| **Unity** | "our skills", collaborative framing | techniques, guidance |

Prefer explanation-first guidance when the model needs judgment. Reserve hard-edged wording for places where ambiguity reliably causes failure. Overusing absolutist language makes skills louder, not better.

After writing: re-run the same pressure scenarios WITH the skill. Agent must now comply.
If agent still fails -> skill is unclear or incomplete. Revise and re-test. Do not proceed.

---
description: Use when creating a new beo skill or editing an existing one

**有节制地应用说服原则:**

| 原则 | 实现方式 | 适用场景 |
|---|---|---|
| **权威** | 仅在故障成本极高或不可逆的场景使用明确禁令 | 硬性安全护栏与不变量保护规则 |
| **承诺** | 有序检查清单,声明skill使用情况 | 多步骤流程 |
| **稀缺** | 「操作前」、「X完成后立即」 | 验证要求 |
| **社会认同** | 「团队反馈...」、有证据支撑的故障模式 | 反复出现的观测到的错误 |
| **一致性** | 「我们的skill」,协作式表述 | 技术方法、指导规则 |

当模型需要判断力时,优先采用先解释的指导方式。仅在歧义必然导致故障的场景使用强硬表述。过度使用绝对化表述只会让skill更冗余,不会更有效。

编写完成后:**启用skill**重新运行相同的压力场景,Agent必须现在能够遵守规则。
如果Agent仍失败->skill表述不清晰或不完整,修改后重新测试,不要进入下一阶段。

---

PHASE 3: REFACTOR: Close Loopholes

阶段3:REFACTOR:填补漏洞

When an agent violates a rule despite having the skill, that is a test regression. The skill has a bug. Fix it using the loop in
references/writing-skills-operations.md
.
Continue until no new rationalizations emerge from pressure testing.
如果Agent已经启用skill但仍违反规则,说明出现了测试回归,skill存在缺陷。按照
references/writing-skills-operations.md
中的循环修复即可。
持续迭代直到压力测试中不再出现新的合理化辩解。

Watch for Overfitting

注意过拟合问题

The skill is overfit if it only passes the exact scenarios you wrote but fails small wording changes, adjacent contexts, or competing pressures. If a revision makes one scenario pass by becoming unnaturally specific, broaden the instruction back to the principle the scenario was testing.

如果skill仅能通过你编写的特定场景,在表述小幅改动、临近上下文、或存在竞争压力时就会失败,说明skill过拟合了。如果某次修改为了通过某一个场景变得过于具体,请将指令拓宽回该场景原本要验证的原则层面。

PHASE 4: VALIDATE & DOCUMENT

阶段4:验证与归档

Use
references/writing-skills-operations.md
for the validation command and documentation flow. Create
CREATION-LOG.md
with
references/creation-log-template.md
.
Signs the skill IS bulletproof:
  • Agent chooses correct option under maximum pressure
  • Agent cites specific skill sections as justification
  • Agent acknowledges temptation but follows rule
  • Meta-test reveals: "skill was clear, I should follow it"
Signs the skill is NOT bulletproof:
  • Agent finds rationalizations not addressed in the skill
  • Agent argues the skill itself is wrong
  • Agent creates "hybrid approaches" that satisfy letter but not spirit

使用
references/writing-skills-operations.md
中的验证命令与文档流转流程,基于
references/creation-log-template.md
创建
CREATION-LOG.md
skill足够可靠的标志:
  • Agent在最大压力下仍选择正确选项
  • Agent引用skill的具体章节作为依据
  • Agent承认存在诱惑但仍遵守规则
  • 元测试显示:「skill表述清晰,我应当遵守」
skill不可靠的标志:
  • Agent找到了skill未覆盖的合理化辩解
  • Agent主张skill本身是错误的
  • Agent提出「折中方案」,符合规则字面含义但违背规则初衷

Rationalization Table (Common Violations)

合理化辩解表(常见违规场景)

ExcuseReality
"I know this technique, testing is unnecessary"You're testing the SKILL, not your knowledge. Agents differ from you.
"It's so simple it can't have bugs"Every untested skill has issues. Test takes 30 minutes.
"Academic questions passed, that's sufficient"Reading a skill != using a skill under pressure. Test application scenarios.
"My description summarizes the workflow so agents know what to do"Workflow-summary descriptions cause agents to skip the skill body. Remove it.
"This edit is minor, testing isn't needed"The Iron Law applies to edits. No exceptions.
"I'll test it after a few real uses"Problems = agents misuse in production. Test BEFORE deploying.
"The baseline is obvious, I know what failures to expect"You know YOUR failures. Agent failures differ. Run the baseline.

借口实际情况
「我知道这个技术,没必要测试」你测试的是SKILL,不是你的知识储备,Agent和你的行为逻辑不一样。
「它太简单了,不可能有bug」所有未测试的skill都有问题,测试只需要30分钟。
「理论问题都通过了,足够了」阅读skill≠在压力下使用skill,测试实际应用场景。
「我的描述已经总结了工作流,Agent知道该做什么」包含工作流摘要的描述会让Agent跳过skill正文,请删掉。
「这次修改很小,不需要测试」铁律适用于所有修改,没有例外。
「我先在实际场景用几次再测试」出现问题就意味着Agent在生产环境被误用,部署前就测试。
「基线很明显,我知道会有什么故障」你知道的是会犯的错误,Agent的故障不一样,跑一遍基线测试。

Red Flags: STOP and Run Baseline Tests

红色预警:立刻停止,运行基线测试

  • Writing skill content before creating any pressure scenarios
  • "I already know what agents will do"
  • "It's just a small addition"
  • "Academic questions passed, that's sufficient testing"
  • Description contains workflow steps or process summary
  • Skill addresses hypothetical scenarios not observed in baseline
  • Deploying without running scenarios WITH skill (no green verification)
  • "The skill was good last month, edits don't need testing"
All of these mean: Stop. Run baseline tests first.

  • 还没创建任何压力场景就开始写skill内容
  • 「我已经知道Agent会怎么做了」
  • 「只是个小改动而已」
  • 「理论问题都过了,测试足够了」
  • 描述包含工作流步骤或流程摘要
  • skill覆盖了基线测试中未观测到的假设场景
  • 没有启用skill运行场景就部署(没有GREEN验证)
  • 「上个月skill还没问题,修改不需要测试」
出现以上任意情况:立刻停止,先跑基线测试。

Context Budget

上下文预算

If context usage exceeds 65%, use
references/writing-skills-operations.md
together with
../reference/references/state-and-handoff-protocol.md
for the canonical checkpoint behavior.

如果上下文使用率超过65%,结合
references/writing-skills-operations.md
../reference/references/state-and-handoff-protocol.md
使用标准检查点行为。

References

参考文档

Load when needed:
  • references/creation-log-template.md
    : CREATION-LOG.md template for documenting the TDD process
  • references/pressure-test-template.md
    : Pressure scenario templates and the 7 pressure types
按需加载:
  • references/creation-log-template.md
    :记录TDD流程的CREATION-LOG.md模板
  • references/pressure-test-template.md
    :压力场景模板与7种压力类型