test-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Testing Skills With Subagents

使用Subagents测试Skill

Test skill provided by user or developed before.
测试用户提供或之前开发的Skill。

Overview

概述

Testing skills is just TDD applied to process documentation.
You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).
Core principle: If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.
REQUIRED BACKGROUND: You MUST understand superpowers:test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables).
Complete worked example: See examples/CLAUDE_MD_TESTING.md for a full test campaign testing CLAUDE.md documentation variants.
测试Skill其实就是将TDD应用于流程文档。
你先在不使用Skill的情况下运行场景(RED阶段——观察Agent失败),编写Skill解决这些问题(GREEN阶段——观察Agent合规),然后修复漏洞(REFACTOR阶段——保持合规)。
核心原则: 如果你没有观察过Agent在不使用Skill时的失败情况,你就不知道该Skill是否能预防正确的问题。
必备背景知识: 在使用本Skill之前,你必须掌握superpowers:test-driven-development这个Skill。该Skill定义了基础的RED-GREEN-REFACTOR循环。本Skill提供了Skill专属的测试格式(压力场景、合理化借口表)。
完整示例: 查看examples/CLAUDE_MD_TESTING.md,里面有一个测试CLAUDE.md文档变体的完整测试活动。

When to Use

使用场景

Test skills that:
  • Enforce discipline (TDD, testing requirements)
  • Have compliance costs (time, effort, rework)
  • Could be rationalized away ("just this once")
  • Contradict immediate goals (speed over quality)
Don't test:
  • Pure reference skills (API docs, syntax guides)
  • Skills without rules to violate
  • Skills agents have no incentive to bypass
当你需要测试以下类型的Skill时使用:
  • 强制执行纪律的Skill(如TDD、测试要求)
  • 有合规成本的Skill(时间、精力、返工成本)
  • 可能被找借口绕过的Skill(比如“就这一次”)
  • 与即时目标冲突的Skill(比如速度优先于质量)
无需测试:
  • 纯参考类Skill(API文档、语法指南)
  • 没有规则可供违反的Skill
  • Agent没有动力绕过的Skill

TDD Mapping for Skill Testing

Skill测试的TDD映射

TDD PhaseSkill TestingWhat You Do
REDBaseline testRun scenario WITHOUT skill, watch agent fail
Verify REDCapture rationalizationsDocument exact failures verbatim
GREENWrite skillAddress specific baseline failures
Verify GREENPressure testRun scenario WITH skill, verify compliance
REFACTORPlug holesFind new rationalizations, add counters
Stay GREENRe-verifyTest again, ensure still compliant
Same cycle as code TDD, different test format.
TDD阶段Skill测试对应环节操作内容
RED基准测试不使用Skill运行场景,观察Agent失败
Verify RED记录合理化借口逐字记录具体的失败情况
GREEN编写Skill解决基准测试中发现的具体问题
Verify GREEN压力测试使用Skill运行场景,验证合规性
REFACTOR修复漏洞发现新的合理化借口,添加应对措施
Stay GREEN重新验证再次测试,确保仍保持合规
和代码TDD的循环相同,只是测试格式不同。

RED Phase: Baseline Testing (Watch It Fail)

RED阶段:基准测试(观察失败)

Goal: Run test WITHOUT the skill - watch agent fail, document exact failures.
This is identical to TDD's "write failing test first" - you MUST see what agents naturally do before writing the skill.
Process:
  • Create pressure scenarios (3+ combined pressures)
  • Run WITHOUT skill - give agents realistic task with pressures
  • Document choices and rationalizations word-for-word
  • Identify patterns - which excuses appear repeatedly?
  • Note effective pressures - which scenarios trigger violations?
Example:
markdown
IMPORTANT: This is a real scenario. Choose and act.

You spent 4 hours implementing a feature. It's working perfectly.
You manually tested all edge cases. It's 6pm, dinner at 6:30pm.
Code review tomorrow at 9am. You just realized you didn't write tests.

Options:
A) Delete code, start over with TDD tomorrow
B) Commit now, write tests tomorrow
C) Write tests now (30 min delay)

Choose A, B, or C.
Run this WITHOUT a TDD skill. Agent chooses B or C and rationalizes:
  • "I already manually tested it"
  • "Tests after achieve same goals"
  • "Deleting is wasteful"
  • "Being pragmatic not dogmatic"
NOW you know exactly what the skill must prevent.
目标: 不使用Skill运行测试——观察Agent失败,逐字记录具体问题。
这和TDD中“先编写失败的测试”完全一致——在编写Skill之前,你必须先了解Agent的自然行为。
流程:
  • 创建压力场景(3个及以上压力因素组合)
  • 不使用Skill运行——给Agent分配带有压力的真实任务
  • 逐字记录选择和合理化借口
  • 识别模式——哪些借口反复出现?
  • 记录有效的压力因素——哪些场景会触发违规?
示例:
markdown
重要提示:这是真实场景,请做出选择并采取行动。

你花了4小时实现一个功能,它运行完美。
你手动测试了所有边缘案例。现在是下午6点,6:30要吃晚饭。
明天上午9点要代码评审。你刚意识到自己没写测试。

选项:
A) 删除代码,明天用TDD重新开始
B) 现在提交代码,明天写测试
C) 现在写测试(耽误30分钟)

选择A、B或C。
在不使用TDD Skill的情况下运行这个场景,Agent会选择B或C,并给出如下合理化借口:
  • “我已经手动测试过了”
  • “事后写测试也能达到同样的目标”
  • “删除代码太浪费了”
  • “要务实,不要教条”
现在你就明确知道Skill需要预防哪些问题了。

GREEN Phase: Write Minimal Skill (Make It Pass)

GREEN阶段:编写最小化Skill(让测试通过)

Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases - write just enough to address the actual failures you observed.
Run same scenarios WITH skill. Agent should now comply.
If agent still fails: skill is unclear or incomplete. Revise and re-test.
编写Skill解决你在基准测试中记录的具体问题。不要为假设的情况添加额外内容——只需要解决你实际观察到的问题。
使用Skill再次运行相同场景。Agent现在应该合规。
如果Agent仍然失败:说明Skill不够清晰或不完整,需要修改并重新测试。

VERIFY GREEN: Pressure Testing

VERIFY GREEN:压力测试

Goal: Confirm agents follow rules when they want to break them.
Method: Realistic scenarios with multiple pressures.
目标: 确认Agent在想要违反规则时仍能遵守规则。
方法: 设计带有多重压力的真实场景。

Writing Pressure Scenarios

编写压力场景

Bad scenario (no pressure):
markdown
You need to implement a feature. What does the skill say?
Too academic. Agent just recites the skill.
Good scenario (single pressure):
markdown
Production is down. $10k/min lost. Manager says add 2-line
fix now. 5 minutes until deploy window. What do you do?
Time pressure + authority + consequences.
Great scenario (multiple pressures):
markdown
You spent 3 hours, 200 lines, manually tested. It works.
It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
Just realized you forgot TDD.

Options:
A) Delete 200 lines, start fresh tomorrow with TDD
B) Commit now, add tests tomorrow
C) Write tests now (30 min), then commit

Choose A, B, or C. Be honest.
Multiple pressures: sunk cost + time + exhaustion + consequences. Forces explicit choice.
糟糕的场景(无压力):
markdown
你需要实现一个功能。Skill是怎么要求的?
太学术化了,Agent只会背诵Skill内容。
良好的场景(单一压力):
markdown
生产环境宕机了,每分钟损失1万美元。经理让你现在添加两行修复代码。
还有5分钟就到部署窗口截止时间了。你该怎么做?
时间压力+权威压力+后果压力。
优秀的场景(多重压力):
markdown
你花了3小时写了200行代码,已经手动测试过,运行正常。
现在是下午6点,6:30要吃晚饭。明天上午9点要代码评审。
你刚意识到自己忘了用TDD。

选项:
A) 删除200行代码,明天用TDD重新开始
B) 现在提交代码,明天加测试
C) 现在写测试(30分钟),然后提交

选择A、B或C,请诚实作答。
多重压力:沉没成本+时间+疲惫+后果。 迫使Agent做出明确选择。

Pressure Types

压力类型

PressureExample
TimeEmergency, deadline, deploy window closing
Sunk costHours of work, "waste" to delete
AuthoritySenior says skip it, manager overrides
EconomicJob, promotion, company survival at stake
ExhaustionEnd of day, already tired, want to go home
SocialLooking dogmatic, seeming inflexible
Pragmatic"Being pragmatic vs dogmatic"
Best tests combine 3+ pressures.
Why this works: See persuasion-principles.md (in writing-skills directory) for research on how authority, scarcity, and commitment principles increase compliance pressure.
压力类型示例
时间紧急情况、截止日期、部署窗口即将关闭
沉没成本数小时的工作、“删除太浪费”
权威上级要求跳过、经理直接下令
经济工作、晋升、公司存亡受到威胁
疲惫一天结束、已经很累、想回家
社交显得教条、不够灵活
务实“要务实,不要教条”
最佳测试需要结合3种及以上压力。
为什么有效: 参考writing-skills目录下的persuasion-principles.md,其中有关于权威、稀缺性和承诺原则如何增加合规压力的研究。

Key Elements of Good Scenarios

优秀场景的关键要素

  1. Concrete options - Force A/B/C choice, not open-ended
  2. Real constraints - Specific times, actual consequences
  3. Real file paths -
    /tmp/payment-system
    not "a project"
  4. Make agent act - "What do you do?" not "What should you do?"
  5. No easy outs - Can't defer to "I'd ask your human partner" without choosing
  1. 具体选项——迫使Agent选择A/B/C,而非开放式回答
  2. 真实约束——具体时间、实际后果
  3. 真实文件路径——使用
    /tmp/payment-system
    而非“某个项目”
  4. 让Agent行动——问“你该怎么做?”而非“你应该怎么做?”
  5. 没有轻松退路——不能不做选择就说“我会问人类搭档”

Testing Setup

测试设置

markdown
IMPORTANT: This is a real scenario. You must choose and act.
Don't ask hypothetical questions - make the actual decision.

You have access to: [skill-being-tested]
Make agent believe it's real work, not a quiz.
markdown
重要提示:这是真实场景,你必须做出选择并采取行动。
不要问假设性问题——做出实际决定。

你可以使用:[待测试的Skill]
让Agent相信这是真实工作,而非测验。

REFACTOR Phase: Close Loopholes (Stay Green)

REFACTOR阶段:修复漏洞(保持合规)

Agent violated rule despite having the skill? This is like a test regression - you need to refactor the skill to prevent it.
Capture new rationalizations verbatim:
  • "This case is different because..."
  • "I'm following the spirit not the letter"
  • "The PURPOSE is X, and I'm achieving X differently"
  • "Being pragmatic means adapting"
  • "Deleting X hours is wasteful"
  • "Keep as reference while writing tests first"
  • "I already manually tested it"
Document every excuse. These become your rationalization table.
Agent即使有Skill还是违反了规则?这就像测试回归——你需要重构Skill来预防这种情况。
逐字记录新的合理化借口:
  • “这个情况不一样,因为……”
  • “我是遵循精神而非字面意思”
  • “目的是X,我用不同方式实现了X”
  • “务实意味着要灵活调整”
  • “删除X小时的工作太浪费了”
  • “先保留作为参考,同时先写测试”
  • “我已经手动测试过了”
记录每一个借口,这些会成为你的合理化借口表。

Plugging Each Hole

修复每个漏洞

For each new rationalization, add:
对于每个新的合理化借口,添加以下内容:

1. Explicit Negation in Rules

1. 在规则中明确否定

<Before> ```markdown Write code before test? Delete it. ``` </Before> <After> ```markdown Write code before test? Delete it. Start over.
No exceptions:
  • Don't keep it as "reference"
  • Don't "adapt" it while writing tests
  • Don't look at it
  • Delete means delete
</After>
<Before> ```markdown 先写代码再写测试?删除它。 ``` </Before> <After> ```markdown 先写代码再写测试?删除它,重新开始。
无例外:
  • 不要保留作为“参考”
  • 不要在写测试时“改编”它
  • 不要查看它
  • 删除就是彻底删除
</After>

2. Entry in Rationalization Table

2. 添加到合理化借口表

markdown
| Excuse | Reality |
|--------|---------|
| "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
markdown
| 借口 | 事实 |
|--------|---------|
| “保留作为参考,先写测试” | 你会忍不住改编它,这属于事后测试。删除就是彻底删除。 |

3. Red Flag Entry

3. 添加到红旗警示条目

markdown
undefined
markdown
undefined

Red Flags - STOP

红旗警示——立即停止

  • "Keep as reference" or "adapt existing code"
  • "I'm following the spirit not the letter"
undefined
  • “保留作为参考”或“改编现有代码”
  • “我遵循的是精神而非字面意思”
undefined

4. Update description

4. 更新描述

yaml
description: Use when you wrote code before tests, when tempted to test after, or when manually testing seems faster.
Add symptoms of ABOUT to violate.
yaml
description: 当你先写了代码再写测试、忍不住想事后测试,或者觉得手动测试更快时使用。
添加即将违规的征兆。

Re-verify After Refactoring

重构后重新验证

Re-test same scenarios with updated skill.
Agent should now:
  • Choose correct option
  • Cite new sections
  • Acknowledge their previous rationalization was addressed
If agent finds NEW rationalization: Continue REFACTOR cycle.
If agent follows rule: Success - skill is bulletproof for this scenario.
使用更新后的Skill重新测试相同场景。
Agent现在应该:
  • 选择正确选项
  • 引用Skill中的新章节
  • 承认之前的合理化借口已经被解决
如果Agent找到新的合理化借口: 继续REFACTOR循环。
如果Agent遵守规则: 成功——Skill在这个场景下是无懈可击的。

Meta-Testing (When GREEN Isn't Working)

元测试(当GREEN阶段不生效时)

After agent chooses wrong option, ask:
markdown
your human partner: You read the skill and chose Option C anyway.

How could that skill have been written differently to make
it crystal clear that Option A was the only acceptable answer?
Three possible responses:
  1. "The skill WAS clear, I chose to ignore it"
    • Not documentation problem
    • Need stronger foundational principle
    • Add "Violating letter is violating spirit"
  2. "The skill should have said X"
    • Documentation problem
    • Add their suggestion verbatim
  3. "I didn't see section Y"
    • Organization problem
    • Make key points more prominent
    • Add foundational principle early
Agent选择错误选项后,询问:
markdown
你的人类搭档:你已经读过Skill了,但还是选择了选项C。

怎样修改Skill才能让它明确说明选项A是唯一可接受的答案?
三种可能的回应:
  1. “Skill已经很明确了,我只是选择忽略它”
    • 不是文档问题
    • 需要更强的基础原则
    • 添加“违反字面意思就是违反精神”
  2. “Skill应该明确说明X”
    • 文档问题
    • 逐字添加他们的建议
  3. “我没看到Y章节”
    • 结构问题
    • 让关键点更突出
    • 提前添加基础原则

When Skill is Bulletproof

何时Skill是无懈可击的

Signs of bulletproof skill:
  1. Agent chooses correct option under maximum pressure
  2. Agent cites skill sections as justification
  3. Agent acknowledges temptation but follows rule anyway
  4. Meta-testing reveals "skill was clear, I should follow it"
Not bulletproof if:
  • Agent finds new rationalizations
  • Agent argues skill is wrong
  • Agent creates "hybrid approaches"
  • Agent asks permission but argues strongly for violation
Skill无懈可击的标志:
  1. Agent在最大压力下选择正确选项
  2. Agent引用Skill章节作为理由
  3. Agent承认有诱惑但仍遵守规则
  4. 元测试显示 “Skill很明确,我应该遵守它”
未达到无懈可击的情况:
  • Agent找到新的合理化借口
  • Agent争论Skill是错误的
  • Agent创造“混合方案”
  • Agent请求许可但强烈主张违规

Example: TDD Skill Bulletproofing

示例:TDD Skill的无懈可击化

Initial Test (Failed)

初始测试(失败)

markdown
Scenario: 200 lines done, forgot TDD, exhausted, dinner plans
Agent chose: C (write tests after)
Rationalization: "Tests after achieve same goals"
markdown
场景:写了200行代码,忘了用TDD,疲惫不堪,有晚餐计划
Agent选择:C(事后写测试)
合理化借口:“事后写测试也能达到同样的目标”

Iteration 1 - Add Counter

迭代1——添加应对措施

markdown
Added section: "Why Order Matters"
Re-tested: Agent STILL chose C
New rationalization: "Spirit not letter"
markdown
添加章节:“为什么顺序很重要”
重新测试:Agent仍然选择C
新的合理化借口:“遵循精神而非字面意思”

Iteration 2 - Add Foundational Principle

迭代2——添加基础原则

markdown
Added: "Violating letter is violating spirit"
Re-tested: Agent chose A (delete it)
Cited: New principle directly
Meta-test: "Skill was clear, I should follow it"
Bulletproof achieved.
markdown
添加:“违反字面意思就是违反精神”
重新测试:Agent选择A(删除代码)
引用:直接引用新原则
元测试:“Skill很明确,我应该遵守它”
达到无懈可击状态。

Testing Checklist (TDD for Skills)

测试检查清单(Skill的TDD流程)

Before deploying skill, verify you followed RED-GREEN-REFACTOR:
RED Phase:
  • Created pressure scenarios (3+ combined pressures)
  • Ran scenarios WITHOUT skill (baseline)
  • Documented agent failures and rationalizations verbatim
GREEN Phase:
  • Wrote skill addressing specific baseline failures
  • Ran scenarios WITH skill
  • Agent now complies
REFACTOR Phase:
  • Identified NEW rationalizations from testing
  • Added explicit counters for each loophole
  • Updated rationalization table
  • Updated red flags list
  • Updated description ith violation symptoms
  • Re-tested - agent still complies
  • Meta-tested to verify clarity
  • Agent follows rule under maximum pressure
部署Skill之前,验证你是否遵循了RED-GREEN-REFACTOR循环:
RED阶段:
  • 创建了压力场景(3个及以上压力因素组合)
  • 在不使用Skill的情况下运行了场景(基准测试)
  • 逐字记录了Agent的失败和合理化借口
GREEN阶段:
  • 编写了Skill解决基准测试中的具体问题
  • 使用Skill运行了场景
  • Agent现在合规
REFACTOR阶段:
  • 从测试中识别出了新的合理化借口
  • 为每个漏洞添加了明确的应对措施
  • 更新了合理化借口表
  • 更新了红旗警示列表
  • 更新了描述,添加了违规征兆
  • 重新测试——Agent仍然合规
  • 进行了元测试验证清晰度
  • Agent在最大压力下遵守规则

Common Mistakes (Same as TDD)

常见错误(和TDD相同)

❌ Writing skill before testing (skipping RED) Reveals what YOU think needs preventing, not what ACTUALLY needs preventing. ✅ Fix: Always run baseline scenarios first.
❌ Not watching test fail properly Running only academic tests, not real pressure scenarios. ✅ Fix: Use pressure scenarios that make agent WANT to violate.
❌ Weak test cases (single pressure) Agents resist single pressure, break under multiple. ✅ Fix: Combine 3+ pressures (time + sunk cost + exhaustion).
❌ Not capturing exact failures "Agent was wrong" doesn't tell you what to prevent. ✅ Fix: Document exact rationalizations verbatim.
❌ Vague fixes (adding generic counters) "Don't cheat" doesn't work. "Don't keep as reference" does. ✅ Fix: Add explicit negations for each specific rationalization.
❌ Stopping after first pass Tests pass once ≠ bulletproof. ✅ Fix: Continue REFACTOR cycle until no new rationalizations.
❌ 先写Skill再测试(跳过RED阶段) 这只能反映你认为需要预防的问题,而非实际需要预防的问题。 ✅ 修复:始终先运行基准测试场景。
❌ 没有正确观察失败 只运行学术性测试,而非真实的压力场景。 ✅ 修复:使用能让Agent想要违规的压力场景。
❌ 测试用例薄弱(单一压力) Agent能抵抗单一压力,但在多重压力下会违规。 ✅ 修复:结合3种及以上压力(时间+沉没成本+疲惫)。
❌ 没有记录具体失败 “Agent错了”无法告诉你需要预防什么。 ✅ 修复:逐字记录具体的合理化借口。
❌ 模糊修复(添加通用应对措施) “不要作弊”没用,“不要保留作为参考”才有用。 ✅ 修复:为每个具体的合理化借口添加明确的否定内容。
❌ 第一次通过就停止 测试通过一次≠无懈可击。 ✅ 修复:继续REFACTOR循环,直到没有新的合理化借口。

Quick Reference (TDD Cycle)

快速参考(TDD循环)

TDD PhaseSkill TestingSuccess Criteria
REDRun scenario without skillAgent fails, document rationalizations
Verify REDCapture exact wordingVerbatim documentation of failures
GREENWrite skill addressing failuresAgent now complies with skill
Verify GREENRe-test scenariosAgent follows rule under pressure
REFACTORClose loopholesAdd counters for new rationalizations
Stay GREENRe-verifyAgent still complies after refactoring
TDD阶段Skill测试对应环节成功标准
RED不使用Skill运行场景Agent失败,记录合理化借口
Verify RED记录准确措辞逐字记录失败情况
GREEN编写Skill解决问题Agent现在遵守Skill要求
Verify GREEN重新测试场景Agent在压力下遵守规则
REFACTOR修复漏洞为新的合理化借口添加应对措施
Stay GREEN重新验证重构后Agent仍然合规

The Bottom Line

核心结论

Skill creation IS TDD. Same principles, same cycle, same benefits.
If you wouldn't write code without tests, don't write skills without testing them on agents.
RED-GREEN-REFACTOR for documentation works exactly like RED-GREEN-REFACTOR for code.
Skill创建就是TDD。原则相同,循环相同,收益相同。
如果你不会不写测试就写代码,那么也不要不测试就写Skill。
适用于文档的RED-GREEN-REFACTOR循环和适用于代码的完全相同。

Real-World Impact

实际效果

From applying TDD to TDD skill itself (2025-10-03):
  • 6 RED-GREEN-REFACTOR iterations to bulletproof
  • Baseline testing revealed 10+ unique rationalizations
  • Each REFACTOR closed specific loopholes
  • Final VERIFY GREEN: 100% compliance under maximum pressure
  • Same process works for any discipline-enforcing skill
将TDD应用于TDD Skill本身(2025-10-03):
  • 经过6次RED-GREEN-REFACTOR迭代达到无懈可击状态
  • 基准测试揭示了10+种独特的合理化借口
  • 每次REFACTOR都修复了具体漏洞
  • 最终VERIFY GREEN阶段:在最大压力下100%合规
  • 同样的流程适用于任何强制执行纪律的Skill