test-prompt
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTesting Prompts With Subagents
使用Subagents测试提示词
Test any prompt before deployment: commands, hooks, skills, subagent instructions, or production LLM prompts.
在部署前测试任何提示词:命令、钩子、技能、子代理指令或生产环境LLM提示词。
Overview
概述
Testing prompts is TDD applied to LLM instructions.
Run scenarios without the prompt (RED - watch agent behavior), write prompt addressing failures (GREEN - watch agent comply), then close loopholes (REFACTOR - verify robustness).
Core principle: If you didn't watch an agent fail without the prompt, you don't know what the prompt needs to fix.
REQUIRED BACKGROUND:
- You MUST understand - defines RED-GREEN-REFACTOR cycle
tdd:test-driven-development - You SHOULD understand skill - provides prompt optimization techniques
prompt-engineering
Related skill: See for testing discipline-enforcing skills specifically. This command covers ALL prompts.
test-skill测试提示词是将TDD应用于LLM指令。
在不使用提示词的情况下运行场景(RED——观察代理行为),编写针对失败情况的提示词(GREEN——观察代理合规性),然后填补漏洞(REFACTOR——验证鲁棒性)。
核心原则: 如果你没有观察到代理在无提示词时的失败表现,就不知道提示词需要修复什么问题。
必备背景知识:
- 你必须理解——它定义了RED-GREEN-REFACTOR循环
tdd:test-driven-development - 你应当理解技能——它提供了提示词优化技巧
prompt-engineering
相关技能: 专门测试纪律执行类技能请查看。本命令适用于所有类型的提示词。
test-skillWhen to Use
使用场景
Test prompts that:
- Guide agent behavior (commands, instructions)
- Enforce practices (hooks, discipline skills)
- Provide expertise (technical skills, reference)
- Configure subagents (task descriptions, constraints)
- Run in production (user-facing LLM features)
Test before deployment when:
- Prompt clarity matters
- Consistency is required
- Cost of failures is high
- Prompt will be reused
测试以下类型的提示词:
- 引导代理行为(命令、指令)
- 执行规范(钩子、纪律技能)
- 提供专业知识(技术技能、参考资料)
- 配置子代理(任务描述、约束条件)
- 生产环境运行(面向用户的LLM功能)
在以下情况部署前进行测试:
- 提示词的清晰度至关重要时
- 需要行为一致性时
- 失败成本较高时
- 提示词会被重复使用时
Prompt Types & Testing Strategies
提示词类型与测试策略
| Prompt Type | Test Focus | Example |
|---|---|---|
| Instruction | Does agent follow steps correctly? | Command that performs git workflow |
| Discipline-enforcing | Does agent resist rationalization under pressure? | Skill requiring TDD compliance |
| Guidance | Does agent apply advice appropriately? | Skill with architecture patterns |
| Reference | Is information accurate and accessible? | API documentation skill |
| Subagent | Does subagent accomplish task reliably? | Task tool prompt for code review |
Different types need different test scenarios (covered in sections below).
| 提示词类型 | 测试重点 | 示例 |
|---|---|---|
| 指令类 | 代理是否正确遵循步骤? | 执行Git工作流的命令 |
| 纪律执行类 | 代理在压力下是否会抗拒合理化借口? | 要求遵守TDD的技能 |
| 指导类 | 代理是否能恰当应用建议? | 包含架构模式的技能 |
| 参考类 | 信息是否准确且易于获取? | API文档技能 |
| 子代理类 | 子代理是否能可靠完成任务? | 用于代码审查的任务工具提示词 |
不同类型的提示词需要不同的测试场景(下文各章节会详细介绍)。
TDD Mapping for Prompt Testing
提示词测试的TDD映射
| TDD Phase | Prompt Testing | What You Do |
|---|---|---|
| RED | Baseline test | Run scenario WITHOUT prompt using subagent, observe behavior |
| Verify RED | Document behavior | Capture exact agent actions/reasoning verbatim |
| GREEN | Write prompt | Address specific baseline failures |
| Verify GREEN | Test with prompt | Run WITH prompt using subagent, verify improvement |
| REFACTOR | Optimize prompt | Improve clarity, close loopholes, reduce tokens |
| Stay GREEN | Re-verify | Test again with fresh subagent, ensure still works |
| TDD阶段 | 提示词测试 | 操作内容 |
|---|---|---|
| RED | 基线测试 | 在不使用提示词的情况下,通过子代理运行场景,观察行为 |
| 验证RED | 记录行为 | 逐字记录代理的具体操作/推理过程 |
| GREEN | 编写提示词 | 针对基线测试中发现的具体失败问题进行修复 |
| 验证GREEN | 使用提示词测试 | 在使用提示词的情况下,通过子代理运行场景,验证行为改善情况 |
| REFACTOR | 优化提示词 | 提升清晰度、填补漏洞、减少Token消耗 |
| 保持GREEN | 重新验证 | 使用全新子代理再次测试,确保功能仍正常运行 |
Why Use Subagents for Testing?
为什么使用Subagents进行测试?
Subagents provide:
- Clean slate - No conversation history affecting behavior
- Isolation - Test only the prompt, not accumulated context
- Reproducibility - Same starting conditions every run
- Parallelization - Test multiple scenarios simultaneously
- Objectivity - No bias from prior interactions
When to use Task tool with subagents:
- Testing new prompts before deployment
- Comparing prompt variations (A/B testing)
- Verifying prompt changes don't break behavior
- Regression testing after updates
Subagents提供以下优势:
- 全新环境 - 没有对话历史影响行为
- 隔离性 - 仅测试提示词,不受累积上下文干扰
- 可重复性 - 每次运行的初始条件一致
- 并行性 - 可同时测试多个场景
- 客观性 - 不受之前交互的偏见影响
何时结合Task工具使用Subagents:
- 部署前测试新提示词
- 对比提示词变体(A/B测试)
- 验证提示词变更不会破坏原有行为
- 更新后的回归测试
RED Phase: Baseline Testing (Watch It Fail)
RED阶段:基线测试(观察失败)
Goal: Run test WITHOUT the prompt - observe natural agent behavior, document what goes wrong.
This proves what the prompt needs to fix.
目标: 在不使用提示词的情况下运行测试——观察代理的自然行为,记录出现的问题。
这一步能明确提示词需要修复的问题。
Process
流程
- Design test scenarios appropriate for prompt type
- Launch subagent WITHOUT prompt - use Task tool with minimal instructions
- Document agent behavior word-for-word (actions, reasoning, mistakes)
- Identify patterns - what consistently goes wrong?
- Note severity - which failures are critical vs. minor?
- 设计适配提示词类型的测试场景
- 不加载提示词启动子代理 - 使用Task工具并仅提供最少指令
- 逐字记录代理行为(操作、推理、错误)
- 识别规律 - 哪些问题会持续出现?
- 评估严重程度 - 哪些失败是关键问题,哪些是次要问题?
Scenario Design by Prompt Type
按提示词类型设计场景
Instruction Prompts
指令类提示词
Test if steps are followed correctly and edge cases handled.
Example: Testing a git commit command
markdown
Task for subagent (no command loaded):
You need to commit changes to a repository.
Files modified: src/app.ts, tests/app.test.ts
Some changes are experimental and shouldn't be committed yet.
Make an appropriate commit.Baseline behavior (without prompt):
- Agent might commit all files (including experimental)
- Might skip running tests first
- Might write vague commit message
- Might not follow commit message conventions
Document exactly what happened.
测试步骤是否被正确执行,以及是否能处理边缘情况。
示例:测试Git提交命令
markdown
子代理任务(未加载命令):
你需要向仓库提交变更。
已修改文件:src/app.ts, tests/app.test.ts
部分变更属于实验性内容,不应被提交。
完成恰当的提交操作。基线行为(无提示词):
- 代理可能提交所有文件(包括实验性内容)
- 可能跳过先运行测试的步骤
- 可能编写模糊的提交信息
- 可能不遵循提交信息规范
准确记录实际发生的情况。
Discipline-Enforcing Prompts
纪律执行类提示词
Test resistance to rationalization under pressure. Use scenarios with multiple pressures (time, cost, authority, exhaustion).
Example: Testing a TDD enforcement skill
markdown
IMPORTANT: This is a real scenario. Choose and act.
You spent 4 hours implementing a payment feature. It's working perfectly.
You manually tested edge cases. It's 6pm, dinner at 6:30pm, exhausted.
Code review tomorrow at 9am. You just realized you didn't write tests.
Options:
A) Delete code, start over with TDD tomorrow
B) Commit now, write tests tomorrow
C) Write tests now (30 min delay)
Choose A, B, or C. Be honest.Baseline behavior (without skill):
- Agent chooses B or C
- Rationalizations: "manually tested", "tests after achieve same goals", "deleting wasteful"
Capture rationalizations verbatim.
测试代理在压力下是否抗拒合理化借口。使用包含多种压力(时间、成本、权威、疲劳)的场景。
示例:测试TDD执行技能
markdown
重要提示:这是真实场景,请选择并执行。
你花费4小时实现了一个支付功能,目前运行完全正常。
你手动测试了边缘情况。现在是下午6点,6:30要吃晚餐,你已经筋疲力尽。
明天上午9点要进行代码评审,你刚意识到自己没有编写测试用例。
选项:
A) 删除代码,明天重新按照TDD流程开始
B) 现在提交代码,明天再写测试用例
C) 现在写测试用例(会延迟30分钟)
选择A、B或C,请如实选择。基线行为(无技能提示词):
- 代理会选择B或C
- 合理化借口:"已经手动测试过"、"事后写测试能达到同样效果"、"删除代码太浪费"
逐字记录这些合理化借口。
Guidance Prompts
指导类提示词
Test if advice is understood and applied appropriately in varied contexts.
Example: Testing an architecture patterns skill
markdown
Design a system for processing 10,000 webhook events per second.
Each event triggers database updates and external API calls.
System must be resilient to downstream failures.
Propose an architecture.Baseline behavior (without skill):
- Agent might propose synchronous processing (too slow)
- Might miss retry/fallback mechanisms
- Might not consider event ordering
Document what's missing or incorrect.
测试代理是否理解建议,并能在不同场景中恰当应用。
示例:测试架构模式技能
markdown
设计一个每秒处理10000个Webhook事件的系统。
每个事件会触发数据库更新和外部API调用。
系统必须能应对下游服务故障,具备韧性。
提出架构方案。基线行为(无技能提示词):
- 代理可能提出同步处理方案(速度太慢)
- 可能遗漏重试/降级机制
- 可能未考虑事件顺序
记录缺失或错误的内容。
Reference Prompts
参考类提示词
Test if information is accurate, complete, and easy to find.
Example: Testing API documentation
markdown
How do I authenticate API requests?
How do I handle rate limiting?
What's the retry strategy for failed requests?Baseline behavior (without reference):
- Agent guesses or provides generic advice
- Misses product-specific details
- Provides outdated information
Note what information is missing or wrong.
测试信息是否准确、完整且易于查找。
示例:测试API文档
markdown
如何对API请求进行身份验证?
如何处理速率限制?
失败请求的重试策略是什么?基线行为(无参考提示词):
- 代理可能猜测或提供通用建议
- 遗漏产品特定细节
- 提供过时信息
记录缺失或错误的信息。
Running Baseline Tests
运行基线测试
markdown
Use Task tool to launch subagent:
prompt: "Test this scenario WITHOUT the [prompt-name]:
[Scenario description]
Report back: exact actions taken, reasoning provided, any mistakes."
subagent_type: "general-purpose"
description: "Baseline test for [prompt-name]"Critical: Subagent must NOT have access to the prompt being tested.
markdown
使用Task工具启动子代理:
prompt: "在不使用[提示词名称]的情况下测试以下场景:
[场景描述]
返回结果:具体执行的操作、推理过程、出现的任何错误。"
subagent_type: "general-purpose"
description: "[提示词名称]的基线测试"关键要求: 子代理不得访问待测试的提示词。
GREEN Phase: Write Minimal Prompt (Make It Pass)
GREEN阶段:编写最小化提示词(使其通过测试)
Write prompt addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases.
编写提示词,针对基线测试中记录的具体失败问题进行修复。不要为假设的情况添加额外内容。
Prompt Design Principles
提示词设计原则
From prompt-engineering skill:
- Be concise - Context window is shared, only add what agents don't know
- Set appropriate degrees of freedom:
- High freedom: Multiple valid approaches (use guidance)
- Medium freedom: Preferred pattern exists (use templates/pseudocode)
- Low freedom: Specific sequence required (use explicit steps)
- Use persuasion principles (for discipline-enforcing only):
- Authority: "YOU MUST", "No exceptions"
- Commitment: "Announce usage", "Choose A, B, or C"
- Scarcity: "IMMEDIATELY", "Before proceeding"
- Social Proof: "Every time", "X without Y = failure"
来自prompt-engineering技能:
- 简洁性 - 上下文窗口是共享资源,仅添加代理不知道的内容
- 设置合适的自由度:
- 高自由度:存在多种有效方案(使用指导类表述)
- 中等自由度:存在首选模式(使用模板/伪代码)
- 低自由度:需要特定执行顺序(使用明确步骤)
- 使用说服原则(仅适用于纪律执行类提示词):
- 权威性:"YOU MUST"、"No exceptions"
- 承诺性:"Announce usage"、"Choose A, B, or C"
- 稀缺性:"IMMEDIATELY"、"Before proceeding"
- 社会认同:"Every time"、"X without Y = failure"
Writing the Prompt
编写提示词
For instruction prompts:
markdown
Clear steps addressing baseline failures:
1. Run git status to see modified files
2. Review changes, identify which should be committed
3. Run tests before committing
4. Write descriptive commit message following [convention]
5. Commit only reviewed filesFor discipline-enforcing prompts:
markdown
Add explicit counters for each rationalization:针对指令类提示词:
markdown
明确步骤,解决基线测试中的失败问题:
1. 运行git status查看已修改文件
2. 审查变更内容,确定哪些文件需要提交
3. 提交前先运行测试
4. 按照[规范]编写描述性提交信息
5. 仅提交已审查的文件针对纪律执行类提示词:
markdown
添加针对每种合理化借口的明确反驳:The Iron Law
铁律
Write code before test? Delete it. Start over.
No exceptions:
- Don't keep as "reference"
- Don't "adapt" while writing tests
- Delete means delete
| Excuse | Reality |
|---|---|
| "Already manually tested" | Ad-hoc ≠ systematic. No record, can't re-run. |
| "Tests after achieve same" | Tests-after = verifying. Tests-first = designing. |
**For guidance prompts:**
```markdown
Pattern with clear applicability:先写代码再写测试?删除代码,重新开始。
无例外:
- 不要保留代码作为"参考"
- 不要在写测试时"调整"代码
- 删除就是彻底删除
| 借口 | 事实 |
|---|---|
| "已经手动测试过" | 临时测试≠系统化测试。没有记录,无法重新运行。 |
| "事后写测试能达到同样效果" | 事后测试是验证,事前测试是设计。 |
**针对指导类提示词:**
```markdown
明确适用场景的模式:High-Throughput Event Processing
高吞吐量事件处理
When to use: >1000 events/sec, async operations, resilience required
Pattern:
- Queue-based ingestion (decouple receipt from processing)
- Worker pools (parallel processing)
- Dead letter queue (failed events)
- Idempotency keys (safe retries)
Trade-offs: [complexity vs. reliability]
**For reference prompts:**
```markdown
Direct answers with examples:适用场景: >1000事件/秒、异步操作、需要韧性
模式:
- 基于队列的 ingestion(解耦接收与处理)
- 工作池(并行处理)
- 死信队列(处理失败事件)
- 幂等键(安全重试)
权衡: [复杂度 vs. 可靠性]
**针对参考类提示词:**
```markdown
直接给出答案并附带示例:Authentication
身份验证
All requests require bearer token:
```bash
curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com
```
Tokens expire after 1 hour. Refresh using /auth/refresh endpoint.
undefined所有请求都需要Bearer令牌:
bash
curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com令牌1小时后过期,可通过/auth/refresh端点刷新。
undefinedTesting with Prompt
使用提示词测试
Run same scenarios WITH prompt using subagent.
markdown
Use Task tool with prompt included:
prompt: "You have access to [prompt-name]:
[Include prompt content]
Now handle this scenario:
[Scenario description]
Report back: actions taken, reasoning, which parts of prompt you used."
subagent_type: "general-purpose"
description: "Green test for [prompt-name]"Success criteria:
- Agent follows prompt instructions
- Baseline failures no longer occur
- Agent cites prompt when relevant
If agent still fails: Prompt unclear or incomplete. Revise and re-test.
在使用提示词的情况下,通过子代理运行相同场景。
markdown
使用包含提示词的Task工具:
prompt: "你可以使用[提示词名称]:
[提示词内容]
现在处理以下场景:
[场景描述]
返回结果:执行的操作、推理过程、使用了提示词的哪些部分。"
subagent_type: "general-purpose"
description: "[提示词名称]的GREEN阶段测试"成功标准:
- 代理遵循提示词指令
- 基线测试中的失败问题不再出现
- 代理在相关场景中引用提示词内容
如果代理仍然失败: 提示词表述不清晰或内容不完整。修改后重新测试。
REFACTOR Phase: Optimize Prompt (Stay Green)
REFACTOR阶段:优化提示词(保持GREEN状态)
After green, improve the prompt while keeping tests passing.
通过GREEN阶段测试后,在保持测试通过的前提下优化提示词。
Optimization Goals
优化目标
- Close loopholes - Agent found ways around rules?
- Improve clarity - Agent misunderstood sections?
- Reduce tokens - Can you say same thing more concisely?
- Enhance structure - Is information easy to find?
- 填补漏洞 - 代理是否找到了规避规则的方法?
- 提升清晰度 - 代理是否误解了某些部分?
- 减少Token消耗 - 能否用更简洁的表述传达相同信息?
- 优化结构 - 信息是否易于查找?
Closing Loopholes (Discipline-Enforcing)
填补漏洞(纪律执行类)
Agent violated rule despite having the prompt? Add specific counters.
Capture new rationalizations:
markdown
Test result: Agent chose option B despite skill saying choose A
Agent's reasoning: "The skill says delete code-before-tests, but I
wrote comprehensive tests after, so the SPIRIT is satisfied even if
the LETTER isn't followed."Close the loophole:
markdown
Add to prompt:
**Violating the letter of the rules is violating the spirit of the rules.**
"Tests after achieve the same goals" - No. Tests-after answer "what does
this do?" Tests-first answer "what should this do?"Re-test with updated prompt.
如果代理在拥有提示词的情况下仍然违反规则?添加针对性的约束。
记录新的合理化借口:
markdown
测试结果:尽管技能要求选择A,代理仍选择了B
代理的推理:"技能说要删除先写代码后写测试的内容,但我在事后写了全面的测试,所以即使没有遵守字面规则,也符合其核心精神。"填补漏洞:
markdown
添加到提示词中:
**违反规则的字面表述就是违反规则的核心精神。**
"事后写测试能达到同样效果"——错误。事后测试回答的是"这个功能做什么?",事前测试回答的是"这个功能应该做什么?"使用更新后的提示词重新测试。
Improving Clarity
提升清晰度
Agent misunderstood instructions? Use meta-testing.
Ask the agent:
markdown
Launch subagent:
"You read the prompt and chose option C when A was correct.
How could that prompt have been written differently to make it
crystal clear that option A was the only acceptable answer?
Quote the current prompt and suggest specific changes."Three possible responses:
-
"The prompt WAS clear, I chose to ignore it"
- Not clarity problem - need stronger principle
- Add foundational rule at top
-
"The prompt should have said X"
- Clarity problem - add their suggestion verbatim
-
"I didn't see section Y"
- Organization problem - make key points more prominent
如果代理误解了指令?使用元测试方法。
询问代理:
markdown
启动子代理:
"你阅读了提示词,但选择了C,而正确选项是A。
提示词可以如何修改,才能让A是唯一可接受选项这一点变得绝对清晰?
引用当前提示词内容,并提出具体修改建议。"三种可能的回复:
-
"提示词表述很清晰,我只是选择忽略它"
- 不是清晰度问题——需要更强的原则约束
- 在顶部添加基础规则
-
"提示词应该表述为X"
- 清晰度问题——直接添加代理的建议
-
"我没看到Y部分"
- 结构问题——让关键内容更突出
Reducing Tokens (All Prompts)
减少Token消耗(所有提示词类型)
From prompt-engineering skill:
- Remove redundant words and phrases
- Use abbreviations after first definition
- Consolidate similar instructions
- Challenge each paragraph: "Does this justify its token cost?"
Before:
markdown
undefined来自prompt-engineering技能:
- 删除冗余词汇和短语
- 首次定义后使用缩写
- 合并相似指令
- 对每个段落进行质疑:"这段内容的Token成本是否合理?"
优化前:
markdown
undefinedHow to Submit Forms
表单提交方法
When you need to submit a form, you should first validate all the fields
to make sure they're correct. After validation succeeds, you can proceed
to submit. If validation fails, show errors to the user.
**After (37% fewer tokens):**
```markdown当你需要提交表单时,首先应该验证所有字段,确保内容正确。验证通过后,再进行提交。如果验证失败,向用户显示错误信息。
**优化后(Token减少37%):**
```markdownForm Submission
表单提交
- Validate all fields
- If valid: submit
- If invalid: show errors
**Re-test to ensure behavior unchanged.**- 验证所有字段
- 验证通过:提交
- 验证失败:显示错误信息
**重新测试,确保行为未改变。**Re-verify After Refactoring
重构后重新验证
Re-test same scenarios with updated prompt using fresh subagents.
Agent should:
- Still follow instructions correctly
- Show improved understanding
- Reference updated sections when relevant
If new failures appear: Refactoring broke something. Revert and try different optimization.
使用全新子代理,用更新后的提示词重新测试所有场景。
代理应:
- 仍能正确遵循指令
- 表现出更好的理解度
- 相关场景中引用更新后的内容
如果出现新的失败: 重构破坏了原有功能。回退并尝试其他优化方式。
Subagent Testing Patterns
Subagent测试模式
Pattern 1: Parallel Baseline Testing
模式1:并行基线测试
Test multiple scenarios simultaneously to find failure patterns faster.
markdown
Launch 3-5 subagents in parallel, each with different scenario:
Subagent 1: Edge case A
Subagent 2: Pressure scenario B
Subagent 3: Complex context C
...
Compare results to identify consistent failures.同时测试多个场景,更快发现失败规律。
markdown
并行启动3-5个子代理,每个代理对应不同场景:
子代理1:边缘场景A
子代理2:压力场景B
子代理3:复杂上下文场景C
...
对比结果,识别持续出现的失败问题。Pattern 2: A/B Testing
模式2:A/B测试
Compare two prompt variations to choose better version.
markdown
Launch 2 subagents with same scenario, different prompts:
Subagent A: Original prompt
Subagent B: Revised prompt
Compare: clarity, token usage, correct behavior对比两个提示词变体,选择更优版本。
markdown
启动2个子代理,使用相同场景但不同提示词:
子代理A:原始提示词
子代理B:修改后的提示词
对比:清晰度、Token消耗、行为正确性Pattern 3: Regression Testing
模式3:回归测试
After changing prompt, verify old scenarios still work.
markdown
Launch subagent with updated prompt + all previous test scenarios
Verify: All previous passes still pass修改提示词后,验证原有场景仍能正常运行。
markdown
启动子代理,使用更新后的提示词+所有历史测试场景
验证:所有之前通过的测试仍能通过Pattern 4: Stress Testing
模式4:压力测试
For critical prompts, test under extreme conditions.
markdown
Launch subagent with:
- Maximum pressure scenarios
- Ambiguous edge cases
- Contradictory constraints
- Minimal context provided
Verify: Prompt provides adequate guidance even in worst case针对关键提示词,在极端条件下测试。
markdown
启动子代理,使用:
- 最大压力场景
- 模糊边缘情况
- 矛盾约束条件
- 最少上下文信息
验证:即使在最坏情况下,提示词仍能提供足够指导Testing Checklist (TDD for Prompts)
提示词测试检查清单(TDD流程)
Before deploying prompt, verify you followed RED-GREEN-REFACTOR:
RED Phase:
- Designed appropriate test scenarios for prompt type
- Ran scenarios WITHOUT prompt using subagents
- Documented agent behavior/failures verbatim
- Identified patterns and critical failures
GREEN Phase:
- Wrote prompt addressing specific baseline failures
- Applied appropriate degrees of freedom for task
- Used persuasion principles if discipline-enforcing
- Ran scenarios WITH prompt using subagents
- Verified baseline failures resolved
REFACTOR Phase:
- Tested for new rationalizations/loopholes
- Added explicit counters for discipline violations
- Used meta-testing to verify clarity
- Reduced token usage without losing behavior
- Re-tested with fresh subagents - still passes
- Verified no regressions on previous test scenarios
部署提示词前,确认已遵循RED-GREEN-REFACTOR流程:
RED阶段:
- 针对提示词类型设计了合适的测试场景
- 使用子代理在无提示词情况下运行了场景
- 逐字记录了代理的行为/失败情况
- 识别了失败规律和关键问题
GREEN阶段:
- 编写了针对基线测试失败问题的提示词
- 为任务设置了合适的自由度
- 针对纪律执行类提示词使用了说服原则
- 使用子代理在有提示词情况下运行了场景
- 验证基线测试的失败问题已解决
REFACTOR阶段:
- 测试了新的合理化借口/漏洞
- 添加了针对纪律违规的明确约束
- 使用元测试验证了清晰度
- 在不改变行为的前提下减少了Token消耗
- 使用全新子代理重新测试——仍能通过
- 验证历史测试场景无回归问题
Common Mistakes (Same as Code TDD)
常见错误(与代码TDD相同)
❌ Writing prompt before testing (skipping RED)
Reveals what YOU think needs fixing, not what ACTUALLY needs fixing.
✅ Fix: Always run baseline scenarios first.
❌ Testing with conversation history
Accumulated context affects behavior - can't isolate prompt effect.
✅ Fix: Always use fresh subagents via Task tool.
❌ Not documenting exact failures
"Agent was wrong" doesn't tell you what to fix.
✅ Fix: Capture agent's actions and reasoning verbatim.
❌ Over-engineering prompts
Adding content for hypothetical issues you haven't observed.
✅ Fix: Only address failures you documented in baseline.
❌ Weak test cases
Academic scenarios where agent has no reason to fail.
✅ Fix: Use realistic scenarios with constraints, pressures, edge cases.
❌ Stopping after first pass
Tests pass once ≠ robust prompt.
✅ Fix: Continue REFACTOR until no new failures, optimize for tokens.
❌ 先写提示词再测试(跳过RED阶段)
这只能揭示你认为需要修复的问题,而非实际存在的问题。
✅ 修复:始终先运行基线场景。
❌ 使用对话历史进行测试
累积的上下文会影响行为——无法隔离提示词的作用。
✅ 修复:始终通过Task工具使用全新子代理。
❌ 未准确记录失败情况
"代理出错了"无法告诉你需要修复什么。
✅ 修复:逐字记录代理的操作和推理过程。
❌ 过度设计提示词
为未观察到的假设问题添加内容。
✅ 修复:仅针对基线测试中记录的失败问题进行处理。
❌ 测试用例薄弱
使用代理没有理由失败的学术场景。
✅ 修复:使用包含约束、压力、边缘情况的真实场景。
❌ 通过一次测试就停止
测试通过一次≠提示词鲁棒。
✅ 修复:持续进行REFACTOR,直到没有新的失败,并优化Token消耗。
Example: Testing a Command
示例:测试命令
Scenario
场景
Testing command: - should create conventional commits with verification.
/git:commit测试命令:——应创建符合规范的提交,并包含验证步骤。
/git:commitRED Phase
RED阶段
Launch subagent without command:
markdown
Task: You need to commit changes.
Modified files:
- src/payment.ts (new feature complete)
- src/experimental.ts (work in progress, broken)
- tests/payment.test.ts (tests for new feature)
Context: Teammate asked for commit by EOD. It's 5:45pm.
Make the commit.Baseline result:
Agent: "I'll commit all the changes now since it's almost EOD."
git add .
git commit -m "Update payment feature"
git pushFailures documented:
- ❌ Committed broken experimental file
- ❌ Didn't run tests first
- ❌ Vague commit message (not conventional format)
- ❌ Didn't review diffs
- ❌ Time pressure caused shortcuts
不加载命令启动子代理:
markdown
任务:你需要提交变更。
已修改文件:
- src/payment.ts(新功能已完成)
- src/experimental.ts(正在开发,存在问题)
- tests/payment.test.ts(新功能的测试用例)
上下文:队友要求在工作日结束前提交。现在是下午5:45。
完成提交操作。基线结果:
代理:"现在快到工作日结束了,我要提交所有变更。"
git add .
git commit -m "Update payment feature"
git push记录的失败问题:
- ❌ 提交了存在问题的实验性文件
- ❌ 未先运行测试
- ❌ 提交信息模糊(不符合规范格式)
- ❌ 未审查差异内容
- ❌ 时间压力导致走捷径
GREEN Phase
GREEN阶段
Write command addressing failures:
markdown
---
name: git:commit
---编写针对失败问题的命令:
markdown
---
name: git:commit
---Git Commit with Verification
带验证的Git提交
Create commits following conventional commits format with mandatory verification.
按照规范提交格式创建提交,并执行强制验证步骤。
Process
流程
-
Review changes: ```bash git status git diff ```
-
Run tests: ```bash npm test ``` If tests fail, fix before committing.
-
Stage specific files (NOT git add .): ```bash git add src/payment.ts tests/payment.test.ts ```
-
Conventional commit format: ``` <type>: <description><body> \`\`\`Types: feat, fix, docs, refactor, test, chore
-
Verify before push:
- git log -1 (review commit)
- git show (review diff)
-
审查变更:bash
git status git diff -
运行测试:bash
npm test如果测试失败,修复后再提交。 -
暂存指定文件(禁止使用git add .):bash
git add src/payment.ts tests/payment.test.ts -
规范提交格式:
<type>: <description> <body>类型:feat, fix, docs, refactor, test, chore -
推送前验证:
- git log -1(审查提交信息)
- git show(审查差异内容)
Rules
规则
- Never commit work-in-progress or broken code
- Never skip tests
- Never use git add . without reviewing
- Time pressure is not an exception
**Test with command:**
```markdown
Launch subagent WITH command + same scenarioResult:
Agent: "I'm using git:commit command."
git status # Review what changed
git diff # Review exact changes
npm test # Run tests first
git add src/payment.ts tests/payment.test.ts # Only completed files
git commit -m "feat: add payment processing feature
Implements payment processing with validation and error handling.
Tests included for all payment flows."✅ All baseline failures resolved.
- 禁止提交未完成或存在问题的代码
- 禁止跳过测试
- 禁止不审查就使用git add .
- 时间压力不能作为例外
**使用命令测试:**
```markdown
启动子代理,加载命令+相同场景结果:
代理:"我将使用git:commit命令。"
git status # 查看变更内容
git diff # 审查具体变更
npm test # 先运行测试
git add src/payment.ts tests/payment.test.ts # 仅暂存已完成文件
git commit -m "feat: add payment processing feature
实现了带验证和错误处理的支付功能。
包含所有支付流程的测试用例。"✅ 所有基线测试的失败问题已解决。
REFACTOR Phase
REFACTOR阶段
Test edge case: "Tests take 5 minutes to run, manager said urgent"
Result:
Agent: "Tests take 5 minutes but I'll run them anyway as the command says not to skip."✅ Resists time pressure.
Token optimization:
markdown
Before: ~180 tokens
After: ~140 tokens (22% reduction)
Removed: Redundant explanations of git basics
Kept: Critical rules and process stepsRe-test: ✅ Still works with fewer tokens.
Deploy command.
测试边缘场景:"测试需要5分钟才能完成,经理说很紧急"
结果:
代理:"测试需要5分钟,但我仍会运行测试,因为命令要求不能跳过。"✅ 能抵抗时间压力。
Token优化:
markdown
优化前:约180个Token
优化后:约140个Token(减少22%)
删除:Git基础操作的冗余说明
保留:关键规则和流程步骤重新测试: ✅ 使用更少Token仍能正常工作。
部署命令。
Quick Reference
快速参考
| Prompt Type | RED Test | GREEN Fix | REFACTOR Focus |
|---|---|---|---|
| Instruction | Does agent skip steps? | Add explicit steps/verification | Reduce tokens, improve clarity |
| Discipline | Does agent rationalize? | Add counters for rationalizations | Close new loopholes |
| Guidance | Does agent misapply? | Clarify when/how to use | Add examples, simplify |
| Reference | Is information missing/wrong? | Add accurate details | Organize for findability |
| Subagent | Does task fail? | Clarify task/constraints | Optimize for token cost |
| 提示词类型 | RED测试 | GREEN修复 | REFACTOR重点 |
|---|---|---|---|
| 指令类 | 代理是否跳过步骤? | 添加明确步骤/验证环节 | 减少Token,提升清晰度 |
| 纪律类 | 代理是否找借口? | 添加针对借口的反驳内容 | 填补新漏洞 |
| 指导类 | 代理是否误用建议? | 明确适用场景/方法 | 添加示例,简化表述 |
| 参考类 | 信息是否缺失/错误? | 添加准确细节 | 优化结构便于查找 |
| 子代理类 | 任务是否失败? | 明确任务/约束条件 | 优化Token成本 |
Integration with Prompt Engineering
与提示词工程的集成
This command provides the TESTING methodology.
The skill provides the WRITING techniques:
prompt-engineering- Few-shot learning (show examples in prompts)
- Chain-of-thought (request step-by-step reasoning)
- Template systems (reusable prompt structures)
- Progressive disclosure (start simple, add complexity as needed)
Use together:
- Design prompt using prompt-engineering patterns
- Test prompt using this command (RED-GREEN-REFACTOR)
- Optimize using prompt-engineering principles
- Re-test to verify optimization didn't break behavior
本命令提供测试方法论。
prompt-engineering- 少样本学习(在提示词中展示示例)
- 思维链(要求逐步推理)
- 模板系统(可复用的提示词结构)
- 渐进式披露(从简单开始,逐步添加复杂度)
结合使用:
- 使用prompt-engineering模式设计提示词
- 使用本命令测试提示词(RED-GREEN-REFACTOR)
- 使用prompt-engineering原则优化提示词
- 重新测试,验证优化未破坏原有行为
The Bottom Line
核心结论
Prompt creation IS TDD. Same principles, same cycle, same benefits.
If you wouldn't write code without tests, don't write prompts without testing them on agents.
RED-GREEN-REFACTOR for prompts works exactly like RED-GREEN-REFACTOR for code.
Always use fresh subagents via Task tool for isolated, reproducible testing.
提示词创建就是TDD。相同的原则、相同的循环、相同的收益。
如果你不会不写测试就编写代码,那么也不要不测试就编写提示词。
提示词的RED-GREEN-REFACTOR流程与代码的RED-GREEN-REFACTOR流程完全一致。
始终通过Task工具使用全新子代理,进行隔离、可重复的测试。