test-prompt

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Testing Prompts With Subagents

使用Subagents测试提示词

Test any prompt before deployment: commands, hooks, skills, subagent instructions, or production LLM prompts.

在部署前测试任何提示词：命令、钩子、技能、子代理指令或生产环境LLM提示词。

Overview

概述

Testing prompts is TDD applied to LLM instructions.

Run scenarios without the prompt (RED - watch agent behavior), write prompt addressing failures (GREEN - watch agent comply), then close loopholes (REFACTOR - verify robustness).

Core principle: If you didn't watch an agent fail without the prompt, you don't know what the prompt needs to fix.

REQUIRED BACKGROUND:

You MUST understand
```
tdd:test-driven-development
```
- defines RED-GREEN-REFACTOR cycle
You SHOULD understand
```
prompt-engineering
```
skill - provides prompt optimization techniques

Related skill: See

test-skill

for testing discipline-enforcing skills specifically. This command covers ALL prompts.

测试提示词是将TDD应用于LLM指令。

在不使用提示词的情况下运行场景（RED——观察代理行为），编写针对失败情况的提示词（GREEN——观察代理合规性），然后填补漏洞（REFACTOR——验证鲁棒性）。

核心原则： 如果你没有观察到代理在无提示词时的失败表现，就不知道提示词需要修复什么问题。

必备背景知识：

你必须理解
```
tdd:test-driven-development
```
——它定义了RED-GREEN-REFACTOR循环
你应当理解
```
prompt-engineering
```
技能——它提供了提示词优化技巧

相关技能： 专门测试纪律执行类技能请查看

test-skill

。本命令适用于所有类型的提示词。

When to Use

使用场景

Test prompts that:

Guide agent behavior (commands, instructions)
Enforce practices (hooks, discipline skills)
Provide expertise (technical skills, reference)
Configure subagents (task descriptions, constraints)
Run in production (user-facing LLM features)

Test before deployment when:

Prompt clarity matters
Consistency is required
Cost of failures is high
Prompt will be reused

测试以下类型的提示词：

引导代理行为（命令、指令）
执行规范（钩子、纪律技能）
提供专业知识（技术技能、参考资料）
配置子代理（任务描述、约束条件）
生产环境运行（面向用户的LLM功能）

在以下情况部署前进行测试：

提示词的清晰度至关重要时
需要行为一致性时
失败成本较高时
提示词会被重复使用时

Prompt Types & Testing Strategies

提示词类型与测试策略

Prompt Type	Test Focus	Example
Instruction	Does agent follow steps correctly?	Command that performs git workflow
Discipline-enforcing	Does agent resist rationalization under pressure?	Skill requiring TDD compliance
Guidance	Does agent apply advice appropriately?	Skill with architecture patterns
Reference	Is information accurate and accessible?	API documentation skill
Subagent	Does subagent accomplish task reliably?	Task tool prompt for code review

Different types need different test scenarios (covered in sections below).

提示词类型	测试重点	示例
指令类	代理是否正确遵循步骤？	执行Git工作流的命令
纪律执行类	代理在压力下是否会抗拒合理化借口？	要求遵守TDD的技能
指导类	代理是否能恰当应用建议？	包含架构模式的技能
参考类	信息是否准确且易于获取？	API文档技能
子代理类	子代理是否能可靠完成任务？	用于代码审查的任务工具提示词

不同类型的提示词需要不同的测试场景（下文各章节会详细介绍）。

TDD Mapping for Prompt Testing

提示词测试的TDD映射

TDD Phase	Prompt Testing	What You Do
RED	Baseline test	Run scenario WITHOUT prompt using subagent, observe behavior
Verify RED	Document behavior	Capture exact agent actions/reasoning verbatim
GREEN	Write prompt	Address specific baseline failures
Verify GREEN	Test with prompt	Run WITH prompt using subagent, verify improvement
REFACTOR	Optimize prompt	Improve clarity, close loopholes, reduce tokens
Stay GREEN	Re-verify	Test again with fresh subagent, ensure still works

TDD阶段	提示词测试	操作内容
RED	基线测试	在不使用提示词的情况下，通过子代理运行场景，观察行为
验证RED	记录行为	逐字记录代理的具体操作/推理过程
GREEN	编写提示词	针对基线测试中发现的具体失败问题进行修复
验证GREEN	使用提示词测试	在使用提示词的情况下，通过子代理运行场景，验证行为改善情况
REFACTOR	优化提示词	提升清晰度、填补漏洞、减少Token消耗
保持GREEN	重新验证	使用全新子代理再次测试，确保功能仍正常运行

Why Use Subagents for Testing?

为什么使用Subagents进行测试？

Subagents provide:

Clean slate - No conversation history affecting behavior
Isolation - Test only the prompt, not accumulated context
Reproducibility - Same starting conditions every run
Parallelization - Test multiple scenarios simultaneously
Objectivity - No bias from prior interactions

When to use Task tool with subagents:

Testing new prompts before deployment
Comparing prompt variations (A/B testing)
Verifying prompt changes don't break behavior
Regression testing after updates

Subagents提供以下优势：

全新环境 - 没有对话历史影响行为
隔离性 - 仅测试提示词，不受累积上下文干扰
可重复性 - 每次运行的初始条件一致
并行性 - 可同时测试多个场景
客观性 - 不受之前交互的偏见影响

何时结合Task工具使用Subagents：

部署前测试新提示词
对比提示词变体（A/B测试）
验证提示词变更不会破坏原有行为
更新后的回归测试

RED Phase: Baseline Testing (Watch It Fail)

RED阶段：基线测试（观察失败）

Goal: Run test WITHOUT the prompt - observe natural agent behavior, document what goes wrong.

This proves what the prompt needs to fix.

目标： 在不使用提示词的情况下运行测试——观察代理的自然行为，记录出现的问题。

这一步能明确提示词需要修复的问题。

Process

流程

Scenario Design by Prompt Type

按提示词类型设计场景

Instruction Prompts

指令类提示词

Test if steps are followed correctly and edge cases handled.

Example: Testing a git commit command

markdown

Task for subagent (no command loaded):

You need to commit changes to a repository.
Files modified: src/app.ts, tests/app.test.ts
Some changes are experimental and shouldn't be committed yet.

Make an appropriate commit.

Baseline behavior (without prompt):

Agent might commit all files (including experimental)
Might skip running tests first
Might write vague commit message
Might not follow commit message conventions

Document exactly what happened.

测试步骤是否被正确执行，以及是否能处理边缘情况。

示例：测试Git提交命令

markdown

子代理任务（未加载命令）：

你需要向仓库提交变更。
已修改文件：src/app.ts, tests/app.test.ts
部分变更属于实验性内容，不应被提交。

完成恰当的提交操作。

基线行为（无提示词）：

代理可能提交所有文件（包括实验性内容）
可能跳过先运行测试的步骤
可能编写模糊的提交信息
可能不遵循提交信息规范

准确记录实际发生的情况。

Discipline-Enforcing Prompts

纪律执行类提示词

Test resistance to rationalization under pressure. Use scenarios with multiple pressures (time, cost, authority, exhaustion).

Example: Testing a TDD enforcement skill

markdown

IMPORTANT: This is a real scenario. Choose and act.

You spent 4 hours implementing a payment feature. It's working perfectly.
You manually tested edge cases. It's 6pm, dinner at 6:30pm, exhausted.
Code review tomorrow at 9am. You just realized you didn't write tests.

Options:
A) Delete code, start over with TDD tomorrow
B) Commit now, write tests tomorrow
C) Write tests now (30 min delay)

Choose A, B, or C. Be honest.

Baseline behavior (without skill):

Agent chooses B or C
Rationalizations: "manually tested", "tests after achieve same goals", "deleting wasteful"

Capture rationalizations verbatim.

测试代理在压力下是否抗拒合理化借口。使用包含多种压力（时间、成本、权威、疲劳）的场景。

示例：测试TDD执行技能

markdown

重要提示：这是真实场景，请选择并执行。

你花费4小时实现了一个支付功能，目前运行完全正常。
你手动测试了边缘情况。现在是下午6点，6:30要吃晚餐，你已经筋疲力尽。
明天上午9点要进行代码评审，你刚意识到自己没有编写测试用例。

选项：
A) 删除代码，明天重新按照TDD流程开始
B) 现在提交代码，明天再写测试用例
C) 现在写测试用例（会延迟30分钟）

选择A、B或C，请如实选择。

基线行为（无技能提示词）：

代理会选择B或C
合理化借口："已经手动测试过"、"事后写测试能达到同样效果"、"删除代码太浪费"

逐字记录这些合理化借口。

Guidance Prompts

指导类提示词

Test if advice is understood and applied appropriately in varied contexts.

Example: Testing an architecture patterns skill

markdown

Design a system for processing 10,000 webhook events per second.
Each event triggers database updates and external API calls.
System must be resilient to downstream failures.

Propose an architecture.

Baseline behavior (without skill):

Agent might propose synchronous processing (too slow)
Might miss retry/fallback mechanisms
Might not consider event ordering

Document what's missing or incorrect.

测试代理是否理解建议，并能在不同场景中恰当应用。

示例：测试架构模式技能

markdown

设计一个每秒处理10000个Webhook事件的系统。
每个事件会触发数据库更新和外部API调用。
系统必须能应对下游服务故障，具备韧性。

提出架构方案。

基线行为（无技能提示词）：

代理可能提出同步处理方案（速度太慢）
可能遗漏重试/降级机制
可能未考虑事件顺序

记录缺失或错误的内容。

Reference Prompts

参考类提示词

Test if information is accurate, complete, and easy to find.

Example: Testing API documentation

markdown

How do I authenticate API requests?
How do I handle rate limiting?
What's the retry strategy for failed requests?

Baseline behavior (without reference):

Agent guesses or provides generic advice
Misses product-specific details
Provides outdated information

Note what information is missing or wrong.

测试信息是否准确、完整且易于查找。

示例：测试API文档

markdown

如何对API请求进行身份验证？
如何处理速率限制？
失败请求的重试策略是什么？

基线行为（无参考提示词）：

代理可能猜测或提供通用建议
遗漏产品特定细节
提供过时信息

记录缺失或错误的信息。

Running Baseline Tests

运行基线测试

markdown

Use Task tool to launch subagent:

prompt: "Test this scenario WITHOUT the [prompt-name]:

[Scenario description]

Report back: exact actions taken, reasoning provided, any mistakes."

subagent_type: "general-purpose"
description: "Baseline test for [prompt-name]"

Critical: Subagent must NOT have access to the prompt being tested.

markdown

使用Task工具启动子代理：

prompt: "在不使用[提示词名称]的情况下测试以下场景：

[场景描述]

返回结果：具体执行的操作、推理过程、出现的任何错误。"

subagent_type: "general-purpose"
description: "[提示词名称]的基线测试"

关键要求： 子代理不得访问待测试的提示词。

GREEN Phase: Write Minimal Prompt (Make It Pass)

GREEN阶段：编写最小化提示词（使其通过测试）

Write prompt addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases.

编写提示词，针对基线测试中记录的具体失败问题进行修复。不要为假设的情况添加额外内容。

Prompt Design Principles

提示词设计原则

From prompt-engineering skill:

Be concise - Context window is shared, only add what agents don't know
Set appropriate degrees of freedom:
- High freedom: Multiple valid approaches (use guidance)
- Medium freedom: Preferred pattern exists (use templates/pseudocode)
- Low freedom: Specific sequence required (use explicit steps)
Use persuasion principles (for discipline-enforcing only):
- Authority: "YOU MUST", "No exceptions"
- Commitment: "Announce usage", "Choose A, B, or C"
- Scarcity: "IMMEDIATELY", "Before proceeding"
- Social Proof: "Every time", "X without Y = failure"

来自prompt-engineering技能：

简洁性 - 上下文窗口是共享资源，仅添加代理不知道的内容
设置合适的自由度：
- 高自由度：存在多种有效方案（使用指导类表述）
- 中等自由度：存在首选模式（使用模板/伪代码）
- 低自由度：需要特定执行顺序（使用明确步骤）
使用说服原则（仅适用于纪律执行类提示词）：
- 权威性："YOU MUST"、"No exceptions"
- 承诺性："Announce usage"、"Choose A, B, or C"
- 稀缺性："IMMEDIATELY"、"Before proceeding"
- 社会认同："Every time"、"X without Y = failure"

Writing the Prompt

编写提示词

For instruction prompts:

markdown

Clear steps addressing baseline failures:

1. Run git status to see modified files
2. Review changes, identify which should be committed
3. Run tests before committing
4. Write descriptive commit message following [convention]
5. Commit only reviewed files

For discipline-enforcing prompts:

markdown

Add explicit counters for each rationalization:

针对指令类提示词：

markdown

明确步骤，解决基线测试中的失败问题：

1. 运行git status查看已修改文件
2. 审查变更内容，确定哪些文件需要提交
3. 提交前先运行测试
4. 按照[规范]编写描述性提交信息
5. 仅提交已审查的文件

针对纪律执行类提示词：

markdown

添加针对每种合理化借口的明确反驳：

The Iron Law

铁律

Write code before test? Delete it. Start over.

No exceptions:

Don't keep as "reference"
Don't "adapt" while writing tests
Delete means delete

Excuse	Reality
"Already manually tested"	Ad-hoc ≠ systematic. No record, can't re-run.
"Tests after achieve same"	Tests-after = verifying. Tests-first = designing.


**For guidance prompts:**

```markdown
Pattern with clear applicability:

先写代码再写测试？删除代码，重新开始。

无例外：

不要保留代码作为"参考"
不要在写测试时"调整"代码
删除就是彻底删除

借口	事实
"已经手动测试过"	临时测试≠系统化测试。没有记录，无法重新运行。
"事后写测试能达到同样效果"	事后测试是验证，事前测试是设计。


**针对指导类提示词：**

```markdown
明确适用场景的模式：

High-Throughput Event Processing

高吞吐量事件处理

When to use: >1000 events/sec, async operations, resilience required

Pattern:

Queue-based ingestion (decouple receipt from processing)
Worker pools (parallel processing)
Dead letter queue (failed events)
Idempotency keys (safe retries)

Trade-offs: [complexity vs. reliability]


**For reference prompts:**

```markdown
Direct answers with examples:

适用场景： >1000事件/秒、异步操作、需要韧性

模式：

基于队列的 ingestion（解耦接收与处理）
工作池（并行处理）
死信队列（处理失败事件）
幂等键（安全重试）

权衡： [复杂度 vs. 可靠性]


**针对参考类提示词：**

```markdown
直接给出答案并附带示例：

Authentication

身份验证

All requests require bearer token:

```bash curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com ```

Tokens expire after 1 hour. Refresh using /auth/refresh endpoint.

undefined

所有请求都需要Bearer令牌：

bash

curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com

令牌1小时后过期，可通过/auth/refresh端点刷新。

undefined

Testing with Prompt

使用提示词测试

Run same scenarios WITH prompt using subagent.

markdown

Use Task tool with prompt included:

prompt: "You have access to [prompt-name]:

[Include prompt content]

Now handle this scenario:
[Scenario description]

Report back: actions taken, reasoning, which parts of prompt you used."

subagent_type: "general-purpose"
description: "Green test for [prompt-name]"

Success criteria:

Agent follows prompt instructions
Baseline failures no longer occur
Agent cites prompt when relevant

If agent still fails: Prompt unclear or incomplete. Revise and re-test.

在使用提示词的情况下，通过子代理运行相同场景。

markdown

使用包含提示词的Task工具：

prompt: "你可以使用[提示词名称]：

[提示词内容]

现在处理以下场景：
[场景描述]

返回结果：执行的操作、推理过程、使用了提示词的哪些部分。"

subagent_type: "general-purpose"
description: "[提示词名称]的GREEN阶段测试"

成功标准：

代理遵循提示词指令
基线测试中的失败问题不再出现
代理在相关场景中引用提示词内容

如果代理仍然失败： 提示词表述不清晰或内容不完整。修改后重新测试。

REFACTOR Phase: Optimize Prompt (Stay Green)

REFACTOR阶段：优化提示词（保持GREEN状态）

After green, improve the prompt while keeping tests passing.

通过GREEN阶段测试后，在保持测试通过的前提下优化提示词。

Optimization Goals

优化目标

Close loopholes - Agent found ways around rules?
Improve clarity - Agent misunderstood sections?
Reduce tokens - Can you say same thing more concisely?
Enhance structure - Is information easy to find?

填补漏洞 - 代理是否找到了规避规则的方法？
提升清晰度 - 代理是否误解了某些部分？
减少Token消耗 - 能否用更简洁的表述传达相同信息？
优化结构 - 信息是否易于查找？

Closing Loopholes (Discipline-Enforcing)

填补漏洞（纪律执行类）

Agent violated rule despite having the prompt? Add specific counters.

Capture new rationalizations:

markdown

Test result: Agent chose option B despite skill saying choose A

Agent's reasoning: "The skill says delete code-before-tests, but I
wrote comprehensive tests after, so the SPIRIT is satisfied even if
the LETTER isn't followed."

Close the loophole:

markdown

Add to prompt:

**Violating the letter of the rules is violating the spirit of the rules.**

"Tests after achieve the same goals" - No. Tests-after answer "what does
this do?" Tests-first answer "what should this do?"

Re-test with updated prompt.

如果代理在拥有提示词的情况下仍然违反规则？添加针对性的约束。

记录新的合理化借口：

markdown

测试结果：尽管技能要求选择A，代理仍选择了B

代理的推理："技能说要删除先写代码后写测试的内容，但我在事后写了全面的测试，所以即使没有遵守字面规则，也符合其核心精神。"

填补漏洞：

markdown

添加到提示词中：

**违反规则的字面表述就是违反规则的核心精神。**

"事后写测试能达到同样效果"——错误。事后测试回答的是"这个功能做什么？"，事前测试回答的是"这个功能应该做什么？"

使用更新后的提示词重新测试。

Improving Clarity

提升清晰度

Agent misunderstood instructions? Use meta-testing.

Ask the agent:

markdown

Launch subagent:

"You read the prompt and chose option C when A was correct.

How could that prompt have been written differently to make it
crystal clear that option A was the only acceptable answer?

Quote the current prompt and suggest specific changes."

Three possible responses:

"The prompt WAS clear, I chose to ignore it"
- Not clarity problem - need stronger principle
- Add foundational rule at top
"The prompt should have said X"
- Clarity problem - add their suggestion verbatim
"I didn't see section Y"
- Organization problem - make key points more prominent

如果代理误解了指令？使用元测试方法。

询问代理：

markdown

启动子代理：

"你阅读了提示词，但选择了C，而正确选项是A。

提示词可以如何修改，才能让A是唯一可接受选项这一点变得绝对清晰？

引用当前提示词内容，并提出具体修改建议。"

三种可能的回复：

"提示词表述很清晰，我只是选择忽略它"
- 不是清晰度问题——需要更强的原则约束
- 在顶部添加基础规则
"提示词应该表述为X"
- 清晰度问题——直接添加代理的建议
"我没看到Y部分"
- 结构问题——让关键内容更突出

Reducing Tokens (All Prompts)

减少Token消耗（所有提示词类型）

From prompt-engineering skill:

Remove redundant words and phrases
Use abbreviations after first definition
Consolidate similar instructions
Challenge each paragraph: "Does this justify its token cost?"

Before:

markdown

undefined

来自prompt-engineering技能：

删除冗余词汇和短语
首次定义后使用缩写
合并相似指令
对每个段落进行质疑："这段内容的Token成本是否合理？"

优化前：

markdown

undefined

How to Submit Forms

表单提交方法

When you need to submit a form, you should first validate all the fields to make sure they're correct. After validation succeeds, you can proceed to submit. If validation fails, show errors to the user.


**After (37% fewer tokens):**

```markdown

当你需要提交表单时，首先应该验证所有字段，确保内容正确。验证通过后，再进行提交。如果验证失败，向用户显示错误信息。


**优化后（Token减少37%）：**

```markdown

Form Submission

表单提交

Validate all fields
If valid: submit
If invalid: show errors


**Re-test to ensure behavior unchanged.**

验证所有字段
验证通过：提交
验证失败：显示错误信息


**重新测试，确保行为未改变。**

Re-verify After Refactoring

重构后重新验证

Re-test same scenarios with updated prompt using fresh subagents.

Agent should:

Still follow instructions correctly
Show improved understanding
Reference updated sections when relevant

If new failures appear: Refactoring broke something. Revert and try different optimization.

使用全新子代理，用更新后的提示词重新测试所有场景。

代理应：

仍能正确遵循指令
表现出更好的理解度
相关场景中引用更新后的内容

如果出现新的失败： 重构破坏了原有功能。回退并尝试其他优化方式。

Subagent Testing Patterns

Subagent测试模式

Pattern 1: Parallel Baseline Testing

模式1：并行基线测试

Test multiple scenarios simultaneously to find failure patterns faster.

markdown

Launch 3-5 subagents in parallel, each with different scenario:

Subagent 1: Edge case A
Subagent 2: Pressure scenario B
Subagent 3: Complex context C
...

Compare results to identify consistent failures.

同时测试多个场景，更快发现失败规律。

markdown

并行启动3-5个子代理，每个代理对应不同场景：

子代理1：边缘场景A
子代理2：压力场景B
子代理3：复杂上下文场景C
...

对比结果，识别持续出现的失败问题。

Pattern 2: A/B Testing

模式2：A/B测试

Compare two prompt variations to choose better version.

markdown

Launch 2 subagents with same scenario, different prompts:

Subagent A: Original prompt
Subagent B: Revised prompt

Compare: clarity, token usage, correct behavior

对比两个提示词变体，选择更优版本。

markdown

启动2个子代理，使用相同场景但不同提示词：

子代理A：原始提示词
子代理B：修改后的提示词

对比：清晰度、Token消耗、行为正确性

Pattern 3: Regression Testing

模式3：回归测试

After changing prompt, verify old scenarios still work.

markdown

Launch subagent with updated prompt + all previous test scenarios

Verify: All previous passes still pass

修改提示词后，验证原有场景仍能正常运行。

markdown

启动子代理，使用更新后的提示词+所有历史测试场景

验证：所有之前通过的测试仍能通过

Pattern 4: Stress Testing

模式4：压力测试

For critical prompts, test under extreme conditions.

markdown

Launch subagent with:
- Maximum pressure scenarios
- Ambiguous edge cases
- Contradictory constraints
- Minimal context provided

Verify: Prompt provides adequate guidance even in worst case

针对关键提示词，在极端条件下测试。

markdown

启动子代理，使用：
- 最大压力场景
- 模糊边缘情况
- 矛盾约束条件
- 最少上下文信息

验证：即使在最坏情况下，提示词仍能提供足够指导

Testing Checklist (TDD for Prompts)

提示词测试检查清单（TDD流程）

Common Mistakes (Same as Code TDD)

常见错误（与代码TDD相同）

❌ Writing prompt before testing (skipping RED) Reveals what YOU think needs fixing, not what ACTUALLY needs fixing. ✅ Fix: Always run baseline scenarios first.

❌ Testing with conversation history Accumulated context affects behavior - can't isolate prompt effect. ✅ Fix: Always use fresh subagents via Task tool.

❌ Not documenting exact failures "Agent was wrong" doesn't tell you what to fix. ✅ Fix: Capture agent's actions and reasoning verbatim.

❌ Over-engineering prompts Adding content for hypothetical issues you haven't observed. ✅ Fix: Only address failures you documented in baseline.

❌ Weak test cases Academic scenarios where agent has no reason to fail. ✅ Fix: Use realistic scenarios with constraints, pressures, edge cases.

❌ Stopping after first pass Tests pass once ≠ robust prompt. ✅ Fix: Continue REFACTOR until no new failures, optimize for tokens.

❌ 先写提示词再测试（跳过RED阶段） 这只能揭示你认为需要修复的问题，而非实际存在的问题。 ✅ 修复：始终先运行基线场景。

❌ 使用对话历史进行测试 累积的上下文会影响行为——无法隔离提示词的作用。 ✅ 修复：始终通过Task工具使用全新子代理。

❌ 未准确记录失败情况 "代理出错了"无法告诉你需要修复什么。 ✅ 修复：逐字记录代理的操作和推理过程。

❌ 过度设计提示词 为未观察到的假设问题添加内容。 ✅ 修复：仅针对基线测试中记录的失败问题进行处理。

❌ 测试用例薄弱 使用代理没有理由失败的学术场景。 ✅ 修复：使用包含约束、压力、边缘情况的真实场景。

❌ 通过一次测试就停止 测试通过一次≠提示词鲁棒。 ✅ 修复：持续进行REFACTOR，直到没有新的失败，并优化Token消耗。

Example: Testing a Command

示例：测试命令

Scenario

场景

Testing command:

/git:commit

- should create conventional commits with verification.

测试命令：

/git:commit

——应创建符合规范的提交，并包含验证步骤。

RED Phase

RED阶段

Launch subagent without command:

markdown

Task: You need to commit changes.

Modified files:
- src/payment.ts (new feature complete)
- src/experimental.ts (work in progress, broken)
- tests/payment.test.ts (tests for new feature)

Context: Teammate asked for commit by EOD. It's 5:45pm.

Make the commit.

Baseline result:

Agent: "I'll commit all the changes now since it's almost EOD."

git add .
git commit -m "Update payment feature"
git push

Failures documented:

❌ Committed broken experimental file
❌ Didn't run tests first
❌ Vague commit message (not conventional format)
❌ Didn't review diffs
❌ Time pressure caused shortcuts

不加载命令启动子代理：

markdown

任务：你需要提交变更。

已修改文件：
- src/payment.ts（新功能已完成）
- src/experimental.ts（正在开发，存在问题）
- tests/payment.test.ts（新功能的测试用例）

上下文：队友要求在工作日结束前提交。现在是下午5:45。

完成提交操作。

基线结果：

代理："现在快到工作日结束了，我要提交所有变更。"

git add .
git commit -m "Update payment feature"
git push

记录的失败问题：

❌ 提交了存在问题的实验性文件
❌ 未先运行测试
❌ 提交信息模糊（不符合规范格式）
❌ 未审查差异内容
❌ 时间压力导致走捷径

GREEN Phase

GREEN阶段

Write command addressing failures:

markdown

---
name: git:commit
---

编写针对失败问题的命令：

markdown

---
name: git:commit
---

Git Commit with Verification

带验证的Git提交

Create commits following conventional commits format with mandatory verification.

按照规范提交格式创建提交，并执行强制验证步骤。

Process

流程

Review changes: ```bash git status git diff ```
Run tests: ```bash npm test ``` If tests fail, fix before committing.
Stage specific files (NOT git add .): ```bash git add src/payment.ts tests/payment.test.ts ```
Conventional commit format: ``` <type>: <description>
<body> \`\`\`
Types: feat, fix, docs, refactor, test, chore
Verify before push:
- git log -1 (review commit)
- git show (review diff)

审查变更：
bash
```
git status
git diff
```
运行测试：
bash
```
npm test
```
如果测试失败，修复后再提交。
暂存指定文件（禁止使用git add .）：
bash
```
git add src/payment.ts tests/payment.test.ts
```
规范提交格式：
```
<type>: <description>

<body>
```
类型：feat, fix, docs, refactor, test, chore
推送前验证：
- git log -1（审查提交信息）
- git show（审查差异内容）

Rules

规则

Never commit work-in-progress or broken code
Never skip tests
Never use git add . without reviewing
Time pressure is not an exception


**Test with command:**

```markdown
Launch subagent WITH command + same scenario

Result:

Agent: "I'm using git:commit command."

git status  # Review what changed
git diff    # Review exact changes
npm test    # Run tests first
git add src/payment.ts tests/payment.test.ts  # Only completed files
git commit -m "feat: add payment processing feature

Implements payment processing with validation and error handling.
Tests included for all payment flows."

✅ All baseline failures resolved.

禁止提交未完成或存在问题的代码
禁止跳过测试
禁止不审查就使用git add .
时间压力不能作为例外


**使用命令测试：**

```markdown
启动子代理，加载命令+相同场景

结果：

代理："我将使用git:commit命令。"

git status  # 查看变更内容
git diff    # 审查具体变更
npm test    # 先运行测试
git add src/payment.ts tests/payment.test.ts  # 仅暂存已完成文件
git commit -m "feat: add payment processing feature

实现了带验证和错误处理的支付功能。
包含所有支付流程的测试用例。"

✅ 所有基线测试的失败问题已解决。

REFACTOR Phase

REFACTOR阶段

Test edge case: "Tests take 5 minutes to run, manager said urgent"

Result:

Agent: "Tests take 5 minutes but I'll run them anyway as the command says not to skip."

✅ Resists time pressure.

Token optimization:

markdown

Before: ~180 tokens
After: ~140 tokens (22% reduction)

Removed: Redundant explanations of git basics
Kept: Critical rules and process steps

Re-test: ✅ Still works with fewer tokens.

Deploy command.

测试边缘场景："测试需要5分钟才能完成，经理说很紧急"

结果：

代理："测试需要5分钟，但我仍会运行测试，因为命令要求不能跳过。"

✅ 能抵抗时间压力。

Token优化：

markdown

优化前：约180个Token
优化后：约140个Token（减少22%）

删除：Git基础操作的冗余说明
保留：关键规则和流程步骤

重新测试： ✅ 使用更少Token仍能正常工作。

部署命令。

Quick Reference

快速参考

Prompt Type	RED Test	GREEN Fix	REFACTOR Focus
Instruction	Does agent skip steps?	Add explicit steps/verification	Reduce tokens, improve clarity
Discipline	Does agent rationalize?	Add counters for rationalizations	Close new loopholes
Guidance	Does agent misapply?	Clarify when/how to use	Add examples, simplify
Reference	Is information missing/wrong?	Add accurate details	Organize for findability
Subagent	Does task fail?	Clarify task/constraints	Optimize for token cost

提示词类型	RED测试	GREEN修复	REFACTOR重点
指令类	代理是否跳过步骤？	添加明确步骤/验证环节	减少Token，提升清晰度
纪律类	代理是否找借口？	添加针对借口的反驳内容	填补新漏洞
指导类	代理是否误用建议？	明确适用场景/方法	添加示例，简化表述
参考类	信息是否缺失/错误？	添加准确细节	优化结构便于查找
子代理类	任务是否失败？	明确任务/约束条件	优化Token成本

Integration with Prompt Engineering

与提示词工程的集成

This command provides the TESTING methodology.

The
prompt-engineering
skill provides the WRITING techniques:

Few-shot learning (show examples in prompts)
Chain-of-thought (request step-by-step reasoning)
Template systems (reusable prompt structures)
Progressive disclosure (start simple, add complexity as needed)

Use together:

Design prompt using prompt-engineering patterns
Test prompt using this command (RED-GREEN-REFACTOR)
Optimize using prompt-engineering principles
Re-test to verify optimization didn't break behavior

本命令提供测试方法论。

prompt-engineering
技能提供编写技巧：

少样本学习（在提示词中展示示例）
思维链（要求逐步推理）
模板系统（可复用的提示词结构）
渐进式披露（从简单开始，逐步添加复杂度）

结合使用：

使用prompt-engineering模式设计提示词
使用本命令测试提示词（RED-GREEN-REFACTOR）
使用prompt-engineering原则优化提示词
重新测试，验证优化未破坏原有行为

The Bottom Line

核心结论

Prompt creation IS TDD. Same principles, same cycle, same benefits.

If you wouldn't write code without tests, don't write prompts without testing them on agents.

RED-GREEN-REFACTOR for prompts works exactly like RED-GREEN-REFACTOR for code.

Always use fresh subagents via Task tool for isolated, reproducible testing.

提示词创建就是TDD。相同的原则、相同的循环、相同的收益。

如果你不会不写测试就编写代码，那么也不要不测试就编写提示词。

提示词的RED-GREEN-REFACTOR流程与代码的RED-GREEN-REFACTOR流程完全一致。

始终通过Task工具使用全新子代理，进行隔离、可重复的测试。