testing-agents-with-subagents

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Testing Agents With Subagents

用子Agent测试Agent

Operator Context

操作者背景

This skill operates as an operator for agent testing workflows, configuring Claude's behavior for systematic agent validation. It applies TDD methodology to agent development — RED (observe failures), GREEN (fix agent definition), REFACTOR (edge cases and robustness) — with subagent dispatch as the execution mechanism.

本Skill作为Agent测试工作流的操作者，配置Claude的行为以实现系统化的Agent验证。它将TDD方法论应用于Agent开发——RED（观察失败）、GREEN（修复Agent定义）、REFACTOR（边界案例与鲁棒性）——并以子Agent调度作为执行机制。

Hardcoded Behaviors (Always Apply)

硬编码行为（始终生效）

CLAUDE.md Compliance: Read and follow repository CLAUDE.md files before testing
Over-Engineering Prevention: Only test what's directly needed. No elaborate test harnesses or infrastructure. Keep test cases focused and minimal.
Verbatim Output Capture: Document exact agent outputs. NEVER summarize or paraphrase.
Isolated Execution: Each test runs in a fresh subagent to avoid context pollution
Evidence-Based Claims: Every claim about agent behavior MUST be backed by actual test execution
No Self-Exemption: You cannot decide an agent doesn't need testing. Human partner must confirm exemptions.

CLAUDE.md合规性：测试前阅读并遵循仓库中的CLAUDE.md文件
避免过度设计：仅测试直接需要的内容。无需复杂的测试工具或基础设施。保持测试用例聚焦且精简。
原始输出捕获：记录Agent的精确输出。绝不总结或改写。
隔离执行：每个测试在全新的子Agent中运行，避免上下文污染
基于证据的结论：所有关于Agent行为的结论必须有实际测试执行的证据支持
无自我豁免权：你不能决定某个Agent无需测试。必须由人类伙伴确认豁免情况。

Default Behaviors (ON unless disabled)

默认行为（除非禁用否则开启）

Multi-Case Testing: Run at least 3 test cases per agent (success, failure, edge case)
Output Schema Validation: Verify agent output matches expected structure and required sections
Consistency Testing: Run same input 2+ times to verify deterministic behavior
Regression Testing: After fixes, re-run ALL previous test cases before declaring green
Temporary File Cleanup: Remove test files and artifacts at completion. Keep only files needed for documentation.
Document Findings: Log all observations, hypotheses, and test results in structured format

多案例测试：每个Agent至少运行3个测试用例（成功、失败、边界案例）
输出Schema验证：验证Agent输出是否符合预期结构和必填章节
一致性测试：相同输入运行2次以上，验证行为的确定性
回归测试：修复后，重新运行所有之前的测试用例，再标记为GREEN
临时文件清理：完成后移除测试文件和工件。仅保留文档所需的文件。
记录发现：以结构化格式记录所有观察结果、假设和测试结果

Optional Behaviors (OFF unless enabled)

可选行为（除非启用否则关闭）

A/B Testing: Compare agent variants using agent-comparison skill
Performance Benchmarking: Measure response time and token usage
Stress Testing: Test with large inputs, many iterations, concurrent requests
Eval Harness Integration: Use
```
evals/harness.py skill-test
```
for YAML-based automated testing

A/B测试：使用agent-comparison Skill对比Agent变体
性能基准测试：测量响应时间和Token使用量
压力测试：使用大输入、多次迭代、并发请求进行测试
评估工具集成：使用
```
evals/harness.py skill-test
```
进行基于YAML的自动化测试

What This Skill CAN Do

本Skill能做什么

Systematically validate agents through RED-GREEN-REFACTOR test cycles
Dispatch subagents with controlled inputs and capture verbatim outputs
Distinguish between output structure issues and behavioral correctness issues
Verify fixes don't introduce regressions across the full test suite
Test routing logic, skill invocation, and multi-agent workflows

通过RED-GREEN-REFACTOR测试周期系统化验证Agent
用受控输入调度子Agent并捕获原始输出
区分输出结构问题与行为正确性问题
验证修复不会在整个测试套件中引入回归问题
测试路由逻辑、Skill调用和多Agent工作流

What This Skill CANNOT Do

本Skill不能做什么

Deploy agents without completing all three test phases
Substitute reading agent prompts for executing actual test runs
Make claims about agent behavior without evidence from subagent dispatch
Evaluate agent quality structurally (use agent-evaluation instead)
Skip the RED phase even when "the fix is obvious"

未完成所有三个测试阶段就部署Agent
用阅读Agent提示词替代实际测试运行
没有子Agent调度的证据就对Agent行为下结论
从结构上评估Agent质量（请使用agent-evaluation替代）
即使「修复方案很明显」也跳过RED阶段

Instructions

操作说明

Phase 0: PREPARE — Understand the Agent

阶段0：准备——了解Agent

Goal: Read the agent definition and understand what it claims to do before writing tests.

Step 1: Read the agent file

bash

undefined

目标：在编写测试前，阅读Agent定义并理解其宣称的功能。

步骤1：阅读Agent文件

bash

undefined

Read agent definition

读取Agent定义

cat agents/{agent-name}.md

Read any referenced skills

读取所有关联的Skill

cat skills/{skill-name}/SKILL.md


**Step 2: Identify testable claims**

Extract concrete, testable behaviors from the agent definition:
- What inputs does it accept?
- What output structure does it produce?
- What routing triggers should activate it?
- What error conditions does it handle?
- What skills does it invoke?

**Step 3: Determine minimum test count**

| Agent Type | Minimum Tests | Required Coverage |
|------------|---------------|-------------------|
| Reviewer agents | 6 | 2 real issues, 2 clean, 1 edge, 1 ambiguous |
| Implementation agents | 5 | 2 typical, 1 complex, 1 minimal, 1 error |
| Analysis agents | 4 | 2 standard, 1 edge, 1 malformed |
| Routing/orchestration | 4 | 2 correct route, 1 ambiguous, 1 invalid |

No gate — this phase is preparation. Move directly to Phase 1.

cat skills/{skill-name}/SKILL.md


**步骤2：识别可测试的宣称**

从Agent定义中提取具体、可测试的行为：
- 它接受哪些输入？
- 它产生什么输出结构？
- 哪些路由触发条件应激活它？
- 它处理哪些错误情况？
- 它调用哪些Skill？

**步骤3：确定最小测试数量**

| Agent类型 | 最小测试数 | 必需覆盖范围 |
|------------|---------------|-------------------|
| 评审类Agent | 6 | 2个真实问题、2个无问题案例、1个边界案例、1个模糊案例 |
| 实现类Agent | 5 | 2个典型案例、1个复杂案例、1个极简案例、1个错误案例 |
| 分析类Agent | 4 | 2个标准案例、1个边界案例、1个格式错误案例 |
| 路由/编排类 | 4 | 2个正确路由、1个模糊案例、1个无效案例 |

无准入门槛——此阶段为准备阶段，直接进入阶段1。

Phase 1: RED — Observe Current Behavior

阶段1：RED——观察当前行为

Goal: Run agent with test inputs and document exact current behavior before any changes.

Step 1: Define test plan

markdown

undefined

目标：用测试输入运行Agent，记录修改前的精确当前行为。

步骤1：定义测试计划

markdown

undefined

Test Plan: {agent-name}

测试计划：{agent-name}

Agent Purpose: {what the agent does} Agent File: agents/{agent-name}.md Date: {date}

Test Cases:

ID	Input	Expected Output	Validates
T1	{input}	{expected}	Happy path
T2	{input}	{expected}	Error handling
T3	{input}	{expected}	Edge case


Write the test plan to a file before executing. This creates a reproducible baseline.

**Step 2: Dispatch subagent with test inputs**

Use the Task tool to dispatch the agent:

Task( prompt=""" [Test input for the agent]

Context: [Any required context]

{Include the actual problem/request the agent should handle} """, subagent_type="{agent-name}" )


**Step 3: Capture results verbatim**

```markdown

Agent用途： {该Agent的功能} Agent文件： agents/{agent-name}.md 日期： {日期}

测试用例：

ID	输入	预期输出	验证点
T1	{输入内容}	{预期结果}	正常路径
T2	{输入内容}	{预期结果}	错误处理
T3	{输入内容}	{预期结果}	边界案例


在执行前将测试计划写入文件。这将创建可复现的基准。

**步骤2：用测试输入调度子Agent**

使用Task工具调度Agent：

Task( prompt=""" [Agent的测试输入]

上下文：[所有必需的上下文]

{包含Agent应处理的实际问题/请求} """, subagent_type="{agent-name}" )


**步骤3：原始捕获结果**

```markdown

Test T1: Happy Path

测试T1：正常路径

Input: {exact input provided}

Expected Output: {what you expected}

Actual Output: {verbatim output from agent — do not summarize}

Result: PASS / FAIL Failure Reason: {if FAIL, exactly what was wrong}


**Step 4: Identify failure patterns**
- Which test categories fail (happy path, error, edge)?
- Are failures structural (missing sections) or behavioral (wrong answers)?
- Do failures correlate with input characteristics?

**Gate**: All test cases executed. Exact outputs captured verbatim. Failures documented with specific issues identified. Proceed only when gate passes.

输入： {提供的精确输入}

预期输出： {你的预期结果}

实际输出： {Agent的原始输出——请勿总结}

结果： 通过 / 失败 失败原因： {如果失败，精确描述问题}


**步骤4：识别失败模式**
- 哪些测试类别失败（正常路径、错误、边界）？
- 失败是结构性的（缺少章节）还是行为性的（答案错误）？
- 失败是否与输入特征相关？

**准入门槛**：所有测试用例已执行。精确输出已原始捕获。失败情况已记录并明确问题。仅当准入门槛通过后才可继续。

Phase 2: GREEN — Fix Agent Definition

阶段2：GREEN——修复Agent定义

Goal: Update agent definition until all test cases pass. One fix at a time.

Step 1: Prioritize failures

Triage failures by severity:

Severity	Description	Priority
Critical	Agent produces wrong answers or harmful output	Fix first
High	Agent missing required output sections	Fix second
Medium	Agent formatting or structure issues	Fix third
Low	Agent phrasing or style inconsistencies	Fix last

Step 2: Diagnose root cause

Failure Type	Fix Approach
Missing output section	Add explicit instruction to include section
Wrong format	Add output schema with examples
Missing context handling	Add instructions for handling missing info
Incorrect classification	Add calibration examples
Hallucinated content	Add constraint to only use provided info
Agent asks questions instead of answering	Provide required context in prompt or add default handling

Step 3: Make one fix at a time

Change one thing in the agent definition. Re-run ALL test cases. Document which tests now pass/fail.

Never make multiple fixes simultaneously — you cannot determine which change was effective. This is the same principle as debugging: one variable at a time.

Step 4: Iterate until green

Repeat Step 3 until all test cases pass. If a fix causes a previously passing test to fail, revert and try a different approach.

Track fix iterations:

markdown

undefined

目标：更新Agent定义，直到所有测试用例通过。一次只做一个修复。

步骤1：优先处理失败

按严重程度对失败进行分类：

严重程度	描述	优先级
关键	Agent产生错误答案或有害输出	优先修复
高	Agent缺少必需的输出章节	次优先修复
中	Agent格式或结构问题	第三优先
低	Agent措辞或风格不一致	最后修复

步骤2：诊断根本原因

失败类型	修复方法
缺少输出章节	添加明确指令要求包含该章节
格式错误	添加带示例的输出Schema
缺少上下文处理	添加处理缺失信息的指令
分类错误	添加校准示例
生成幻觉内容	添加约束，仅使用提供的信息
Agent提问而非回答	在提示词中提供必需的上下文，或添加默认处理逻辑

步骤3：一次只做一个修复

修改Agent定义中的一处内容。重新运行所有测试用例。记录哪些测试现在通过/失败。

切勿同时进行多个修复——你将无法确定哪个更改起作用。这与调试原则相同：一次只改变一个变量。

步骤4：迭代直到全部通过

重复步骤3，直到所有测试用例通过。如果某个修复导致之前通过的测试失败，回滚并尝试其他方法。

跟踪修复迭代：

markdown

undefined

Fix Log

修复日志

Iteration	Change Made	Tests Passed	Tests Failed	Action
1	Added output schema	T1, T2	T3	Continue
2	Added error handling instruction	T1, T2, T3	—	Green


**Gate**: All test cases pass. No regressions from previously passing tests. Can explain what each fix changed and why. Proceed only when gate passes.

迭代次数	所做更改	通过的测试	失败的测试	操作
1	添加输出Schema	T1, T2	T3	继续
2	添加错误处理指令	T1, T2, T3	—	全部通过


**准入门槛**：所有测试用例通过。之前通过的测试无回归。能解释每个修复的更改内容和原因。仅当准入门槛通过后才可继续。

Phase 3: REFACTOR — Edge Cases and Robustness

阶段3：REFACTOR——边界案例与鲁棒性

Goal: Verify agent handles boundary conditions and produces consistent outputs.

Step 1: Add edge case tests

Category	Test Cases
Empty Input	Empty string, whitespace only, no context
Large Input	Very long content, deeply nested structures
Unusual Input	Malformed data, unexpected formats
Ambiguous Input	Cases where correct behavior is unclear

Step 2: Run consistency tests

Run the same input 3 times. Outputs should be consistent:

Same structure
Same key findings (for analysis agents)
Acceptable variation in phrasing only

If inconsistent: add more explicit instructions to the agent definition. Re-test.

Step 3: Run regression suite

Re-run ALL test cases (original + edge cases) to confirm nothing broke during refactoring.

Step 4: Document final test report

markdown

undefined

目标：验证Agent能处理边界条件并产生一致输出。

步骤1：添加边界案例测试

类别	测试用例
空输入	空字符串、仅空白字符、无上下文
大输入	超长内容、深度嵌套结构
异常输入	格式错误的数据、意外格式
模糊输入	正确行为不明确的案例

步骤2：运行一致性测试

相同输入运行3次。输出应保持一致：

相同结构
相同关键结论（针对分析类Agent）
仅措辞可接受少量变化

如果不一致：在Agent定义中添加更明确的指令。重新测试。

步骤3：运行回归套件

重新运行所有测试用例（原始+边界案例），确认重构期间没有任何功能损坏。

步骤4：记录最终测试报告

markdown

undefined

Test Report: {agent-name}

测试报告：{agent-name}

Metric	Result
Test Cases Run	N
Passed	N
Failed	N
Pass Rate	N%

指标	结果
运行的测试用例数	N
通过	N
失败	N
通过率	N%

Verdict

结论

READY FOR DEPLOYMENT / NEEDS FIXES / REQUIRES REVIEW


**Gate**: Edge cases handled. Consistency verified. Full suite green. Test report documented. Fix is complete.

---

可部署 / 需要修复 / 需评审


**准入门槛**：边界案例已处理。一致性已验证。全套件通过。测试报告已记录。修复完成。

---

Examples

示例

Example 1: Testing a New Reviewer Agent

示例1：测试新的评审类Agent

User says: "Test the new reviewer-security agent" Actions:

Define 6 test cases: 2 real issues, 2 clean code, 1 edge case, 1 ambiguous (RED)
Dispatch subagent for each, capture verbatim outputs (RED)
Fix agent definition for any failures, re-run all tests (GREEN)
Add edge cases (empty input, malformed code), verify consistency (REFACTOR) Result: Agent passes all tests, report documents pass rate and verdict

用户说：「测试新的reviewer-security Agent」操作：

定义6个测试用例：2个真实问题、2个无问题代码、1个边界案例、1个模糊案例（RED）
为每个用例调度子Agent，捕获原始输出（RED）
修复Agent定义中的任何失败，重新运行所有测试（GREEN）
添加边界案例（空输入、格式错误代码），验证一致性（REFACTOR）结果：Agent通过所有测试，报告记录了通过率和结论

Example 2: Testing After Agent Modification

示例2：Agent修改后测试

User says: "I updated the golang-general-engineer, make sure it still works" Actions:

Run existing test cases against modified agent (RED)
Compare outputs to previous baseline (RED)
Fix any regressions introduced by the modification (GREEN)
Test edge cases to verify robustness not degraded (REFACTOR) Result: Agent modification validated, no regressions confirmed

用户说：「我更新了golang-general-engineer，确保它仍能正常工作」操作：

用修改后的Agent运行现有测试用例（RED）
将输出与之前的基准对比（RED）
修复修改引入的任何回归问题（GREEN）
测试边界案例，验证鲁棒性未下降（REFACTOR）结果：Agent修改已验证，确认无回归问题

Example 3: Testing Routing Logic

示例3：测试路由逻辑

User says: "Verify the /do router sends Go requests to the right agent" Actions:

Define test cases: "Review this Go code", "Fix this .go file", "Write a goroutine" (RED)
Dispatch each through router, verify correct agent handles it (RED)
Fix routing triggers if wrong agent selected (GREEN)
Test ambiguous inputs like "Review this code" with mixed-language context (REFACTOR) Result: Routing validated for all trigger phrases, ambiguous cases documented

用户说：「验证/do路由器是否将Go请求发送给正确的Agent」操作：

定义测试用例：「评审这段Go代码」「修复这个.go文件」「写一个goroutine」（RED）
通过路由器调度每个请求，验证正确的Agent处理了请求（RED）
如果选择了错误的Agent，修复路由触发条件（GREEN）
测试模糊输入，比如「评审这段代码」（包含混合语言上下文）（REFACTOR）结果：所有触发短语的路由已验证，模糊案例已记录

Error Handling

错误处理

Error: "Agent type not found"

错误：「Agent type not found」

Cause: Agent not registered or name misspelled Solution:

Verify agent file exists:
```
ls agents/{agent-name}.md
```
Check YAML frontmatter has correct
```
name
```
field
Restart Claude Code to pick up new agents

原因：Agent未注册或名称拼写错误解决方案：

验证Agent文件存在：
```
ls agents/{agent-name}.md
```
检查YAML前置元数据中的
```
name
```
字段是否正确
重启Claude Code以加载新Agent

Error: "Inconsistent outputs across runs"

错误：「Inconsistent outputs across runs」

Cause: Agent produces different results for same input Solution:

Document the inconsistency — this is a valid finding
Add more explicit instructions to agent definition
Re-test consistency after fix
Determine if variation is acceptable (phrasing) or problematic (structure/findings)

原因：相同输入下Agent产生不同结果解决方案：

记录不一致情况——这是有效的测试发现
在Agent定义中添加更明确的指令
修复后重新测试一致性
判断变化是否可接受（措辞变化可接受，结构/结论变化不可接受）

Error: "Subagent timeout"

错误：「Subagent timeout」

Cause: Agent taking too long to respond Solution:

Simplify test input to reduce processing
Check agent isn't in an infinite loop or excessive tool use
Increase timeout if agent legitimately needs more time

原因：Agent响应时间过长解决方案：

简化测试输入以减少处理量
检查Agent是否陷入无限循环或过度使用工具
如果Agent确实需要更多时间，增加超时时间

Error: "Agent asks questions instead of answering"

错误：「Agent asks questions instead of answering」

Cause: Agent needs clarification that test input did not provide Solution:

This may be correct behavior — agent properly requesting context
Update test input to provide the required context
Or update agent definition to handle ambiguity with defaults
Document whether questioning behavior is acceptable for this agent type

原因：Agent需要测试输入中未提供的澄清信息解决方案：

这可能是正确行为——Agent正确请求上下文
更新测试输入以提供所需上下文
或更新Agent定义，用默认值处理模糊情况
记录该Agent类型的提问行为是否可接受

Anti-Patterns

反模式

Anti-Pattern 1: Testing Without Capturing Exact Output

反模式1：测试时不捕获精确输出

What it looks like: "Tested the agent, it looks good." Why wrong: No evidence of what was tested. Cannot reproduce or verify results. Subjective assessment instead of objective evidence. Do instead: Capture verbatim output for every test case. Document input, expected, actual, and result.

表现：「测试了Agent，看起来没问题。」 错误原因：没有测试内容的证据。无法复现或验证结果。用主观评估替代客观证据。 正确做法：为每个测试用例捕获原始输出。记录输入、预期、实际和结果。

Anti-Pattern 2: Testing Only Happy Path

反模式2：仅测试正常路径

What it looks like: "Tested with one example, it worked." Why wrong: Agents fail on edge cases most often. One test proves almost nothing. False confidence in agent quality. Do instead: Minimum 3-6 test cases per agent covering success, failure, edge, and ambiguous inputs.

表现：「用一个例子测试过了，能正常工作。」 错误原因：Agent最常在边界案例上失败。一个测试几乎证明不了什么。会对Agent质量产生错误的信心。 正确做法：每个Agent至少测试3-6个用例，覆盖成功、失败、边界和模糊输入。

Anti-Pattern 3: Skipping Re-test After Fixes

反模式3：修复后不重新测试

What it looks like: "Fixed the issue, should work now." Why wrong: Fix might have broken other tests. No verification fix actually works. Regression bugs slip through. Do instead: Re-run ALL test cases after any change. Only mark green when full suite passes.

表现：「修复了问题，现在应该能正常工作了。」 错误原因：修复可能破坏了其他测试。没有验证修复是否真的有效。回归漏洞会被遗漏。 正确做法：任何更改后重新运行所有测试用例。只有全套件通过后才能标记为GREEN。

Anti-Pattern 4: Reading Prompts Instead of Running Agents

反模式4：阅读提示词替代运行Agent

What it looks like: "Checked that agent prompt has the right sections." Why wrong: Reading a prompt is not executing an agent. Prompt structure does not guarantee behavior. Must verify actual output. Do instead: Test what the agent DOES, not what the prompt SAYS. Execute with real inputs via Task tool.

表现：「检查过了，Agent提示词有正确的章节。」 错误原因：阅读提示词不等于执行Agent。提示词结构不能保证行为。必须验证实际输出。 正确做法：测试Agent的实际行为，而不是提示词的内容。用Task工具执行真实输入。

Anti-Pattern 5: Self-Exempting from Testing

反模式5：自我豁免测试

What it looks like: "This agent is simple, doesn't need testing." or "Simple change, no need to re-test." Why wrong: Simple agents can still fail. Small changes can break behavior. You cannot self-determine exemptions from testing. Do instead: Get human partner confirmation for exemptions. When in doubt, test. Document why testing was skipped if approved.

表现：「这个Agent很简单，不需要测试。」或「只是小改动，不需要重新测试。」 错误原因：简单Agent也可能失败。小改动也可能破坏行为。你不能自行决定豁免测试。 正确做法：豁免情况必须得到人类伙伴的确认。如有疑问，就测试。如果获批跳过测试，记录原因。

References

参考资料

This skill uses these shared patterns:

Anti-Rationalization — Prevents shortcut rationalizations
Anti-Rationalization: Testing — Testing-specific rationalization blocks
Verification Checklist — Pre-completion checks

本Skill使用以下共享模式：

Anti-Rationalization — 防止找借口走捷径
Anti-Rationalization: Testing — 测试专用的防借口规则
Verification Checklist — 完成前检查清单

Domain-Specific Anti-Rationalization

领域专用防借口规则

Rationalization	Why It's Wrong	Required Action
"Agent prompt looks correct"	Reading prompt ≠ executing agent	Dispatch subagent and capture output
"Tested manually in conversation"	Not reproducible, no baseline	Use Task tool for formal dispatch
"Only a small change"	Small changes can break agent behavior	Run full test suite
"Will monitor in production"	Production monitoring ≠ pre-deployment testing	Complete RED-GREEN-REFACTOR first
"Based on working template"	Template correctness ≠ instance correctness	Test this specific agent

借口	错误原因	必需操作
「Agent提示词看起来是正确的」	阅读提示词≠执行Agent	调度子Agent并捕获输出
「在对话中手动测试过了」	不可复现，无基准	使用Task工具进行正式调度
「只是小改动」	小改动可能破坏Agent行为	运行完整测试套件
「会在生产环境中监控」	生产监控≠部署前测试	先完成RED-GREEN-REFACTOR
「基于可用的模板」	模板正确≠实例正确	测试这个特定的Agent

Integration

集成

```
agent-comparison
```
: A/B test agent variants
```
agent-evaluation
```
: Structural quality checks
```
test-driven-development
```
: TDD principles applied to agents

```
agent-comparison
```
：A/B测试Agent变体
```
agent-evaluation
```
：结构质量检查
```
test-driven-development
```
：TDD原则应用于Agent

Reference Files

参考文件

```
${CLAUDE_SKILL_DIR}/references/testing-patterns.md
```
: Dispatch patterns, test scenarios, eval harness integration

```
${CLAUDE_SKILL_DIR}/references/testing-patterns.md
```
：调度模式、测试场景、评估工具集成