agent-comparison

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Comparison Skill

Agent对比Skill

Operator Context

操作者上下文

This skill operates as an operator for agent A/B testing workflows, configuring Claude's behavior for rigorous, evidence-based variant comparison. It implements the Benchmark Pipeline architectural pattern — prepare variants, run identical tasks, measure outcomes, report findings — with Domain Intelligence embedded in the comparison methodology.

本Skill作为Agent A/B测试工作流的操作者，配置Claude的行为以实现严谨、基于证据的变体对比。它采用基准测试流水线（Benchmark Pipeline）架构模式——准备变体、运行相同任务、衡量结果、报告发现——并在对比方法中嵌入领域智能（Domain Intelligence）。

Hardcoded Behaviors (Always Apply)

硬编码行为（始终生效）

CLAUDE.md Compliance: Read and follow repository CLAUDE.md before execution
Over-Engineering Prevention: Keep benchmark scripts simple. No speculative features or configurable frameworks that were not requested
Identical Task Prompts: Both agents MUST receive the exact same task description, character-for-character
Isolated Execution: Each agent runs in a separate session to avoid contamination
Test-Based Validation: All generated code MUST pass the same test suite with
```
-race
```
flag
Evidence-Based Reporting: Every claim backed by measurable data (tokens, test counts, quality scores)
Total Session Cost: Measure total tokens to working solution, not just prompt size

遵循CLAUDE.md规范：执行前阅读并遵循仓库中的CLAUDE.md
避免过度设计：保持基准测试脚本简洁。不添加未被要求的推测性功能或可配置框架
任务提示完全一致：两个Agent必须接收完全相同的任务描述，一字不差
隔离执行：每个Agent在独立会话中运行，避免相互干扰
基于测试的验证：所有生成的代码必须通过相同的测试套件，并使用
```
-race
```
标志
基于证据的报告：所有结论均需有可衡量的数据支持（令牌数、测试次数、质量得分）
会话总成本：衡量得到可行解决方案的总令牌数，而非仅提示词大小

Default Behaviors (ON unless disabled)

默认行为（启用状态，除非手动关闭）

Communication Style: Report facts without self-congratulation. Show command output rather than describing it
Temporary File Cleanup: Remove temporary benchmark files and debug outputs at completion. Keep only comparison report and generated code
Two-Tier Benchmarking: Run both simple (algorithmic) and complex (production) tasks
Token Tracking: Record input/output token counts per turn where visible
Quality Grading: Score code on correctness, error handling, idioms, documentation, testing
Comparative Summary: Generate side-by-side comparison report with clear verdict

沟通风格：仅报告事实，不自我夸赞。直接展示命令输出而非描述输出内容
临时文件清理：完成后删除临时基准测试文件和调试输出，仅保留对比报告和生成的代码
双层基准测试：同时运行简单（算法类）和复杂（生产级）任务
令牌跟踪：在可见的情况下记录每一轮的输入/输出令牌数
质量评分：从正确性、错误处理、语言惯用法、文档、测试等维度对代码评分
对比摘要：生成并列对比报告并给出明确结论

Optional Behaviors (OFF unless enabled)

可选行为（禁用状态，除非手动启用）

Multiple Runs: Run each benchmark 3x to account for variance
Blind Evaluation: Hide agent identity during quality grading
Extended Benchmark Suite: Run additional domain-specific tests
Historical Tracking: Compare against previous benchmark runs

多次运行：每个基准测试运行3次以抵消差异
盲态评估：在质量评分阶段隐藏Agent身份
扩展基准测试套件：运行额外的领域特定测试
历史跟踪：与之前的基准测试结果进行对比

What This Skill CAN Do

本Skill可实现的功能

Systematically compare agent variants through controlled benchmarks
Measure total session token cost (prompt + reasoning + tools + retries)
Grade code quality using domain-specific checklists
Reveal quality differences invisible to simple metrics (prompt size, line count)
Generate comparison reports with evidence-backed verdicts

通过受控基准测试系统地对比Agent变体
衡量会话总令牌成本（提示词+推理+工具+重试）
使用领域特定清单对代码质量评分
发现简单指标（提示词大小、代码行数）无法体现的质量差异
生成基于证据的对比报告并给出结论

What This Skill CANNOT Do

本Skill无法实现的功能

Compare agents without running identical tasks on both
Declare a winner based on prompt size alone
Skip quality grading and rely only on test pass rates
Evaluate single agents in isolation (use quality-grading skill instead)
Compare skills or prompts (this is for agent variants only)

在不为两个Agent运行相同任务的情况下进行对比
仅根据提示词大小判定胜负
跳过质量评分，仅依赖测试通过率
孤立评估单个Agent（请使用质量评分Skill替代）
对比Skill或提示词（本Skill仅适用于Agent变体对比）

Instructions

操作步骤

Phase 1: PREPARE

阶段1：准备

Goal: Create benchmark environment and validate both agent variants exist.

Step 1: Analyze original agent

bash

undefined

目标：创建基准测试环境并验证两个Agent变体均存在。

步骤1：分析原始Agent

bash

undefined

Count original agent size

统计原始Agent的行数

wc -l agents/{original-agent}.md

Identify major sections

识别主要章节

grep "^## " agents/{original-agent}.md

Count code examples (candidates for removal in compact version)

统计代码示例数量（精简版中可考虑移除的内容）

grep -c '```' agents/{original-agent}.md


**Step 2: Create or validate compact variant**

If creating a compact variant, preserve:
- YAML frontmatter (name, description, routing)
- Operator Context (Hardcoded/Default/Optional)
- Core patterns and principles
- Error handling philosophy

Remove or condense:
- Lengthy code examples (keep 1-2 representative per pattern)
- Verbose explanations (condense to bullet points)
- Redundant instructions and changelogs

Target: 10-15% of original size while keeping essential knowledge. Removing capability (error handling patterns, concurrency patterns) invalidates the comparison. Remove redundancy, not knowledge.

**Step 3: Validate compact variant structure**

```bash

grep -c '```' agents/{original-agent}.md


**步骤2：创建或验证精简变体**

若创建精简变体，需保留以下内容：
- YAML前置元数据（名称、描述、路由）
- 操作者上下文（硬编码/默认/可选行为）
- 核心模式与原则
- 错误处理理念

可移除或精简的内容：
- 冗长的代码示例（每种模式保留1-2个代表性示例即可）
- 繁琐的解释（精简为要点形式）
- 冗余的说明和变更日志

目标：在保留核心知识的前提下，将体积压缩至原始版本的10-15%。移除核心能力（如错误处理模式、并发模式）会使对比失去有效性。应移除冗余内容，而非核心知识。

**步骤3：验证精简变体的结构**

```bash

Verify YAML frontmatter

验证YAML前置元数据

head -20 agents/{compact-agent}.md | grep -E "^(name|description):"

Verify Operator Context preserved

验证操作者上下文是否保留

grep -c "### Hardcoded Behaviors" agents/{compact-agent}.md

Compare sizes

对比体积

echo "Original: $(wc -l < agents/{original-agent}.md) lines" echo "Compact: $(wc -l < agents/{compact-agent}.md) lines"


**Step 4: Create benchmark directory and prepare prompts**

```bash
mkdir -p benchmark/{task-name}/{full,compact}

Write the task prompt ONCE, then copy it for both agents. NEVER customize prompts per agent.

Gate: Both agent variants exist with valid YAML frontmatter. Benchmark directories created. Identical task prompts written. Proceed only when gate passes.

echo "Original: $(wc -l < agents/{original-agent}.md) lines" echo "Compact: $(wc -l < agents/{compact-agent}.md) lines"


**步骤4：创建基准测试目录并准备提示词**

```bash
mkdir -p benchmark/{task-name}/{full,compact}

仅编写一次任务提示词，然后复制给两个Agent。绝不为不同Agent定制提示词。

准入条件：两个Agent变体均存在且YAML前置元数据有效，基准测试目录已创建，任务提示词完全相同。仅当满足所有条件时方可进入下一阶段。

Phase 2: BENCHMARK

阶段2：基准测试

Goal: Run identical tasks on both agents, capturing all metrics.

Step 1: Run simple task benchmark (2-3 tasks)

Use algorithmic problems with clear specifications (e.g., Advent of Code Day 1-6). Both agents should perform identically on well-defined problems. Simple tasks establish a baseline — if an agent fails here, it has fundamental issues.

Spawn both agents in parallel using Task tool for fair timing:

Task(
  prompt="[exact task prompt]\nSave to: benchmark/{task}/full/",
  subagent_type="{full-agent}"
)

Task(
  prompt="[exact task prompt]\nSave to: benchmark/{task}/compact/",
  subagent_type="{compact-agent}"
)

Run in parallel to avoid caching effects or system load variance skewing results.

Step 2: Run complex task benchmark (1-2 tasks)

Use production-style problems that require concurrency, error handling, edge case anticipation. These are where quality differences emerge. See

references/benchmark-tasks.md

for standard tasks.

Recommended complex tasks:

Worker Pool: Rate limiting, graceful shutdown, panic recovery
LRU Cache with TTL: Generics, background goroutines, zero-value semantics
HTTP Service: Middleware chains, structured errors, health checks

Step 3: Capture metrics for each run

Record immediately after each agent completes. Do NOT wait until all runs finish.

Metric	Full Agent	Compact Agent
Tests pass	X/X	X/X
Race conditions	X	X
Code lines (main)	X	X
Test lines	X	X
Session tokens	X	X
Wall-clock time	Xm Xs	Xm Xs
Retry cycles	X	X

Step 4: Run tests with race detector

bash

cd benchmark/{task-name}/full && go test -race -v -count=1
cd benchmark/{task-name}/compact && go test -race -v -count=1

Use

-count=1

to disable test caching. Race conditions are automatic quality failures — record them but do NOT fix them for the agent being tested.

Gate: Both agents completed all tasks. Metrics captured for every run. Test output saved. Proceed only when gate passes.

目标：为两个Agent运行相同任务，捕获所有指标。

步骤1：运行简单任务基准测试（2-3个任务）

使用需求明确的算法类问题（如Advent of Code第1-6天的题目）。两个Agent在定义清晰的问题上应表现一致。简单任务用于建立基线——若Agent在此阶段失败，则说明存在根本性问题。

使用Task工具并行启动两个Agent以保证计时公平：

Task(
  prompt="[exact task prompt]\nSave to: benchmark/{task}/full/",
  subagent_type="{full-agent}"
)

Task(
  prompt="[exact task prompt]\nSave to: benchmark/{task}/compact/",
  subagent_type="{compact-agent}"
)

并行运行以避免缓存效应或系统负载差异影响结果。

步骤2：运行复杂任务基准测试（1-2个任务）

使用需要并发处理、错误处理、边缘场景预判的生产级问题。质量差异通常在此类场景中显现。标准任务可参考

references/benchmark-tasks.md

。

推荐的复杂任务：

工作池：速率限制、优雅关闭、panic恢复
带TTL的LRU缓存：泛型、后台goroutine、零值语义
HTTP服务：中间件链、结构化错误、健康检查

步骤3：捕获每次运行的指标

每个Agent完成后立即记录指标，请勿等待所有运行结束后再记录。

指标	完整Agent	精简Agent
测试通过数	X/X	X/X
竞态条件	X	X
主代码行数	X	X
测试代码行数	X	X
会话令牌数	X	X
实际耗时	Xm Xs	Xm Xs
重试次数	X	X

步骤4：使用竞态检测器运行测试

bash

cd benchmark/{task-name}/full && go test -race -v -count=1
cd benchmark/{task-name}/compact && go test -race -v -count=1

使用

-count=1

禁用测试缓存。竞态条件属于严重的质量问题——需记录此类问题，但请勿为被测试的Agent修复该问题。

准入条件：两个Agent均完成所有任务，所有运行的指标均已捕获，测试输出已保存。仅当满足所有条件时方可进入下一阶段。

Phase 3: GRADE

阶段3：评分

Goal: Score code quality beyond pass/fail using domain-specific checklists.

Step 1: Create quality checklist BEFORE reviewing code

Define criteria before seeing results to prevent bias. Do NOT invent criteria after seeing one agent's output. See

references/grading-rubric.md

for standard rubrics.

Criterion	5/5	3/5	1/5
Correctness	All tests pass, no race conditions	Some failures	Broken
Error Handling	Comprehensive, production-ready	Adequate	None
Idioms	Exemplary for the language	Acceptable	Anti-patterns
Documentation	Thorough	Adequate	None
Testing	Comprehensive coverage	Basic	Minimal

Step 2: Score each solution independently

Grade each agent's code on all five criteria. Score one agent completely before starting the other.

markdown

undefined

目标：使用领域特定清单对代码质量进行超越通过/失败的评分。

步骤1：在查看代码前创建质量清单

在查看结果前先定义评分标准，以避免偏见。请勿在看到某个Agent的输出后再制定标准。标准评分细则可参考

references/grading-rubric.md

。

评分项	5/5	3/5	1/5
正确性	所有测试通过，无竞态条件	部分测试失败	功能完全失效
错误处理	全面且符合生产环境要求	基本够用	无错误处理
语言惯用法	符合语言最佳实践	可接受	使用反模式
文档	详尽完善	基本够用	无文档
测试	覆盖全面	基础覆盖	仅最少量测试

步骤2：独立评分每个解决方案

针对所有5个评分项为每个Agent的代码评分。完成一个Agent的全部评分后再开始另一个Agent的评分。

markdown

undefined

{Agent} Solution - {Task}

{Agent} 解决方案 - {任务名称}

Criterion	Score	Notes
Correctness	X/5
Error Handling	X/5
Idioms	X/5
Documentation	X/5
Testing	X/5
Total	X/25


**Step 3: Document specific bugs with production impact**

For each bug found, record:

```markdown

评分项	得分	备注
正确性	X/5
错误处理	X/5
语言惯用法	X/5
文档	X/5
测试	X/5
总分	X/25


**步骤3：记录具有生产影响的特定Bug**

针对发现的每个Bug，记录以下内容：

```markdown

Bug: {description}

Bug：{描述}

Agent: {which agent}
What happened: {behavior}
Correct behavior: {expected}
Production impact: {consequence}
Test coverage: {did tests catch it? why not?}


"Tests pass" is necessary but not sufficient. Production bugs often pass tests — Clear() returning nothing passes if no test checks the return value. TTL=0 bugs pass if no test uses zero TTL.

**Step 4: Calculate effective cost**

effective_cost = total_tokens * (1 + bug_count * 0.25)


An agent using 194k tokens with 0 bugs has better economics than one using 119k tokens with 5 bugs requiring fixes. The metric that matters is total cost to working, production-quality solution.

**Gate**: Both solutions graded with evidence. Specific bugs documented with production impact. Effective cost calculated. Proceed only when gate passes.

Agent：{涉及的Agent}
实际表现：{具体行为}
预期行为：{正确表现}
生产影响：{后果}
测试覆盖：{测试是否发现该问题？为何未发现？}


“测试通过”是必要条件但不充分。生产环境中的Bug常能通过测试——例如Clear()方法返回空值，但如果没有测试检查返回值，测试仍会通过；TTL=0的Bug若没有测试使用零TTL值，也会通过测试。

**步骤4：计算有效成本**

effective_cost = total_tokens * (1 + bug_count * 0.25)


使用194k令牌且无Bug的Agent，比使用119k令牌但有5个需修复Bug的Agent更具经济性。真正重要的指标是得到可用的生产级解决方案的总成本。

**准入条件**：两个解决方案均已基于证据完成评分，具有生产影响的特定Bug已记录，有效成本已计算。仅当满足所有条件时方可进入下一阶段。

Phase 4: REPORT

阶段4：报告

Goal: Generate comparison report with evidence-backed verdict.

Step 1: Generate comparison report

Use the report template from

references/report-template.md

. Include:

Executive summary with clear winner per metric
Per-task results with metrics tables
Token economics analysis (one-time prompt cost vs session cost)
Specific bugs found and their production impact
Verdict based on total evidence

Step 2: Run comparison analysis

bash

undefined

目标：生成基于证据的对比报告并给出结论。

步骤1：生成对比报告

使用

references/report-template.md

中的报告模板。报告需包含：

执行摘要及各指标的明确胜者
各任务的结果及指标表格
令牌经济性分析（一次性提示词成本vs会话总成本）
发现的特定Bug及其生产影响
基于全部证据的结论

步骤2：运行对比分析

bash

undefined

TODO: scripts/compare.py not yet implemented

TODO: scripts/compare.py 尚未实现

Manual alternative: compare benchmark outputs side-by-side

手动替代方案：并列对比基准测试输出

diff benchmark/{task-name}/full/ benchmark/{task-name}/compact/


**Step 3: Analyze token economics**

The key economic insight: agent prompts are a one-time cost per session. Everything after — reasoning, code generation, debugging, retries — costs tokens on every turn.

| Pattern | Description |
|---------|-------------|
| Large agent, low churn | High initial cost, fewer retries, less debugging |
| Small agent, high churn | Low initial cost, more retries, more debugging |

When a micro agent produces correct code, it uses approximately the same total tokens. The savings appear only when it cuts corners.

**Step 4: State verdict with evidence**

The verdict MUST be backed by data. Include:
- Which agent won on simple tasks (expected: equivalent)
- Which agent won on complex tasks (expected: full agent)
- Total session cost comparison
- Effective cost comparison (with bug penalty)
- Clear recommendation for when to use each variant

See `references/methodology.md` for the complete testing methodology with December 2024 data.

**Gate**: Report generated with all metrics. Verdict stated with evidence. Report saved to benchmark directory.

---

diff benchmark/{task-name}/full/ benchmark/{task-name}/compact/


**步骤3：分析令牌经济性**

核心经济洞察：Agent提示词是每次会话的一次性成本。而提示词之后的所有操作——推理、代码生成、调试、重试——每一轮都会消耗令牌。

| 模式 | 描述 |
|---------|-------------|
| 大型Agent，低重试率 | 初始成本高，重试次数少，调试需求低 |
| 小型Agent，高重试率 | 初始成本低，重试次数多，调试需求高 |

当微型Agent生成正确代码时，其总令牌消耗与大型Agent相近。只有在偷工减料时才会体现出“节省”。

**步骤4：基于证据给出结论**

结论必须有数据支持。需包含：
- 简单任务的胜者（预期：表现相当）
- 复杂任务的胜者（预期：完整Agent更优）
- 会话总成本对比
- 有效成本对比（含Bug惩罚）
- 明确的各变体适用场景建议

完整测试方法论及2024年12月的数据可参考`references/methodology.md`。

**准入条件**：已生成包含所有指标的报告，结论基于证据给出，报告已保存至基准测试目录。

---

Examples

示例

Example 1: Creating a Compact Agent

示例1：创建精简版Agent

User says: "Create a compact version of golang-general-engineer and test it" Actions:

Analyze original, create compact variant at 10-15% size (PREPARE)
Run simple task (Advent of Code) + complex task (Worker Pool) on both (BENCHMARK)
Score both with domain-specific checklist, calculate effective cost (GRADE)
Generate comparison report with verdict (REPORT) Result: Data-driven recommendation on whether compact version is viable

用户需求："创建golang-general-engineer的精简版并进行测试" 操作步骤：

分析原始Agent，创建体积为原版本10-15%的精简变体（准备阶段）
为两个Agent运行简单任务（Advent of Code）和复杂任务（工作池）（基准测试阶段）
使用领域特定清单为两个Agent评分，计算有效成本（评分阶段）
生成对比报告并给出结论（报告阶段）结果：基于数据给出精简版是否可行的建议

Example 2: Comparing Internal vs External Agent

示例2：对比内部与外部Agent

User says: "Compare our Go agent against go-expert-0xfurai" Actions:

Validate both agents exist, prepare identical task prompts (PREPARE)
Run two-tier benchmarks with token tracking (BENCHMARK)
Grade with production quality checklist, document all bugs (GRADE)
Report with token economics showing prompt cost vs session cost (REPORT) Result: Evidence-based comparison showing true cost of each variant

用户需求："对比我们的Go Agent与go-expert-0xfurai" 操作步骤：

验证两个Agent均存在，准备完全相同的任务提示词（准备阶段）
运行双层基准测试并跟踪令牌消耗（基准测试阶段）
使用生产级质量清单评分，记录所有Bug（评分阶段）
生成包含令牌经济性分析的报告，对比提示词成本与会话总成本（报告阶段）结果：基于证据的对比，展示每个变体的真实成本

Error Handling

错误处理

Error: "Agent Type Not Found"

错误："Agent Type Not Found"

Cause: Agent not registered or name misspelled Solution: Verify agent file exists in agents/ directory. Restart Claude Code client to pick up new definitions.

原因：Agent未注册或名称拼写错误解决方案：验证agents/目录下是否存在该Agent文件。重启Claude Code客户端以加载新的定义。

Error: "Tests Fail with Race Condition"

错误："Tests Fail with Race Condition"

Cause: Concurrent code has data races Solution: This is a real quality difference. Record as a finding in the grade. Do NOT fix for the agent being tested.

原因：并发代码存在数据竞态解决方案：这是真实的质量差异。将其记录为评分阶段的发现，请勿为被测试的Agent修复该问题。

Error: "Different Test Counts Between Agents"

错误："Different Test Counts Between Agents"

Cause: Agents wrote different test suites Solution: Valid data point. Grade on test coverage and quality, not raw count. More tests is not always better.

原因：两个Agent编写的测试套件不同解决方案：这是有效的数据点。需根据测试覆盖范围和质量评分，而非原始数量。测试数量多并不一定更好。

Error: "Timeout During Agent Execution"

错误："Timeout During Agent Execution"

Cause: Complex task taking too long or agent stuck in retry loop Solution: Note the timeout and number of retries attempted. Record as incomplete with partial metrics. Increase timeout limit if warranted, but excessive retries are a quality signal — an agent that needs many retries is less efficient regardless of final outcome.

原因：复杂任务耗时过长或Agent陷入重试循环解决方案：记录超时情况及已尝试的重试次数。将其标记为未完成，并记录已捕获的部分指标。若有必要可增加超时限制，但过多重试本身就是质量信号——无论最终结果如何，需要多次重试的Agent效率更低。

Anti-Patterns

反模式

Anti-Pattern 1: Comparing Only Prompt Size

反模式1：仅对比提示词大小

What it looks like: "Compact agent is 90% smaller, therefore 90% more efficient" Why wrong: Prompt is one-time cost. Session reasoning, retries, and debugging dominate total tokens. Our data showed a 57-line agent used 69.5k tokens vs 69.6k for a 3,529-line agent on the same correct solution. Do instead: Measure total session tokens to working solution.

表现："精简Agent体积小90%，因此效率高90%" 错误原因：提示词是一次性成本。会话推理、重试和调试才是总令牌消耗的主要部分。我们的数据显示，在生成正确解决方案的情况下，57行的Agent消耗69.5k令牌，而3529行的Agent消耗69.6k令牌。 正确做法：衡量得到可行解决方案的会话总令牌数。

Anti-Pattern 2: Different Task Prompts

反模式2：使用不同的任务提示词

What it looks like: Giving the full agent harder requirements than the compact agent Why wrong: Creates unfair comparison. Different requirements produce different solutions, invalidating all measurements. Do instead: Copy-paste identical prompts character-for-character. Verify before running.

表现：为完整Agent设置比精简Agent更难的需求 错误原因：会导致不公平对比。不同的需求会产生不同的解决方案，使所有测量结果失效。 正确做法：一字不差地复制相同的提示词。运行前需验证一致性。

Anti-Pattern 3: Treating Test Failures as Equal Quality

反模式3：将测试失败视为同等质量

What it looks like: "Both agents completed the task" when one has 12/12 tests and the other has 8/12 Why wrong: Bugs have real cost. False equivalence between producing code and producing working code. Do instead: Grade quality rigorously. Calculate effective cost with bug penalty multiplier.

表现：当一个Agent通过12/12测试，另一个通过8/12测试时，称"两个Agent均完成任务" 错误原因：Bug会产生真实成本。将生成代码与生成可用代码等同是错误的。 正确做法：严格进行质量评分。使用Bug惩罚乘数计算有效成本。

Anti-Pattern 4: Single Benchmark Declaration

反模式4：单次基准测试即下结论

What it looks like: "Tested on one puzzle. Compact agent wins!" Why wrong: Single data point is sensitive to task selection bias. Simple tasks mask differences in edge case handling. Cannot distinguish luck from systematic quality. Do instead: Run two-tier benchmarking with 2-3 simple tasks and 1-2 complex tasks.

表现："在一个谜题上测试后，精简Agent获胜！" 错误原因：单个数据点易受任务选择偏见影响。简单任务会掩盖边缘场景处理能力的差异。无法区分运气与系统性质量差异。 正确做法：运行双层基准测试，包含2-3个简单任务和1-2个复杂任务。

Anti-Pattern 5: Removing Core Patterns to Create Compact Agent

反模式5：移除核心模式以创建精简Agent

What it looks like: Compact version removes error handling patterns, concurrency guidance, and testing requirements to reduce size Why wrong: Creates unfair comparison. Compact agent is missing essential knowledge, guaranteeing quality degradation rather than testing if brevity is possible. Do instead: Remove verbose examples and redundant explanations, not capability. Keep one representative example per pattern. Condense explanations to bullet points but retain key insights.

表现：精简版移除了错误处理模式、并发指导和测试要求以缩小体积 错误原因：会导致不公平对比。精简Agent缺失了核心知识，必然会导致质量下降，而非测试精简是否可行。 正确做法：移除冗长的示例和冗余的解释，但保留核心能力。每种模式保留一个代表性示例。将解释精简为要点，但保留关键见解。

References

参考资料

This skill uses these shared patterns:

Anti-Rationalization - Prevents shortcut rationalizations
Verification Checklist - Pre-completion checks

本Skill使用以下共享模式：

反合理化 - 避免捷径式的合理化解释
验证清单 - 完成前的检查项

Domain-Specific Anti-Rationalization

领域特定反合理化

Rationalization	Why It's Wrong	Required Action
"Compact agent saved 50% tokens"	Savings may come from cutting corners, not efficiency	Check quality scores before claiming savings
"Tests pass, agents are equal"	Tests can miss production bugs (goroutine leaks, wrong semantics)	Apply domain-specific quality checklist
"One benchmark is enough"	Single task is sensitive to selection bias	Run two-tier benchmarks (simple + complex)
"Prompt size determines cost"	Prompt is one-time; reasoning tokens dominate sessions	Measure total session cost to working solution

合理化解释	错误原因	要求的操作
"精简Agent节省了50%的令牌"	节省可能来自偷工减料，而非效率提升	在宣称节省前先检查质量得分
"测试通过，两个Agent质量相当"	测试可能遗漏生产环境Bug（如goroutine泄漏、语义错误）	使用领域特定质量清单进行评估
"一次基准测试足够"	单个任务易受选择偏见影响	运行双层基准测试（简单+复杂）
"提示词大小决定成本"	提示词是一次性成本，推理令牌才是会话的主要消耗	衡量得到可行解决方案的会话总成本

Reference Files

参考文件

```
${CLAUDE_SKILL_DIR}/references/methodology.md
```
: Complete testing methodology with December 2024 data
```
${CLAUDE_SKILL_DIR}/references/grading-rubric.md
```
: Detailed grading criteria and quality checklists
```
${CLAUDE_SKILL_DIR}/references/benchmark-tasks.md
```
: Standard benchmark task descriptions and prompts
```
${CLAUDE_SKILL_DIR}/references/report-template.md
```
: Comparison report template with all required sections

```
${CLAUDE_SKILL_DIR}/references/methodology.md
```
：完整测试方法论及2024年12月的数据

${CLAUDE_SKILL_DIR}/references/grading-rubric.md

：详细的评分标准和质量清单

${CLAUDE_SKILL_DIR}/references/benchmark-tasks.md

：标准基准测试任务描述和提示词

${CLAUDE_SKILL_DIR}/references/report-template.md

：包含所有必填章节的对比报告模板