agent-comparison
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgent Comparison Skill
Agent对比Skill
Operator Context
操作者上下文
This skill operates as an operator for agent A/B testing workflows, configuring Claude's behavior for rigorous, evidence-based variant comparison. It implements the Benchmark Pipeline architectural pattern — prepare variants, run identical tasks, measure outcomes, report findings — with Domain Intelligence embedded in the comparison methodology.
本Skill作为Agent A/B测试工作流的操作者,配置Claude的行为以实现严谨、基于证据的变体对比。它采用基准测试流水线(Benchmark Pipeline)架构模式——准备变体、运行相同任务、衡量结果、报告发现——并在对比方法中嵌入领域智能(Domain Intelligence)。
Hardcoded Behaviors (Always Apply)
硬编码行为(始终生效)
- CLAUDE.md Compliance: Read and follow repository CLAUDE.md before execution
- Over-Engineering Prevention: Keep benchmark scripts simple. No speculative features or configurable frameworks that were not requested
- Identical Task Prompts: Both agents MUST receive the exact same task description, character-for-character
- Isolated Execution: Each agent runs in a separate session to avoid contamination
- Test-Based Validation: All generated code MUST pass the same test suite with flag
-race - Evidence-Based Reporting: Every claim backed by measurable data (tokens, test counts, quality scores)
- Total Session Cost: Measure total tokens to working solution, not just prompt size
- 遵循CLAUDE.md规范:执行前阅读并遵循仓库中的CLAUDE.md
- 避免过度设计:保持基准测试脚本简洁。不添加未被要求的推测性功能或可配置框架
- 任务提示完全一致:两个Agent必须接收完全相同的任务描述,一字不差
- 隔离执行:每个Agent在独立会话中运行,避免相互干扰
- 基于测试的验证:所有生成的代码必须通过相同的测试套件,并使用标志
-race - 基于证据的报告:所有结论均需有可衡量的数据支持(令牌数、测试次数、质量得分)
- 会话总成本:衡量得到可行解决方案的总令牌数,而非仅提示词大小
Default Behaviors (ON unless disabled)
默认行为(启用状态,除非手动关闭)
- Communication Style: Report facts without self-congratulation. Show command output rather than describing it
- Temporary File Cleanup: Remove temporary benchmark files and debug outputs at completion. Keep only comparison report and generated code
- Two-Tier Benchmarking: Run both simple (algorithmic) and complex (production) tasks
- Token Tracking: Record input/output token counts per turn where visible
- Quality Grading: Score code on correctness, error handling, idioms, documentation, testing
- Comparative Summary: Generate side-by-side comparison report with clear verdict
- 沟通风格:仅报告事实,不自我夸赞。直接展示命令输出而非描述输出内容
- 临时文件清理:完成后删除临时基准测试文件和调试输出,仅保留对比报告和生成的代码
- 双层基准测试:同时运行简单(算法类)和复杂(生产级)任务
- 令牌跟踪:在可见的情况下记录每一轮的输入/输出令牌数
- 质量评分:从正确性、错误处理、语言惯用法、文档、测试等维度对代码评分
- 对比摘要:生成并列对比报告并给出明确结论
Optional Behaviors (OFF unless enabled)
可选行为(禁用状态,除非手动启用)
- Multiple Runs: Run each benchmark 3x to account for variance
- Blind Evaluation: Hide agent identity during quality grading
- Extended Benchmark Suite: Run additional domain-specific tests
- Historical Tracking: Compare against previous benchmark runs
- 多次运行:每个基准测试运行3次以抵消差异
- 盲态评估:在质量评分阶段隐藏Agent身份
- 扩展基准测试套件:运行额外的领域特定测试
- 历史跟踪:与之前的基准测试结果进行对比
What This Skill CAN Do
本Skill可实现的功能
- Systematically compare agent variants through controlled benchmarks
- Measure total session token cost (prompt + reasoning + tools + retries)
- Grade code quality using domain-specific checklists
- Reveal quality differences invisible to simple metrics (prompt size, line count)
- Generate comparison reports with evidence-backed verdicts
- 通过受控基准测试系统地对比Agent变体
- 衡量会话总令牌成本(提示词+推理+工具+重试)
- 使用领域特定清单对代码质量评分
- 发现简单指标(提示词大小、代码行数)无法体现的质量差异
- 生成基于证据的对比报告并给出结论
What This Skill CANNOT Do
本Skill无法实现的功能
- Compare agents without running identical tasks on both
- Declare a winner based on prompt size alone
- Skip quality grading and rely only on test pass rates
- Evaluate single agents in isolation (use quality-grading skill instead)
- Compare skills or prompts (this is for agent variants only)
- 在不为两个Agent运行相同任务的情况下进行对比
- 仅根据提示词大小判定胜负
- 跳过质量评分,仅依赖测试通过率
- 孤立评估单个Agent(请使用质量评分Skill替代)
- 对比Skill或提示词(本Skill仅适用于Agent变体对比)
Instructions
操作步骤
Phase 1: PREPARE
阶段1:准备
Goal: Create benchmark environment and validate both agent variants exist.
Step 1: Analyze original agent
bash
undefined目标:创建基准测试环境并验证两个Agent变体均存在。
步骤1:分析原始Agent
bash
undefinedCount original agent size
统计原始Agent的行数
wc -l agents/{original-agent}.md
wc -l agents/{original-agent}.md
Identify major sections
识别主要章节
grep "^## " agents/{original-agent}.md
grep "^## " agents/{original-agent}.md
Count code examples (candidates for removal in compact version)
统计代码示例数量(精简版中可考虑移除的内容)
grep -c '```' agents/{original-agent}.md
**Step 2: Create or validate compact variant**
If creating a compact variant, preserve:
- YAML frontmatter (name, description, routing)
- Operator Context (Hardcoded/Default/Optional)
- Core patterns and principles
- Error handling philosophy
Remove or condense:
- Lengthy code examples (keep 1-2 representative per pattern)
- Verbose explanations (condense to bullet points)
- Redundant instructions and changelogs
Target: 10-15% of original size while keeping essential knowledge. Removing capability (error handling patterns, concurrency patterns) invalidates the comparison. Remove redundancy, not knowledge.
**Step 3: Validate compact variant structure**
```bashgrep -c '```' agents/{original-agent}.md
**步骤2:创建或验证精简变体**
若创建精简变体,需保留以下内容:
- YAML前置元数据(名称、描述、路由)
- 操作者上下文(硬编码/默认/可选行为)
- 核心模式与原则
- 错误处理理念
可移除或精简的内容:
- 冗长的代码示例(每种模式保留1-2个代表性示例即可)
- 繁琐的解释(精简为要点形式)
- 冗余的说明和变更日志
目标:在保留核心知识的前提下,将体积压缩至原始版本的10-15%。移除核心能力(如错误处理模式、并发模式)会使对比失去有效性。应移除冗余内容,而非核心知识。
**步骤3:验证精简变体的结构**
```bashVerify YAML frontmatter
验证YAML前置元数据
head -20 agents/{compact-agent}.md | grep -E "^(name|description):"
head -20 agents/{compact-agent}.md | grep -E "^(name|description):"
Verify Operator Context preserved
验证操作者上下文是否保留
grep -c "### Hardcoded Behaviors" agents/{compact-agent}.md
grep -c "### Hardcoded Behaviors" agents/{compact-agent}.md
Compare sizes
对比体积
echo "Original: $(wc -l < agents/{original-agent}.md) lines"
echo "Compact: $(wc -l < agents/{compact-agent}.md) lines"
**Step 4: Create benchmark directory and prepare prompts**
```bash
mkdir -p benchmark/{task-name}/{full,compact}Write the task prompt ONCE, then copy it for both agents. NEVER customize prompts per agent.
Gate: Both agent variants exist with valid YAML frontmatter. Benchmark directories created. Identical task prompts written. Proceed only when gate passes.
echo "Original: $(wc -l < agents/{original-agent}.md) lines"
echo "Compact: $(wc -l < agents/{compact-agent}.md) lines"
**步骤4:创建基准测试目录并准备提示词**
```bash
mkdir -p benchmark/{task-name}/{full,compact}仅编写一次任务提示词,然后复制给两个Agent。绝不为不同Agent定制提示词。
准入条件:两个Agent变体均存在且YAML前置元数据有效,基准测试目录已创建,任务提示词完全相同。仅当满足所有条件时方可进入下一阶段。
Phase 2: BENCHMARK
阶段2:基准测试
Goal: Run identical tasks on both agents, capturing all metrics.
Step 1: Run simple task benchmark (2-3 tasks)
Use algorithmic problems with clear specifications (e.g., Advent of Code Day 1-6). Both agents should perform identically on well-defined problems. Simple tasks establish a baseline — if an agent fails here, it has fundamental issues.
Spawn both agents in parallel using Task tool for fair timing:
Task(
prompt="[exact task prompt]\nSave to: benchmark/{task}/full/",
subagent_type="{full-agent}"
)
Task(
prompt="[exact task prompt]\nSave to: benchmark/{task}/compact/",
subagent_type="{compact-agent}"
)Run in parallel to avoid caching effects or system load variance skewing results.
Step 2: Run complex task benchmark (1-2 tasks)
Use production-style problems that require concurrency, error handling, edge case anticipation. These are where quality differences emerge. See for standard tasks.
references/benchmark-tasks.mdRecommended complex tasks:
- Worker Pool: Rate limiting, graceful shutdown, panic recovery
- LRU Cache with TTL: Generics, background goroutines, zero-value semantics
- HTTP Service: Middleware chains, structured errors, health checks
Step 3: Capture metrics for each run
Record immediately after each agent completes. Do NOT wait until all runs finish.
| Metric | Full Agent | Compact Agent |
|---|---|---|
| Tests pass | X/X | X/X |
| Race conditions | X | X |
| Code lines (main) | X | X |
| Test lines | X | X |
| Session tokens | X | X |
| Wall-clock time | Xm Xs | Xm Xs |
| Retry cycles | X | X |
Step 4: Run tests with race detector
bash
cd benchmark/{task-name}/full && go test -race -v -count=1
cd benchmark/{task-name}/compact && go test -race -v -count=1Use to disable test caching. Race conditions are automatic quality failures — record them but do NOT fix them for the agent being tested.
-count=1Gate: Both agents completed all tasks. Metrics captured for every run. Test output saved. Proceed only when gate passes.
目标:为两个Agent运行相同任务,捕获所有指标。
步骤1:运行简单任务基准测试(2-3个任务)
使用需求明确的算法类问题(如Advent of Code第1-6天的题目)。两个Agent在定义清晰的问题上应表现一致。简单任务用于建立基线——若Agent在此阶段失败,则说明存在根本性问题。
使用Task工具并行启动两个Agent以保证计时公平:
Task(
prompt="[exact task prompt]\nSave to: benchmark/{task}/full/",
subagent_type="{full-agent}"
)
Task(
prompt="[exact task prompt]\nSave to: benchmark/{task}/compact/",
subagent_type="{compact-agent}"
)并行运行以避免缓存效应或系统负载差异影响结果。
步骤2:运行复杂任务基准测试(1-2个任务)
使用需要并发处理、错误处理、边缘场景预判的生产级问题。质量差异通常在此类场景中显现。标准任务可参考。
references/benchmark-tasks.md推荐的复杂任务:
- 工作池:速率限制、优雅关闭、panic恢复
- 带TTL的LRU缓存:泛型、后台goroutine、零值语义
- HTTP服务:中间件链、结构化错误、健康检查
步骤3:捕获每次运行的指标
每个Agent完成后立即记录指标,请勿等待所有运行结束后再记录。
| 指标 | 完整Agent | 精简Agent |
|---|---|---|
| 测试通过数 | X/X | X/X |
| 竞态条件 | X | X |
| 主代码行数 | X | X |
| 测试代码行数 | X | X |
| 会话令牌数 | X | X |
| 实际耗时 | Xm Xs | Xm Xs |
| 重试次数 | X | X |
步骤4:使用竞态检测器运行测试
bash
cd benchmark/{task-name}/full && go test -race -v -count=1
cd benchmark/{task-name}/compact && go test -race -v -count=1使用禁用测试缓存。竞态条件属于严重的质量问题——需记录此类问题,但请勿为被测试的Agent修复该问题。
-count=1准入条件:两个Agent均完成所有任务,所有运行的指标均已捕获,测试输出已保存。仅当满足所有条件时方可进入下一阶段。
Phase 3: GRADE
阶段3:评分
Goal: Score code quality beyond pass/fail using domain-specific checklists.
Step 1: Create quality checklist BEFORE reviewing code
Define criteria before seeing results to prevent bias. Do NOT invent criteria after seeing one agent's output. See for standard rubrics.
references/grading-rubric.md| Criterion | 5/5 | 3/5 | 1/5 |
|---|---|---|---|
| Correctness | All tests pass, no race conditions | Some failures | Broken |
| Error Handling | Comprehensive, production-ready | Adequate | None |
| Idioms | Exemplary for the language | Acceptable | Anti-patterns |
| Documentation | Thorough | Adequate | None |
| Testing | Comprehensive coverage | Basic | Minimal |
Step 2: Score each solution independently
Grade each agent's code on all five criteria. Score one agent completely before starting the other.
markdown
undefined目标:使用领域特定清单对代码质量进行超越通过/失败的评分。
步骤1:在查看代码前创建质量清单
在查看结果前先定义评分标准,以避免偏见。请勿在看到某个Agent的输出后再制定标准。标准评分细则可参考。
references/grading-rubric.md| 评分项 | 5/5 | 3/5 | 1/5 |
|---|---|---|---|
| 正确性 | 所有测试通过,无竞态条件 | 部分测试失败 | 功能完全失效 |
| 错误处理 | 全面且符合生产环境要求 | 基本够用 | 无错误处理 |
| 语言惯用法 | 符合语言最佳实践 | 可接受 | 使用反模式 |
| 文档 | 详尽完善 | 基本够用 | 无文档 |
| 测试 | 覆盖全面 | 基础覆盖 | 仅最少量测试 |
步骤2:独立评分每个解决方案
针对所有5个评分项为每个Agent的代码评分。完成一个Agent的全部评分后再开始另一个Agent的评分。
markdown
undefined{Agent} Solution - {Task}
{Agent} 解决方案 - {任务名称}
| Criterion | Score | Notes |
|---|---|---|
| Correctness | X/5 | |
| Error Handling | X/5 | |
| Idioms | X/5 | |
| Documentation | X/5 | |
| Testing | X/5 | |
| Total | X/25 |
**Step 3: Document specific bugs with production impact**
For each bug found, record:
```markdown| 评分项 | 得分 | 备注 |
|---|---|---|
| 正确性 | X/5 | |
| 错误处理 | X/5 | |
| 语言惯用法 | X/5 | |
| 文档 | X/5 | |
| 测试 | X/5 | |
| 总分 | X/25 |
**步骤3:记录具有生产影响的特定Bug**
针对发现的每个Bug,记录以下内容:
```markdownBug: {description}
Bug:{描述}
- Agent: {which agent}
- What happened: {behavior}
- Correct behavior: {expected}
- Production impact: {consequence}
- Test coverage: {did tests catch it? why not?}
"Tests pass" is necessary but not sufficient. Production bugs often pass tests — Clear() returning nothing passes if no test checks the return value. TTL=0 bugs pass if no test uses zero TTL.
**Step 4: Calculate effective cost**
effective_cost = total_tokens * (1 + bug_count * 0.25)
An agent using 194k tokens with 0 bugs has better economics than one using 119k tokens with 5 bugs requiring fixes. The metric that matters is total cost to working, production-quality solution.
**Gate**: Both solutions graded with evidence. Specific bugs documented with production impact. Effective cost calculated. Proceed only when gate passes.- Agent:{涉及的Agent}
- 实际表现:{具体行为}
- 预期行为:{正确表现}
- 生产影响:{后果}
- 测试覆盖:{测试是否发现该问题?为何未发现?}
“测试通过”是必要条件但不充分。生产环境中的Bug常能通过测试——例如Clear()方法返回空值,但如果没有测试检查返回值,测试仍会通过;TTL=0的Bug若没有测试使用零TTL值,也会通过测试。
**步骤4:计算有效成本**
effective_cost = total_tokens * (1 + bug_count * 0.25)
使用194k令牌且无Bug的Agent,比使用119k令牌但有5个需修复Bug的Agent更具经济性。真正重要的指标是得到可用的生产级解决方案的总成本。
**准入条件**:两个解决方案均已基于证据完成评分,具有生产影响的特定Bug已记录,有效成本已计算。仅当满足所有条件时方可进入下一阶段。Phase 4: REPORT
阶段4:报告
Goal: Generate comparison report with evidence-backed verdict.
Step 1: Generate comparison report
Use the report template from . Include:
references/report-template.md- Executive summary with clear winner per metric
- Per-task results with metrics tables
- Token economics analysis (one-time prompt cost vs session cost)
- Specific bugs found and their production impact
- Verdict based on total evidence
Step 2: Run comparison analysis
bash
undefined目标:生成基于证据的对比报告并给出结论。
步骤1:生成对比报告
使用中的报告模板。报告需包含:
references/report-template.md- 执行摘要及各指标的明确胜者
- 各任务的结果及指标表格
- 令牌经济性分析(一次性提示词成本vs会话总成本)
- 发现的特定Bug及其生产影响
- 基于全部证据的结论
步骤2:运行对比分析
bash
undefinedTODO: scripts/compare.py not yet implemented
TODO: scripts/compare.py 尚未实现
Manual alternative: compare benchmark outputs side-by-side
手动替代方案:并列对比基准测试输出
diff benchmark/{task-name}/full/ benchmark/{task-name}/compact/
**Step 3: Analyze token economics**
The key economic insight: agent prompts are a one-time cost per session. Everything after — reasoning, code generation, debugging, retries — costs tokens on every turn.
| Pattern | Description |
|---------|-------------|
| Large agent, low churn | High initial cost, fewer retries, less debugging |
| Small agent, high churn | Low initial cost, more retries, more debugging |
When a micro agent produces correct code, it uses approximately the same total tokens. The savings appear only when it cuts corners.
**Step 4: State verdict with evidence**
The verdict MUST be backed by data. Include:
- Which agent won on simple tasks (expected: equivalent)
- Which agent won on complex tasks (expected: full agent)
- Total session cost comparison
- Effective cost comparison (with bug penalty)
- Clear recommendation for when to use each variant
See `references/methodology.md` for the complete testing methodology with December 2024 data.
**Gate**: Report generated with all metrics. Verdict stated with evidence. Report saved to benchmark directory.
---diff benchmark/{task-name}/full/ benchmark/{task-name}/compact/
**步骤3:分析令牌经济性**
核心经济洞察:Agent提示词是每次会话的一次性成本。而提示词之后的所有操作——推理、代码生成、调试、重试——每一轮都会消耗令牌。
| 模式 | 描述 |
|---------|-------------|
| 大型Agent,低重试率 | 初始成本高,重试次数少,调试需求低 |
| 小型Agent,高重试率 | 初始成本低,重试次数多,调试需求高 |
当微型Agent生成正确代码时,其总令牌消耗与大型Agent相近。只有在偷工减料时才会体现出“节省”。
**步骤4:基于证据给出结论**
结论必须有数据支持。需包含:
- 简单任务的胜者(预期:表现相当)
- 复杂任务的胜者(预期:完整Agent更优)
- 会话总成本对比
- 有效成本对比(含Bug惩罚)
- 明确的各变体适用场景建议
完整测试方法论及2024年12月的数据可参考`references/methodology.md`。
**准入条件**:已生成包含所有指标的报告,结论基于证据给出,报告已保存至基准测试目录。
---Examples
示例
Example 1: Creating a Compact Agent
示例1:创建精简版Agent
User says: "Create a compact version of golang-general-engineer and test it"
Actions:
- Analyze original, create compact variant at 10-15% size (PREPARE)
- Run simple task (Advent of Code) + complex task (Worker Pool) on both (BENCHMARK)
- Score both with domain-specific checklist, calculate effective cost (GRADE)
- Generate comparison report with verdict (REPORT) Result: Data-driven recommendation on whether compact version is viable
用户需求:"创建golang-general-engineer的精简版并进行测试"
操作步骤:
- 分析原始Agent,创建体积为原版本10-15%的精简变体(准备阶段)
- 为两个Agent运行简单任务(Advent of Code)和复杂任务(工作池)(基准测试阶段)
- 使用领域特定清单为两个Agent评分,计算有效成本(评分阶段)
- 生成对比报告并给出结论(报告阶段) 结果:基于数据给出精简版是否可行的建议
Example 2: Comparing Internal vs External Agent
示例2:对比内部与外部Agent
User says: "Compare our Go agent against go-expert-0xfurai"
Actions:
- Validate both agents exist, prepare identical task prompts (PREPARE)
- Run two-tier benchmarks with token tracking (BENCHMARK)
- Grade with production quality checklist, document all bugs (GRADE)
- Report with token economics showing prompt cost vs session cost (REPORT) Result: Evidence-based comparison showing true cost of each variant
用户需求:"对比我们的Go Agent与go-expert-0xfurai"
操作步骤:
- 验证两个Agent均存在,准备完全相同的任务提示词(准备阶段)
- 运行双层基准测试并跟踪令牌消耗(基准测试阶段)
- 使用生产级质量清单评分,记录所有Bug(评分阶段)
- 生成包含令牌经济性分析的报告,对比提示词成本与会话总成本(报告阶段) 结果:基于证据的对比,展示每个变体的真实成本
Error Handling
错误处理
Error: "Agent Type Not Found"
错误:"Agent Type Not Found"
Cause: Agent not registered or name misspelled
Solution: Verify agent file exists in agents/ directory. Restart Claude Code client to pick up new definitions.
原因:Agent未注册或名称拼写错误
解决方案:验证agents/目录下是否存在该Agent文件。重启Claude Code客户端以加载新的定义。
Error: "Tests Fail with Race Condition"
错误:"Tests Fail with Race Condition"
Cause: Concurrent code has data races
Solution: This is a real quality difference. Record as a finding in the grade. Do NOT fix for the agent being tested.
原因:并发代码存在数据竞态
解决方案:这是真实的质量差异。将其记录为评分阶段的发现,请勿为被测试的Agent修复该问题。
Error: "Different Test Counts Between Agents"
错误:"Different Test Counts Between Agents"
Cause: Agents wrote different test suites
Solution: Valid data point. Grade on test coverage and quality, not raw count. More tests is not always better.
原因:两个Agent编写的测试套件不同
解决方案:这是有效的数据点。需根据测试覆盖范围和质量评分,而非原始数量。测试数量多并不一定更好。
Error: "Timeout During Agent Execution"
错误:"Timeout During Agent Execution"
Cause: Complex task taking too long or agent stuck in retry loop
Solution: Note the timeout and number of retries attempted. Record as incomplete with partial metrics. Increase timeout limit if warranted, but excessive retries are a quality signal — an agent that needs many retries is less efficient regardless of final outcome.
原因:复杂任务耗时过长或Agent陷入重试循环
解决方案:记录超时情况及已尝试的重试次数。将其标记为未完成,并记录已捕获的部分指标。若有必要可增加超时限制,但过多重试本身就是质量信号——无论最终结果如何,需要多次重试的Agent效率更低。
Anti-Patterns
反模式
Anti-Pattern 1: Comparing Only Prompt Size
反模式1:仅对比提示词大小
What it looks like: "Compact agent is 90% smaller, therefore 90% more efficient"
Why wrong: Prompt is one-time cost. Session reasoning, retries, and debugging dominate total tokens. Our data showed a 57-line agent used 69.5k tokens vs 69.6k for a 3,529-line agent on the same correct solution.
Do instead: Measure total session tokens to working solution.
表现:"精简Agent体积小90%,因此效率高90%"
错误原因:提示词是一次性成本。会话推理、重试和调试才是总令牌消耗的主要部分。我们的数据显示,在生成正确解决方案的情况下,57行的Agent消耗69.5k令牌,而3529行的Agent消耗69.6k令牌。
正确做法:衡量得到可行解决方案的会话总令牌数。
Anti-Pattern 2: Different Task Prompts
反模式2:使用不同的任务提示词
What it looks like: Giving the full agent harder requirements than the compact agent
Why wrong: Creates unfair comparison. Different requirements produce different solutions, invalidating all measurements.
Do instead: Copy-paste identical prompts character-for-character. Verify before running.
表现:为完整Agent设置比精简Agent更难的需求
错误原因:会导致不公平对比。不同的需求会产生不同的解决方案,使所有测量结果失效。
正确做法:一字不差地复制相同的提示词。运行前需验证一致性。
Anti-Pattern 3: Treating Test Failures as Equal Quality
反模式3:将测试失败视为同等质量
What it looks like: "Both agents completed the task" when one has 12/12 tests and the other has 8/12
Why wrong: Bugs have real cost. False equivalence between producing code and producing working code.
Do instead: Grade quality rigorously. Calculate effective cost with bug penalty multiplier.
表现:当一个Agent通过12/12测试,另一个通过8/12测试时,称"两个Agent均完成任务"
错误原因:Bug会产生真实成本。将生成代码与生成可用代码等同是错误的。
正确做法:严格进行质量评分。使用Bug惩罚乘数计算有效成本。
Anti-Pattern 4: Single Benchmark Declaration
反模式4:单次基准测试即下结论
What it looks like: "Tested on one puzzle. Compact agent wins!"
Why wrong: Single data point is sensitive to task selection bias. Simple tasks mask differences in edge case handling. Cannot distinguish luck from systematic quality.
Do instead: Run two-tier benchmarking with 2-3 simple tasks and 1-2 complex tasks.
表现:"在一个谜题上测试后,精简Agent获胜!"
错误原因:单个数据点易受任务选择偏见影响。简单任务会掩盖边缘场景处理能力的差异。无法区分运气与系统性质量差异。
正确做法:运行双层基准测试,包含2-3个简单任务和1-2个复杂任务。
Anti-Pattern 5: Removing Core Patterns to Create Compact Agent
反模式5:移除核心模式以创建精简Agent
What it looks like: Compact version removes error handling patterns, concurrency guidance, and testing requirements to reduce size
Why wrong: Creates unfair comparison. Compact agent is missing essential knowledge, guaranteeing quality degradation rather than testing if brevity is possible.
Do instead: Remove verbose examples and redundant explanations, not capability. Keep one representative example per pattern. Condense explanations to bullet points but retain key insights.
表现:精简版移除了错误处理模式、并发指导和测试要求以缩小体积
错误原因:会导致不公平对比。精简Agent缺失了核心知识,必然会导致质量下降,而非测试精简是否可行。
正确做法:移除冗长的示例和冗余的解释,但保留核心能力。每种模式保留一个代表性示例。将解释精简为要点,但保留关键见解。
References
参考资料
This skill uses these shared patterns:
- Anti-Rationalization - Prevents shortcut rationalizations
- Verification Checklist - Pre-completion checks
本Skill使用以下共享模式:
- 反合理化 - 避免捷径式的合理化解释
- 验证清单 - 完成前的检查项
Domain-Specific Anti-Rationalization
领域特定反合理化
| Rationalization | Why It's Wrong | Required Action |
|---|---|---|
| "Compact agent saved 50% tokens" | Savings may come from cutting corners, not efficiency | Check quality scores before claiming savings |
| "Tests pass, agents are equal" | Tests can miss production bugs (goroutine leaks, wrong semantics) | Apply domain-specific quality checklist |
| "One benchmark is enough" | Single task is sensitive to selection bias | Run two-tier benchmarks (simple + complex) |
| "Prompt size determines cost" | Prompt is one-time; reasoning tokens dominate sessions | Measure total session cost to working solution |
| 合理化解释 | 错误原因 | 要求的操作 |
|---|---|---|
| "精简Agent节省了50%的令牌" | 节省可能来自偷工减料,而非效率提升 | 在宣称节省前先检查质量得分 |
| "测试通过,两个Agent质量相当" | 测试可能遗漏生产环境Bug(如goroutine泄漏、语义错误) | 使用领域特定质量清单进行评估 |
| "一次基准测试足够" | 单个任务易受选择偏见影响 | 运行双层基准测试(简单+复杂) |
| "提示词大小决定成本" | 提示词是一次性成本,推理令牌才是会话的主要消耗 | 衡量得到可行解决方案的会话总成本 |
Reference Files
参考文件
- : Complete testing methodology with December 2024 data
${CLAUDE_SKILL_DIR}/references/methodology.md - : Detailed grading criteria and quality checklists
${CLAUDE_SKILL_DIR}/references/grading-rubric.md - : Standard benchmark task descriptions and prompts
${CLAUDE_SKILL_DIR}/references/benchmark-tasks.md - : Comparison report template with all required sections
${CLAUDE_SKILL_DIR}/references/report-template.md
- :完整测试方法论及2024年12月的数据
${CLAUDE_SKILL_DIR}/references/methodology.md - :详细的评分标准和质量清单
${CLAUDE_SKILL_DIR}/references/grading-rubric.md - :标准基准测试任务描述和提示词
${CLAUDE_SKILL_DIR}/references/benchmark-tasks.md - :包含所有必填章节的对比报告模板
${CLAUDE_SKILL_DIR}/references/report-template.md