agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agent Evaluation Skill

Agent评估技能

Operator Context

操作者上下文

This skill operates as an operator for agent/skill quality assurance, configuring Claude's behavior for objective, evidence-based evaluation. It implements the Iterative Assessment pattern — identify targets, validate structure, measure depth, score, report — with Domain Intelligence embedded in the scoring rubric.
本技能作为Agent/Skill质量保证的操作者,配置Claude的行为以实现基于证据的客观评估。它采用迭代评估模式——识别目标、验证结构、衡量深度、打分、生成报告——并在评分准则中嵌入了领域智能

Hardcoded Behaviors (Always Apply)

硬编码行为(始终适用)

  • CLAUDE.md Compliance: Read and follow repository CLAUDE.md before evaluation
  • Over-Engineering Prevention: Evaluate only what is requested. Do not speculatively analyze additional agents/skills or invent metrics that were not asked for
  • Read-Only Evaluation: NEVER modify agents or skills during evaluation — only report findings
  • Evidence-Based Findings: Every issue MUST include file path and line reference
  • Objective Scoring: Use the rubric consistently across all evaluations — no subjective "looks good" assessments
  • Complete Output: Show all test results with scores; never summarize as "all tests pass"
  • CLAUDE.md合规性:评估前阅读并遵循仓库中的CLAUDE.md文件
  • 防止过度设计:仅评估请求的内容,不得推测性分析额外的Agent/Skill,也不得发明未要求的指标
  • 只读评估:评估过程中绝不能修改Agent或Skill — 仅报告发现的问题
  • 基于证据的结论:每个问题必须包含文件路径和行号引用
  • 客观打分:在所有评估中一致使用评分准则 — 不得做出主观的“看起来不错”的评估
  • 完整输出:显示所有测试结果及分数;绝不能仅总结为“所有测试通过”

Default Behaviors (ON unless disabled)

默认行为(默认启用,可禁用)

  • Full Test Suite: Run all evaluation categories (structural, content, code, integration)
  • Priority Ranking: Sort findings by impact (HIGH / MEDIUM / LOW)
  • Score Calculation: Generate numeric quality scores using the standard rubric
  • Improvement Suggestions: Provide specific, actionable recommendations with file paths
  • Temporary File Cleanup: Remove any intermediate analysis files at task completion
  • Comparative Analysis: Show how evaluated items compare to collection averages
  • 完整测试套件:运行所有评估类别(结构、内容、代码、集成)
  • 优先级排序:按影响程度(高/中/低)对发现的问题排序
  • 分数计算:使用标准评分准则生成数字质量分数
  • 改进建议:提供带有文件路径的具体、可操作的建议
  • 临时文件清理:任务完成后删除所有中间分析文件
  • 对比分析:展示评估项与集合平均值的对比情况

Optional Behaviors (OFF unless enabled)

可选行为(默认禁用,可启用)

  • Historical Comparison: Compare current scores to previous evaluations (requires baseline)
  • Cross-Reference Validation: Check all internal links and references resolve
  • Code Example Execution: Actually run code examples to verify they work
  • 历史对比:将当前分数与之前的评估结果对比(需要基线数据)
  • 交叉引用验证:检查所有内部链接和引用是否可解析
  • 代码示例执行:实际运行代码示例以验证其有效性

What This Skill CAN Do

该技能可完成的事项

  • Score agents and skills against a consistent 100-point rubric
  • Detect missing sections, broken references, and structural gaps
  • Measure content depth and compare to collection averages
  • Generate structured reports with prioritized findings
  • Batch-evaluate entire collections with summary statistics
  • 依据统一的100分评分准则为Agent和Skill打分
  • 检测缺失的章节、损坏的引用和结构漏洞
  • 衡量内容深度并与集合平均值对比
  • 生成包含优先级问题的结构化报告
  • 批量评估整个集合并生成汇总统计数据

What This Skill CANNOT Do

该技能不可完成的事项

  • Modify or fix agents/skills (use skill-creator-engineer instead)
  • Evaluate external repositories or non-agent/skill files
  • Replace human judgment on content accuracy or domain correctness
  • Skip rubric categories — all must be scored

  • 修改或修复Agent/Skill(请使用skill-creator-engineer技能)
  • 评估外部仓库或非Agent/Skill文件
  • 替代人类对内容准确性或领域正确性的判断
  • 跳过评分准则中的任何类别 — 所有类别都必须打分

Instructions

操作说明

Step 1: Identify Evaluation Targets

步骤1:识别评估目标

Goal: Determine what to evaluate and confirm targets exist.
bash
undefined
目标:确定要评估的对象并确认目标存在。
bash
undefined

List all agents

列出所有Agent

ls agents/*.md | wc -l
ls agents/*.md | wc -l

List all skills

列出所有Skill

ls -d skills/*/ | wc -l
ls -d skills/*/ | wc -l

Verify specific target

验证特定目标

ls agents/{name}.md ls -la skills/{name}/

**Gate**: All targets confirmed to exist on disk. Proceed only when gate passes.
ls agents/{name}.md ls -la skills/{name}/

**关卡**:确认所有目标都存在于磁盘上。仅当关卡通过时才可继续。

Step 2: Structural Validation

步骤2:结构验证

Goal: Check that required components exist and are well-formed.
For Agents — check each item and record PASS/FAIL with line number:
  1. YAML front matter:
    name
    ,
    description
    ,
    color
    fields present
  2. Operator Context section with all 3 behavior types (Hardcoded, Default, Optional)
  3. Hardcoded Behaviors: 5-8 items, MUST include CLAUDE.md Compliance and Over-Engineering Prevention
  4. Default Behaviors: 5-8 items
  5. Optional Behaviors: 3-5 items
  6. Examples in description: 3+
    <example>
    blocks with
    <commentary>
  7. Error Handling section with 3+ documented errors
  8. CAN/CANNOT boundaries section
bash
undefined
目标:检查所需组件是否存在且格式规范。
针对Agent — 检查每一项并记录通过/失败状态及行号:
  1. YAML前置元数据:存在
    name
    description
    color
    字段
  2. 包含所有3种行为类型(硬编码、默认、可选)的操作者上下文章节
  3. 硬编码行为:5-8项,必须包含CLAUDE.md合规性和防止过度设计
  4. 默认行为:5-8项
  5. 可选行为:3-5项
  6. 描述中的示例:3个以上带有
    <commentary>
    <example>
  7. 包含3个以上已记录错误的错误处理章节
  8. CAN/CANNOT边界章节
bash
undefined

Agent structural checks

Agent结构检查

head -20 agents/{name}.md | grep -E "^(name|description|color):" grep -c "## Operator Context" agents/{name}.md grep -c "### Hardcoded Behaviors" agents/{name}.md grep -c "### Default Behaviors" agents/{name}.md grep -c "### Optional Behaviors" agents/{name}.md grep -c "CLAUDE.md" agents/{name}.md grep -c "Over-Engineering" agents/{name}.md grep -c "<example>" agents/{name}.md grep -c "## Error Handling" agents/{name}.md grep -c "CAN Do" agents/{name}.md grep -c "CANNOT Do" agents/{name}.md

**For Skills** — check each item and record PASS/FAIL with line number:

1. YAML front matter: `name`, `description`, `version`, `allowed-tools` present
2. `allowed-tools` uses YAML list format (not comma-separated string)
3. `description` uses pipe (`|`) format with WHAT + WHEN + negative constraint, under 1024 chars
4. `version` set to `2.0.0` for migrated skills
5. Operator Context section with all 3 behavior types
6. Hardcoded Behaviors: 5-8 items, MUST include CLAUDE.md Compliance and Over-Engineering Prevention
7. Default Behaviors: 5-8 items
8. Optional Behaviors: 3-5 items
9. Instructions section with gates between phases
10. Error Handling section with 2-4 documented errors
11. Anti-Patterns section with 3-5 patterns
12. `references/` directory with substantive content
13. CAN/CANNOT boundaries section
14. References section with shared patterns and domain-specific anti-rationalization table

```bash
head -20 agents/{name}.md | grep -E "^(name|description|color):" grep -c "## Operator Context" agents/{name}.md grep -c "### Hardcoded Behaviors" agents/{name}.md grep -c "### Default Behaviors" agents/{name}.md grep -c "### Optional Behaviors" agents/{name}.md grep -c "CLAUDE.md" agents/{name}.md grep -c "Over-Engineering" agents/{name}.md grep -c "<example>" agents/{name}.md grep -c "## Error Handling" agents/{name}.md grep -c "CAN Do" agents/{name}.md grep -c "CANNOT Do" agents/{name}.md

**针对Skill** — 检查每一项并记录通过/失败状态及行号:

1. YAML前置元数据:存在`name`、`description`、`version`、`allowed-tools`字段
2. `allowed-tools`使用YAML列表格式(而非逗号分隔的字符串)
3. `description`使用管道符(`|`)格式,包含WHAT + WHEN + 约束条件,长度不超过1024字符
4. 迁移后的Skill的`version`设置为`2.0.0`
5. 包含所有3种行为类型的操作者上下文章节
6. 硬编码行为:5-8项,必须包含CLAUDE.md合规性和防止过度设计
7. 默认行为:5-8项
8. 可选行为:3-5项
9. 各阶段间有关卡的操作说明章节
10. 包含2-4个已记录错误的错误处理章节
11. 包含3-5种模式的反模式章节
12. `references/`目录包含实质性内容
13. CAN/CANNOT边界章节
14. 包含共享模式和领域特定反合理化表的参考章节

```bash

Skill structural checks

Skill结构检查

head -20 skills/{name}/SKILL.md | grep -E "^(name|description|version|allowed-tools):" grep -n "allowed-tools:" skills/{name}/SKILL.md # Check YAML list vs comma format grep -c "## Operator Context" skills/{name}/SKILL.md grep -c "CLAUDE.md" skills/{name}/SKILL.md grep -c "Over-Engineering" skills/{name}/SKILL.md grep -c "## Instructions" skills/{name}/SKILL.md grep -c "Gate.*Proceed" skills/{name}/SKILL.md # Count gates grep -c "## Error Handling" skills/{name}/SKILL.md grep -c "## Anti-Patterns" skills/{name}/SKILL.md grep -c "CAN Do" skills/{name}/SKILL.md grep -c "CANNOT Do" skills/{name}/SKILL.md grep -c "anti-rationalization-core" skills/{name}/SKILL.md ls skills/{name}/references/

**Structural Scoring** (60 points):

| Component | Points | Requirement |
|-----------|--------|-------------|
| YAML front matter | 10 | All required fields, list format, pipe description |
| Operator Context | 20 | All 3 behavior types with correct item counts |
| Error Handling | 10 | Section present with documented errors |
| Examples (agents) / References (skills) | 10 | 3+ examples or 2+ reference files |
| CAN/CANNOT | 5 | Both sections present with concrete items |
| Anti-Patterns | 5 | 3-5 domain-specific patterns with 3-part structure |

**Integration Scoring** (10 points):

| Component | Points | Requirement |
|-----------|--------|-------------|
| References and cross-references | 5 | Shared patterns linked, all refs resolve |
| Tool and link consistency | 5 | allowed-tools matches usage, anti-rationalization table present |

See `references/scoring-rubric.md` for full/partial/no credit breakdowns.

**Gate**: All structural checks scored with evidence. Proceed only when gate passes.
head -20 skills/{name}/SKILL.md | grep -E "^(name|description|version|allowed-tools):" grep -n "allowed-tools:" skills/{name}/SKILL.md # 检查YAML列表格式vs逗号格式 grep -c "## Operator Context" skills/{name}/SKILL.md grep -c "CLAUDE.md" skills/{name}/SKILL.md grep -c "Over-Engineering" skills/{name}/SKILL.md grep -c "## Instructions" skills/{name}/SKILL.md grep -c "Gate.*Proceed" skills/{name}/SKILL.md # 统计关卡数量 grep -c "## Error Handling" skills/{name}/SKILL.md grep -c "## Anti-Patterns" skills/{name}/SKILL.md grep -c "CAN Do" skills/{name}/SKILL.md grep -c "CANNOT Do" skills/{name}/SKILL.md grep -c "anti-rationalization-core" skills/{name}/SKILL.md ls skills/{name}/references/

**结构评分**(60分):

| 组件 | 分值 | 要求 |
|-----------|--------|-------------|
| YAML前置元数据 | 10 | 包含所有必填字段、列表格式、管道符描述 |
| 操作者上下文 | 20 | 包含所有3种行为类型且数量符合要求 |
| 错误处理 | 10 | 存在包含已记录错误的章节 |
| 示例(Agent)/参考(Skill) | 10 | 3个以上示例或2个以上参考文件 |
| CAN/CANNOT | 5 | 同时存在包含具体内容的两个章节 |
| 反模式 | 5 | 3-5个领域特定模式,且具有三部分结构 |

**集成评分**(10分):

| 组件 | 分值 | 要求 |
|-----------|--------|-------------|
| 引用和交叉引用 | 5 | 共享模式已链接,所有引用均可解析 |
| 工具和链接一致性 | 5 | allowed-tools与实际使用匹配,存在反合理化表 |

完整/部分/无学分的细分规则请参见`references/scoring-rubric.md`。

**关卡**:所有结构检查均已打分并提供证据。仅当关卡通过时才可继续。

Step 3: Content Depth Analysis

步骤3:内容深度分析

Goal: Measure content quality and volume.
bash
undefined
目标:衡量内容质量和体量。
bash
undefined

Skill total lines (SKILL.md + references)

Skill总行数(SKILL.md + 参考文件)

skill_lines=$(wc -l < skills/{name}/SKILL.md) ref_lines=$(cat skills/{name}/references/*.md 2>/dev/null | wc -l) total=$((skill_lines + ref_lines))
skill_lines=$(wc -l < skills/{name}/SKILL.md) ref_lines=$(cat skills/{name}/references/*.md 2>/dev/null | wc -l) total=$((skill_lines + ref_lines))

Agent total lines

Agent总行数

agent_lines=$(wc -l < agents/{name}.md)

**Depth Scoring** (30 points max):

| Total Lines | Score | Grade |
|-------------|-------|-------|
| >1500 (skills) / >2000 (agents) | 30 | EXCELLENT |
| 500-1500 / 1000-2000 | 22 | GOOD |
| 300-500 / 500-1000 | 15 | ADEQUATE |
| 150-300 / 200-500 | 8 | THIN |
| <150 / <200 | 0 | INSUFFICIENT |

**Gate**: Depth score calculated. Proceed only when gate passes.
agent_lines=$(wc -l < agents/{name}.md)

**深度评分**(最高30分):

| 总行数 | 分数 | 等级 |
|-------------|-------|-------|
| >1500(Skill)/ >2000(Agent) | 30 | 优秀 |
| 500-1500 / 1000-2000 | 22 | 良好 |
| 300-500 / 500-1000 | 15 | 合格 |
| 150-300 / 200-500 | 8 | 单薄 |
| <150 / <200 | 0 | 不足 |

**关卡**:已计算深度分数。仅当关卡通过时才可继续。

Step 4: Code Quality Checks

步骤4:代码质量检查

Goal: Validate that code examples and scripts are functional.
  1. Script syntax: Run
    python3 -m py_compile
    on all
    .py
    files
  2. Placeholder detection: Search for
    [TODO]
    ,
    [TBD]
    ,
    [PLACEHOLDER]
    ,
    [INSERT]
  3. Code block tagging: Count untagged (bare
    ```
    ) vs tagged (
    ```language
    ) blocks
bash
undefined
目标:验证代码示例和脚本的功能性。
  1. 脚本语法:对所有
    .py
    文件运行
    python3 -m py_compile
  2. 占位符检测:搜索
    [TODO]
    [TBD]
    [PLACEHOLDER]
    [INSERT]
  3. 代码块标记:统计未标记(裸
    ```
    )与已标记(
    ```language
    )的代码块数量
bash
undefined

Python syntax check

Python语法检查

Syntax-check any .py scripts found in the skill's scripts/ directory

对Skill的scripts/目录中找到的所有.py脚本进行语法检查

python3 -m py_compile scripts/*.py 2>/dev/null
python3 -m py_compile scripts/*.py 2>/dev/null

Placeholder search

占位符搜索

grep -nE '[TODO]|[TBD]|[PLACEHOLDER]|[INSERT]' {file}
grep -nE '[TODO]|[TBD]|[PLACEHOLDER]|[INSERT]' {file}

Untagged code blocks

未标记代码块统计

grep -c '```$' {file}

**Gate**: All code checks complete. Proceed only when gate passes.
grep -c '```$' {file}

**关卡**:所有代码检查完成。仅当关卡通过时才可继续。

Step 5: Integration Verification

步骤5:集成验证

Goal: Confirm cross-references and tool declarations are consistent.
Reference Resolution:
  1. Extract all referenced files from SKILL.md (grep for
    references/
    )
  2. Verify each reference exists on disk
  3. Check shared pattern links resolve (
    ../shared-patterns/
    )
Tool Consistency:
  1. Parse
    allowed-tools
    from YAML front matter
  2. Scan instructions for tool usage (Read, Write, Edit, Bash, Grep, Glob, Task, WebSearch)
  3. Flag any tool used in instructions but not declared in
    allowed-tools
  4. Flag any tool declared but never used in instructions
Anti-Rationalization Table:
  1. Check that References section links to
    anti-rationalization-core.md
  2. Verify domain-specific anti-rationalization table is present
  3. Table should have 3-5 rows specific to the skill's domain
bash
undefined
目标:确认交叉引用和工具声明的一致性。
引用解析
  1. 从SKILL.md中提取所有引用的文件(搜索
    references/
  2. 验证每个引用的文件是否存在于磁盘上
  3. 检查共享模式链接是否可解析(
    ../shared-patterns/
工具一致性
  1. 解析YAML前置元数据中的
    allowed-tools
  2. 扫描操作说明中的工具使用情况(Read、Write、Edit、Bash、Grep、Glob、Task、WebSearch)
  3. 标记任何在操作说明中使用但未在
    allowed-tools
    中声明的工具
  4. 标记任何在
    allowed-tools
    中声明但从未在操作说明中使用的工具
反合理化表
  1. 检查参考章节是否链接到
    anti-rationalization-core.md
  2. 验证领域特定反合理化表是否存在
  3. 表中应包含3-5行与该Skill领域相关的内容
bash
undefined

Check referenced files exist

检查引用的文件是否存在

grep -oE 'references/[a-z-]+.md' skills/{name}/SKILL.md | while read ref; do ls "skills/{name}/$ref" 2>/dev/null || echo "MISSING: $ref" done
grep -oE 'references/[a-z-]+.md' skills/{name}/SKILL.md | while read ref; do ls "skills/{name}/$ref" 2>/dev/null || echo "缺失: $ref" done

Check tool consistency

检查工具一致性

grep "allowed-tools:" skills/{name}/SKILL.md grep -oE '(Read|Write|Edit|Bash|Grep|Glob|Task|WebSearch)' skills/{name}/SKILL.md | sort -u
grep "allowed-tools:" skills/{name}/SKILL.md grep -oE '(Read|Write|Edit|Bash|Grep|Glob|Task|WebSearch)' skills/{name}/SKILL.md | sort -u

Check anti-rationalization reference

检查反合理化引用

grep -c "anti-rationalization-core" skills/{name}/SKILL.md

**Gate**: All integration checks complete. Proceed only when gate passes.
grep -c "anti-rationalization-core" skills/{name}/SKILL.md

**关卡**:所有集成检查完成。仅当关卡通过时才可继续。

Step 6: Generate Quality Report

步骤6:生成质量报告

Goal: Compile all findings into the standard report format.
Use the report template from
references/report-templates.md
. The report MUST include:
  1. Header: Name, type, date, overall score and grade
  2. Structural Validation: Table with check, status, score, and evidence (line numbers)
  3. Content Depth: Line counts for main file and references, grade, depth score
  4. Code Quality: Script syntax results, placeholder count, untagged block count
  5. Issues Found: Grouped by HIGH / MEDIUM / LOW priority
  6. Recommendations: Specific, actionable improvements with file paths and line numbers
  7. Comparison: Score vs collection average (if batch evaluating)
Issue Priority Classification:
PriorityCriteriaExamples
HIGHMissing required section or broken functionalityNo Operator Context, syntax errors in scripts
MEDIUMSection present but incomplete or non-compliantWrong item counts, old allowed-tools format
LOWCosmetic or minor quality issuesUntagged code blocks, missing changelog
Grade Boundaries:
ScoreGradeInterpretation
90-100AProduction ready, exemplary
80-89BGood, minor improvements needed
70-79CAdequate, some gaps to address
60-69DBelow standard, significant work needed
<60FMajor overhaul required
Gate: Report generated with all sections populated and evidence cited. Evaluation complete.

目标:将所有发现的问题整理成标准报告格式。
使用
references/report-templates.md
中的报告模板。报告必须包含:
  1. 页眉:名称、类型、日期、总体分数和等级
  2. 结构验证:包含检查项、状态、分数和证据(行号)的表格
  3. 内容深度:主文件和参考文件的行数、等级、深度分数
  4. 代码质量:脚本语法检查结果、占位符数量、未标记代码块数量
  5. 发现的问题:按高/中/低优先级分组
  6. 建议:带有文件路径和行号的具体、可操作的改进建议
  7. 对比:分数与集合平均值的对比(如果是批量评估)
问题优先级分类
优先级标准示例
缺失必填章节或功能损坏无操作者上下文、脚本语法错误
章节存在但不完整或不合规项目数量错误、旧版allowed-tools格式
cosmetic或轻微质量问题未标记代码块、缺失变更日志
等级边界
分数等级说明
90-100A可投入生产,表现优秀
80-89B良好,需少量改进
70-79C合格,存在一些需要解决的问题
60-69D低于标准,需要大量工作改进
<60F需要全面整改
关卡:生成包含所有章节并引用证据的报告。评估完成。

Examples

示例

Example 1: Single Skill Evaluation

示例1:单个Skill评估

User says: "Evaluate the test-driven-development skill" Actions:
  1. Confirm
    skills/test-driven-development/
    exists (IDENTIFY)
  2. Check YAML, Operator Context, Error Handling sections (STRUCTURAL)
  3. Count lines in SKILL.md + references (CONTENT)
  4. Syntax-check any scripts, find placeholders (CODE)
  5. Verify all referenced files exist (INTEGRATION)
  6. Generate scored report (REPORT) Result: Structured report with score, grade, and prioritized findings
用户指令:"Evaluate the test-driven-development skill" 操作步骤:
  1. 确认
    skills/test-driven-development/
    存在(识别)
  2. 检查YAML、操作者上下文、错误处理章节(结构检查)
  3. 统计SKILL.md + 参考文件的行数(内容分析)
  4. 检查脚本语法、查找占位符(代码验证)
  5. 验证所有引用的文件是否存在(集成校验)
  6. 生成带分数的报告(报告生成) 结果:包含分数、等级和优先级问题的结构化报告

Example 2: Collection Batch Evaluation

示例2:集合批量评估

User says: "Audit all agents and skills" Actions:
  1. List all agents/.md and skills//SKILL.md (IDENTIFY)
  2. Run Steps 2-5 for each target (EVALUATE)
  3. Generate individual reports + collection summary (REPORT) Result: Per-item scores plus distribution, top performers, and improvement areas
用户指令:"Audit all agents and skills" 操作步骤:
  1. 列出所有agents/.md和skills//SKILL.md(识别)
  2. 对每个目标执行步骤2-5(评估)
  3. 生成单个报告 + 集合汇总(报告生成) 结果:每个项的分数以及分布情况、表现最佳/最差项、改进方向

Example 3: V2 Migration Compliance Check

示例3:V2迁移合规性检查

User says: "Check if systematic-refactoring skill meets v2 standards" Actions:
  1. Confirm
    skills/systematic-refactoring/
    exists (IDENTIFY)
  2. Check YAML uses list
    allowed-tools
    , pipe description, version 2.0.0 (STRUCTURAL)
  3. Verify Operator Context has correct item counts: Hardcoded 5-8, Default 5-8, Optional 3-5 (STRUCTURAL)
  4. Confirm CAN/CANNOT sections, gates in Instructions, anti-rationalization table (STRUCTURAL)
  5. Count total lines, run code checks (CONTENT + CODE)
  6. Generate scored report highlighting v2 gaps (REPORT) Result: Report with specific v2 compliance gaps and required actions

用户指令:"Check if systematic-refactoring skill meets v2 standards" 操作步骤:
  1. 确认
    skills/systematic-refactoring/
    存在(识别)
  2. 检查YAML是否使用列表格式的
    allowed-tools
    、管道符描述、版本2.0.0(结构检查)
  3. 验证操作者上下文的项目数量是否符合要求:硬编码5-8项、默认5-8项、可选3-5项(结构检查)
  4. 确认存在CAN/CANNOT章节、操作说明中的关卡、反合理化表(结构检查)
  5. 统计总行数、执行代码检查(内容分析 + 代码验证)
  6. 生成突出显示V2差距的带分数报告(报告生成) 结果:包含具体V2合规性差距和所需操作的报告

Error Handling

错误处理

Error: "File Not Found"

错误:"文件未找到"

Cause: Agent or skill path incorrect, or item was deleted Solution: Verify path exists with
ls
before evaluation. If truly missing, exclude from batch and note in report.
原因:Agent或Skill路径不正确,或项已被删除 解决方案:评估前使用
ls
命令验证路径是否存在。如果确实缺失,从批量评估中排除并在报告中注明。

Error: "Cannot Parse YAML Front Matter"

错误:"无法解析YAML前置元数据"

Cause: Malformed YAML — missing
---
delimiters, bad indentation, or invalid syntax Solution: Flag as HIGH priority structural failure. Score YAML section as 0/10. Include the specific parse error in the report.
原因:YAML格式错误 — 缺少
---
分隔符、缩进错误或语法无效 解决方案:标记为高优先级结构失败。YAML章节评分0/10。在报告中包含具体的解析错误信息。

Error: "Python Syntax Error in Script"

错误:"脚本中的Python语法错误"

Cause: Validation script has syntax issues Solution: Run
python3 -m py_compile
and capture the specific error. Score validation script as 0/10. Include error output in report.
原因:验证脚本存在语法问题 解决方案:运行
python3 -m py_compile
并捕获具体错误。验证脚本评分0/10。在报告中包含错误输出。

Error: "Operator Context Item Counts Out of Range"

错误:"操作者上下文项目数量超出范围"

Cause: v2 standard requires Hardcoded 5-8, Default 5-8, Optional 3-5 items. Skill has too few or too many. Solution:
  1. Count actual items per behavior type (bold items starting with
    - **
    )
  2. If too few: flag as MEDIUM priority — behaviors likely need to be split or added
  3. If too many: flag as LOW priority — behaviors may need consolidation
  4. Score Operator Context at partial credit (10/20) if counts are wrong

原因:V2标准要求硬编码行为5-8项、默认行为5-8项、可选行为3-5项。Skill的项目数量过多或过少。 解决方案:
  1. 统计每种行为类型的实际项目数量(以
    - **
    开头的粗体项目)
  2. 如果数量过少:标记为中优先级 — 行为可能需要拆分或添加
  3. 如果数量过多:标记为低优先级 — 行为可能需要合并
  4. 如果数量错误,操作者上下文评分按部分学分计算(10/20)

Anti-Patterns

反模式

Anti-Pattern 1: Superficial Evaluation Without Evidence

反模式1:无证据的表面评估

What it looks like: "Structure: Looks good. Content: Seems adequate. Overall: PASS" Why wrong: No file paths, no line references, no specific scores. Cannot verify or reproduce. Do instead: Score every rubric category. Cite file:line for every finding.
表现:"结构:看起来不错。内容:似乎足够。总体:通过" 问题所在:没有文件路径、行号引用或具体分数。无法验证或复现。 正确做法:为每个评分准则类别打分。为每个发现的问题引用文件:行号。

Anti-Pattern 2: Skipping Validation Script Execution

反模式2:跳过验证脚本执行

What it looks like: "The skill has a validation script present." Why wrong: Presence is not correctness. Script may have syntax errors or do nothing. Do instead: Run
python3 -m py_compile
at minimum. Execute the script and capture output.
表现:"该Skill包含一个验证脚本。" 问题所在:存在不代表正确。脚本可能存在语法错误或无实际功能。 正确做法:至少运行
python3 -m py_compile
。执行脚本并捕获输出。

Anti-Pattern 3: Accepting Placeholder Content as Complete

反模式3:将占位符内容视为完整内容

What it looks like: "Agent has comprehensive examples section. PASS" Why wrong: Did not check if examples contain [TODO] or [PLACEHOLDER] text. Do instead: Search for placeholder patterns. Score content on substance, not section headers.
表现:"Agent包含全面的示例章节。通过" 问题所在:未检查示例中是否包含[TODO]或[PLACEHOLDER]文本。 正确做法:搜索占位符模式。根据内容实质而非章节标题打分。

Anti-Pattern 4: Batch Evaluation Without Summary Statistics

反模式4:无汇总统计的批量评估

What it looks like: "Evaluated all 38 agents. Most are good quality." Why wrong: No quantitative data. Cannot track improvements or identify problem areas. Do instead: Generate score distribution table, top/bottom performers, common issues count. See
references/batch-evaluation.md
for the collection summary template.
表现:"已评估所有38个Agent。大多数质量良好。" 问题所在:无量化数据。无法跟踪改进或识别问题区域。 正确做法:生成分数分布表、表现最佳/最差项、常见问题统计。请参见
references/batch-evaluation.md
中的集合汇总模板。

Anti-Pattern 5: Ignoring Repository-Specific Standards

反模式5:忽略仓库特定标准

What it looks like: "This agent follows standard practices and is well-structured." Why wrong: Did not check CLAUDE.md requirements. May miss v2 standards (YAML list format, pipe description, item count ranges, gates, anti-rationalization table). Do instead: Check CLAUDE.md first. Verify all v2-specific criteria. A generic "well-structured" verdict is meaningless without rubric scores.

表现:"该Agent遵循标准实践且结构良好。" 问题所在:未检查CLAUDE.md要求。可能遗漏V2标准(YAML列表格式、管道符描述、项目数量范围、关卡、反合理化表)。 正确做法:首先检查CLAUDE.md。验证所有V2特定标准。没有评分准则分数的通用“结构良好”结论毫无意义。

References

参考

This skill uses these shared patterns:
  • Anti-Rationalization - Prevents shortcut rationalizations
  • Verification Checklist - Pre-completion checks
本技能使用以下共享模式:
  • 反合理化 - 防止捷径式合理化
  • 验证清单 - 完成前检查

Domain-Specific Anti-Rationalization

领域特定反合理化

RationalizationWhy It's WrongRequired Action
"YAML looks fine, no need to parse it"Looking is not parsing; fields may be missingCheck each required field explicitly
"Content is long enough, skip counting"Impressions are not measurementsCount lines, calculate score
"Script exists, must work"Existence is not correctnessRun
python3 -m py_compile
"One failing check, rest are probably fine"Partial evaluation is not evaluationComplete all 6 steps
合理化理由错误原因要求操作
"YAML看起来没问题,无需解析"视觉检查不等于解析;可能缺少字段显式检查每个必填字段
"内容足够长,无需统计行数"印象不等于测量统计行数,计算分数
"脚本存在,肯定能工作"存在不等于正确运行
python3 -m py_compile
"一个检查失败,其他可能也没问题"部分评估不等于完整评估完成所有6个步骤

Reference Files

参考文件

  • ${CLAUDE_SKILL_DIR}/references/scoring-rubric.md
    - Full/partial/no credit breakdowns per rubric category
  • ${CLAUDE_SKILL_DIR}/references/report-templates.md
    - Standard report format templates (single, batch, comparison)
  • ${CLAUDE_SKILL_DIR}/references/common-issues.md
    - Frequently found issues with fix templates
  • ${CLAUDE_SKILL_DIR}/references/batch-evaluation.md
    - Batch evaluation procedures and collection summary format
  • ${CLAUDE_SKILL_DIR}/references/scoring-rubric.md
    - 每个评分准则类别的完整/部分/无学分细分规则
  • ${CLAUDE_SKILL_DIR}/references/report-templates.md
    - 标准报告格式模板(单个、批量、对比)
  • ${CLAUDE_SKILL_DIR}/references/common-issues.md
    - 常见问题及修复模板
  • ${CLAUDE_SKILL_DIR}/references/batch-evaluation.md
    - 批量评估流程和集合汇总格式