agent-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgent Evaluation Skill
Agent评估技能
Operator Context
操作者上下文
This skill operates as an operator for agent/skill quality assurance, configuring Claude's behavior for objective, evidence-based evaluation. It implements the Iterative Assessment pattern — identify targets, validate structure, measure depth, score, report — with Domain Intelligence embedded in the scoring rubric.
本技能作为Agent/Skill质量保证的操作者,配置Claude的行为以实现基于证据的客观评估。它采用迭代评估模式——识别目标、验证结构、衡量深度、打分、生成报告——并在评分准则中嵌入了领域智能。
Hardcoded Behaviors (Always Apply)
硬编码行为(始终适用)
- CLAUDE.md Compliance: Read and follow repository CLAUDE.md before evaluation
- Over-Engineering Prevention: Evaluate only what is requested. Do not speculatively analyze additional agents/skills or invent metrics that were not asked for
- Read-Only Evaluation: NEVER modify agents or skills during evaluation — only report findings
- Evidence-Based Findings: Every issue MUST include file path and line reference
- Objective Scoring: Use the rubric consistently across all evaluations — no subjective "looks good" assessments
- Complete Output: Show all test results with scores; never summarize as "all tests pass"
- CLAUDE.md合规性:评估前阅读并遵循仓库中的CLAUDE.md文件
- 防止过度设计:仅评估请求的内容,不得推测性分析额外的Agent/Skill,也不得发明未要求的指标
- 只读评估:评估过程中绝不能修改Agent或Skill — 仅报告发现的问题
- 基于证据的结论:每个问题必须包含文件路径和行号引用
- 客观打分:在所有评估中一致使用评分准则 — 不得做出主观的“看起来不错”的评估
- 完整输出:显示所有测试结果及分数;绝不能仅总结为“所有测试通过”
Default Behaviors (ON unless disabled)
默认行为(默认启用,可禁用)
- Full Test Suite: Run all evaluation categories (structural, content, code, integration)
- Priority Ranking: Sort findings by impact (HIGH / MEDIUM / LOW)
- Score Calculation: Generate numeric quality scores using the standard rubric
- Improvement Suggestions: Provide specific, actionable recommendations with file paths
- Temporary File Cleanup: Remove any intermediate analysis files at task completion
- Comparative Analysis: Show how evaluated items compare to collection averages
- 完整测试套件:运行所有评估类别(结构、内容、代码、集成)
- 优先级排序:按影响程度(高/中/低)对发现的问题排序
- 分数计算:使用标准评分准则生成数字质量分数
- 改进建议:提供带有文件路径的具体、可操作的建议
- 临时文件清理:任务完成后删除所有中间分析文件
- 对比分析:展示评估项与集合平均值的对比情况
Optional Behaviors (OFF unless enabled)
可选行为(默认禁用,可启用)
- Historical Comparison: Compare current scores to previous evaluations (requires baseline)
- Cross-Reference Validation: Check all internal links and references resolve
- Code Example Execution: Actually run code examples to verify they work
- 历史对比:将当前分数与之前的评估结果对比(需要基线数据)
- 交叉引用验证:检查所有内部链接和引用是否可解析
- 代码示例执行:实际运行代码示例以验证其有效性
What This Skill CAN Do
该技能可完成的事项
- Score agents and skills against a consistent 100-point rubric
- Detect missing sections, broken references, and structural gaps
- Measure content depth and compare to collection averages
- Generate structured reports with prioritized findings
- Batch-evaluate entire collections with summary statistics
- 依据统一的100分评分准则为Agent和Skill打分
- 检测缺失的章节、损坏的引用和结构漏洞
- 衡量内容深度并与集合平均值对比
- 生成包含优先级问题的结构化报告
- 批量评估整个集合并生成汇总统计数据
What This Skill CANNOT Do
该技能不可完成的事项
- Modify or fix agents/skills (use skill-creator-engineer instead)
- Evaluate external repositories or non-agent/skill files
- Replace human judgment on content accuracy or domain correctness
- Skip rubric categories — all must be scored
- 修改或修复Agent/Skill(请使用skill-creator-engineer技能)
- 评估外部仓库或非Agent/Skill文件
- 替代人类对内容准确性或领域正确性的判断
- 跳过评分准则中的任何类别 — 所有类别都必须打分
Instructions
操作说明
Step 1: Identify Evaluation Targets
步骤1:识别评估目标
Goal: Determine what to evaluate and confirm targets exist.
bash
undefined目标:确定要评估的对象并确认目标存在。
bash
undefinedList all agents
列出所有Agent
ls agents/*.md | wc -l
ls agents/*.md | wc -l
List all skills
列出所有Skill
ls -d skills/*/ | wc -l
ls -d skills/*/ | wc -l
Verify specific target
验证特定目标
ls agents/{name}.md
ls -la skills/{name}/
**Gate**: All targets confirmed to exist on disk. Proceed only when gate passes.ls agents/{name}.md
ls -la skills/{name}/
**关卡**:确认所有目标都存在于磁盘上。仅当关卡通过时才可继续。Step 2: Structural Validation
步骤2:结构验证
Goal: Check that required components exist and are well-formed.
For Agents — check each item and record PASS/FAIL with line number:
- YAML front matter: ,
name,descriptionfields presentcolor - Operator Context section with all 3 behavior types (Hardcoded, Default, Optional)
- Hardcoded Behaviors: 5-8 items, MUST include CLAUDE.md Compliance and Over-Engineering Prevention
- Default Behaviors: 5-8 items
- Optional Behaviors: 3-5 items
- Examples in description: 3+ blocks with
<example><commentary> - Error Handling section with 3+ documented errors
- CAN/CANNOT boundaries section
bash
undefined目标:检查所需组件是否存在且格式规范。
针对Agent — 检查每一项并记录通过/失败状态及行号:
- YAML前置元数据:存在、
name、description字段color - 包含所有3种行为类型(硬编码、默认、可选)的操作者上下文章节
- 硬编码行为:5-8项,必须包含CLAUDE.md合规性和防止过度设计
- 默认行为:5-8项
- 可选行为:3-5项
- 描述中的示例:3个以上带有的
<commentary>块<example> - 包含3个以上已记录错误的错误处理章节
- CAN/CANNOT边界章节
bash
undefinedAgent structural checks
Agent结构检查
head -20 agents/{name}.md | grep -E "^(name|description|color):"
grep -c "## Operator Context" agents/{name}.md
grep -c "### Hardcoded Behaviors" agents/{name}.md
grep -c "### Default Behaviors" agents/{name}.md
grep -c "### Optional Behaviors" agents/{name}.md
grep -c "CLAUDE.md" agents/{name}.md
grep -c "Over-Engineering" agents/{name}.md
grep -c "<example>" agents/{name}.md
grep -c "## Error Handling" agents/{name}.md
grep -c "CAN Do" agents/{name}.md
grep -c "CANNOT Do" agents/{name}.md
**For Skills** — check each item and record PASS/FAIL with line number:
1. YAML front matter: `name`, `description`, `version`, `allowed-tools` present
2. `allowed-tools` uses YAML list format (not comma-separated string)
3. `description` uses pipe (`|`) format with WHAT + WHEN + negative constraint, under 1024 chars
4. `version` set to `2.0.0` for migrated skills
5. Operator Context section with all 3 behavior types
6. Hardcoded Behaviors: 5-8 items, MUST include CLAUDE.md Compliance and Over-Engineering Prevention
7. Default Behaviors: 5-8 items
8. Optional Behaviors: 3-5 items
9. Instructions section with gates between phases
10. Error Handling section with 2-4 documented errors
11. Anti-Patterns section with 3-5 patterns
12. `references/` directory with substantive content
13. CAN/CANNOT boundaries section
14. References section with shared patterns and domain-specific anti-rationalization table
```bashhead -20 agents/{name}.md | grep -E "^(name|description|color):"
grep -c "## Operator Context" agents/{name}.md
grep -c "### Hardcoded Behaviors" agents/{name}.md
grep -c "### Default Behaviors" agents/{name}.md
grep -c "### Optional Behaviors" agents/{name}.md
grep -c "CLAUDE.md" agents/{name}.md
grep -c "Over-Engineering" agents/{name}.md
grep -c "<example>" agents/{name}.md
grep -c "## Error Handling" agents/{name}.md
grep -c "CAN Do" agents/{name}.md
grep -c "CANNOT Do" agents/{name}.md
**针对Skill** — 检查每一项并记录通过/失败状态及行号:
1. YAML前置元数据:存在`name`、`description`、`version`、`allowed-tools`字段
2. `allowed-tools`使用YAML列表格式(而非逗号分隔的字符串)
3. `description`使用管道符(`|`)格式,包含WHAT + WHEN + 约束条件,长度不超过1024字符
4. 迁移后的Skill的`version`设置为`2.0.0`
5. 包含所有3种行为类型的操作者上下文章节
6. 硬编码行为:5-8项,必须包含CLAUDE.md合规性和防止过度设计
7. 默认行为:5-8项
8. 可选行为:3-5项
9. 各阶段间有关卡的操作说明章节
10. 包含2-4个已记录错误的错误处理章节
11. 包含3-5种模式的反模式章节
12. `references/`目录包含实质性内容
13. CAN/CANNOT边界章节
14. 包含共享模式和领域特定反合理化表的参考章节
```bashSkill structural checks
Skill结构检查
head -20 skills/{name}/SKILL.md | grep -E "^(name|description|version|allowed-tools):"
grep -n "allowed-tools:" skills/{name}/SKILL.md # Check YAML list vs comma format
grep -c "## Operator Context" skills/{name}/SKILL.md
grep -c "CLAUDE.md" skills/{name}/SKILL.md
grep -c "Over-Engineering" skills/{name}/SKILL.md
grep -c "## Instructions" skills/{name}/SKILL.md
grep -c "Gate.*Proceed" skills/{name}/SKILL.md # Count gates
grep -c "## Error Handling" skills/{name}/SKILL.md
grep -c "## Anti-Patterns" skills/{name}/SKILL.md
grep -c "CAN Do" skills/{name}/SKILL.md
grep -c "CANNOT Do" skills/{name}/SKILL.md
grep -c "anti-rationalization-core" skills/{name}/SKILL.md
ls skills/{name}/references/
**Structural Scoring** (60 points):
| Component | Points | Requirement |
|-----------|--------|-------------|
| YAML front matter | 10 | All required fields, list format, pipe description |
| Operator Context | 20 | All 3 behavior types with correct item counts |
| Error Handling | 10 | Section present with documented errors |
| Examples (agents) / References (skills) | 10 | 3+ examples or 2+ reference files |
| CAN/CANNOT | 5 | Both sections present with concrete items |
| Anti-Patterns | 5 | 3-5 domain-specific patterns with 3-part structure |
**Integration Scoring** (10 points):
| Component | Points | Requirement |
|-----------|--------|-------------|
| References and cross-references | 5 | Shared patterns linked, all refs resolve |
| Tool and link consistency | 5 | allowed-tools matches usage, anti-rationalization table present |
See `references/scoring-rubric.md` for full/partial/no credit breakdowns.
**Gate**: All structural checks scored with evidence. Proceed only when gate passes.head -20 skills/{name}/SKILL.md | grep -E "^(name|description|version|allowed-tools):"
grep -n "allowed-tools:" skills/{name}/SKILL.md # 检查YAML列表格式vs逗号格式
grep -c "## Operator Context" skills/{name}/SKILL.md
grep -c "CLAUDE.md" skills/{name}/SKILL.md
grep -c "Over-Engineering" skills/{name}/SKILL.md
grep -c "## Instructions" skills/{name}/SKILL.md
grep -c "Gate.*Proceed" skills/{name}/SKILL.md # 统计关卡数量
grep -c "## Error Handling" skills/{name}/SKILL.md
grep -c "## Anti-Patterns" skills/{name}/SKILL.md
grep -c "CAN Do" skills/{name}/SKILL.md
grep -c "CANNOT Do" skills/{name}/SKILL.md
grep -c "anti-rationalization-core" skills/{name}/SKILL.md
ls skills/{name}/references/
**结构评分**(60分):
| 组件 | 分值 | 要求 |
|-----------|--------|-------------|
| YAML前置元数据 | 10 | 包含所有必填字段、列表格式、管道符描述 |
| 操作者上下文 | 20 | 包含所有3种行为类型且数量符合要求 |
| 错误处理 | 10 | 存在包含已记录错误的章节 |
| 示例(Agent)/参考(Skill) | 10 | 3个以上示例或2个以上参考文件 |
| CAN/CANNOT | 5 | 同时存在包含具体内容的两个章节 |
| 反模式 | 5 | 3-5个领域特定模式,且具有三部分结构 |
**集成评分**(10分):
| 组件 | 分值 | 要求 |
|-----------|--------|-------------|
| 引用和交叉引用 | 5 | 共享模式已链接,所有引用均可解析 |
| 工具和链接一致性 | 5 | allowed-tools与实际使用匹配,存在反合理化表 |
完整/部分/无学分的细分规则请参见`references/scoring-rubric.md`。
**关卡**:所有结构检查均已打分并提供证据。仅当关卡通过时才可继续。Step 3: Content Depth Analysis
步骤3:内容深度分析
Goal: Measure content quality and volume.
bash
undefined目标:衡量内容质量和体量。
bash
undefinedSkill total lines (SKILL.md + references)
Skill总行数(SKILL.md + 参考文件)
skill_lines=$(wc -l < skills/{name}/SKILL.md)
ref_lines=$(cat skills/{name}/references/*.md 2>/dev/null | wc -l)
total=$((skill_lines + ref_lines))
skill_lines=$(wc -l < skills/{name}/SKILL.md)
ref_lines=$(cat skills/{name}/references/*.md 2>/dev/null | wc -l)
total=$((skill_lines + ref_lines))
Agent total lines
Agent总行数
agent_lines=$(wc -l < agents/{name}.md)
**Depth Scoring** (30 points max):
| Total Lines | Score | Grade |
|-------------|-------|-------|
| >1500 (skills) / >2000 (agents) | 30 | EXCELLENT |
| 500-1500 / 1000-2000 | 22 | GOOD |
| 300-500 / 500-1000 | 15 | ADEQUATE |
| 150-300 / 200-500 | 8 | THIN |
| <150 / <200 | 0 | INSUFFICIENT |
**Gate**: Depth score calculated. Proceed only when gate passes.agent_lines=$(wc -l < agents/{name}.md)
**深度评分**(最高30分):
| 总行数 | 分数 | 等级 |
|-------------|-------|-------|
| >1500(Skill)/ >2000(Agent) | 30 | 优秀 |
| 500-1500 / 1000-2000 | 22 | 良好 |
| 300-500 / 500-1000 | 15 | 合格 |
| 150-300 / 200-500 | 8 | 单薄 |
| <150 / <200 | 0 | 不足 |
**关卡**:已计算深度分数。仅当关卡通过时才可继续。Step 4: Code Quality Checks
步骤4:代码质量检查
Goal: Validate that code examples and scripts are functional.
- Script syntax: Run on all
python3 -m py_compilefiles.py - Placeholder detection: Search for ,
[TODO],[TBD],[PLACEHOLDER][INSERT] - Code block tagging: Count untagged (bare ) vs tagged (
```) blocks```language
bash
undefined目标:验证代码示例和脚本的功能性。
- 脚本语法:对所有文件运行
.pypython3 -m py_compile - 占位符检测:搜索、
[TODO]、[TBD]、[PLACEHOLDER][INSERT] - 代码块标记:统计未标记(裸)与已标记(
```)的代码块数量```language
bash
undefinedPython syntax check
Python语法检查
Syntax-check any .py scripts found in the skill's scripts/ directory
对Skill的scripts/目录中找到的所有.py脚本进行语法检查
python3 -m py_compile scripts/*.py 2>/dev/null
python3 -m py_compile scripts/*.py 2>/dev/null
Placeholder search
占位符搜索
grep -nE '[TODO]|[TBD]|[PLACEHOLDER]|[INSERT]' {file}
grep -nE '[TODO]|[TBD]|[PLACEHOLDER]|[INSERT]' {file}
Untagged code blocks
未标记代码块统计
grep -c '```$' {file}
**Gate**: All code checks complete. Proceed only when gate passes.grep -c '```$' {file}
**关卡**:所有代码检查完成。仅当关卡通过时才可继续。Step 5: Integration Verification
步骤5:集成验证
Goal: Confirm cross-references and tool declarations are consistent.
Reference Resolution:
- Extract all referenced files from SKILL.md (grep for )
references/ - Verify each reference exists on disk
- Check shared pattern links resolve ()
../shared-patterns/
Tool Consistency:
- Parse from YAML front matter
allowed-tools - Scan instructions for tool usage (Read, Write, Edit, Bash, Grep, Glob, Task, WebSearch)
- Flag any tool used in instructions but not declared in
allowed-tools - Flag any tool declared but never used in instructions
Anti-Rationalization Table:
- Check that References section links to
anti-rationalization-core.md - Verify domain-specific anti-rationalization table is present
- Table should have 3-5 rows specific to the skill's domain
bash
undefined目标:确认交叉引用和工具声明的一致性。
引用解析:
- 从SKILL.md中提取所有引用的文件(搜索)
references/ - 验证每个引用的文件是否存在于磁盘上
- 检查共享模式链接是否可解析()
../shared-patterns/
工具一致性:
- 解析YAML前置元数据中的
allowed-tools - 扫描操作说明中的工具使用情况(Read、Write、Edit、Bash、Grep、Glob、Task、WebSearch)
- 标记任何在操作说明中使用但未在中声明的工具
allowed-tools - 标记任何在中声明但从未在操作说明中使用的工具
allowed-tools
反合理化表:
- 检查参考章节是否链接到
anti-rationalization-core.md - 验证领域特定反合理化表是否存在
- 表中应包含3-5行与该Skill领域相关的内容
bash
undefinedCheck referenced files exist
检查引用的文件是否存在
grep -oE 'references/[a-z-]+.md' skills/{name}/SKILL.md | while read ref; do
ls "skills/{name}/$ref" 2>/dev/null || echo "MISSING: $ref"
done
grep -oE 'references/[a-z-]+.md' skills/{name}/SKILL.md | while read ref; do
ls "skills/{name}/$ref" 2>/dev/null || echo "缺失: $ref"
done
Check tool consistency
检查工具一致性
grep "allowed-tools:" skills/{name}/SKILL.md
grep -oE '(Read|Write|Edit|Bash|Grep|Glob|Task|WebSearch)' skills/{name}/SKILL.md | sort -u
grep "allowed-tools:" skills/{name}/SKILL.md
grep -oE '(Read|Write|Edit|Bash|Grep|Glob|Task|WebSearch)' skills/{name}/SKILL.md | sort -u
Check anti-rationalization reference
检查反合理化引用
grep -c "anti-rationalization-core" skills/{name}/SKILL.md
**Gate**: All integration checks complete. Proceed only when gate passes.grep -c "anti-rationalization-core" skills/{name}/SKILL.md
**关卡**:所有集成检查完成。仅当关卡通过时才可继续。Step 6: Generate Quality Report
步骤6:生成质量报告
Goal: Compile all findings into the standard report format.
Use the report template from . The report MUST include:
references/report-templates.md- Header: Name, type, date, overall score and grade
- Structural Validation: Table with check, status, score, and evidence (line numbers)
- Content Depth: Line counts for main file and references, grade, depth score
- Code Quality: Script syntax results, placeholder count, untagged block count
- Issues Found: Grouped by HIGH / MEDIUM / LOW priority
- Recommendations: Specific, actionable improvements with file paths and line numbers
- Comparison: Score vs collection average (if batch evaluating)
Issue Priority Classification:
| Priority | Criteria | Examples |
|---|---|---|
| HIGH | Missing required section or broken functionality | No Operator Context, syntax errors in scripts |
| MEDIUM | Section present but incomplete or non-compliant | Wrong item counts, old allowed-tools format |
| LOW | Cosmetic or minor quality issues | Untagged code blocks, missing changelog |
Grade Boundaries:
| Score | Grade | Interpretation |
|---|---|---|
| 90-100 | A | Production ready, exemplary |
| 80-89 | B | Good, minor improvements needed |
| 70-79 | C | Adequate, some gaps to address |
| 60-69 | D | Below standard, significant work needed |
| <60 | F | Major overhaul required |
Gate: Report generated with all sections populated and evidence cited. Evaluation complete.
目标:将所有发现的问题整理成标准报告格式。
使用中的报告模板。报告必须包含:
references/report-templates.md- 页眉:名称、类型、日期、总体分数和等级
- 结构验证:包含检查项、状态、分数和证据(行号)的表格
- 内容深度:主文件和参考文件的行数、等级、深度分数
- 代码质量:脚本语法检查结果、占位符数量、未标记代码块数量
- 发现的问题:按高/中/低优先级分组
- 建议:带有文件路径和行号的具体、可操作的改进建议
- 对比:分数与集合平均值的对比(如果是批量评估)
问题优先级分类:
| 优先级 | 标准 | 示例 |
|---|---|---|
| 高 | 缺失必填章节或功能损坏 | 无操作者上下文、脚本语法错误 |
| 中 | 章节存在但不完整或不合规 | 项目数量错误、旧版allowed-tools格式 |
| 低 | cosmetic或轻微质量问题 | 未标记代码块、缺失变更日志 |
等级边界:
| 分数 | 等级 | 说明 |
|---|---|---|
| 90-100 | A | 可投入生产,表现优秀 |
| 80-89 | B | 良好,需少量改进 |
| 70-79 | C | 合格,存在一些需要解决的问题 |
| 60-69 | D | 低于标准,需要大量工作改进 |
| <60 | F | 需要全面整改 |
关卡:生成包含所有章节并引用证据的报告。评估完成。
Examples
示例
Example 1: Single Skill Evaluation
示例1:单个Skill评估
User says: "Evaluate the test-driven-development skill"
Actions:
- Confirm exists (IDENTIFY)
skills/test-driven-development/ - Check YAML, Operator Context, Error Handling sections (STRUCTURAL)
- Count lines in SKILL.md + references (CONTENT)
- Syntax-check any scripts, find placeholders (CODE)
- Verify all referenced files exist (INTEGRATION)
- Generate scored report (REPORT) Result: Structured report with score, grade, and prioritized findings
用户指令:"Evaluate the test-driven-development skill"
操作步骤:
- 确认存在(识别)
skills/test-driven-development/ - 检查YAML、操作者上下文、错误处理章节(结构检查)
- 统计SKILL.md + 参考文件的行数(内容分析)
- 检查脚本语法、查找占位符(代码验证)
- 验证所有引用的文件是否存在(集成校验)
- 生成带分数的报告(报告生成) 结果:包含分数、等级和优先级问题的结构化报告
Example 2: Collection Batch Evaluation
示例2:集合批量评估
User says: "Audit all agents and skills"
Actions:
- List all agents/.md and skills//SKILL.md (IDENTIFY)
- Run Steps 2-5 for each target (EVALUATE)
- Generate individual reports + collection summary (REPORT) Result: Per-item scores plus distribution, top performers, and improvement areas
用户指令:"Audit all agents and skills"
操作步骤:
- 列出所有agents/.md和skills//SKILL.md(识别)
- 对每个目标执行步骤2-5(评估)
- 生成单个报告 + 集合汇总(报告生成) 结果:每个项的分数以及分布情况、表现最佳/最差项、改进方向
Example 3: V2 Migration Compliance Check
示例3:V2迁移合规性检查
User says: "Check if systematic-refactoring skill meets v2 standards"
Actions:
- Confirm exists (IDENTIFY)
skills/systematic-refactoring/ - Check YAML uses list , pipe description, version 2.0.0 (STRUCTURAL)
allowed-tools - Verify Operator Context has correct item counts: Hardcoded 5-8, Default 5-8, Optional 3-5 (STRUCTURAL)
- Confirm CAN/CANNOT sections, gates in Instructions, anti-rationalization table (STRUCTURAL)
- Count total lines, run code checks (CONTENT + CODE)
- Generate scored report highlighting v2 gaps (REPORT) Result: Report with specific v2 compliance gaps and required actions
用户指令:"Check if systematic-refactoring skill meets v2 standards"
操作步骤:
- 确认存在(识别)
skills/systematic-refactoring/ - 检查YAML是否使用列表格式的、管道符描述、版本2.0.0(结构检查)
allowed-tools - 验证操作者上下文的项目数量是否符合要求:硬编码5-8项、默认5-8项、可选3-5项(结构检查)
- 确认存在CAN/CANNOT章节、操作说明中的关卡、反合理化表(结构检查)
- 统计总行数、执行代码检查(内容分析 + 代码验证)
- 生成突出显示V2差距的带分数报告(报告生成) 结果:包含具体V2合规性差距和所需操作的报告
Error Handling
错误处理
Error: "File Not Found"
错误:"文件未找到"
Cause: Agent or skill path incorrect, or item was deleted
Solution: Verify path exists with before evaluation. If truly missing, exclude from batch and note in report.
ls原因:Agent或Skill路径不正确,或项已被删除
解决方案:评估前使用命令验证路径是否存在。如果确实缺失,从批量评估中排除并在报告中注明。
lsError: "Cannot Parse YAML Front Matter"
错误:"无法解析YAML前置元数据"
Cause: Malformed YAML — missing delimiters, bad indentation, or invalid syntax
Solution: Flag as HIGH priority structural failure. Score YAML section as 0/10. Include the specific parse error in the report.
---原因:YAML格式错误 — 缺少分隔符、缩进错误或语法无效
解决方案:标记为高优先级结构失败。YAML章节评分0/10。在报告中包含具体的解析错误信息。
---Error: "Python Syntax Error in Script"
错误:"脚本中的Python语法错误"
Cause: Validation script has syntax issues
Solution: Run and capture the specific error. Score validation script as 0/10. Include error output in report.
python3 -m py_compile原因:验证脚本存在语法问题
解决方案:运行并捕获具体错误。验证脚本评分0/10。在报告中包含错误输出。
python3 -m py_compileError: "Operator Context Item Counts Out of Range"
错误:"操作者上下文项目数量超出范围"
Cause: v2 standard requires Hardcoded 5-8, Default 5-8, Optional 3-5 items. Skill has too few or too many.
Solution:
- Count actual items per behavior type (bold items starting with )
- ** - If too few: flag as MEDIUM priority — behaviors likely need to be split or added
- If too many: flag as LOW priority — behaviors may need consolidation
- Score Operator Context at partial credit (10/20) if counts are wrong
原因:V2标准要求硬编码行为5-8项、默认行为5-8项、可选行为3-5项。Skill的项目数量过多或过少。
解决方案:
- 统计每种行为类型的实际项目数量(以开头的粗体项目)
- ** - 如果数量过少:标记为中优先级 — 行为可能需要拆分或添加
- 如果数量过多:标记为低优先级 — 行为可能需要合并
- 如果数量错误,操作者上下文评分按部分学分计算(10/20)
Anti-Patterns
反模式
Anti-Pattern 1: Superficial Evaluation Without Evidence
反模式1:无证据的表面评估
What it looks like: "Structure: Looks good. Content: Seems adequate. Overall: PASS"
Why wrong: No file paths, no line references, no specific scores. Cannot verify or reproduce.
Do instead: Score every rubric category. Cite file:line for every finding.
表现:"结构:看起来不错。内容:似乎足够。总体:通过"
问题所在:没有文件路径、行号引用或具体分数。无法验证或复现。
正确做法:为每个评分准则类别打分。为每个发现的问题引用文件:行号。
Anti-Pattern 2: Skipping Validation Script Execution
反模式2:跳过验证脚本执行
What it looks like: "The skill has a validation script present."
Why wrong: Presence is not correctness. Script may have syntax errors or do nothing.
Do instead: Run at minimum. Execute the script and capture output.
python3 -m py_compile表现:"该Skill包含一个验证脚本。"
问题所在:存在不代表正确。脚本可能存在语法错误或无实际功能。
正确做法:至少运行。执行脚本并捕获输出。
python3 -m py_compileAnti-Pattern 3: Accepting Placeholder Content as Complete
反模式3:将占位符内容视为完整内容
What it looks like: "Agent has comprehensive examples section. PASS"
Why wrong: Did not check if examples contain [TODO] or [PLACEHOLDER] text.
Do instead: Search for placeholder patterns. Score content on substance, not section headers.
表现:"Agent包含全面的示例章节。通过"
问题所在:未检查示例中是否包含[TODO]或[PLACEHOLDER]文本。
正确做法:搜索占位符模式。根据内容实质而非章节标题打分。
Anti-Pattern 4: Batch Evaluation Without Summary Statistics
反模式4:无汇总统计的批量评估
What it looks like: "Evaluated all 38 agents. Most are good quality."
Why wrong: No quantitative data. Cannot track improvements or identify problem areas.
Do instead: Generate score distribution table, top/bottom performers, common issues count. See for the collection summary template.
references/batch-evaluation.md表现:"已评估所有38个Agent。大多数质量良好。"
问题所在:无量化数据。无法跟踪改进或识别问题区域。
正确做法:生成分数分布表、表现最佳/最差项、常见问题统计。请参见中的集合汇总模板。
references/batch-evaluation.mdAnti-Pattern 5: Ignoring Repository-Specific Standards
反模式5:忽略仓库特定标准
What it looks like: "This agent follows standard practices and is well-structured."
Why wrong: Did not check CLAUDE.md requirements. May miss v2 standards (YAML list format, pipe description, item count ranges, gates, anti-rationalization table).
Do instead: Check CLAUDE.md first. Verify all v2-specific criteria. A generic "well-structured" verdict is meaningless without rubric scores.
表现:"该Agent遵循标准实践且结构良好。"
问题所在:未检查CLAUDE.md要求。可能遗漏V2标准(YAML列表格式、管道符描述、项目数量范围、关卡、反合理化表)。
正确做法:首先检查CLAUDE.md。验证所有V2特定标准。没有评分准则分数的通用“结构良好”结论毫无意义。
References
参考
This skill uses these shared patterns:
- Anti-Rationalization - Prevents shortcut rationalizations
- Verification Checklist - Pre-completion checks
本技能使用以下共享模式:
- 反合理化 - 防止捷径式合理化
- 验证清单 - 完成前检查
Domain-Specific Anti-Rationalization
领域特定反合理化
| Rationalization | Why It's Wrong | Required Action |
|---|---|---|
| "YAML looks fine, no need to parse it" | Looking is not parsing; fields may be missing | Check each required field explicitly |
| "Content is long enough, skip counting" | Impressions are not measurements | Count lines, calculate score |
| "Script exists, must work" | Existence is not correctness | Run |
| "One failing check, rest are probably fine" | Partial evaluation is not evaluation | Complete all 6 steps |
| 合理化理由 | 错误原因 | 要求操作 |
|---|---|---|
| "YAML看起来没问题,无需解析" | 视觉检查不等于解析;可能缺少字段 | 显式检查每个必填字段 |
| "内容足够长,无需统计行数" | 印象不等于测量 | 统计行数,计算分数 |
| "脚本存在,肯定能工作" | 存在不等于正确 | 运行 |
| "一个检查失败,其他可能也没问题" | 部分评估不等于完整评估 | 完成所有6个步骤 |
Reference Files
参考文件
- - Full/partial/no credit breakdowns per rubric category
${CLAUDE_SKILL_DIR}/references/scoring-rubric.md - - Standard report format templates (single, batch, comparison)
${CLAUDE_SKILL_DIR}/references/report-templates.md - - Frequently found issues with fix templates
${CLAUDE_SKILL_DIR}/references/common-issues.md - - Batch evaluation procedures and collection summary format
${CLAUDE_SKILL_DIR}/references/batch-evaluation.md
- - 每个评分准则类别的完整/部分/无学分细分规则
${CLAUDE_SKILL_DIR}/references/scoring-rubric.md - - 标准报告格式模板(单个、批量、对比)
${CLAUDE_SKILL_DIR}/references/report-templates.md - - 常见问题及修复模板
${CLAUDE_SKILL_DIR}/references/common-issues.md - - 批量评估流程和集合汇总格式
${CLAUDE_SKILL_DIR}/references/batch-evaluation.md