Eval Harness Skill

评估工具技能

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.

基于评估驱动开发（EDD）原则的Claude Code会话正式评估框架。

Philosophy

理念

Eval-Driven Development treats evals as the "unit tests of AI development":

Define expected behavior BEFORE implementation
Run evals continuously during development
Track regressions with each change
Use pass@k metrics for reliability measurement

评估驱动开发（EDD）将评估视为“AI开发的单元测试”：

在实现前定义预期行为
开发过程中持续运行评估
追踪每次变更的回归问题
使用pass@k指标衡量可靠性

Eval Types

评估类型

Capability Evals

能力评估

Test if Claude can do something it couldn't before:

markdown

[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
  - [ ] Criterion 3
Expected Output: Description of expected result

测试Claude是否能完成之前无法实现的功能：

markdown

[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
  - [ ] Criterion 3
Expected Output: Description of expected result

Regression Evals

回归评估

Ensure changes don't break existing functionality:

markdown

[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
  - existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)

确保变更不会破坏现有功能：

markdown

[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
  - existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)

Grader Types

grader类型

1. Code-Based Grader

1. 基于代码的Grader

Deterministic checks using code:

bash

undefined

使用代码进行确定性检查：

bash

undefined

Check if file contains expected pattern

grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"

Check if tests pass

npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"

Check if build succeeds

npm run build && echo "PASS" || echo "FAIL"

undefined

npm run build && echo "PASS" || echo "FAIL"

undefined

2. Model-Based Grader

2. 基于模型的Grader

Use Claude to evaluate open-ended outputs:

markdown

[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?

Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]

使用Claude评估开放式输出：

markdown

[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?

Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]

3. Human Grader

3. 人工Grader

Flag for manual review:

markdown

[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH

标记需要人工审核的内容：

markdown

[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH

Metrics

指标

pass@k

"At least one success in k attempts"

pass@1: First attempt success rate
pass@3: Success within 3 attempts
Typical target: pass@3 > 90%

“k次尝试中至少成功一次”

pass@1: 首次尝试成功率
pass@3: 3次尝试内成功
典型目标：pass@3 > 90%

pass^k

"All k trials succeed"

Higher bar for reliability
pass^3: 3 consecutive successes
Use for critical paths

“k次测试全部成功”

对可靠性要求更高
pass^3: 连续3次成功
用于关键路径

Eval Workflow

评估工作流

1. Define (Before Coding)

1. 定义（编码前）

markdown

undefined

markdown

undefined

EVAL DEFINITION: feature-xyz

Capability Evals

Can create new user account
Can validate email format
Can hash password securely

Can create new user account
Can validate email format
Can hash password securely

Regression Evals

Existing login still works
Session management unchanged
Logout flow intact

Existing login still works
Session management unchanged
Logout flow intact

Success Metrics

pass@3 > 90% for capability evals
pass^3 = 100% for regression evals

undefined

pass@3 > 90% for capability evals
pass^3 = 100% for regression evals

undefined

2. Implement

2. 实现

Write code to pass the defined evals.

编写代码以通过定义的评估。

3. Evaluate

3. 评估

bash

undefined

bash

undefined

Run capability evals

[Run each capability eval, record PASS/FAIL]

Run regression evals

npm test -- --testPathPattern="existing"

Generate report

undefined

undefined

4. Report

4. 报告

markdown

EVAL REPORT: feature-xyz
========================

Capability Evals:
  create-user:     PASS (pass@1)
  validate-email:  PASS (pass@2)
  hash-password:   PASS (pass@1)
  Overall:         3/3 passed

Regression Evals:
  login-flow:      PASS
  session-mgmt:    PASS
  logout-flow:     PASS
  Overall:         3/3 passed

Metrics:
  pass@1: 67% (2/3)
  pass@3: 100% (3/3)

Status: READY FOR REVIEW

markdown

EVAL REPORT: feature-xyz
========================

Capability Evals:
  create-user:     PASS (pass@1)
  validate-email:  PASS (pass@2)
  hash-password:   PASS (pass@1)
  Overall:         3/3 passed

Regression Evals:
  login-flow:      PASS
  session-mgmt:    PASS
  logout-flow:     PASS
  Overall:         3/3 passed

Metrics:
  pass@1: 67% (2/3)
  pass@3: 100% (3/3)

Status: READY FOR REVIEW

Integration Patterns

集成模式

Pre-Implementation

实施前

/eval define feature-name

Creates eval definition file at

.claude/evals/feature-name.md

/eval define feature-name

在

.claude/evals/feature-name.md

创建评估定义文件

During Implementation

实施中

/eval check feature-name

Runs current evals and reports status

/eval check feature-name

运行当前评估并报告状态

Post-Implementation

实施后

/eval report feature-name

Generates full eval report

/eval report feature-name

生成完整评估报告

Eval Storage

评估存储

Store evals in project:

.claude/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines

在项目中存储评估：

.claude/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines

Best Practices

最佳实践

Define evals BEFORE coding - Forces clear thinking about success criteria
Run evals frequently - Catch regressions early
Track pass@k over time - Monitor reliability trends
Use code graders when possible - Deterministic > probabilistic
Human review for security - Never fully automate security checks
Keep evals fast - Slow evals don't get run
Version evals with code - Evals are first-class artifacts

编码前定义评估 - 促使明确思考成功标准
频繁运行评估 - 尽早发现回归问题
随时间追踪pass@k - 监控可靠性趋势
尽可能使用代码Grader - 确定性优于概率性
安全相关内容人工审核 - 永远不要完全自动化安全检查
保持评估快速 - 缓慢的评估不会被执行
评估与代码一起版本化 - 评估是一等工件

Example: Adding Authentication

示例：添加认证功能

markdown

undefined

markdown

undefined

EVAL: add-authentication

Phase 1: Define (10 min)

Capability Evals:

User can register with email/password
User can login with valid credentials
Invalid credentials rejected with proper error
Sessions persist across page reloads
Logout clears session

Regression Evals:

Public routes still accessible
API responses unchanged
Database schema compatible

Capability Evals:

User can register with email/password
User can login with valid credentials
Invalid credentials rejected with proper error
Sessions persist across page reloads
Logout clears session

Regression Evals:

Public routes still accessible
API responses unchanged
Database schema compatible

Phase 2: Implement (varies)

[Write code]

Phase 3: Evaluate

Run: /eval check add-authentication

Phase 4: Report

EVAL REPORT: add-authentication

Capability: 5/5 passed (pass@3: 100%) Regression: 3/3 passed (pass^3: 100%) Status: SHIP IT

undefined

EVAL REPORT: add-authentication

Capability: 5/5 passed (pass@3: 100%) Regression: 3/3 passed (pass^3: 100%) Status: SHIP IT

undefined

eval-harness

Original

Translation

Eval Harness Skill

评估工具技能

Philosophy

理念

Eval Types

评估类型

Capability Evals

能力评估

Regression Evals

回归评估

Grader Types

grader类型

1. Code-Based Grader

1. 基于代码的Grader

Check if file contains expected pattern

Check if file contains expected pattern

Check if tests pass

Check if tests pass

Check if build succeeds

Check if build succeeds

2. Model-Based Grader

2. 基于模型的Grader

3. Human Grader

3. 人工Grader

Metrics

指标

pass@k

pass@k

pass^k

pass^k

Eval Workflow

评估工作流

1. Define (Before Coding)

1. 定义（编码前）

EVAL DEFINITION: feature-xyz

EVAL DEFINITION: feature-xyz

Capability Evals

Capability Evals

Regression Evals

Regression Evals

Success Metrics

Success Metrics

2. Implement

2. 实现

3. Evaluate

3. 评估

Run capability evals

Run capability evals

Run regression evals

Run regression evals

Generate report

Generate report

4. Report

4. 报告

Integration Patterns

集成模式

Pre-Implementation

实施前

During Implementation

实施中

Post-Implementation

实施后

Eval Storage

评估存储

Best Practices

最佳实践

Example: Adding Authentication

示例：添加认证功能

EVAL: add-authentication

EVAL: add-authentication

Phase 1: Define (10 min)

Phase 1: Define (10 min)

Phase 2: Implement (varies)

Phase 2: Implement (varies)

Phase 3: Evaluate

Phase 3: Evaluate

Phase 4: Report