eval-harness

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Eval Harness Skill

评估工具技能

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
基于评估驱动开发(EDD)原则的Claude Code会话正式评估框架。

Philosophy

理念

Eval-Driven Development treats evals as the "unit tests of AI development":
  • Define expected behavior BEFORE implementation
  • Run evals continuously during development
  • Track regressions with each change
  • Use pass@k metrics for reliability measurement
评估驱动开发(EDD)将评估视为“AI开发的单元测试”:
  • 在实现前定义预期行为
  • 开发过程中持续运行评估
  • 追踪每次变更的回归问题
  • 使用pass@k指标衡量可靠性

Eval Types

评估类型

Capability Evals

能力评估

Test if Claude can do something it couldn't before:
markdown
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
  - [ ] Criterion 3
Expected Output: Description of expected result
测试Claude是否能完成之前无法实现的功能:
markdown
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
  - [ ] Criterion 3
Expected Output: Description of expected result

Regression Evals

回归评估

Ensure changes don't break existing functionality:
markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
  - existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
确保变更不会破坏现有功能:
markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
  - existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)

Grader Types

grader类型

1. Code-Based Grader

1. 基于代码的Grader

Deterministic checks using code:
bash
undefined
使用代码进行确定性检查:
bash
undefined

Check if file contains expected pattern

Check if file contains expected pattern

grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"

Check if tests pass

Check if tests pass

npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"

Check if build succeeds

Check if build succeeds

npm run build && echo "PASS" || echo "FAIL"
undefined
npm run build && echo "PASS" || echo "FAIL"
undefined

2. Model-Based Grader

2. 基于模型的Grader

Use Claude to evaluate open-ended outputs:
markdown
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?

Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]
使用Claude评估开放式输出:
markdown
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?

Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]

3. Human Grader

3. 人工Grader

Flag for manual review:
markdown
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH
标记需要人工审核的内容:
markdown
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH

Metrics

指标

pass@k

pass@k

"At least one success in k attempts"
  • pass@1: First attempt success rate
  • pass@3: Success within 3 attempts
  • Typical target: pass@3 > 90%
“k次尝试中至少成功一次”
  • pass@1: 首次尝试成功率
  • pass@3: 3次尝试内成功
  • 典型目标:pass@3 > 90%

pass^k

pass^k

"All k trials succeed"
  • Higher bar for reliability
  • pass^3: 3 consecutive successes
  • Use for critical paths
“k次测试全部成功”
  • 对可靠性要求更高
  • pass^3: 连续3次成功
  • 用于关键路径

Eval Workflow

评估工作流

1. Define (Before Coding)

1. 定义(编码前)

markdown
undefined
markdown
undefined

EVAL DEFINITION: feature-xyz

EVAL DEFINITION: feature-xyz

Capability Evals

Capability Evals

  1. Can create new user account
  2. Can validate email format
  3. Can hash password securely
  1. Can create new user account
  2. Can validate email format
  3. Can hash password securely

Regression Evals

Regression Evals

  1. Existing login still works
  2. Session management unchanged
  3. Logout flow intact
  1. Existing login still works
  2. Session management unchanged
  3. Logout flow intact

Success Metrics

Success Metrics

  • pass@3 > 90% for capability evals
  • pass^3 = 100% for regression evals
undefined
  • pass@3 > 90% for capability evals
  • pass^3 = 100% for regression evals
undefined

2. Implement

2. 实现

Write code to pass the defined evals.
编写代码以通过定义的评估。

3. Evaluate

3. 评估

bash
undefined
bash
undefined

Run capability evals

Run capability evals

[Run each capability eval, record PASS/FAIL]
[Run each capability eval, record PASS/FAIL]

Run regression evals

Run regression evals

npm test -- --testPathPattern="existing"
npm test -- --testPathPattern="existing"

Generate report

Generate report

undefined
undefined

4. Report

4. 报告

markdown
EVAL REPORT: feature-xyz
========================

Capability Evals:
  create-user:     PASS (pass@1)
  validate-email:  PASS (pass@2)
  hash-password:   PASS (pass@1)
  Overall:         3/3 passed

Regression Evals:
  login-flow:      PASS
  session-mgmt:    PASS
  logout-flow:     PASS
  Overall:         3/3 passed

Metrics:
  pass@1: 67% (2/3)
  pass@3: 100% (3/3)

Status: READY FOR REVIEW
markdown
EVAL REPORT: feature-xyz
========================

Capability Evals:
  create-user:     PASS (pass@1)
  validate-email:  PASS (pass@2)
  hash-password:   PASS (pass@1)
  Overall:         3/3 passed

Regression Evals:
  login-flow:      PASS
  session-mgmt:    PASS
  logout-flow:     PASS
  Overall:         3/3 passed

Metrics:
  pass@1: 67% (2/3)
  pass@3: 100% (3/3)

Status: READY FOR REVIEW

Integration Patterns

集成模式

Pre-Implementation

实施前

/eval define feature-name
Creates eval definition file at
.claude/evals/feature-name.md
/eval define feature-name
.claude/evals/feature-name.md
创建评估定义文件

During Implementation

实施中

/eval check feature-name
Runs current evals and reports status
/eval check feature-name
运行当前评估并报告状态

Post-Implementation

实施后

/eval report feature-name
Generates full eval report
/eval report feature-name
生成完整评估报告

Eval Storage

评估存储

Store evals in project:
.claude/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines
在项目中存储评估:
.claude/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines

Best Practices

最佳实践

  1. Define evals BEFORE coding - Forces clear thinking about success criteria
  2. Run evals frequently - Catch regressions early
  3. Track pass@k over time - Monitor reliability trends
  4. Use code graders when possible - Deterministic > probabilistic
  5. Human review for security - Never fully automate security checks
  6. Keep evals fast - Slow evals don't get run
  7. Version evals with code - Evals are first-class artifacts
  1. 编码前定义评估 - 促使明确思考成功标准
  2. 频繁运行评估 - 尽早发现回归问题
  3. 随时间追踪pass@k - 监控可靠性趋势
  4. 尽可能使用代码Grader - 确定性优于概率性
  5. 安全相关内容人工审核 - 永远不要完全自动化安全检查
  6. 保持评估快速 - 缓慢的评估不会被执行
  7. 评估与代码一起版本化 - 评估是一等工件

Example: Adding Authentication

示例:添加认证功能

markdown
undefined
markdown
undefined

EVAL: add-authentication

EVAL: add-authentication

Phase 1: Define (10 min)

Phase 1: Define (10 min)

Capability Evals:
  • User can register with email/password
  • User can login with valid credentials
  • Invalid credentials rejected with proper error
  • Sessions persist across page reloads
  • Logout clears session
Regression Evals:
  • Public routes still accessible
  • API responses unchanged
  • Database schema compatible
Capability Evals:
  • User can register with email/password
  • User can login with valid credentials
  • Invalid credentials rejected with proper error
  • Sessions persist across page reloads
  • Logout clears session
Regression Evals:
  • Public routes still accessible
  • API responses unchanged
  • Database schema compatible

Phase 2: Implement (varies)

Phase 2: Implement (varies)

[Write code]
[Write code]

Phase 3: Evaluate

Phase 3: Evaluate

Run: /eval check add-authentication
Run: /eval check add-authentication

Phase 4: Report

Phase 4: Report

EVAL REPORT: add-authentication

Capability: 5/5 passed (pass@3: 100%) Regression: 3/3 passed (pass^3: 100%) Status: SHIP IT
undefined

EVAL REPORT: add-authentication

Capability: 5/5 passed (pass@3: 100%) Regression: 3/3 passed (pass^3: 100%) Status: SHIP IT
undefined