eval-harness
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEval Harness Skill
评估工具技能
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
基于评估驱动开发(EDD)原则的Claude Code会话正式评估框架。
Philosophy
理念
Eval-Driven Development treats evals as the "unit tests of AI development":
- Define expected behavior BEFORE implementation
- Run evals continuously during development
- Track regressions with each change
- Use pass@k metrics for reliability measurement
评估驱动开发(EDD)将评估视为“AI开发的单元测试”:
- 在实现前定义预期行为
- 开发过程中持续运行评估
- 追踪每次变更的回归问题
- 使用pass@k指标衡量可靠性
Eval Types
评估类型
Capability Evals
能力评估
Test if Claude can do something it couldn't before:
markdown
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
Expected Output: Description of expected result测试Claude是否能完成之前无法实现的功能:
markdown
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
Expected Output: Description of expected resultRegression Evals
回归评估
Ensure changes don't break existing functionality:
markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)确保变更不会破坏现有功能:
markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)Grader Types
grader类型
1. Code-Based Grader
1. 基于代码的Grader
Deterministic checks using code:
bash
undefined使用代码进行确定性检查:
bash
undefinedCheck if file contains expected pattern
Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
Check if tests pass
Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
Check if build succeeds
Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
undefinednpm run build && echo "PASS" || echo "FAIL"
undefined2. Model-Based Grader
2. 基于模型的Grader
Use Claude to evaluate open-ended outputs:
markdown
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?
Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]使用Claude评估开放式输出:
markdown
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?
Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]3. Human Grader
3. 人工Grader
Flag for manual review:
markdown
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH标记需要人工审核的内容:
markdown
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGHMetrics
指标
pass@k
pass@k
"At least one success in k attempts"
- pass@1: First attempt success rate
- pass@3: Success within 3 attempts
- Typical target: pass@3 > 90%
“k次尝试中至少成功一次”
- pass@1: 首次尝试成功率
- pass@3: 3次尝试内成功
- 典型目标:pass@3 > 90%
pass^k
pass^k
"All k trials succeed"
- Higher bar for reliability
- pass^3: 3 consecutive successes
- Use for critical paths
“k次测试全部成功”
- 对可靠性要求更高
- pass^3: 连续3次成功
- 用于关键路径
Eval Workflow
评估工作流
1. Define (Before Coding)
1. 定义(编码前)
markdown
undefinedmarkdown
undefinedEVAL DEFINITION: feature-xyz
EVAL DEFINITION: feature-xyz
Capability Evals
Capability Evals
- Can create new user account
- Can validate email format
- Can hash password securely
- Can create new user account
- Can validate email format
- Can hash password securely
Regression Evals
Regression Evals
- Existing login still works
- Session management unchanged
- Logout flow intact
- Existing login still works
- Session management unchanged
- Logout flow intact
Success Metrics
Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
undefined- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
undefined2. Implement
2. 实现
Write code to pass the defined evals.
编写代码以通过定义的评估。
3. Evaluate
3. 评估
bash
undefinedbash
undefinedRun capability evals
Run capability evals
[Run each capability eval, record PASS/FAIL]
[Run each capability eval, record PASS/FAIL]
Run regression evals
Run regression evals
npm test -- --testPathPattern="existing"
npm test -- --testPathPattern="existing"
Generate report
Generate report
undefinedundefined4. Report
4. 报告
markdown
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEWmarkdown
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEWIntegration Patterns
集成模式
Pre-Implementation
实施前
/eval define feature-nameCreates eval definition file at
.claude/evals/feature-name.md/eval define feature-name在创建评估定义文件
.claude/evals/feature-name.mdDuring Implementation
实施中
/eval check feature-nameRuns current evals and reports status
/eval check feature-name运行当前评估并报告状态
Post-Implementation
实施后
/eval report feature-nameGenerates full eval report
/eval report feature-name生成完整评估报告
Eval Storage
评估存储
Store evals in project:
.claude/
evals/
feature-xyz.md # Eval definition
feature-xyz.log # Eval run history
baseline.json # Regression baselines在项目中存储评估:
.claude/
evals/
feature-xyz.md # Eval definition
feature-xyz.log # Eval run history
baseline.json # Regression baselinesBest Practices
最佳实践
- Define evals BEFORE coding - Forces clear thinking about success criteria
- Run evals frequently - Catch regressions early
- Track pass@k over time - Monitor reliability trends
- Use code graders when possible - Deterministic > probabilistic
- Human review for security - Never fully automate security checks
- Keep evals fast - Slow evals don't get run
- Version evals with code - Evals are first-class artifacts
- 编码前定义评估 - 促使明确思考成功标准
- 频繁运行评估 - 尽早发现回归问题
- 随时间追踪pass@k - 监控可靠性趋势
- 尽可能使用代码Grader - 确定性优于概率性
- 安全相关内容人工审核 - 永远不要完全自动化安全检查
- 保持评估快速 - 缓慢的评估不会被执行
- 评估与代码一起版本化 - 评估是一等工件
Example: Adding Authentication
示例:添加认证功能
markdown
undefinedmarkdown
undefinedEVAL: add-authentication
EVAL: add-authentication
Phase 1: Define (10 min)
Phase 1: Define (10 min)
Capability Evals:
- User can register with email/password
- User can login with valid credentials
- Invalid credentials rejected with proper error
- Sessions persist across page reloads
- Logout clears session
Regression Evals:
- Public routes still accessible
- API responses unchanged
- Database schema compatible
Capability Evals:
- User can register with email/password
- User can login with valid credentials
- Invalid credentials rejected with proper error
- Sessions persist across page reloads
- Logout clears session
Regression Evals:
- Public routes still accessible
- API responses unchanged
- Database schema compatible
Phase 2: Implement (varies)
Phase 2: Implement (varies)
[Write code]
[Write code]
Phase 3: Evaluate
Phase 3: Evaluate
Run: /eval check add-authentication
Run: /eval check add-authentication
Phase 4: Report
Phase 4: Report
EVAL REPORT: add-authentication
Capability: 5/5 passed (pass@3: 100%)
Regression: 3/3 passed (pass^3: 100%)
Status: SHIP IT
undefinedEVAL REPORT: add-authentication
Capability: 5/5 passed (pass@3: 100%)
Regression: 3/3 passed (pass^3: 100%)
Status: SHIP IT
undefined