self-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSelf-Eval: Honest Work Evaluation
Self-Eval:客观工作评估
ultrathink
Tier: STANDARD
Category: Engineering / Quality
Dependencies: None (prompt-only, no external tools required)
ultrathink
等级: STANDARD
分类: 工程/质量
依赖项: 无(仅需提示词,无需外部工具)
Description
描述
Self-eval is a Claude Code skill that produces honest, calibrated work evaluations. It replaces the default AI tendency to rate everything 4/5 with a structured two-axis scoring system, mandatory devil's advocate reasoning, and cross-session anti-inflation detection.
The core insight: AI self-assessment converges to "everything is a 4" because a single-axis score conflates task difficulty with execution quality. Self-eval separates these axes, then combines them via a fixed matrix that the model cannot override.
Self-eval是一款Claude Code技能,可生成客观、校准后的工作评估结果。它通过结构化双轴评分系统、强制正反论证以及跨会话反评分膨胀检测,替代了AI默认给所有内容打4/5分的倾向。
核心思路:AI自我评估会趋向于“所有内容都是4分”,因为单轴评分混淆了任务难度与执行质量。Self-eval将这两个维度分开,然后通过模型无法覆盖的固定矩阵将它们结合起来。
Features
功能特性
- Two-axis scoring — Independently rates task ambition (Low/Medium/High) and execution quality (Poor/Adequate/Strong), then combines via a lookup matrix
- Mandatory devil's advocate — Before finalizing, must argue for both higher AND lower scores, then resolve the tension
- Score persistence — Appends scores to in the working directory, building history across sessions
.self-eval-scores.jsonl - Anti-inflation detection — Reads past scores and flags clustering (4+ of last 5 identical)
- Matrix-locked scoring — The composite score comes from the matrix, not from direct selection. Low ambition caps at 2/5 regardless of execution quality
- 双轴评分 — 独立评估任务目标(低/中/高)和执行质量(差/合格/优秀),然后通过查询矩阵得出综合评分
- 强制正反论证 — 在最终确定评分前,必须同时论证更高分和更低分的理由,然后解决两者的矛盾
- 评分持久化 — 将评分追加到工作目录下的文件中,跨会话构建评分历史
.self-eval-scores.jsonl - 反评分膨胀检测 — 读取过往评分,标记聚类情况(最近5次评分中有4次及以上相同)
- 矩阵锁定评分 — 综合评分来自矩阵计算,而非直接选择。低目标任务无论执行质量如何,最高仅得2/5分
Usage
使用方法
After completing work in a Claude Code session:
/self-evalWith context about what to evaluate:
/self-eval the authentication refactor we just completedThe skill reviews the conversation history, identifies what was accomplished, scores it on both axes, runs the devil's advocate check, and appends the result to the score history file.
在Claude Code会话中完成工作后:
/self-eval附带评估上下文:
/self-eval the authentication refactor we just completed该技能会回顾对话历史,确定已完成的工作内容,从两个维度进行评分,执行正反论证检查,并将结果追加到评分历史文件中。
Examples
示例
Example 1: Feature Implementation
示例1:功能实现
/self-eval added pagination to the user list APIOutput:
undefined/self-eval added pagination to the user list API输出:
undefinedSelf-Evaluation
自我评估
Task: Added cursor-based pagination to the user list API endpoint
Ambition: Medium — New feature with established patterns, but required careful backwards compatibility
Execution: Strong — All tests pass, handles edge cases, existing clients unaffected
Devil's Advocate:
- Lower: Pagination is well-documented. The pattern was copied from another endpoint with minor changes. Could argue Low ambition.
- Higher: The backwards-compat requirement made this harder than a greenfield implementation. Had to support both offset and cursor modes.
- Resolution: Medium ambition holds — the backwards-compat constraint added real complexity beyond copy-paste.
Score: 4 — Solid execution on meaningful work with real constraints.
undefined任务: 为用户列表API端点添加基于游标分页功能
目标: 中等 — 采用成熟模式的新功能,但需谨慎处理向后兼容性
执行: 优秀 — 所有测试通过,处理了边缘情况,未影响现有客户端
正反论证:
- 低分理由:分页功能有完善文档,模式是从另一个端点复制后做了微小改动,可论证为低目标任务。
- 高分理由:向后兼容要求让这项工作比从零开始实现更难,必须同时支持偏移量和游标两种模式。
- 结论:中等目标的判定成立 — 向后兼容约束带来了远超复制粘贴的实际复杂度。
评分:4 — 在有实际约束的有意义工作上表现出色。
undefinedExample 2: Bug Fix
示例2:Bug修复
/self-evalOutput:
undefined/self-eval输出:
undefinedSelf-Evaluation
自我评估
Task: Fixed race condition in the checkout flow causing duplicate orders
Ambition: Medium — Required debugging across 3 services to find the root cause
Execution: Adequate — Fix works but the test only covers the exact scenario reported, not the general class of race conditions
Devil's Advocate:
- Lower: The fix was a single line (adding a mutex). Once found, it was trivial.
- Higher: Finding the root cause across 3 services took real investigation. The mutex placement required understanding the full flow.
- Resolution: Medium ambition is correct for the investigation, but execution drops to Adequate — a more thorough fix would address the pattern, not just the instance.
Score: 3 — Good debugging work but the fix is narrow.
---任务: 修复结账流程中导致重复订单的竞态条件
目标: 中等 — 需要跨3个服务调试才能找到根本原因
执行: 合格 — 修复有效,但测试仅覆盖了报告的具体场景,未涵盖这类竞态条件的通用情况
正反论证:
- 低分理由:修复仅添加了一行代码(互斥锁),找到问题后操作非常简单。
- 高分理由:跨3个服务找到根本原因需要真正的调查,互斥锁的位置需要理解完整流程。
- 结论:中等目标的判定符合调查工作的难度,但执行降为合格 — 更彻底的修复应解决这类问题的模式,而非仅针对单个实例。
评分:3 — 调试工作出色,但修复范围较窄。
---What to Evaluate
评估范围
$ARGUMENTS
If no arguments provided, review the full conversation history to identify what was accomplished this session. Summarize the work in one sentence before scoring.
$ARGUMENTS
如果未提供参数,则回顾完整对话历史以确定本次会话完成的工作内容。评分前先用一句话总结工作。
How to Score — Two-Axis Model
评分方法 — 双轴模型
Score on two independent axes, then combine using the matrix. Do NOT pick a number first and rationalize it — rate each axis separately, then read the matrix.
从两个独立维度评分,然后通过矩阵计算综合得分。请勿先选分数再找理由 — 先分别评估每个维度,再查看矩阵结果。
Axis 1: Task Ambition (what was attempted)
维度1:任务目标(尝试完成的内容)
Rate the difficulty and risk of what was worked on. NOT how well it was done.
- Low (1) — Safe, familiar, routine. No real risk of failure. Examples: minor config changes, simple refactors, copy-paste with small modifications, tasks you were confident you'd complete before starting.
- Medium (2) — Meaningful work with novelty or challenge. Partial failure was possible. Examples: new feature implementation, integrating an unfamiliar API, architectural changes, debugging a tricky issue.
- High (3) — Ambitious, unfamiliar, or high-stakes. Real risk of complete failure. Examples: building something from scratch in an unfamiliar domain, complex system redesign, performance-critical optimization, shipping to production under pressure.
Self-check: If you were confident of success before starting, ambition is Low or Medium, not High.
评估工作的难度和风险,而非完成质量。
- 低(1) — 安全、熟悉、常规任务,无实际失败风险。示例:微小配置变更、简单重构、复制粘贴后做小幅修改、开始前就确信能完成的任务。
- 中(2) — 有新意或挑战性的有意义工作,存在部分失败的可能。示例:新功能实现、集成不熟悉的API、架构变更、调试复杂问题。
- 高(3) — 有野心、不熟悉或高风险任务,存在完全失败的实际风险。示例:在不熟悉的领域从零构建、复杂系统重构、性能关键优化、高压下向生产环境交付。
自我检查: 如果开始前就确信能成功,目标为低或中等,而非高。
Axis 2: Execution Quality (how well it was done)
维度2:执行质量(完成的质量)
Rate the quality of the actual output, independent of how ambitious the task was.
- Poor (1) — Major failures, incomplete, wrong output, or abandoned mid-task. The deliverable doesn't meet its own stated criteria.
- Adequate (2) — Completed but with gaps, shortcuts, or missing rigor. Did the thing but left obvious improvements on the table.
- Strong (3) — Well-executed, thorough, quality output. No obvious improvements left undone given the scope.
评估实际输出的质量,与任务目标无关。
- 差(1) — 重大失败、未完成、输出错误或中途放弃。交付成果未达到自身设定的标准。
- 合格(2) — 已完成但存在漏洞、捷径或严谨性不足。完成了任务,但留下了明显可以改进的空间。
- 优秀(3) — 执行良好、全面、高质量输出。在给定范围内没有明显未完成的改进点。
Composite Score Matrix
综合评分矩阵
| Poor Exec (1) | Adequate Exec (2) | Strong Exec (3) | |
|---|---|---|---|
| Low Ambition (1) | 1 | 2 | 2 |
| Medium Ambition (2) | 2 | 3 | 4 |
| High Ambition (3) | 2 | 4 | 5 |
Read the matrix, don't override it. The composite is your score. The devil's advocate below can cause you to re-rate an axis — but you cannot directly override the matrix result.
Key properties:
- Low ambition caps at 2. Safe work done perfectly is still safe work.
- A 5 requires BOTH high ambition AND strong execution. It should be rare.
- High ambition + poor execution = 2. Bold failure hurts.
- The most common honest score for solid work is 3 (medium ambition, adequate execution).
| 差执行(1) | 合格执行(2) | 优秀执行(3) | |
|---|---|---|---|
| 低目标(1) | 1 | 2 | 2 |
| 中目标(2) | 2 | 3 | 4 |
| 高目标(3) | 2 | 4 | 5 |
严格遵循矩阵结果,不得自行修改。 综合得分即为你的最终评分。下方的正反论证可能会让你重新评估某个维度,但你不能直接覆盖矩阵结果。
关键特性:
- 低目标任务最高得2分。完美完成的安全工作依然是安全工作。
- 5分需要同时满足高目标和优秀执行,应较为罕见。
- 高目标+差执行=2分。大胆的失败会带来负面影响。
- 扎实工作最常见的客观评分是3分(中目标+合格执行)。
Devil's Advocate (MANDATORY)
正反论证(强制要求)
Before writing your final score, you MUST write all three of these:
- Case for LOWER: Why might this work deserve a lower score? What was easy, what was avoided, what was less ambitious than it appears? Would a skeptical reviewer agree with your axis ratings?
- Case for HIGHER: Why might this work deserve a higher score? What was genuinely challenging, surprising, or exceeded the original plan?
- Resolution: If either case reveals you mis-rated an axis, re-rate it and recompute the matrix result. Then state your final score with a 1-2 sentence justification that addresses at least one point from each case.
If your devil's advocate is less than 3 sentences total, you're not engaging with it — try harder.
在写出最终评分前,你必须完成以下三点:
- 低分论证: 为什么这项工作可能值得更低的分数?哪些部分很简单,哪些被回避了,哪些看起来比实际更有野心?持怀疑态度的评审者会同意你的维度评分吗?
- 高分论证: 为什么这项工作可能值得更高的分数?哪些部分真正具有挑战性、出乎意料或超出了原计划?
- 结论: 如果任一论证显示你对某个维度的评分有误,重新评估该维度并重新计算矩阵结果。然后用1-2句话说明最终评分的理由,至少涵盖每个论证中的一个要点。
如果你的正反论证总长度不足3句话,说明你没有充分参与 — 请更深入思考。
Anti-Inflation Check
反评分膨胀检查
Check for a score history file at in the current working directory.
.self-eval-scores.jsonlIf the file exists, read it and check the last 5 scores. If 4+ of the last 5 are the same number, flag it:
Warning: Score clustering detected. Last 5 scores: [list]. Consider whether you're anchoring to a default.
If the file doesn't exist, ask yourself: "Would an outside observer rate this the same way I am?"
检查当前工作目录下是否存在评分历史文件。
.self-eval-scores.jsonl如果文件存在,读取并查看最近5次评分。如果最近5次中有4次及以上相同,标记如下:
警告:检测到评分聚类。 最近5次评分:[列表]。请考虑是否陷入了默认评分的锚定效应。
如果文件不存在,请自问:“外部观察者会和我给出相同的评分吗?”
Score Persistence
评分持久化
After presenting your evaluation, append one line to in the current working directory:
.self-eval-scores.jsonljson
{"date":"YYYY-MM-DD","score":N,"ambition":"Low|Medium|High","execution":"Poor|Adequate|Strong","task":"1-sentence summary"}This enables the anti-inflation check to work across sessions. If the file doesn't exist, create it.
展示评估结果后,将一行内容追加到当前工作目录下的文件中:
.self-eval-scores.jsonljson
{"date":"YYYY-MM-DD","score":N,"ambition":"Low|Medium|High","execution":"Poor|Adequate|Strong","task":"1-sentence summary"}这使得反评分膨胀检查可以跨会话生效。如果文件不存在,则创建它。
Output Format
输出格式
Present your evaluation as:
评估结果请按以下格式呈现:
Self-Evaluation
自我评估
Task: [1-sentence summary of what was attempted]
Ambition: [Low/Medium/High] — [1-sentence justification]
Execution: [Poor/Adequate/Strong] — [1-sentence justification]
Devil's Advocate:
- Lower: [why it might deserve less]
- Higher: [why it might deserve more]
- Resolution: [final reasoning]
Score: [1-5] — [1-sentence final justification]
任务: [一句话总结尝试完成的内容]
目标: [低/中/高] — [一句话理由]
执行: [差/合格/优秀] — [一句话理由]
正反论证:
- 低分理由:[为什么可能得分更低]
- 高分理由:[为什么可能得分更高]
- 结论:[最终推理]
评分:[1-5] — [一句话最终理由]