self-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Self-Eval: Honest Work Evaluation

Self-Eval：客观工作评估

ultrathink

Tier: STANDARD Category: Engineering / Quality Dependencies: None (prompt-only, no external tools required)

ultrathink

等级： STANDARD 分类： 工程/质量 依赖项： 无（仅需提示词，无需外部工具）

Description

描述

Self-eval is a Claude Code skill that produces honest, calibrated work evaluations. It replaces the default AI tendency to rate everything 4/5 with a structured two-axis scoring system, mandatory devil's advocate reasoning, and cross-session anti-inflation detection.

The core insight: AI self-assessment converges to "everything is a 4" because a single-axis score conflates task difficulty with execution quality. Self-eval separates these axes, then combines them via a fixed matrix that the model cannot override.

Self-eval是一款Claude Code技能，可生成客观、校准后的工作评估结果。它通过结构化双轴评分系统、强制正反论证以及跨会话反评分膨胀检测，替代了AI默认给所有内容打4/5分的倾向。

核心思路：AI自我评估会趋向于“所有内容都是4分”，因为单轴评分混淆了任务难度与执行质量。Self-eval将这两个维度分开，然后通过模型无法覆盖的固定矩阵将它们结合起来。

Features

功能特性

Two-axis scoring — Independently rates task ambition (Low/Medium/High) and execution quality (Poor/Adequate/Strong), then combines via a lookup matrix
Mandatory devil's advocate — Before finalizing, must argue for both higher AND lower scores, then resolve the tension
Score persistence — Appends scores to
```
.self-eval-scores.jsonl
```
in the working directory, building history across sessions
Anti-inflation detection — Reads past scores and flags clustering (4+ of last 5 identical)
Matrix-locked scoring — The composite score comes from the matrix, not from direct selection. Low ambition caps at 2/5 regardless of execution quality

双轴评分 — 独立评估任务目标（低/中/高）和执行质量（差/合格/优秀），然后通过查询矩阵得出综合评分
强制正反论证 — 在最终确定评分前，必须同时论证更高分和更低分的理由，然后解决两者的矛盾
评分持久化 — 将评分追加到工作目录下的
```
.self-eval-scores.jsonl
```
文件中，跨会话构建评分历史
反评分膨胀检测 — 读取过往评分，标记聚类情况（最近5次评分中有4次及以上相同）
矩阵锁定评分 — 综合评分来自矩阵计算，而非直接选择。低目标任务无论执行质量如何，最高仅得2/5分

Usage

使用方法

After completing work in a Claude Code session:

/self-eval

With context about what to evaluate:

/self-eval the authentication refactor we just completed

The skill reviews the conversation history, identifies what was accomplished, scores it on both axes, runs the devil's advocate check, and appends the result to the score history file.

在Claude Code会话中完成工作后：

/self-eval

附带评估上下文：

/self-eval the authentication refactor we just completed

该技能会回顾对话历史，确定已完成的工作内容，从两个维度进行评分，执行正反论证检查，并将结果追加到评分历史文件中。

Examples

示例

Example 1: Feature Implementation

示例1：功能实现

/self-eval added pagination to the user list API

Output:

undefined

/self-eval added pagination to the user list API

输出：

undefined

Self-Evaluation

自我评估

Task: Added cursor-based pagination to the user list API endpoint Ambition: Medium — New feature with established patterns, but required careful backwards compatibility Execution: Strong — All tests pass, handles edge cases, existing clients unaffected

Devil's Advocate:

Lower: Pagination is well-documented. The pattern was copied from another endpoint with minor changes. Could argue Low ambition.
Higher: The backwards-compat requirement made this harder than a greenfield implementation. Had to support both offset and cursor modes.
Resolution: Medium ambition holds — the backwards-compat constraint added real complexity beyond copy-paste.

Score: 4 — Solid execution on meaningful work with real constraints.

undefined

任务： 为用户列表API端点添加基于游标分页功能 目标： 中等 — 采用成熟模式的新功能，但需谨慎处理向后兼容性 执行： 优秀 — 所有测试通过，处理了边缘情况，未影响现有客户端

正反论证：

低分理由：分页功能有完善文档，模式是从另一个端点复制后做了微小改动，可论证为低目标任务。
高分理由：向后兼容要求让这项工作比从零开始实现更难，必须同时支持偏移量和游标两种模式。
结论：中等目标的判定成立 — 向后兼容约束带来了远超复制粘贴的实际复杂度。

评分：4 — 在有实际约束的有意义工作上表现出色。

undefined

Example 2: Bug Fix

示例2：Bug修复

/self-eval

Output:

undefined

/self-eval

输出：

undefined

Self-Evaluation

自我评估

Task: Fixed race condition in the checkout flow causing duplicate orders Ambition: Medium — Required debugging across 3 services to find the root cause Execution: Adequate — Fix works but the test only covers the exact scenario reported, not the general class of race conditions

Devil's Advocate:

Lower: The fix was a single line (adding a mutex). Once found, it was trivial.
Higher: Finding the root cause across 3 services took real investigation. The mutex placement required understanding the full flow.
Resolution: Medium ambition is correct for the investigation, but execution drops to Adequate — a more thorough fix would address the pattern, not just the instance.

Score: 3 — Good debugging work but the fix is narrow.

---

任务： 修复结账流程中导致重复订单的竞态条件 目标： 中等 — 需要跨3个服务调试才能找到根本原因 执行： 合格 — 修复有效，但测试仅覆盖了报告的具体场景，未涵盖这类竞态条件的通用情况

正反论证：

低分理由：修复仅添加了一行代码（互斥锁），找到问题后操作非常简单。
高分理由：跨3个服务找到根本原因需要真正的调查，互斥锁的位置需要理解完整流程。
结论：中等目标的判定符合调查工作的难度，但执行降为合格 — 更彻底的修复应解决这类问题的模式，而非仅针对单个实例。

评分：3 — 调试工作出色，但修复范围较窄。

---

What to Evaluate

评估范围

$ARGUMENTS

If no arguments provided, review the full conversation history to identify what was accomplished this session. Summarize the work in one sentence before scoring.

$ARGUMENTS

如果未提供参数，则回顾完整对话历史以确定本次会话完成的工作内容。评分前先用一句话总结工作。

How to Score — Two-Axis Model

评分方法 — 双轴模型

Score on two independent axes, then combine using the matrix. Do NOT pick a number first and rationalize it — rate each axis separately, then read the matrix.

从两个独立维度评分，然后通过矩阵计算综合得分。请勿先选分数再找理由 — 先分别评估每个维度，再查看矩阵结果。

Axis 1: Task Ambition (what was attempted)

维度1：任务目标（尝试完成的内容）

Rate the difficulty and risk of what was worked on. NOT how well it was done.

Low (1) — Safe, familiar, routine. No real risk of failure. Examples: minor config changes, simple refactors, copy-paste with small modifications, tasks you were confident you'd complete before starting.
Medium (2) — Meaningful work with novelty or challenge. Partial failure was possible. Examples: new feature implementation, integrating an unfamiliar API, architectural changes, debugging a tricky issue.
High (3) — Ambitious, unfamiliar, or high-stakes. Real risk of complete failure. Examples: building something from scratch in an unfamiliar domain, complex system redesign, performance-critical optimization, shipping to production under pressure.

Self-check: If you were confident of success before starting, ambition is Low or Medium, not High.

评估工作的难度和风险，而非完成质量。

低（1） — 安全、熟悉、常规任务，无实际失败风险。示例：微小配置变更、简单重构、复制粘贴后做小幅修改、开始前就确信能完成的任务。
中（2） — 有新意或挑战性的有意义工作，存在部分失败的可能。示例：新功能实现、集成不熟悉的API、架构变更、调试复杂问题。
高（3） — 有野心、不熟悉或高风险任务，存在完全失败的实际风险。示例：在不熟悉的领域从零构建、复杂系统重构、性能关键优化、高压下向生产环境交付。

自我检查： 如果开始前就确信能成功，目标为低或中等，而非高。

Axis 2: Execution Quality (how well it was done)

维度2：执行质量（完成的质量）

Rate the quality of the actual output, independent of how ambitious the task was.

Poor (1) — Major failures, incomplete, wrong output, or abandoned mid-task. The deliverable doesn't meet its own stated criteria.
Adequate (2) — Completed but with gaps, shortcuts, or missing rigor. Did the thing but left obvious improvements on the table.
Strong (3) — Well-executed, thorough, quality output. No obvious improvements left undone given the scope.

评估实际输出的质量，与任务目标无关。

差（1） — 重大失败、未完成、输出错误或中途放弃。交付成果未达到自身设定的标准。
合格（2） — 已完成但存在漏洞、捷径或严谨性不足。完成了任务，但留下了明显可以改进的空间。
优秀（3） — 执行良好、全面、高质量输出。在给定范围内没有明显未完成的改进点。

Composite Score Matrix

综合评分矩阵

	Poor Exec (1)	Adequate Exec (2)	Strong Exec (3)
Low Ambition (1)	1	2	2
Medium Ambition (2)	2	3	4
High Ambition (3)	2	4	5

Read the matrix, don't override it. The composite is your score. The devil's advocate below can cause you to re-rate an axis — but you cannot directly override the matrix result.

Key properties:

Low ambition caps at 2. Safe work done perfectly is still safe work.
A 5 requires BOTH high ambition AND strong execution. It should be rare.
High ambition + poor execution = 2. Bold failure hurts.
The most common honest score for solid work is 3 (medium ambition, adequate execution).

	差执行（1）	合格执行（2）	优秀执行（3）
低目标（1）	1	2	2
中目标（2）	2	3	4
高目标（3）	2	4	5

严格遵循矩阵结果，不得自行修改。 综合得分即为你的最终评分。下方的正反论证可能会让你重新评估某个维度，但你不能直接覆盖矩阵结果。

关键特性：

低目标任务最高得2分。完美完成的安全工作依然是安全工作。
5分需要同时满足高目标和优秀执行，应较为罕见。
高目标+差执行=2分。大胆的失败会带来负面影响。
扎实工作最常见的客观评分是3分（中目标+合格执行）。

Devil's Advocate (MANDATORY)

正反论证（强制要求）

Before writing your final score, you MUST write all three of these:

Case for LOWER: Why might this work deserve a lower score? What was easy, what was avoided, what was less ambitious than it appears? Would a skeptical reviewer agree with your axis ratings?
Case for HIGHER: Why might this work deserve a higher score? What was genuinely challenging, surprising, or exceeded the original plan?
Resolution: If either case reveals you mis-rated an axis, re-rate it and recompute the matrix result. Then state your final score with a 1-2 sentence justification that addresses at least one point from each case.

If your devil's advocate is less than 3 sentences total, you're not engaging with it — try harder.

在写出最终评分前，你必须完成以下三点：

低分论证： 为什么这项工作可能值得更低的分数？哪些部分很简单，哪些被回避了，哪些看起来比实际更有野心？持怀疑态度的评审者会同意你的维度评分吗？
高分论证： 为什么这项工作可能值得更高的分数？哪些部分真正具有挑战性、出乎意料或超出了原计划？
结论： 如果任一论证显示你对某个维度的评分有误，重新评估该维度并重新计算矩阵结果。然后用1-2句话说明最终评分的理由，至少涵盖每个论证中的一个要点。

如果你的正反论证总长度不足3句话，说明你没有充分参与 — 请更深入思考。

Anti-Inflation Check

反评分膨胀检查

Check for a score history file at

.self-eval-scores.jsonl

in the current working directory.

If the file exists, read it and check the last 5 scores. If 4+ of the last 5 are the same number, flag it:

Warning: Score clustering detected. Last 5 scores: [list]. Consider whether you're anchoring to a default.

If the file doesn't exist, ask yourself: "Would an outside observer rate this the same way I am?"

检查当前工作目录下是否存在

.self-eval-scores.jsonl

评分历史文件。

如果文件存在，读取并查看最近5次评分。如果最近5次中有4次及以上相同，标记如下：

警告：检测到评分聚类。 最近5次评分：[列表]。请考虑是否陷入了默认评分的锚定效应。

如果文件不存在，请自问：“外部观察者会和我给出相同的评分吗？”

Score Persistence

评分持久化

After presenting your evaluation, append one line to

.self-eval-scores.jsonl

in the current working directory:

json

{"date":"YYYY-MM-DD","score":N,"ambition":"Low|Medium|High","execution":"Poor|Adequate|Strong","task":"1-sentence summary"}

This enables the anti-inflation check to work across sessions. If the file doesn't exist, create it.

展示评估结果后，将一行内容追加到当前工作目录下的

.self-eval-scores.jsonl

文件中：

json

{"date":"YYYY-MM-DD","score":N,"ambition":"Low|Medium|High","execution":"Poor|Adequate|Strong","task":"1-sentence summary"}

这使得反评分膨胀检查可以跨会话生效。如果文件不存在，则创建它。

Output Format

输出格式

Present your evaluation as:

评估结果请按以下格式呈现：

Self-Evaluation

自我评估

Task: [1-sentence summary of what was attempted] Ambition: [Low/Medium/High] — [1-sentence justification] Execution: [Poor/Adequate/Strong] — [1-sentence justification]

Devil's Advocate:

Lower: [why it might deserve less]
Higher: [why it might deserve more]
Resolution: [final reasoning]

Score: [1-5] — [1-sentence final justification]

任务： [一句话总结尝试完成的内容] 目标： [低/中/高] — [一句话理由] 执行： [差/合格/优秀] — [一句话理由]

正反论证：

低分理由：[为什么可能得分更低]
高分理由：[为什么可能得分更高]
结论：[最终推理]

评分：[1-5] — [一句话最终理由]