evaluation-rubrics

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Evaluation Rubrics

评估量规（Rubrics）

Purpose

用途

Evaluation Rubrics provide explicit criteria and performance scales to assess quality consistently, fairly, and transparently. This skill guides you through rubric design—from identifying meaningful criteria to writing clear performance descriptors—to enable objective evaluation, reduce bias, align teams on standards, and give actionable feedback.

评估量规（Rubrics）提供明确的标准和绩效等级，用于持续、公平、透明地评估质量。本工具将指导您完成量规设计的全流程——从确定有意义的评估标准到撰写清晰的绩效描述——以实现客观评估、减少偏见、统一团队标准，并提供可落地的反馈。

When to Use

适用场景

Use this skill when:

Quality assessment: Code reviews, design critiques, writing evaluation, product launches, academic grading
Competitive evaluation: Vendor selection, hiring candidates, grant proposals, pitch competitions, award judging
Progress tracking: Sprint reviews, skill assessments, training completion, certification exams
Standardization: Multiple reviewers need to score consistently (inter-rater reliability), reduce subjective bias
Feedback delivery: Provide clear, actionable feedback tied to specific criteria (not just "good" or "needs work")
Threshold setting: Define minimum acceptable quality (e.g., "must score ≥3/5 on all criteria to pass")
Process improvement: Identify systematic weaknesses (many submissions score low on same criterion → need better guidance)

Trigger phrases: "rubric", "scoring criteria", "evaluation framework", "quality standards", "how do we grade this", "what does good look like", "consistent assessment", "inter-rater reliability"

在以下场景中使用本工具：

质量评估：代码评审、设计评审、写作评估、产品发布、学术评分
竞争性评估：供应商选择、候选人招聘、Grant申请、路演比赛、奖项评选
进度跟踪：迭代评审、技能评估、培训结业、认证考试
标准化需求：多名评审者需要保持评分一致性（评分者间信度）、减少主观偏见
反馈交付：提供与具体标准挂钩的清晰、可落地的反馈（而非仅“好”或“需要改进”）
阈值设定：定义最低可接受质量（例如：“所有标准得分≥3/5方可通过”）
流程改进：识别系统性薄弱环节（多项提交在同一标准上得分低→需要优化指导）

触发词："rubric"、"评分标准"、"评估框架"、"质量标准"、"如何为这项工作评分"、"优秀的标准是什么"、"一致性评估"、"评分者间信度"

What Is It?

什么是评估量规？

An evaluation rubric is a structured scoring tool with:

Criteria: What dimensions of quality are being assessed (e.g., clarity, completeness, originality)
Scale: Numeric or qualitative levels (e.g., 1-5, Novice-Expert, Below/Meets/Exceeds)
Descriptors: Explicit descriptions of what each level looks like for each criterion
Weighting (optional): Importance of each criterion (some more critical than others)

Core benefits:

Consistency: Same work scored similarly by different reviewers (inter-rater reliability)
Transparency: Evaluatees know expectations upfront, can self-assess
Actionable feedback: Specific areas for improvement, not vague critique
Fairness: Reduces bias, focuses on observable work not subjective impressions
Efficiency: Faster evaluation with clear benchmarks, less debate

Quick example:

Scenario: Evaluating technical blog posts

Rubric (1-5 scale):

Criterion	1 (Poor)	3 (Adequate)	5 (Excellent)
Technical Accuracy	Multiple factual errors, misleading	Mostly correct, minor inaccuracies	Fully accurate, technically rigorous
Clarity	Confusing, jargon-heavy, poor structure	Clear to experts, some structure	Accessible to target audience, well-organized
Practical Value	No actionable guidance, theoretical only	Some examples, limited applicability	Concrete examples, immediately applicable
Originality	Rehashes common knowledge, no new insight	Some fresh perspective, builds on existing	Novel approach, advances understanding

Scoring: Post A scores [4, 5, 3, 2] = 3.5 avg. Post B scores [5, 4, 5, 4] = 4.5 avg → Post B higher quality.

Feedback for Post A: "Strong clarity (5) and good accuracy (4), but needs more practical examples (3) and offers less original insight (2). Add code samples and explore edge cases to improve."

评估量规（Rubric）是一种结构化评分工具，包含：

评估标准：评估的质量维度（例如：清晰度、完整性、原创性）
评分等级：数字或定性等级（例如：1-5、新手-专家、未达标/达标/优秀）
绩效描述：每个标准下各等级的明确表现描述
权重设置（可选）：各标准的重要性（部分标准比其他标准更关键）

核心优势：

一致性：不同评审者对同一工作的评分结果相近（评分者间信度）
透明度：被评估者提前了解预期要求，可进行自我评估
可落地反馈：明确指出改进的具体方向，而非模糊批评
公平性：减少偏见，关注可观察的工作成果而非主观印象
高效性：借助清晰的基准加快评估速度，减少争议

快速示例：

场景：评估技术博客文章

量规（1-5分制）：

评估标准	1（差）	3（合格）	5（优秀）
技术准确性	存在多处事实错误、误导性内容	基本正确，存在少量不准确之处	完全准确，技术严谨
清晰度	内容混乱、术语过多、结构糟糕	专业人士可理解，结构尚可	目标受众易于理解，组织有序
实用价值	无可行指导，仅为理论内容	包含部分示例，适用性有限	提供具体示例，可直接应用
原创性	重复常见知识，无新见解	有一定新颖视角，基于现有内容拓展	采用新颖方法，深化认知

评分：文章A得分[4,5,3,2]→平均分3.5。文章B得分[5,4,5,4]→平均分4.5→文章B质量更高。

给文章A的反馈："清晰度表现出色（5分），准确性良好（4分），但需增加更多实用示例（3分），原创性不足（2分）。建议添加代码示例并探讨边缘案例以提升质量。"

Workflow

工作流程

Copy this checklist and track your progress:

Rubric Development Progress:
- [ ] Step 1: Define purpose and scope
- [ ] Step 2: Identify evaluation criteria
- [ ] Step 3: Design the scale
- [ ] Step 4: Write performance descriptors
- [ ] Step 5: Test and calibrate
- [ ] Step 6: Use and iterate

Step 1: Define purpose and scope

Clarify what you're evaluating, who evaluates, who uses results, what decisions depend on scores. See resources/template.md for scoping questions.

Step 2: Identify evaluation criteria

Brainstorm quality dimensions, prioritize most important/observable, balance coverage vs. simplicity (4-8 criteria typical). See resources/template.md for brainstorming framework.

Step 3: Design the scale

Choose number of levels (1-5, 1-4, 1-10), scale type (numeric, qualitative), anchors (what does each level mean?). See resources/methodology.md for scale selection guidance.

Step 4: Write performance descriptors

For each criterion × level, write observable description of what that performance looks like. See resources/template.md for writing guidelines.

Step 5: Test and calibrate

Have multiple reviewers score sample work, compare scores, discuss discrepancies, refine rubric. See resources/methodology.md for inter-rater reliability testing.

Step 6: Use and iterate

Apply rubric, collect feedback from evaluators and evaluatees, revise criteria/descriptors as needed. Validate using resources/evaluators/rubric_evaluation_rubrics.json. Minimum standard: Average score ≥ 3.5.

复制以下清单并跟踪进度：

Rubric Development Progress:
- [ ] Step 1: Define purpose and scope
- [ ] Step 2: Identify evaluation criteria
- [ ] Step 3: Design the scale
- [ ] Step 4: Write performance descriptors
- [ ] Step 5: Test and calibrate
- [ ] Step 6: Use and iterate

步骤1：定义用途和范围

明确评估对象、评审者、结果使用者，以及评分将影响哪些决策。查看resources/template.md获取范围界定问题。

步骤2：确定评估标准

头脑风暴质量维度，优先选择最重要、可观察的维度，平衡覆盖范围与简洁性（通常4-8个标准）。查看resources/template.md获取头脑风暴框架。

步骤3：设计评分等级

选择等级数量（1-5、1-4、1-10）、等级类型（数字、定性）、锚点（各等级的含义）。查看resources/methodology.md获取评分等级选择指南。

步骤4：撰写绩效描述

为每个标准×等级组合，撰写可观察的表现描述。查看resources/template.md获取撰写指南。

步骤5：测试与校准

安排多名评审者对样本工作评分，对比结果，讨论差异，优化量规。查看resources/methodology.md获取评分者间信度测试方法。

步骤6：使用与迭代

应用量规，收集评审者和被评估者的反馈，根据需要修订标准/描述。使用resources/evaluators/rubric_evaluation_rubrics.json进行验证。最低标准：平均分≥3.5。

Common Patterns

常见类型

Pattern 1: Analytic Rubric (Most Common)

Structure: Multiple criteria (rows), multiple levels (columns), descriptor for each cell
Use case: Detailed feedback needed, want to see performance across dimensions, diagnostic assessment
Pros: Specific feedback, identifies strengths/weaknesses by criterion, high reliability
Cons: Time-consuming to create and use, can feel reductive
Example: Code review rubric (Correctness, Efficiency, Readability, Maintainability × 1-5 scale)

Pattern 2: Holistic Rubric

Structure: Single overall score, descriptors integrate multiple criteria
Use case: Quick overall judgment, summative assessment, criteria hard to separate
Pros: Fast, intuitive, captures gestalt quality
Cons: Less actionable feedback, lower reliability, can't diagnose specific weaknesses
Example: Essay holistic scoring (1=poor essay, 3=adequate essay, 5=excellent essay with detailed descriptors)

Pattern 3: Single-Point Rubric

Structure: Criteria listed with only "meets standard" descriptor, space to note above/below
Use case: Growth mindset feedback, encourage self-assessment, less punitive feel
Pros: Emphasizes improvement not deficit, simpler to create, encourages dialogue
Cons: Less precision, requires written feedback to supplement
Example: Design critique (list criteria like "Visual hierarchy", "Accessibility", note "+Clear focal point, -Poor contrast")

Pattern 4: Checklist (Binary)

Structure: List of yes/no items, must-haves for acceptance
Use case: Compliance checks, minimum quality gates, pass/fail decisions
Pros: Very clear, objective, easy to use
Cons: No gradations, misses quality beyond basics, can feel rigid
Example: Pull request checklist (Tests pass? Code linted? Documentation updated? Security review?)

Pattern 5: Standards-Based Rubric

Structure: Criteria tied to learning objectives/competencies, levels = degree of mastery
Use case: Educational assessment, skill certification, training evaluation, criterion-referenced
Pros: Aligned to standards, shows progress toward mastery, diagnostic
Cons: Requires clear standards, can be complex to design
Example: Data science skills (Proficiency in: Data cleaning, Modeling, Visualization, Communication × Novice/Competent/Expert)

类型1：分析型量规（最常用）

结构：多个标准（行）、多个等级（列），每个单元格对应一个描述
适用场景：需要详细反馈、希望了解各维度表现、诊断性评估
优势：反馈具体，可按标准识别优势/劣势，信度高
劣势：创建和使用耗时，可能过于简化
示例：代码评审量规（正确性、效率、可读性、可维护性 × 1-5分制）

类型2：整体型量规

结构：单一整体得分，描述整合多个标准
适用场景：快速整体判断、总结性评估、标准难以拆分
优势：快速、直观，能捕捉整体质量
劣势：反馈可落地性差，信度较低，无法诊断具体弱点
示例：论文整体评分（1=差，3=合格，5=优秀，附带详细描述）

类型3：单点量规

结构：列出标准及仅“达标”的描述，留有记录优于/低于达标情况的空间
适用场景：成长型反馈、鼓励自我评估、减少惩罚感
优势：强调改进而非不足，创建更简单，促进对话
劣势：精度较低，需要书面反馈补充
示例：设计评审（列出“视觉层级”“可访问性”等标准，记录“+清晰焦点，-对比度不足”）

类型4：Checklist（二元制）

结构：是/否选项列表，验收必备条件
适用场景：合规检查、最低质量门槛、通过/不通过决策
优势：非常清晰、客观，易于使用
劣势：无等级区分，无法反映基础以上的质量差异，可能过于僵化
示例：拉取请求检查清单（测试通过？代码已Lint？文档已更新？已完成安全评审？）

类型5：基于标准的量规

结构：标准与学习目标/能力挂钩，等级代表掌握程度
适用场景：教育评估、技能认证、培训评估、标准参照评估
优势：与标准对齐，展示向Mastery迈进的进度，具有诊断性
劣势：需要明确的标准，设计复杂
示例：数据科学技能（熟练程度：数据清洗、建模、可视化、沟通 × 新手/胜任/专家）

Guardrails

注意事项

Critical requirements:

Criteria must be observable and measurable: Not "good attitude" (subjective), but "arrives on time, volunteers for tasks, helps teammates" (observable). Vague criteria lead to unreliable scoring. Test: Can two independent reviewers score this criterion consistently?
Descriptors must distinguish levels clearly: Each level should have concrete differences from adjacent levels (not just "better" or "more"). Avoid: "5=very good, 4=good, 3=okay". Better: "5=zero bugs, meets all requirements, 4=1-2 minor bugs, meets 90% requirements, 3=3+ bugs or missing key feature".
Use appropriate scale granularity: 1-3 too coarse (hard to differentiate), 1-10 too fine (false precision, hard to define all levels). Sweet spot: 1-4 (forced choice, no middle) or 1-5 (allows neutral middle). Match granularity to actual observable differences.
Balance comprehensiveness with simplicity: More criteria = more detailed feedback but longer to use. Aim for 4-8 criteria covering essential quality dimensions. If >10 criteria, consider grouping or prioritizing.
Calibrate for inter-rater reliability: Have multiple reviewers score same work, measure agreement (Kappa, ICC). If <70% agreement, refine descriptors. Schedule calibration sessions where reviewers discuss discrepancies.
Provide examples at each level: Abstract descriptors are ambiguous. Include concrete examples of work at each level (anchor papers, reference designs, code samples) to calibrate reviewers.
Make rubric accessible before evaluation: If evaluatees see rubric only after being scored, it's just grading not guidance. Share rubric upfront so people know expectations and can self-assess.
Weight criteria appropriately: Not all criteria equally important. If "Security" matters more than "Code style", weight it (Security ×3, Style ×1). Or use thresholds (must score ≥4 on Security to pass, regardless of other scores).

Common pitfalls:

❌ Subjective language: "Shows effort", "creative", "professional" - not observable without concrete descriptors
❌ Overlapping criteria: "Clarity" and "Organization" often conflated - define boundaries clearly
❌ Hidden expectations: Rubric doesn't mention X, but evaluators penalize for missing X - document all criteria
❌ Central tendency bias: Reviewers avoid extremes (always score 3/5) - use even-number scales (1-4) to force choice
❌ Halo effect: High score on one criterion biases other scores up - score each criterion independently before looking at others
❌ Rubric drift: Descriptors erode over time, reviewers interpret differently - periodic re-calibration required

关键要求：

评估标准必须可观察、可衡量：不能是“态度良好”（主观），而应是“准时到场、主动承担任务、帮助队友”（可观察）。模糊的标准会导致评分不可靠。测试方法：两名独立评审者能否对该标准给出一致评分？
绩效描述必须清晰区分等级：每个等级与相邻等级应有具体差异（不能仅为“更好”或“更多”）。避免：“5=非常好，4=好，3=一般”。更好的描述：“5=零缺陷，满足所有需求；4=1-2个小缺陷，满足90%需求；3=3个以上缺陷或缺失关键功能”。
使用合适的评分等级粒度：1-3分制过于粗糙（难以区分），1-10分制过于精细（虚假精确，难以定义所有等级）。最佳选择：1-4分制（强制选择，无中间选项）或1-5分制（允许中立选项）。粒度应与实际可观察的差异匹配。
平衡全面性与简洁性：标准越多→反馈越详细，但使用耗时越长。目标是4-8个标准，覆盖核心质量维度。如果超过10个标准，考虑分组或优先排序。
校准以确保评分者间信度：安排多名评审者对同一工作评分，衡量一致性（Kappa系数、组内相关系数ICC）。如果一致性<70%，优化描述。安排校准会议，让评审者讨论差异。
为每个等级提供示例：抽象描述模糊不清。包含每个等级的具体工作示例（锚定论文、参考设计、代码样本），以校准评审者。
评估前向被评估者提供量规：如果被评估者仅在评分后才看到量规，那这只是评分而非指导。提前分享量规，让人们了解预期要求并进行自我评估。
合理设置标准权重：并非所有标准同等重要。如果“安全性”比“代码风格”更重要，可设置权重（安全性×3，风格×1）。或使用阈值（无论其他标准得分如何，安全性得分≥4方可通过）。

常见误区：

❌ 主观语言：“表现努力”、“有创意”、“专业”——若无具体描述则无法观察
❌ 标准重叠：“清晰度”和“组织性”常被混淆——明确界定边界
❌ 隐藏预期：量规未提及X，但评审者因缺少X而扣分——记录所有标准
❌ 趋中偏差：评审者避免极端评分（总是给3/5）——使用偶数分制（1-4）强制选择
❌ 光环效应：某一标准得分高导致其他标准得分被高估——独立评分每个标准后再查看整体
❌ 量规偏差：描述随时间被弱化，评审者解读不同——定期重新校准

Quick Reference

快速参考

Key resources:

resources/template.md: Purpose definition, criteria brainstorming, scale selection, descriptor templates, rubric formats
resources/methodology.md: Scale design principles, descriptor writing techniques, inter-rater reliability testing, bias mitigation
resources/evaluators/rubric_evaluation_rubrics.json: Quality criteria for rubric design (criteria clarity, scale appropriateness, descriptor specificity)

Scale Selection Guide:

Scale	Use When	Pros	Cons
1-3	Need quick categorization, clear tiers	Fast, forces clear decision	Too coarse, less feedback
1-4	Want forced choice (no middle)	Avoids central tendency, clear differentiation	No neutral option, feels binary
1-5	General purpose, most common	Allows neutral, familiar, good granularity	Central tendency bias (everyone gets 3)
1-10	Need fine gradations, large sample	Maximum differentiation, statistical analysis	False precision, hard to distinguish adjacent levels
Qualitative (Novice/Proficient/Expert)	Educational, skill development	Intuitive, growth-oriented	Less quantitative, harder to aggregate
Binary (Yes/No, Pass/Fail)	Compliance, gatekeeping	Objective, simple	No gradations, misses quality differences

Criteria Types:

Product criteria: Evaluate the artifact itself (correctness, clarity, completeness, aesthetics, performance)
Process criteria: How work was done (methodology followed, collaboration, iteration, time management)
Impact criteria: Outcomes/effects (user satisfaction, business value, learning achieved)
Meta criteria: Quality of quality (documentation, testability, maintainability, scalability)

Inter-Rater Reliability Benchmarks:

<50% agreement: Rubric unreliable, needs major revision
50-70% agreement: Marginal, refine descriptors and calibrate reviewers
70-85% agreement: Good, acceptable for most uses
>85% agreement: Excellent, highly reliable scoring

Typical Rubric Development Time:

Simple rubric (3-5 criteria, 1-4 scale, known domain): 2-4 hours
Standard rubric (5-7 criteria, 1-5 scale, some complexity): 6-10 hours + calibration session
Complex rubric (8+ criteria, multiple scales, novel domain): 15-25 hours + multiple calibration rounds

When to escalate beyond rubrics:

High-stakes decisions (hiring, admissions, awards) → Add structured interviews, portfolios, multi-method assessment
Subjective/creative work (art, poetry, design) → Supplement rubric with critique, discourse, expert judgment
Complex holistic judgment (leadership, cultural fit) → Rubrics help but don't capture everything, use thoughtfully → Rubrics are tools not replacements for human judgment. Use to structure thinking, not mechanize decisions.

Inputs required:

Artifact type (what are we evaluating? essays, code, designs, proposals?)
Criteria (quality dimensions to assess, 4-8 most common)
Scale (1-5 default, or specify 1-4, 1-10, qualitative labels)

Outputs produced:

```
evaluation-rubrics.md
```
: Purpose, criteria definitions, scale with descriptors, usage instructions, weighting/thresholds, calibration notes

核心资源：

resources/template.md：用途定义、标准头脑风暴、评分等级选择、描述模板、量规格式
resources/methodology.md：评分等级设计原则、描述撰写技巧、评分者间信度测试、偏见缓解
resources/evaluators/rubric_evaluation_rubrics.json：量规设计的质量标准（标准清晰度、等级适用性、描述特异性）

评分等级选择指南：

评分等级	适用场景	优势	劣势
1-3分制	需要快速分类，等级清晰	快速，强制明确决策	过于粗糙，反馈少
1-4分制	希望强制选择（无中间选项）	避免趋中偏差，区分清晰	无中立选项，类似二元制
1-5分制	通用场景，最常用	允许中立选项，熟悉度高，粒度合适	存在趋中偏差（所有人得3分）
1-10分制	需要精细区分，样本量大	区分度最大，便于统计分析	虚假精确，难以区分相邻等级
定性等级（新手/熟练/专家）	教育、技能发展	直观，以成长为导向	量化程度低，难以汇总
二元制（是/否，通过/不通过）	合规检查、门槛设置	客观、简单	无等级区分，无法反映质量差异

评估标准类型：

产品标准：评估成果本身（正确性、清晰度、完整性、美观性、性能）
流程标准：评估工作完成方式（是否遵循方法论、协作、迭代、时间管理）
影响标准：评估结果/影响（用户满意度、业务价值、学习成果）
元标准：评估质量的质量（文档、可测试性、可维护性、可扩展性）

评分者间信度基准：

<50%一致性：量规不可靠，需大幅修订
50-70%一致性：边际可靠，优化描述并校准评审者
70-85%一致性：良好，适用于大多数场景
>85%一致性：优秀，评分高度可靠

典型量规开发时间：

简单量规（3-5个标准，1-4分制，熟悉领域）：2-4小时
标准量规（5-7个标准，1-5分制，一定复杂度）：6-10小时 + 校准会议
复杂量规（8个以上标准，多等级，新领域）：15-25小时 + 多轮校准

何时需超越量规：

高风险决策（招聘、录取、奖项）→ 增加结构化面试、作品集、多方法评估
主观/创意工作（艺术、诗歌、设计）→ 用量规补充评审、讨论、专家判断
复杂整体判断（领导力、文化契合度）→ 量规有帮助但无法涵盖所有内容，需谨慎使用 → 量规是工具而非人类判断的替代品。用于结构化思考，而非机械化决策。

所需输入：

成果类型（评估对象：论文、代码、设计、提案？）
评估标准（评估的质量维度，通常4-8个）
评分等级（默认1-5分制，或指定1-4、1-10、定性标签）

产出内容：

```
evaluation-rubrics.md
```
：用途、标准定义、带描述的评分等级、使用说明、权重/阈值、校准说明