llm-as-judge

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM-as-Judge

LLM-as-Judge

Overview

概述

Some quality criteria are inherently subjective — tone of voice, visual aesthetics, UX feel, documentation clarity, code readability. These cannot be verified by deterministic tests. The LLM-as-judge pattern provides structured, repeatable evaluation using an LLM reviewer with defined rubrics, ensuring subjective quality is measured consistently.
Announce at start: "I'm using the llm-as-judge skill to evaluate subjective quality."

有些质量标准本质上是主观的——语气、视觉美学、UX体验、文档清晰度、代码可读性。这些无法通过确定性测试验证。LLM-as-judge模式借助配备明确定义评分规则的LLM评审员,提供结构化、可重复的评估,确保主观质量得到一致衡量。
开始时需声明: "I'm using the llm-as-judge skill to evaluate subjective quality."

Phase 1: Determine Evaluation Method

第一阶段:确定评估方法

Goal: Decide whether LLM-as-judge is the right tool.
目标: 判断LLM-as-judge是否是合适的工具。

Decision Table: LLM-as-Judge vs Deterministic Tests

决策表:LLM-as-Judge vs 确定性测试

Criterion TypeMethodExample
Objective, measurableDeterministic test"Response time < 200ms"
Structural, verifiableDeterministic test"Returns valid JSON"
Subjective, qualitativeLLM-as-judge"Error messages are friendly and helpful"
Aesthetic, perceptualLLM-as-judge"UI feels clean and modern"
Linguistic, tonalLLM-as-judge"Documentation is clear for beginners"
Holistic, experientialLLM-as-judge"The onboarding flow feels intuitive"
Rule of thumb: If you can write a boolean assertion, use a deterministic test. If evaluation requires judgment, use LLM-as-judge.
标准类型方法示例
客观、可衡量确定性测试"响应时间 < 200ms"
结构化、可验证确定性测试"返回合法JSON"
主观、定性LLM-as-judge"错误消息友好且有帮助"
美学、感知类LLM-as-judge"UI看起来简洁现代"
语言、语气类LLM-as-judge"文档对初学者友好易懂"
整体、体验类LLM-as-judge"入门流程感觉很直观"
经验法则: 如果你能写出布尔断言,就使用确定性测试。如果评估需要主观判断,就使用LLM-as-judge。

STOP — Do NOT proceed to Phase 2 until:

停止 — 满足以下条件前请勿进入第二阶段:

  • Confirmed that criteria are genuinely subjective
  • Deterministic testing has been ruled out
  • Specific artifacts to evaluate are identified

  • 确认评估标准确实是主观的
  • 已排除使用确定性测试的可能性
  • 已确定需要评估的具体产物

Phase 2: Define Rubric

第二阶段:定义评分规则

Goal: Create evaluation dimensions with weights and anchor points.
目标: 创建带权重和锚点的评估维度。

Actions

操作步骤

  1. Define 3-5 evaluation dimensions
  2. Assign weights (must sum to 1.0)
  3. Define anchor points for each dimension (1=worst, 5=adequate, 10=best)
  4. Set passing threshold
  1. 定义3-5个评估维度
  2. 分配权重(总和必须为1.0)
  3. 为每个维度定义锚点(1=最差,5=合格,10=最佳)
  4. 设置及格阈值

Rubric Structure

评分规则结构

DimensionWeightScaleAnchor: 1Anchor: 5Anchor: 10
[Name]0.X1-10[Worst case][Adequate][Excellent]
维度权重评分范围锚点:1锚点:5锚点:10
[维度名]0.X1-10[最差情况][合格标准][优秀标准]

Threshold Selection Table

阈值选择表

Quality LevelThresholdUse For
Minimum viable5.0Internal docs, draft content
Production quality7.0User-facing content, public APIs
Excellence8.5Marketing, critical UX flows
质量等级阈值适用场景
最小可用5.0内部文档、草稿内容
生产级质量7.0用户-facing内容、公开API
卓越级8.5营销内容、核心UX流程

STOP — Do NOT proceed to Phase 3 until:

停止 — 满足以下条件前请勿进入第三阶段:

  • 3-5 dimensions are defined with clear descriptions
  • Weights sum to exactly 1.0
  • Anchor points are specific (not vague)
  • Passing threshold is set before evaluation

  • 已定义3-5个带清晰描述的维度
  • 权重总和恰好为1.0
  • 锚点描述具体(无模糊表述)
  • 评估前已设置好评及阈值

Phase 3: Evaluate

第三阶段:执行评估

Goal: Submit the artifact with rubric to an LLM reviewer.
目标: 将待评估产物和评分规则提交给LLM评审员。

Review Request Structure

评审请求结构

javascript
{
  criteria: "Description of what to evaluate and the quality standard",
  artifact: "The content to be evaluated (code, text, UI markup, etc.)",
  rubric: [
    { dimension: "Clarity", weight: 0.3, description: "Is the content easy to understand?" },
    { dimension: "Tone", weight: 0.3, description: "Is the tone appropriate for the audience?" },
    { dimension: "Completeness", weight: 0.2, description: "Does it cover all necessary points?" },
    { dimension: "Engagement", weight: 0.2, description: "Does it hold the reader's interest?" }
  ],
  passing_threshold: 7.0,
  intelligence: "opus"
}
javascript
{
  criteria: "Description of what to evaluate and the quality standard",
  artifact: "The content to be evaluated (code, text, UI markup, etc.)",
  rubric: [
    { dimension: "Clarity", weight: 0.3, description: "Is the content easy to understand?" },
    { dimension: "Tone", weight: 0.3, description: "Is the tone appropriate for the audience?" },
    { dimension: "Completeness", weight: 0.2, description: "Does it cover all necessary points?" },
    { dimension: "Engagement", weight: 0.2, description: "Does it hold the reader's interest?" }
  ],
  passing_threshold: 7.0,
  intelligence: "opus"
}

Review Response Structure

评审响应结构

javascript
{
  scores: [
    { dimension: "Clarity", score: 8, reasoning: "Well-structured with clear headings..." },
    { dimension: "Tone", score: 7, reasoning: "Professional but occasionally too formal..." },
    { dimension: "Completeness", score: 9, reasoning: "Covers all key topics..." },
    { dimension: "Engagement", score: 6, reasoning: "Could use more examples..." }
  ],
  weighted_score: 7.4,
  pass: true,
  summary: "Overall good quality with minor tone and engagement improvements suggested.",
  suggestions: [
    "Add a real-world example in section 3",
    "Use more conversational language in the introduction"
  ]
}
javascript
{
  scores: [
    { dimension: "Clarity", score: 8, reasoning: "Well-structured with clear headings..." },
    { dimension: "Tone", score: 7, reasoning: "Professional but occasionally too formal..." },
    { dimension: "Completeness", score: 9, reasoning: "Covers all key topics..." },
    { dimension: "Engagement", score: 6, reasoning: "Could use more examples..." }
  ],
  weighted_score: 7.4,
  pass: true,
  summary: "Overall good quality with minor tone and engagement improvements suggested.",
  suggestions: [
    "Add a real-world example in section 3",
    "Use more conversational language in the introduction"
  ]
}

STOP — Do NOT proceed to Phase 4 until:

停止 — 满足以下条件前请勿进入第四阶段:

  • Artifact has been submitted with full rubric
  • Each dimension has been scored independently
  • Reasoning is provided for every score
  • Weighted total is calculated

  • 待评估产物已和完整评分规则一同提交
  • 每个维度都已独立评分
  • 每个评分都附带推理说明
  • 已计算加权总分

Phase 4: Iterate or Accept

第四阶段:迭代或接受

Goal: Act on the evaluation results.
目标: 根据评估结果采取对应行动。

Result Action Table

结果行动对照表

ResultActionMax Iterations
Pass (score >= threshold)Accept the artifact, proceedDone
Marginal fail (within 1 point)Apply suggestions, re-evaluate once1
Clear fail (> 1 point below)Significant revision, apply all suggestions2
Repeated fail (3+ attempts)Escalate — rubric or approach may need adjustmentEscalate
结果行动最大迭代次数
通过(得分 >= 阈值)接受产物,继续后续流程完成
接近及格(差1分以内)采纳优化建议,重新评估1次1
明显不及格(低于阈值1分以上)大幅修改,采纳所有建议2
多次不及格(3次及以上尝试)升级处理——可能需要调整评分规则或评估方法升级

STOP — Evaluation complete when:

停止 — 满足以下条件时评估完成:

  • Artifact passes threshold, OR
  • 3 iterations completed and escalation decision made

  • 产物得分超过阈值,或
  • 已完成3次迭代并做出升级处理的决策

Common Rubric Templates

常用评分规则模板

Documentation Quality

文档质量

Clarity (0.3): Is the content easy to understand for the target audience?
  1=incomprehensible  5=adequate but requires re-reading  10=crystal clear
Accuracy (0.3): Is the information technically correct?
  1=factually wrong  5=mostly correct  10=perfectly accurate
Completeness (0.2): Does it cover all necessary topics?
  1=missing critical info  5=covers basics  10=comprehensive
Examples (0.2): Are there sufficient, relevant code examples?
  1=no examples  5=some examples  10=rich, varied examples
Threshold: 7.0
清晰度 (0.3): 内容对目标受众是否容易理解?
  1=完全看不懂  5=合格但需要反复阅读  10=极其清晰
准确率 (0.3): 信息在技术层面是否正确?
  1=完全错误  5=大部分正确  10=完全准确
完整度 (0.2): 是否覆盖了所有必要的主题?
  1=缺失关键信息  5=覆盖基础内容  10=全面详尽
示例质量 (0.2): 是否有足够的相关代码示例?
  1=无示例  5=有部分示例  10=丰富多样的示例
阈值: 7.0

Error Message Quality

错误消息质量

Helpfulness (0.4): Does the message help the user fix the problem?
  1=no help at all  5=vague direction  10=exact fix steps
Clarity (0.3): Is the message easy to understand?
  1=cryptic  5=understandable  10=immediately clear
Tone (0.2): Is the tone empathetic and non-blaming?
  1=hostile/blaming  5=neutral  10=empathetic and supportive
Actionability (0.1): Does it suggest a concrete next step?
  1=no suggestion  5=vague suggestion  10=specific actionable step
Threshold: 7.5
有用性 (0.4): 消息能否帮助用户解决问题?
  1=完全没用  5=方向模糊  10=给出确切修复步骤
清晰度 (0.3): 消息是否容易理解?
  1=晦涩难懂  5=可以理解  10=一目了然
语气 (0.2): 语气是否共情且无指责性?
  1=有敌意/指责用户  5=中立  10=共情且友好
可执行性 (0.1): 是否给出具体下一步建议?
  1=无建议  5=建议模糊  10=具体可执行的步骤
阈值: 7.5

Code Readability

代码可读性

Naming (0.3): Are variable/function names descriptive and consistent?
  1=single letters everywhere  5=adequate  10=self-documenting
Structure (0.3): Is the code logically organized?
  1=spaghetti  5=functional  10=elegant and clear
Simplicity (0.2): Is the code as simple as possible (but not simpler)?
  1=over-engineered  5=reasonable  10=minimal and clear
Documentation (0.2): Are complex sections adequately commented?
  1=no comments where needed  5=some comments  10=well-documented why
Threshold: 7.0
命名规范 (0.3): 变量/函数名是否描述清晰且一致?
  1=全是单字母命名  5=合格  10=自解释
结构合理性 (0.3): 代码逻辑是否组织合理?
  1=意大利面代码  5=可运行  10=优雅清晰
简洁性 (0.2): 代码是否尽可能简洁(但不过度简化)?
  1=过度设计  5=合理  10=极简清晰
文档完善度 (0.2): 复杂模块是否有足够注释?
  1=需要注释的地方完全没有  5=有部分注释  10=逻辑原因说明完善
阈值: 7.0

UX Copy

UX文案

Clarity (0.3): Is the copy easy to understand?
  1=confusing  5=understandable  10=immediately clear
Brevity (0.2): Is it concise without losing meaning?
  1=verbose  5=adequate length  10=perfectly concise
Tone (0.2): Does it match the brand voice?
  1=off-brand  5=neutral  10=perfectly on-brand
Actionability (0.2): Do CTAs clearly communicate what happens next?
  1=unclear  5=adequate  10=crystal clear action
Accessibility (0.1): Is the language inclusive and jargon-free?
  1=exclusionary  5=neutral  10=fully inclusive
Threshold: 7.5

清晰度 (0.3): 文案是否容易理解?
  1=令人困惑  5=可理解  10=一目了然
简洁性 (0.2): 是否简洁且不丢失核心含义?
  1=过于冗长  5=长度合适  10=完美简洁
语气匹配度 (0.2): 是否符合品牌调性?
  1=完全不符合品牌  5=中立  10=完美匹配品牌
可执行性 (0.2): CTA是否清晰说明点击后的结果?
  1=不清晰  5=合格  10=行动说明极其清晰
无障碍适配 (0.1): 语言是否包容无专业黑话?
  1=有排他性内容  5=中立  10=完全包容
阈值: 7.5

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-PatternWhy It Is WrongCorrect Approach
Using LLM-as-judge for measurable criteriaWastes tokens, less reliable than assertionsUse deterministic tests for anything quantifiable
Vague rubric dimensions ("is it good?")Produces unreliable, inconsistent scoresSpecific dimensions with anchored examples
No passing threshold definedNo way to determine pass/fail objectivelyAlways set threshold before evaluation
Adjusting rubric to pass failing contentDefeats the purpose of quality gatesFix the content, not the rubric
Single evaluation without reasoningCannot improve without understanding whyAlways require per-dimension reasoning
Using weaker model for evaluationLower quality judgmentsUse strongest available model (Opus)
Skipping re-evaluation after changesNo verification that changes improved qualityAlways re-evaluate after revisions

反模式错误原因正确做法
用LLM-as-judge评估可衡量的标准浪费token,可靠性不如断言所有可量化的内容都使用确定性测试
评分规则维度模糊(比如"好不好?")产出不可靠、不一致的评分维度具体且附带锚点示例
未定义及格阈值无法客观判断是否通过评估前必须设置阈值
调整评分规则让不及格内容通过违背质量门禁的初衷修改内容而非评分规则
只有评分没有推理说明不知道问题原因无法优化每个维度的评分都必须附带推理
用弱模型做评估评审质量更低使用可用的最强模型(Opus)
修改后跳过重新评估无法验证修改是否提升了质量修改后必须重新评估

Integration Points

集成点

SkillRelationship
acceptance-testing
LLM-as-judge handles subjective acceptance criteria
spec-writing
Specs may include subjective quality criteria
code-review
Readability evaluation during code review
verification-before-completion
Subjective validation gate before completion
senior-prompt-engineer
Prompt quality evaluation uses LLM-as-judge
tech-docs-generator
Documentation quality evaluation

Skill关系
acceptance-testing
LLM-as-judge处理主观验收标准
spec-writing
需求规范可能包含主观质量标准
code-review
代码评审过程中的可读性评估
verification-before-completion
完成前的主观验证门禁
senior-prompt-engineer
Prompt质量评估使用LLM-as-judge
tech-docs-generator
文档质量评估

Downstream Steering Pattern

下游流转模式

+----------+     +----------+     +----------+     +----------+
|  SPECS   |---->|   CODE   |---->|  TESTS   |---->| LLM-AS-  |
|          |     |          |     |(determin)|     |  JUDGE   |
|          |     |          |     |          |     |(subject) |
+----------+     +----------+     +----------+     +----+-----+
                      ^                                  |
                      |          backpressure             |
                      +----------------------------------+
Deterministic tests validate objective criteria. LLM-as-judge validates subjective criteria. Both must pass.

+----------+     +----------+     +----------+     +----------+
|  需求规范   |---->|   代码   |---->|  测试   |---->| LLM-AS-  |
|          |     |          |     |(确定性) |     |  JUDGE   |
|          |     |          |     |          |     |(主观) |
+----------+     +----------+     +----------+     +----+-----+
                      ^                                  |
                      |          反向反馈             |
                      +----------------------------------+
确定性测试验证客观标准,LLM-as-judge验证主观标准,两者都必须通过。

Skill Type

Skill类型

FLEXIBLE — Adapt rubric dimensions and thresholds to context. The pattern structure (define rubric, evaluate, score, iterate) is fixed. Always set the threshold before evaluation, never after.
灵活适配型 — 可根据场景调整评分规则维度和阈值。模式结构(定义评分规则、评估、打分、迭代)是固定的。必须在评估前设置阈值,禁止事后调整。",