llm-as-judge

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM-as-Judge

Overview

概述

Some quality criteria are inherently subjective — tone of voice, visual aesthetics, UX feel, documentation clarity, code readability. These cannot be verified by deterministic tests. The LLM-as-judge pattern provides structured, repeatable evaluation using an LLM reviewer with defined rubrics, ensuring subjective quality is measured consistently.

Announce at start: "I'm using the llm-as-judge skill to evaluate subjective quality."

有些质量标准本质上是主观的——语气、视觉美学、UX体验、文档清晰度、代码可读性。这些无法通过确定性测试验证。LLM-as-judge模式借助配备明确定义评分规则的LLM评审员，提供结构化、可重复的评估，确保主观质量得到一致衡量。

开始时需声明： "I'm using the llm-as-judge skill to evaluate subjective quality."

Phase 1: Determine Evaluation Method

第一阶段：确定评估方法

Goal: Decide whether LLM-as-judge is the right tool.

目标： 判断LLM-as-judge是否是合适的工具。

Decision Table: LLM-as-Judge vs Deterministic Tests

决策表：LLM-as-Judge vs 确定性测试

Criterion Type	Method	Example
Objective, measurable	Deterministic test	"Response time < 200ms"
Structural, verifiable	Deterministic test	"Returns valid JSON"
Subjective, qualitative	LLM-as-judge	"Error messages are friendly and helpful"
Aesthetic, perceptual	LLM-as-judge	"UI feels clean and modern"
Linguistic, tonal	LLM-as-judge	"Documentation is clear for beginners"
Holistic, experiential	LLM-as-judge	"The onboarding flow feels intuitive"

Rule of thumb: If you can write a boolean assertion, use a deterministic test. If evaluation requires judgment, use LLM-as-judge.

标准类型	方法	示例
客观、可衡量	确定性测试	"响应时间 < 200ms"
结构化、可验证	确定性测试	"返回合法JSON"
主观、定性	LLM-as-judge	"错误消息友好且有帮助"
美学、感知类	LLM-as-judge	"UI看起来简洁现代"
语言、语气类	LLM-as-judge	"文档对初学者友好易懂"
整体、体验类	LLM-as-judge	"入门流程感觉很直观"

经验法则： 如果你能写出布尔断言，就使用确定性测试。如果评估需要主观判断，就使用LLM-as-judge。

STOP — Do NOT proceed to Phase 2 until:

停止 — 满足以下条件前请勿进入第二阶段：

Confirmed that criteria are genuinely subjective
Deterministic testing has been ruled out
Specific artifacts to evaluate are identified

确认评估标准确实是主观的
已排除使用确定性测试的可能性
已确定需要评估的具体产物

Phase 2: Define Rubric

第二阶段：定义评分规则

Goal: Create evaluation dimensions with weights and anchor points.

目标： 创建带权重和锚点的评估维度。

Actions

操作步骤

Define 3-5 evaluation dimensions
Assign weights (must sum to 1.0)
Define anchor points for each dimension (1=worst, 5=adequate, 10=best)
Set passing threshold

定义3-5个评估维度
分配权重（总和必须为1.0）
为每个维度定义锚点（1=最差，5=合格，10=最佳）
设置及格阈值

Rubric Structure

评分规则结构

Dimension	Weight	Scale	Anchor: 1	Anchor: 5	Anchor: 10
[Name]	0.X	1-10	[Worst case]	[Adequate]	[Excellent]

维度	权重	评分范围	锚点：1	锚点：5	锚点：10
[维度名]	0.X	1-10	[最差情况]	[合格标准]	[优秀标准]

Threshold Selection Table

阈值选择表

Quality Level	Threshold	Use For
Minimum viable	5.0	Internal docs, draft content
Production quality	7.0	User-facing content, public APIs
Excellence	8.5	Marketing, critical UX flows

质量等级	阈值	适用场景
最小可用	5.0	内部文档、草稿内容
生产级质量	7.0	用户-facing内容、公开API
卓越级	8.5	营销内容、核心UX流程

STOP — Do NOT proceed to Phase 3 until:

停止 — 满足以下条件前请勿进入第三阶段：

3-5 dimensions are defined with clear descriptions
Weights sum to exactly 1.0
Anchor points are specific (not vague)
Passing threshold is set before evaluation

已定义3-5个带清晰描述的维度
权重总和恰好为1.0
锚点描述具体（无模糊表述）
评估前已设置好评及阈值

Phase 3: Evaluate

第三阶段：执行评估

Goal: Submit the artifact with rubric to an LLM reviewer.

目标： 将待评估产物和评分规则提交给LLM评审员。

Review Request Structure

评审请求结构

javascript

{
  criteria: "Description of what to evaluate and the quality standard",
  artifact: "The content to be evaluated (code, text, UI markup, etc.)",
  rubric: [
    { dimension: "Clarity", weight: 0.3, description: "Is the content easy to understand?" },
    { dimension: "Tone", weight: 0.3, description: "Is the tone appropriate for the audience?" },
    { dimension: "Completeness", weight: 0.2, description: "Does it cover all necessary points?" },
    { dimension: "Engagement", weight: 0.2, description: "Does it hold the reader's interest?" }
  ],
  passing_threshold: 7.0,
  intelligence: "opus"
}

javascript

{
  criteria: "Description of what to evaluate and the quality standard",
  artifact: "The content to be evaluated (code, text, UI markup, etc.)",
  rubric: [
    { dimension: "Clarity", weight: 0.3, description: "Is the content easy to understand?" },
    { dimension: "Tone", weight: 0.3, description: "Is the tone appropriate for the audience?" },
    { dimension: "Completeness", weight: 0.2, description: "Does it cover all necessary points?" },
    { dimension: "Engagement", weight: 0.2, description: "Does it hold the reader's interest?" }
  ],
  passing_threshold: 7.0,
  intelligence: "opus"
}

Review Response Structure

评审响应结构

javascript

{
  scores: [
    { dimension: "Clarity", score: 8, reasoning: "Well-structured with clear headings..." },
    { dimension: "Tone", score: 7, reasoning: "Professional but occasionally too formal..." },
    { dimension: "Completeness", score: 9, reasoning: "Covers all key topics..." },
    { dimension: "Engagement", score: 6, reasoning: "Could use more examples..." }
  ],
  weighted_score: 7.4,
  pass: true,
  summary: "Overall good quality with minor tone and engagement improvements suggested.",
  suggestions: [
    "Add a real-world example in section 3",
    "Use more conversational language in the introduction"
  ]
}

javascript

{
  scores: [
    { dimension: "Clarity", score: 8, reasoning: "Well-structured with clear headings..." },
    { dimension: "Tone", score: 7, reasoning: "Professional but occasionally too formal..." },
    { dimension: "Completeness", score: 9, reasoning: "Covers all key topics..." },
    { dimension: "Engagement", score: 6, reasoning: "Could use more examples..." }
  ],
  weighted_score: 7.4,
  pass: true,
  summary: "Overall good quality with minor tone and engagement improvements suggested.",
  suggestions: [
    "Add a real-world example in section 3",
    "Use more conversational language in the introduction"
  ]
}

STOP — Do NOT proceed to Phase 4 until:

停止 — 满足以下条件前请勿进入第四阶段：

Artifact has been submitted with full rubric
Each dimension has been scored independently
Reasoning is provided for every score
Weighted total is calculated

待评估产物已和完整评分规则一同提交
每个维度都已独立评分
每个评分都附带推理说明
已计算加权总分

Phase 4: Iterate or Accept

第四阶段：迭代或接受

Goal: Act on the evaluation results.

目标： 根据评估结果采取对应行动。

Result Action Table

结果行动对照表

Result	Action	Max Iterations
Pass (score >= threshold)	Accept the artifact, proceed	Done
Marginal fail (within 1 point)	Apply suggestions, re-evaluate once	1
Clear fail (> 1 point below)	Significant revision, apply all suggestions	2
Repeated fail (3+ attempts)	Escalate — rubric or approach may need adjustment	Escalate

结果	行动	最大迭代次数
通过（得分 >= 阈值）	接受产物，继续后续流程	完成
接近及格（差1分以内）	采纳优化建议，重新评估1次	1
明显不及格（低于阈值1分以上）	大幅修改，采纳所有建议	2
多次不及格（3次及以上尝试）	升级处理——可能需要调整评分规则或评估方法	升级

STOP — Evaluation complete when:

停止 — 满足以下条件时评估完成：

Artifact passes threshold, OR
3 iterations completed and escalation decision made

产物得分超过阈值，或
已完成3次迭代并做出升级处理的决策

Common Rubric Templates

常用评分规则模板

Documentation Quality

文档质量

Clarity (0.3): Is the content easy to understand for the target audience?
  1=incomprehensible  5=adequate but requires re-reading  10=crystal clear
Accuracy (0.3): Is the information technically correct?
  1=factually wrong  5=mostly correct  10=perfectly accurate
Completeness (0.2): Does it cover all necessary topics?
  1=missing critical info  5=covers basics  10=comprehensive
Examples (0.2): Are there sufficient, relevant code examples?
  1=no examples  5=some examples  10=rich, varied examples
Threshold: 7.0

清晰度 (0.3): 内容对目标受众是否容易理解？
  1=完全看不懂  5=合格但需要反复阅读  10=极其清晰
准确率 (0.3): 信息在技术层面是否正确？
  1=完全错误  5=大部分正确  10=完全准确
完整度 (0.2): 是否覆盖了所有必要的主题？
  1=缺失关键信息  5=覆盖基础内容  10=全面详尽
示例质量 (0.2): 是否有足够的相关代码示例？
  1=无示例  5=有部分示例  10=丰富多样的示例
阈值: 7.0

Error Message Quality

错误消息质量

Helpfulness (0.4): Does the message help the user fix the problem?
  1=no help at all  5=vague direction  10=exact fix steps
Clarity (0.3): Is the message easy to understand?
  1=cryptic  5=understandable  10=immediately clear
Tone (0.2): Is the tone empathetic and non-blaming?
  1=hostile/blaming  5=neutral  10=empathetic and supportive
Actionability (0.1): Does it suggest a concrete next step?
  1=no suggestion  5=vague suggestion  10=specific actionable step
Threshold: 7.5

有用性 (0.4): 消息能否帮助用户解决问题？
  1=完全没用  5=方向模糊  10=给出确切修复步骤
清晰度 (0.3): 消息是否容易理解？
  1=晦涩难懂  5=可以理解  10=一目了然
语气 (0.2): 语气是否共情且无指责性？
  1=有敌意/指责用户  5=中立  10=共情且友好
可执行性 (0.1): 是否给出具体下一步建议？
  1=无建议  5=建议模糊  10=具体可执行的步骤
阈值: 7.5

Code Readability

代码可读性

Naming (0.3): Are variable/function names descriptive and consistent?
  1=single letters everywhere  5=adequate  10=self-documenting
Structure (0.3): Is the code logically organized?
  1=spaghetti  5=functional  10=elegant and clear
Simplicity (0.2): Is the code as simple as possible (but not simpler)?
  1=over-engineered  5=reasonable  10=minimal and clear
Documentation (0.2): Are complex sections adequately commented?
  1=no comments where needed  5=some comments  10=well-documented why
Threshold: 7.0

命名规范 (0.3): 变量/函数名是否描述清晰且一致？
  1=全是单字母命名  5=合格  10=自解释
结构合理性 (0.3): 代码逻辑是否组织合理？
  1=意大利面代码  5=可运行  10=优雅清晰
简洁性 (0.2): 代码是否尽可能简洁（但不过度简化）？
  1=过度设计  5=合理  10=极简清晰
文档完善度 (0.2): 复杂模块是否有足够注释？
  1=需要注释的地方完全没有  5=有部分注释  10=逻辑原因说明完善
阈值: 7.0

UX Copy

UX文案

Clarity (0.3): Is the copy easy to understand?
  1=confusing  5=understandable  10=immediately clear
Brevity (0.2): Is it concise without losing meaning?
  1=verbose  5=adequate length  10=perfectly concise
Tone (0.2): Does it match the brand voice?
  1=off-brand  5=neutral  10=perfectly on-brand
Actionability (0.2): Do CTAs clearly communicate what happens next?
  1=unclear  5=adequate  10=crystal clear action
Accessibility (0.1): Is the language inclusive and jargon-free?
  1=exclusionary  5=neutral  10=fully inclusive
Threshold: 7.5

清晰度 (0.3): 文案是否容易理解？
  1=令人困惑  5=可理解  10=一目了然
简洁性 (0.2): 是否简洁且不丢失核心含义？
  1=过于冗长  5=长度合适  10=完美简洁
语气匹配度 (0.2): 是否符合品牌调性？
  1=完全不符合品牌  5=中立  10=完美匹配品牌
可执行性 (0.2): CTA是否清晰说明点击后的结果？
  1=不清晰  5=合格  10=行动说明极其清晰
无障碍适配 (0.1): 语言是否包容无专业黑话？
  1=有排他性内容  5=中立  10=完全包容
阈值: 7.5

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-Pattern	Why It Is Wrong	Correct Approach
Using LLM-as-judge for measurable criteria	Wastes tokens, less reliable than assertions	Use deterministic tests for anything quantifiable
Vague rubric dimensions ("is it good?")	Produces unreliable, inconsistent scores	Specific dimensions with anchored examples
No passing threshold defined	No way to determine pass/fail objectively	Always set threshold before evaluation
Adjusting rubric to pass failing content	Defeats the purpose of quality gates	Fix the content, not the rubric
Single evaluation without reasoning	Cannot improve without understanding why	Always require per-dimension reasoning
Using weaker model for evaluation	Lower quality judgments	Use strongest available model (Opus)
Skipping re-evaluation after changes	No verification that changes improved quality	Always re-evaluate after revisions

反模式	错误原因	正确做法
用LLM-as-judge评估可衡量的标准	浪费token，可靠性不如断言	所有可量化的内容都使用确定性测试
评分规则维度模糊（比如"好不好？"）	产出不可靠、不一致的评分	维度具体且附带锚点示例
未定义及格阈值	无法客观判断是否通过	评估前必须设置阈值
调整评分规则让不及格内容通过	违背质量门禁的初衷	修改内容而非评分规则
只有评分没有推理说明	不知道问题原因无法优化	每个维度的评分都必须附带推理
用弱模型做评估	评审质量更低	使用可用的最强模型（Opus）
修改后跳过重新评估	无法验证修改是否提升了质量	修改后必须重新评估

Integration Points

集成点

Skill	Relationship
`acceptance-testing`	LLM-as-judge handles subjective acceptance criteria
`spec-writing`	Specs may include subjective quality criteria
`code-review`	Readability evaluation during code review
`verification-before-completion`	Subjective validation gate before completion
`senior-prompt-engineer`	Prompt quality evaluation uses LLM-as-judge
`tech-docs-generator`	Documentation quality evaluation

Skill	关系
`acceptance-testing`	LLM-as-judge处理主观验收标准
`spec-writing`	需求规范可能包含主观质量标准
`code-review`	代码评审过程中的可读性评估
`verification-before-completion`	完成前的主观验证门禁
`senior-prompt-engineer`	Prompt质量评估使用LLM-as-judge
`tech-docs-generator`	文档质量评估

Downstream Steering Pattern

下游流转模式

+----------+     +----------+     +----------+     +----------+
|  SPECS   |---->|   CODE   |---->|  TESTS   |---->| LLM-AS-  |
|          |     |          |     |(determin)|     |  JUDGE   |
|          |     |          |     |          |     |(subject) |
+----------+     +----------+     +----------+     +----+-----+
                      ^                                  |
                      |          backpressure             |
                      +----------------------------------+

Deterministic tests validate objective criteria. LLM-as-judge validates subjective criteria. Both must pass.

+----------+     +----------+     +----------+     +----------+
|  需求规范   |---->|   代码   |---->|  测试   |---->| LLM-AS-  |
|          |     |          |     |(确定性) |     |  JUDGE   |
|          |     |          |     |          |     |(主观) |
+----------+     +----------+     +----------+     +----+-----+
                      ^                                  |
                      |          反向反馈             |
                      +----------------------------------+

确定性测试验证客观标准，LLM-as-judge验证主观标准，两者都必须通过。

Skill Type

Skill类型

FLEXIBLE — Adapt rubric dimensions and thresholds to context. The pattern structure (define rubric, evaluate, score, iterate) is fixed. Always set the threshold before evaluation, never after.

灵活适配型 — 可根据场景调整评分规则维度和阈值。模式结构（定义评分规则、评估、打分、迭代）是固定的。必须在评估前设置阈值，禁止事后调整。",