llm-as-judge
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM-as-Judge
LLM-as-Judge
Overview
概述
Some quality criteria are inherently subjective — tone of voice, visual aesthetics, UX feel, documentation clarity, code readability. These cannot be verified by deterministic tests. The LLM-as-judge pattern provides structured, repeatable evaluation using an LLM reviewer with defined rubrics, ensuring subjective quality is measured consistently.
Announce at start: "I'm using the llm-as-judge skill to evaluate subjective quality."
有些质量标准本质上是主观的——语气、视觉美学、UX体验、文档清晰度、代码可读性。这些无法通过确定性测试验证。LLM-as-judge模式借助配备明确定义评分规则的LLM评审员,提供结构化、可重复的评估,确保主观质量得到一致衡量。
开始时需声明: "I'm using the llm-as-judge skill to evaluate subjective quality."
Phase 1: Determine Evaluation Method
第一阶段:确定评估方法
Goal: Decide whether LLM-as-judge is the right tool.
目标: 判断LLM-as-judge是否是合适的工具。
Decision Table: LLM-as-Judge vs Deterministic Tests
决策表:LLM-as-Judge vs 确定性测试
| Criterion Type | Method | Example |
|---|---|---|
| Objective, measurable | Deterministic test | "Response time < 200ms" |
| Structural, verifiable | Deterministic test | "Returns valid JSON" |
| Subjective, qualitative | LLM-as-judge | "Error messages are friendly and helpful" |
| Aesthetic, perceptual | LLM-as-judge | "UI feels clean and modern" |
| Linguistic, tonal | LLM-as-judge | "Documentation is clear for beginners" |
| Holistic, experiential | LLM-as-judge | "The onboarding flow feels intuitive" |
Rule of thumb: If you can write a boolean assertion, use a deterministic test. If evaluation requires judgment, use LLM-as-judge.
| 标准类型 | 方法 | 示例 |
|---|---|---|
| 客观、可衡量 | 确定性测试 | "响应时间 < 200ms" |
| 结构化、可验证 | 确定性测试 | "返回合法JSON" |
| 主观、定性 | LLM-as-judge | "错误消息友好且有帮助" |
| 美学、感知类 | LLM-as-judge | "UI看起来简洁现代" |
| 语言、语气类 | LLM-as-judge | "文档对初学者友好易懂" |
| 整体、体验类 | LLM-as-judge | "入门流程感觉很直观" |
经验法则: 如果你能写出布尔断言,就使用确定性测试。如果评估需要主观判断,就使用LLM-as-judge。
STOP — Do NOT proceed to Phase 2 until:
停止 — 满足以下条件前请勿进入第二阶段:
- Confirmed that criteria are genuinely subjective
- Deterministic testing has been ruled out
- Specific artifacts to evaluate are identified
- 确认评估标准确实是主观的
- 已排除使用确定性测试的可能性
- 已确定需要评估的具体产物
Phase 2: Define Rubric
第二阶段:定义评分规则
Goal: Create evaluation dimensions with weights and anchor points.
目标: 创建带权重和锚点的评估维度。
Actions
操作步骤
- Define 3-5 evaluation dimensions
- Assign weights (must sum to 1.0)
- Define anchor points for each dimension (1=worst, 5=adequate, 10=best)
- Set passing threshold
- 定义3-5个评估维度
- 分配权重(总和必须为1.0)
- 为每个维度定义锚点(1=最差,5=合格,10=最佳)
- 设置及格阈值
Rubric Structure
评分规则结构
| Dimension | Weight | Scale | Anchor: 1 | Anchor: 5 | Anchor: 10 |
|---|---|---|---|---|---|
| [Name] | 0.X | 1-10 | [Worst case] | [Adequate] | [Excellent] |
| 维度 | 权重 | 评分范围 | 锚点:1 | 锚点:5 | 锚点:10 |
|---|---|---|---|---|---|
| [维度名] | 0.X | 1-10 | [最差情况] | [合格标准] | [优秀标准] |
Threshold Selection Table
阈值选择表
| Quality Level | Threshold | Use For |
|---|---|---|
| Minimum viable | 5.0 | Internal docs, draft content |
| Production quality | 7.0 | User-facing content, public APIs |
| Excellence | 8.5 | Marketing, critical UX flows |
| 质量等级 | 阈值 | 适用场景 |
|---|---|---|
| 最小可用 | 5.0 | 内部文档、草稿内容 |
| 生产级质量 | 7.0 | 用户-facing内容、公开API |
| 卓越级 | 8.5 | 营销内容、核心UX流程 |
STOP — Do NOT proceed to Phase 3 until:
停止 — 满足以下条件前请勿进入第三阶段:
- 3-5 dimensions are defined with clear descriptions
- Weights sum to exactly 1.0
- Anchor points are specific (not vague)
- Passing threshold is set before evaluation
- 已定义3-5个带清晰描述的维度
- 权重总和恰好为1.0
- 锚点描述具体(无模糊表述)
- 评估前已设置好评及阈值
Phase 3: Evaluate
第三阶段:执行评估
Goal: Submit the artifact with rubric to an LLM reviewer.
目标: 将待评估产物和评分规则提交给LLM评审员。
Review Request Structure
评审请求结构
javascript
{
criteria: "Description of what to evaluate and the quality standard",
artifact: "The content to be evaluated (code, text, UI markup, etc.)",
rubric: [
{ dimension: "Clarity", weight: 0.3, description: "Is the content easy to understand?" },
{ dimension: "Tone", weight: 0.3, description: "Is the tone appropriate for the audience?" },
{ dimension: "Completeness", weight: 0.2, description: "Does it cover all necessary points?" },
{ dimension: "Engagement", weight: 0.2, description: "Does it hold the reader's interest?" }
],
passing_threshold: 7.0,
intelligence: "opus"
}javascript
{
criteria: "Description of what to evaluate and the quality standard",
artifact: "The content to be evaluated (code, text, UI markup, etc.)",
rubric: [
{ dimension: "Clarity", weight: 0.3, description: "Is the content easy to understand?" },
{ dimension: "Tone", weight: 0.3, description: "Is the tone appropriate for the audience?" },
{ dimension: "Completeness", weight: 0.2, description: "Does it cover all necessary points?" },
{ dimension: "Engagement", weight: 0.2, description: "Does it hold the reader's interest?" }
],
passing_threshold: 7.0,
intelligence: "opus"
}Review Response Structure
评审响应结构
javascript
{
scores: [
{ dimension: "Clarity", score: 8, reasoning: "Well-structured with clear headings..." },
{ dimension: "Tone", score: 7, reasoning: "Professional but occasionally too formal..." },
{ dimension: "Completeness", score: 9, reasoning: "Covers all key topics..." },
{ dimension: "Engagement", score: 6, reasoning: "Could use more examples..." }
],
weighted_score: 7.4,
pass: true,
summary: "Overall good quality with minor tone and engagement improvements suggested.",
suggestions: [
"Add a real-world example in section 3",
"Use more conversational language in the introduction"
]
}javascript
{
scores: [
{ dimension: "Clarity", score: 8, reasoning: "Well-structured with clear headings..." },
{ dimension: "Tone", score: 7, reasoning: "Professional but occasionally too formal..." },
{ dimension: "Completeness", score: 9, reasoning: "Covers all key topics..." },
{ dimension: "Engagement", score: 6, reasoning: "Could use more examples..." }
],
weighted_score: 7.4,
pass: true,
summary: "Overall good quality with minor tone and engagement improvements suggested.",
suggestions: [
"Add a real-world example in section 3",
"Use more conversational language in the introduction"
]
}STOP — Do NOT proceed to Phase 4 until:
停止 — 满足以下条件前请勿进入第四阶段:
- Artifact has been submitted with full rubric
- Each dimension has been scored independently
- Reasoning is provided for every score
- Weighted total is calculated
- 待评估产物已和完整评分规则一同提交
- 每个维度都已独立评分
- 每个评分都附带推理说明
- 已计算加权总分
Phase 4: Iterate or Accept
第四阶段:迭代或接受
Goal: Act on the evaluation results.
目标: 根据评估结果采取对应行动。
Result Action Table
结果行动对照表
| Result | Action | Max Iterations |
|---|---|---|
| Pass (score >= threshold) | Accept the artifact, proceed | Done |
| Marginal fail (within 1 point) | Apply suggestions, re-evaluate once | 1 |
| Clear fail (> 1 point below) | Significant revision, apply all suggestions | 2 |
| Repeated fail (3+ attempts) | Escalate — rubric or approach may need adjustment | Escalate |
| 结果 | 行动 | 最大迭代次数 |
|---|---|---|
| 通过(得分 >= 阈值) | 接受产物,继续后续流程 | 完成 |
| 接近及格(差1分以内) | 采纳优化建议,重新评估1次 | 1 |
| 明显不及格(低于阈值1分以上) | 大幅修改,采纳所有建议 | 2 |
| 多次不及格(3次及以上尝试) | 升级处理——可能需要调整评分规则或评估方法 | 升级 |
STOP — Evaluation complete when:
停止 — 满足以下条件时评估完成:
- Artifact passes threshold, OR
- 3 iterations completed and escalation decision made
- 产物得分超过阈值,或
- 已完成3次迭代并做出升级处理的决策
Common Rubric Templates
常用评分规则模板
Documentation Quality
文档质量
Clarity (0.3): Is the content easy to understand for the target audience?
1=incomprehensible 5=adequate but requires re-reading 10=crystal clear
Accuracy (0.3): Is the information technically correct?
1=factually wrong 5=mostly correct 10=perfectly accurate
Completeness (0.2): Does it cover all necessary topics?
1=missing critical info 5=covers basics 10=comprehensive
Examples (0.2): Are there sufficient, relevant code examples?
1=no examples 5=some examples 10=rich, varied examples
Threshold: 7.0清晰度 (0.3): 内容对目标受众是否容易理解?
1=完全看不懂 5=合格但需要反复阅读 10=极其清晰
准确率 (0.3): 信息在技术层面是否正确?
1=完全错误 5=大部分正确 10=完全准确
完整度 (0.2): 是否覆盖了所有必要的主题?
1=缺失关键信息 5=覆盖基础内容 10=全面详尽
示例质量 (0.2): 是否有足够的相关代码示例?
1=无示例 5=有部分示例 10=丰富多样的示例
阈值: 7.0Error Message Quality
错误消息质量
Helpfulness (0.4): Does the message help the user fix the problem?
1=no help at all 5=vague direction 10=exact fix steps
Clarity (0.3): Is the message easy to understand?
1=cryptic 5=understandable 10=immediately clear
Tone (0.2): Is the tone empathetic and non-blaming?
1=hostile/blaming 5=neutral 10=empathetic and supportive
Actionability (0.1): Does it suggest a concrete next step?
1=no suggestion 5=vague suggestion 10=specific actionable step
Threshold: 7.5有用性 (0.4): 消息能否帮助用户解决问题?
1=完全没用 5=方向模糊 10=给出确切修复步骤
清晰度 (0.3): 消息是否容易理解?
1=晦涩难懂 5=可以理解 10=一目了然
语气 (0.2): 语气是否共情且无指责性?
1=有敌意/指责用户 5=中立 10=共情且友好
可执行性 (0.1): 是否给出具体下一步建议?
1=无建议 5=建议模糊 10=具体可执行的步骤
阈值: 7.5Code Readability
代码可读性
Naming (0.3): Are variable/function names descriptive and consistent?
1=single letters everywhere 5=adequate 10=self-documenting
Structure (0.3): Is the code logically organized?
1=spaghetti 5=functional 10=elegant and clear
Simplicity (0.2): Is the code as simple as possible (but not simpler)?
1=over-engineered 5=reasonable 10=minimal and clear
Documentation (0.2): Are complex sections adequately commented?
1=no comments where needed 5=some comments 10=well-documented why
Threshold: 7.0命名规范 (0.3): 变量/函数名是否描述清晰且一致?
1=全是单字母命名 5=合格 10=自解释
结构合理性 (0.3): 代码逻辑是否组织合理?
1=意大利面代码 5=可运行 10=优雅清晰
简洁性 (0.2): 代码是否尽可能简洁(但不过度简化)?
1=过度设计 5=合理 10=极简清晰
文档完善度 (0.2): 复杂模块是否有足够注释?
1=需要注释的地方完全没有 5=有部分注释 10=逻辑原因说明完善
阈值: 7.0UX Copy
UX文案
Clarity (0.3): Is the copy easy to understand?
1=confusing 5=understandable 10=immediately clear
Brevity (0.2): Is it concise without losing meaning?
1=verbose 5=adequate length 10=perfectly concise
Tone (0.2): Does it match the brand voice?
1=off-brand 5=neutral 10=perfectly on-brand
Actionability (0.2): Do CTAs clearly communicate what happens next?
1=unclear 5=adequate 10=crystal clear action
Accessibility (0.1): Is the language inclusive and jargon-free?
1=exclusionary 5=neutral 10=fully inclusive
Threshold: 7.5清晰度 (0.3): 文案是否容易理解?
1=令人困惑 5=可理解 10=一目了然
简洁性 (0.2): 是否简洁且不丢失核心含义?
1=过于冗长 5=长度合适 10=完美简洁
语气匹配度 (0.2): 是否符合品牌调性?
1=完全不符合品牌 5=中立 10=完美匹配品牌
可执行性 (0.2): CTA是否清晰说明点击后的结果?
1=不清晰 5=合格 10=行动说明极其清晰
无障碍适配 (0.1): 语言是否包容无专业黑话?
1=有排他性内容 5=中立 10=完全包容
阈值: 7.5Anti-Patterns / Common Mistakes
反模式/常见错误
| Anti-Pattern | Why It Is Wrong | Correct Approach |
|---|---|---|
| Using LLM-as-judge for measurable criteria | Wastes tokens, less reliable than assertions | Use deterministic tests for anything quantifiable |
| Vague rubric dimensions ("is it good?") | Produces unreliable, inconsistent scores | Specific dimensions with anchored examples |
| No passing threshold defined | No way to determine pass/fail objectively | Always set threshold before evaluation |
| Adjusting rubric to pass failing content | Defeats the purpose of quality gates | Fix the content, not the rubric |
| Single evaluation without reasoning | Cannot improve without understanding why | Always require per-dimension reasoning |
| Using weaker model for evaluation | Lower quality judgments | Use strongest available model (Opus) |
| Skipping re-evaluation after changes | No verification that changes improved quality | Always re-evaluate after revisions |
| 反模式 | 错误原因 | 正确做法 |
|---|---|---|
| 用LLM-as-judge评估可衡量的标准 | 浪费token,可靠性不如断言 | 所有可量化的内容都使用确定性测试 |
| 评分规则维度模糊(比如"好不好?") | 产出不可靠、不一致的评分 | 维度具体且附带锚点示例 |
| 未定义及格阈值 | 无法客观判断是否通过 | 评估前必须设置阈值 |
| 调整评分规则让不及格内容通过 | 违背质量门禁的初衷 | 修改内容而非评分规则 |
| 只有评分没有推理说明 | 不知道问题原因无法优化 | 每个维度的评分都必须附带推理 |
| 用弱模型做评估 | 评审质量更低 | 使用可用的最强模型(Opus) |
| 修改后跳过重新评估 | 无法验证修改是否提升了质量 | 修改后必须重新评估 |
Integration Points
集成点
| Skill | Relationship |
|---|---|
| LLM-as-judge handles subjective acceptance criteria |
| Specs may include subjective quality criteria |
| Readability evaluation during code review |
| Subjective validation gate before completion |
| Prompt quality evaluation uses LLM-as-judge |
| Documentation quality evaluation |
| Skill | 关系 |
|---|---|
| LLM-as-judge处理主观验收标准 |
| 需求规范可能包含主观质量标准 |
| 代码评审过程中的可读性评估 |
| 完成前的主观验证门禁 |
| Prompt质量评估使用LLM-as-judge |
| 文档质量评估 |
Downstream Steering Pattern
下游流转模式
+----------+ +----------+ +----------+ +----------+
| SPECS |---->| CODE |---->| TESTS |---->| LLM-AS- |
| | | | |(determin)| | JUDGE |
| | | | | | |(subject) |
+----------+ +----------+ +----------+ +----+-----+
^ |
| backpressure |
+----------------------------------+Deterministic tests validate objective criteria. LLM-as-judge validates subjective criteria. Both must pass.
+----------+ +----------+ +----------+ +----------+
| 需求规范 |---->| 代码 |---->| 测试 |---->| LLM-AS- |
| | | | |(确定性) | | JUDGE |
| | | | | | |(主观) |
+----------+ +----------+ +----------+ +----+-----+
^ |
| 反向反馈 |
+----------------------------------+确定性测试验证客观标准,LLM-as-judge验证主观标准,两者都必须通过。
Skill Type
Skill类型
FLEXIBLE — Adapt rubric dimensions and thresholds to context. The pattern structure (define rubric, evaluate, score, iterate) is fixed. Always set the threshold before evaluation, never after.
灵活适配型 — 可根据场景调整评分规则维度和阈值。模式结构(定义评分规则、评估、打分、迭代)是固定的。必须在评估前设置阈值,禁止事后调整。",