write-judge-prompt

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Write LLM-as-Judge Prompt

编写LLM-as-Judge提示词

Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.
设计一个二元Pass/Fail类型的LLM-as-Judge评估器,用于检测单个特定的故障模式。每个评估器仅检查一项内容。

Prerequisites

前置条件

  • Error analysis is complete. The failure mode is identified.
  • You have human-labeled traces for this failure mode (at least 20 Pass and 20 Fail examples).
  • A code-based evaluator cannot check this failure mode. Exhaust code-based options before reaching for a judge — many failure modes that seem subjective reduce to keyword checks, regex, or API calls when you understand the domain. Example: detecting whether an AI interviewing coach suggests "general" questions (asking about typical behavior instead of a specific past event) seems to require semantic understanding, but in practice a keyword check for words like "usually," "typical," and "normally" could work quite well.
  • 错误分析已完成,故障模式已明确。
  • 你有该故障模式的人工标注轨迹(至少20个Pass和20个Fail样例)。
  • 基于代码的评估器无法检测该故障模式。在选择评估器前请先穷尽所有基于代码的方案——很多看起来主观的故障模式,当你熟悉业务领域后,其实可以转化为关键词检查、regex或API调用。例如:检测AI面试教练是否提出了「通用」问题(询问典型行为而非特定过往事件)看起来需要语义理解,但实际上通过检查「usually」「typical」「normally」这类关键词就可以取得很好的效果。

The Four Components

四大组成部分

Every judge prompt requires exactly four components:
每个评估器提示词必须包含以下四个部分:

1. Task and Evaluation Criterion

1. 任务与评估标准

State what the judge evaluates. One failure mode per judge.
You are an evaluator assessing whether a real estate assistant's email
uses the appropriate tone for the client's persona.
Not: "Evaluate whether the email is good" or "Rate the email quality from 1-5."
明确说明评估器的评估内容,每个评估器仅对应一个故障模式。
You are an evaluator assessing whether a real estate assistant's email
uses the appropriate tone for the client's persona.
反例:「评估邮件是否优秀」或「按1-5分评估邮件质量」。

2. Pass/Fail Definitions

2. Pass/Fail定义

Outcomes are strictly binary: Pass or Fail. No Likert scales, no letter grades, no partial credit. Define exactly what constitutes Pass and Fail. These definitions come from your error analysis failure mode descriptions.
undefined
输出结果严格为二元选项:Pass或Fail。不允许使用李克特量表、字母评分、部分得分。明确定义Pass和Fail的判定标准,这些定义来源于你在错误分析阶段得到的故障模式描述。
undefined

Definitions

Definitions

PASS: The email matches the expected communication style for the client persona:
  • Luxury Buyers: formal language, emphasis on exclusive features, premium market positioning, no casual slang
  • First-Time Homebuyers: warm and encouraging tone, educational explanations, avoids jargon, patient and supportive
  • Investors: data-driven language, ROI-focused, market analytics, concise and professional
FAIL: The email uses a tone mismatched to the client persona. Examples:
  • Using casual slang ("hey, check out this pad!") for a luxury buyer
  • Using heavy financial jargon for a first-time homebuyer
  • Using overly emotional language for an investor
undefined
PASS: The email matches the expected communication style for the client persona:
  • Luxury Buyers: formal language, emphasis on exclusive features, premium market positioning, no casual slang
  • First-Time Homebuyers: warm and encouraging tone, educational explanations, avoids jargon, patient and supportive
  • Investors: data-driven language, ROI-focused, market analytics, concise and professional
FAIL: The email uses a tone mismatched to the client persona. Examples:
  • Using casual slang ("hey, check out this pad!") for a luxury buyer
  • Using heavy financial jargon for a first-time homebuyer
  • Using overly emotional language for an investor
undefined

3. Few-Shot Examples

3. 少样本示例

Include labeled Pass and Fail examples from your human-labeled data.
undefined
从你的人工标注数据中选取带标注的Pass和Fail样例。
undefined

Examples

Examples

Example 1: PASS

Example 1: PASS

Client Persona: Luxury Buyer Email: "Dear Mr. Harrington, I am pleased to present an exclusive listing at 1200 Pacific Heights Drive. This distinguished property features..." Critique: The email opens with a formal salutation and uses language consistent with luxury positioning — "exclusive listing," "distinguished property." No casual slang or informal phrasing. The tone matches the luxury buyer persona throughout. Result: Pass
Client Persona: Luxury Buyer Email: "Dear Mr. Harrington, I am pleased to present an exclusive listing at 1200 Pacific Heights Drive. This distinguished property features..." Critique: The email opens with a formal salutation and uses language consistent with luxury positioning — "exclusive listing," "distinguished property." No casual slang or informal phrasing. The tone matches the luxury buyer persona throughout. Result: Pass

Example 2: FAIL

Example 2: FAIL

Client Persona: Luxury Buyer Email: "Hey! Just found this awesome place you might like. It's got a pool and stuff, super cool neighborhood..." Critique: The greeting "Hey!" is informal. Phrases like "awesome place," "got a pool and stuff," and "super cool" are casual slang inappropriate for a luxury buyer. The email reads like a text message, not a professional communication for a high-end client. Result: Fail
Client Persona: Luxury Buyer Email: "Hey! Just found this awesome place you might like. It's got a pool and stuff, super cool neighborhood..." Critique: The greeting "Hey!" is informal. Phrases like "awesome place," "got a pool and stuff," and "super cool" are casual slang inappropriate for a luxury buyer. The email reads like a text message, not a professional communication for a high-end client. Result: Fail

Example 3: PASS (borderline)

Example 3: PASS (borderline)

Client Persona: First-Time Homebuyer Email: "Hi Sarah, I found a property that might be a great fit for your first home. The neighborhood has good schools nearby, and the monthly payment would be similar to what you're currently paying in rent..." Critique: The greeting is warm but not overly casual. The email explains the property in relatable terms — comparing mortgage to rent, mentioning schools — which is educational without being condescending. It avoids jargon like "amortization" or "LTV ratio." While not deeply technical, this matches the supportive tone expected for a first-time buyer. Result: Pass

**Rules for selecting examples:**
- Include at least one clear Pass, one clear Fail, and one borderline case. Borderline examples are the most valuable — they teach nuance.
- Draw examples from the training split (10-20% of labeled data set aside for this purpose).
- Any example used in the judge prompt must be excluded from dev and test sets. Using dev/test examples is data leakage.
- 2-4 examples is typical. Performance plateaus after 4-8.
Client Persona: First-Time Homebuyer Email: "Hi Sarah, I found a property that might be a great fit for your first home. The neighborhood has good schools nearby, and the monthly payment would be similar to what you're currently paying in rent..." Critique: The greeting is warm but not overly casual. The email explains the property in relatable terms — comparing mortgage to rent, mentioning schools — which is educational without being condescending. It avoids jargon like "amortization" or "LTV ratio." While not deeply technical, this matches the supportive tone expected for a first-time buyer. Result: Pass

**示例选择规则:**
- 至少包含1个明确Pass、1个明确Fail和1个边界案例。边界案例的价值最高,可以教会模型识别细微差异。
- 示例从训练集中选取(预留10-20%的标注数据用于此用途)。
- 任何用于评估器提示词的示例都必须从开发集和测试集中排除,否则会造成数据泄露。
- 通常提供2-4个示例即可,超过4-8个后性能不会再明显提升。

4. Structured Output Format

4. 结构化输出格式

Enforce structured output using your LLM provider's schema enforcement (e.g.,
response_format
in OpenAI, tool definitions in Anthropic) or a library like Instructor or Outlines. If the provider doesn't support schema enforcement, specify the JSON schema in the prompt.
The output must include a critique before the verdict. Placing the critique first forces the judge to articulate its assessment before committing to a decision.
json
{
  "critique": "string — detailed assessment of the output against the criterion",
  "result": "Pass or Fail"
}
Critiques must be detailed, not terse. A good critique explains what specifically was correct or incorrect and references concrete evidence from the output. The critiques in your few-shot examples set the bar for the level of detail the judge will produce.
使用LLM服务商提供的schema强制能力(例如OpenAI的
response_format
、Anthropic的工具定义)或者Instructor、Outlines这类库来强制输出结构化内容。如果服务商不支持schema强制,则在提示词中明确指定JSON schema。
输出必须在判定结论前包含评估说明,先输出评估说明可以强制评估器在给出最终判定前先清晰阐述自己的判断依据。
json
{
  "critique": "string — detailed assessment of the output against the criterion",
  "result": "Pass or Fail"
}
评估说明必须详细,不能过于简略。合格的评估说明会明确说明输出内容的正确/错误之处,并引用输出中的具体证据。你在少样本示例中提供的评估说明会为评估器的输出详细程度设定基准。

Choosing What to Pass to the Judge

选择输入给评估器的内容

Feed only what the judge needs for an accurate decision:
Failure ModeWhat the Judge Needs
Tone mismatchClient persona + generated email
Answer faithfulnessRetrieved context + generated answer
SQL correctnessUser query + generated SQL + schema
Instruction followingSystem prompt rules + generated response
Tool call justificationConversation history + tool call + tool result
For long documents, feed only the relevant snippet, not the entire document.
仅提供评估器做出准确判断所需的内容:
故障模式评估器所需内容
语气不匹配客户画像 + 生成的邮件
答案不忠实检索到的上下文 + 生成的答案
SQL正确性用户查询 + 生成的SQL + schema
指令遵循度系统提示词规则 + 生成的回复
工具调用合理性对话历史 + 工具调用 + 工具返回结果
如果是长文档,仅提供相关片段即可,不需要传入整个文档。

Model Selection

模型选择

Start with the most capable model available. The same model used for the main task works as judge (the judge performs a different, narrower task). Optimize for cost later once alignment is confirmed.
优先使用可用的能力最强的模型。主任务使用的模型也可以作为评估器(评估器执行的是不同的、更窄的任务)。确认评估效果符合预期后再优化成本。

Anti-Patterns

反模式

  • Vague criteria like "is this helpful?" Target a specific, observable failure mode from error analysis.
  • Holistic judge for the entire trace. A single judge covering multiple dimensions produces unactionable verdicts.
  • No few-shot examples. Without examples, the model won't know what counts as a failure in your application.
  • Dev/test examples used as few-shot. This is data leakage. Use only the training split.
  • Likert scales (1-5, letter grades, etc.). Binary pass/fail only. Likert scales produce scores that sound precise but can't be calibrated: annotators disagree on the difference between a 3 and a 4, and the judge inherits that noise. Binary forces you to define a clear decision boundary upfront, which makes inter-annotator agreement measurable and the judge's errors actionable. If you need to capture severity, use multiple binary judges (e.g., "factually wrong" and "dangerously wrong") rather than one ordinal scale.
  • Skipping validation. Measure alignment with human labels using validate-evaluator before trusting the judge.
  • Judges for specification failures without fixing the prompt first. If the prompt never asked for the behavior, add the instruction before building an evaluator. For critical requirements, a judge can still serve as a regression guard.
  • 模糊的评估标准,例如「这是否有用?」 应该针对错误分析中得出的具体、可观测的故障模式设计评估器。
  • 覆盖整个轨迹的综合评估器。 单个评估器覆盖多个维度会得到无法落地的判定结果。
  • 没有少样本示例。 没有示例的话,模型无法知道你的应用中什么才算故障。
  • 使用开发集/测试集的样例作为少样本。 这属于数据泄露,仅允许使用训练集的样例。
  • 使用李克特量表(1-5分、字母评分等)。 仅允许使用二元Pass/Fail。李克特量表的分数看起来精确,但无法校准:标注人员对3分和4分的差异无法达成一致,评估器也会继承这种噪声。二元判定会强制你预先明确定义决策边界,使得标注人员之间的一致性可衡量,评估器的错误也可落地优化。如果你需要区分严重程度,可以使用多个二元评估器(例如「事实错误」和「危险错误」),而非单个序数评分量表。
  • 跳过验证。 在信任评估器的结果前,要使用validate-evaluator工具衡量评估器和人工标注的对齐度。
  • 针对规范类故障构建评估器前不先修复提示词。 如果你的提示词根本没有要求对应行为,先添加对应的指令再构建评估器。对于核心要求,评估器仍然可以作为回归防护。