validate-evaluator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Validate Evaluator

验证评估器

Calibrate an LLM judge against human judgment.
校准LLM评判器,使其与人工判断对齐。

Overview

概述

  1. Split human-labeled data into train (10-20%), dev (40-45%), test (40-45%)
  2. Run judge on dev set and measure TPR/TNR
  3. Iterate on the judge until TPR and TNR > 90% on dev set
  4. Run once on held-out test set for final TPR/TNR
  5. Apply bias correction formula to production data
  1. 将人工标注数据拆分为训练集(10-20%)、开发集(40-45%)、测试集(40-45%)
  2. 在开发集上运行评判器,测算TPR/TNR
  3. 迭代优化评判器,直到开发集上的TPR和TNR均高于90%
  4. 在留出的测试集上运行一次,得到最终的TPR/TNR
  5. 对生产数据应用偏差校正公式

Prerequisites

前置要求

  • A built LLM judge prompt (from write-judge-prompt)
  • Human-labeled data: ~100 traces with binary Pass/Fail labels per failure mode
    • Aim for ~50 Pass and ~50 Fail (balanced, even if real distribution is skewed)
    • Labels must come from a domain expert, not outsourced annotators
  • Candidate few-shot examples from your labeled data
  • 已构建完成的LLM评判提示词(来自write-judge-prompt流程)
  • 人工标注数据:每个故障模式约100条带二元通过/不通过标签的轨迹
    • 目标约50条通过、50条不通过(数据平衡,即使真实分布是偏态的)
    • 标注必须来自领域专家,而非外包标注人员
  • 来自标注数据的候选few-shot示例

Core Instructions

核心操作指引

Step 1: Create Data Splits

步骤1:拆分数据集

Split human-labeled data into three disjoint sets:
SplitSizePurposeRules
Training10-20% (~10-20 examples)Source of few-shot examples for the judge promptOnly clear-cut Pass and Fail cases. Used directly in the prompt.
Dev40-45% (~40-45 examples)Iterative evaluator refinementNever include in the prompt. Evaluate against repeatedly.
Test40-45% (~40-45 examples)Final unbiased accuracy measurementDo NOT look at during development. Used once at the end.
Target: 30-50 examples of each class (Pass and Fail) across dev and test combined. Use balanced splits even if real-world prevalence is skewed — you need enough Fail examples to measure TNR reliably.
python
from sklearn.model_selection import train_test_split
将人工标注数据拆分为三个互不重叠的集合:
拆分集大小用途规则
训练集10-20%(约10-20个示例)评判提示词的few-shot示例来源仅收录明确的通过和不通过案例,直接用于提示词中。
开发集40-45%(约40-45个示例)评估器迭代优化不得纳入提示词中,可反复用于评估。
测试集40-45%(约40-45个示例)最终无偏准确率测算开发过程中不得查看,仅在最终阶段使用一次。
目标:开发集和测试集合计包含每类(通过/不通过)30-50个示例。即使真实场景下的类别占比是偏态的,也要使用平衡拆分——你需要足够多的不通过示例来可靠测算TNR。
python
from sklearn.model_selection import train_test_split

First split: separate test set

First split: separate test set

train_dev, test = train_test_split( labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42 )
train_dev, test = train_test_split( labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42 )

Second split: separate training examples from dev set

Second split: separate training examples from dev set

train, dev = train_test_split( train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42 )
train, dev = train_test_split( train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42 )

Result: ~15% train, ~45% dev, ~40% test

Result: ~15% train, ~45% dev, ~40% test

undefined
undefined

Step 2: Run Evaluator on Dev Set

步骤2:在开发集上运行评估器

Run the judge on every example in the dev set. Compare predictions to human labels.
在开发集的每个示例上运行评判器,将预测结果与人工标注对比。

Step 3: Measure TPR and TNR

步骤3:测算TPR和TNR

TPR (True Positive Rate): When a human says Pass, how often does the judge also say Pass?
TPR = (judge says Pass AND human says Pass) / (human says Pass)
TNR (True Negative Rate): When a human says Fail, how often does the judge also say Fail?
TNR = (judge says Fail AND human says Fail) / (human says Fail)
python
from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
                                   labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
Use TPR/TNR, not Precision/Recall or raw accuracy. These two metrics directly map to the bias correction formula. Use Cohen's Kappa only for measuring agreement between two human annotators, not for judge-vs-ground-truth.
TPR (True Positive Rate): 当人工标注为通过时,评判器也判定为通过的概率是多少?
TPR = (judge says Pass AND human says Pass) / (human says Pass)
TNR (True Negative Rate): 当人工标注为不通过时,评判器也判定为不通过的概率是多少?
TNR = (judge says Fail AND human says Fail) / (human says Fail)
python
from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
                                   labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
请使用TPR/TNR,而非精确率/召回率或原始准确率。这两个指标直接对应偏差校正公式。Cohen's Kappa仅用于测算两名人工标注者之间的一致性,不适用于评判器与真实标注的对比。

Step 4: Inspect Disagreements

步骤4:检查不一致案例

Examine every case where the judge disagrees with human labels:
Disagreement TypeJudgeHumanFix
False PassPassFailJudge is too lenient. Strengthen Fail definitions or add edge-case examples.
False FailFailPassJudge is too strict. Clarify Pass definitions or adjust examples.
For each disagreement, determine whether to:
  • Clarify wording in the judge prompt
  • Swap or add few-shot examples from the training set
  • Add explicit rules for the edge case
  • Split the criterion into more specific sub-checks
逐一检查评判器判定结果与人工标注不一致的案例:
不一致类型评判器判定人工标注修复方案
误判通过通过不通过评判器过于宽松。强化不通过判定的定义,或补充边缘案例示例。
误判不通过不通过通过评判器过于严格。明确通过判定的定义,或调整示例。
针对每个不一致案例,决定采取以下操作:
  • 明确评判提示词中的表述
  • 替换或补充来自训练集的few-shot示例
  • 为边缘案例添加明确规则
  • 将判定标准拆分为更具体的子检查项

Step 5: Iterate

步骤5:迭代优化

Refine the judge prompt and re-run on the dev set. Repeat until TPR and TNR stabilize.
Stopping criteria:
  • Target: TPR > 90% AND TNR > 90%
  • Minimum acceptable: TPR > 80% AND TNR > 80%
If alignment stalls:
ProblemSolution
TPR and TNR both lowUse a more capable LLM for the judge
One metric low, one acceptableInspect disagreements for the low metric specifically
Both plateau below targetDecompose the criterion into smaller, more atomic checks
Consistently wrong on certain input typesAdd targeted few-shot examples from training set
Labels themselves seem inconsistentRe-examine human labels; the rubric may need refinement
优化评判提示词,在开发集上重新运行。重复操作直到TPR和TNR稳定。
停止条件:
  • 目标值: TPR > 90% 且 TNR > 90%
  • 最低可接受值: TPR > 80% 且 TNR > 80%
对齐停滞时的解决方案:
问题解决方案
TPR和TNR均偏低为评判器使用能力更强的LLM
一项指标偏低,另一项达标专门针对偏低指标对应的不一致案例进行检查
两项指标均在目标值以下进入平台期将判定标准拆解为更小、更原子化的检查项
对特定输入类型的判定持续出错从训练集中补充针对性的few-shot示例
标注本身看起来不一致重新核查人工标注;标注规则可能需要优化

Step 6: Final Measurement on Test Set

步骤6:在测试集上完成最终测算

Run the judge exactly once on the held-out test set. Record final TPR and TNR.
Do not iterate after seeing test set results. Go back to step 4 with new dev data if needed.
在留出的测试集上仅运行一次评判器。记录最终的TPR和TNR。
看到测试集结果后不得再迭代优化。如有需要,使用新的开发数据回到步骤4操作。

Step 7 (Optional): Estimate True Success Rate (Rogan-Gladen Correction)

步骤7(可选):估算真实通过率(Rogan-Gladen校正)

Raw judge scores on unlabeled production data are biased. If you need an accurate aggregate pass rate, correct for known judge errors:
theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)
Where:
  • p_obs
    = fraction of unlabeled traces the judge scored as Pass
  • TPR
    ,
    TNR
    = from test set measurement
  • theta_hat
    = corrected estimate of true success rate
Clip to [0, 1]. Invalid when TPR + TNR - 1 is near 0 (judge is no better than random).
Example:
  • Judge TPR = 0.92, TNR = 0.88
  • 500 production traces: 400 scored Pass -> p_obs = 0.80
  • theta_hat = (0.80 + 0.88 - 1) / (0.92 + 0.88 - 1) = 0.68 / 0.80 = 0.85
  • True success rate is ~85%, not the raw 80%
未标注生产数据的原始评判得分存在偏差。如果你需要准确的聚合通过率,需针对已知的评判器误差进行校正:
theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)
其中:
  • p_obs
    = 评判器判定为通过的未标注轨迹占比
  • TPR
    ,
    TNR
    = 测试集测算得到的数值
  • theta_hat
    = 校正后的真实通过率估算值
将结果裁剪到[0, 1]区间。当TPR + TNR - 1接近0时公式无效(说明评判器表现不比随机猜测好)。
示例:
  • 评判器TPR = 0.92,TNR = 0.88
  • 500条生产轨迹:400条被判定为通过 -> p_obs = 0.80
  • theta_hat = (0.80 + 0.88 - 1) / (0.92 + 0.88 - 1) = 0.68 / 0.80 = 0.85
  • 真实通过率约为85%,而非原始的80%

Step 8: Confidence Interval

步骤8:置信区间

Compute a bootstrap confidence interval. A point estimate alone is not enough.
python
import numpy as np

def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
    """Bootstrap 95% CI for corrected success rate."""
    n = len(human_labels)
    estimates = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, size=n, replace=True)
        h = np.array(human_labels)[idx]
        e = np.array(eval_labels)[idx]

        tp = ((h == 'Pass') & (e == 'Pass')).sum()
        fn = ((h == 'Pass') & (e == 'Fail')).sum()
        tn = ((h == 'Fail') & (e == 'Fail')).sum()
        fp = ((h == 'Fail') & (e == 'Pass')).sum()

        tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
        tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
        denom = tpr_b + tnr_b - 1

        if abs(denom) < 1e-6:
            continue
        theta = (p_obs + tnr_b - 1) / denom
        estimates.append(np.clip(theta, 0, 1))

    return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)

lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")
Or use
judgy
(
pip install judgy
):
python
from judgy import estimate_success_rate

result = estimate_success_rate(
    human_labels=test_human_labels,
    evaluator_labels=test_eval_labels,
    unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")
计算bootstrap置信区间,仅靠单点估算值是不够的。
python
import numpy as np

def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
    """Bootstrap 95% CI for corrected success rate."""
    n = len(human_labels)
    estimates = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, size=n, replace=True)
        h = np.array(human_labels)[idx]
        e = np.array(eval_labels)[idx]

        tp = ((h == 'Pass') & (e == 'Pass')).sum()
        fn = ((h == 'Pass') & (e == 'Fail')).sum()
        tn = ((h == 'Fail') & (e == 'Fail')).sum()
        fp = ((h == 'Fail') & (e == 'Pass')).sum()

        tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
        tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
        denom = tpr_b + tnr_b - 1

        if abs(denom) < 1e-6:
            continue
        theta = (p_obs + tnr_b - 1) / denom
        estimates.append(np.clip(theta, 0, 1))

    return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)

lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")
或者使用
judgy
pip install judgy
):
python
from judgy import estimate_success_rate

result = estimate_success_rate(
    human_labels=test_human_labels,
    evaluator_labels=test_eval_labels,
    unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")

Practical Guidance

实操指南

  • Pin exact model versions for LLM judges (e.g.,
    gpt-4o-2024-05-13
    , not
    gpt-4o
    ). Providers update models without notice, causing silent drift.
  • Re-validate after changing the judge prompt, switching models, or when production confidence intervals widen unexpectedly.
  • Use ~100 labeled examples (50 Pass, 50 Fail). Below 60, confidence intervals become wide.
  • One trusted domain expert is the most efficient labeling path. If not feasible, have two annotators label 20-50 traces independently and resolve disagreements before proceeding.
  • Improving TPR narrows the confidence interval more than improving TNR. The correction formula divides by TPR, so low TPR amplifies estimation errors into wide CIs.
  • 固定LLM评判器的exact模型版本(例如
    gpt-4o-2024-05-13
    ,而非
    gpt-4o
    )。服务商可能在未通知的情况下更新模型,导致静默漂移。
  • 以下场景需重新验证:修改评判提示词、切换模型、或生产环境置信区间意外变宽时。
  • 使用约100条标注示例(50条通过,50条不通过)。低于60条时置信区间会变得很宽。
  • 由一名可信的领域专家标注是最高效的标注路径。如果不可行,让两名标注者独立标注20-50条轨迹,解决分歧后再推进。
  • 提升TPR比提升TNR更能缩短信赖区间。校正公式会除以TPR,因此低TPR会放大估算误差,导致置信区间变宽。

Anti-Patterns

反模式

  • Assuming judges "just work" without validation. A judge may consistently miss failures or flag passing traces.
  • Using raw accuracy or percent agreement. Use TPR and TNR. With class imbalance, raw accuracy is misleading.
  • Dev/test examples as few-shot examples. This is data leakage.
  • Reporting dev set performance as final accuracy. Dev numbers are optimistic. The test set gives the unbiased estimate.
  • Raw judge scores without bias correction. If you report an aggregate pass rate, apply the Rogan-Gladen formula (Step 7).
  • Point estimates without confidence intervals. A corrected rate of 85% could easily be 78-92% with small test sets. Report the range so stakeholders know how much to trust the number.
  • 未经验证就假设评判器“天然可用”。评判器可能会持续漏判故障,或将正常轨迹标记为异常。
  • 使用原始准确率或百分比一致性。请使用TPR和TNR。存在类别不平衡时,原始准确率会产生误导。
  • 将开发/测试集示例用作few-shot示例。这属于数据泄露。
  • 将开发集表现作为最终准确率上报。开发集的数值是偏乐观的,测试集才能给出无偏估算。
  • 未做偏差校正就上报原始评判得分。如果你要上报聚合通过率,请应用Rogan-Gladen公式(步骤7)。
  • 仅上报单点估算值,不上报置信区间。如果测试集规模较小,85%的校正通过率很可能实际范围是78-92%。请上报范围,让利益相关方清楚数值的可信度。