validate-evaluator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseValidate Evaluator
验证评估器
Calibrate an LLM judge against human judgment.
校准LLM评判器,使其与人工判断对齐。
Overview
概述
- Split human-labeled data into train (10-20%), dev (40-45%), test (40-45%)
- Run judge on dev set and measure TPR/TNR
- Iterate on the judge until TPR and TNR > 90% on dev set
- Run once on held-out test set for final TPR/TNR
- Apply bias correction formula to production data
- 将人工标注数据拆分为训练集(10-20%)、开发集(40-45%)、测试集(40-45%)
- 在开发集上运行评判器,测算TPR/TNR
- 迭代优化评判器,直到开发集上的TPR和TNR均高于90%
- 在留出的测试集上运行一次,得到最终的TPR/TNR
- 对生产数据应用偏差校正公式
Prerequisites
前置要求
- A built LLM judge prompt (from write-judge-prompt)
- Human-labeled data: ~100 traces with binary Pass/Fail labels per failure mode
- Aim for ~50 Pass and ~50 Fail (balanced, even if real distribution is skewed)
- Labels must come from a domain expert, not outsourced annotators
- Candidate few-shot examples from your labeled data
- 已构建完成的LLM评判提示词(来自write-judge-prompt流程)
- 人工标注数据:每个故障模式约100条带二元通过/不通过标签的轨迹
- 目标约50条通过、50条不通过(数据平衡,即使真实分布是偏态的)
- 标注必须来自领域专家,而非外包标注人员
- 来自标注数据的候选few-shot示例
Core Instructions
核心操作指引
Step 1: Create Data Splits
步骤1:拆分数据集
Split human-labeled data into three disjoint sets:
| Split | Size | Purpose | Rules |
|---|---|---|---|
| Training | 10-20% (~10-20 examples) | Source of few-shot examples for the judge prompt | Only clear-cut Pass and Fail cases. Used directly in the prompt. |
| Dev | 40-45% (~40-45 examples) | Iterative evaluator refinement | Never include in the prompt. Evaluate against repeatedly. |
| Test | 40-45% (~40-45 examples) | Final unbiased accuracy measurement | Do NOT look at during development. Used once at the end. |
Target: 30-50 examples of each class (Pass and Fail) across dev and test combined. Use balanced splits even if real-world prevalence is skewed — you need enough Fail examples to measure TNR reliably.
python
from sklearn.model_selection import train_test_split将人工标注数据拆分为三个互不重叠的集合:
| 拆分集 | 大小 | 用途 | 规则 |
|---|---|---|---|
| 训练集 | 10-20%(约10-20个示例) | 评判提示词的few-shot示例来源 | 仅收录明确的通过和不通过案例,直接用于提示词中。 |
| 开发集 | 40-45%(约40-45个示例) | 评估器迭代优化 | 不得纳入提示词中,可反复用于评估。 |
| 测试集 | 40-45%(约40-45个示例) | 最终无偏准确率测算 | 开发过程中不得查看,仅在最终阶段使用一次。 |
目标:开发集和测试集合计包含每类(通过/不通过)30-50个示例。即使真实场景下的类别占比是偏态的,也要使用平衡拆分——你需要足够多的不通过示例来可靠测算TNR。
python
from sklearn.model_selection import train_test_splitFirst split: separate test set
First split: separate test set
train_dev, test = train_test_split(
labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
)
train_dev, test = train_test_split(
labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
)
Second split: separate training examples from dev set
Second split: separate training examples from dev set
train, dev = train_test_split(
train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
)
train, dev = train_test_split(
train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
)
Result: ~15% train, ~45% dev, ~40% test
Result: ~15% train, ~45% dev, ~40% test
undefinedundefinedStep 2: Run Evaluator on Dev Set
步骤2:在开发集上运行评估器
Run the judge on every example in the dev set. Compare predictions to human labels.
在开发集的每个示例上运行评判器,将预测结果与人工标注对比。
Step 3: Measure TPR and TNR
步骤3:测算TPR和TNR
TPR (True Positive Rate): When a human says Pass, how often does the judge also say Pass?
TPR = (judge says Pass AND human says Pass) / (human says Pass)TNR (True Negative Rate): When a human says Fail, how often does the judge also say Fail?
TNR = (judge says Fail AND human says Fail) / (human says Fail)python
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)Use TPR/TNR, not Precision/Recall or raw accuracy. These two metrics directly map to the bias correction formula. Use Cohen's Kappa only for measuring agreement between two human annotators, not for judge-vs-ground-truth.
TPR (True Positive Rate): 当人工标注为通过时,评判器也判定为通过的概率是多少?
TPR = (judge says Pass AND human says Pass) / (human says Pass)TNR (True Negative Rate): 当人工标注为不通过时,评判器也判定为不通过的概率是多少?
TNR = (judge says Fail AND human says Fail) / (human says Fail)python
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)请使用TPR/TNR,而非精确率/召回率或原始准确率。这两个指标直接对应偏差校正公式。Cohen's Kappa仅用于测算两名人工标注者之间的一致性,不适用于评判器与真实标注的对比。
Step 4: Inspect Disagreements
步骤4:检查不一致案例
Examine every case where the judge disagrees with human labels:
| Disagreement Type | Judge | Human | Fix |
|---|---|---|---|
| False Pass | Pass | Fail | Judge is too lenient. Strengthen Fail definitions or add edge-case examples. |
| False Fail | Fail | Pass | Judge is too strict. Clarify Pass definitions or adjust examples. |
For each disagreement, determine whether to:
- Clarify wording in the judge prompt
- Swap or add few-shot examples from the training set
- Add explicit rules for the edge case
- Split the criterion into more specific sub-checks
逐一检查评判器判定结果与人工标注不一致的案例:
| 不一致类型 | 评判器判定 | 人工标注 | 修复方案 |
|---|---|---|---|
| 误判通过 | 通过 | 不通过 | 评判器过于宽松。强化不通过判定的定义,或补充边缘案例示例。 |
| 误判不通过 | 不通过 | 通过 | 评判器过于严格。明确通过判定的定义,或调整示例。 |
针对每个不一致案例,决定采取以下操作:
- 明确评判提示词中的表述
- 替换或补充来自训练集的few-shot示例
- 为边缘案例添加明确规则
- 将判定标准拆分为更具体的子检查项
Step 5: Iterate
步骤5:迭代优化
Refine the judge prompt and re-run on the dev set. Repeat until TPR and TNR stabilize.
Stopping criteria:
- Target: TPR > 90% AND TNR > 90%
- Minimum acceptable: TPR > 80% AND TNR > 80%
If alignment stalls:
| Problem | Solution |
|---|---|
| TPR and TNR both low | Use a more capable LLM for the judge |
| One metric low, one acceptable | Inspect disagreements for the low metric specifically |
| Both plateau below target | Decompose the criterion into smaller, more atomic checks |
| Consistently wrong on certain input types | Add targeted few-shot examples from training set |
| Labels themselves seem inconsistent | Re-examine human labels; the rubric may need refinement |
优化评判提示词,在开发集上重新运行。重复操作直到TPR和TNR稳定。
停止条件:
- 目标值: TPR > 90% 且 TNR > 90%
- 最低可接受值: TPR > 80% 且 TNR > 80%
对齐停滞时的解决方案:
| 问题 | 解决方案 |
|---|---|
| TPR和TNR均偏低 | 为评判器使用能力更强的LLM |
| 一项指标偏低,另一项达标 | 专门针对偏低指标对应的不一致案例进行检查 |
| 两项指标均在目标值以下进入平台期 | 将判定标准拆解为更小、更原子化的检查项 |
| 对特定输入类型的判定持续出错 | 从训练集中补充针对性的few-shot示例 |
| 标注本身看起来不一致 | 重新核查人工标注;标注规则可能需要优化 |
Step 6: Final Measurement on Test Set
步骤6:在测试集上完成最终测算
Run the judge exactly once on the held-out test set. Record final TPR and TNR.
Do not iterate after seeing test set results. Go back to step 4 with new dev data if needed.
在留出的测试集上仅运行一次评判器。记录最终的TPR和TNR。
看到测试集结果后不得再迭代优化。如有需要,使用新的开发数据回到步骤4操作。
Step 7 (Optional): Estimate True Success Rate (Rogan-Gladen Correction)
步骤7(可选):估算真实通过率(Rogan-Gladen校正)
Raw judge scores on unlabeled production data are biased. If you need an accurate aggregate pass rate, correct for known judge errors:
theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)Where:
- = fraction of unlabeled traces the judge scored as Pass
p_obs - ,
TPR= from test set measurementTNR - = corrected estimate of true success rate
theta_hat
Clip to [0, 1]. Invalid when TPR + TNR - 1 is near 0 (judge is no better than random).
Example:
- Judge TPR = 0.92, TNR = 0.88
- 500 production traces: 400 scored Pass -> p_obs = 0.80
- theta_hat = (0.80 + 0.88 - 1) / (0.92 + 0.88 - 1) = 0.68 / 0.80 = 0.85
- True success rate is ~85%, not the raw 80%
未标注生产数据的原始评判得分存在偏差。如果你需要准确的聚合通过率,需针对已知的评判器误差进行校正:
theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)其中:
- = 评判器判定为通过的未标注轨迹占比
p_obs - ,
TPR= 测试集测算得到的数值TNR - = 校正后的真实通过率估算值
theta_hat
将结果裁剪到[0, 1]区间。当TPR + TNR - 1接近0时公式无效(说明评判器表现不比随机猜测好)。
示例:
- 评判器TPR = 0.92,TNR = 0.88
- 500条生产轨迹:400条被判定为通过 -> p_obs = 0.80
- theta_hat = (0.80 + 0.88 - 1) / (0.92 + 0.88 - 1) = 0.68 / 0.80 = 0.85
- 真实通过率约为85%,而非原始的80%
Step 8: Confidence Interval
步骤8:置信区间
Compute a bootstrap confidence interval. A point estimate alone is not enough.
python
import numpy as np
def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
"""Bootstrap 95% CI for corrected success rate."""
n = len(human_labels)
estimates = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, size=n, replace=True)
h = np.array(human_labels)[idx]
e = np.array(eval_labels)[idx]
tp = ((h == 'Pass') & (e == 'Pass')).sum()
fn = ((h == 'Pass') & (e == 'Fail')).sum()
tn = ((h == 'Fail') & (e == 'Fail')).sum()
fp = ((h == 'Fail') & (e == 'Pass')).sum()
tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
denom = tpr_b + tnr_b - 1
if abs(denom) < 1e-6:
continue
theta = (p_obs + tnr_b - 1) / denom
estimates.append(np.clip(theta, 0, 1))
return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)
lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")Or use ():
judgypip install judgypython
from judgy import estimate_success_rate
result = estimate_success_rate(
human_labels=test_human_labels,
evaluator_labels=test_eval_labels,
unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")计算bootstrap置信区间,仅靠单点估算值是不够的。
python
import numpy as np
def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
"""Bootstrap 95% CI for corrected success rate."""
n = len(human_labels)
estimates = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, size=n, replace=True)
h = np.array(human_labels)[idx]
e = np.array(eval_labels)[idx]
tp = ((h == 'Pass') & (e == 'Pass')).sum()
fn = ((h == 'Pass') & (e == 'Fail')).sum()
tn = ((h == 'Fail') & (e == 'Fail')).sum()
fp = ((h == 'Fail') & (e == 'Pass')).sum()
tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
denom = tpr_b + tnr_b - 1
if abs(denom) < 1e-6:
continue
theta = (p_obs + tnr_b - 1) / denom
estimates.append(np.clip(theta, 0, 1))
return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)
lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")或者使用():
judgypip install judgypython
from judgy import estimate_success_rate
result = estimate_success_rate(
human_labels=test_human_labels,
evaluator_labels=test_eval_labels,
unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")Practical Guidance
实操指南
- Pin exact model versions for LLM judges (e.g., , not
gpt-4o-2024-05-13). Providers update models without notice, causing silent drift.gpt-4o - Re-validate after changing the judge prompt, switching models, or when production confidence intervals widen unexpectedly.
- Use ~100 labeled examples (50 Pass, 50 Fail). Below 60, confidence intervals become wide.
- One trusted domain expert is the most efficient labeling path. If not feasible, have two annotators label 20-50 traces independently and resolve disagreements before proceeding.
- Improving TPR narrows the confidence interval more than improving TNR. The correction formula divides by TPR, so low TPR amplifies estimation errors into wide CIs.
- 固定LLM评判器的exact模型版本(例如,而非
gpt-4o-2024-05-13)。服务商可能在未通知的情况下更新模型,导致静默漂移。gpt-4o - 以下场景需重新验证:修改评判提示词、切换模型、或生产环境置信区间意外变宽时。
- 使用约100条标注示例(50条通过,50条不通过)。低于60条时置信区间会变得很宽。
- 由一名可信的领域专家标注是最高效的标注路径。如果不可行,让两名标注者独立标注20-50条轨迹,解决分歧后再推进。
- 提升TPR比提升TNR更能缩短信赖区间。校正公式会除以TPR,因此低TPR会放大估算误差,导致置信区间变宽。
Anti-Patterns
反模式
- Assuming judges "just work" without validation. A judge may consistently miss failures or flag passing traces.
- Using raw accuracy or percent agreement. Use TPR and TNR. With class imbalance, raw accuracy is misleading.
- Dev/test examples as few-shot examples. This is data leakage.
- Reporting dev set performance as final accuracy. Dev numbers are optimistic. The test set gives the unbiased estimate.
- Raw judge scores without bias correction. If you report an aggregate pass rate, apply the Rogan-Gladen formula (Step 7).
- Point estimates without confidence intervals. A corrected rate of 85% could easily be 78-92% with small test sets. Report the range so stakeholders know how much to trust the number.
- 未经验证就假设评判器“天然可用”。评判器可能会持续漏判故障,或将正常轨迹标记为异常。
- 使用原始准确率或百分比一致性。请使用TPR和TNR。存在类别不平衡时,原始准确率会产生误导。
- 将开发/测试集示例用作few-shot示例。这属于数据泄露。
- 将开发集表现作为最终准确率上报。开发集的数值是偏乐观的,测试集才能给出无偏估算。
- 未做偏差校正就上报原始评判得分。如果你要上报聚合通过率,请应用Rogan-Gladen公式(步骤7)。
- 仅上报单点估算值,不上报置信区间。如果测试集规模较小,85%的校正通过率很可能实际范围是78-92%。请上报范围,让利益相关方清楚数值的可信度。