ai-scoring
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBuild an AI Scorer
构建AI评分器
Guide the user through building AI that scores, grades, or evaluates work against defined criteria. The pattern: define a rubric, score each criterion independently, calibrate with examples, and validate scorer quality.
引导用户构建可依据既定准则对工作成果进行打分、评级或评估的AI。流程为:定义评分标准、独立为每项准则打分、用示例校准、验证评分器质量。
Step 1: Define the rubric
步骤1:定义评分标准
Ask the user:
- What are you scoring? (essays, code, support responses, applications, etc.)
- What criteria matter? (clarity, accuracy, completeness, tone, security, etc.)
- What's the scale? (1-5, 1-10, pass/fail, letter grade)
- Are criteria weighted equally? (e.g., accuracy 50%, clarity 30%, formatting 20%)
A good rubric has:
- 3-7 criteria — more than that and scorers lose focus
- Clear scale anchors — what does a "2" vs a "4" look like?
- Observable evidence — criteria should reference things you can point to, not vibes
询问用户:
- 你要对什么进行评分?(论文、代码、支持服务回复、申请材料等)
- 哪些准则至关重要?(清晰性、准确性、完整性、语气、安全性等)
- 采用什么评分尺度?(1-5分、1-10分、合格/不合格、字母等级)
- 各项准则权重是否相同?(例如:准确性占50%、清晰性占30%、格式占20%)
优质的评分标准具备以下特点:
- 3-7项准则 —— 过多会导致评分者注意力分散
- 明确的尺度锚点 —— “2分”和“4分”分别是什么表现?
- 可观测的依据 —— 准则应指向可明确指出的内容,而非主观感觉
Step 2: Build the scoring signature
步骤2:构建评分签名
python
import dspy
from pydantic import BaseModel, Field
class CriterionScore(BaseModel):
criterion: str = Field(description="Name of the criterion being scored")
score: int = Field(ge=1, le=5, description="Score from 1 (poor) to 5 (excellent)")
justification: str = Field(description="Evidence from the input that supports this score")
class ScoringResult(BaseModel):
criterion_scores: list[CriterionScore] = Field(description="Score for each criterion")
overall_score: float = Field(ge=1.0, le=5.0, description="Weighted overall score")
summary: str = Field(description="Brief overall assessment")Define what's being scored and the criteria:
python
CRITERIA = [
"clarity: Is the writing clear and easy to follow? (1=confusing, 5=crystal clear)",
"argument: Is the argument well-structured and logical? (1=no structure, 5=compelling)",
"evidence: Does the writing cite relevant evidence? (1=no evidence, 5=strong support)",
]
class ScoreCriterion(dspy.Signature):
"""Score the submission on a single criterion. Be specific — cite evidence from the text."""
submission: str = dspy.InputField(desc="The work being evaluated")
criterion: str = dspy.InputField(desc="The criterion to score, including scale description")
score: int = dspy.OutputField(desc="Score from 1 to 5")
justification: str = dspy.OutputField(desc="Specific evidence from the submission supporting this score")python
import dspy
from pydantic import BaseModel, Field
class CriterionScore(BaseModel):
criterion: str = Field(description="Name of the criterion being scored")
score: int = Field(ge=1, le=5, description="Score from 1 (poor) to 5 (excellent)")
justification: str = Field(description="Evidence from the input that supports this score")
class ScoringResult(BaseModel):
criterion_scores: list[CriterionScore] = Field(description="Score for each criterion")
overall_score: float = Field(ge=1.0, le=5.0, description="Weighted overall score")
summary: str = Field(description="Brief overall assessment")定义评分对象和准则:
python
CRITERIA = [
"clarity: Is the writing clear and easy to follow? (1=confusing, 5=crystal clear)",
"argument: Is the argument well-structured and logical? (1=no structure, 5=compelling)",
"evidence: Does the writing cite relevant evidence? (1=no evidence, 5=strong support)",
]
class ScoreCriterion(dspy.Signature):
"""Score the submission on a single criterion. Be specific — cite evidence from the text."""
submission: str = dspy.InputField(desc="The work being evaluated")
criterion: str = dspy.InputField(desc="The criterion to score, including scale description")
score: int = dspy.OutputField(desc="Score from 1 to 5")
justification: str = dspy.OutputField(desc="Specific evidence from the submission supporting this score")Step 3: Score per criterion independently
步骤3:独立为每项准则打分
Scoring all criteria at once causes "halo effect" — a strong first impression biases all scores. Instead, score each criterion in its own call:
python
class RubricScorer(dspy.Module):
def __init__(self, criteria: list[str], weights: list[float] = None):
self.criteria = criteria
self.weights = weights or [1.0 / len(criteria)] * len(criteria)
self.score_criterion = dspy.ChainOfThought(ScoreCriterion)
def forward(self, submission: str):
criterion_scores = []
for criterion in self.criteria:
result = self.score_criterion(
submission=submission,
criterion=criterion,
)
dspy.Assert(
1 <= result.score <= 5,
f"Score must be 1-5, got {result.score}"
)
dspy.Assert(
len(result.justification) > 20,
"Justification must cite specific evidence from the submission"
)
criterion_scores.append(CriterionScore(
criterion=criterion.split(":")[0],
score=result.score,
justification=result.justification,
))
overall = sum(
cs.score * w for cs, w in zip(criterion_scores, self.weights)
)
return dspy.Prediction(
criterion_scores=criterion_scores,
overall_score=round(overall, 2),
)Using here is important — reasoning through the evidence before assigning a score produces more calibrated results than jumping straight to a number.
ChainOfThought一次性为所有准则打分会产生“光环效应”—— 第一印象过好会影响所有评分。相反,应为每项准则单独调用评分:
python
class RubricScorer(dspy.Module):
def __init__(self, criteria: list[str], weights: list[float] = None):
self.criteria = criteria
self.weights = weights or [1.0 / len(criteria)] * len(criteria)
self.score_criterion = dspy.ChainOfThought(ScoreCriterion)
def forward(self, submission: str):
criterion_scores = []
for criterion in self.criteria:
result = self.score_criterion(
submission=submission,
criterion=criterion,
)
dspy.Assert(
1 <= result.score <= 5,
f"Score must be 1-5, got {result.score}"
)
dspy.Assert(
len(result.justification) > 20,
"Justification must cite specific evidence from the submission"
)
criterion_scores.append(CriterionScore(
criterion=criterion.split(":")[0],
score=result.score,
justification=result.justification,
))
overall = sum(
cs.score * w for cs, w in zip(criterion_scores, self.weights)
)
return dspy.Prediction(
criterion_scores=criterion_scores,
overall_score=round(overall, 2),
)此处使用十分重要—— 在给出分数前先梳理依据,比直接给出数字能得到更精准的结果。
ChainOfThoughtStep 4: Calibrate with anchor examples
步骤4:用锚点示例校准
Without anchors, the scorer doesn't know what a "2" vs a "4" looks like. Provide reference examples at each level:
python
ANCHORS = """
Score 2 example for clarity: "The thing with the data is that it does stuff and the results are what they are."
→ Vague language, no specific referents, reader can't follow what's being described.
Score 4 example for clarity: "The customer churn model reduced false positives by 30% compared to the rule-based approach, though it still struggles with seasonal patterns."
→ Specific claims with numbers, clear comparison, one caveat noted.
"""
class ScoreCriterionCalibrated(dspy.Signature):
"""Score the submission on a single criterion. Use the anchor examples to calibrate your scoring."""
submission: str = dspy.InputField(desc="The work being evaluated")
criterion: str = dspy.InputField(desc="The criterion to score, including scale description")
anchors: str = dspy.InputField(desc="Reference examples showing what different score levels look like")
score: int = dspy.OutputField(desc="Score from 1 to 5")
justification: str = dspy.OutputField(desc="Specific evidence from the submission supporting this score")Then pass anchors per criterion:
python
class CalibratedScorer(dspy.Module):
def __init__(self, criteria: list[str], anchors: dict[str, str], weights: list[float] = None):
self.criteria = criteria
self.anchors = anchors
self.weights = weights or [1.0 / len(criteria)] * len(criteria)
self.score_criterion = dspy.ChainOfThought(ScoreCriterionCalibrated)
def forward(self, submission: str):
criterion_scores = []
for criterion in self.criteria:
criterion_name = criterion.split(":")[0]
result = self.score_criterion(
submission=submission,
criterion=criterion,
anchors=self.anchors.get(criterion_name, "No anchors provided."),
)
dspy.Assert(1 <= result.score <= 5, f"Score must be 1-5, got {result.score}")
criterion_scores.append(CriterionScore(
criterion=criterion_name,
score=result.score,
justification=result.justification,
))
overall = sum(cs.score * w for cs, w in zip(criterion_scores, self.weights))
return dspy.Prediction(
criterion_scores=criterion_scores,
overall_score=round(overall, 2),
)Writing good anchors takes effort, but it's the single biggest lever for scoring quality. Start with 2-3 anchors per criterion at the low, mid, and high ends of the scale.
没有锚点的话,评分器无法区分“2分”和“4分”的表现。需为每个分数等级提供参考示例:
python
ANCHORS = """
Score 2 example for clarity: "The thing with the data is that it does stuff and the results are what they are."
→ Vague language, no specific referents, reader can't follow what's being described.
Score 4 example for clarity: "The customer churn model reduced false positives by 30% compared to the rule-based approach, though it still struggles with seasonal patterns."
→ Specific claims with numbers, clear comparison, one caveat noted.
"""
class ScoreCriterionCalibrated(dspy.Signature):
"""Score the submission on a single criterion. Use the anchor examples to calibrate your scoring."""
submission: str = dspy.InputField(desc="The work being evaluated")
criterion: str = dspy.InputField(desc="The criterion to score, including scale description")
anchors: str = dspy.InputField(desc="Reference examples showing what different score levels look like")
score: int = dspy.OutputField(desc="Score from 1 to 5")
justification: str = dspy.OutputField(desc="Specific evidence from the submission supporting this score")然后为每项准则传入锚点:
python
class CalibratedScorer(dspy.Module):
def __init__(self, criteria: list[str], anchors: dict[str, str], weights: list[float] = None):
self.criteria = criteria
self.anchors = anchors
self.weights = weights or [1.0 / len(criteria)] * len(criteria)
self.score_criterion = dspy.ChainOfThought(ScoreCriterionCalibrated)
def forward(self, submission: str):
criterion_scores = []
for criterion in self.criteria:
criterion_name = criterion.split(":")[0]
result = self.score_criterion(
submission=submission,
criterion=criterion,
anchors=self.anchors.get(criterion_name, "No anchors provided."),
)
dspy.Assert(1 <= result.score <= 5, f"Score must be 1-5, got {result.score}")
criterion_scores.append(CriterionScore(
criterion=criterion_name,
score=result.score,
justification=result.justification,
))
overall = sum(cs.score * w for cs, w in zip(criterion_scores, self.weights))
return dspy.Prediction(
criterion_scores=criterion_scores,
overall_score=round(overall, 2),
)撰写优质锚点需要投入精力,但这是提升评分质量最关键的手段。先为每项准则在低分、中分、高分段各准备2-3个锚点。
Step 5: Handle edge cases
步骤5:处理边缘情况
Validate score consistency
验证分数一致性
The overall score should be consistent with per-criterion scores:
python
def validate_scores(criterion_scores, weights, overall_score):
expected = sum(cs.score * w for cs, w in zip(criterion_scores, weights))
dspy.Assert(
abs(expected - overall_score) < 0.1,
f"Overall score {overall_score} doesn't match weighted criteria ({expected:.2f})"
)总分应与各项准则的分数保持一致:
python
def validate_scores(criterion_scores, weights, overall_score):
expected = sum(cs.score * w for cs, w in zip(criterion_scores, weights))
dspy.Assert(
abs(expected - overall_score) < 0.1,
f"Overall score {overall_score} doesn't match weighted criteria ({expected:.2f})"
)Handle "not applicable" criteria
处理“不适用”的准则
Some criteria don't apply to every submission:
python
class CriterionScoreOptional(BaseModel):
criterion: str
score: int = Field(ge=0, le=5, description="Score 1-5, or 0 if not applicable")
justification: str
applicable: bool = Field(description="Whether this criterion applies to this submission")部分准则并非适用于所有提交内容:
python
class CriterionScoreOptional(BaseModel):
criterion: str
score: int = Field(ge=0, le=5, description="Score 1-5, or 0 if not applicable")
justification: str
applicable: bool = Field(description="Whether this criterion applies to this submission")Score ranges for pass/fail decisions
用于合格/不合格判定的分数区间
python
def pass_fail(overall_score: float, threshold: float = 3.0) -> str:
if overall_score >= threshold:
return "pass"
return "fail"python
def pass_fail(overall_score: float, threshold: float = 3.0) -> str:
if overall_score >= threshold:
return "pass"
return "fail"Or with a "needs review" band
Or with a "needs review" band
def tiered_decision(overall_score: float) -> str:
if overall_score >= 4.0:
return "pass"
elif overall_score >= 2.5:
return "needs_review"
return "fail"
undefineddef tiered_decision(overall_score: float) -> str:
if overall_score >= 4.0:
return "pass"
elif overall_score >= 2.5:
return "needs_review"
return "fail"
undefinedStep 6: Multi-rater ensemble
步骤6:多评分器集成
For high-stakes scoring, run multiple independent scorers and flag disagreements:
python
class EnsembleScorer(dspy.Module):
def __init__(self, criteria, anchors, num_raters=3, weights=None):
self.raters = [
CalibratedScorer(criteria, anchors, weights)
for _ in range(num_raters)
]
def forward(self, submission: str):
all_results = [rater(submission=submission) for rater in self.raters]
# Check for disagreement per criterion
flagged = []
for i, criterion in enumerate(self.raters[0].criteria):
criterion_name = criterion.split(":")[0]
scores = [r.criterion_scores[i].score for r in all_results]
spread = max(scores) - min(scores)
if spread > 1:
flagged.append({
"criterion": criterion_name,
"scores": scores,
"spread": spread,
})
# Average the overall scores
avg_overall = sum(r.overall_score for r in all_results) / len(all_results)
return dspy.Prediction(
overall_score=round(avg_overall, 2),
all_results=all_results,
flagged_disagreements=flagged,
needs_human_review=len(flagged) > 0,
)When raters disagree by more than 1 point on any criterion, flag it for human review. This catches the submissions that are genuinely ambiguous — exactly where human judgment matters most.
对于高风险评分任务,运行多个独立评分器并标记分歧:
python
class EnsembleScorer(dspy.Module):
def __init__(self, criteria, anchors, num_raters=3, weights=None):
self.raters = [
CalibratedScorer(criteria, anchors, weights)
for _ in range(num_raters)
]
def forward(self, submission: str):
all_results = [rater(submission=submission) for rater in self.raters]
# Check for disagreement per criterion
flagged = []
for i, criterion in enumerate(self.raters[0].criteria):
criterion_name = criterion.split(":")[0]
scores = [r.criterion_scores[i].score for r in all_results]
spread = max(scores) - min(scores)
if spread > 1:
flagged.append({
"criterion": criterion_name,
"scores": scores,
"spread": spread,
})
# Average the overall scores
avg_overall = sum(r.overall_score for r in all_results) / len(all_results)
return dspy.Prediction(
overall_score=round(avg_overall, 2),
all_results=all_results,
flagged_disagreements=flagged,
needs_human_review=len(flagged) > 0,
)当评分器对任意准则的分数分歧超过1分时,标记出来供人工审核。这能捕捉到真正模糊的提交内容—— 而这正是人工判断最有价值的场景。
Step 7: Evaluate scorer quality
步骤7:评估评分器质量
Prepare gold-standard scores
准备黄金标准分数
You need human-scored examples to evaluate your AI scorer:
python
scored_examples = [
dspy.Example(
submission="...",
gold_scores={"clarity": 4, "argument": 3, "evidence": 5},
gold_overall=4.0,
).with_inputs("submission"),
# 20-50+ scored examples
]需要人工评分的示例来评估AI评分器:
python
scored_examples = [
dspy.Example(
submission="...",
gold_scores={"clarity": 4, "argument": 3, "evidence": 5},
gold_overall=4.0,
).with_inputs("submission"),
# 20-50+ scored examples
]Mean absolute error metric
平均绝对误差指标
python
def scoring_metric(example, prediction, trace=None):
"""Measures how close AI scores are to human gold scores."""
errors = []
for cs in prediction.criterion_scores:
gold = example.gold_scores.get(cs.criterion)
if gold is not None:
errors.append(abs(cs.score - gold))
if not errors:
return 0.0
mae = sum(errors) / len(errors)
# Convert to 0-1 scale (0 error = 1.0, 4 error = 0.0)
return max(0.0, 1.0 - mae / 4.0)python
def scoring_metric(example, prediction, trace=None):
"""Measures how close AI scores are to human gold scores."""
errors = []
for cs in prediction.criterion_scores:
gold = example.gold_scores.get(cs.criterion)
if gold is not None:
errors.append(abs(cs.score - gold))
if not errors:
return 0.0
mae = sum(errors) / len(errors)
# Convert to 0-1 scale (0 error = 1.0, 4 error = 0.0)
return max(0.0, 1.0 - mae / 4.0)Agreement rate metric
一致性率指标
python
def agreement_metric(example, prediction, trace=None):
"""Score is 1.0 if all criteria are within 1 point of gold."""
for cs in prediction.criterion_scores:
gold = example.gold_scores.get(cs.criterion)
if gold is not None and abs(cs.score - gold) > 1:
return 0.0
return 1.0python
def agreement_metric(example, prediction, trace=None):
"""Score is 1.0 if all criteria are within 1 point of gold."""
for cs in prediction.criterion_scores:
gold = example.gold_scores.get(cs.criterion)
if gold is not None and abs(cs.score - gold) > 1:
return 0.0
return 1.0Optimize the scorer
优化评分器
python
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=scored_examples, metric=scoring_metric, num_threads=4)
baseline = evaluator(scorer)
optimizer = dspy.MIPROv2(metric=scoring_metric, auto="medium")
optimized_scorer = optimizer.compile(scorer, trainset=trainset)
optimized_score = evaluator(optimized_scorer)
print(f"Baseline MAE: {baseline:.1f}%")
print(f"Optimized MAE: {optimized_score:.1f}%")python
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=scored_examples, metric=scoring_metric, num_threads=4)
baseline = evaluator(scorer)
optimizer = dspy.MIPROv2(metric=scoring_metric, auto="medium")
optimized_scorer = optimizer.compile(scorer, trainset=trainset)
optimized_score = evaluator(optimized_scorer)
print(f"Baseline MAE: {baseline:.1f}%")
print(f"Optimized MAE: {optimized_score:.1f}%")Key patterns
核心模式
- Score per criterion independently — prevents halo effect where one strong dimension inflates all scores
- Use anchor examples — the single biggest lever for calibration quality
- ChainOfThought for scoring — reasoning before scoring produces better-calibrated results
- Require justifications — forces the scorer to cite evidence, catches lazy scoring
- Multi-rater for high stakes — flag disagreements for human review
- Validate consistency — overall score should match weighted criterion scores
- Pydantic for structure — enforces valid score ranges automatically
Field(ge=1, le=5)
- 独立为每项准则打分 —— 避免“光环效应”,即某一维度表现优异会拉高所有分数
- 使用锚点示例 —— 这是提升校准质量最关键的手段
- 用ChainOfThought进行评分 —— 先推理再打分能得到更精准的结果
- 要求提供评分理由 —— 强制评分器引用依据,避免敷衍打分
- 高风险任务采用多评分器 —— 将分歧内容标记出来供人工审核
- 验证分数一致性 —— 总分应与加权后的准则分数匹配
- 用Pydantic保证结构 —— 可自动确保分数在有效范围内
Field(ge=1, le=5)
Additional resources
额外资源
- For worked examples (essay grading, code review, support QA), see examples.md
- Need discrete categories instead of scores? Use
/ai-sorting - Need to validate AI output (not score human work)? Use
/ai-checking-outputs - Need to improve scorer accuracy? Use
/ai-improving-accuracy - Next: to measure and optimize your scorer
/ai-improving-accuracy
- 如需实战示例(论文评分、代码评审、支持服务QA),请查看examples.md
- 如需离散分类而非分数?请使用
/ai-sorting - 如需验证AI输出(而非评分人工成果)?请使用
/ai-checking-outputs - 如需提升评分器准确性?请使用
/ai-improving-accuracy - 下一步:使用衡量并优化你的评分器
/ai-improving-accuracy