llm-judge

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Judge Skill

LLM Judge Skill

Compare code implementations across 2+ repositories using structured evaluation.
使用结构化评估比较2个及以上仓库的代码实现。

Overview

概述

This skill implements a two-phase LLM-as-judge evaluation:
  1. Phase 1: Fact Gathering - Parallel agents explore each repo and extract structured facts
  2. Phase 2: Judging - Parallel judges score each dimension using consistent rubrics
本Skill实现了分为两个阶段的LLM-as-judge评估:
  1. 第一阶段:事实收集 - 并行Agent遍历每个仓库并提取结构化事实
  2. 第二阶段:评审 - 并行评审Agent使用统一的评分标准对每个维度进行打分

Reference Files

参考文件

FilePurpose
references/fact-schema.mdJSON schema for Phase 1 facts
references/scoring-rubrics.mdDetailed rubrics for each dimension
references/repo-agent.mdInstructions for Phase 1 agents
references/judge-agents.mdInstructions for Phase 2 judges
文件用途
references/fact-schema.md第一阶段事实收集的JSON schema
references/scoring-rubrics.md各维度的详细评分标准
references/repo-agent.md第一阶段Repo Agent的操作说明
references/judge-agents.md第二阶段Judge Agent的操作说明

Scoring Dimensions

评分维度

DimensionDefault WeightEvaluates
Functionality30%Spec compliance, test pass rate
Security25%Vulnerabilities, security patterns
Test Quality20%Coverage, DRY, mock boundaries
Overengineering15%Unnecessary complexity
Dead Code10%Unused code, TODOs
维度默认权重评估内容
Functionality30%规范合规性、测试通过率
Security25%漏洞、安全模式
Test Quality20%覆盖率、DRY原则、模拟边界
Overengineering15%不必要的复杂度
Dead Code10%未使用代码、TODO注释

Scoring Scale

评分等级

ScoreMeaning
5Excellent - Exceeds expectations
4Good - Meets requirements, minor issues
3Average - Functional but notable gaps
2Below Average - Significant issues
1Poor - Fails basic requirements
分数含义
5优秀 - 超出预期
4良好 - 满足要求,存在微小问题
3中等 - 功能可用但存在明显不足
2较差 - 存在严重问题
1极差 - 未满足基础要求

Phase 1: Spawning Repo Agents

第一阶段:启动Repo Agent

For each repository, spawn a Task agent with:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT

**Instructions:** Read @beagle:llm-judge references/repo-agent.md

Gather facts and return a JSON object following the schema in references/fact-schema.md.

Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.

Return ONLY valid JSON, no markdown or explanations.
针对每个仓库,启动一个任务Agent,配置如下:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT

**Instructions:** Read @beagle:llm-judge references/repo-agent.md

Gather facts and return a JSON object following the schema in references/fact-schema.md.

Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.

Return ONLY valid JSON, no markdown or explanations.

Phase 2: Spawning Judge Agents

第二阶段:启动Judge Agent

After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):
You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:** Read @beagle:llm-judge references/judge-agents.md

Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.

Return ONLY valid JSON following the judge output schema.
在所有第一阶段的Agent完成任务后,启动5个评审Agent(每个维度对应一个):
You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:** Read @beagle:llm-judge references/judge-agents.md

Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.

Return ONLY valid JSON following the judge output schema.

Aggregation

结果聚合

After Phase 2 completes:
  1. Collect scores from all 5 judges
  2. For each repo, compute weighted total:
    weighted_total = sum(score[dim] * weight[dim]) / 100
  3. Rank repos by weighted total (descending)
  4. Generate verdict explaining the ranking
第二阶段完成后:
  1. 收集5个评审Agent的所有评分
  2. 为每个仓库计算加权总分:
    weighted_total = sum(score[dim] * weight[dim]) / 100
  3. 按加权总分从高到低对仓库进行排名
  4. 生成解释排名结果的结论

Output

输出

Write results to
.beagle/llm-judge-report.json
and display markdown summary.
将结果写入
.beagle/llm-judge-report.json
并显示Markdown格式的摘要。

Dependencies

依赖

  • @beagle:llm-artifacts-detection
    - Reused by repo agents for dead code/overengineering
  • @beagle:llm-artifacts-detection
    - 被Repo Agent复用,用于死代码和过度设计分析