llm-judge
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Judge Skill
LLM Judge Skill
Compare code implementations across 2+ repositories using structured evaluation.
使用结构化评估比较2个及以上仓库的代码实现。
Overview
概述
This skill implements a two-phase LLM-as-judge evaluation:
- Phase 1: Fact Gathering - Parallel agents explore each repo and extract structured facts
- Phase 2: Judging - Parallel judges score each dimension using consistent rubrics
本Skill实现了分为两个阶段的LLM-as-judge评估:
- 第一阶段:事实收集 - 并行Agent遍历每个仓库并提取结构化事实
- 第二阶段:评审 - 并行评审Agent使用统一的评分标准对每个维度进行打分
Reference Files
参考文件
| File | Purpose |
|---|---|
| references/fact-schema.md | JSON schema for Phase 1 facts |
| references/scoring-rubrics.md | Detailed rubrics for each dimension |
| references/repo-agent.md | Instructions for Phase 1 agents |
| references/judge-agents.md | Instructions for Phase 2 judges |
| 文件 | 用途 |
|---|---|
| references/fact-schema.md | 第一阶段事实收集的JSON schema |
| references/scoring-rubrics.md | 各维度的详细评分标准 |
| references/repo-agent.md | 第一阶段Repo Agent的操作说明 |
| references/judge-agents.md | 第二阶段Judge Agent的操作说明 |
Scoring Dimensions
评分维度
| Dimension | Default Weight | Evaluates |
|---|---|---|
| Functionality | 30% | Spec compliance, test pass rate |
| Security | 25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
| 维度 | 默认权重 | 评估内容 |
|---|---|---|
| Functionality | 30% | 规范合规性、测试通过率 |
| Security | 25% | 漏洞、安全模式 |
| Test Quality | 20% | 覆盖率、DRY原则、模拟边界 |
| Overengineering | 15% | 不必要的复杂度 |
| Dead Code | 10% | 未使用代码、TODO注释 |
Scoring Scale
评分等级
| Score | Meaning |
|---|---|
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
| 分数 | 含义 |
|---|---|
| 5 | 优秀 - 超出预期 |
| 4 | 良好 - 满足要求,存在微小问题 |
| 3 | 中等 - 功能可用但存在明显不足 |
| 2 | 较差 - 存在严重问题 |
| 1 | 极差 - 未满足基础要求 |
Phase 1: Spawning Repo Agents
第一阶段:启动Repo Agent
For each repository, spawn a Task agent with:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:** Read @beagle:llm-judge references/repo-agent.md
Gather facts and return a JSON object following the schema in references/fact-schema.md.
Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.
Return ONLY valid JSON, no markdown or explanations.针对每个仓库,启动一个任务Agent,配置如下:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:** Read @beagle:llm-judge references/repo-agent.md
Gather facts and return a JSON object following the schema in references/fact-schema.md.
Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.
Return ONLY valid JSON, no markdown or explanations.Phase 2: Spawning Judge Agents
第二阶段:启动Judge Agent
After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:** Read @beagle:llm-judge references/judge-agents.md
Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.
Return ONLY valid JSON following the judge output schema.在所有第一阶段的Agent完成任务后,启动5个评审Agent(每个维度对应一个):
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:** Read @beagle:llm-judge references/judge-agents.md
Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.
Return ONLY valid JSON following the judge output schema.Aggregation
结果聚合
After Phase 2 completes:
- Collect scores from all 5 judges
- For each repo, compute weighted total:
weighted_total = sum(score[dim] * weight[dim]) / 100 - Rank repos by weighted total (descending)
- Generate verdict explaining the ranking
第二阶段完成后:
- 收集5个评审Agent的所有评分
- 为每个仓库计算加权总分:
weighted_total = sum(score[dim] * weight[dim]) / 100 - 按加权总分从高到低对仓库进行排名
- 生成解释排名结果的结论
Output
输出
Write results to and display markdown summary.
.beagle/llm-judge-report.json将结果写入并显示Markdown格式的摘要。
.beagle/llm-judge-report.jsonDependencies
依赖
- - Reused by repo agents for dead code/overengineering
@beagle:llm-artifacts-detection
- - 被Repo Agent复用,用于死代码和过度设计分析
@beagle:llm-artifacts-detection