eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

/hub:eval — Evaluate Agent Results

/hub:eval — 评估Agent结果

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
为某一会话对所有Agent结果进行排名。支持基于指标的评估(运行命令)、LLM评判者(比较差异)或混合模式。

Usage

使用方法

/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)
/hub:eval                           # 使用配置的标准评估最新会话
/hub:eval 20260317-143022           # 评估特定会话
/hub:eval --judge                   # 强制使用LLM评判模式(忽略指标配置)

What It Does

功能说明

Metric Mode (eval command configured)

指标模式(已配置评估命令)

Run the evaluation command in each agent's worktree:
bash
python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}
Output:
RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)
在每个Agent的工作目录中运行评估命令:
bash
python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}
输出示例:
RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

LLM Judge Mode (no eval command, or --judge flag)

LLM评判模式(无评估命令,或使用--judge参数)

For each agent:
  1. Get the diff:
    git diff {base_branch}...{agent_branch}
  2. Read the agent's result post from
    .agenthub/board/results/agent-{i}-result.md
  3. Compare all diffs and rank by:
    • Correctness — Does it solve the task?
    • Simplicity — Fewer lines changed is better (when equal correctness)
    • Quality — Clean execution, good structure, no regressions
Present rankings with justification.
Example LLM judge output for a content task:
RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)
针对每个Agent:
  1. 获取差异:
    git diff {base_branch}...{agent_branch}
  2. 读取Agent的结果帖子:
    .agenthub/board/results/agent-{i}-result.md
  3. 对比所有差异并按以下维度排名:
    • 正确性 — 是否解决了任务?
    • 简洁性 — 在正确性相同的情况下,修改行数越少越好
    • 质量 — 执行流畅、结构良好、无回归问题
附带理由展示排名结果。
内容类任务的LLM评判输出示例:
RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)

Hybrid Mode

混合模式

  1. Run metric evaluation first
  2. If top agents are within 10% of each other, use LLM judge to break ties
  3. Present both metric and qualitative rankings
  1. 先运行指标评估
  2. 如果排名靠前的Agent之间差距在10%以内,使用LLM评判者来打破平局
  3. 同时展示指标排名和定性排名

After Eval

评估完成后

  1. Update session state:
bash
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
  1. Tell the user:
    • Ranked results with winner highlighted
    • Next step:
      /hub:merge
      to merge the winner
    • Or
      /hub:merge {session-id} --agent {winner}
      to be explicit
  1. 更新会话状态:
bash
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
  1. 告知用户:
    • 带有获胜者高亮的排名结果
    • 下一步:使用
      /hub:merge
      合并获胜者的结果
    • 或者使用
      /hub:merge {session-id} --agent {winner}
      来明确指定