Loading...
Loading...
Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
npx skill4agent add alirezarezvani/claude-skills eval/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)git diff {base_branch}...{agent_branch}.agenthub/board/results/agent-{i}-result.mdRANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating/hub:merge/hub:merge {session-id} --agent {winner}