Search Results: llm-judging

Found 3 Skills

AI & Machine Learningalirezarezvani/claude-ski...

eval

Evaluate and rank agent results by metric or LLM judge for an AgentHub session.

skill-benchmark

Benchmark any agent skill to measure whether it actually improves performance. Use when the user wants to evaluate, test, or compare a skill against baseline, or when they mention "benchmark", "eval", "skill performance", or "does this skill help". Runs isolated eval sessions with and without the skill, grades outputs via layered grading (deterministic checks + LLM-as-judge), analyzes behavioral signals, and generates a comparison report with a USE / DON'T USE verdict.

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningglebis/claude-skills

vision-bench

Score and compare images using vision LLMs as judges. YAML-defined criteria presets for 11 use cases (text-to-image, photorealism, document OCR, charts, UI, portrait, product, scientific, invoice, alt-text, artistic style). Supports OpenAI, Anthropic, Gemini, Mistral, and OpenRouter as judge providers. Keys auto-decrypted via SOPS + age.

🇺🇸|EnglishTranslated

4 scripts/Checked