Search Results: llm-as-judge

Found 43 Skills

skill-benchmark

Benchmark any agent skill to measure whether it actually improves performance. Use when the user wants to evaluate, test, or compare a skill against baseline, or when they mention "benchmark", "eval", "skill performance", or "does this skill help". Runs isolated eval sessions with and without the skill, grades outputs via layered grading (deterministic checks + LLM-as-judge), analyzes behavioral signals, and generates a comparison report with a USE / DON'T USE verdict.

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningantstackio/skills

aws-bedrock-evals

Build and run LLM-as-judge evaluation pipelines using Amazon Bedrock Evaluation Jobs with pre-computed inference datasets. Use when setting up automated model evaluation, designing test scenarios, collecting pre-computed responses, configuring custom metrics, creating AWS infrastructure, running evaluation jobs, parsing results, and iterating on findings.

🇺🇸|EnglishTranslated

AI & Machine Learningwshobson/agents

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

sadd:do-in-steps

Execute complex tasks through sequential sub-agent orchestration with intelligent model selection, and LLM-as-a-judge verification

🇺🇸|EnglishTranslated

AI & Machine Learningshipshitdev/library

evaluation

Build evaluation frameworks for agent systems. Use when testing agent performance, validating context engineering choices, or measuring improvements over time.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningorq-ai/assistant-plugins

build-evaluator

Create validated LLM-as-a-Judge evaluators following best practices — binary Pass/Fail judges with TPR/TNR validation for measuring specific failure modes. Use when you need to automate quality checks, build guardrails, or measure a specific failure mode identified during trace analysis. Do NOT use when failures are fixable with prompt changes (use optimize-prompt) or when failure modes are unknown (use analyze-trace-failures first).

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

do-in-steps

Execute complex tasks through sequential sub-agent orchestration with intelligent model selection, meta-judge → LLM-as-a-judge verification

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

do-and-judge

Execute a task with sub-agent implementation and LLM-as-a-judge verification with automatic retry loop

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

sadd:judge

Launch a sub-agent judge to evaluate results produced in the current conversation

🇺🇸|EnglishTranslated

AI & Machine Learninglaunchdarkly/agent-skills

aiconfig-online-evals

Attach judges to AI Config variations for automatic LLM-as-a-judge evaluation. Create custom judges, configure sampling rates, and monitor quality scores.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

judge

Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation

🇺🇸|EnglishTranslated

AI & Machine Learningglennguilloux/context-eng...

agent-evaluation

Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.

🇺🇸|EnglishTranslated