Search Results: llm-as-a-judge

Found 16 Skills

AI & Machine Learningmaragudk/evals-skills

llm-as-a-judge

Build, validate, and deploy LLM-as-Judge evaluators for automated quality assessment of LLM pipeline outputs. Use this skill whenever the user wants to: create an automated evaluator for subjective or nuanced failure modes, write a judge prompt for Pass/Fail assessment, split labeled data for judge development, measure judge alignment (TPR/TNR), estimate true success rates with bias correction, or set up CI evaluation pipelines. Also trigger when the user mentions "judge prompt", "automated eval", "LLM evaluator", "grading prompt", "alignment metrics", "true positive rate", or wants to move from manual trace review to automated evaluation. This skill covers the full lifecycle: prompt design → data splitting → iterative refinement → success rate estimation.

🇺🇸|EnglishTranslated

AI & Machine Learninglaunchdarkly/agent-skills

aiconfig-online-evals

Attach judges to AI Config variations for automatic LLM-as-a-judge evaluation. Create custom judges, configure sampling rates, and monitor quality scores.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

do-and-judge

Execute a task with sub-agent implementation and LLM-as-a-judge verification with automatic retry loop

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

sadd:do-in-steps

Execute complex tasks through sequential sub-agent orchestration with intelligent model selection, and LLM-as-a-judge verification

🇺🇸|EnglishTranslated

AI & Machine Learninglaunchdarkly/agent-skills

online-evals

Attach judges to AI Config variations for automatic LLM-as-a-judge evaluation. Create custom judges, configure sampling rates, and monitor quality scores.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

do-in-steps

Execute complex tasks through sequential sub-agent orchestration with intelligent model selection, meta-judge → LLM-as-a-judge verification

🇺🇸|EnglishTranslated

AI & Machine Learningawslabs/agent-plugins

use-case-specification

Creates a reusable use case specification file that defines the business problem, stakeholders, and measurable success criteria for model customization, as recommended by the AWS Responsible AI Lens. Use as the default first step in any model customization plan. Skip only if the user explicitly declines or already has a use case specification to reuse. Captures problem statement, primary users, and LLM-as-a-Judge success tenets.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

sadd:do-and-judge

Execute a task with sub-agent implementation and LLM-as-a-judge verification with automatic retry loop

🇺🇸|EnglishTranslated

AI & Machine Learningshipshitdev/library

advanced-evaluation

Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or establishing quality standards for AI-generated content.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningorq-ai/assistant-plugins

compare-agents

Run cross-framework agent comparisons using evaluatorq from orqkit — compares any combination of agents (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. Use when comparing agents, benchmarking, or wanting side-by-side evaluation. Do NOT use when comparing only orq.ai configurations with no external agents (use run-experiment instead).

🇺🇸|EnglishTranslated

AI & Machine Learningorq-ai/assistant-plugins

build-evaluator

Create validated LLM-as-a-Judge evaluators following best practices — binary Pass/Fail judges with TPR/TNR validation for measuring specific failure modes. Use when you need to automate quality checks, build guardrails, or measure a specific failure mode identified during trace analysis. Do NOT use when failures are fixable with prompt changes (use optimize-prompt) or when failure modes are unknown (use analyze-trace-failures first).

🇺🇸|EnglishTranslated

AI & Machine Learningawslabs/agent-plugins

model-evaluation

Generates a Jupyter notebook that evaluates a fine-tuned SageMaker model using LLM-as-a-Judge. Use when the user says "evaluate my model", "how did my model perform", "compare models", or after a training job completes. Supports built-in and custom evaluation metrics, evaluation dataset setup, and judge model selection.

🇺🇸|EnglishTranslated

2 scripts/Checked