Search Results: llm-as-judge

Found 44 Skills

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment. Part of the context engineering skill suite — also activates when the user mentions "context engineering" or "context-engineering" in the context of evaluating LLM output quality.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learninglangchain-ai/langsmith-sk...

langsmith-evaluator

INVOKE THIS SKILL when building evaluation pipelines for LangSmith. Covers three core components: (1) Creating Evaluators - LLM-as-Judge, custom code; (2) Defining Run Functions - how to capture outputs and trajectories from your agent; (3) Running Evaluations - locally with evaluate() or auto-run via LangSmith. Uses the langsmith CLI tool.

🇺🇸|EnglishTranslated

Testing & QAravnhq/ai-toolkit

eval-agent-md

Behavioral compliance testing for any CLAUDE.md or agent definition file. Auto-generates test scenarios from your rules, runs them via LLM-as-judge scoring, and reports compliance. Optionally improves failing rules via automated mutation loop.

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningneolabhq/context-engineer...

sdd:implement

Implement a task with automated LLM-as-Judge verification for critical steps

🇺🇸|EnglishTranslated

AI & Machine Learninggoogle/agents-cli

google-agents-cli-eval

This skill should be used when the user wants to "run an evaluation", "evaluate my ADK agent", "write an evalset", "debug eval scores", "compare eval results", or needs guidance on ADK (Agent Development Kit) evaluation methodology and the eval-fix loop. Covers eval metrics, evalset schema, LLM-as-judge, tool trajectory scoring, and common failure causes. Part of the Google ADK (Agent Development Kit) skills suite. Do NOT use for API code patterns (use google-agents-cli-adk-code), deployment (use google-agents-cli-deploy), or project scaffolding (use google-agents-cli-scaffold).

🇺🇸|EnglishTranslated

29.1k

AI & Machine Learninggoogle/skills

agent-platform-eval-flywheel

Measure and improve the quality of AI models and agents on Google Cloud using the Eval Quality Flywheel methodology. Use when evaluating an agent or model, building an eval dataset, picking or writing evaluation metrics, analyzing failures, comparing results before and after a fix, or when guidance is needed on Agent Platform eval methodology — including dataset schema, LLM-as-judge scoring, and common failure causes. For fine-tuning, use agent-platform-tuning. For deployment, use agent-platform-deploy.

🇺🇸|EnglishTranslated

5 scripts/Checked

Testing & QApedronauck/skills

testing-boss

Comprehensive testing doctrine for software and AI systems — covers positive patterns, anti-patterns, gates for coding agents writing tests, CI discipline, and an LLM/agent evaluation primer. Use when authoring or reviewing tests, adding mocks, deciding test placement, generating tests via agents, debugging flaky CI, designing eval suites for LLM features, or rebuilding a brittle test suite. Contains 12 positive patterns (selector hierarchy, table-driven, builders, real-system gates), 25 anti-patterns across Brittleness, Flakiness, Mock-misuse, Process, and AI-specific families, 7 mandatory gates for agents writing tests, flaky-test taxonomy with quarantine workflow, contract / property / mutation testing patterns, and an oracle-ladder primer for LLM-as-judge and agent eval. Language-agnostic — pseudo-code only. Don't use for general code review, library-specific debugging unrelated to tests, non-testing CI pipeline design, or production observability.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

byob

Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.

🇺🇸|EnglishTranslated

AI & Machine Learninglubu-labs/langchain-agent...

langgraph-testing-evaluation

Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results.

🇺🇸|EnglishTranslated

11 scripts/Attention

AI & Machine Learningmrgoonie/claudekit-skills

context-engineering

Master context engineering for AI agent systems. Use when designing agent architectures, debugging context failures, optimizing token usage, implementing memory systems, building multi-agent coordination, evaluating agent performance, or developing LLM-powered pipelines. Covers context fundamentals, degradation patterns, optimization techniques (compaction, masking, caching), compression strategies, memory architectures, multi-agent patterns, LLM-as-Judge evaluation, tool design, and project development.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningantstackio/skills

aws-bedrock-evals

Build and run LLM-as-judge evaluation pipelines using Amazon Bedrock Evaluation Jobs with pre-computed inference datasets. Use when setting up automated model evaluation, designing test scenarios, collecting pre-computed responses, configuring custom metrics, creating AWS infrastructure, running evaluation jobs, parsing results, and iterating on findings.

🇺🇸|EnglishTranslated

AI & Machine Learningzrosenbauer/skills

skill-eval

This skill should be used when the user wants to run baseline evaluations on existing agent skills, regenerate transcripts after a model upgrade, or check whether a skill still solves the gap it was authored for. Common triggers include "rerun the baselines", "re-eval skill X", "test all the skills", "check for skill drift", and "run the evals". Bakes in verbatim transcript capture (no paraphrasing), deterministic-only grading (regex / contains / file_exists — no LLM-as-judge), and the iteration-N workspace convention. Skip when authoring a new skill (use skill-creator) or modifying skill content directly.

🇺🇸|EnglishTranslated