Loading...
Loading...
Found 47 Skills
Generates a Jupyter notebook that evaluates a fine-tuned SageMaker model using LLM-as-a-Judge. Use when the user says "evaluate my model", "how did my model perform", "compare models", or after a training job completes. Supports built-in and custom evaluation metrics, evaluation dataset setup, and judge model selection.
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
Comprehensive multi-perspective review using specialized judges with debate and consensus building
Use this skill for ANY question about CREATING evaluators. Covers creating custom metrics, LLM as Judge evaluators, code-based evaluators, and uploading evaluation logic to LangSmith. Includes basic usage of evaluators to run evaluations.
Execute complex tasks through sequential sub-agent orchestration with intelligent model selection, meta-judge → LLM-as-a-judge verification
Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.
Launch a sub-agent judge to evaluate results produced in the current conversation
Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation
Comprehensive multi-perspective review using specialized judges with debate and consensus building
Execute a task with sub-agent implementation and LLM-as-a-judge verification with automatic retry loop
Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.