Search Results: agent-evaluation

Found 55 Skills

AI & Machine Learningsupercent-io/skills-templ...

agent-evaluation

Design and implement comprehensive evaluation systems for AI agents. Use when building evals for coding agents, conversational agents, research agents, or computer-use agents. Covers grader types, benchmarks, 8-step roadmap, and production integration.

🇺🇸|EnglishTranslated

10.1k

Testing & QAdavila7/claude-code-templ...

agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

🇺🇸|EnglishTranslated

AI & Machine Learningeyadsibai/ltk

agent-evaluation

Use when evaluating agent performance, building test frameworks, measuring quality, or asking about "agent evaluation", "LLM-as-judge", "agent testing", "quality metrics", "evaluation rubrics", "agent benchmarks"

🇺🇸|EnglishTranslated

AI & Machine Learningoimiragieo/agent-studio

agent-evaluation

LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningnotque/claude-code-toolki...

agent-evaluation

Evaluate agents and skills for quality, completeness, and standards compliance using a 6-step rubric: Identify, Structural, Content, Code, Integration, Report. Use when auditing agents/skills, checking quality after creation or update, or reviewing collection health. Triggers: "evaluate", "audit", "check quality", "review agent", "score skill". Do NOT use for creating or modifying agents/skills — only for read-only assessment and scoring.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningmlflow/skills

agent-evaluation

Use this when you need to EVALUATE OR IMPROVE or OPTIMIZE an existing LLM agent's output quality - including improving tool selection accuracy, answer quality, reducing costs, or fixing issues where the agent gives wrong/incomplete responses. Evaluates agents systematically using MLflow evaluation with datasets, scorers, and tracing. Covers end-to-end evaluation workflow or individual components (tracing setup, dataset creation, scorer definition, evaluation execution).

🇺🇸|EnglishTranslated

12 scripts/Attention

AI & Machine Learningglennguilloux/context-eng...

agent-evaluation

Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

customaize-agent:agent-evaluation

Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.

🇺🇸|EnglishTranslated

AI & Machine Learninggoogle/agents-cli

google-agents-cli-workflow

This skill should be used when the user wants to "develop an agent", "build an agent using ADK", "run the agent locally", "debug agent code", "test an agent", "deploy an agent", "publish an agent", "monitor an agent", or needs the ADK (Agent Development Kit) development lifecycle and coding guidelines. Entrypoint for building ADK agents. Always active — provides the full workflow (scaffold, build, evaluate, deploy, publish, observe), code preservation rules, model selection guidance, and troubleshooting steps for ADK or any agent development.

🇺🇸|EnglishTranslated

28.9k

AI & Machine Learningmicrosoft/agent-skills

azure-ai-evaluation-py

Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, agent, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics", "RedTeam", "agent evaluation".

🇺🇸|EnglishTranslated

1 scripts/Checked

Testing & QAmicrosoft/eval-guide

eval-generator

Generates eval test cases from an eval suite plan (output of /eval-suite-planner) or a plain-English agent description. Supports both single-response and conversation (multi-turn) evaluation modes. Outputs a Copilot Studio test set table, a CSV file for import (single-response only), and a docx report for human review.

🇺🇸|EnglishTranslated

AI & Machine Learningmicrosoft/eval-guide

eval-guide

Eval enablement accelerator — help customers think through "what does good look like" for their AI agent, then generate a structured eval plan and test cases they can use immediately. No running agent required. Works from a description, an idea, or even a vague goal. Use when anyone mentions agent evaluation, eval planning, "what should we test", "how do we know if the agent is good", test case generation, or interpreting eval results.

🇺🇸|EnglishTranslated

2 scripts/Attention