Search Results: ai-agent-evaluation

Found 14 Skills

AI & Machine Learningmicrosoft/eval-guide

eval-result-interpreter

Analyzes Copilot Studio evaluation CSV results using Microsoft's Triage & Improvement Playbook. Returns a SHIP / ITERATE / BLOCK verdict with root cause classification, diagnostic triage, prioritized remediation, and pattern analysis.

🇺🇸|EnglishTranslated

AI & Machine Learningconfident-ai/deepeval

deepeval

DeepEval evaluation workflow for AI agents and LLM applications. TRIGGER when the user wants to evaluate or improve an AI agent, tool-using workflow, multi-turn chatbot, RAG pipeline, or LLM app; add evals; generate datasets or goldens; use deepeval generate; use deepeval test run; add tracing or @observe; send results to Confident AI; monitor production; run online evals; inspect traces; or iterate on prompts, tools, retrieval, or agent behavior from eval failures. AI agents are the primary use case. Covers Python SDK, pytest eval suites, CLI generation, tracing, Confident AI reporting, and agent-driven improvement loops. DO NOT TRIGGER for unrelated generic pytest, non-AI test setup, or non-DeepEval observability work unless the user asks to compare or migrate to DeepEval.

🇺🇸|EnglishTranslated

4 scripts/Checked

AI & Machine Learningmicrosoft/eval-guide

eval-faq

Answers AI agent evaluation methodology questions with practical, opinionated guidance grounded primarily in Microsoft's agent evaluation ecosystem (MS Learn, Eval Scenario Library, Triage & Improvement Playbook, Eval Guidance Kit) supplemented by select industry sources.

🇺🇸|EnglishTranslated

AI & Machine Learninggithub/awesome-copilot

agentic-eval

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality

🇺🇸|EnglishTranslated

AI & Machine Learningmicrosoft/eval-guide

eval-triage-and-improvement

Use this skill when the user's Copilot Studio agent evaluations have come back and they need to interpret scores, diagnose root causes of underperforming test cases, find remediation steps, or analyze patterns to improve their agent. Always use this skill when the user mentions: "eval failed", "why did this fail", "triage", "diagnose failure", "low pass rate", "fix evaluation results", "not passing", "failing test cases", "evaluation results", "improve my eval scores", or any situation where eval scores need interpretation and action.

🇺🇸|EnglishTranslated

AI & Machine Learningadaptationio/skrillz

bedrock-agentcore-evaluations

Amazon Bedrock AgentCore Evaluations for testing and monitoring AI agent quality. 13 built-in evaluators plus custom LLM-as-Judge patterns. Use when testing agents, monitoring production quality, setting up alerts, or validating agent behavior.

🇺🇸|EnglishTranslated

AI & Machine Learningaradotso/trending-skills

future-agi-platform

Expert skill for using Future AGI — the open-source end-to-end platform for evaluating, observing, and improving LLM and AI agent applications with tracing, evals, simulations, datasets, gateway, and guardrails.

🇺🇸|EnglishTranslated

AI & Machine Learningtyler-r-kendrick/agent-sk...

microsoft-foundry

Use this skill to work with Microsoft Foundry (Azure AI Foundry): deploy AI models from catalog, build RAG applications with knowledge indexes, create and evaluate AI agents. USE FOR: Microsoft Foundry, AI Foundry, deploy model, model catalog, RAG, knowledge index, create agent, evaluate agent, agent monitoring. DO NOT USE FOR: Azure Functions (use azure-functions), App Service (use azure-create-app).

🇺🇸|EnglishTranslated

Testing & QAcoval-ai/coval-external-s...

build-test-suite

Build a complete test suite with test set and test cases for evaluating an AI agent. Guides through test set type selection, scenario design using vertical-specific templates, expected behavior crafting, and bulk creation. Use when user says "create test cases", "build test suite", "add test scenarios", "set up evaluation tests", or "design test cases".

🇺🇸|EnglishTranslated

AI & Machine Learningbagelhole/devops-security...

agent-evals

Build automated evaluation suites for AI agents using golden datasets, rubrics, and regression gates.

🇺🇸|EnglishTranslated

AI & Machine Learningcoval-ai/coval-external-s...

huggingface-import

Import datasets from HuggingFace and convert them to Coval test sets. Use when the user wants to create test cases from HuggingFace dataset or repository.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningcekura-ai/cekura-skills

cekura-predefined-metrics

Use when the user asks "what predefined metrics are available", "which built-in metrics should I use", "what does CSAT measure", "how does hallucination detection work", "what's the difference between Interruption Score and AI Interrupting User", "which metrics are free", "which metrics need audio", "configure silence threshold", "set up sentiment metric", or any question about Cekura's out-of-the-box metrics. Covers the full catalog of predefined metrics — what each does, costs, constraints, configuration options, and when to use each one.

🇺🇸|EnglishTranslated