Search Results: agent-evaluation

Found 20 Skills

google-agents-cli-workflow

This skill should be used when the user wants to "develop an agent", "build an agent using ADK", "run the agent locally", "debug agent code", "test an agent", "deploy an agent", "publish an agent", "monitor an agent", or needs the ADK (Agent Development Kit) development lifecycle and coding guidelines. Entrypoint for building ADK agents. Always active — provides the full workflow (scaffold, build, evaluate, deploy, publish, observe), code preservation rules, model selection guidance, and troubleshooting steps for ADK or any agent development.

🇺🇸|EnglishTranslated

26.4k

AI & Machine Learningmicrosoft/agent-skills

azure-ai-evaluation-py

Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, agent, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics", "RedTeam", "agent evaluation".

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningsammcj/agentic-coding

deepeval

Use when discussing or working with DeepEval (the python AI evaluation framework)

🇺🇸|EnglishTranslated

AI & Machine Learninggoogle/adk-docs

adk-dev-guide

ALWAYS ACTIVE — read at the start of any ADK agent development session. ADK development lifecycle and mandatory coding guidelines — spec-driven workflow, code preservation rules, model selection, and troubleshooting.

🇺🇸|EnglishTranslated

AI & Machine Learningcekura-ai/cekura-skills

cekura-coordinator

Use when the user asks "what can Cekura do", "what commands are available", "help me with Cekura", "what skills do I have", "show me Cekura features", "what's available", "how do I use Cekura", or needs guidance on which Cekura skill to use for their task. Also relevant as the entry point when a user has just installed cekura-skills for the first time.

🇺🇸|EnglishTranslated

AI & Machine Learningjackjin1997/clawforge

langsmith-dataset

Use this skill for ANY question about creating test or evaluation datasets for LangChain agents. Covers generating datasets from traces (final_response, single_step, trajectory, RAG types), uploading to LangSmith, and managing evaluation data.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningaws/agent-toolkit-for-aws

agents-optimize

Use when measuring or improving agent quality and performance — set up evaluators, online monitoring, CI/CD quality gates, observability, or cost optimization. Triggers on: "evaluate my agent", "add evaluator", "measure quality", "quality gate", "run evals", "agent too slow", "why is it slow", "reduce latency", "set up observability", "CloudWatch dashboard", "how much does my agent cost", "cost optimization", "logs not showing up", "logs missing", "spans not found", "eval failing", "eval error", "dev traces", "local traces", "agentcore dev traces", "traces to CloudWatch". Not for debugging errors or crashes — use agents-debug. Slow but correct routes here; broken routes to debug.

🇺🇸|EnglishTranslated

Tools & Utilitiescolbymchenry/codegraph

agent-eval

Benchmark CodeGraph retrieval quality on a real codebase by comparing agent behavior with vs without CodeGraph. Use when the user runs /agent-eval or asks to test, benchmark, audit, or validate a codegraph version (the local dev build or a published npm version) against a language's repo.

🇺🇸|EnglishTranslated

AI & Machine Learningmizchi/chezmoi-dotfiles

empirical-prompt-tuning

A method for iteratively improving text instructions for agents (skills / slash commands / task prompts / CLAUDE.md sections / code generation prompts) by having unbiased executors run them, then evaluating from both perspectives (executor self-report + instruction-side metrics). Repeat until improvement plateaus. Use immediately after creating or significantly revising a prompt or skill, or when you suspect the reason an agent isn't behaving as expected is due to ambiguity in the instructions.

🇨🇳|ChineseTranslated

AI & Machine Learninggooglecloudplatform/cxas-...

cxas-agent-foundry

End-to-end GECX/CXAS/CES conversational agent lifecycle -- build agents from requirements (PRD-to-agent), create and run evals (goldens, simulations, tool tests, callback tests), debug failures, and iterate to production quality. Use this skill whenever the user mentions GECX, CXAS, CES, SCRAPI, conversational agents, voice agents, audio agents, agent evals, pushing/pulling/linting agents, or agent instructions/callbacks/tools on the Google Customer Engagement Suite platform.

🇺🇸|EnglishTranslated

27 scripts/Attention

AI & Machine Learningthe-perfect-developer/the...

google-adk

This skill should be used when the user asks to "build an agent with Google ADK", "use the Agent Development Kit", "create a Google ADK agent", "set up ADK tools", or needs guidance on Google's Agent Development Kit best practices, multi-agent systems, or agent evaluation.

🇺🇸|EnglishTranslated

AI & Machine Learningaffaan-m/ecc

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

🇺🇸|EnglishTranslated