Search Results: agent-evaluation

Found 55 Skills

AI & Machine Learninggooglecloudplatform/cxas-...

cxas-agent-foundry

End-to-end GECX/CXAS/CES conversational agent lifecycle -- build agents from requirements (PRD-to-agent), create and run evals (goldens, simulations, tool tests, callback tests), debug failures, and iterate to production quality. Use this skill whenever the user mentions GECX, CXAS, CES, SCRAPI, conversational agents, voice agents, audio agents, agent evals, pushing/pulling/linting agents, or agent instructions/callbacks/tools on the Google Customer Engagement Suite platform.

🇺🇸|EnglishTranslated

27 scripts/Attention

AI & Machine Learningcoval-ai/coval-external-s...

quick-eval

Full evaluation workflow - launch a run, watch progress, and summarize results. Use for end-to-end agent testing.

🇺🇸|EnglishTranslated

AI & Machine Learningbagelhole/devops-security...

agent-evals

Build automated evaluation suites for AI agents using golden datasets, rubrics, and regression gates.

🇺🇸|EnglishTranslated

AI & Machine Learningcobosteven/cobo-agent-wal...

caw-eval

Evaluate the quality of CAW (Cobo Agentic Wallet) Agent in local Claude Code, and generate scoring data and analysis reports. Use when: Users want to run CAW evaluation, conduct evaluation, test Skill, assess Agent quality, generate evaluation reports, or say "run evaluation", "evaluate CAW", "eval", "score". For weak model / openclaw evaluation, please use caw-eval-openclaw (only installed on openclaw servers).

🇨🇳|ChineseTranslated

8 scripts/Attention

Testing & QAcoval-ai/coval-external-s...

build-test-suite

Build a complete test suite with test set and test cases for evaluating an AI agent. Guides through test set type selection, scenario design using vertical-specific templates, expected behavior crafting, and bulk creation. Use when user says "create test cases", "build test suite", "add test scenarios", "set up evaluation tests", or "design test cases".

🇺🇸|EnglishTranslated

AI & Machine Learningaffaan-m/ecc

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

🇺🇸|EnglishTranslated

AI & Machine Learningcekura-ai/cekura-skills

cekura-eval-design

Use when the user asks to "create an evaluator", "create evals", "create a scenario", "write a test scenario", "design a test case", "test my agent", "build eval coverage", "plan a test suite", "create red team tests", "set up test profiles", "configure conditional actions", "write a conditional action evaluator", "build a deterministic test", "design an IVR test", "IVR navigation test", "write a unit test for a voice agent", "build a regression test", "scripted scenario", "scripted voice test", "structured evaluator", "exact flow test", "sequential conditions", "fixed sequence test", or "run evals". Covers individual evaluator design, suite coverage strategy, test profiles, mock-tool data design, conditional actions (deterministic / unit test / regression / IVR navigation flows), and best practices for workflow / red-team / edge-case / deterministic test types.

🇺🇸|EnglishTranslated

AI & Machine Learninglangchain-ai/lca-skills

langsmith-code-eval

Create code-based evaluators for LangSmith-traced agents with step-by-step collaborative guidance through inspection, evaluation logic, and testing.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningworkersio/spec

skill-benchmark

Benchmark any agent skill to measure whether it actually improves performance. Use when the user wants to evaluate, test, or compare a skill against baseline, or when they mention "benchmark", "eval", "skill performance", or "does this skill help". Runs isolated eval sessions with and without the skill, grades outputs via layered grading (deterministic checks + LLM-as-judge), analyzes behavioral signals, and generates a comparison report with a USE / DON'T USE verdict.

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningshentufoundation/openmath...

openmath-lean-theorem

Configures Lean environments, installs external proof skills, runs preflight checks, and guides the workflow for proving downloaded OpenMath Lean theorems locally.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learninggotalab/skillport

skill-evaluator

Evaluates agent skills against Anthropic's best practices. Use when asked to review, evaluate, assess, or audit a skill for quality. Analyzes SKILL.md structure, naming conventions, description quality, content organization, and identifies anti-patterns. Produces actionable improvement recommendations.

🇺🇸|EnglishTranslated

1 scripts/Attention

AI & Machine Learning10xchengtu/harness-engine...

harness-engineering

Set up and improve harness engineering (AGENTS.md, docs/, lint rules, eval systems, project-level prompt engineering) for AI-agent-friendly codebases. Triggers on: new/empty project setup for AI agents, AGENTS.md or CLAUDE.md creation, harness engineering questions, making agents work better on a codebase. ALSO triggers when users are frustrated or complaining about agent quality — e.g. 'the agent keeps ignoring conventions', 'it never follows instructions', 'why does it keep doing X', 'the agent is broken' — because poor agent output almost always signals harness gaps, not model problems. Covers: context engineering, architectural constraints, multi-agent coordination, evaluation, long-running agent harness, and diagnosis of agent quality issues.

🇺🇸|EnglishTranslated