Search Results: ai-testing

Found 14 Skills

eval-generator

Generates eval test cases from an eval suite plan (output of /eval-suite-planner) or a plain-English agent description. Supports both single-response and conversation (multi-turn) evaluation modes. Outputs a Copilot Studio test set table, a CSV file for import (single-response only), and a docx report for human review.

🇺🇸|EnglishTranslated

AI & Machine Learningmicrosoft/eval-guide

eval-suite-planner

Produces a concrete eval suite plan grounded in Microsoft's Eval Scenario Library and MS Learn agent evaluation guidance — scenario types, evaluation methods, quality signals, thresholds, and priority order — before any test cases are generated or evals are run.

🇺🇸|EnglishTranslated

AI & Machine Learninglivekit/agent-skills

livekit-agents

Build voice AI agents with LiveKit Cloud and the Agents SDK. Use when the user asks to "build a voice agent", "create a LiveKit agent", "add voice AI", "implement handoffs", "structure agent workflows", or is working with LiveKit Agents SDK. Provides opinionated guidance for the recommended path: LiveKit Cloud + LiveKit Inference. REQUIRES writing tests for all implementations.

🇺🇸|EnglishTranslated

AI & Machine Learningcoval-ai/coval-external-s...

migrate-bluejay

Migrate configuration from Bluejay voice AI testing platform to Coval. Use when customer says "migrate from bluejay", "bluejay migration", "import bluejay config", or needs to transfer agents, simulations, metrics, and schedules from Bluejay to Coval.

🇺🇸|EnglishTranslated

AI & Machine Learninggithub/awesome-copilot

eval-driven-dev

Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycle. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, even if they don't say "evals" explicitly. Use for making sure an AI app works correctly, catching regressions after prompt changes, debugging why an agent started behaving differently, or validating output quality before shipping.

🇺🇸|EnglishTranslated

Testing & QAcinience/alicloud-skills

alicloud-ai-multimodal-qwen-vl-test

Minimal image-understanding smoke test for Model Studio Qwen VL.

🇨🇳|ChineseTranslated

2 scripts/Attention

Testing & QAstablyai/agent-skills

stably-sdk-rules

AI rules for writing tests with Stably Playwright SDK. Use this skill when writing or modifying Playwright tests with Stably AI features. Covers when to use Playwright vs Stably methods, plus minimal patterns for aiAssert, extract, getLocatorsByAI, agent.act, Inbox, and Google auth.

🇺🇸|EnglishTranslated

AI & Machine Learningpromptfoo/promptfoo

promptfoo-evals

Write, refine, run, and QA promptfoo evaluation suites: promptfooconfig.yaml, prompts, providers, vars, tests, assertions, model-graded rubrics, transforms, datasets, exports, and CI gates. Use for non-redteam eval coverage, regression tests, or new eval matrices. Do not use for adversarial redteam plugin or strategy setup.

🇺🇸|EnglishTranslated

AI & Machine Learningcoval-ai/coval-external-s...

quick-eval

Full evaluation workflow - launch a run, watch progress, and summarize results. Use for end-to-end agent testing.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

onboard-gb200-1node-tests

Onboard 1-node GitHub MR functional tests for GB200 from existing mr-scoped 2-node tests.

🇺🇸|EnglishTranslated

AI & Machine Learninggarrytan/gbrain

skillify

The meta skill. Turn any raw feature into a properly-skilled, tested, resolvable unit of agent capability. Cross-modal eval is the recommended Phase 3 quality gate: 3 frontier models from different providers critique the output, you iterate to quality, THEN write tests that lock in the proven-good behavior.

🇺🇸|EnglishTranslated

AI & Machine Learningaxiomhq/skills

writing-evals

Scaffolds evaluation suites for the Axiom AI SDK. Generates eval files, scorers, flag schemas, and config from natural-language descriptions. Use when creating evals, writing scorers, setting up flag schemas, or configuring axiom.config.ts.

🇺🇸|EnglishTranslated

16 scripts/Attention