Loading...
Loading...
Found 35 Skills
Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.
Migrate an application with hardcoded LLM prompts to a full LaunchDarkly AgentControl implementation in five stages: audit the code, wrap the call, move the tools, add tracking, attach evaluators. Use when the user wants to externalize model/prompt configuration, move from direct provider calls (OpenAI, Anthropic, Bedrock, Gemini, Strands) to a managed config, or stage a full hardcoded-to-LaunchDarkly migration.
Complete reference for the Galileo AI platform TypeScript/JS SDK for evaluating, observing, and protecting GenAI applications. Use when building Node.js or TypeScript applications that need LLM evaluation, production observability, tracing, or runtime guardrails with Galileo.
Compare Replicate models by cost, speed, quality, and capabilities.
Run cross-framework agent comparisons using evaluatorq from orqkit — compares any combination of agents (orq.ai, LangGraph, CrewAI, OpenAI Agents SDK, Vercel AI SDK) head-to-head on the same dataset with LLM-as-a-judge scoring. Use when comparing agents, benchmarking, or wanting side-by-side evaluation. Do NOT use when comparing only orq.ai configurations with no external agents (use run-experiment instead).
Evaluates ML models for performance, fairness, and reliability. Use for metric selection, cross-validation strategies, overfitting/underfitting diagnosis, hyperparameter tuning, LLM evaluation, A/B testing, and production monitoring for model drift.
Write, refine, run, and QA promptfoo evaluation suites: promptfooconfig.yaml, prompts, providers, vars, tests, assertions, model-graded rubrics, transforms, datasets, exports, and CI gates. Use for non-redteam eval coverage, regression tests, or new eval matrices. Do not use for adversarial redteam plugin or strategy setup.
Build and run LLM-as-judge evaluation pipelines using Amazon Bedrock Evaluation Jobs with pre-computed inference datasets. Use when setting up automated model evaluation, designing test scenarios, collecting pre-computed responses, configuring custom metrics, creating AWS infrastructure, running evaluation jobs, parsing results, and iterating on findings.
Design LLM-as-Judge evaluators for subjective criteria that code-based checks cannot handle. Use when a failure mode requires interpretation (tone, faithfulness, relevance, completeness). Do NOT use when the failure mode can be checked with code (regex, schema validation, execution tests). Do NOT use when you need to validate or calibrate the judge — use validate-evaluator instead.
Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) when you need to verify alignment before trusting its outputs. Do NOT use for code-based evaluators (those are deterministic; test with standard unit tests).
Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead), or when the task is collecting production logs.
Cross-model benchmark for gstack skills. Runs the same prompt through Claude, GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost, and optionally quality via LLM judge. Answers "which model is actually best for this skill?" with data instead of vibes. Separate from /benchmark, which measures web page performance. Use when: "benchmark models", "compare models", "which model is best for X", "cross-model comparison", "model shootout". (gstack) Voice triggers (speech-to-text aliases): "compare models", "model shootout", "which model is best".