Search Results: agent-evaluation

Found 55 Skills

AI & Machine Learningharbor-framework/harbor

create-task

Create a new Harbor task for evaluating agents. Use when the user wants to scaffold, build, or design a new task, benchmark problem, or eval. Guides through instruction writing, environment setup, verifier design (pytest vs Reward Kit vs custom), and solution scripting.

🇺🇸|EnglishTranslated

AI & Machine Learningcoval-ai/coval-external-s...

huggingface-import

Import datasets from HuggingFace and convert them to Coval test sets. Use when the user wants to create test cases from HuggingFace dataset or repository.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningcekura-ai/cekura-skills

cekura-predefined-metrics

Use when the user asks "what predefined metrics are available", "which built-in metrics should I use", "what does CSAT measure", "how does hallucination detection work", "what's the difference between Interruption Score and AI Interrupting User", "which metrics are free", "which metrics need audio", "configure silence threshold", "set up sentiment metric", or any question about Cekura's out-of-the-box metrics. Covers the full catalog of predefined metrics — what each does, costs, constraints, configuration options, and when to use each one.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

judge

Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation

🇺🇸|EnglishTranslated

AI & Machine Learningcoval-ai/coval-external-s...

configure-metrics

Select and configure evaluation metrics for an AI agent. Guides through metric selection using use-case recommendations, custom LLM-based metric creation with prompt engineering, and agent default attachment. Use when user says "set up metrics", "configure metrics", "create a metric", "what metrics should I use", "add evaluation criteria", or "customize scoring".

🇺🇸|EnglishTranslated

AI & Machine Learningcekura-ai/cekura-skills

cekura-metric-design

Use when the user asks to "create a metric", "write a metric", "design a metric", "build a metric for", "evaluate agent performance", "measure call quality", "track a KPI", "add a workflow metric", "improve my metric", "fix a metric", "debug metric results", "set up quality scoring", or "what metrics do I need". Also relevant when discussing LLM judge prompts, custom code metrics, evaluation triggers, VALID_SKIP patterns, section extraction, or metric best practices for Cekura voice AI agents. Covers both creating new metrics and reviewing, iterating on, or troubleshooting existing ones.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningaradotso/ai-agent-skills

datawhale-agent-learning-hub

AI Agent learning roadmap and curated resources for building production-ready agents with modern patterns like Claude Code, OpenClaw, skills, MCP, and evaluation

🇺🇸|EnglishTranslated