Search Results: agent-evaluation

Found 55 Skills

AI & Machine Learningcekura-ai/cekura-skills

cekura-coordinator

Use when the user asks "what can Cekura do", "what commands are available", "help me with Cekura", "what skills do I have", "show me Cekura features", "what's available", "how do I use Cekura", or needs guidance on which Cekura skill to use for their task. Also relevant as the entry point when a user has just installed cekura-skills for the first time.

🇺🇸|EnglishTranslated

AI & Machine Learningalirezarezvani/claude-ski...

eval

Evaluate and rank agent results by metric or LLM judge for an AgentHub session.

🇺🇸|EnglishTranslated

Testing & QApedronauck/skills

testing-boss

Comprehensive testing doctrine for software and AI systems — covers positive patterns, anti-patterns, gates for coding agents writing tests, CI discipline, and an LLM/agent evaluation primer. Use when authoring or reviewing tests, adding mocks, deciding test placement, generating tests via agents, debugging flaky CI, designing eval suites for LLM features, or rebuilding a brittle test suite. Contains 12 positive patterns (selector hierarchy, table-driven, builders, real-system gates), 25 anti-patterns across Brittleness, Flakiness, Mock-misuse, Process, and AI-specific families, 7 mandatory gates for agents writing tests, flaky-test taxonomy with quarantine workflow, contract / property / mutation testing patterns, and an oracle-ladder primer for LLM-as-judge and agent eval. Language-agnostic — pseudo-code only. Don't use for general code review, library-specific debugging unrelated to tests, non-testing CI pipeline design, or production observability.

🇺🇸|EnglishTranslated

AI & Machine Learningjackjin1997/clawforge

langsmith-dataset

Use this skill for ANY question about creating test or evaluation datasets for LangChain agents. Covers generating datasets from traces (final_response, single_step, trajectory, RAG types), uploading to LangSmith, and managing evaluation data.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningaws/agent-toolkit-for-aws

agents-optimize

Use when measuring or improving agent quality and performance — set up evaluators, online monitoring, CI/CD quality gates, observability, or cost optimization. Triggers on: "evaluate my agent", "add evaluator", "measure quality", "quality gate", "run evals", "agent too slow", "why is it slow", "reduce latency", "set up observability", "CloudWatch dashboard", "how much does my agent cost", "cost optimization", "logs not showing up", "logs missing", "spans not found", "eval failing", "eval error", "dev traces", "local traces", "agentcore dev traces", "traces to CloudWatch". Not for debugging errors or crashes — use agents-debug. Slow but correct routes here; broken routes to debug.

🇺🇸|EnglishTranslated

AI & Machine Learninglubu-labs/langchain-agent...

langgraph-testing-evaluation

Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results.

🇺🇸|EnglishTranslated

11 scripts/Attention

AI & Machine Learningmizchi/chezmoi-dotfiles

empirical-prompt-tuning

A method for iteratively improving text instructions for agents (skills / slash commands / task prompts / CLAUDE.md sections / code generation prompts) by having unbiased executors run them, then evaluating from both perspectives (executor self-report + instruction-side metrics). Repeat until improvement plateaus. Use immediately after creating or significantly revising a prompt or skill, or when you suspect the reason an agent isn't behaving as expected is due to ambiguity in the instructions.

🇨🇳|ChineseTranslated

Tools & Utilitiescolbymchenry/codegraph

agent-eval

Benchmark CodeGraph retrieval quality on a real codebase by comparing agent behavior with vs without CodeGraph. Use when the user runs /agent-eval or asks to test, benchmark, audit, or validate a codegraph version (the local dev build or a published npm version) against a language's repo.

🇺🇸|EnglishTranslated

AI & Machine Learningtyler-r-kendrick/agent-sk...

microsoft-foundry

Use this skill to work with Microsoft Foundry (Azure AI Foundry): deploy AI models from catalog, build RAG applications with knowledge indexes, create and evaluate AI agents. USE FOR: Microsoft Foundry, AI Foundry, deploy model, model catalog, RAG, knowledge index, create agent, evaluate agent, agent monitoring. DO NOT USE FOR: Azure Functions (use azure-functions), App Service (use azure-create-app).

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

sadd:tree-of-thoughts

Execute tasks through systematic exploration, pruning, and expansion using Tree of Thoughts methodology with multi-agent evaluation

🇺🇸|EnglishTranslated

AI & Machine Learningthe-perfect-developer/the...

google-adk

This skill should be used when the user asks to "build an agent with Google ADK", "use the Agent Development Kit", "create a Google ADK agent", "set up ADK tools", or needs guidance on Google's Agent Development Kit best practices, multi-agent systems, or agent evaluation.

🇺🇸|EnglishTranslated

AI & Machine Learningaradotso/trending-skills

future-agi-platform

Expert skill for using Future AGI — the open-source end-to-end platform for evaluating, observing, and improving LLM and AI agent applications with tracing, evals, simulations, datasets, gateway, and guardrails.

🇺🇸|EnglishTranslated