Search Results: llm-evaluation

Found 54 Skills

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

🇺🇸|EnglishTranslated

AI & Machine Learningphrazzld/claude-config

llm-evaluation

LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo. Invoke when: - Setting up prompt evaluation or regression testing - Integrating LLM testing into CI/CD pipelines - Configuring security testing (red teaming, jailbreaks) - Comparing prompt or model performance - Building evaluation suites for RAG, factuality, or safety Keywords: promptfoo, llm evaluation, prompt testing, red team, CI/CD, regression testing

🇺🇸|EnglishTranslated

AI & Machine Learningposthog/skills

exploring-llm-evaluations

Investigate LLM analytics evaluations of both types — `hog` (deterministic code-based) and `llm_judge` (LLM-prompt-based). Find existing evaluations, inspect their configuration, run them against specific generations, query individual pass/fail results, and generate AI-powered summaries of patterns across many runs. Use when the user asks to debug why an evaluation is failing, surface common failure modes, compare results across filters, dry-run a Hog evaluator, prototype a new LLM-judge prompt, or manage the evaluation lifecycle (create, update, enable/disable, delete).

🇺🇸|EnglishTranslated

AI & Machine Learninggoogle/agents-cli

google-agents-cli-eval

This skill should be used when the user wants to "run an evaluation", "evaluate my ADK agent", "write an evalset", "debug eval scores", "compare eval results", or needs guidance on ADK (Agent Development Kit) evaluation methodology and the eval-fix loop. Covers eval metrics, evalset schema, LLM-as-judge, tool trajectory scoring, and common failure causes. Part of the Google ADK (Agent Development Kit) skills suite. Do NOT use for API code patterns (use google-agents-cli-adk-code), deployment (use google-agents-cli-deploy), or project scaffolding (use google-agents-cli-scaffold).

🇺🇸|EnglishTranslated

26.2k

AI & Machine Learningdatocms/agent-skills

eval-triggers

Run the trigger evaluation pipeline — classify, analyze, and optionally compare against a baseline. Only run when explicitly asked — evals are expensive.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

nemo-evaluator-plugin

Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, or plugin-owned Evaluator skills.

🇺🇸|EnglishTranslated

AI & Machine Learningsammcj/agentic-coding

deepeval

Use when discussing or working with DeepEval (the python AI evaluation framework)

🇺🇸|EnglishTranslated

AI & Machine Learningopenai/skills

codex-readiness-integration-test

Run the Codex Readiness integration test. Use when you need an end-to-end agentic loop with build/test scoring.

🇺🇸|EnglishTranslated

9 scripts/Attention

AI & Machine Learninghamelsmu/evals-skills

eval-audit

Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).

🇺🇸|EnglishTranslated

AI & Machine Learningakillness/oh-my-skills

langsmith

Instrument, trace, evaluate, and monitor LLM applications and AI agents with LangSmith. Use when setting up observability for LLM pipelines, running offline or online evaluations, managing prompts in the Prompt Hub, creating datasets for regression testing, or deploying agent servers. Triggers on: langsmith, langchain tracing, llm tracing, llm observability, llm evaluation, trace llm calls, @traceable, wrap_openai, langsmith evaluate, langsmith dataset, langsmith feedback, langsmith prompt hub, langsmith project, llm monitoring, llm debugging, llm quality, openevals, langsmith cli, langsmith experiment, annotate llm, llm judge.

🇺🇸|EnglishTranslated

2 scripts/Attention

AI & Machine Learninggyanesh-m/skills

galileo-python-sdk

Complete reference for the Galileo AI platform Python SDK for evaluating, observing, and protecting GenAI applications. Use when building Python applications that need LLM evaluation, production observability, tracing, or runtime guardrails with Galileo.

🇺🇸|EnglishTranslated

AI & Machine Learningorq-ai/assistant-plugins

run-experiment

Create and run orq.ai experiments — compare configurations against datasets using evaluators, analyze results, and generate prioritized action plans. Use when evaluating LLM agents, deployments, conversations, or RAG pipelines end-to-end. Do NOT use without a dataset and evaluators. Do NOT use for cross-framework comparisons with external agents (use compare-agents).

🇺🇸|EnglishTranslated