Search Results: llm-evaluation

Found 35 Skills

building-with-llms

Produce an LLM Build Pack (prompt+tool contract, data/eval plan, architecture+safety, launch checklist). Use for building with LLMs, GPT/Claude apps, prompt engineering, RAG, and tool-using agents.

🇺🇸|EnglishTranslated

AI & Machine Learningg1joshi/agent-skills

mlflow

MLflow ML lifecycle management. Use for ML experiment tracking.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

nel-assistant

Interactive config wizard for NeMo Evaluator Launcher (NEL). Use when the user wants to create a new evaluation config from scratch, set up an evaluation from existing configs, or modify a NEL config (deployment, tasks, multi-node, interceptors). ALWAYS triggers on mentions of creating configs, setting up evaluations, configuring models for evaluation, or modifying NEL YAML files. Do NOT use for monitoring, debugging, or analyzing already-running evaluations.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

launching-evals

Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

🇺🇸|EnglishTranslated

AI & Machine Learningphrazzld/claude-config

llm-evaluation

LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo. Invoke when: - Setting up prompt evaluation or regression testing - Integrating LLM testing into CI/CD pipelines - Configuring security testing (red teaming, jailbreaks) - Comparing prompt or model performance - Building evaluation suites for RAG, factuality, or safety Keywords: promptfoo, llm evaluation, prompt testing, red team, CI/CD, regression testing

🇺🇸|EnglishTranslated

AI & Machine Learningbobmatnyc/claude-mpm-skil...

mcp-builder

MCP (Model Context Protocol) server build and evaluation guide, including local conventions for tool surfaces, config, and testing

🇺🇸|EnglishTranslated

AI & Machine Learningaradotso/ai-agent-skills

agent-skills-context-engineering

Master context engineering principles for building production-grade AI agent systems with effective context management, multi-agent architectures, and memory systems.

🇺🇸|EnglishTranslated

AI & Machine Learningmaragudk/evals-skills

failure-taxonomy

Build a structured taxonomy of failure modes from open-coded trace annotations. Use this skill whenever the user has freeform annotations from reviewing LLM traces and wants to cluster them into a coherent, non-overlapping set of binary failure categories (axial coding). Also use when the user mentions "failure modes", "error taxonomy", "axial coding", "cluster annotations", "categorize errors", "failure analysis", or wants to go from raw observation notes to structured evaluation criteria. This skill covers the full pipeline: grouping open codes, defining failure modes, re-labeling traces, and quantifying error rates.

🇺🇸|EnglishTranslated

AI & Machine Learningalirezarezvani/claude-ski...

senior-prompt-engineer

This skill should be used when the user asks to "optimize prompts", "design prompt templates", "evaluate LLM outputs", "build agentic systems", "implement RAG", "create few-shot examples", "analyze token usage", or "design AI workflows". Use for prompt engineering patterns, LLM evaluation frameworks, agent architectures, and structured output design.

🇺🇸|EnglishTranslated

3 scripts/Checked

AI & Machine Learningyonatangross/orchestkit

langfuse-observability

LLM observability platform for tracing, evaluation, prompt management, and cost tracking. Use when setting up Langfuse, monitoring LLM costs, tracking token usage, or implementing prompt versioning.

🇺🇸|EnglishTranslated

2 scripts/Attention

AI & Machine Learningnvidia/skills

ad-accuracy-debug

Debug AutoDeploy accuracy regressions vs a reference score (PyTorch backend or published baseline). Use when an AutoDeploy model's eval score is significantly below the reference and the root cause is unknown.

🇺🇸|EnglishTranslated