Search Results: llm-evaluation

Found 43 Skills

llm

Large Language Model development, training, fine-tuning, and deployment best practices.

ai-evals

Create an AI Evals Pack (eval PRD, test set, rubric, judge plan, results + iteration loop). Use for LLM evaluation, benchmarks, rubrics, error analysis/open coding, and ship/no-ship quality gates for AI features.

🇺🇸|EnglishTranslated

AI & Machine Learningsammcj/agentic-coding

deepeval

Use when discussing or working with DeepEval (the python AI evaluation framework)

🇺🇸|EnglishTranslated

AI & Machine Learninggithub/awesome-copilot

eval-driven-dev

Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycle. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, even if they don't say "evals" explicitly. Use for making sure an AI app works correctly, catching regressions after prompt changes, debugging why an agent started behaving differently, or validating output quality before shipping.

🇺🇸|EnglishTranslated

AI & Machine Learningorq-ai/assistant-plugins

build-evaluator

Create validated LLM-as-a-Judge evaluators following best practices — binary Pass/Fail judges with TPR/TNR validation for measuring specific failure modes. Use when you need to automate quality checks, build guardrails, or measure a specific failure mode identified during trace analysis. Do NOT use when failures are fixable with prompt changes (use optimize-prompt) or when failure modes are unknown (use analyze-trace-failures first).

🇺🇸|EnglishTranslated

AI & Machine Learningorq-ai/assistant-plugins

generate-synthetic-dataset

Generate and curate evaluation datasets — structured generation via dimensions-tuples-NL, quick from description, expansion from existing data, plus dataset maintenance through deduplication, rebalancing, and gap-filling. Use when creating eval data, expanding test coverage, or cleaning datasets. Do NOT use when sufficient real production data exists (use analyze-trace-failures instead). Do NOT use for evaluator creation (use build-evaluator).

🇺🇸|EnglishTranslated

AI & Machine Learningbobmatnyc/claude-mpm-skil...

mcp-builder

MCP (Model Context Protocol) server build and evaluation guide, including local conventions for tool surfaces, config, and testing

🇺🇸|EnglishTranslated

AI & Machine Learningniracler/skill

note-to-blog

Use when user wants to find a note to publish as a blog post. Triggers on「选一篇笔记发博客」「note to blog」「写博客」「博客选题」. Scans Obsidian notes via Python script, evaluates blog-readiness, supports batch selection with fast/deep dual-track and parallel Agent dispatch.

🇺🇸|EnglishTranslated

1 scripts/Attention

AI & Machine Learningg1joshi/agent-skills

mlflow

MLflow ML lifecycle management. Use for ML experiment tracking.

🇺🇸|EnglishTranslated

AI & Machine Learningflora131/atomic

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment. Part of the context engineering skill suite — also activates when the user mentions "context engineering" or "context-engineering" in the context of evaluating LLM output quality.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningshipshitdev/library

advanced-evaluation

Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or establishing quality standards for AI-generated content.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningoldwinter/skills

building-with-llms

Produce an LLM Build Pack (prompt+tool contract, data/eval plan, architecture+safety, launch checklist). Use for building with LLMs, GPT/Claude apps, prompt engineering, RAG, and tool-using agents.

🇺🇸|EnglishTranslated