Loading...
Loading...
Found 1,746 Skills
Use when evaluating LLMs, running benchmarks like MMLU/HumanEval/GSM8K, setting up evaluation pipelines, or asking about "NeMo Evaluator", "LLM benchmarking", "model evaluation", "MMLU", "HumanEval", "GSM8K", "benchmark harnesses"
Evaluates agent skills against Anthropic's best practices. Use when asked to review, evaluate, assess, or audit a skill for quality. Analyzes SKILL.md structure, naming conventions, description quality, content organization, and identifies anti-patterns. Produces actionable improvement recommendations.
Professionally evaluate story outlines, judge and score from the dimensions of market potential, innovation attributes, and content highlights. Suitable for story outline quality assessment, IP adaptation potential judgment, and project approval decision-making
Defines evaluation criteria and scoring methodologies for deliverable assessment
Evaluate educational chapters from dual student and teacher perspectives. This skill should be used when analyzing chapter quality, identifying content gaps, or planning chapter improvements. Reads all lessons in a chapter directory and provides structured analysis with ratings, gap identification, and prioritized recommendations.
Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
Evaluate and improve code modularization using the Balanced Coupling Model. Analyzes coupling strength, connascence types, and distance to identify refactoring opportunities and architectural improvements. Use when reviewing code architecture, refactoring modules, or designing new systems.
Consult this skill when building evaluation or scoring systems. Use when implementing evaluation systems, creating quality gates, designing scoring rubrics, building decision frameworks. Do not use when simple pass/fail without scoring needs.
Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.
Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) when you need to verify alignment before trusting its outputs. Do NOT use for code-based evaluators (those are deterministic; test with standard unit tests).
Comprehensive security and privacy evaluation system for MCP (Model Context Protocol) servers. Use when users provide GitHub URLs to MCP servers and request security assessment, privacy evaluation, or ask "is this MCP safe to use." Evaluates security vulnerabilities, privacy risks, code quality, community feedback, and provides actionable recommendations with risk scoring.
Comprehensive security and safety evaluation system for agent skills (.skill files). Use when users provide GitHub URLs, website links, or .skill files for download and request security assessment, safety evaluation, or ask "is this skill safe to use." Evaluates prompt injection risks, malicious code patterns, hidden instructions, data exfiltration attempts, and provides actionable recommendations with risk scoring.