Search Results: llm

Found 1,288 Skills

AI & Machine Learningorchestra-research/ai-res...

llamaguard

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

🇺🇸|EnglishTranslated

AI & Machine Learninghamelsmu/evals-skills

write-judge-prompt

Design LLM-as-Judge evaluators for subjective criteria that code-based checks cannot handle. Use when a failure mode requires interpretation (tone, faithfulness, relevance, completeness). Do NOT use when the failure mode can be checked with code (regex, schema validation, execution tests). Do NOT use when you need to validate or calibrate the judge — use validate-evaluator instead.

🇺🇸|EnglishTranslated

AI & Machine Learninghamelsmu/evals-skills

build-review-interface

Build a custom browser-based annotation interface tailored to your data for reviewing LLM traces and collecting structured feedback. Use when you need to build an annotation tool, review traces, or collect human labels.

🇺🇸|EnglishTranslated

AI & Machine Learninghamelsmu/evals-skills

validate-evaluator

Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) when you need to verify alignment before trusting its outputs. Do NOT use for code-based evaluators (those are deterministic; test with standard unit tests).

🇺🇸|EnglishTranslated

AI & Machine Learninghamelsmu/evals-skills

generate-synthetic-data

Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead), or when the task is collecting production logs.

🇺🇸|EnglishTranslated

AI & Machine Learninghamelsmu/evals-skills

eval-audit

Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).

🇺🇸|EnglishTranslated

AI & Machine Learninghamelsmu/evals-skills

error-analysis

Help the user systematically identify and categorize failure modes in an LLM pipeline by reading traces. Use when starting a new eval project, after significant pipeline changes (new features, model switches, prompt rewrites), when production metrics drop, or after incidents.

🇺🇸|EnglishTranslated

AI & Machine Learningcascade-protocol/agentbox

agentbox-inference

LLM inference via paid API: OpenAI-compatible chat completions proxied through x402 providers. Supports Kimi K2.5, MiniMax M2.5. Uses x_payment tool for automatic USDC micropayments ($0.001-$0.003/call). Use when: (1) generating text with a specific model, (2) running chat completions through a pay-per-request LLM endpoint, (3) comparing outputs across models.

🇺🇸|EnglishTranslated

AI & Machine Learningarize-ai/arize-skills

arize-trace

INVOKE THIS SKILL when downloading or exporting Arize traces and spans. Covers exporting traces by ID, sessions by ID, and debugging LLM application issues using the ax CLI.

🇺🇸|EnglishTranslated

AI & Machine Learningmathews-tom/praxis-skills

prompt-lab

Systematic LLM prompt engineering: analyzes existing prompts for failure modes, generates structured variants (direct, few-shot, chain-of-thought), designs evaluation rubrics with weighted criteria, and produces test case suites for comparing prompt performance. Triggers on: "prompt engineering", "prompt lab", "generate prompt variants", "A/B test prompts", "evaluate prompt", "optimize prompt", "write a better prompt", "prompt design", "prompt iteration", "few-shot examples", "chain-of-thought prompt", "prompt failure modes", "improve this prompt". Use this skill when designing, improving, or evaluating LLM prompts specifically. NOT for evaluating Claude Code skills or SKILL.md files — use skill-evaluator instead.

🇺🇸|EnglishTranslated

AI & Machine Learningahmad-tomy/google-adk-go

google-adk-go

Build, debug, and deploy Google Agent Development Kit (ADK) applications in Go using the exact adk-go v0.6.0 APIs and patterns. Use when a task involves ADK Go agent architecture, llmagent configuration, tools/toolsets, sessions/state, memory/artifacts, workflow agents, A2A/REST/web serving, telemetry/plugins, or migration/troubleshooting for google.golang.org/adk@v0.6.0.

🇺🇸|EnglishTranslated

AI & Machine Learningmodelslab/skills

modelslab-chat-generation

Chat with LLM models using ModelsLab's OpenAI-compatible Chat Completions API. Supports 60+ models including DeepSeek R1, Meta Llama, Google Gemini, Qwen, and Mistral with streaming, function calling, and structured outputs.

🇺🇸|EnglishTranslated