Search Results: llm-deployment

Found 19 Skills

AI & Machine Learningdavila7/claude-code-templ...

gptq

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

🇺🇸|EnglishTranslated

100

AI & Machine Learningancoleman/ai-design-compo...

model-serving

LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.

🇺🇸|EnglishTranslated

5 scripts/Attention

AI & Machine Learningorq-ai/assistant-plugins

invoke-deployment

Invoke orq.ai deployments, agents, and models via the Python SDK or HTTP API. Use when a user wants to call a deployment with prompt variables, invoke an agent in a conversation, or call a model directly through the AI Router. Do NOT use for creating or editing deployments/agents (use optimize-prompt or build-agent). Do NOT use for running evaluations (use run-experiment).

🇺🇸|EnglishTranslated

AI & Machine Learningmartinholovsky/claude-ski...

model-quantization

Expert skill for AI model quantization and optimization. Covers 4-bit/8-bit quantization, GGUF conversion, memory optimization, and quality-performance tradeoffs for deploying LLMs in resource-constrained JARVIS environments.

🇺🇸|EnglishTranslated

AI & Machine Learningjackspace/claudeskillz

cloudflare-workers-ai

Complete knowledge domain for Cloudflare Workers AI - Run AI models on serverless GPUs across Cloudflare's global network. Use when: implementing AI inference on Workers, running LLM models, generating text/images with AI, configuring Workers AI bindings, implementing AI streaming, using AI Gateway, integrating with embeddings/RAG systems, or encountering "AI_ERROR", rate limit errors, model not found, token limit exceeded, or neurons exceeded errors. Keywords: workers ai, cloudflare ai, ai bindings, llm workers, @cf/meta/llama, workers ai models, ai inference, cloudflare llm, ai streaming, text generation ai, ai embeddings, image generation ai, workers ai rag, ai gateway, llama workers, flux image generation, stable diffusion workers, vision models ai, ai chat completion, AI_ERROR, rate limit ai, model not found, token limit exceeded, neurons exceeded, ai quota exceeded, streaming failed, model unavailable, workers ai hono, ai gateway workers, vercel ai sdk workers, openai compatible workers, workers ai vectorize

🇺🇸|EnglishTranslated

AI & Machine Learningg1joshi/agent-skills

mistral

Mistral AI efficient open models. Use for efficient AI.

🇺🇸|EnglishTranslated

AI & Machine Learningpluginagentmarketplace/cu...

model-deployment

LLM deployment strategies including vLLM, TGI, and cloud inference endpoints.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningvllm-project/vllm-skills

vllm-deploy-docker

Deploy vLLM using Docker (pre-built images or build-from-source) with NVIDIA GPU support and run the OpenAI-compatible server.

🇺🇸|EnglishTranslated

AI & Machine Learningoaustegard/claude-skills

reviewing-ai-papers

Analyze AI/ML technical content (papers, articles, blog posts) and extract actionable insights filtered through enterprise AI engineering lens. Use when user provides URL/document for AI/ML content analysis, asks to "review this paper", or mentions technical content in domains like RAG, embeddings, fine-tuning, prompt engineering, LLM deployment.

🇺🇸|EnglishTranslated

AI & Machine Learningscientiacapital/skills

unsloth-training

Fine-tune LLMs with Unsloth using GRPO or SFT. Supports FP8, vision models, mobile deployment, Docker, packing, GGUF export. Use when: train with GRPO, fine-tune, reward functions, SFT training, FP8 training, vision fine-tuning, phone deployment, docker training, packing, export to GGUF.

🇺🇸|EnglishTranslated

5 scripts/Checked

AI & Machine Learningorchestra-research/ai-res...

llamaguard

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

🇺🇸|EnglishTranslated

AI & Machine Learningtruefoundry/tfy-deploy-sk...

truefoundry-llm-deploy

Deploys ML and LLM models on TrueFoundry with GPU inference servers (vLLM, TGI, NVIDIA NIM). Uses YAML manifests with `tfy apply`. Use when serving language models, deploying Hugging Face models, or hosting GPU-accelerated inference endpoints.

🇺🇸|EnglishTranslated

2 scripts/Attention