Search Results: quantization

Found 26 Skills

vector-index-tuning

Optimize vector index performance for latency, recall, and memory. Use when tuning HNSW parameters, selecting quantization strategies, or scaling vector search infrastructure.

🇺🇸|EnglishTranslated

AI & Machine Learningomer-metin/skills-for-ant...

model-optimization

Use when reducing model size, improving inference speed, or deploying to edge devices - covers quantization, pruning, knowledge distillation, ONNX export, and TensorRT optimizationUse when ", " mentioned.

🇺🇸|EnglishTranslated

AI & Machine Learningmartinholovsky/claude-ski...

model-quantization

Expert skill for AI model quantization and optimization. Covers 4-bit/8-bit quantization, GGUF conversion, memory optimization, and quality-performance tradeoffs for deploying LLMs in resource-constrained JARVIS environments.

🇺🇸|EnglishTranslated

AI & Machine Learningdoanchienthangdev/omgkit

efficient-ai

Efficient AI techniques including model compression, quantization, pruning, knowledge distillation, and hardware-aware optimization for production systems.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

awq-quantization

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

hqq-quantization

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

gptq

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

🇺🇸|EnglishTranslated

AI & Machine Learningtimescale/pg-aiguide

pgvector-semantic-search

Use this skill for setting up vector similarity search with pgvector for AI/ML embeddings, RAG applications, or semantic search. **Trigger when user asks to:** - Store or search vector embeddings in PostgreSQL - Set up semantic search, similarity search, or nearest neighbor search - Create HNSW or IVFFlat indexes for vectors - Implement RAG (Retrieval Augmented Generation) with PostgreSQL - Optimize pgvector performance, recall, or memory usage - Use binary quantization for large vector datasets **Keywords:** pgvector, embeddings, semantic search, vector similarity, HNSW, IVFFlat, halfvec, cosine distance, nearest neighbor, RAG, LLM, AI search Covers: halfvec storage, HNSW index configuration (m, ef_construction, ef_search), quantization strategies, filtered search, bulk loading, and performance tuning.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

serving-llms-vllm

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

🇺🇸|EnglishTranslated

AI & Machine Learningbagelhole/devops-security...

llm-cost-optimization

Reduce LLM API and infrastructure costs through model selection, prompt caching, batching, caching, quantization, and self-hosting strategies. Track spend by team and model, set budgets, and implement cost-aware routing.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

llama-cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

🇺🇸|EnglishTranslated