Search Results: quantization

Found 26 Skills

AI & Machine Learningitsmostafa/llm-engineerin...

qlora

Memory-efficient fine-tuning with 4-bit quantization and LoRA adapters. Use when fine-tuning large models (7B+) on consumer GPUs, when VRAM is limited, or when standard LoRA still exceeds memory. Builds on the lora skill.

🇺🇸|EnglishTranslated

AI & Machine Learningitechmeat/llm-code

qdrant

Qdrant vector database: collections, points, payload filtering, indexing, quantization, snapshots, and Docker/Kubernetes deployment.

🇺🇸|EnglishTranslated

AI & Machine Learningdnyoussef/context-cascade

agentdb-performance-optimization

Apply quantization to reduce memory by 4-32x. Enable HNSW indexing for 150x faster search. Configure caching strategies and implement batch operations. Use when optimizing memory usage, improving search speed, or scaling to millions of vectors. Deploy these optimizations to achieve 12,500x performance gains.

🇺🇸|EnglishTranslated

4 scripts/Checked

AI & Machine Learningdpearson2699/swift-ios-sk...

apple-on-device-ai

Use when integrating Foundation Models framework, implementing on-device AI with Apple Intelligence, building tool-calling AI features, working with guided generation schemas, converting models with Core ML and coremltools, or running open-source LLMs on Apple Silicon. Covers Foundation Models (LanguageModelSession, @Generable, @Guide, SystemLanguageModel, structured output, tool calling), Core ML (coremltools, model conversion, quantization, palettization, pruning, Neural Engine, MLTensor), MLX Swift (transformer inference, unified memory), and llama.cpp (GGUF, cross-platform LLM).

🇺🇸|EnglishTranslated

AI & Machine Learningaradotso/trending-skills

turboquant-pytorch

PyTorch implementation of TurboQuant for LLM KV cache compression using two-stage vector quantization (random rotation + Lloyd-Max + QJL residual correction).

🇺🇸|EnglishTranslated

AI & Machine Learningjakerains/agentskills

onnx-webgpu-converter

Convert HuggingFace transformer models to ONNX format for browser inference with Transformers.js and WebGPU. Use when given a HuggingFace model link to convert to ONNX, when setting up optimum-cli for ONNX export, when quantizing models (fp16, q8, q4) for web deployment, when configuring Transformers.js with WebGPU acceleration, or when troubleshooting ONNX conversion errors. Triggers on mentions of ONNX conversion, Transformers.js, WebGPU inference, optimum export, model quantization for browser, or running ML models in the browser.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningqdrant/skills

qdrant-search-quality

Diagnoses and improves Qdrant search relevance. Use when someone reports 'search results are bad', 'wrong results', 'low precision', 'low recall', 'irrelevant matches', 'missing expected results', or asks 'how to improve search quality?', 'which embedding model?', 'should I use hybrid search?', 'should I use reranking?'. Also use when search quality degrades after quantization, model change, or data growth.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

🇺🇸|EnglishTranslated

AI & Machine Learningaradotso/trending-skills

tilekernels-gpu-kernels

Expert skill for using TileKernels, a library of optimized GPU kernels for LLM operations (MoE routing, quantization, transpose, engram gating, Manifold HyperConnection) built with TileLang.

🇺🇸|EnglishTranslated

Data Processingneo4j-contrib/neo4j-skill...

neo4j-vector-index-skill

Create and manage Neo4j vector indexes, run vector similarity search (ANN/kNN), store embeddings on nodes or relationships, use SEARCH clause (Neo4j 2026.01+, preferred) or db.index.vector.queryNodes() procedure (deprecated 2026.04, still works on 2025.x), configure HNSW and quantization options, pick similarity function and embedding provider dimensions, and batch-update embeddings. Use when tasks involve CREATE VECTOR INDEX, vector.dimensions, cosine/euclidean search, embedding ingestion pipelines, or semantic nearest-neighbor lookup. Does NOT handle GraphRAG retrieval_query graph traversal — use neo4j-graphrag-skill. Does NOT handle fulltext/keyword indexes (FULLTEXT INDEX, db.index.fulltext) — use neo4j-cypher-skill. Does NOT handle GDS graph embeddings (FastRP, Node2Vec) — use neo4j-gds-skill.

🇺🇸|EnglishTranslated

AI & Machine Learningascend-ai-coding/awesome-...

vllm-ascend

vLLM Ascend plugin for LLM inference serving on Huawei Ascend NPU. Use for offline batch inference, API server deployment, quantization inference (with msmodelslim quantized models), tensor/pipeline parallelism for distributed serving, and OpenAI-compatible API endpoints. Supports Qwen, DeepSeek, GLM, LLaMA models with Ascend-optimized kernels.

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningslowlyc/agent-gpu-skills

sglang-skill

Develop, debug, and optimize SGLang LLM serving engine. Use when the user mentions SGLang, sglang, srt, sgl-kernel, LLM serving, model inference, KV cache, attention backend, FlashInfer, MLA, MoE routing, speculative decoding, disaggregated serving, TP/PP/EP, radix cache, continuous batching, chunked prefill, CUDA graph, model loading, quantization FP8/GPTQ/AWQ, JIT kernel, triton kernel SGLang, or asks about serving LLMs with SGLang.

🇺🇸|EnglishTranslated

1 scripts/Attention