Loading...
Loading...
Use when "LLM inference", "serving LLM", "vLLM", "llama.cpp", "GGUF", "text generation", "model serving", "inference optimization", "KV cache", "continuous batching", "speculative decoding", "local LLM", "CPU inference"
npx skill4agent add eyadsibai/ltk llm-inference| Engine | Best For | Hardware | Throughput | Setup |
|---|---|---|---|---|
| vLLM | Production serving | GPU | Highest | Medium |
| llama.cpp | Local/edge, CPU | CPU/GPU | Good | Easy |
| TGI | HuggingFace models | GPU | High | Easy |
| Ollama | Local desktop | CPU/GPU | Good | Easiest |
| TensorRT-LLM | NVIDIA production | NVIDIA GPU | Highest | Complex |
| Scenario | Recommendation |
|---|---|
| Production API server | vLLM or TGI |
| Maximum throughput | vLLM |
| Local development | Ollama or llama.cpp |
| CPU-only deployment | llama.cpp |
| Edge/embedded | llama.cpp |
| Apple Silicon | llama.cpp with Metal |
| Quick experimentation | Ollama |
| Privacy-sensitive (no cloud) | llama.cpp |
| Feature | What It Does |
|---|---|
| PagedAttention | Non-contiguous KV cache, better memory utilization |
| Continuous batching | Dynamic request grouping for throughput |
| Speculative decoding | Small model drafts, large model verifies |
| Format | Size (7B) | Quality | Use Case |
|---|---|---|---|
| Q8_0 | ~7 GB | Highest | When you have RAM |
| Q6_K | ~6 GB | High | Good balance |
| Q5_K_M | ~5 GB | Good | Balanced |
| Q4_K_M | ~4 GB | OK | Memory constrained |
| Q2_K | ~2.5 GB | Low | Minimum viable |
| Model Size | Q4_K_M | RAM Needed |
|---|---|---|
| 7B | ~4 GB | 8 GB |
| 13B | ~7 GB | 16 GB |
| 30B | ~17 GB | 32 GB |
| 70B | ~38 GB | 64 GB |
| Platform | Key Setting |
|---|---|
| Apple Silicon | |
| CUDA GPU | |
| CPU only | |
| Technique | What It Does | When to Use |
|---|---|---|
| KV Cache | Reuse attention computations | Always (automatic) |
| Continuous Batching | Group requests dynamically | High-throughput serving |
| Tensor Parallelism | Split model across GPUs | Large models |
| Quantization | Reduce precision (fp16→int4) | Memory constrained |
| Speculative Decoding | Small model drafts, large verifies | Latency sensitive |
| GPU Offloading | Move layers to GPU | When GPU available |
| Parameter | Purpose | Typical Value |
|---|---|---|
| n_ctx | Context window size | 2048-8192 |
| n_gpu_layers | Layers to offload | -1 (all) or 0 (none) |
| temperature | Randomness | 0.0-1.0 |
| max_tokens | Output limit | 100-2000 |
| n_threads | CPU threads | Match core count |
| Issue | Solution |
|---|---|
| Out of memory | Reduce n_ctx, use smaller quant |
| Slow inference | Enable GPU offload, use faster quant |
| Model won't load | Check GGUF integrity, check RAM |
| Metal not working | Reinstall with |
| Poor quality | Use higher quant (Q5_K_M, Q6_K) |