llm-inference

Original🇺🇸 English
Translated

Use when "LLM inference", "serving LLM", "vLLM", "llama.cpp", "GGUF", "text generation", "model serving", "inference optimization", "KV cache", "continuous batching", "speculative decoding", "local LLM", "CPU inference"

4installs
Added on

NPX Install

npx skill4agent add eyadsibai/ltk llm-inference

Tags

Translated version includes tags in frontmatter

LLM Inference

High-performance inference engines for serving large language models.

Engine Comparison

EngineBest ForHardwareThroughputSetup
vLLMProduction servingGPUHighestMedium
llama.cppLocal/edge, CPUCPU/GPUGoodEasy
TGIHuggingFace modelsGPUHighEasy
OllamaLocal desktopCPU/GPUGoodEasiest
TensorRT-LLMNVIDIA productionNVIDIA GPUHighestComplex

Decision Guide

ScenarioRecommendation
Production API servervLLM or TGI
Maximum throughputvLLM
Local developmentOllama or llama.cpp
CPU-only deploymentllama.cpp
Edge/embeddedllama.cpp
Apple Siliconllama.cpp with Metal
Quick experimentationOllama
Privacy-sensitive (no cloud)llama.cpp

vLLM

Production-grade serving with PagedAttention for optimal GPU memory usage.

Key Innovations

FeatureWhat It Does
PagedAttentionNon-contiguous KV cache, better memory utilization
Continuous batchingDynamic request grouping for throughput
Speculative decodingSmall model drafts, large model verifies
Strengths: Highest throughput, OpenAI-compatible API, multi-GPU Limitations: GPU required, more complex setup
Key concept: Serves OpenAI-compatible endpoints—drop-in replacement for OpenAI API.

llama.cpp

C++ inference for running models anywhere—laptops, phones, Raspberry Pi.

Quantization Formats (GGUF)

FormatSize (7B)QualityUse Case
Q8_0~7 GBHighestWhen you have RAM
Q6_K~6 GBHighGood balance
Q5_K_M~5 GBGoodBalanced
Q4_K_M~4 GBOKMemory constrained
Q2_K~2.5 GBLowMinimum viable
Recommendation: Q4_K_M for best quality/size balance.

Memory Requirements

Model SizeQ4_K_MRAM Needed
7B~4 GB8 GB
13B~7 GB16 GB
30B~17 GB32 GB
70B~38 GB64 GB

Platform Optimization

PlatformKey Setting
Apple Silicon
n_gpu_layers=-1
(Metal offload)
CUDA GPU
n_gpu_layers=-1
+
offload_kqv=True
CPU only
n_gpu_layers=0
+ set
n_threads
to core count
Strengths: Runs anywhere, GGUF format, Metal/CUDA support Limitations: Lower throughput than vLLM, single-user focused
Key concept: GGUF format + quantization = run large models on consumer hardware.

Key Optimization Concepts

TechniqueWhat It DoesWhen to Use
KV CacheReuse attention computationsAlways (automatic)
Continuous BatchingGroup requests dynamicallyHigh-throughput serving
Tensor ParallelismSplit model across GPUsLarge models
QuantizationReduce precision (fp16→int4)Memory constrained
Speculative DecodingSmall model drafts, large verifiesLatency sensitive
GPU OffloadingMove layers to GPUWhen GPU available

Common Parameters

ParameterPurposeTypical Value
n_ctxContext window size2048-8192
n_gpu_layersLayers to offload-1 (all) or 0 (none)
temperatureRandomness0.0-1.0
max_tokensOutput limit100-2000
n_threadsCPU threadsMatch core count

Troubleshooting

IssueSolution
Out of memoryReduce n_ctx, use smaller quant
Slow inferenceEnable GPU offload, use faster quant
Model won't loadCheck GGUF integrity, check RAM
Metal not workingReinstall with
-DLLAMA_METAL=on
Poor qualityUse higher quant (Q5_K_M, Q6_K)

Resources