llm-inference

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Inference

LLM推理

High-performance inference engines for serving large language models.

用于部署大语言模型的高性能推理引擎。

Engine Comparison

引擎对比

EngineBest ForHardwareThroughputSetup
vLLMProduction servingGPUHighestMedium
llama.cppLocal/edge, CPUCPU/GPUGoodEasy
TGIHuggingFace modelsGPUHighEasy
OllamaLocal desktopCPU/GPUGoodEasiest
TensorRT-LLMNVIDIA productionNVIDIA GPUHighestComplex

引擎适用场景硬件要求吞吐量部署难度
vLLM生产环境部署GPU最高中等
llama.cpp本地/边缘设备、CPU环境CPU/GPU良好简单
TGIHuggingFace模型部署GPU简单
Ollama本地桌面端CPU/GPU良好极简单
TensorRT-LLMNVIDIA生产环境NVIDIA GPU最高复杂

Decision Guide

选型指南

ScenarioRecommendation
Production API servervLLM or TGI
Maximum throughputvLLM
Local developmentOllama or llama.cpp
CPU-only deploymentllama.cpp
Edge/embeddedllama.cpp
Apple Siliconllama.cpp with Metal
Quick experimentationOllama
Privacy-sensitive (no cloud)llama.cpp

应用场景推荐方案
生产环境API服务器vLLM 或 TGI
追求最高吞吐量vLLM
本地开发Ollama 或 llama.cpp
仅CPU部署llama.cpp
边缘/嵌入式设备llama.cpp
Apple Silicon设备带Metal加速的llama.cpp
快速实验Ollama
隐私敏感场景(无需云端)llama.cpp

vLLM

vLLM

Production-grade serving with PagedAttention for optimal GPU memory usage.
基于PagedAttention技术的生产级部署方案,可优化GPU内存使用。

Key Innovations

核心创新点

FeatureWhat It Does
PagedAttentionNon-contiguous KV cache, better memory utilization
Continuous batchingDynamic request grouping for throughput
Speculative decodingSmall model drafts, large model verifies
Strengths: Highest throughput, OpenAI-compatible API, multi-GPU Limitations: GPU required, more complex setup
Key concept: Serves OpenAI-compatible endpoints—drop-in replacement for OpenAI API.

特性功能说明
PagedAttention非连续KV缓存,提升内存利用率
Continuous batching(连续批处理)动态请求分组,提升吞吐量
Speculative decoding(投机解码)小模型生成草稿,大模型验证补全
优势:吞吐量最高,兼容OpenAI API,支持多GPU 局限性:必须使用GPU,部署配置更复杂
核心概念:提供兼容OpenAI的接口——可直接替代OpenAI API。

llama.cpp

llama.cpp

C++ inference for running models anywhere—laptops, phones, Raspberry Pi.
基于C++的推理框架,可在任意设备运行模型——笔记本、手机、树莓派等。

Quantization Formats (GGUF)

量化格式(GGUF)

FormatSize (7B)QualityUse Case
Q8_0~7 GBHighestWhen you have RAM
Q6_K~6 GBHighGood balance
Q5_K_M~5 GBGoodBalanced
Q4_K_M~4 GBOKMemory constrained
Q2_K~2.5 GBLowMinimum viable
Recommendation: Q4_K_M for best quality/size balance.
格式7B模型大小生成质量适用场景
Q8_0~7 GB最高内存充足时使用
Q6_K~6 GB平衡质量与内存
Q5_K_M~5 GB良好综合最优平衡
Q4_K_M~4 GB尚可内存受限场景
Q2_K~2.5 GB一般最低可用要求
推荐选型:Q4_K_M是质量与体积的最优平衡。

Memory Requirements

内存要求

Model SizeQ4_K_MRAM Needed
7B~4 GB8 GB
13B~7 GB16 GB
30B~17 GB32 GB
70B~38 GB64 GB
模型规模Q4_K_M量化后大小所需内存
7B~4 GB8 GB
13B~7 GB16 GB
30B~17 GB32 GB
70B~38 GB64 GB

Platform Optimization

平台优化设置

PlatformKey Setting
Apple Silicon
n_gpu_layers=-1
(Metal offload)
CUDA GPU
n_gpu_layers=-1
+
offload_kqv=True
CPU only
n_gpu_layers=0
+ set
n_threads
to core count
Strengths: Runs anywhere, GGUF format, Metal/CUDA support Limitations: Lower throughput than vLLM, single-user focused
Key concept: GGUF format + quantization = run large models on consumer hardware.

平台关键配置
Apple Silicon
n_gpu_layers=-1
(Metal加速卸载)
CUDA GPU
n_gpu_layers=-1
+
offload_kqv=True
仅CPU
n_gpu_layers=0
+ 设置
n_threads
为CPU核心数
优势:全平台兼容,支持GGUF格式,适配Metal/CUDA加速 局限性:吞吐量低于vLLM,主要面向单用户场景
核心概念:GGUF格式+量化技术=在消费级硬件运行大模型。

Key Optimization Concepts

核心优化技术

TechniqueWhat It DoesWhen to Use
KV CacheReuse attention computationsAlways (automatic)
Continuous BatchingGroup requests dynamicallyHigh-throughput serving
Tensor ParallelismSplit model across GPUsLarge models
QuantizationReduce precision (fp16→int4)Memory constrained
Speculative DecodingSmall model drafts, large verifiesLatency sensitive
GPU OffloadingMove layers to GPUWhen GPU available

技术功能说明适用场景
KV Cache(KV缓存)复用注意力计算结果始终启用(自动生效)
Continuous batching(连续批处理)动态分组请求高吞吐量部署场景
Tensor Parallelism(张量并行)模型拆分到多GPU超大规模模型
Quantization(量化)降低精度(fp16→int4)内存受限场景
Speculative decoding(投机解码)小模型生成草稿,大模型验证低延迟敏感场景
GPU Offloading(GPU卸载)将模型层迁移到GPU有GPU可用时

Common Parameters

常用参数

ParameterPurposeTypical Value
n_ctxContext window size2048-8192
n_gpu_layersLayers to offload-1 (all) or 0 (none)
temperatureRandomness0.0-1.0
max_tokensOutput limit100-2000
n_threadsCPU threadsMatch core count

参数作用典型取值
n_ctx上下文窗口大小2048-8192
n_gpu_layers卸载到GPU的层数-1(全部)或0(不卸载)
temperature生成随机性0.0-1.0
max_tokens输出长度限制100-2000
n_threadsCPU线程数与CPU核心数匹配

Troubleshooting

故障排查

IssueSolution
Out of memoryReduce n_ctx, use smaller quant
Slow inferenceEnable GPU offload, use faster quant
Model won't loadCheck GGUF integrity, check RAM
Metal not workingReinstall with
-DLLAMA_METAL=on
Poor qualityUse higher quant (Q5_K_M, Q6_K)
问题解决方案
内存不足减小n_ctx,使用更轻量化的量化格式
推理速度慢启用GPU卸载,使用更快的量化格式
模型无法加载检查GGUF文件完整性,确认内存充足
Metal加速不生效重新编译时添加
-DLLAMA_METAL=on
生成质量差使用更高精度的量化格式(Q5_K_M、Q6_K)

Resources

参考资源