llm-inference
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Inference
LLM推理
High-performance inference engines for serving large language models.
用于部署大语言模型的高性能推理引擎。
Engine Comparison
引擎对比
| Engine | Best For | Hardware | Throughput | Setup |
|---|---|---|---|---|
| vLLM | Production serving | GPU | Highest | Medium |
| llama.cpp | Local/edge, CPU | CPU/GPU | Good | Easy |
| TGI | HuggingFace models | GPU | High | Easy |
| Ollama | Local desktop | CPU/GPU | Good | Easiest |
| TensorRT-LLM | NVIDIA production | NVIDIA GPU | Highest | Complex |
| 引擎 | 适用场景 | 硬件要求 | 吞吐量 | 部署难度 |
|---|---|---|---|---|
| vLLM | 生产环境部署 | GPU | 最高 | 中等 |
| llama.cpp | 本地/边缘设备、CPU环境 | CPU/GPU | 良好 | 简单 |
| TGI | HuggingFace模型部署 | GPU | 高 | 简单 |
| Ollama | 本地桌面端 | CPU/GPU | 良好 | 极简单 |
| TensorRT-LLM | NVIDIA生产环境 | NVIDIA GPU | 最高 | 复杂 |
Decision Guide
选型指南
| Scenario | Recommendation |
|---|---|
| Production API server | vLLM or TGI |
| Maximum throughput | vLLM |
| Local development | Ollama or llama.cpp |
| CPU-only deployment | llama.cpp |
| Edge/embedded | llama.cpp |
| Apple Silicon | llama.cpp with Metal |
| Quick experimentation | Ollama |
| Privacy-sensitive (no cloud) | llama.cpp |
| 应用场景 | 推荐方案 |
|---|---|
| 生产环境API服务器 | vLLM 或 TGI |
| 追求最高吞吐量 | vLLM |
| 本地开发 | Ollama 或 llama.cpp |
| 仅CPU部署 | llama.cpp |
| 边缘/嵌入式设备 | llama.cpp |
| Apple Silicon设备 | 带Metal加速的llama.cpp |
| 快速实验 | Ollama |
| 隐私敏感场景(无需云端) | llama.cpp |
vLLM
vLLM
Production-grade serving with PagedAttention for optimal GPU memory usage.
基于PagedAttention技术的生产级部署方案,可优化GPU内存使用。
Key Innovations
核心创新点
| Feature | What It Does |
|---|---|
| PagedAttention | Non-contiguous KV cache, better memory utilization |
| Continuous batching | Dynamic request grouping for throughput |
| Speculative decoding | Small model drafts, large model verifies |
Strengths: Highest throughput, OpenAI-compatible API, multi-GPU
Limitations: GPU required, more complex setup
Key concept: Serves OpenAI-compatible endpoints—drop-in replacement for OpenAI API.
| 特性 | 功能说明 |
|---|---|
| PagedAttention | 非连续KV缓存,提升内存利用率 |
| Continuous batching(连续批处理) | 动态请求分组,提升吞吐量 |
| Speculative decoding(投机解码) | 小模型生成草稿,大模型验证补全 |
优势:吞吐量最高,兼容OpenAI API,支持多GPU
局限性:必须使用GPU,部署配置更复杂
核心概念:提供兼容OpenAI的接口——可直接替代OpenAI API。
llama.cpp
llama.cpp
C++ inference for running models anywhere—laptops, phones, Raspberry Pi.
基于C++的推理框架,可在任意设备运行模型——笔记本、手机、树莓派等。
Quantization Formats (GGUF)
量化格式(GGUF)
| Format | Size (7B) | Quality | Use Case |
|---|---|---|---|
| Q8_0 | ~7 GB | Highest | When you have RAM |
| Q6_K | ~6 GB | High | Good balance |
| Q5_K_M | ~5 GB | Good | Balanced |
| Q4_K_M | ~4 GB | OK | Memory constrained |
| Q2_K | ~2.5 GB | Low | Minimum viable |
Recommendation: Q4_K_M for best quality/size balance.
| 格式 | 7B模型大小 | 生成质量 | 适用场景 |
|---|---|---|---|
| Q8_0 | ~7 GB | 最高 | 内存充足时使用 |
| Q6_K | ~6 GB | 高 | 平衡质量与内存 |
| Q5_K_M | ~5 GB | 良好 | 综合最优平衡 |
| Q4_K_M | ~4 GB | 尚可 | 内存受限场景 |
| Q2_K | ~2.5 GB | 一般 | 最低可用要求 |
推荐选型:Q4_K_M是质量与体积的最优平衡。
Memory Requirements
内存要求
| Model Size | Q4_K_M | RAM Needed |
|---|---|---|
| 7B | ~4 GB | 8 GB |
| 13B | ~7 GB | 16 GB |
| 30B | ~17 GB | 32 GB |
| 70B | ~38 GB | 64 GB |
| 模型规模 | Q4_K_M量化后大小 | 所需内存 |
|---|---|---|
| 7B | ~4 GB | 8 GB |
| 13B | ~7 GB | 16 GB |
| 30B | ~17 GB | 32 GB |
| 70B | ~38 GB | 64 GB |
Platform Optimization
平台优化设置
| Platform | Key Setting |
|---|---|
| Apple Silicon | |
| CUDA GPU | |
| CPU only | |
Strengths: Runs anywhere, GGUF format, Metal/CUDA support
Limitations: Lower throughput than vLLM, single-user focused
Key concept: GGUF format + quantization = run large models on consumer hardware.
| 平台 | 关键配置 |
|---|---|
| Apple Silicon | |
| CUDA GPU | |
| 仅CPU | |
优势:全平台兼容,支持GGUF格式,适配Metal/CUDA加速
局限性:吞吐量低于vLLM,主要面向单用户场景
核心概念:GGUF格式+量化技术=在消费级硬件运行大模型。
Key Optimization Concepts
核心优化技术
| Technique | What It Does | When to Use |
|---|---|---|
| KV Cache | Reuse attention computations | Always (automatic) |
| Continuous Batching | Group requests dynamically | High-throughput serving |
| Tensor Parallelism | Split model across GPUs | Large models |
| Quantization | Reduce precision (fp16→int4) | Memory constrained |
| Speculative Decoding | Small model drafts, large verifies | Latency sensitive |
| GPU Offloading | Move layers to GPU | When GPU available |
| 技术 | 功能说明 | 适用场景 |
|---|---|---|
| KV Cache(KV缓存) | 复用注意力计算结果 | 始终启用(自动生效) |
| Continuous batching(连续批处理) | 动态分组请求 | 高吞吐量部署场景 |
| Tensor Parallelism(张量并行) | 模型拆分到多GPU | 超大规模模型 |
| Quantization(量化) | 降低精度(fp16→int4) | 内存受限场景 |
| Speculative decoding(投机解码) | 小模型生成草稿,大模型验证 | 低延迟敏感场景 |
| GPU Offloading(GPU卸载) | 将模型层迁移到GPU | 有GPU可用时 |
Common Parameters
常用参数
| Parameter | Purpose | Typical Value |
|---|---|---|
| n_ctx | Context window size | 2048-8192 |
| n_gpu_layers | Layers to offload | -1 (all) or 0 (none) |
| temperature | Randomness | 0.0-1.0 |
| max_tokens | Output limit | 100-2000 |
| n_threads | CPU threads | Match core count |
| 参数 | 作用 | 典型取值 |
|---|---|---|
| n_ctx | 上下文窗口大小 | 2048-8192 |
| n_gpu_layers | 卸载到GPU的层数 | -1(全部)或0(不卸载) |
| temperature | 生成随机性 | 0.0-1.0 |
| max_tokens | 输出长度限制 | 100-2000 |
| n_threads | CPU线程数 | 与CPU核心数匹配 |
Troubleshooting
故障排查
| Issue | Solution |
|---|---|
| Out of memory | Reduce n_ctx, use smaller quant |
| Slow inference | Enable GPU offload, use faster quant |
| Model won't load | Check GGUF integrity, check RAM |
| Metal not working | Reinstall with |
| Poor quality | Use higher quant (Q5_K_M, Q6_K) |
| 问题 | 解决方案 |
|---|---|
| 内存不足 | 减小n_ctx,使用更轻量化的量化格式 |
| 推理速度慢 | 启用GPU卸载,使用更快的量化格式 |
| 模型无法加载 | 检查GGUF文件完整性,确认内存充足 |
| Metal加速不生效 | 重新编译时添加 |
| 生成质量差 | 使用更高精度的量化格式(Q5_K_M、Q6_K) |
Resources
参考资源
- vLLM: https://docs.vllm.ai
- llama.cpp: https://github.com/ggerganov/llama.cpp
- TGI: https://huggingface.co/docs/text-generation-inference
- Ollama: https://ollama.ai
- GGUF Models: https://huggingface.co/TheBloke
- vLLM: https://docs.vllm.ai
- llama.cpp: https://github.com/ggerganov/llama.cpp
- TGI: https://huggingface.co/docs/text-generation-inference
- Ollama: https://ollama.ai
- GGUF模型: https://huggingface.co/TheBloke