tensorrt-llm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTensorRT-LLM
TensorRT-LLM
NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
NVIDIA推出的开源库,用于在NVIDIA GPU上以顶尖性能优化大语言模型(LLM)推理。
When to use TensorRT-LLM
何时使用TensorRT-LLM
Use TensorRT-LLM when:
- Deploying on NVIDIA GPUs (A100, H100, GB200)
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
- Require low latency for real-time applications
- Working with quantized models (FP8, INT4, FP4)
- Scaling across multiple GPUs or nodes
Use vLLM instead when:
- Need simpler setup and Python-first API
- Want PagedAttention without TensorRT compilation
- Working with AMD GPUs or non-NVIDIA hardware
Use llama.cpp instead when:
- Deploying on CPU or Apple Silicon
- Need edge deployment without NVIDIA GPUs
- Want simpler GGUF quantization format
在以下场景使用TensorRT-LLM:
- 在NVIDIA GPU(A100、H100、GB200)上部署
- 需要最大吞吐量(Llama 3可达24,000+ tokens/秒)
- 实时应用需要低延迟
- 处理量化模型(FP8、INT4、FP4)
- 跨多GPU或多节点扩展
在以下场景使用vLLM替代:
- 需要更简单的设置和Python优先的API
- 想要无需TensorRT编译的PagedAttention
- 使用AMD GPU或非NVIDIA硬件
在以下场景使用llama.cpp替代:
- 在CPU或Apple Silicon上部署
- 无需NVIDIA GPU的边缘部署
- 想要更简单的GGUF量化格式
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedDocker (recommended)
Docker(推荐)
docker pull nvidia/tensorrt_llm:latest
docker pull nvidia/tensorrt_llm:latest
pip install
pip 安装
pip install tensorrt_llm==1.2.0rc3
pip install tensorrt_llm==1.2.0rc3
Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
依赖CUDA 13.0.0、TensorRT 10.13.2、Python 3.10-3.12
undefinedundefinedBasic inference
基础推理
python
from tensorrt_llm import LLM, SamplingParamspython
from tensorrt_llm import LLM, SamplingParamsInitialize model
初始化模型
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
Configure sampling
配置采样参数
sampling_params = SamplingParams(
max_tokens=100,
temperature=0.7,
top_p=0.9
)
sampling_params = SamplingParams(
max_tokens=100,
temperature=0.7,
top_p=0.9
)
Generate
生成结果
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.text)
undefinedprompts = ["解释量子计算"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.text)
undefinedServing with trtllm-serve
使用trtllm-serve部署服务
bash
undefinedbash
undefinedStart server (automatic model download and compilation)
启动服务(自动下载并编译模型)
trtllm-serve meta-llama/Meta-Llama-3-8B
--tp_size 4 \ # Tensor parallelism (4 GPUs) --max_batch_size 256
--max_num_tokens 4096
--tp_size 4 \ # Tensor parallelism (4 GPUs) --max_batch_size 256
--max_num_tokens 4096
trtllm-serve meta-llama/Meta-Llama-3-8B
--tp_size 4 \ # 张量并行(4个GPU) --max_batch_size 256
--max_num_tokens 4096
--tp_size 4 \ # 张量并行(4个GPU) --max_batch_size 256
--max_num_tokens 4096
Client request
客户端请求
curl -X POST http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
undefinedcurl -X POST http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "你好!"}], "temperature": 0.7, "max_tokens": 100 }'
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "你好!"}], "temperature": 0.7, "max_tokens": 100 }'
undefinedKey features
核心特性
Performance optimizations
性能优化
- In-flight batching: Dynamic batching during generation
- Paged KV cache: Efficient memory management
- Flash Attention: Optimized attention kernels
- Quantization: FP8, INT4, FP4 for 2-4× faster inference
- CUDA graphs: Reduced kernel launch overhead
- 动态批处理(in-flight batching): 生成过程中的动态批处理
- 分页KV缓存: 高效内存管理
- Flash Attention: 优化的注意力内核
- 量化: FP8、INT4、FP4可实现2-4倍更快推理
- CUDA图: 减少内核启动开销
Parallelism
并行能力
- Tensor parallelism (TP): Split model across GPUs
- Pipeline parallelism (PP): Layer-wise distribution
- Expert parallelism: For Mixture-of-Experts models
- Multi-node: Scale beyond single machine
- 张量并行(TP): 在多个GPU间拆分模型
- 流水线并行(PP): 按层分配模型
- 专家并行: 适用于混合专家(Mixture-of-Experts)模型
- 多节点: 扩展至单机器以外的规模
Advanced features
高级特性
- Speculative decoding: Faster generation with draft models
- LoRA serving: Efficient multi-adapter deployment
- Disaggregated serving: Separate prefill and generation
- 投机解码: 借助草稿模型实现更快生成
- LoRA部署: 高效的多适配器部署
- 解耦部署: 分离预填充与生成阶段
Common patterns
常见使用模式
Quantized model (FP8)
量化模型(FP8)
python
from tensorrt_llm import LLMpython
from tensorrt_llm import LLMLoad FP8 quantized model (2× faster, 50% memory)
加载FP8量化模型(速度快2倍,内存占用减少50%)
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
dtype="fp8",
max_num_tokens=8192
)
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
dtype="fp8",
max_num_tokens=8192
)
Inference same as before
推理方式与之前相同
outputs = llm.generate(["Summarize this article..."])
undefinedoutputs = llm.generate(["总结这篇文章..."])
undefinedMulti-GPU deployment
多GPU部署
python
undefinedpython
undefinedTensor parallelism across 8 GPUs
跨8个GPU的张量并行
llm = LLM(
model="meta-llama/Meta-Llama-3-405B",
tensor_parallel_size=8,
dtype="fp8"
)
undefinedllm = LLM(
model="meta-llama/Meta-Llama-3-405B",
tensor_parallel_size=8,
dtype="fp8"
)
undefinedBatch inference
批处理推理
python
undefinedpython
undefinedProcess 100 prompts efficiently
高效处理100个提示词
prompts = [f"Question {i}: ..." for i in range(100)]
outputs = llm.generate(
prompts,
sampling_params=SamplingParams(max_tokens=200)
)
prompts = [f"问题 {i}: ..." for i in range(100)]
outputs = llm.generate(
prompts,
sampling_params=SamplingParams(max_tokens=200)
)
Automatic in-flight batching for maximum throughput
自动使用动态批处理以实现最大吞吐量
undefinedundefinedPerformance benchmarks
性能基准测试
Meta Llama 3-8B (H100 GPU):
- Throughput: 24,000 tokens/sec
- Latency: ~10ms per token
- vs PyTorch: 100× faster
Llama 3-70B (8× A100 80GB):
- FP8 quantization: 2× faster than FP16
- Memory: 50% reduction with FP8
Meta Llama 3-8B(H100 GPU):
- 吞吐量:24,000 tokens/秒
- 延迟:约10ms每token
- 对比PyTorch:快100倍
Llama 3-70B(8× A100 80GB):
- FP8量化:比FP16快2倍
- 内存:使用FP8可减少50%占用
Supported models
支持的模型
- LLaMA family: Llama 2, Llama 3, CodeLlama
- GPT family: GPT-2, GPT-J, GPT-NeoX
- Qwen: Qwen, Qwen2, QwQ
- DeepSeek: DeepSeek-V2, DeepSeek-V3
- Mixtral: Mixtral-8x7B, Mixtral-8x22B
- Vision: LLaVA, Phi-3-vision
- 100+ models on HuggingFace
- LLaMA系列: Llama 2、Llama 3、CodeLlama
- GPT系列: GPT-2、GPT-J、GPT-NeoX
- Qwen系列: Qwen、Qwen2、QwQ
- DeepSeek系列: DeepSeek-V2、DeepSeek-V3
- Mixtral系列: Mixtral-8x7B、Mixtral-8x22B
- 视觉模型: LLaVA、Phi-3-vision
- HuggingFace上的100+模型
References
参考资料
- Optimization Guide - Quantization, batching, KV cache tuning
- Multi-GPU Setup - Tensor/pipeline parallelism, multi-node
- Serving Guide - Production deployment, monitoring, autoscaling
- 优化指南 - 量化、批处理、KV缓存调优
- 多GPU设置 - 张量/流水线并行、多节点
- 部署指南 - 生产部署、监控、自动扩缩容