tensorrt-llm

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TensorRT-LLM

TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
NVIDIA推出的开源库,用于在NVIDIA GPU上以顶尖性能优化大语言模型(LLM)推理。

When to use TensorRT-LLM

何时使用TensorRT-LLM

Use TensorRT-LLM when:
  • Deploying on NVIDIA GPUs (A100, H100, GB200)
  • Need maximum throughput (24,000+ tokens/sec on Llama 3)
  • Require low latency for real-time applications
  • Working with quantized models (FP8, INT4, FP4)
  • Scaling across multiple GPUs or nodes
Use vLLM instead when:
  • Need simpler setup and Python-first API
  • Want PagedAttention without TensorRT compilation
  • Working with AMD GPUs or non-NVIDIA hardware
Use llama.cpp instead when:
  • Deploying on CPU or Apple Silicon
  • Need edge deployment without NVIDIA GPUs
  • Want simpler GGUF quantization format
在以下场景使用TensorRT-LLM:
  • 在NVIDIA GPU(A100、H100、GB200)上部署
  • 需要最大吞吐量(Llama 3可达24,000+ tokens/秒)
  • 实时应用需要低延迟
  • 处理量化模型(FP8、INT4、FP4)
  • 跨多GPU或多节点扩展
在以下场景使用vLLM替代:
  • 需要更简单的设置和Python优先的API
  • 想要无需TensorRT编译的PagedAttention
  • 使用AMD GPU或非NVIDIA硬件
在以下场景使用llama.cpp替代:
  • 在CPU或Apple Silicon上部署
  • 无需NVIDIA GPU的边缘部署
  • 想要更简单的GGUF量化格式

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

Docker (recommended)

Docker(推荐)

docker pull nvidia/tensorrt_llm:latest
docker pull nvidia/tensorrt_llm:latest

pip install

pip 安装

pip install tensorrt_llm==1.2.0rc3
pip install tensorrt_llm==1.2.0rc3

Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

依赖CUDA 13.0.0、TensorRT 10.13.2、Python 3.10-3.12

undefined
undefined

Basic inference

基础推理

python
from tensorrt_llm import LLM, SamplingParams
python
from tensorrt_llm import LLM, SamplingParams

Initialize model

初始化模型

llm = LLM(model="meta-llama/Meta-Llama-3-8B")
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

Configure sampling

配置采样参数

sampling_params = SamplingParams( max_tokens=100, temperature=0.7, top_p=0.9 )
sampling_params = SamplingParams( max_tokens=100, temperature=0.7, top_p=0.9 )

Generate

生成结果

prompts = ["Explain quantum computing"] outputs = llm.generate(prompts, sampling_params)
for output in outputs: print(output.text)
undefined
prompts = ["解释量子计算"] outputs = llm.generate(prompts, sampling_params)
for output in outputs: print(output.text)
undefined

Serving with trtllm-serve

使用trtllm-serve部署服务

bash
undefined
bash
undefined

Start server (automatic model download and compilation)

启动服务(自动下载并编译模型)

trtllm-serve meta-llama/Meta-Llama-3-8B
--tp_size 4 \ # Tensor parallelism (4 GPUs) --max_batch_size 256
--max_num_tokens 4096
trtllm-serve meta-llama/Meta-Llama-3-8B
--tp_size 4 \ # 张量并行(4个GPU) --max_batch_size 256
--max_num_tokens 4096

Client request

客户端请求

curl -X POST http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
undefined
curl -X POST http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "你好!"}], "temperature": 0.7, "max_tokens": 100 }'
undefined

Key features

核心特性

Performance optimizations

性能优化

  • In-flight batching: Dynamic batching during generation
  • Paged KV cache: Efficient memory management
  • Flash Attention: Optimized attention kernels
  • Quantization: FP8, INT4, FP4 for 2-4× faster inference
  • CUDA graphs: Reduced kernel launch overhead
  • 动态批处理(in-flight batching): 生成过程中的动态批处理
  • 分页KV缓存: 高效内存管理
  • Flash Attention: 优化的注意力内核
  • 量化: FP8、INT4、FP4可实现2-4倍更快推理
  • CUDA图: 减少内核启动开销

Parallelism

并行能力

  • Tensor parallelism (TP): Split model across GPUs
  • Pipeline parallelism (PP): Layer-wise distribution
  • Expert parallelism: For Mixture-of-Experts models
  • Multi-node: Scale beyond single machine
  • 张量并行(TP): 在多个GPU间拆分模型
  • 流水线并行(PP): 按层分配模型
  • 专家并行: 适用于混合专家(Mixture-of-Experts)模型
  • 多节点: 扩展至单机器以外的规模

Advanced features

高级特性

  • Speculative decoding: Faster generation with draft models
  • LoRA serving: Efficient multi-adapter deployment
  • Disaggregated serving: Separate prefill and generation
  • 投机解码: 借助草稿模型实现更快生成
  • LoRA部署: 高效的多适配器部署
  • 解耦部署: 分离预填充与生成阶段

Common patterns

常见使用模式

Quantized model (FP8)

量化模型(FP8)

python
from tensorrt_llm import LLM
python
from tensorrt_llm import LLM

Load FP8 quantized model (2× faster, 50% memory)

加载FP8量化模型(速度快2倍,内存占用减少50%)

llm = LLM( model="meta-llama/Meta-Llama-3-70B", dtype="fp8", max_num_tokens=8192 )
llm = LLM( model="meta-llama/Meta-Llama-3-70B", dtype="fp8", max_num_tokens=8192 )

Inference same as before

推理方式与之前相同

outputs = llm.generate(["Summarize this article..."])
undefined
outputs = llm.generate(["总结这篇文章..."])
undefined

Multi-GPU deployment

多GPU部署

python
undefined
python
undefined

Tensor parallelism across 8 GPUs

跨8个GPU的张量并行

llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, dtype="fp8" )
undefined
llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, dtype="fp8" )
undefined

Batch inference

批处理推理

python
undefined
python
undefined

Process 100 prompts efficiently

高效处理100个提示词

prompts = [f"Question {i}: ..." for i in range(100)]
outputs = llm.generate( prompts, sampling_params=SamplingParams(max_tokens=200) )
prompts = [f"问题 {i}: ..." for i in range(100)]
outputs = llm.generate( prompts, sampling_params=SamplingParams(max_tokens=200) )

Automatic in-flight batching for maximum throughput

自动使用动态批处理以实现最大吞吐量

undefined
undefined

Performance benchmarks

性能基准测试

Meta Llama 3-8B (H100 GPU):
  • Throughput: 24,000 tokens/sec
  • Latency: ~10ms per token
  • vs PyTorch: 100× faster
Llama 3-70B (8× A100 80GB):
  • FP8 quantization: 2× faster than FP16
  • Memory: 50% reduction with FP8
Meta Llama 3-8B(H100 GPU):
  • 吞吐量:24,000 tokens/秒
  • 延迟:约10ms每token
  • 对比PyTorch:快100倍
Llama 3-70B(8× A100 80GB):
  • FP8量化:比FP16快2倍
  • 内存:使用FP8可减少50%占用

Supported models

支持的模型

  • LLaMA family: Llama 2, Llama 3, CodeLlama
  • GPT family: GPT-2, GPT-J, GPT-NeoX
  • Qwen: Qwen, Qwen2, QwQ
  • DeepSeek: DeepSeek-V2, DeepSeek-V3
  • Mixtral: Mixtral-8x7B, Mixtral-8x22B
  • Vision: LLaVA, Phi-3-vision
  • 100+ models on HuggingFace
  • LLaMA系列: Llama 2、Llama 3、CodeLlama
  • GPT系列: GPT-2、GPT-J、GPT-NeoX
  • Qwen系列: Qwen、Qwen2、QwQ
  • DeepSeek系列: DeepSeek-V2、DeepSeek-V3
  • Mixtral系列: Mixtral-8x7B、Mixtral-8x22B
  • 视觉模型: LLaVA、Phi-3-vision
  • HuggingFace上的100+模型

References

参考资料

  • Optimization Guide - Quantization, batching, KV cache tuning
  • Multi-GPU Setup - Tensor/pipeline parallelism, multi-node
  • Serving Guide - Production deployment, monitoring, autoscaling
  • 优化指南 - 量化、批处理、KV缓存调优
  • 多GPU设置 - 张量/流水线并行、多节点
  • 部署指南 - 生产部署、监控、自动扩缩容

Resources

资源