tensorrt-llm

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

NVIDIA推出的开源库，用于在NVIDIA GPU上以顶尖性能优化大语言模型（LLM）推理。

When to use TensorRT-LLM

何时使用TensorRT-LLM

Use TensorRT-LLM when:

Deploying on NVIDIA GPUs (A100, H100, GB200)
Need maximum throughput (24,000+ tokens/sec on Llama 3)
Require low latency for real-time applications
Working with quantized models (FP8, INT4, FP4)
Scaling across multiple GPUs or nodes

Use vLLM instead when:

Need simpler setup and Python-first API
Want PagedAttention without TensorRT compilation
Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

Deploying on CPU or Apple Silicon
Need edge deployment without NVIDIA GPUs
Want simpler GGUF quantization format

在以下场景使用TensorRT-LLM：

在NVIDIA GPU（A100、H100、GB200）上部署
需要最大吞吐量（Llama 3可达24,000+ tokens/秒）
实时应用需要低延迟
处理量化模型（FP8、INT4、FP4）
跨多GPU或多节点扩展

在以下场景使用vLLM替代：

需要更简单的设置和Python优先的API
想要无需TensorRT编译的PagedAttention
使用AMD GPU或非NVIDIA硬件

在以下场景使用llama.cpp替代：

在CPU或Apple Silicon上部署
无需NVIDIA GPU的边缘部署
想要更简单的GGUF量化格式

Quick start

快速开始

Installation

安装

bash

undefined

bash

undefined

Docker (recommended)

Docker（推荐）

docker pull nvidia/tensorrt_llm:latest

pip install

pip 安装

pip install tensorrt_llm==1.2.0rc3

Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

依赖CUDA 13.0.0、TensorRT 10.13.2、Python 3.10-3.12

undefined

undefined

Basic inference

基础推理

python

from tensorrt_llm import LLM, SamplingParams

python

from tensorrt_llm import LLM, SamplingParams

Initialize model

初始化模型

llm = LLM(model="meta-llama/Meta-Llama-3-8B")

Configure sampling

配置采样参数

sampling_params = SamplingParams( max_tokens=100, temperature=0.7, top_p=0.9 )

Generate

生成结果

prompts = ["Explain quantum computing"] outputs = llm.generate(prompts, sampling_params)

for output in outputs: print(output.text)

undefined

prompts = ["解释量子计算"] outputs = llm.generate(prompts, sampling_params)

for output in outputs: print(output.text)

undefined

Serving with trtllm-serve

使用trtllm-serve部署服务

bash

undefined

bash

undefined

Start server (automatic model download and compilation)

启动服务（自动下载并编译模型）

trtllm-serve meta-llama/Meta-Llama-3-8B
--tp_size 4 \ # Tensor parallelism (4 GPUs) --max_batch_size 256
--max_num_tokens 4096

trtllm-serve meta-llama/Meta-Llama-3-8B
--tp_size 4 \ # 张量并行（4个GPU） --max_batch_size 256
--max_num_tokens 4096

Client request

客户端请求

curl -X POST http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'

undefined

curl -X POST http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "meta-llama/Meta-Llama-3-8B", "messages": [{"role": "user", "content": "你好！"}], "temperature": 0.7, "max_tokens": 100 }'

undefined

Key features

核心特性

Performance optimizations

性能优化

In-flight batching: Dynamic batching during generation
Paged KV cache: Efficient memory management
Flash Attention: Optimized attention kernels
Quantization: FP8, INT4, FP4 for 2-4× faster inference
CUDA graphs: Reduced kernel launch overhead

动态批处理（in-flight batching）: 生成过程中的动态批处理
分页KV缓存: 高效内存管理
Flash Attention: 优化的注意力内核
量化: FP8、INT4、FP4可实现2-4倍更快推理
CUDA图: 减少内核启动开销

Parallelism

并行能力

Tensor parallelism (TP): Split model across GPUs
Pipeline parallelism (PP): Layer-wise distribution
Expert parallelism: For Mixture-of-Experts models
Multi-node: Scale beyond single machine

张量并行（TP）: 在多个GPU间拆分模型
流水线并行（PP）: 按层分配模型
专家并行: 适用于混合专家（Mixture-of-Experts）模型
多节点: 扩展至单机器以外的规模

Advanced features

高级特性

Speculative decoding: Faster generation with draft models
LoRA serving: Efficient multi-adapter deployment
Disaggregated serving: Separate prefill and generation

投机解码: 借助草稿模型实现更快生成
LoRA部署: 高效的多适配器部署
解耦部署: 分离预填充与生成阶段

Common patterns

常见使用模式

Quantized model (FP8)

量化模型（FP8）

python

from tensorrt_llm import LLM

python

from tensorrt_llm import LLM

Load FP8 quantized model (2× faster, 50% memory)

加载FP8量化模型（速度快2倍，内存占用减少50%）

llm = LLM( model="meta-llama/Meta-Llama-3-70B", dtype="fp8", max_num_tokens=8192 )

Inference same as before

推理方式与之前相同

outputs = llm.generate(["Summarize this article..."])

undefined

outputs = llm.generate(["总结这篇文章..."])

undefined

Multi-GPU deployment

多GPU部署

python

undefined

python

undefined

Tensor parallelism across 8 GPUs

跨8个GPU的张量并行

llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, dtype="fp8" )

undefined

llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, dtype="fp8" )

undefined

Batch inference

批处理推理

python

undefined

python

undefined

Process 100 prompts efficiently

高效处理100个提示词

prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate( prompts, sampling_params=SamplingParams(max_tokens=200) )

prompts = [f"问题 {i}: ..." for i in range(100)]

outputs = llm.generate( prompts, sampling_params=SamplingParams(max_tokens=200) )

Automatic in-flight batching for maximum throughput

自动使用动态批处理以实现最大吞吐量

undefined

undefined

Performance benchmarks

性能基准测试

Meta Llama 3-8B (H100 GPU):

Throughput: 24,000 tokens/sec
Latency: ~10ms per token
vs PyTorch: 100× faster

Llama 3-70B (8× A100 80GB):

FP8 quantization: 2× faster than FP16
Memory: 50% reduction with FP8

Meta Llama 3-8B（H100 GPU）：

吞吐量：24,000 tokens/秒
延迟：约10ms每token
对比PyTorch：快100倍

Llama 3-70B（8× A100 80GB）：

FP8量化：比FP16快2倍
内存：使用FP8可减少50%占用

Supported models

支持的模型

LLaMA family: Llama 2, Llama 3, CodeLlama
GPT family: GPT-2, GPT-J, GPT-NeoX
Qwen: Qwen, Qwen2, QwQ
DeepSeek: DeepSeek-V2, DeepSeek-V3
Mixtral: Mixtral-8x7B, Mixtral-8x22B
Vision: LLaVA, Phi-3-vision
100+ models on HuggingFace

LLaMA系列: Llama 2、Llama 3、CodeLlama
GPT系列: GPT-2、GPT-J、GPT-NeoX
Qwen系列: Qwen、Qwen2、QwQ
DeepSeek系列: DeepSeek-V2、DeepSeek-V3
Mixtral系列: Mixtral-8x7B、Mixtral-8x22B
视觉模型: LLaVA、Phi-3-vision
HuggingFace上的100+模型

References

参考资料

Optimization Guide - Quantization, batching, KV cache tuning
Multi-GPU Setup - Tensor/pipeline parallelism, multi-node
Serving Guide - Production deployment, monitoring, autoscaling

优化指南 - 量化、批处理、KV缓存调优
多GPU设置 - 张量/流水线并行、多节点
部署指南 - 生产部署、监控、自动扩缩容

Resources

资源

Docs: https://nvidia.github.io/TensorRT-LLM/
GitHub: https://github.com/NVIDIA/TensorRT-LLM
Models: https://huggingface.co/models?library=tensorrt_llm

文档: https://nvidia.github.io/TensorRT-LLM/
GitHub: https://github.com/NVIDIA/TensorRT-LLM
模型: https://huggingface.co/models?library=tensorrt_llm