trtllm-serve-config-guide

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Serve Config Guide

服务配置指南

Scope: aggregate/IFB (in-flight batching) colocated prefill+decode, single node, PyTorch backend, non-speculative by default; DeepSeek-R1 MTP is the standard mode (all checked-in configs include it).

Input: model, GPU, ISL (input sequence length), OSL (output sequence length), concurrency, TP, performance objective (

Min Latency

Balanced

Max Throughput

| unspecified). Output: repo-grounded starting YAML for

trtllm-serve --config

If the request is adjacent but out of scope, provide a best-effort answer using the nearest in-scope config as a starting point, clearly label inferred vs. verified fields, and point to the relevant feature doc in

docs/source/features/

(e.g., speculative-decoding, disagg-serving, parallel-strategy) or

examples/llm-api/

适用范围： 聚合式/IFB（飞行中批处理）预填充+解码同部署、单节点、PyTorch后端，默认非推测式；DeepSeek-R1 MTP为标准模式（所有已提交的配置均包含该模式）。

输入参数： 模型、GPU、ISL（输入序列长度）、OSL（输出序列长度）、并发数、TP、性能目标（

Min Latency

Balanced

Max Throughput

| 未指定）。 输出： 基于代码仓库的

trtllm-serve --config

初始YAML配置文件。

如果请求接近但超出适用范围，请以最接近的范围内配置为起点提供尽力而为的答案，明确标记推断字段与已验证字段，并指向

docs/source/features/

（例如speculative-decoding、disagg-serving、parallel-strategy）或

examples/llm-api/

中的相关功能文档。

Constraints

约束条件

Speculative exclusion: Exclude configs containing
```
speculative_config
```
by default. Exception: exact checked-in DeepSeek-R1 MTP configs (models with
```
decoding_type: MTP
```
in
```
examples/configs/
```
). When including MTP, copy the full
```
speculative_config
```
block verbatim — never interpolate speculative fields.
Objective preservation: Preserve the user's stated objective through config selection. Use
```
database.py
```
profile labels (
```
Min Latency
```
,
```
Balanced
```
,
```
Max Throughput
```
; plus
```
Low Latency
```
/
```
High Throughput
```
in smaller sets) as selection aids. If a config is unlabeled, treat it as a default starting point — do not claim it matches a specific objective. If the only match conflicts with the stated objective, call out the mismatch.
Source preference: Prefer checked-in configs over interpolation. When docs and configs disagree, prefer the config for the exact scenario and note the mismatch. Mark any interpolation as unverified.

排除推测式配置： 默认排除包含
```
speculative_config
```
的配置。例外情况：已提交的DeepSeek-R1 MTP精确配置（
```
examples/configs/
```
中带有
```
decoding_type: MTP
```
的模型）。当包含MTP时，需完整复制
```
speculative_config
```
块的原文内容——切勿对推测式字段进行插值处理。
目标一致性： 通过配置选择保留用户指定的目标。使用
```
database.py
```
中的配置文件标签（
```
Min Latency
```
、
```
Balanced
```
、
```
Max Throughput
```
；在小型配置集中还包括
```
Low Latency
```
/
```
High Throughput
```
）作为选择辅助。如果某个配置未标记标签，则将其视为默认起点——不要声称它匹配特定目标。如果唯一匹配的配置与指定目标冲突，需指出该不匹配情况。
源码优先： 优先使用已提交的配置，而非插值生成的配置。当文档与配置存在分歧时，优先选择适用于具体场景的配置，并记录该分歧。所有插值生成的内容需标记为未验证。

Response Format

响应格式

For exact matches:

Config

→

Source

→

Launch command

For interpolated configs:

Config

→

Source used as starting point

→

What to benchmark

(single list of knobs worth sweeping, not per-field unverified tags)

对于精确匹配：

配置

→

来源

→

启动命令

对于插值生成的配置：

配置

→

用作起点的来源

→

基准测试内容

（列出值得遍历调整的参数列表，而非按字段标记未验证标签）

Step 0: Lock Objective and Decode Mode

步骤0：锁定目标与解码模式

Identify the user's objective (

Min Latency

Balanced

Max Throughput

| unspecified) and decode mode (non-speculative or DeepSeek-R1 MTP per Constraint 1). Preserve both through the remaining steps.

确定用户的目标（

Min Latency

Balanced

Max Throughput

| 未指定）和解码模式（根据约束条件1，为非推测式或DeepSeek-R1 MTP）。在后续步骤中保持这两者不变。

Step 1: Exact Database Match

步骤1：数据库精确匹配

examples/configs/database/lookup.yaml

for an exact

(model, gpu, isl, osl, concurrency, num_gpus)

match. Use

database.py

as a loader/helper.

Apply speculative exclusion.
When multiple recipes exist at different concurrency points, use profile labels to match the user's objective per objective preservation.
Prefer an exact match that also matches the stated objective over manual tuning.

在

examples/configs/database/lookup.yaml

中搜索

(model, gpu, isl, osl, concurrency, num_gpus)

的精确匹配项。使用

database.py

作为加载器/辅助工具。

应用排除推测式配置规则。
当不同并发点存在多个配置方案时，根据目标一致性规则，使用配置文件标签匹配用户目标。
优先选择既精确匹配又符合指定目标的配置，而非手动调优。

Step 2: Nearest Checked-In Config

步骤2：最接近的已提交配置

If no exact match, widen the search to also include

examples/configs/curated/lookup.yaml

Apply the same constraints as Step 1. Additionally:

A partial match from
```
database/
```
is preferred over a partial match from
```
curated/
```
for the same model (database configs are benchmark-tuned).
Exclude disaggregated-only or prefill-only entries (e.g.,
```
qwen3-disagg-prefill.yaml
```
).
For curated configs, only treat intent as explicit when the repo labels it (e.g.,
```
*-latency.yaml
```
,
```
*-throughput.yaml
```
, or guide text).
If no in-scope config matches the stated objective, pick the nearest same-model starting point and call out the mismatch.

如果没有精确匹配项，扩大搜索范围至

examples/configs/curated/lookup.yaml

。

应用与步骤1相同的约束条件。此外：

对于同一模型，优先选择
```
database/
```
中的部分匹配项，而非
```
curated/
```
中的部分匹配项（数据库中的配置经过基准测试调优）。
排除仅分布式或仅预填充的配置项（例如
```
qwen3-disagg-prefill.yaml
```
）。
对于精选配置，只有当代码仓库对其进行标记时（例如
```
*-latency.yaml
```
、
```
*-throughput.yaml
```
或指南文本），才将其意图视为明确的。
如果没有范围内的配置匹配指定目标，选择最接近的同模型起点，并指出该不匹配情况。

Step 3: Read Model Docs

步骤3：阅读模型文档

docs/source/deployment-guide/

and

examples/models/core/

for the model's deployment guide and README. Read both before adjusting knobs.

Excluded sources: Do NOT use

docs/source/legacy/

tuning values or benchmark numbers — those were measured on the TensorRT engine-building backend and do not transfer to PyTorch backend serving.

DeepSeek-V3 caveat: For DeepSeek-V3/V3.2-Exp, use

examples/models/core/deepseek_v3/README.md

, not the R1 deployment guide.

在

docs/source/deployment-guide/

和

examples/models/core/

中查找模型的部署指南和README。在调整参数前需阅读这两份文档。

排除来源： 请勿使用

docs/source/legacy/

中的调优值或基准测试数据——这些数据是基于TensorRT引擎构建后端测量的，不适用于PyTorch后端服务。

DeepSeek-V3注意事项： 对于DeepSeek-V3/V3.2-Exp，请使用

examples/models/core/deepseek_v3/README.md

，而非R1的部署指南。

Step 4: Adjust Source-Backed Fields

步骤4：调整基于源码的字段

Commonly scenario-dependent fields (adjust only these, guided by the checked-in source):

max_batch_size

max_num_tokens

max_seq_len

enable_attention_dp

attention_dp_config.*

kv_cache_config.free_gpu_memory_fraction

moe_expert_parallel_size

(MoE),

moe_config.backend

(when guide specifies),

stream_interval

num_postprocess_workers

cuda_graph_config.max_batch_size

batch_sizes

, and MTP-specific fields when using DeepSeek-R1 MTP configs.

Do not assume other fields are constant across models/GPUs. For tuning notes, read

references/knob-heuristics.md

常见的场景相关字段（仅调整这些字段，需以已提交的源码为指导）：

max_batch_size

max_num_tokens

max_seq_len

enable_attention_dp

attention_dp_config.*

kv_cache_config.free_gpu_memory_fraction

moe_expert_parallel_size

(MoE),

moe_config.backend

(当指南指定时),

stream_interval

num_postprocess_workers

cuda_graph_config.max_batch_size

batch_sizes

, and MTP-specific fields when using DeepSeek-R1 MTP configs.

请勿假设其他字段在不同模型/GPU间保持不变。有关调优说明，请阅读

references/knob-heuristics.md

。

Validation Checklist

验证清单

```
trust_remote_code: true
```
called out as trust boundary when present
```
max_num_tokens
```
>= ISL + chat template overhead (requests rejected if violated)
If interpolated: single "What to benchmark" section listing knobs to sweep, not per-field unverified tags

若存在
```
trust_remote_code: true
```
，需明确指出这是信任边界
```
max_num_tokens
```
>= ISL + 聊天模板开销（若违反则请求会被拒绝）
若为插值生成的配置：需包含单独的“基准测试内容”部分，列出值得遍历调整的参数，而非按字段标记未验证标签