trtllm-serve-config-guide

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Serve Config Guide

服务配置指南

Scope: aggregate/IFB (in-flight batching) colocated prefill+decode, single node, PyTorch backend, non-speculative by default; DeepSeek-R1 MTP is the standard mode (all checked-in configs include it).
Input: model, GPU, ISL (input sequence length), OSL (output sequence length), concurrency, TP, performance objective (
Min Latency
|
Balanced
|
Max Throughput
| unspecified). Output: repo-grounded starting YAML for
trtllm-serve --config
.
If the request is adjacent but out of scope, provide a best-effort answer using the nearest in-scope config as a starting point, clearly label inferred vs. verified fields, and point to the relevant feature doc in
docs/source/features/
(e.g., speculative-decoding, disagg-serving, parallel-strategy) or
examples/llm-api/
.
适用范围: 聚合式/IFB(飞行中批处理)预填充+解码同部署、单节点、PyTorch后端,默认非推测式;DeepSeek-R1 MTP为标准模式(所有已提交的配置均包含该模式)。
输入参数: 模型、GPU、ISL(输入序列长度)、OSL(输出序列长度)、并发数、TP、性能目标(
Min Latency
|
Balanced
|
Max Throughput
| 未指定)。 输出: 基于代码仓库的
trtllm-serve --config
初始YAML配置文件。
如果请求接近但超出适用范围,请以最接近的范围内配置为起点提供尽力而为的答案,明确标记推断字段与已验证字段,并指向
docs/source/features/
(例如speculative-decoding、disagg-serving、parallel-strategy)或
examples/llm-api/
中的相关功能文档。

Constraints

约束条件

  1. Speculative exclusion: Exclude configs containing
    speculative_config
    by default. Exception: exact checked-in DeepSeek-R1 MTP configs (models with
    decoding_type: MTP
    in
    examples/configs/
    ). When including MTP, copy the full
    speculative_config
    block verbatim — never interpolate speculative fields.
  2. Objective preservation: Preserve the user's stated objective through config selection. Use
    database.py
    profile labels (
    Min Latency
    ,
    Balanced
    ,
    Max Throughput
    ; plus
    Low Latency
    /
    High Throughput
    in smaller sets) as selection aids. If a config is unlabeled, treat it as a default starting point — do not claim it matches a specific objective. If the only match conflicts with the stated objective, call out the mismatch.
  3. Source preference: Prefer checked-in configs over interpolation. When docs and configs disagree, prefer the config for the exact scenario and note the mismatch. Mark any interpolation as unverified.
  1. 排除推测式配置: 默认排除包含
    speculative_config
    的配置。例外情况:已提交的DeepSeek-R1 MTP精确配置(
    examples/configs/
    中带有
    decoding_type: MTP
    的模型)。当包含MTP时,需完整复制
    speculative_config
    块的原文内容——切勿对推测式字段进行插值处理。
  2. 目标一致性: 通过配置选择保留用户指定的目标。使用
    database.py
    中的配置文件标签(
    Min Latency
    Balanced
    Max Throughput
    ;在小型配置集中还包括
    Low Latency
    /
    High Throughput
    )作为选择辅助。如果某个配置未标记标签,则将其视为默认起点——不要声称它匹配特定目标。如果唯一匹配的配置与指定目标冲突,需指出该不匹配情况。
  3. 源码优先: 优先使用已提交的配置,而非插值生成的配置。当文档与配置存在分歧时,优先选择适用于具体场景的配置,并记录该分歧。所有插值生成的内容需标记为未验证。

Response Format

响应格式

For exact matches:
Config
Source
Launch command
For interpolated configs:
Config
Source used as starting point
What to benchmark
(single list of knobs worth sweeping, not per-field unverified tags)
对于精确匹配
配置
来源
启动命令
对于插值生成的配置
配置
用作起点的来源
基准测试内容
(列出值得遍历调整的参数列表,而非按字段标记未验证标签)

Step 0: Lock Objective and Decode Mode

步骤0:锁定目标与解码模式

Identify the user's objective (
Min Latency
|
Balanced
|
Max Throughput
| unspecified) and decode mode (non-speculative or DeepSeek-R1 MTP per Constraint 1). Preserve both through the remaining steps.
确定用户的目标(
Min Latency
|
Balanced
|
Max Throughput
| 未指定)和解码模式(根据约束条件1,为非推测式或DeepSeek-R1 MTP)。在后续步骤中保持这两者不变。

Step 1: Exact Database Match

步骤1:数据库精确匹配

Search
examples/configs/database/lookup.yaml
for an exact
(model, gpu, isl, osl, concurrency, num_gpus)
match. Use
database.py
as a loader/helper.
  • Apply speculative exclusion.
  • When multiple recipes exist at different concurrency points, use profile labels to match the user's objective per objective preservation.
  • Prefer an exact match that also matches the stated objective over manual tuning.
examples/configs/database/lookup.yaml
中搜索
(model, gpu, isl, osl, concurrency, num_gpus)
的精确匹配项。使用
database.py
作为加载器/辅助工具。
  • 应用排除推测式配置规则。
  • 当不同并发点存在多个配置方案时,根据目标一致性规则,使用配置文件标签匹配用户目标。
  • 优先选择既精确匹配又符合指定目标的配置,而非手动调优。

Step 2: Nearest Checked-In Config

步骤2:最接近的已提交配置

If no exact match, widen the search to also include
examples/configs/curated/lookup.yaml
.
Apply the same constraints as Step 1. Additionally:
  • A partial match from
    database/
    is preferred over a partial match from
    curated/
    for the same model (database configs are benchmark-tuned).
  • Exclude disaggregated-only or prefill-only entries (e.g.,
    qwen3-disagg-prefill.yaml
    ).
  • For curated configs, only treat intent as explicit when the repo labels it (e.g.,
    *-latency.yaml
    ,
    *-throughput.yaml
    , or guide text).
  • If no in-scope config matches the stated objective, pick the nearest same-model starting point and call out the mismatch.
如果没有精确匹配项,扩大搜索范围至
examples/configs/curated/lookup.yaml
应用与步骤1相同的约束条件。此外:
  • 对于同一模型,优先选择
    database/
    中的部分匹配项,而非
    curated/
    中的部分匹配项(数据库中的配置经过基准测试调优)。
  • 排除仅分布式或仅预填充的配置项(例如
    qwen3-disagg-prefill.yaml
    )。
  • 对于精选配置,只有当代码仓库对其进行标记时(例如
    *-latency.yaml
    *-throughput.yaml
    或指南文本),才将其意图视为明确的。
  • 如果没有范围内的配置匹配指定目标,选择最接近的同模型起点,并指出该不匹配情况。

Step 3: Read Model Docs

步骤3:阅读模型文档

Search
docs/source/deployment-guide/
and
examples/models/core/
for the model's deployment guide and README. Read both before adjusting knobs.
Excluded sources: Do NOT use
docs/source/legacy/
tuning values or benchmark numbers — those were measured on the TensorRT engine-building backend and do not transfer to PyTorch backend serving.
DeepSeek-V3 caveat: For DeepSeek-V3/V3.2-Exp, use
examples/models/core/deepseek_v3/README.md
, not the R1 deployment guide.
docs/source/deployment-guide/
examples/models/core/
中查找模型的部署指南和README。在调整参数前需阅读这两份文档。
排除来源: 请勿使用
docs/source/legacy/
中的调优值或基准测试数据——这些数据是基于TensorRT引擎构建后端测量的,不适用于PyTorch后端服务。
DeepSeek-V3注意事项: 对于DeepSeek-V3/V3.2-Exp,请使用
examples/models/core/deepseek_v3/README.md
,而非R1的部署指南。

Step 4: Adjust Source-Backed Fields

步骤4:调整基于源码的字段

Commonly scenario-dependent fields (adjust only these, guided by the checked-in source):
max_batch_size
,
max_num_tokens
,
max_seq_len
,
enable_attention_dp
,
attention_dp_config.*
,
kv_cache_config.free_gpu_memory_fraction
,
moe_expert_parallel_size
(MoE),
moe_config.backend
(when guide specifies),
stream_interval
,
num_postprocess_workers
,
cuda_graph_config.max_batch_size
/
batch_sizes
, and MTP-specific fields when using DeepSeek-R1 MTP configs.
Do not assume other fields are constant across models/GPUs. For tuning notes, read
references/knob-heuristics.md
.
常见的场景相关字段(仅调整这些字段,需以已提交的源码为指导):
max_batch_size
,
max_num_tokens
,
max_seq_len
,
enable_attention_dp
,
attention_dp_config.*
,
kv_cache_config.free_gpu_memory_fraction
,
moe_expert_parallel_size
(MoE),
moe_config.backend
(当指南指定时),
stream_interval
,
num_postprocess_workers
,
cuda_graph_config.max_batch_size
/
batch_sizes
, and MTP-specific fields when using DeepSeek-R1 MTP configs.
请勿假设其他字段在不同模型/GPU间保持不变。有关调优说明,请阅读
references/knob-heuristics.md

Validation Checklist

验证清单

  • trust_remote_code: true
    called out as trust boundary when present
  • max_num_tokens
    >= ISL + chat template overhead (requests rejected if violated)
  • If interpolated: single "What to benchmark" section listing knobs to sweep, not per-field unverified tags
  • 若存在
    trust_remote_code: true
    ,需明确指出这是信任边界
  • max_num_tokens
    >= ISL + 聊天模板开销(若违反则请求会被拒绝)
  • 若为插值生成的配置:需包含单独的“基准测试内容”部分,列出值得遍历调整的参数,而非按字段标记未验证标签