trtllm-serve-config-guide
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseServe Config Guide
服务配置指南
Scope: aggregate/IFB (in-flight batching) colocated prefill+decode, single node, PyTorch backend, non-speculative by default; DeepSeek-R1 MTP is the standard mode (all checked-in configs include it).
Input: model, GPU, ISL (input sequence length), OSL (output sequence length), concurrency, TP, performance objective ( | | | unspecified).
Output: repo-grounded starting YAML for .
Min LatencyBalancedMax Throughputtrtllm-serve --configIf the request is adjacent but out of scope, provide a best-effort answer using the nearest in-scope config as a starting point, clearly label inferred vs. verified fields, and point to the relevant feature doc in (e.g., speculative-decoding, disagg-serving, parallel-strategy) or .
docs/source/features/examples/llm-api/适用范围: 聚合式/IFB(飞行中批处理)预填充+解码同部署、单节点、PyTorch后端,默认非推测式;DeepSeek-R1 MTP为标准模式(所有已提交的配置均包含该模式)。
输入参数: 模型、GPU、ISL(输入序列长度)、OSL(输出序列长度)、并发数、TP、性能目标( | | | 未指定)。
输出: 基于代码仓库的初始YAML配置文件。
Min LatencyBalancedMax Throughputtrtllm-serve --config如果请求接近但超出适用范围,请以最接近的范围内配置为起点提供尽力而为的答案,明确标记推断字段与已验证字段,并指向(例如speculative-decoding、disagg-serving、parallel-strategy)或中的相关功能文档。
docs/source/features/examples/llm-api/Constraints
约束条件
-
Speculative exclusion: Exclude configs containingby default. Exception: exact checked-in DeepSeek-R1 MTP configs (models with
speculative_configindecoding_type: MTP). When including MTP, copy the fullexamples/configs/block verbatim — never interpolate speculative fields.speculative_config -
Objective preservation: Preserve the user's stated objective through config selection. Useprofile labels (
database.py,Min Latency,Balanced; plusMax Throughput/Low Latencyin smaller sets) as selection aids. If a config is unlabeled, treat it as a default starting point — do not claim it matches a specific objective. If the only match conflicts with the stated objective, call out the mismatch.High Throughput -
Source preference: Prefer checked-in configs over interpolation. When docs and configs disagree, prefer the config for the exact scenario and note the mismatch. Mark any interpolation as unverified.
-
排除推测式配置: 默认排除包含的配置。例外情况:已提交的DeepSeek-R1 MTP精确配置(
speculative_config中带有examples/configs/的模型)。当包含MTP时,需完整复制decoding_type: MTP块的原文内容——切勿对推测式字段进行插值处理。speculative_config -
目标一致性: 通过配置选择保留用户指定的目标。使用中的配置文件标签(
database.py、Min Latency、Balanced;在小型配置集中还包括Max Throughput/Low Latency)作为选择辅助。如果某个配置未标记标签,则将其视为默认起点——不要声称它匹配特定目标。如果唯一匹配的配置与指定目标冲突,需指出该不匹配情况。High Throughput -
源码优先: 优先使用已提交的配置,而非插值生成的配置。当文档与配置存在分歧时,优先选择适用于具体场景的配置,并记录该分歧。所有插值生成的内容需标记为未验证。
Response Format
响应格式
For exact matches: → →
ConfigSourceLaunch commandFor interpolated configs: → → (single list of knobs worth sweeping, not per-field unverified tags)
ConfigSource used as starting pointWhat to benchmark对于精确匹配: → →
配置来源启动命令对于插值生成的配置: → → (列出值得遍历调整的参数列表,而非按字段标记未验证标签)
配置用作起点的来源基准测试内容Step 0: Lock Objective and Decode Mode
步骤0:锁定目标与解码模式
Identify the user's objective ( | | | unspecified) and decode mode (non-speculative or DeepSeek-R1 MTP per Constraint 1). Preserve both through the remaining steps.
Min LatencyBalancedMax Throughput确定用户的目标( | | | 未指定)和解码模式(根据约束条件1,为非推测式或DeepSeek-R1 MTP)。在后续步骤中保持这两者不变。
Min LatencyBalancedMax ThroughputStep 1: Exact Database Match
步骤1:数据库精确匹配
Search for an exact match. Use as a loader/helper.
examples/configs/database/lookup.yaml(model, gpu, isl, osl, concurrency, num_gpus)database.py- Apply speculative exclusion.
- When multiple recipes exist at different concurrency points, use profile labels to match the user's objective per objective preservation.
- Prefer an exact match that also matches the stated objective over manual tuning.
在中搜索的精确匹配项。使用作为加载器/辅助工具。
examples/configs/database/lookup.yaml(model, gpu, isl, osl, concurrency, num_gpus)database.py- 应用排除推测式配置规则。
- 当不同并发点存在多个配置方案时,根据目标一致性规则,使用配置文件标签匹配用户目标。
- 优先选择既精确匹配又符合指定目标的配置,而非手动调优。
Step 2: Nearest Checked-In Config
步骤2:最接近的已提交配置
If no exact match, widen the search to also include .
examples/configs/curated/lookup.yamlApply the same constraints as Step 1. Additionally:
- A partial match from is preferred over a partial match from
database/for the same model (database configs are benchmark-tuned).curated/ - Exclude disaggregated-only or prefill-only entries (e.g., ).
qwen3-disagg-prefill.yaml - For curated configs, only treat intent as explicit when the repo labels it (e.g., ,
*-latency.yaml, or guide text).*-throughput.yaml - If no in-scope config matches the stated objective, pick the nearest same-model starting point and call out the mismatch.
如果没有精确匹配项,扩大搜索范围至。
examples/configs/curated/lookup.yaml应用与步骤1相同的约束条件。此外:
- 对于同一模型,优先选择中的部分匹配项,而非
database/中的部分匹配项(数据库中的配置经过基准测试调优)。curated/ - 排除仅分布式或仅预填充的配置项(例如)。
qwen3-disagg-prefill.yaml - 对于精选配置,只有当代码仓库对其进行标记时(例如、
*-latency.yaml或指南文本),才将其意图视为明确的。*-throughput.yaml - 如果没有范围内的配置匹配指定目标,选择最接近的同模型起点,并指出该不匹配情况。
Step 3: Read Model Docs
步骤3:阅读模型文档
Search and for the model's deployment guide and README. Read both before adjusting knobs.
docs/source/deployment-guide/examples/models/core/Excluded sources: Do NOT use tuning values or benchmark numbers — those were measured on the TensorRT engine-building backend and do not transfer to PyTorch backend serving.
docs/source/legacy/DeepSeek-V3 caveat: For DeepSeek-V3/V3.2-Exp, use , not the R1 deployment guide.
examples/models/core/deepseek_v3/README.md在和中查找模型的部署指南和README。在调整参数前需阅读这两份文档。
docs/source/deployment-guide/examples/models/core/排除来源: 请勿使用中的调优值或基准测试数据——这些数据是基于TensorRT引擎构建后端测量的,不适用于PyTorch后端服务。
docs/source/legacy/DeepSeek-V3注意事项: 对于DeepSeek-V3/V3.2-Exp,请使用,而非R1的部署指南。
examples/models/core/deepseek_v3/README.mdStep 4: Adjust Source-Backed Fields
步骤4:调整基于源码的字段
Commonly scenario-dependent fields (adjust only these, guided by the checked-in source):
max_batch_sizemax_num_tokensmax_seq_lenenable_attention_dpattention_dp_config.*kv_cache_config.free_gpu_memory_fractionmoe_expert_parallel_sizemoe_config.backendstream_intervalnum_postprocess_workerscuda_graph_config.max_batch_sizebatch_sizesDo not assume other fields are constant across models/GPUs. For tuning notes, read .
references/knob-heuristics.md常见的场景相关字段(仅调整这些字段,需以已提交的源码为指导):
max_batch_sizemax_num_tokensmax_seq_lenenable_attention_dpattention_dp_config.*kv_cache_config.free_gpu_memory_fractionmoe_expert_parallel_sizemoe_config.backendstream_intervalnum_postprocess_workerscuda_graph_config.max_batch_sizebatch_sizes请勿假设其他字段在不同模型/GPU间保持不变。有关调优说明,请阅读。
references/knob-heuristics.mdValidation Checklist
验证清单
- called out as trust boundary when present
trust_remote_code: true - >= ISL + chat template overhead (requests rejected if violated)
max_num_tokens - If interpolated: single "What to benchmark" section listing knobs to sweep, not per-field unverified tags
- 若存在,需明确指出这是信任边界
trust_remote_code: true - >= ISL + 聊天模板开销(若违反则请求会被拒绝)
max_num_tokens - 若为插值生成的配置:需包含单独的“基准测试内容”部分,列出值得遍历调整的参数,而非按字段标记未验证标签