Serve Config Guide
Scope: aggregate/IFB (in-flight batching) colocated prefill+decode, single node, PyTorch backend, non-speculative by default; DeepSeek-R1 MTP is the standard mode (all checked-in configs include it).
Input: model, GPU, ISL (input sequence length), OSL (output sequence length), concurrency, TP, performance objective (
|
|
| unspecified).
Output: repo-grounded starting YAML for
.
If the request is adjacent but out of scope, provide a best-effort answer using the nearest in-scope config as a starting point, clearly label inferred vs. verified fields, and point to the relevant feature doc in
(e.g., speculative-decoding, disagg-serving, parallel-strategy) or
.
Constraints
-
Speculative exclusion: Exclude configs containing
by default. Exception: exact checked-in DeepSeek-R1 MTP configs (models with
in
). When including MTP, copy the full
block verbatim — never interpolate speculative fields.
-
Objective preservation: Preserve the user's stated objective through config selection. Use
profile labels (
,
,
; plus
/
in smaller sets) as selection aids. If a config is unlabeled, treat it as a default starting point — do not claim it matches a specific objective. If the only match conflicts with the stated objective, call out the mismatch.
-
Source preference: Prefer checked-in configs over interpolation. When docs and configs disagree, prefer the config for the exact scenario and note the mismatch. Mark any interpolation as unverified.
Response Format
For
interpolated configs:
→
Source used as starting point
→
(single list of knobs worth sweeping, not per-field unverified tags)
Step 0: Lock Objective and Decode Mode
Identify the user's objective (
|
|
| unspecified) and decode mode (non-speculative or DeepSeek-R1 MTP per
Constraint 1). Preserve both through the remaining steps.
Step 1: Exact Database Match
Search
examples/configs/database/lookup.yaml
for an exact
(model, gpu, isl, osl, concurrency, num_gpus)
match. Use
as a loader/helper.
- Apply speculative exclusion.
- When multiple recipes exist at different concurrency points, use profile labels to match the user's objective per objective preservation.
- Prefer an exact match that also matches the stated objective over manual tuning.
Step 2: Nearest Checked-In Config
If no exact match, widen the search to also include
examples/configs/curated/lookup.yaml
.
Apply the same constraints as Step 1. Additionally:
- A partial match from is preferred over a partial match from for the same model (database configs are benchmark-tuned).
- Exclude disaggregated-only or prefill-only entries (e.g.,
qwen3-disagg-prefill.yaml
).
- For curated configs, only treat intent as explicit when the repo labels it (e.g., , , or guide text).
- If no in-scope config matches the stated objective, pick the nearest same-model starting point and call out the mismatch.
Step 3: Read Model Docs
Search
docs/source/deployment-guide/
and
for the model's deployment guide and README. Read both before adjusting knobs.
Excluded sources: Do NOT use
tuning values or benchmark numbers — those were measured on the TensorRT engine-building backend and do not transfer to PyTorch backend serving.
DeepSeek-V3 caveat: For DeepSeek-V3/V3.2-Exp, use
examples/models/core/deepseek_v3/README.md
, not the R1 deployment guide.
Step 4: Adjust Source-Backed Fields
Commonly scenario-dependent fields (adjust only these, guided by the checked-in source):
,
,
,
,
,
kv_cache_config.free_gpu_memory_fraction
,
(MoE),
(when guide specifies),
,
,
cuda_graph_config.max_batch_size
/
, and MTP-specific fields when using DeepSeek-R1 MTP configs.
Do not assume other fields are constant across models/GPUs. For tuning notes, read
references/knob-heuristics.md
.
Validation Checklist