llm-serving-capacity-planner

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM Serving Capacity Planner

LLM服务容量规划工具

Overview

概述

Use this when a serving log has enough memory lines to explain where GPU HBM went. The analyzer reads SGLang/vLLM startup logs, extracts weight load, KV pool, CUDA graph, framework overhead, and token-capacity lines, then estimates concurrent requests for common token lengths.

当服务日志包含足够的内存相关日志行，可用于分析GPU HBM内存的分配情况时，即可使用本工具。分析器会读取SGLang/vLLM的启动日志，提取权重加载、KV池、CUDA graph、框架开销以及令牌容量相关日志行，进而估算不同常见令牌长度下的并发请求数。

Confirmation Required

需要确认的信息

Before running analysis, collect or verify these inputs:

Item	Why it matters	How to obtain	Default if user skips
Log file path	Primary input; all memory data comes from here	Ask user for the serving startup log	— (required)
GPU type	Determines total HBM for decomposition validation	Ask user or infer from log	Auto-detected from log if possible
nvidia-smi output	Provides per-rank actual memory for cross-validation	Capture with `nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt`	— (optional, but recommended)
Model config.json	Enables theoretical KV cache byte calculation and replication factor analysis	Ask user for the model's config.json path	— (optional, log data used instead)
Request token length	Determines concurrency estimate denominator	Ask user	4096, 6144, 8192

在运行分析前，请收集或验证以下输入信息：

项目	重要性说明	获取方式	用户未提供时的默认值
日志文件路径	核心输入；所有内存数据均来自该文件	向用户索要服务启动日志	—（必填）
GPU型号	用于确定总HBM容量，验证内存分解结果	询问用户或从日志中推断	尽可能从日志中自动检测
nvidia-smi输出	提供各rank的实际内存数据，用于交叉验证	执行命令 `nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt` 获取	—（可选，但推荐提供）
模型config.json	支持理论KV缓存字节计算和复制因子分析	向用户索要模型的config.json路径	—（可选，未提供时使用日志数据）
请求令牌长度	决定并发量估算的分母值	询问用户	4096, 6144, 8192

Workflow

工作流程

Step 1: Collect the serving log

步骤1：收集服务日志

The user should provide the startup log from an SGLang or vLLM serving instance. Key log lines that the analyzer needs:

```
Load weight begin. avail mem=XX GB
```

Memory profiling: available_gpu_memory=XX GB, ...

(newer sglang)

SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX

(SWA models like DeepSeek-V4)

```
Memory pool end. avail mem=XX GB
```

Capture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.

max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB

```
server_args=ServerArgs(...)
```
(for serving parameters)

If the log is from a running instance, capture it by redirecting stdout/stderr to a file at launch time.

用户需提供SGLang或vLLM服务实例的启动日志。分析器需要的关键日志行包括：

```
Load weight begin. avail mem=XX GB
```

Memory profiling: available_gpu_memory=XX GB, ...

（新版SGLang）

SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX

（如DeepSeek-V4等SWA模型）

```
Memory pool end. avail mem=XX GB
```

Capture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.

max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB

```
server_args=ServerArgs(...)
```
（用于获取服务参数）

如果日志来自运行中的实例，请在启动时将stdout/stderr重定向到文件中捕获。

Step 2: Optionally capture nvidia-smi data

步骤2：可选捕获nvidia-smi数据

For per-rank memory comparison:

bash

docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt

用于各rank的内存对比：

bash

docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt

Step 3: Run the analyzer

步骤3：运行分析器

bash

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --nvidia-smi-file /path/to/smi.txt \
  --gpu h200 \
  --config-json /path/to/config.json

For JSON output (automation):

bash

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --format json

bash

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --nvidia-smi-file /path/to/smi.txt \
  --gpu h200 \
  --config-json /path/to/config.json

如需JSON格式输出（用于自动化场景）：

bash

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --format json

Step 4: Review and interpret results

步骤4：查看并解读结果

The analyzer prints:

Memory breakdown table: each category (weights, KV pool, CUDA graph, framework, other) with GiB, MiB, percentage, and derivation
Per-rank comparison: nvidia-smi data across all TP ranks
KV pool detail: pool configuration, KV dtype, replication factor, per-token byte calculation
Concurrency estimate: max concurrent requests for different token lengths
Tuning notes: configuration changes that may increase capacity

分析器会输出以下内容：

内存分解表：每个类别（权重、KV池、CUDA graph、框架、其他）的GiB、MiB、占比及数据来源说明
各rank对比：所有TP rank的nvidia-smi数据
KV池详情：池配置、KV dtype、复制因子、每令牌字节计算
并发量估算：不同令牌长度下的最大并发请求数
调优建议：可提升容量的配置修改方案

When To Use It

使用场景

After launching an LLM serving instance, to understand how GPU memory is distributed
When comparing different
```
--mem-fraction-static
```
values and their impact on KV pool capacity
When planning deployment capacity: how many concurrent requests can a given GPU configuration support
When investigating OOM issues: identifying which memory category is consuming the most
When evaluating whether fp8 KV cache or EP can improve concurrency

启动LLM服务实例后，了解GPU内存的分配情况
对比不同
```
--mem-fraction-static
```
参数值对KV池容量的影响
规划部署容量：特定GPU配置可支持多少并发请求
排查OOM问题：确定哪个内存类别占用最多资源
评估fp8 KV缓存或EP是否能提升并发量

Key Concepts

核心概念

mem-fraction-static

Controls what fraction of available GPU memory after weight loading is reserved for the KV cache pool. Higher values give more KV capacity but less headroom for CUDA graph and other runtime buffers.

```
0.88
```
(default): aggressive — 88% of post-weight memory goes to KV pool
```
0.60
```
: conservative — more free memory left for runtime, but significantly less KV capacity

控制权重加载后可用GPU内存中分配给KV缓存池的比例。值越高，KV容量越大，但留给CUDA graph和其他运行时缓冲区的空间越少。

```
0.88
```
（默认）：激进模式 — 权重加载后88%的内存分配给KV池
```
0.60
```
：保守模式 — 预留更多空闲内存给运行时，但KV容量显著减少

KV Head Replication

When

num_key_value_heads < tp_size

, KV cache is replicated across all TP ranks rather than split. For example, models with

kv_heads=1, tp=8

means each of the 8 cards stores a full copy of the KV cache — 8x the per-card KV memory compared to a split scenario.

当

num_key_value_heads < tp_size

时，KV缓存会在所有TP rank间复制而非拆分。例如，

kv_heads=1, tp=8

的模型意味着8张卡每张都存储完整的KV缓存副本 — 与拆分场景相比，单卡KV内存占用是其8倍。

SWA (Sliding Window Attention) Compression

SWA（滑动窗口注意力）压缩

Models like DeepSeek-V4 use CSA (Compressed Sliding Attention) and HCA (Hierarchical Context Attention) with sliding windows. This reduces per-token KV cache bytes compared to the theoretical full-attention calculation. The

bytes_per_full_token

reported in the log already accounts for this compression.

如DeepSeek-V4等模型使用CSA（压缩滑动注意力）和HCA（分层上下文注意力）及滑动窗口机制。与理论全注意力计算相比，这会减少每令牌KV缓存字节数。日志中报告的

bytes_per_full_token

已考虑该压缩机制。

Reporting Checklist

报告检查清单

Include:

Serving configuration: model, GPU, TP/PP/EP, mem-fraction-static, kv-cache-dtype
Memory breakdown table: category / GiB / MiB / percentage / derivation source
Per-rank nvidia-smi comparison: used and free memory per TP rank
KV pool detail: pool size, bytes_per_full_token, KV dtype, replication factor, theoretical per-token KV calculation (when config.json provided)
Concurrency estimate table: request token length / token-limit / request-limit / max concurrent
Tuning notes based on free memory and configuration

报告需包含：

服务配置：模型、GPU、TP/PP/EP、mem-fraction-static、kv-cache-dtype
内存分解表：类别 / GiB / MiB / 占比 / 数据来源
各rank nvidia-smi对比：各TP rank的已用和空闲内存
KV池详情：池大小、bytes_per_full_token、KV dtype、复制因子、理论每令牌KV计算（提供config.json时）
并发量估算表：请求令牌长度 / 令牌限制 / 请求限制 / 最大并发数
调优建议：基于空闲内存和配置给出的优化方案

Known Limitations

已知限制

Limitation	Detail	Workaround
SGLang-specific patterns	Currently only SGLang log patterns are fully supported	vLLM patterns to be added as encountered
SWA compression models	Per-token KV bytes cannot be independently calculated from model config for CSA/HCA attention — the framework's internal SWA window parameters are needed	Use `bytes_per_full_token` from the log directly
DeepGEMM JIT memory	The analyzer categorizes DeepGEMM JIT compilation memory as "other" because it is not explicitly reported in the log	Compare with nvidia-smi total for accurate accounting
PP (Pipeline Parallelism)	Memory decomposition is per-rank; PP configurations may have uneven memory across stages	Specify `--target-rank` for each PP stage
MoE expert buffer	Some frameworks allocate additional buffers for expert routing that are not separately reported	Included in "model weights" or "other" depending on when allocated

限制	详情	解决方法
SGLang专属日志模式	当前仅完全支持SGLang的日志模式	后续会逐步添加遇到的vLLM日志模式
SWA压缩模型	无法通过模型配置独立计算CSA/HCA注意力的每令牌KV字节数 — 需要框架内部的SWA窗口参数	直接使用日志中的 `bytes_per_full_token` 值
DeepGEMM JIT内存	分析器将DeepGEMM JIT编译内存归类为“其他”，因为日志中未明确报告	与nvidia-smi总内存对比以获得准确统计
PP（流水线并行）	内存分解按rank计算；PP配置可能导致各阶段内存分布不均	为每个PP阶段指定 `--target-rank`
MoE专家缓冲区	部分框架会为专家路由分配额外缓冲区，但未单独报告	根据分配时机归类到“模型权重”或“其他”

References

参考资料

```
references/log-patterns.md
```
: log line patterns and their semantics for memory analysis.
```
references/gpu-specs.json
```
: GPU HBM specifications for
```
h20
```
,
```
h100
```
,
```
h200
```
, and
```
b200
```
aliases.
```
scripts/capacity_analyzer.py
```
: the core analysis script.

```
references/log-patterns.md
```
：用于内存分析的日志行模式及其语义说明

references/gpu-specs.json

：

h20

、

h100

、

h200

和

b200

等GPU的HBM规格

```
scripts/capacity_analyzer.py
```
：核心分析脚本