profile-kernel

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GPU Kernel Profiling

GPU内核分析

GPU Clock Protocol

GPU时钟协议

Always lock clocks before profiling. Always run one profiling at a time.

bash

bash scripts/lock-gpu-clock.sh           # before profiling
bash scripts/reset-gpu-clock.sh          # after done

分析前务必锁定时钟。每次仅运行一项分析任务。

bash

bash scripts/lock-gpu-clock.sh           # 分析前执行
bash scripts/reset-gpu-clock.sh          # 分析完成后执行

Device Peaks

设备峰值

Shared profiling guidance should not hard-code a single device. Read the active device context from

params.json

, runtime profiling artifacts, and the configured peak-spec sources in

src/mla_var3/conf/devices.json

Use device-specific reference files in

docs/devices/

only when the measurement context matters. Ridge point remains:

text

ridge_point = peak_tflops / peak_gbps

通用分析指南不应硬编码单一设备。从

params.json

、运行时分析产物以及

src/mla_var3/conf/devices.json

中配置的峰值规格源读取活跃设备上下文。

仅当测量上下文重要时，才使用

docs/devices/

中的设备特定参考文件。脊点计算公式如下：

text

ridge_point = peak_tflops / peak_gbps

Profiling Mode Decision Tree

分析模式决策树

Mode	`--prof_type`	When to use	Output root
Annotation	`annotation`	Default first pass. Roofline + NCU metrics + comparison tables	`out/profiles/annotation/`
Event	`event`	Quick iteration timing (carries Python overhead)	`out/profiles/event/`
NCU	`ncu`	Deep investigation: full NCU sections, source annotations, optimization suggestions	`out/profiles/ncu/`
NSYS	`nsys`	Pipeline overlap, stream concurrency, launch ordering	`out/profiles/nsys/`

Selection rule: Start with

annotation

. Use

event

for fast A/B comparison (not suitable for KernelPipeline/ConcurrentKernels). Use

ncu

when you need source-level analysis, Nsight Compute optimization suggestions, or CUDA insights. Use

nsys

for

ConcurrentKernels

overlap and stream concurrency analysis.

模式	`--prof_type` 参数	使用场景	输出根目录
Annotation	`annotation`	默认首次分析。Roofline模型 + NCU指标 + 对比表格	`out/profiles/annotation/`
Event	`event`	快速迭代计时（带有Python开销）	`out/profiles/event/`
NCU	`ncu`	深度调研：完整NCU章节、源码注解、优化建议	`out/profiles/ncu/`
NSYS	`nsys`	流水线重叠、流并发、启动顺序分析	`out/profiles/nsys/`

选择规则：从

annotation

模式开始。使用

event

模式进行快速A/B对比（不适用于KernelPipeline/ConcurrentKernels）。当需要源码级分析、Nsight Compute优化建议或CUDA洞察时，使用

ncu

模式。针对

ConcurrentKernels

重叠和流并发分析，使用

nsys

模式。

Profiling Commands

分析命令

CLI entry point (all modes)

CLI入口（所有模式）

bash

python -m mla_var3.kernel <kernel_package> [<version>] \
    --b=32 --s=16 --t=4096 --prof_type=<mode>

bash

python -m mla_var3.kernel <kernel_package> [<version>] \
    --b=32 --s=16 --t=4096 --prof_type=<mode>

Language-specific path when needed:

需要时使用特定语言路径：

python -m mla_var3.kernel.<language_python_package>.mla.<kernel_package> [<version>]
--b=32 --s=16 --t=4096 --prof_type=<mode>

undefined

python -m mla_var3.kernel.<language_python_package>.mla.<kernel_package> [<version>]
--b=32 --s=16 --t=4096 --prof_type=<mode>

undefined

Python API

python

undefined

python

undefined

Annotation

Annotation模式

from mla_var3.runtime.profiling.annotation import annotation_profile_plan prof_result = annotation_profile_plan(plan, out_dir=..., version=..., bench_runs=...)

NCU

NCU模式

from mla_var3.runtime.profiling.ncu import ncu_profile_plan prof_result = ncu_profile_plan(plan, out_dir=..., version=..., bench_runs=...)

NSYS

NSYS模式

from mla_var3.runtime.profiling.nsys import nsys_profile_plan prof_result = nsys_profile_plan(plan, out_dir=..., version=..., bench_runs=...)

Notice that version is optional, and allows you to specify a custom version. This retrieves the right `KernelPlan` of the specified version, and uses that, NOT the `plan` passed to the function. This is just a convenience to automatize version profiling.
Alternatively, you can directly build the plan object using the class of the version to be profiled, and ignore the version argument. 
**Important**: sometimes, after auto-tuning, there might be some errors. In this case, you can simply re-run the command, the best autotune config is already cached and everything should work.

from mla_var3.runtime.profiling.nsys import nsys_profile_plan prof_result = nsys_profile_plan(plan, out_dir=..., version=..., bench_runs=...)

注意，版本参数是可选的，允许您指定自定义版本。这会检索指定版本对应的`KernelPlan`并使用它，而非传入函数的`plan`对象。这只是为了自动化版本分析提供的便利。
或者，您可以直接使用待分析版本的类构建plan对象，并忽略version参数。 
**重要提示**：有时自动调优后可能会出现一些错误。这种情况下，您只需重新运行命令即可，最佳自动调优配置已被缓存，一切都能正常工作。

Benchmark scripts

基准测试脚本

Benchmark scripts in

tests/benchmark/

are curated experiment drivers. Some accept CLI args, others are fixed scripts to copy and edit.

bash

python -m tests.benchmark.bench_mla_var6_plus       # base version
python -m tests.benchmark.bench_mla_var6_plus_v4    # specific version
python -m tests.benchmark.bench_all_mla             # compare all kernels (slow)

tests/benchmark/

中的基准测试脚本是经过整理的实验驱动程序。部分脚本接受CLI参数，其他则是固定脚本，可复制后编辑。

bash

python -m tests.benchmark.bench_mla_var6_plus       # 基础版本
python -m tests.benchmark.bench_mla_var6_plus_v4    # 指定版本
python -m tests.benchmark.bench_all_mla             # 对比所有内核（速度较慢）

Output Artifacts

输出产物

All modes write to

out/profiles/<mode>/<kernel>/<params>/<timestamp>/

if not ouptut directory is specified.

若未指定输出目录，所有模式都会将结果写入

out/profiles/<mode>/<kernel>/<params>/<timestamp>/

。

Common artifacts (all modes)

通用产物（所有模式）

```
params.json
```
— Full kernel parameter set (dtype, b, s, t, h, d, k, p). Always read this to confirm the profiling configuration, especially for NCU and NSYS modes whose
```
report.md
```
does not embed parameters.
```
tiling/<stage>.json
```
— Autotuned tiling config per stage
```
tiling/<stage>/autotuning.json
```
— Copied autotuning history from
```
.cache/kernel-autotune/...
```
, colocated with the selected tiling so you can inspect the explored configurations without leaving the profiling output directory
```
compiled/<stage>/
```
— MLIR and bytecode compilation artifacts

```
params.json
```
— 完整内核参数集（dtype、b、s、t、h、d、k、p）。务必读取该文件以确认分析配置，尤其是NCU和NSYS模式，它们的
```
report.md
```
不会嵌入参数。
```
tiling/<stage>.json
```
— 各阶段的自动调优分片配置
```
tiling/<stage>/autotuning.json
```
— 从
```
.cache/kernel-autotune/...
```
复制的自动调优历史，与选中的分片配置放在一起，您无需离开分析输出目录即可查看探索过的配置
```
compiled/<stage>/
```
— MLIR和字节码编译产物

Per-mode artifacts

分模式产物

Each mode produces different artifacts. See the per-mode reference for full details:

mode-annotation.md — report.md with per-kernel NCU metrics, roofline comparison, tiling;
mode-event.md — report.md with end-to-end roofline summary;
mode-ncu.md — report.md (compact diagnosis + NCU recommendations), report-verbose.md (full NCU sections), annotated source, per-kernel SASS metrics, PTX source files;
mode-nsys.md — report.md with timeline and overlap analysis;

每种模式会生成不同的产物。查看各模式参考文档获取详细信息：

mode-annotation.md — 包含每个内核NCU指标、Roofline对比、分片配置的report.md；
mode-event.md — 包含端到端Roofline摘要的report.md；
mode-ncu.md — report.md（精简诊断 + NCU建议）、report-verbose.md（完整NCU章节）、带注解的源码、每个内核的SASS指标、PTX源文件；
mode-nsys.md — 包含时间线和重叠分析的report.md；

Metric Interpretation

指标解读

Metric	Healthy	If unhealthy → Issue	Optimization hint
TC Util	>60%	Memory or latency bound	Check DRAM%, occupancy
DRAM Throughput	>70%	Compute or latency bound	Check TC%, occupancy
Achieved Occupancy	>25%	Register/smem pressure	Reduce tile size, occupancy hint
L2 Hit Rate	>80%	Poor data reuse	Swizzle, larger tiles, data layout
Local Spilling	0 bytes	Register overflow	Smaller tiles, fewer accumulators
Waves/SM	>1.0	Underfilled GPU	More blocks, reduce per-block resources

指标	健康状态	异常时→问题	优化建议
TC利用率	>60%	内存或延迟受限	检查DRAM占比、 occupancy
DRAM吞吐量	>70%	计算或延迟受限	检查TC占比、 occupancy
实际Occupancy	>25%	寄存器/共享内存压力	减小分片尺寸、 occupancy优化提示
L2命中率	>80%	数据复用性差	内存混洗、更大分片、数据布局调整
本地溢出	0字节	寄存器溢出	更小分片、减少累加器数量
每SM波数	>1.0	GPU未被充分利用	更多块、减少每块资源占用

Bottleneck Classification

瓶颈分类

Pattern	Classification	Focus
DRAM% high + TC% low	Memory-bound	Data reuse, TMA hints, head grouping
TC% high + DRAM% low	Compute-bound	Kernel efficiency, tile sizes
Both low	Latency-bound	Occupancy, reduce spilling, more blocks
L2 hit < 80%	Locality issue	Swizzle scheduling, tile size adjustment

模式	分类	优化重点
DRAM占比高 + TC占比低	内存受限	数据复用、TMA提示、头部分组
TC占比高 + DRAM占比低	计算受限	内核效率、分片尺寸
两者都低	延迟受限	Occupancy、减少溢出、更多块
L2命中率 < 80%	局部性问题	混洗调度、分片尺寸调整

Profiler Output Contract

分析器输出约定

After profiling, return results to the orchestrator in this exact format. Always read

params.json

to populate the configuration fields accurately.

markdown

undefined

分析完成后，务必按照以下格式向编排器返回结果。请始终读取

params.json

以准确填充配置字段。

markdown

undefined

Profile: [kernel] [version]

分析结果: [内核名称] [版本]

Configuration

配置信息

b	s	t	dtype
X	X	X	bfloat16

b	s	t	数据类型
X	X	X	bfloat16

Stages

各阶段详情

Stage	Duration (us)	TC%	DRAM%	Occ%	Bottleneck	Key Issue

阶段	耗时(微秒)	TC%	DRAM%	Occ%	瓶颈类型	核心问题

Bottleneck: [Memory/Compute/Latency]-bound

瓶颈: [内存/计算/延迟]受限

Root cause: [2 sentences max]

根本原因: [最多2句话]

Top 3 Opportunities (ranked by estimated impact)

三大优化机会（按预估影响排序）

[name] — est. X% gain — trigger: [metric=value]
...
...

[名称] — 预估提升X% — 触发条件: [指标=数值]
...
...

vs Baseline (if applicable)

与基线对比（如适用）

Metric	Previous	Current	Change

undefined

指标	之前值	当前值	变化

undefined

Development Log Performance Template

开发日志性能模板

Update the kernel's devlog (e.g.,

docs/kernels/mla-var6-plus.md

) with:

markdown

**Performance** (<device>, locked clocks if applicable, bfloat16, b=X, s=X, t=X):

| Metric            | Value   | vs Previous |
| ----------------- | ------- | ----------- |
| Duration          | X.XX μs | Y% faster   |
| Achieved TFLOPs/s | X.XX    | +Z%         |
| Achieved GB/s     | X.XX    | +Z%         |
| Occupancy         | XX%     | --          |
| TC Util           | XX%     | --          |

**Bottleneck**: [Memory-bound / Compute-bound / Latency-bound]

**Issues**:
- [Remaining problems]

**Insights**:
- [Key lessons — why optimization worked or didn't]
- [Guidance for next iteration]

使用以下模板更新内核的开发日志（例如

docs/kernels/mla-var6-plus.md

）：

markdown

**性能**（<设备名称>，如已锁定时钟，bfloat16，b=X，s=X，t=X）:

| 指标            | 数值   | 与之前版本对比 |
| ----------------- | ------- | ----------- |
| 耗时          | X.XX μs | 快Y%   |
| 实际TFLOPs/s | X.XX    | +Z%         |
| 实际GB/s     | X.XX    | +Z%         |
| Occupancy         | XX%     | --          |
| TC利用率           | XX%     | --          |

**瓶颈**: [内存受限 / 计算受限 / 延迟受限]

**现存问题**:
- [剩余问题]

**洞察**:
- [关键经验——优化生效或失效的原因]
- [下一迭代的指导建议]

Detailed References

详细参考文档

Annotation mode: references/mode-annotation.md — artifacts, report structure, reading guide
Event mode: references/mode-event.md — artifacts, limitations, reading guide
NCU mode: references/mode-ncu.md — artifacts, annotated source, per-kernel SASS metrics and PTX source files, reading guide
NSYS mode: references/mode-nsys.md — artifacts, CSVs, visualization, reading guide

Annotation模式: references/mode-annotation.md — 产物、报告结构、阅读指南
Event模式: references/mode-event.md — 产物、局限性、阅读指南
NCU模式: references/mode-ncu.md — 产物、带注解的源码、每个内核的SASS指标和PTX源文件、阅读指南
NSYS模式: references/mode-nsys.md — 产物、CSV文件、可视化、阅读指南