profile-kernel
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGPU Kernel Profiling
GPU内核分析
GPU Clock Protocol
GPU时钟协议
Always lock clocks before profiling. Always run one profiling at a time.
bash
bash scripts/lock-gpu-clock.sh # before profiling
bash scripts/reset-gpu-clock.sh # after done分析前务必锁定时钟。每次仅运行一项分析任务。
bash
bash scripts/lock-gpu-clock.sh # 分析前执行
bash scripts/reset-gpu-clock.sh # 分析完成后执行Device Peaks
设备峰值
Shared profiling guidance should not hard-code a single device. Read the active device context from , runtime profiling artifacts, and the configured peak-spec sources in .
params.jsonsrc/mla_var3/conf/devices.jsonUse device-specific reference files in only when the measurement context matters. Ridge point remains:
docs/devices/text
ridge_point = peak_tflops / peak_gbps通用分析指南不应硬编码单一设备。从、运行时分析产物以及中配置的峰值规格源读取活跃设备上下文。
params.jsonsrc/mla_var3/conf/devices.json仅当测量上下文重要时,才使用中的设备特定参考文件。脊点计算公式如下:
docs/devices/text
ridge_point = peak_tflops / peak_gbpsProfiling Mode Decision Tree
分析模式决策树
| Mode | | When to use | Output root |
|---|---|---|---|
| Annotation | | Default first pass. Roofline + NCU metrics + comparison tables | |
| Event | | Quick iteration timing (carries Python overhead) | |
| NCU | | Deep investigation: full NCU sections, source annotations, optimization suggestions | |
| NSYS | | Pipeline overlap, stream concurrency, launch ordering | |
Selection rule: Start with . Use for fast A/B comparison (not suitable for KernelPipeline/ConcurrentKernels). Use when you need source-level analysis, Nsight Compute optimization suggestions, or CUDA insights. Use for overlap and stream concurrency analysis.
annotationeventncunsysConcurrentKernels| 模式 | | 使用场景 | 输出根目录 |
|---|---|---|---|
| Annotation | | 默认首次分析。Roofline模型 + NCU指标 + 对比表格 | |
| Event | | 快速迭代计时(带有Python开销) | |
| NCU | | 深度调研:完整NCU章节、源码注解、优化建议 | |
| NSYS | | 流水线重叠、流并发、启动顺序分析 | |
选择规则:从模式开始。使用模式进行快速A/B对比(不适用于KernelPipeline/ConcurrentKernels)。当需要源码级分析、Nsight Compute优化建议或CUDA洞察时,使用模式。针对重叠和流并发分析,使用模式。
annotationeventncuConcurrentKernelsnsysProfiling Commands
分析命令
CLI entry point (all modes)
CLI入口(所有模式)
bash
python -m mla_var3.kernel <kernel_package> [<version>] \
--b=32 --s=16 --t=4096 --prof_type=<mode>bash
python -m mla_var3.kernel <kernel_package> [<version>] \
--b=32 --s=16 --t=4096 --prof_type=<mode>Language-specific path when needed:
需要时使用特定语言路径:
python -m mla_var3.kernel.<language_python_package>.mla.<kernel_package> [<version>]
--b=32 --s=16 --t=4096 --prof_type=<mode>
--b=32 --s=16 --t=4096 --prof_type=<mode>
undefinedpython -m mla_var3.kernel.<language_python_package>.mla.<kernel_package> [<version>]
--b=32 --s=16 --t=4096 --prof_type=<mode>
--b=32 --s=16 --t=4096 --prof_type=<mode>
undefinedPython API
Python API
python
undefinedpython
undefinedAnnotation
Annotation模式
from mla_var3.runtime.profiling.annotation import annotation_profile_plan
prof_result = annotation_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
from mla_var3.runtime.profiling.annotation import annotation_profile_plan
prof_result = annotation_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
NCU
NCU模式
from mla_var3.runtime.profiling.ncu import ncu_profile_plan
prof_result = ncu_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
from mla_var3.runtime.profiling.ncu import ncu_profile_plan
prof_result = ncu_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
NSYS
NSYS模式
from mla_var3.runtime.profiling.nsys import nsys_profile_plan
prof_result = nsys_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
Notice that version is optional, and allows you to specify a custom version. This retrieves the right `KernelPlan` of the specified version, and uses that, NOT the `plan` passed to the function. This is just a convenience to automatize version profiling.
Alternatively, you can directly build the plan object using the class of the version to be profiled, and ignore the version argument.
**Important**: sometimes, after auto-tuning, there might be some errors. In this case, you can simply re-run the command, the best autotune config is already cached and everything should work.from mla_var3.runtime.profiling.nsys import nsys_profile_plan
prof_result = nsys_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
注意,版本参数是可选的,允许您指定自定义版本。这会检索指定版本对应的`KernelPlan`并使用它,而非传入函数的`plan`对象。这只是为了自动化版本分析提供的便利。
或者,您可以直接使用待分析版本的类构建plan对象,并忽略version参数。
**重要提示**:有时自动调优后可能会出现一些错误。这种情况下,您只需重新运行命令即可,最佳自动调优配置已被缓存,一切都能正常工作。Benchmark scripts
基准测试脚本
Benchmark scripts in are curated experiment drivers. Some accept CLI args, others are fixed scripts to copy and edit.
tests/benchmark/bash
python -m tests.benchmark.bench_mla_var6_plus # base version
python -m tests.benchmark.bench_mla_var6_plus_v4 # specific version
python -m tests.benchmark.bench_all_mla # compare all kernels (slow)tests/benchmark/bash
python -m tests.benchmark.bench_mla_var6_plus # 基础版本
python -m tests.benchmark.bench_mla_var6_plus_v4 # 指定版本
python -m tests.benchmark.bench_all_mla # 对比所有内核(速度较慢)Output Artifacts
输出产物
All modes write to if not ouptut directory is specified.
out/profiles/<mode>/<kernel>/<params>/<timestamp>/若未指定输出目录,所有模式都会将结果写入。
out/profiles/<mode>/<kernel>/<params>/<timestamp>/Common artifacts (all modes)
通用产物(所有模式)
- — Full kernel parameter set (dtype, b, s, t, h, d, k, p). Always read this to confirm the profiling configuration, especially for NCU and NSYS modes whose
params.jsondoes not embed parameters.report.md - — Autotuned tiling config per stage
tiling/<stage>.json - — Copied autotuning history from
tiling/<stage>/autotuning.json, colocated with the selected tiling so you can inspect the explored configurations without leaving the profiling output directory.cache/kernel-autotune/... - — MLIR and bytecode compilation artifacts
compiled/<stage>/
- — 完整内核参数集(dtype、b、s、t、h、d、k、p)。务必读取该文件以确认分析配置,尤其是NCU和NSYS模式,它们的
params.json不会嵌入参数。report.md - — 各阶段的自动调优分片配置
tiling/<stage>.json - — 从
tiling/<stage>/autotuning.json复制的自动调优历史,与选中的分片配置放在一起,您无需离开分析输出目录即可查看探索过的配置.cache/kernel-autotune/... - — MLIR和字节码编译产物
compiled/<stage>/
Per-mode artifacts
分模式产物
Each mode produces different artifacts. See the per-mode reference for full details:
- mode-annotation.md — report.md with per-kernel NCU metrics, roofline comparison, tiling;
- mode-event.md — report.md with end-to-end roofline summary;
- mode-ncu.md — report.md (compact diagnosis + NCU recommendations), report-verbose.md (full NCU sections), annotated source, per-kernel SASS metrics, PTX source files;
- mode-nsys.md — report.md with timeline and overlap analysis;
每种模式会生成不同的产物。查看各模式参考文档获取详细信息:
- mode-annotation.md — 包含每个内核NCU指标、Roofline对比、分片配置的report.md;
- mode-event.md — 包含端到端Roofline摘要的report.md;
- mode-ncu.md — report.md(精简诊断 + NCU建议)、report-verbose.md(完整NCU章节)、带注解的源码、每个内核的SASS指标、PTX源文件;
- mode-nsys.md — 包含时间线和重叠分析的report.md;
Metric Interpretation
指标解读
| Metric | Healthy | If unhealthy → Issue | Optimization hint |
|---|---|---|---|
| TC Util | >60% | Memory or latency bound | Check DRAM%, occupancy |
| DRAM Throughput | >70% | Compute or latency bound | Check TC%, occupancy |
| Achieved Occupancy | >25% | Register/smem pressure | Reduce tile size, occupancy hint |
| L2 Hit Rate | >80% | Poor data reuse | Swizzle, larger tiles, data layout |
| Local Spilling | 0 bytes | Register overflow | Smaller tiles, fewer accumulators |
| Waves/SM | >1.0 | Underfilled GPU | More blocks, reduce per-block resources |
| 指标 | 健康状态 | 异常时→问题 | 优化建议 |
|---|---|---|---|
| TC利用率 | >60% | 内存或延迟受限 | 检查DRAM占比、 occupancy |
| DRAM吞吐量 | >70% | 计算或延迟受限 | 检查TC占比、 occupancy |
| 实际Occupancy | >25% | 寄存器/共享内存压力 | 减小分片尺寸、 occupancy优化提示 |
| L2命中率 | >80% | 数据复用性差 | 内存混洗、更大分片、数据布局调整 |
| 本地溢出 | 0字节 | 寄存器溢出 | 更小分片、减少累加器数量 |
| 每SM波数 | >1.0 | GPU未被充分利用 | 更多块、减少每块资源占用 |
Bottleneck Classification
瓶颈分类
| Pattern | Classification | Focus |
|---|---|---|
| DRAM% high + TC% low | Memory-bound | Data reuse, TMA hints, head grouping |
| TC% high + DRAM% low | Compute-bound | Kernel efficiency, tile sizes |
| Both low | Latency-bound | Occupancy, reduce spilling, more blocks |
| L2 hit < 80% | Locality issue | Swizzle scheduling, tile size adjustment |
| 模式 | 分类 | 优化重点 |
|---|---|---|
| DRAM占比高 + TC占比低 | 内存受限 | 数据复用、TMA提示、头部分组 |
| TC占比高 + DRAM占比低 | 计算受限 | 内核效率、分片尺寸 |
| 两者都低 | 延迟受限 | Occupancy、减少溢出、更多块 |
| L2命中率 < 80% | 局部性问题 | 混洗调度、分片尺寸调整 |
Profiler Output Contract
分析器输出约定
After profiling, return results to the orchestrator in this exact format.
Always read to populate the configuration fields accurately.
params.jsonmarkdown
undefined分析完成后,务必按照以下格式向编排器返回结果。
请始终读取以准确填充配置字段。
params.jsonmarkdown
undefinedProfile: [kernel] [version]
分析结果: [内核名称] [版本]
Configuration
配置信息
| b | s | t | dtype |
|---|---|---|---|
| X | X | X | bfloat16 |
| b | s | t | 数据类型 |
|---|---|---|---|
| X | X | X | bfloat16 |
Stages
各阶段详情
| Stage | Duration (us) | TC% | DRAM% | Occ% | Bottleneck | Key Issue |
|---|
| 阶段 | 耗时(微秒) | TC% | DRAM% | Occ% | 瓶颈类型 | 核心问题 |
|---|
Bottleneck: [Memory/Compute/Latency]-bound
瓶颈: [内存/计算/延迟]受限
Root cause: [2 sentences max]
根本原因: [最多2句话]
Top 3 Opportunities (ranked by estimated impact)
三大优化机会(按预估影响排序)
- [name] — est. X% gain — trigger: [metric=value]
- ...
- ...
- [名称] — 预估提升X% — 触发条件: [指标=数值]
- ...
- ...
vs Baseline (if applicable)
与基线对比(如适用)
| Metric | Previous | Current | Change |
|---|
undefined| 指标 | 之前值 | 当前值 | 变化 |
|---|
undefinedDevelopment Log Performance Template
开发日志性能模板
Update the kernel's devlog (e.g., ) with:
docs/kernels/mla-var6-plus.mdmarkdown
**Performance** (<device>, locked clocks if applicable, bfloat16, b=X, s=X, t=X):
| Metric | Value | vs Previous |
| ----------------- | ------- | ----------- |
| Duration | X.XX μs | Y% faster |
| Achieved TFLOPs/s | X.XX | +Z% |
| Achieved GB/s | X.XX | +Z% |
| Occupancy | XX% | -- |
| TC Util | XX% | -- |
**Bottleneck**: [Memory-bound / Compute-bound / Latency-bound]
**Issues**:
- [Remaining problems]
**Insights**:
- [Key lessons — why optimization worked or didn't]
- [Guidance for next iteration]使用以下模板更新内核的开发日志(例如):
docs/kernels/mla-var6-plus.mdmarkdown
**性能**(<设备名称>,如已锁定时钟,bfloat16,b=X,s=X,t=X):
| 指标 | 数值 | 与之前版本对比 |
| ----------------- | ------- | ----------- |
| 耗时 | X.XX μs | 快Y% |
| 实际TFLOPs/s | X.XX | +Z% |
| 实际GB/s | X.XX | +Z% |
| Occupancy | XX% | -- |
| TC利用率 | XX% | -- |
**瓶颈**: [内存受限 / 计算受限 / 延迟受限]
**现存问题**:
- [剩余问题]
**洞察**:
- [关键经验——优化生效或失效的原因]
- [下一迭代的指导建议]Detailed References
详细参考文档
- Annotation mode: references/mode-annotation.md — artifacts, report structure, reading guide
- Event mode: references/mode-event.md — artifacts, limitations, reading guide
- NCU mode: references/mode-ncu.md — artifacts, annotated source, per-kernel SASS metrics and PTX source files, reading guide
- NSYS mode: references/mode-nsys.md — artifacts, CSVs, visualization, reading guide
- Annotation模式: references/mode-annotation.md — 产物、报告结构、阅读指南
- Event模式: references/mode-event.md — 产物、局限性、阅读指南
- NCU模式: references/mode-ncu.md — 产物、带注解的源码、每个内核的SASS指标和PTX源文件、阅读指南
- NSYS模式: references/mode-nsys.md — 产物、CSV文件、可视化、阅读指南