profile-kernel

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GPU Kernel Profiling

GPU内核分析

GPU Clock Protocol

GPU时钟协议

Always lock clocks before profiling. Always run one profiling at a time.
bash
bash scripts/lock-gpu-clock.sh           # before profiling
bash scripts/reset-gpu-clock.sh          # after done
分析前务必锁定时钟。每次仅运行一项分析任务。
bash
bash scripts/lock-gpu-clock.sh           # 分析前执行
bash scripts/reset-gpu-clock.sh          # 分析完成后执行

Device Peaks

设备峰值

Shared profiling guidance should not hard-code a single device. Read the active device context from
params.json
, runtime profiling artifacts, and the configured peak-spec sources in
src/mla_var3/conf/devices.json
.
Use device-specific reference files in
docs/devices/
only when the measurement context matters. Ridge point remains:
text
ridge_point = peak_tflops / peak_gbps
通用分析指南不应硬编码单一设备。从
params.json
、运行时分析产物以及
src/mla_var3/conf/devices.json
中配置的峰值规格源读取活跃设备上下文。
仅当测量上下文重要时,才使用
docs/devices/
中的设备特定参考文件。脊点计算公式如下:
text
ridge_point = peak_tflops / peak_gbps

Profiling Mode Decision Tree

分析模式决策树

Mode
--prof_type
When to useOutput root
Annotation
annotation
Default first pass. Roofline + NCU metrics + comparison tables
out/profiles/annotation/
Event
event
Quick iteration timing (carries Python overhead)
out/profiles/event/
NCU
ncu
Deep investigation: full NCU sections, source annotations, optimization suggestions
out/profiles/ncu/
NSYS
nsys
Pipeline overlap, stream concurrency, launch ordering
out/profiles/nsys/
Selection rule: Start with
annotation
. Use
event
for fast A/B comparison (not suitable for KernelPipeline/ConcurrentKernels). Use
ncu
when you need source-level analysis, Nsight Compute optimization suggestions, or CUDA insights. Use
nsys
for
ConcurrentKernels
overlap and stream concurrency analysis.
模式
--prof_type
参数
使用场景输出根目录
Annotation
annotation
默认首次分析。Roofline模型 + NCU指标 + 对比表格
out/profiles/annotation/
Event
event
快速迭代计时(带有Python开销)
out/profiles/event/
NCU
ncu
深度调研:完整NCU章节、源码注解、优化建议
out/profiles/ncu/
NSYS
nsys
流水线重叠、流并发、启动顺序分析
out/profiles/nsys/
选择规则:从
annotation
模式开始。使用
event
模式进行快速A/B对比(不适用于KernelPipeline/ConcurrentKernels)。当需要源码级分析、Nsight Compute优化建议或CUDA洞察时,使用
ncu
模式。针对
ConcurrentKernels
重叠和流并发分析,使用
nsys
模式。

Profiling Commands

分析命令

CLI entry point (all modes)

CLI入口(所有模式)

bash
python -m mla_var3.kernel <kernel_package> [<version>] \
    --b=32 --s=16 --t=4096 --prof_type=<mode>
bash
python -m mla_var3.kernel <kernel_package> [<version>] \
    --b=32 --s=16 --t=4096 --prof_type=<mode>

Language-specific path when needed:

需要时使用特定语言路径:

python -m mla_var3.kernel.<language_python_package>.mla.<kernel_package> [<version>]
--b=32 --s=16 --t=4096 --prof_type=<mode>
undefined
python -m mla_var3.kernel.<language_python_package>.mla.<kernel_package> [<version>]
--b=32 --s=16 --t=4096 --prof_type=<mode>
undefined

Python API

Python API

python
undefined
python
undefined

Annotation

Annotation模式

from mla_var3.runtime.profiling.annotation import annotation_profile_plan prof_result = annotation_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
from mla_var3.runtime.profiling.annotation import annotation_profile_plan prof_result = annotation_profile_plan(plan, out_dir=..., version=..., bench_runs=...)

NCU

NCU模式

from mla_var3.runtime.profiling.ncu import ncu_profile_plan prof_result = ncu_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
from mla_var3.runtime.profiling.ncu import ncu_profile_plan prof_result = ncu_profile_plan(plan, out_dir=..., version=..., bench_runs=...)

NSYS

NSYS模式

from mla_var3.runtime.profiling.nsys import nsys_profile_plan prof_result = nsys_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
Notice that version is optional, and allows you to specify a custom version. This retrieves the right `KernelPlan` of the specified version, and uses that, NOT the `plan` passed to the function. This is just a convenience to automatize version profiling.
Alternatively, you can directly build the plan object using the class of the version to be profiled, and ignore the version argument. 
**Important**: sometimes, after auto-tuning, there might be some errors. In this case, you can simply re-run the command, the best autotune config is already cached and everything should work.
from mla_var3.runtime.profiling.nsys import nsys_profile_plan prof_result = nsys_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
注意,版本参数是可选的,允许您指定自定义版本。这会检索指定版本对应的`KernelPlan`并使用它,而非传入函数的`plan`对象。这只是为了自动化版本分析提供的便利。
或者,您可以直接使用待分析版本的类构建plan对象,并忽略version参数。 
**重要提示**:有时自动调优后可能会出现一些错误。这种情况下,您只需重新运行命令即可,最佳自动调优配置已被缓存,一切都能正常工作。

Benchmark scripts

基准测试脚本

Benchmark scripts in
tests/benchmark/
are curated experiment drivers. Some accept CLI args, others are fixed scripts to copy and edit.
bash
python -m tests.benchmark.bench_mla_var6_plus       # base version
python -m tests.benchmark.bench_mla_var6_plus_v4    # specific version
python -m tests.benchmark.bench_all_mla             # compare all kernels (slow)
tests/benchmark/
中的基准测试脚本是经过整理的实验驱动程序。部分脚本接受CLI参数,其他则是固定脚本,可复制后编辑。
bash
python -m tests.benchmark.bench_mla_var6_plus       # 基础版本
python -m tests.benchmark.bench_mla_var6_plus_v4    # 指定版本
python -m tests.benchmark.bench_all_mla             # 对比所有内核(速度较慢)

Output Artifacts

输出产物

All modes write to
out/profiles/<mode>/<kernel>/<params>/<timestamp>/
if not ouptut directory is specified.
若未指定输出目录,所有模式都会将结果写入
out/profiles/<mode>/<kernel>/<params>/<timestamp>/

Common artifacts (all modes)

通用产物(所有模式)

  • params.json
    — Full kernel parameter set (dtype, b, s, t, h, d, k, p). Always read this to confirm the profiling configuration, especially for NCU and NSYS modes whose
    report.md
    does not embed parameters.
  • tiling/<stage>.json
    — Autotuned tiling config per stage
  • tiling/<stage>/autotuning.json
    — Copied autotuning history from
    .cache/kernel-autotune/...
    , colocated with the selected tiling so you can inspect the explored configurations without leaving the profiling output directory
  • compiled/<stage>/
    — MLIR and bytecode compilation artifacts
  • params.json
    — 完整内核参数集(dtype、b、s、t、h、d、k、p)。务必读取该文件以确认分析配置,尤其是NCU和NSYS模式,它们的
    report.md
    不会嵌入参数。
  • tiling/<stage>.json
    — 各阶段的自动调优分片配置
  • tiling/<stage>/autotuning.json
    — 从
    .cache/kernel-autotune/...
    复制的自动调优历史,与选中的分片配置放在一起,您无需离开分析输出目录即可查看探索过的配置
  • compiled/<stage>/
    — MLIR和字节码编译产物

Per-mode artifacts

分模式产物

Each mode produces different artifacts. See the per-mode reference for full details:
  • mode-annotation.md — report.md with per-kernel NCU metrics, roofline comparison, tiling;
  • mode-event.md — report.md with end-to-end roofline summary;
  • mode-ncu.md — report.md (compact diagnosis + NCU recommendations), report-verbose.md (full NCU sections), annotated source, per-kernel SASS metrics, PTX source files;
  • mode-nsys.md — report.md with timeline and overlap analysis;
每种模式会生成不同的产物。查看各模式参考文档获取详细信息:
  • mode-annotation.md — 包含每个内核NCU指标、Roofline对比、分片配置的report.md;
  • mode-event.md — 包含端到端Roofline摘要的report.md;
  • mode-ncu.md — report.md(精简诊断 + NCU建议)、report-verbose.md(完整NCU章节)、带注解的源码、每个内核的SASS指标、PTX源文件;
  • mode-nsys.md — 包含时间线和重叠分析的report.md;

Metric Interpretation

指标解读

MetricHealthyIf unhealthy → IssueOptimization hint
TC Util>60%Memory or latency boundCheck DRAM%, occupancy
DRAM Throughput>70%Compute or latency boundCheck TC%, occupancy
Achieved Occupancy>25%Register/smem pressureReduce tile size, occupancy hint
L2 Hit Rate>80%Poor data reuseSwizzle, larger tiles, data layout
Local Spilling0 bytesRegister overflowSmaller tiles, fewer accumulators
Waves/SM>1.0Underfilled GPUMore blocks, reduce per-block resources
指标健康状态异常时→问题优化建议
TC利用率>60%内存或延迟受限检查DRAM占比、 occupancy
DRAM吞吐量>70%计算或延迟受限检查TC占比、 occupancy
实际Occupancy>25%寄存器/共享内存压力减小分片尺寸、 occupancy优化提示
L2命中率>80%数据复用性差内存混洗、更大分片、数据布局调整
本地溢出0字节寄存器溢出更小分片、减少累加器数量
每SM波数>1.0GPU未被充分利用更多块、减少每块资源占用

Bottleneck Classification

瓶颈分类

PatternClassificationFocus
DRAM% high + TC% lowMemory-boundData reuse, TMA hints, head grouping
TC% high + DRAM% lowCompute-boundKernel efficiency, tile sizes
Both lowLatency-boundOccupancy, reduce spilling, more blocks
L2 hit < 80%Locality issueSwizzle scheduling, tile size adjustment
模式分类优化重点
DRAM占比高 + TC占比低内存受限数据复用、TMA提示、头部分组
TC占比高 + DRAM占比低计算受限内核效率、分片尺寸
两者都低延迟受限Occupancy、减少溢出、更多块
L2命中率 < 80%局部性问题混洗调度、分片尺寸调整

Profiler Output Contract

分析器输出约定

After profiling, return results to the orchestrator in this exact format. Always read
params.json
to populate the configuration fields accurately.
markdown
undefined
分析完成后,务必按照以下格式向编排器返回结果。 请始终读取
params.json
以准确填充配置字段。
markdown
undefined

Profile: [kernel] [version]

分析结果: [内核名称] [版本]

Configuration

配置信息

bstdtype
XXXbfloat16
bst数据类型
XXXbfloat16

Stages

各阶段详情

StageDuration (us)TC%DRAM%Occ%BottleneckKey Issue
阶段耗时(微秒)TC%DRAM%Occ%瓶颈类型核心问题

Bottleneck: [Memory/Compute/Latency]-bound

瓶颈: [内存/计算/延迟]受限

Root cause: [2 sentences max]
根本原因: [最多2句话]

Top 3 Opportunities (ranked by estimated impact)

三大优化机会(按预估影响排序)

  1. [name] — est. X% gain — trigger: [metric=value]
  2. ...
  3. ...
  1. [名称] — 预估提升X% — 触发条件: [指标=数值]
  2. ...
  3. ...

vs Baseline (if applicable)

与基线对比(如适用)

MetricPreviousCurrentChange
undefined
指标之前值当前值变化
undefined

Development Log Performance Template

开发日志性能模板

Update the kernel's devlog (e.g.,
docs/kernels/mla-var6-plus.md
) with:
markdown
**Performance** (<device>, locked clocks if applicable, bfloat16, b=X, s=X, t=X):

| Metric            | Value   | vs Previous |
| ----------------- | ------- | ----------- |
| Duration          | X.XX μs | Y% faster   |
| Achieved TFLOPs/s | X.XX    | +Z%         |
| Achieved GB/s     | X.XX    | +Z%         |
| Occupancy         | XX%     | --          |
| TC Util           | XX%     | --          |

**Bottleneck**: [Memory-bound / Compute-bound / Latency-bound]

**Issues**:
- [Remaining problems]

**Insights**:
- [Key lessons — why optimization worked or didn't]
- [Guidance for next iteration]
使用以下模板更新内核的开发日志(例如
docs/kernels/mla-var6-plus.md
):
markdown
**性能**<设备名称>,如已锁定时钟,bfloat16,b=X,s=X,t=X):

| 指标            | 数值   | 与之前版本对比 |
| ----------------- | ------- | ----------- |
| 耗时          | X.XX μs | 快Y%   |
| 实际TFLOPs/s | X.XX    | +Z%         |
| 实际GB/s     | X.XX    | +Z%         |
| Occupancy         | XX%     | --          |
| TC利用率           | XX%     | --          |

**瓶颈**: [内存受限 / 计算受限 / 延迟受限]

**现存问题**:
- [剩余问题]

**洞察**:
- [关键经验——优化生效或失效的原因]
- [下一迭代的指导建议]

Detailed References

详细参考文档

  • Annotation mode: references/mode-annotation.md — artifacts, report structure, reading guide
  • Event mode: references/mode-event.md — artifacts, limitations, reading guide
  • NCU mode: references/mode-ncu.md — artifacts, annotated source, per-kernel SASS metrics and PTX source files, reading guide
  • NSYS mode: references/mode-nsys.md — artifacts, CSVs, visualization, reading guide
  • Annotation模式: references/mode-annotation.md — 产物、报告结构、阅读指南
  • Event模式: references/mode-event.md — 产物、局限性、阅读指南
  • NCU模式: references/mode-ncu.md — 产物、带注解的源码、每个内核的SASS指标和PTX源文件、阅读指南
  • NSYS模式: references/mode-nsys.md — 产物、CSV文件、可视化、阅读指南