Loading...
Loading...
GPU kernel profiling workflow across supported kernel implementation languages. Provides commands for all 4 profiling modes (annotation, event, ncu, nsys), metric interpretation tables, bottleneck identification rules, and the output contract for returning compact results to the orchestrator. Use when: (1) profiling a kernel version, (2) interpreting profiling artifacts/reports, (3) comparing kernel versions, (4) identifying bottlenecks and optimization opportunities, (5) documenting performance in the development log.
npx skill4agent add pepperu96/hyper-mla profile-kernelbash scripts/lock-gpu-clock.sh # before profiling
bash scripts/reset-gpu-clock.sh # after doneparams.jsonsrc/mla_var3/conf/devices.jsondocs/devices/ridge_point = peak_tflops / peak_gbps| Mode | | When to use | Output root |
|---|---|---|---|
| Annotation | | Default first pass. Roofline + NCU metrics + comparison tables | |
| Event | | Quick iteration timing (carries Python overhead) | |
| NCU | | Deep investigation: full NCU sections, source annotations, optimization suggestions | |
| NSYS | | Pipeline overlap, stream concurrency, launch ordering | |
annotationeventncunsysConcurrentKernelspython -m mla_var3.kernel <kernel_package> [<version>] \
--b=32 --s=16 --t=4096 --prof_type=<mode>
# Language-specific path when needed:
python -m mla_var3.kernel.<language_python_package>.mla.<kernel_package> [<version>] \
--b=32 --s=16 --t=4096 --prof_type=<mode># Annotation
from mla_var3.runtime.profiling.annotation import annotation_profile_plan
prof_result = annotation_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
# NCU
from mla_var3.runtime.profiling.ncu import ncu_profile_plan
prof_result = ncu_profile_plan(plan, out_dir=..., version=..., bench_runs=...)
# NSYS
from mla_var3.runtime.profiling.nsys import nsys_profile_plan
prof_result = nsys_profile_plan(plan, out_dir=..., version=..., bench_runs=...)KernelPlanplantests/benchmark/python -m tests.benchmark.bench_mla_var6_plus # base version
python -m tests.benchmark.bench_mla_var6_plus_v4 # specific version
python -m tests.benchmark.bench_all_mla # compare all kernels (slow)out/profiles/<mode>/<kernel>/<params>/<timestamp>/params.jsonreport.mdtiling/<stage>.jsontiling/<stage>/autotuning.json.cache/kernel-autotune/...compiled/<stage>/| Metric | Healthy | If unhealthy → Issue | Optimization hint |
|---|---|---|---|
| TC Util | >60% | Memory or latency bound | Check DRAM%, occupancy |
| DRAM Throughput | >70% | Compute or latency bound | Check TC%, occupancy |
| Achieved Occupancy | >25% | Register/smem pressure | Reduce tile size, occupancy hint |
| L2 Hit Rate | >80% | Poor data reuse | Swizzle, larger tiles, data layout |
| Local Spilling | 0 bytes | Register overflow | Smaller tiles, fewer accumulators |
| Waves/SM | >1.0 | Underfilled GPU | More blocks, reduce per-block resources |
| Pattern | Classification | Focus |
|---|---|---|
| DRAM% high + TC% low | Memory-bound | Data reuse, TMA hints, head grouping |
| TC% high + DRAM% low | Compute-bound | Kernel efficiency, tile sizes |
| Both low | Latency-bound | Occupancy, reduce spilling, more blocks |
| L2 hit < 80% | Locality issue | Swizzle scheduling, tile size adjustment |
params.json## Profile: [kernel] [version]
### Configuration
| b | s | t | dtype |
|---|---|---|-------|
| X | X | X | bfloat16 |
### Stages
| Stage | Duration (us) | TC% | DRAM% | Occ% | Bottleneck | Key Issue |
|-------|---------------|-----|-------|------|------------|-----------|
### Bottleneck: [Memory/Compute/Latency]-bound
Root cause: [2 sentences max]
### Top 3 Opportunities (ranked by estimated impact)
1. [name] — est. X% gain — trigger: [metric=value]
2. ...
3. ...
### vs Baseline (if applicable)
| Metric | Previous | Current | Change |
|--------|----------|---------|--------|docs/kernels/mla-var6-plus.md**Performance** (<device>, locked clocks if applicable, bfloat16, b=X, s=X, t=X):
| Metric | Value | vs Previous |
| ----------------- | ------- | ----------- |
| Duration | X.XX μs | Y% faster |
| Achieved TFLOPs/s | X.XX | +Z% |
| Achieved GB/s | X.XX | +Z% |
| Occupancy | XX% | -- |
| TC Util | XX% | -- |
**Bottleneck**: [Memory-bound / Compute-bound / Latency-bound]
**Issues**:
- [Remaining problems]
**Insights**:
- [Key lessons — why optimization worked or didn't]
- [Guidance for next iteration]