Loading...
Loading...
Performance analysis coordination workflow. Guides profiling delegation, bottleneck classification (compute/memory/launch/communication/sync), and structured report generation. Use when the user asks to analyze performance, profile a workload, check MFU/SOL, or diagnose bottlenecks.
npx skill4agent add nvidia/skills perf-analysis| Type | Indicator | Description |
|---|---|---|
| Compute-bound | High GPU utilization, low memory bandwidth usage | Limited by compute capacity (FLOPs) |
| Memory-bound | High memory bandwidth, low compute utilization | Limited by DRAM throughput |
| Launch-overhead | Many small kernels, high CPU time | CPU becoming bottleneck from kernel launch overhead |
| Communication-bound | Significant time in collective operations | Limited by inter-GPU or inter-node communication |
| Sync-bound | Excessive CPU-GPU synchronization points | Stalls from unnecessary synchronization |
--set=full--section SpeedOfLight--set=fullProfile the batched GEMM kernel in bmm_workload.py with NCU.
The workload uses cudaProfilerStart/Stop markers to isolate the region of interest.
Collect kernel-level metrics: SOL%, compute/memory throughput, DRAM bandwidth,
tensor core utilization, occupancy, warp stall reasons, and roofline classification.
The batched GEMM performs 68.72 GFLOP per call (B=32, M=512, N=1024, K=2048, FP16).
Calculate MFU against the GPU's peak FP16 tensor core TFLOP/s.Run NCU with --set=full --profile-from-start off --target-processes all.
If --set=full fails, try --set=detailed. Parse the CSV output for
sm__throughput.avg.pct_of_peak_sustained_elapsed.
Save raw NCU output to /workspace/.../ncu_output.txt.remote-slurm## Summary
Training at 42% MFU, memory-bound due to large attention tensors.
## Metrics
- Throughput: 1,247 samples/sec
- MFU: 42% (vs 65% theoretical for this model)
- % of SOL: 58% (room for 1.7x improvement)
- GPU Utilization: 45%
- Memory Bandwidth: 850 GB/s (89% of peak)
- Kernel Count: 1,247 per iteration
## Findings
1. Self-attention consumes 60% of memory bandwidth
2. Optimizer step has 3 unnecessary synchronizations
3. Batch size could be increased by 2x
## Recommendations
1. Enable FlashAttention (expected: +15% MFU)
2. Remove synchronizations in optimizer (expected: +5% throughput)
3. Increase batch size to improve GPU utilization