Loading...
Loading...
Evaluate the performance of Triton operators on Ascend NPU. It is used when users need to analyze operator performance bottlenecks, collect and compare operator performance using msprof/msprof op, diagnose Memory-Bound/Compute-Bound bottlenecks, measure hardware utilization metrics, and generate performance evaluation reports.
npx skill4agent add ascend/agent-skills triton-operator-performance-evalmsprofmsprof optime.time()torch.npu.Eventtriton.testing.do_benchmsprofmsprof opmsprof --application="python my_script.py" --output=./profiling_resultmsprof-function-level.mdmsprof op --kernel-name={jit_kernel_name} {application}msprof-op-level.mdperformance-data-analysis.md## Performance Evaluation Summary
### Basic Information
| Item | Value |
|------|-----|
| Operator Name | {kernel_name} |
| Input Scale | {shape, dtype} |
| Test Hardware | {Ascend Model} |
| Measurement Method | {msprof / msprof op} |
### Performance Metrics
| Metric | Value | Reference Value | Utilization |
|------|-----|--------|--------|
| Execution Time | {X} us | - | - |
| Memory Bandwidth | {X} GB/s | {Theoretical Peak} GB/s | {X}% |
| Cube Utilization | - | - | {X}% |
| Vector Utilization | - | - | {X}% |
| L2 Cache Hit Rate | - | - | {X}% |
| Bank Conflict Ratio | - | - | {X}% |
### Bottleneck Diagnosis
- **Bottleneck Type**: {Memory-Bound / Compute-Bound}
- **Judgment Basis**: {Analysis based on Arithmetic Intensity and hardware utilization data}
- **Key Evidence**: {Cite specific CSV data}
### Performance Issue List
| Priority | Problem | Evidence | Optimization Direction |
|--------|------|------|----------|
| P0 | {Most Critical Issue} | {Data Source} | {Specific Suggestions} |
| P1 | ... | ... | ... |
### Optimization Suggestions
1. {Highest Priority Optimization Suggestion and Expected Benefit}
2. ...| Task Type | Must Load | Do Not Load |
|---|---|---|
| Function-Level Performance Operator Comparison | | |
| Operator-Level Hardware Analysis | | |
| Performance Bottleneck Diagnosis | | - |
| Understand Hardware Terminology | | - |
| Complete Performance Optimization Process | All references | - |
torch.npu.synchronize()--application--kernel-name--aic-metricstime.time()torch.npu.Eventtriton.testing.do_benchtorch.npu.synchronize()msprofmsprof op| Pitfall | Manifestation | Correct Approach |
|---|---|---|
Using | Only a single kernel can be seen, no comparison possible | Use |
Incorrect | | Confirm the kernel name matches the Triton function definition |
| Failing to distinguish first-time compilation and steady-state performance | Abnormally high execution time in the first run | Collect data only after at least 5 warm-up runs |
| Testing performance with small-scale inputs | Startup overhead accounts for a large proportion, conclusions have no reference value | Use production-scale inputs for evaluation |
| Ignoring the impact of dtype on performance | Significant performance difference between FP16 and FP32 | Fix dtype for comparison and evaluate separately |
| Bottleneck Type | Judgment Criteria | Core Optimization Direction |
|---|---|---|
| Memory-Bound | AI < hardware balance point; high bandwidth utilization, low computing utilization | Reduce data transfer volume, improve data reuse, optimize memory access patterns |
| Compute-Bound | AI > hardware balance point; high computing utilization, low bandwidth utilization | Optimize computation instruction efficiency, improve Cube/Vector utilization |
| Latency-Bound | Both bandwidth and computing utilization are low | Increase parallelism (Grid Size), reduce synchronization overhead |
| Problem | Symptom | Diagnosis Data Source | Solution Direction |
|---|---|---|---|
| UB Overflow | Compilation error/runtime OOM | Check BLOCK_SIZE configuration | Reduce BLOCK_SIZE or split blocks within the kernel |
| Cube Miss | Performance is only 10% of theoretical value | ArithmeticUtilization.csv | Force BLOCK_M/N/K to be multiples of 16 |
| Precision Loss | Large deviation in FP16 results | Compare with PyTorch results | Use FP32 for accumulators |
| Non-Contiguous Memory Access | Bandwidth utilization is only 20% | Memory.csv | Adjust data layout to be contiguous |
| Low Parallelism | Low AI Core utilization | PipeUtilization.csv | Increase Grid Size |
| High Bank Conflict | Resource conflict rate > 10% | ResourceConflictRatio.csv | Adjust data block size and alignment method |
| Low L2 Cache Hit Rate | Frequent GM access | L2Cache.csv | Optimize Tiling strategy to improve data locality |
msprofmsprof op| Dimension | | |
|---|---|---|
| Command Format | | |
| Analysis Granularity | All operators of the entire application | Specified single kernel |
| Core Output | op_summary.csv, timeline_trace.json, report.html | ArithmeticUtilization.csv, Memory.csv, PipeUtilization.csv, etc. |
| Provided Information | Operator execution time ranking, Host/Device full-link timeline | Hardware utilization, memory bandwidth, Bank Conflict and other microarchitecture metrics |
| Typical Usage | Compare overall performance of PyTorch vs Triton operators | Diagnose hardware bottlenecks of a single kernel |
| Required Parameter | | |
msprofmsprof opmsprofmsprof op| Document | Content | Associated Command |
|---|---|---|
| Function-level performance collection usage and output analysis | |
| Operator-level in-depth analysis usage and hardware metrics | |
| Detailed analysis methods for | |
| Overview of performance analysis toolchain and workflow | Both |
| Ascend hardware terminology and architecture concepts | - |