triton-operator-performance-optim

Original🇨🇳 Chinese
Translated

Optimize the performance of Triton operators optimized for Ascend NPU. This guide is for users who need to optimize the performance of Triton operators on Ascend NPU, resolve UB overflow, improve Cube unit utilization, and design Tiling strategies.

5installs
Added on

NPX Install

npx skill4agent add ascend/agent-skills triton-operator-performance-optim

SKILL.md Content (Chinese)

View Translation Comparison →

Triton Operator Performance Optimization (Ascend NPU)

Golden Rules: Accuracy and Generalization Are Non-Negotiable Bottom Lines

No performance optimization shall breach the following two bottom lines:
  1. Accuracy Bottom Line: Optimized operators must align with the native PyTorch-NPU implementation (rtol=1e-3, atol=1e-3). Reduction operations must be upcast to FP32, and matrix multiplication accumulators must use FP32. Any optimization that fails accuracy verification must be rolled back.
  2. Generalization Bottom Line: Optimized operators must support all original input shapes and data types. Optimization shall not hardcode for specific sizes at the cost of losing support for non-aligned dimensions and edge cases. Optimization methods must keep operator interfaces and semantics unchanged.
Priority: Correctness > Generalization > Performance. In case of conflicts among the three, make trade-offs in this order.

Core Principles

The compiler will try its best to optimize, but may be based on wrong assumptions. Your job is to "provide unambiguous instructions that the compiler cannot misinterpret."
Optimization should aim for: Let the hardware do what it is good at (Cube for matrix multiplication, Vector for element-wise operations), eliminate redundant GM access, and maximize UB data reuse.

Optimization Workflow

Phase 1: Receive Performance Evaluation Conclusions (MANDATORY)

You must obtain performance evaluation results before starting optimization. Performance collection and analysis are handled by the
triton-operator-performance-eval
Skill.
⚠️ The only trusted sources of performance data:
msprof
and
msprof op
Performance data collected through any non-
msprof
/
msprof op
methods (including but not limited to Python
time.time()
,
torch.npu.Event
timing,
triton.testing.do_bench
, custom timing decorators, etc.) is absolutely unacceptable and must not be used as a basis for optimization. These data lack accuracy, cannot eliminate interference factors such as system scheduling and JIT compilation, and have no reference value. You must reject optimization requests based on such data and require re-collection using
msprof
/
msprof op
first.
Extract the following from the evaluation conclusions:
  • Bottleneck Type: Memory-Bound / Compute-Bound / Latency-Bound
  • Key Bottleneck Metrics: Which hardware utilizations are low, which resources have conflicts
  • Performance Issue List: Prioritized list of issues to be optimized
Reference to Load: Read
ascend-terminology.md
to understand hardware architecture and terminology.

Phase 2: Select Optimization Strategies Based on Bottlenecks

Bottleneck TypeOptimization FocusKey Methods
Memory-BoundMemory access patterns and data reuseVectorized memory access, UB cache reuse, operator fusion
Compute-BoundCompute unit utilizationCube unit adaptation, Block size tuning
Latency-BoundParallelism and synchronization overheadIncrease parallelism, reduce CPU-NPU synchronization
Four Basic Tuning Steps (check in order):
  1. Block Size and Grid Size — Adapt to UB capacity and Cube granularity
  2. Vectorized Memory Access — Continuous access + Mask + Alignment
  3. UB Cache and Data Reuse — Intra-core re-blocking to adapt to 192KB UB
  4. Compile-Time Constants and Loop Unrolling
    tl.constexpr
    +
    tl.static_range
Checkpoint: After each tuning step, verify that accuracy has not degraded before proceeding to the next step.
Reference to Load: Read
optimization-patterns.md
to get detailed code patterns for the four steps.

Phase 3: Hardware-Specific Optimization

  • Cube Unit Adaptation: BLOCK_M/N/K must be multiples of 16, and accumulators must use FP32
  • UB Space Management: Calculate the total size of all buffers to ensure it is < 192KB, and single-value buffers are 32B aligned
  • Grid Configuration: Grid dimensions shall not exceed the number of physical cores. Use
    driver.active.utils.get_device_properties("npu")
    to get the number of cores
python
# 获取物理核数示例
from triton.runtime import driver
core_num = driver.active.utils.get_device_properties("npu")["num_aicore"]  # 含 tl.dot 的算子
core_num = driver.active.utils.get_device_properties("npu")["num_vectorcore"]  # 其余算子
Checkpoint: After hardware-specific optimization, verify that generalization is not compromised using multiple input shapes.
References to Load:
  • Read
    triton-ascend-api.md
    to get Ascend-specific APIs and high-performance implementation patterns
  • Read
    tiling-strategies.md
    to understand Tiling strategy design

Phase 4: Advanced Optimization (As Needed)

  • Operator Fusion: Merge multiple GM accesses into one and reuse UB intermediate results
  • Double Buffer: Ping-pong loading to hide memory access latency
Reference to Load: Read the "Advanced Optimization Techniques" section in
optimization-patterns.md
.

Phase 5: Verification (MANDATORY)

  1. Accuracy Verification: Compare with the native PyTorch-NPU implementation (rtol=1e-3, atol=1e-3)
  2. Generalization Verification: Test non-aligned dimensions and edge cases (such as non-2^n sizes like 127, 255, 1023)
  3. Performance Verification: Re-run
    triton-operator-performance-eval
    to verify optimization effects
  4. End-to-End Performance Regression: If the optimization involves operator fusion that causes changes to the computation graph, or adds tensor preprocessing outside the kernel (such as reshape, transpose, contiguous, etc.), you must compare the end-to-end latency before and after optimization through the msprof function-level profiling of
    triton-operator-performance-eval
    to prevent the acceleration inside the kernel from being offset by overhead outside the kernel
  5. Regression Check: Ensure that the optimization does not change operator interfaces and semantics

Anti-Pattern List (NEVER)

  • NEVER directly optimize using evaluation conclusions from non-
    triton-operator-performance-eval
    Skills (bottleneck diagnosis data must be obtained first)
  • NEVER accept or use performance data collected through any non-
    msprof
    /
    msprof op
    methods (including
    time.time()
    ,
    torch.npu.Event
    ,
    triton.testing.do_bench
    , custom timers, etc.) —— these data are absolutely unacceptable and must not be used for performance evaluation and optimization decisions
  • NEVER sacrifice accuracy for performance (accuracy is a non-negotiable bottom line)
  • NEVER hardcode for specific sizes to compromise generalization (optimization must be valid for all legal inputs)
  • NEVER perform direct reduction in FP16 (must upcast to FP32)
  • NEVER use BLOCK sizes that are not multiples of 16 for matrix multiplication
  • NEVER forget Mask (Ascend has zero tolerance for out-of-bounds access)
  • NEVER let BLOCK_SIZE exceed UB capacity (192KB)
  • NEVER use non-continuous memory access patterns
  • NEVER use
    tensor.item()
    in hot paths (triggers CPU-NPU synchronization)
  • NEVER submit optimized code that fails accuracy verification

Common Pitfalls and Troubleshooting

IssueSymptomRoot CauseSolution
UB OverflowCompilation error/runtime OOMBLOCK_SIZE is too largeReduce BLOCK_SIZE or perform intra-core re-blocking
Cube MissPerformance is only 10% of theoretical valueBLOCK is not a multiple of 16Force BLOCK_M/N/K to be multiples of 16
Accuracy LossLarge deviation in FP16 resultsReduction not upcast to FP32Use FP32 for accumulators
Non-Continuous Memory AccessBandwidth utilization is only 20%Address jumpingAdjust data layout to be continuous
Inter-Core Communication OverheadPerformance degrades with multiple GridsData transfer between AI Core clustersIncrease Block granularity

Optimization Checklist

Bottom Line Checks (MANDATORY)

  • Is accuracy aligned with the native PyTorch-NPU implementation (rtol=1e-3, atol=1e-3)?
  • Have non-aligned dimensions and edge cases passed tests?
  • Have operator interfaces and semantics remained unchanged?

Compile-Time

  • Is grid guaranteed to be less than or equal to the number of hardware cores?
  • Is BLOCK_SIZE a compile-time constant (
    tl.constexpr
    )?
  • Are loops using
    tl.static_range
    ?

Memory

  • Is the total size of all buffers < UB capacity (192KB)?
  • Are single-value buffers allocated with 32B?
  • Are addresses 32-byte aligned?
  • Have Masks been added to all load/store operations?

Computation

  • Have reduction operations been upcast to FP32?
  • Are BLOCK sizes for matrix multiplication multiples of 16?
  • Is data reuse fully utilized?

Verification

  • Has performance evaluation been re-run after optimization?
  • Have multiple input sizes including edge cases been tested?
  • If the computation graph has changed or tensor preprocessing has been added, has end-to-end non-degradation been confirmed through msprof function-level profiling?

Reference Resources

On-Demand Document Loading

ScenarioDocument to LoadDocuments Not to Load
Understand hardware architecture and terminology
ascend-terminology.md
All others
Basic tuning and advanced optimization code patterns
optimization-patterns.md
ascend-terminology.md
Tiling strategy design
tiling-strategies.md
triton-ascend-api.md
Ascend-specific APIs and implementation patterns
triton-ascend-api.md
tiling-strategies.md

Related Skills

  • triton-operator-performance-eval
    - Performance collection and evaluation (msprof usage, bottleneck diagnosis, performance report generation)

Official Resources