triton-operator-design

Original🇨🇳 Chinese
Translated

Generate Triton operator requirement documents suitable for Ascend NPU. Used when users need to design new Triton operators, write operator requirement documents, or perform operator performance optimization design.

2installs
Added on

NPX Install

npx skill4agent add ascend/agent-skills triton-operator-design

SKILL.md Content (Chinese)

View Translation Comparison →

Generation of Triton Operator Requirement Documents

Workflow Overview

Generating Triton operator requirement documents is divided into the following phases:
  1. Requirement Analysis → Deliverables: Function Definition, Competitor Comparison
  2. Prototype Design → Deliverables: API Interface Definition
  3. Specification Constraints → Deliverables: Input/Output Constraints, Hardware Limitations
  4. Feature Implementation → Deliverables: Tiling Strategy, Kernel Implementation Scheme

Phase 1: Requirement Analysis

1.1 Functional Analysis

Must include:
  • Operator Function Description: Clearly describe the role and application scenarios of the operator
  • Mathematical Formula: Provide core calculation formulas using standard mathematical symbols
  • Variable Description Table:
VariableTypeMeaningConstraints
[Variable Name][Type][Meaning][Constraints]
Key Terms (must be used accurately):
  • GM (Global Memory): Global memory, large-capacity storage on DDR
  • UB (Unified Buffer): Unified buffer, high-speed cache inside the Vector Core of AI Core
  • L1 Buffer: Level 1 buffer, cache inside the Cube Core of AI Core
  • AI Core: Computing core of Ascend processor, A2/A3 usually has 24 units, each containing 1 Cube computing core and 2 Vector computing cores
  • Tiling: Data splitting strategy, decomposing large tasks into small chunks
  • Reduction Operation: Dimensionality reduction calculations such as sum, mean, max, etc.
  • Precision Upcasting/Downcasting: Type conversion from FP16→FP32 or FP32→FP16

1.2 Competitor Solution Analysis

Must include:
  • Competitor Operator List:
Competitor NameSource FrameworkInterface DefinitionImplemented FunctionConstraints
[Name][Framework][Interface][Function][Constraints]
  • Comparison Analysis:
    • Function Comparison: Differences in functions supported by each framework
    • Performance Comparison: Performance on different hardware platforms
    • Design Reference: Excellent designs that can be referenced

Phase 2: Prototype Design

2.1 Interface Definition

Triton Interface Features:
  • Defined using Python functions
  • Supports automatic differentiation
  • Supports multiple data types
Interface Example:
python
def triton_operator(
    input: torch.Tensor,
    param1: torch.Tensor,
    param2: Optional[torch.Tensor] = None,
    eps: float = 1e-6
) -> torch.Tensor:
    """
    [Operator Function Description]
    
    Args:
        input: Input tensor with shape [..., D]
        param1: Parameter 1 with shape [D]
        param2: Parameter 2 (optional) with shape [D]
        eps: Small constant
    
    Returns:
        Output tensor with the same shape as input
    """
    pass

2.2 Interface Description Table

Parameter NameTypeInput/OutputDescriptionConstraints
[Parameter Name][Type][Input/Output][Description][Constraints]

2.3 Data Type Support

Interface TypeSupported Data TypesData Format
TritonFLOAT16, BF16, FLOATND

Phase 3: Specification Constraints

3.1 Input Tensor Constraints

Constraint ItemConstraint ConditionsDescription
Shape[Specific Constraints][Description]
Data Type[Supported Types][Description]
Data FormatNDUnified use of ND format
Memory Alignment16-byte or 32-byteHardware requirement

3.2 Output Tensor Constraints

Constraint ItemConstraint ConditionsDescription
Shape[Specific Constraints][Description]
Data Type[Specific Constraints][Description]

3.3 Hardware Constraints

Hardware Limitations that must be considered:
  • AI Core Architecture:
    • A2/A3 usually has 24 AI Cores
    • Each AI Core contains 1 Cube computing core and 2 Vector computing cores
    • Cube Core is dedicated to matrix computation, Vector Core is dedicated to vector computation
  • UB Buffer Size: 192KB (A2/A3), dedicated to Vector Core
  • L1 Buffer Size: Usually 1MB (A2/A3), dedicated to Cube Core
  • Memory Alignment Requirements:
    • UB buffer must be 32-byte aligned
    • Single-value buffers (such as mean) require 32B space (even if only 4B is needed logically)
  • Data Type Size: FP16=2B, BF16=2B, FP32=4B

Phase 4: Feature Implementation Scheme

4.1 Tiling Splitting

This is the most critical part and must be explained in detail.

4.1.1 Inter-Core Splitting Strategy

Must include:
  1. Splitting Principles:
    • How to divide tasks into multiple AI Cores
    • Why this splitting method is chosen
    • How to ensure load balancing
  2. Calculation Method:
    Input: x[B, D]
    
    // Step 1: Calculate the amount of data processed by each Core
    data_per_core = ceil(total_size / num_cores)
    
    // Step 2: Calculate the data range of the current Core
    core_start = core_id * data_per_core
    core_end = min((core_id + 1) * data_per_core, total_size)
  3. Example:
    • Provide specific input shapes
    • Show splitting results
    • Explain the data range processed by each Core

4.1.2 Intra-Core Loop Strategy

Must include:
  1. UB Space Calculation:
    Total UB size: 192KB
    Data type size: FP16=2B, FP32=4B
    
    Buffers required for a single loop:
    - Input buffer: [Size] × [Type Size]
    - Intermediate buffer: [Size] × [Type Size]
    - Output buffer: [Size] × [Type Size]
    
    Amount of data processed per loop = Total UB size / Total space per loop
  2. Buffer Allocation Strategy:
    • List all required buffers
    • Explain the size and purpose of each buffer
    • Consider alignment requirements
  3. Precision Processing Strategy:
    • Whether precision upcasting (FP16→FP32) is required
    • At which stage to upcast precision
    • At which stage to downcast precision

4.2 Kernel Implementation

4.2.1 Computational Flow Diagram

Must draw a data flow diagram:
Input Tensor (GM)    Parameter Tensor (GM)
    │                │
    ▼                ▼
[Load to UB]       [Load to UB]
    │                │
    ▼                ▼
[Calculation Step 1]      [Preprocessing]
    │                │
    ▼                │
[Calculation Step 2]      ───┘
[Final Calculation]
Output Tensor (GM)
Key Points:
  • Label the data type of each step
  • Label data transmission between GM↔UB
  • Label the location of precision conversion

4.2.2 Core Implementation Logic

Explain separately according to input data types:
FP32 Input Type:
  1. Inter-core task allocation
  2. UB buffer management (list all buffers)
  3. Calculation process (explain in detail step by step)
FP16/BF16 Input Type:
  1. Inter-core task allocation
  2. UB buffer management (including upcasting/downcasting buffers)
  3. Calculation process (including precision conversion steps)
Hardware Optimization Points:
  • Vectorized computation
  • Data reuse
  • Memory access optimization
  • Alignment processing

Things Absolutely Not to Do

  • ❌ Use vague terms (such as "appropriate splitting", "reasonable allocation")
  • ❌ Ignore hardware constraints (UB size, alignment requirements)
  • ❌ Do not explain the specific calculation method of Tiling
  • ❌ Do not distinguish processing strategies for different data types
  • ❌ Do not label data types in the data flow diagram
  • ❌ Ignore alignment requirements for reduction operations (must be 32B)
  • ❌ Confuse the uses of Vector Core and Cube Core (Vector Core for vector computation, Cube Core for matrix computation)
  • ❌ Ignore the difference between UB and L1 (UB is dedicated to Vector Core, L1 is dedicated to Cube Core)

Common Pitfalls

Pitfall 1: Ignoring UB Size Limitations

Symptom: The designed scheme exceeds UB capacity
Solution:
  1. Calculate the total size of all buffers
  2. Ensure the total size < Total UB size
  3. If exceeded, adjust the amount of data processed per loop

Pitfall 2: Ignoring Memory Alignment

Symptom: Hardware errors or performance degradation
Solution:
  1. UB buffers are aligned to 32 bytes
  2. Allocate 32B space for single-value buffers (mean, variance, etc.)
  3. Use
    ceil(actual_size, 32)
    to calculate allocated space

Pitfall 3: Precision Loss

Symptom: Inaccurate calculation results when input is FP16
Solution:
  1. Upcast precision to FP32 before reduction operations
  2. Complete all calculations in FP32 precision
  3. Finally downcast precision to the output type

Pitfall 4: Unreasonable Tiling Strategy

Symptom: Poor performance or inability to handle large shapes
Solution:
  1. Choose splitting dimensions based on operator characteristics
  2. Ensure each Core completes calculations independently
  3. Avoid cross-Core data dependencies

Quality Check Checklist

After completing the document, check the following items:

Requirement Analysis

  • Operator function description is clear
  • Mathematical formulas are correct and complete
  • Variable description table includes all key variables
  • Competitor analysis covers mainstream frameworks
  • Terms are used accurately (GM, UB, AI Core, etc.)

Prototype Design

  • Interface definition is complete
  • Parameter descriptions are detailed
  • Data type support is clear

Specification Constraints

  • Input/output constraints are complete
  • Hardware constraints are clear (UB size, L1 size, alignment requirements)
  • Distinguish the uses of Vector Core and Cube Core
  • Boundary conditions are clearly explained

Tiling Splitting

  • Inter-core splitting strategy has specific calculation methods
  • Intra-core loop strategy includes UB space calculation
  • Buffer allocation is detailed
  • Has specific examples

Kernel Implementation

  • Computational flow diagram is clear
  • Data type labels are complete
  • Different input types are explained separately
  • Hardware optimization points are clear

Reference Resources

For detailed design guidelines and examples, please refer to:
  • triton-operator-template.md - Complete document template
  • ascend-terminology.md - Ascend Terminology Glossary
  • tiling-strategies.md - Detailed Tiling Strategies
Official Documentation: