triton-operator-design

Original：🇨🇳 Chinese

Translated

Generate Triton operator requirement documents suitable for Ascend NPU. Used when users need to design new Triton operators, write operator requirement documents, or perform operator performance optimization design.

11installs

Sourceascend/agent-skills

Added on2026-04-15

NPX Install

npx skill4agent add ascend/agent-skills triton-operator-design

SKILL.md Content (Chinese)

View Translation Comparison →

Generation of Triton Operator Requirement Documents

Workflow Overview

Generating Triton operator requirement documents is divided into the following phases:

Requirement Analysis → Deliverables: Function Definition, Competitor Comparison
Prototype Design → Deliverables: API Interface Definition
Specification Constraints → Deliverables: Input/Output Constraints, Hardware Limitations
Feature Implementation → Deliverables: Tiling Strategy, Kernel Implementation Scheme

Phase 1: Requirement Analysis

1.1 Functional Analysis

Must include:

Operator Function Description: Clearly describe the role and application scenarios of the operator
Mathematical Formula: Provide core calculation formulas using standard mathematical symbols
Variable Description Table:

Variable	Type	Meaning	Constraints
[Variable Name]	[Type]	[Meaning]	[Constraints]

Key Terms (must be used accurately):

GM (Global Memory): Global memory, large-capacity storage on DDR
UB (Unified Buffer): Unified buffer, high-speed cache inside the Vector Core of AI Core
L1 Buffer: Level 1 buffer, cache inside the Cube Core of AI Core
AI Core: Computing core of Ascend processor, A2/A3 usually has 24 units, each containing 1 Cube computing core and 2 Vector computing cores
Tiling: Data splitting strategy, decomposing large tasks into small chunks
Reduction Operation: Dimensionality reduction calculations such as sum, mean, max, etc.
Precision Upcasting/Downcasting: Type conversion from FP16→FP32 or FP32→FP16

1.2 Competitor Solution Analysis

Must include:

Competitor Operator List:

Competitor Name	Source Framework	Interface Definition	Implemented Function	Constraints
[Name]	[Framework]	[Interface]	[Function]	[Constraints]

Comparison Analysis:
- Function Comparison: Differences in functions supported by each framework
- Performance Comparison: Performance on different hardware platforms
- Design Reference: Excellent designs that can be referenced

Phase 2: Prototype Design

2.1 Interface Definition

Triton Interface Features:

Defined using Python functions
Supports automatic differentiation
Supports multiple data types

Interface Example:

python

def triton_operator(
    input: torch.Tensor,
    param1: torch.Tensor,
    param2: Optional[torch.Tensor] = None,
    eps: float = 1e-6
) -> torch.Tensor:
    """
    [Operator Function Description]
    
    Args:
        input: Input tensor with shape [..., D]
        param1: Parameter 1 with shape [D]
        param2: Parameter 2 (optional) with shape [D]
        eps: Small constant
    
    Returns:
        Output tensor with the same shape as input
    """
    pass

2.2 Interface Description Table

Parameter Name	Type	Input/Output	Description	Constraints
[Parameter Name]	[Type]	[Input/Output]	[Description]	[Constraints]

2.3 Data Type Support

Interface Type	Supported Data Types	Data Format
Triton	FLOAT16, BF16, FLOAT	ND

Phase 3: Specification Constraints

3.1 Input Tensor Constraints

Constraint Item	Constraint Conditions	Description
Shape	[Specific Constraints]	[Description]
Data Type	[Supported Types]	[Description]
Data Format	ND	Unified use of ND format
Memory Alignment	16-byte or 32-byte	Hardware requirement

3.2 Output Tensor Constraints

Constraint Item	Constraint Conditions	Description
Shape	[Specific Constraints]	[Description]
Data Type	[Specific Constraints]	[Description]

3.3 Hardware Constraints

Hardware Limitations that must be considered:

AI Core Architecture:
- A2/A3 usually has 24 AI Cores
- Each AI Core contains 1 Cube computing core and 2 Vector computing cores
- Cube Core is dedicated to matrix computation, Vector Core is dedicated to vector computation
UB Buffer Size: 192KB (A2/A3), dedicated to Vector Core
L1 Buffer Size: Usually 1MB (A2/A3), dedicated to Cube Core
Memory Alignment Requirements:
- UB buffer must be 32-byte aligned
- Single-value buffers (such as mean) require 32B space (even if only 4B is needed logically)
Data Type Size: FP16=2B, BF16=2B, FP32=4B

Phase 4: Feature Implementation Scheme

4.1 Tiling Splitting

This is the most critical part and must be explained in detail.

4.1.1 Inter-Core Splitting Strategy

Must include:

Splitting Principles:
- How to divide tasks into multiple AI Cores
- Why this splitting method is chosen
- How to ensure load balancing

Calculation Method:

Input: x[B, D]

// Step 1: Calculate the amount of data processed by each Core
data_per_core = ceil(total_size / num_cores)

// Step 2: Calculate the data range of the current Core
core_start = core_id * data_per_core
core_end = min((core_id + 1) * data_per_core, total_size)

Example:
- Provide specific input shapes
- Show splitting results
- Explain the data range processed by each Core

4.1.2 Intra-Core Loop Strategy

Must include:

UB Space Calculation:

Total UB size: 192KB
Data type size: FP16=2B, FP32=4B

Buffers required for a single loop:
- Input buffer: [Size] × [Type Size]
- Intermediate buffer: [Size] × [Type Size]
- Output buffer: [Size] × [Type Size]

Amount of data processed per loop = Total UB size / Total space per loop

Buffer Allocation Strategy:
- List all required buffers
- Explain the size and purpose of each buffer
- Consider alignment requirements
Precision Processing Strategy:
- Whether precision upcasting (FP16→FP32) is required
- At which stage to upcast precision
- At which stage to downcast precision

4.2 Kernel Implementation

4.2.1 Computational Flow Diagram

Must draw a data flow diagram:

Input Tensor (GM)    Parameter Tensor (GM)
    │                │
    ▼                ▼
[Load to UB]       [Load to UB]
    │                │
    ▼                ▼
[Calculation Step 1]      [Preprocessing]
    │                │
    ▼                │
[Calculation Step 2]      ───┘
    │
    ▼
[Final Calculation]
    │
    ▼
Output Tensor (GM)

Key Points:

Label the data type of each step
Label data transmission between GM↔UB
Label the location of precision conversion

4.2.2 Core Implementation Logic

Explain separately according to input data types:

FP32 Input Type:

Inter-core task allocation
UB buffer management (list all buffers)
Calculation process (explain in detail step by step)

FP16/BF16 Input Type:

Inter-core task allocation
UB buffer management (including upcasting/downcasting buffers)
Calculation process (including precision conversion steps)

Hardware Optimization Points:

Vectorized computation
Data reuse
Memory access optimization
Alignment processing

Things Absolutely Not to Do

❌ Use vague terms (such as "appropriate splitting", "reasonable allocation")
❌ Ignore hardware constraints (UB size, alignment requirements)
❌ Do not explain the specific calculation method of Tiling
❌ Do not distinguish processing strategies for different data types
❌ Do not label data types in the data flow diagram
❌ Ignore alignment requirements for reduction operations (must be 32B)
❌ Confuse the uses of Vector Core and Cube Core (Vector Core for vector computation, Cube Core for matrix computation)
❌ Ignore the difference between UB and L1 (UB is dedicated to Vector Core, L1 is dedicated to Cube Core)

Common Pitfalls

Pitfall 1: Ignoring UB Size Limitations

Symptom: The designed scheme exceeds UB capacity

Solution:

Calculate the total size of all buffers
Ensure the total size < Total UB size
If exceeded, adjust the amount of data processed per loop

Pitfall 2: Ignoring Memory Alignment

Symptom: Hardware errors or performance degradation

Solution:

UB buffers are aligned to 32 bytes
Allocate 32B space for single-value buffers (mean, variance, etc.)
Use
```
ceil(actual_size, 32)
```
to calculate allocated space

Pitfall 3: Precision Loss

Symptom: Inaccurate calculation results when input is FP16

Solution:

Upcast precision to FP32 before reduction operations
Complete all calculations in FP32 precision
Finally downcast precision to the output type

Pitfall 4: Unreasonable Tiling Strategy

Symptom: Poor performance or inability to handle large shapes

Solution:

Choose splitting dimensions based on operator characteristics
Ensure each Core completes calculations independently
Avoid cross-Core data dependencies

Quality Check Checklist

After completing the document, check the following items:

Requirement Analysis

Operator function description is clear
Mathematical formulas are correct and complete
Variable description table includes all key variables
Competitor analysis covers mainstream frameworks
Terms are used accurately (GM, UB, AI Core, etc.)

Prototype Design

Interface definition is complete
Parameter descriptions are detailed
Data type support is clear

Specification Constraints

Input/output constraints are complete
Hardware constraints are clear (UB size, L1 size, alignment requirements)
Distinguish the uses of Vector Core and Cube Core
Boundary conditions are clearly explained

Tiling Splitting

Inter-core splitting strategy has specific calculation methods
Intra-core loop strategy includes UB space calculation
Buffer allocation is detailed
Has specific examples

Kernel Implementation

Computational flow diagram is clear
Data type labels are complete
Different input types are explained separately
Hardware optimization points are clear

Reference Resources

For detailed design guidelines and examples, please refer to:

triton-operator-template.md - Complete document template
ascend-terminology.md - Ascend Terminology Glossary
tiling-strategies.md - Detailed Tiling Strategies

Official Documentation:

triton-operator-design

NPX Install

Tags

SKILL.md Content (Chinese)

Generation of Triton Operator Requirement Documents

Workflow Overview

Phase 1: Requirement Analysis

1.1 Functional Analysis

1.2 Competitor Solution Analysis

Phase 2: Prototype Design

2.1 Interface Definition

2.2 Interface Description Table

2.3 Data Type Support

Phase 3: Specification Constraints

3.1 Input Tensor Constraints

3.2 Output Tensor Constraints

3.3 Hardware Constraints

Phase 4: Feature Implementation Scheme

4.1 Tiling Splitting

4.1.1 Inter-Core Splitting Strategy

4.1.2 Intra-Core Loop Strategy

4.2 Kernel Implementation

4.2.1 Computational Flow Diagram

4.2.2 Core Implementation Logic

Things Absolutely Not to Do

Common Pitfalls

Pitfall 1: Ignoring UB Size Limitations

Pitfall 2: Ignoring Memory Alignment

Pitfall 3: Precision Loss

Pitfall 4: Unreasonable Tiling Strategy

Quality Check Checklist

Requirement Analysis

Prototype Design

Specification Constraints

Tiling Splitting

Kernel Implementation

Reference Resources