triton-operator-design
Original:🇨🇳 Chinese
Translated
Generate Triton operator requirement documents suitable for Ascend NPU. Used when users need to design new Triton operators, write operator requirement documents, or perform operator performance optimization design.
2installs
Sourceascend/agent-skills
Added on
NPX Install
npx skill4agent add ascend/agent-skills triton-operator-designTags
Translated version includes tags in frontmatterSKILL.md Content (Chinese)
View Translation Comparison →Generation of Triton Operator Requirement Documents
Workflow Overview
Generating Triton operator requirement documents is divided into the following phases:
- Requirement Analysis → Deliverables: Function Definition, Competitor Comparison
- Prototype Design → Deliverables: API Interface Definition
- Specification Constraints → Deliverables: Input/Output Constraints, Hardware Limitations
- Feature Implementation → Deliverables: Tiling Strategy, Kernel Implementation Scheme
Phase 1: Requirement Analysis
1.1 Functional Analysis
Must include:
- Operator Function Description: Clearly describe the role and application scenarios of the operator
- Mathematical Formula: Provide core calculation formulas using standard mathematical symbols
- Variable Description Table:
| Variable | Type | Meaning | Constraints |
|---|---|---|---|
| [Variable Name] | [Type] | [Meaning] | [Constraints] |
Key Terms (must be used accurately):
- GM (Global Memory): Global memory, large-capacity storage on DDR
- UB (Unified Buffer): Unified buffer, high-speed cache inside the Vector Core of AI Core
- L1 Buffer: Level 1 buffer, cache inside the Cube Core of AI Core
- AI Core: Computing core of Ascend processor, A2/A3 usually has 24 units, each containing 1 Cube computing core and 2 Vector computing cores
- Tiling: Data splitting strategy, decomposing large tasks into small chunks
- Reduction Operation: Dimensionality reduction calculations such as sum, mean, max, etc.
- Precision Upcasting/Downcasting: Type conversion from FP16→FP32 or FP32→FP16
1.2 Competitor Solution Analysis
Must include:
- Competitor Operator List:
| Competitor Name | Source Framework | Interface Definition | Implemented Function | Constraints |
|---|---|---|---|---|
| [Name] | [Framework] | [Interface] | [Function] | [Constraints] |
- Comparison Analysis:
- Function Comparison: Differences in functions supported by each framework
- Performance Comparison: Performance on different hardware platforms
- Design Reference: Excellent designs that can be referenced
Phase 2: Prototype Design
2.1 Interface Definition
Triton Interface Features:
- Defined using Python functions
- Supports automatic differentiation
- Supports multiple data types
Interface Example:
python
def triton_operator(
input: torch.Tensor,
param1: torch.Tensor,
param2: Optional[torch.Tensor] = None,
eps: float = 1e-6
) -> torch.Tensor:
"""
[Operator Function Description]
Args:
input: Input tensor with shape [..., D]
param1: Parameter 1 with shape [D]
param2: Parameter 2 (optional) with shape [D]
eps: Small constant
Returns:
Output tensor with the same shape as input
"""
pass2.2 Interface Description Table
| Parameter Name | Type | Input/Output | Description | Constraints |
|---|---|---|---|---|
| [Parameter Name] | [Type] | [Input/Output] | [Description] | [Constraints] |
2.3 Data Type Support
| Interface Type | Supported Data Types | Data Format |
|---|---|---|
| Triton | FLOAT16, BF16, FLOAT | ND |
Phase 3: Specification Constraints
3.1 Input Tensor Constraints
| Constraint Item | Constraint Conditions | Description |
|---|---|---|
| Shape | [Specific Constraints] | [Description] |
| Data Type | [Supported Types] | [Description] |
| Data Format | ND | Unified use of ND format |
| Memory Alignment | 16-byte or 32-byte | Hardware requirement |
3.2 Output Tensor Constraints
| Constraint Item | Constraint Conditions | Description |
|---|---|---|
| Shape | [Specific Constraints] | [Description] |
| Data Type | [Specific Constraints] | [Description] |
3.3 Hardware Constraints
Hardware Limitations that must be considered:
- AI Core Architecture:
- A2/A3 usually has 24 AI Cores
- Each AI Core contains 1 Cube computing core and 2 Vector computing cores
- Cube Core is dedicated to matrix computation, Vector Core is dedicated to vector computation
- UB Buffer Size: 192KB (A2/A3), dedicated to Vector Core
- L1 Buffer Size: Usually 1MB (A2/A3), dedicated to Cube Core
- Memory Alignment Requirements:
- UB buffer must be 32-byte aligned
- Single-value buffers (such as mean) require 32B space (even if only 4B is needed logically)
- Data Type Size: FP16=2B, BF16=2B, FP32=4B
Phase 4: Feature Implementation Scheme
4.1 Tiling Splitting
This is the most critical part and must be explained in detail.
4.1.1 Inter-Core Splitting Strategy
Must include:
-
Splitting Principles:
- How to divide tasks into multiple AI Cores
- Why this splitting method is chosen
- How to ensure load balancing
-
Calculation Method:
Input: x[B, D] // Step 1: Calculate the amount of data processed by each Core data_per_core = ceil(total_size / num_cores) // Step 2: Calculate the data range of the current Core core_start = core_id * data_per_core core_end = min((core_id + 1) * data_per_core, total_size) -
Example:
- Provide specific input shapes
- Show splitting results
- Explain the data range processed by each Core
4.1.2 Intra-Core Loop Strategy
Must include:
-
UB Space Calculation:
Total UB size: 192KB Data type size: FP16=2B, FP32=4B Buffers required for a single loop: - Input buffer: [Size] × [Type Size] - Intermediate buffer: [Size] × [Type Size] - Output buffer: [Size] × [Type Size] Amount of data processed per loop = Total UB size / Total space per loop -
Buffer Allocation Strategy:
- List all required buffers
- Explain the size and purpose of each buffer
- Consider alignment requirements
-
Precision Processing Strategy:
- Whether precision upcasting (FP16→FP32) is required
- At which stage to upcast precision
- At which stage to downcast precision
4.2 Kernel Implementation
4.2.1 Computational Flow Diagram
Must draw a data flow diagram:
Input Tensor (GM) Parameter Tensor (GM)
│ │
▼ ▼
[Load to UB] [Load to UB]
│ │
▼ ▼
[Calculation Step 1] [Preprocessing]
│ │
▼ │
[Calculation Step 2] ───┘
│
▼
[Final Calculation]
│
▼
Output Tensor (GM)Key Points:
- Label the data type of each step
- Label data transmission between GM↔UB
- Label the location of precision conversion
4.2.2 Core Implementation Logic
Explain separately according to input data types:
FP32 Input Type:
- Inter-core task allocation
- UB buffer management (list all buffers)
- Calculation process (explain in detail step by step)
FP16/BF16 Input Type:
- Inter-core task allocation
- UB buffer management (including upcasting/downcasting buffers)
- Calculation process (including precision conversion steps)
Hardware Optimization Points:
- Vectorized computation
- Data reuse
- Memory access optimization
- Alignment processing
Things Absolutely Not to Do
- ❌ Use vague terms (such as "appropriate splitting", "reasonable allocation")
- ❌ Ignore hardware constraints (UB size, alignment requirements)
- ❌ Do not explain the specific calculation method of Tiling
- ❌ Do not distinguish processing strategies for different data types
- ❌ Do not label data types in the data flow diagram
- ❌ Ignore alignment requirements for reduction operations (must be 32B)
- ❌ Confuse the uses of Vector Core and Cube Core (Vector Core for vector computation, Cube Core for matrix computation)
- ❌ Ignore the difference between UB and L1 (UB is dedicated to Vector Core, L1 is dedicated to Cube Core)
Common Pitfalls
Pitfall 1: Ignoring UB Size Limitations
Symptom: The designed scheme exceeds UB capacity
Solution:
- Calculate the total size of all buffers
- Ensure the total size < Total UB size
- If exceeded, adjust the amount of data processed per loop
Pitfall 2: Ignoring Memory Alignment
Symptom: Hardware errors or performance degradation
Solution:
- UB buffers are aligned to 32 bytes
- Allocate 32B space for single-value buffers (mean, variance, etc.)
- Use to calculate allocated space
ceil(actual_size, 32)
Pitfall 3: Precision Loss
Symptom: Inaccurate calculation results when input is FP16
Solution:
- Upcast precision to FP32 before reduction operations
- Complete all calculations in FP32 precision
- Finally downcast precision to the output type
Pitfall 4: Unreasonable Tiling Strategy
Symptom: Poor performance or inability to handle large shapes
Solution:
- Choose splitting dimensions based on operator characteristics
- Ensure each Core completes calculations independently
- Avoid cross-Core data dependencies
Quality Check Checklist
After completing the document, check the following items:
Requirement Analysis
- Operator function description is clear
- Mathematical formulas are correct and complete
- Variable description table includes all key variables
- Competitor analysis covers mainstream frameworks
- Terms are used accurately (GM, UB, AI Core, etc.)
Prototype Design
- Interface definition is complete
- Parameter descriptions are detailed
- Data type support is clear
Specification Constraints
- Input/output constraints are complete
- Hardware constraints are clear (UB size, L1 size, alignment requirements)
- Distinguish the uses of Vector Core and Cube Core
- Boundary conditions are clearly explained
Tiling Splitting
- Inter-core splitting strategy has specific calculation methods
- Intra-core loop strategy includes UB space calculation
- Buffer allocation is detailed
- Has specific examples
Kernel Implementation
- Computational flow diagram is clear
- Data type labels are complete
- Different input types are explained separately
- Hardware optimization points are clear
Reference Resources
For detailed design guidelines and examples, please refer to:
- triton-operator-template.md - Complete document template
- ascend-terminology.md - Ascend Terminology Glossary
- tiling-strategies.md - Detailed Tiling Strategies
Official Documentation: