Loading...
Loading...
Generate Triton operator requirement documents suitable for Ascend NPU. Used when users need to design new Triton operators, write operator requirement documents, or perform operator performance optimization design.
npx skill4agent add ascend/agent-skills triton-operator-design| Variable | Type | Meaning | Constraints |
|---|---|---|---|
| [Variable Name] | [Type] | [Meaning] | [Constraints] |
| Competitor Name | Source Framework | Interface Definition | Implemented Function | Constraints |
|---|---|---|---|---|
| [Name] | [Framework] | [Interface] | [Function] | [Constraints] |
def triton_operator(
input: torch.Tensor,
param1: torch.Tensor,
param2: Optional[torch.Tensor] = None,
eps: float = 1e-6
) -> torch.Tensor:
"""
[Operator Function Description]
Args:
input: Input tensor with shape [..., D]
param1: Parameter 1 with shape [D]
param2: Parameter 2 (optional) with shape [D]
eps: Small constant
Returns:
Output tensor with the same shape as input
"""
pass| Parameter Name | Type | Input/Output | Description | Constraints |
|---|---|---|---|---|
| [Parameter Name] | [Type] | [Input/Output] | [Description] | [Constraints] |
| Interface Type | Supported Data Types | Data Format |
|---|---|---|
| Triton | FLOAT16, BF16, FLOAT | ND |
| Constraint Item | Constraint Conditions | Description |
|---|---|---|
| Shape | [Specific Constraints] | [Description] |
| Data Type | [Supported Types] | [Description] |
| Data Format | ND | Unified use of ND format |
| Memory Alignment | 16-byte or 32-byte | Hardware requirement |
| Constraint Item | Constraint Conditions | Description |
|---|---|---|
| Shape | [Specific Constraints] | [Description] |
| Data Type | [Specific Constraints] | [Description] |
Input: x[B, D]
// Step 1: Calculate the amount of data processed by each Core
data_per_core = ceil(total_size / num_cores)
// Step 2: Calculate the data range of the current Core
core_start = core_id * data_per_core
core_end = min((core_id + 1) * data_per_core, total_size)Total UB size: 192KB
Data type size: FP16=2B, FP32=4B
Buffers required for a single loop:
- Input buffer: [Size] × [Type Size]
- Intermediate buffer: [Size] × [Type Size]
- Output buffer: [Size] × [Type Size]
Amount of data processed per loop = Total UB size / Total space per loopInput Tensor (GM) Parameter Tensor (GM)
│ │
▼ ▼
[Load to UB] [Load to UB]
│ │
▼ ▼
[Calculation Step 1] [Preprocessing]
│ │
▼ │
[Calculation Step 2] ───┘
│
▼
[Final Calculation]
│
▼
Output Tensor (GM)ceil(actual_size, 32)