Loading...
Loading...
Compare original and translation side by side
| 阶段 | 必须加载 | 不要加载 |
|---|---|---|
| 理解需求文档 | 无 | 所有 references |
| 确认计算逻辑 | 无 | 所有 references |
| 设计 Tiling 策略 | | |
| 生成 Kernel 代码 | | |
| 生成测试代码 | 无 | 所有 references |
| Phase | Must Load | Do Not Load |
|---|---|---|
| Understand Requirement Documents | None | All references |
| Confirm Computing Logic | None | All references |
| Design Tiling Strategy | | |
| Generate Kernel Code | | |
| Generate Test Code | None | All references |
hardware-architecture.mdcore_num = get_npu_aicore_num() # 或 get_npu_vectorcore_num()
grid = (core_num,) # 原则1:grid必须等于物理核数
@triton.jit
def xxx_fwd(
......
M, N,
BLOCK_M: tl.constexpr,
BLOCK_N: tl.constexpr,
):
pid = tl.program_id(0)
num_core = tl.num_programs(0)
num_block_m = tl.cdiv(M, BLOCK_M)
num_block_n = tl.cdiv(N, BLOCK_N)
total_blocks = num_block_m * num_block_n
# 原则2:核内循环处理多任务,每个核自己计算要处理的数据
for block_idx in range(pid, total_blocks, num_core):
pid_m = block_idx // num_block_n
pid_n = block_idx % num_block_nUB 总大小: 192KB (A2/A3)
安全 BLOCK_SIZE = (196608 - 32) / (缓冲区数量 × 数据类型大小) × 0.8hardware-architecture.mdcore_num = get_npu_aicore_num() # or get_npu_vectorcore_num()
grid = (core_num,) # Principle 1: grid must equal the number of physical cores
@triton.jit
def xxx_fwd(
......
M, N,
BLOCK_M: tl.constexpr,
BLOCK_N: tl.constexpr,
):
pid = tl.program_id(0)
num_core = tl.num_programs(0)
num_block_m = tl.cdiv(M, BLOCK_M)
num_block_n = tl.cdiv(N, BLOCK_N)
total_blocks = num_block_m * num_block_n
# Principle 2: Intra-core loop handles multiple tasks, each core calculates the data it needs to process
for block_idx in range(pid, total_blocks, num_core):
pid_m = block_idx // num_block_n
pid_n = block_idx % num_block_nTotal UB size: 192KB (A2/A3)
Safe BLOCK_SIZE = (196608 - 32) / (number of buffers × data type size) × 0.8templates.md| 算子类型 | 特征 | 核心类型 | 模板 |
|---|---|---|---|
| 归约类 | sum/max/min 归约 | vector core | 模板 1 |
| GEMM | tl.dot() 矩阵乘法 | AI core | 模板 2 |
| 激活函数 | 逐元素计算 | vector core | 模板 3 |
| 损失函数 | softmax + reduction | vector core | 模板 4 |
| 索引变换 | 索引计算、条件分支 | vector core | 模板 5 |
| 注意力 | QK^T + SV 多阶段 | AI core | 模板 6 |
| MoE | 门控机制 | vector core | 模板 7 |
| 后处理 | 简单数据变换 | vector core | 模板 8 |
| 卷积 | 状态更新、滑动窗口 | AI core | 模板 9 |
templates.md| Operator Type | Features | Core Type | Template |
|---|---|---|---|
| Reduction | sum/max/min reduction | vector core | Template 1 |
| GEMM | tl.dot() matrix multiplication | AI core | Template 2 |
| Activation Function | Element-wise calculation | vector core | Template 3 |
| Loss Function | softmax + reduction | vector core | Template 4 |
| Index Transformation | Index calculation, conditional branching | vector core | Template 5 |
| Attention | QK^T + SV multi-stage | AI core | Template 6 |
| MoE | Gating mechanism | vector core | Template 7 |
| Post-processing | Simple data transformation | vector core | Template 8 |
| Convolution | State update, sliding window | AI core | Template 9 |
| 陷阱 | 症状 | 解决方案 |
|---|---|---|
| 计算逻辑错误 | 输出结果与预期不符 | 用伪代码描述计算过程,与用户确认 |
| UB 溢出 | 运行时报错 "ub overflow" | 计算缓冲区总大小,减小 BLOCK_SIZE |
| coreDim 超限 | 运行时报错 "coreDim can't be greater than UINT16_MAX" | 增大 BLOCK_SIZE 或设置 |
| 精度损失 | FP16 输入时结果不准确 | 归约操作前升精度到 FP32 |
| 索引长度不够 | D-cache报错 | 在超大shape下int32索引长度不足,需要换成int64 |
| Pitfall | Symptom | Solution |
|---|---|---|
| Incorrect computing logic | Output results do not match expectations | Describe the computing process with pseudocode and confirm with the user |
| UB overflow | Runtime error "ub overflow" | Calculate total buffer size and reduce BLOCK_SIZE |
| coreDim exceeded | Runtime error "coreDim can't be greater than UINT16_MAX" | Increase BLOCK_SIZE or set |
| Precision loss | Inaccurate results with FP16 input | Upgrade precision to FP32 before reduction operations |
| Insufficient index length | D-cache error | Replace int32 with int64 for index when dealing with super-large shapes |