ascendc-operator-precision-debug

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AscendC 算子精度调试

AscendC Operator Precision Debugging

按「由浅入深」五阶段定位根因：先看数据分布，再查代码易错点，然后实验隔离，最后插桩定位。

Phase 1: 误差分析 → Phase 2: 代码审查 → Phase 3: 实验隔离 → Phase 4: 插桩定位 → Phase 5: 修复验证

Locate root causes in five phases from shallow to deep: First analyze data distribution, then review error-prone code points, conduct experimental isolation, and finally use instrumentation for localization.

Phase 1: Error Analysis → Phase 2: Code Review → Phase 3: Experimental Isolation → Phase 4: Instrumentation Localization → Phase 5: Fix Verification

Phase 1：误差分析

Phase 1: Error Analysis

原则：先看数据，再看代码。 先搞清楚「错在哪、错多少、错成什么样」。

收集失败用例的 shape、dtype、MaxAbsErr/MeanAbsErr/CosineSim，然后基于

scripts/debug_precision_template.py

创建

csrc/ops/<op_name>/test/debug_<op_name>_precision.py

（替换占位符后运行），自动分析：

误差统计：MaxAbsErr、MeanAbsErr、MaxRelErr
首个错误元素：多维坐标 + 线性下标 + NPU 值 vs 参考值
错误分布：错误元素数量/占比、错误间隔是否呈周期性
特殊值：输出是否全零、含 NaN/Inf
自动对照：固定输入 vs 随机输入、缩小 shape 二分

Principle: Check data first, then code. Clarify "where the error is, how severe it is, and what the error looks like" first.

Collect the shape, dtype, MaxAbsErr/MeanAbsErr/CosineSim of failed cases, then create

csrc/ops/<op_name>/test/debug_<op_name>_precision.py

based on

scripts/debug_precision_template.py

(replace placeholders and run) for automatic analysis:

Error Statistics: MaxAbsErr, MeanAbsErr, MaxRelErr
First Error Element: Multi-dimensional coordinates + linear index + NPU value vs reference value
Error Distribution: Number/proportion of error elements, whether error intervals are periodic
Special Values: Whether output is all-zero, contains NaN/Inf
Automatic Comparison: Fixed input vs random input, binary search with reduced shape

误差特征 → 初步判断

Error Characteristics → Preliminary Judgment

现象	最可能原因	下一步
FP16 失败，FP32 通过	未升精度到 FP32 计算	Phase 2 查 Cast
输出全零	CopyOut 未执行 / GM 偏移错	Phase 2 查 CopyOut
输出含 NaN/Inf	除零 / log 负数 / 溢出	Phase 2 查 Compute
全部偏差，CosineSim≈1	系统性精度损失	Phase 2 查升精度
周期性/条纹状错误	tile 边界 / 搬运偏移	Phase 3 实验
仅尾部元素错	尾 tile 长度 / 对齐	Phase 2 查尾 tile
多次运行结果不同	异步同步不足	Phase 3 实验 B
小 shape 过、大 shape 挂	多核/tiling 边界	Phase 3 实验 A
固定输入过、随机挂	地址/stride/偏移错	Phase 3 实验 C

Phenomenon	Most Likely Cause	Next Step
FP16 fails, FP32 passes	Failure to upcast to FP32 for computation	Check Cast in Phase 2
All-zero output	CopyOut not executed / GM offset error	Check CopyOut in Phase 2
Output contains NaN/Inf	Division by zero / log of negative number / overflow	Check Compute in Phase 2
All deviations, CosineSim≈1	Systematic precision loss	Check upcasting in Phase 2
Periodic/striped errors	Tile boundary / data transfer offset	Conduct experiments in Phase 3
Only tail elements are wrong	Tail tile length / alignment	Check tail tile in Phase 2
Different results in multiple runs	Insufficient asynchronous synchronization	Conduct Experiment B in Phase 3
Small shape passes, large shape fails	Multi-core/tiling boundary	Conduct Experiment A in Phase 3
Fixed input passes, random input fails	Address/stride/offset error	Conduct Experiment C in Phase 3

Phase 2：代码审查

Phase 2: Code Review

MANDATORY：读取

op_host/<op_name>.cpp

、

op_kernel/<op_name>.cpp

、

design.md

（若存在），按以下清单由浅入深排查。

MANDATORY: Read

op_host/<op_name>.cpp

op_kernel/<op_name>.cpp

, and

design.md

(if exists), then troubleshoot from shallow to deep according to the following checklist.

第一层：基本正确性（最高频）

Layer 1: Basic Correctness (Highest Frequency)

FP16/BF16 未升精度：Compute 中半精度是否先
```
Cast
```
到 FP32 计算再
```
Cast
```
回？这是最高频精度 bug。
计算公式错误：API 调用序列与设计文档/PyTorch 逐步对照——运算顺序、标量符号、是否缺步骤。
GM 偏移单位混淆：
```
xGm[progress * tileLength]
```
是元素偏移，不要多乘
```
sizeof(T)
```
。
tileLength vs curTileLength：偏移用
```
tileLength
```
，计算/搬运用
```
curTileLength
```
（尾 tile 可能更小）。

FP16/BF16 not upcast: In Compute, does half-precision data first
```
Cast
```
to FP32 for computation then
```
Cast
```
back? This is the most frequent precision bug.
Incorrect calculation formula: Compare API call sequence with design document/PyTorch step by step — operation order, scalar sign, missing steps.
GM offset unit confusion:
```
xGm[progress * tileLength]
```
is element offset, do not multiply by
```
sizeof(T)
```
extra.
tileLength vs curTileLength: Use
```
tileLength
```
for offset, use
```
curTileLength
```
for computation/transfer (tail tile may be smaller).

第二层：搬运与对齐

Layer 2: Data Transfer and Alignment

DataCopyPad copyLen：
```
DataCopyExtParams
```
的 copyLen 是字节数 =
```
curTileLength * sizeof(T)
```
。
尾 tile 对齐：尾 tile 不满足 32B 对齐时，
```
alignedTailLen
```
计算及使用是否正确。
多输入偏移不一致：多输入 tensor shape 不同时（如 RoPE 的 x vs cos/sin），各自的偏移计算是否正确。

DataCopyPad copyLen:

copyLen

DataCopyExtParams

is byte count =

curTileLength * sizeof(T)

Tail tile alignment: When tail tile does not meet 32B alignment, is the calculation and usage of
```
alignedTailLen
```
correct?
Inconsistent offsets for multiple inputs: When shapes of multiple input tensors are different (such as x vs cos/sin in RoPE), is the offset calculation for each correct?

第三层：Tiling 与多核

Layer 3: Tiling and Multi-core

Host/Kernel tiling 不一致：同一符号（如
```
tileLength
```
）在 host 和 kernel 中含义是否一致。
核间边界重叠/遗漏：formerNum × formerLength + tailNum × tailLength 是否恰好覆盖全部数据。
bufferCoefficient 错误：与设计文档 UB 分配表核对，错误的系数会导致 tileLength 偏差。

Host/Kernel tiling inconsistency: Does the same symbol (e.g.,
```
tileLength
```
) have the same meaning in host and kernel?
Inter-core boundary overlap/omission: Does formerNum × formerLength + tailNum × tailLength exactly cover all data?
bufferCoefficient error: Check against the UB allocation table in design document, incorrect coefficient will cause tileLength deviation.

第四层：API 陷阱

Layer 4: API Traps

ReduceSum/Max 修改源数据：归约可能改写源 tensor，后续若复用需先
```
Adds(backup, src, 0.0f, len)
```
备份。
AllocTensor/FreeTensor 未配对：与 EnQue/DeQue 需严格配对，否则缓冲区泄漏。
向量长度参数：AscendC 向量 API 长度是元素个数，非字节数。

ReduceSum/Max modifies source data: Reduction may rewrite source tensor, need to backup first with
```
Adds(backup, src, 0.0f, len)
```
if reused later.
AllocTensor/FreeTensor not paired: Must be strictly paired with EnQue/DeQue, otherwise buffer leakage occurs.
Vector length parameter: The length parameter of AscendC vector API is number of elements, not byte count.

第五层：边界情况

Layer 5: Boundary Cases

除零 / 定义域越界：Div、Reciprocal 防零；Ln 要求正数；Sqrt 要求非负。
tiling 整数溢出：乘法是否可能溢出 int32？建议 int64_t。

检查点：输出审查报告——疑似问题列表（按可能性排序）。若已锁定根因，跳到 Phase 5；否则进入 Phase 3。

Division by zero / domain out of bounds: Prevent zero for Div, Reciprocal; Ln requires positive numbers; Sqrt requires non-negative numbers.
Tiling integer overflow: Is multiplication likely to overflow int32? int64_t is recommended.

Checkpoint: Output review report — list of suspected issues (sorted by possibility). If root cause is locked, jump to Phase 5; otherwise enter Phase 3.

Phase 3：实验隔离

Phase 3: Experimental Isolation

Phase 2 无法直接锁定根因时，通过控制变量实验缩小范围。每次只改一个变量。

When root cause cannot be directly locked in Phase 2, narrow down the scope through controlled variable experiments. Change only one variable each time.

实验 A：block_dim → 1（多核隔离）

Experiment A: block_dim → 1 (Multi-core Isolation)

在 op_host 临时硬编码

blockDim = 1

，重编译测试。可配合缩小 shape。

结果	结论
单核过、多核挂	核间问题：GM 区间重叠 / tiling 映射 / 核间同步
单核也挂	非多核问题 → 实验 B

Temporarily hardcode

blockDim = 1

in op_host, recompile and test. Can be combined with reduced shape.

Result	Conclusion
Single-core passes, multi-core fails	Inter-core issue: GM interval overlap / tiling mapping / inter-core synchronization
Single-core also fails	Non-multi-core issue → Experiment B

实验 B：PipeBarrier<PIPE_ALL>（同步隔离）

Experiment B: PipeBarrier<PIPE_ALL> (Synchronization Isolation)

将 kernel Process 中所有同步临时替换为

AscendC::PipeBarrier<PIPE_ALL>()

（CopyIn / Compute / CopyOut 之间各加一个）。

结果	结论
全屏障后过	核内同步不足 → 逐步恢复细粒度同步定位
仍失败	非同步问题 → 实验 C

PIPE_ALL
仅用于实验隔离，绝不可作为最终方案。

Temporarily replace all synchronization in kernel Process with

AscendC::PipeBarrier<PIPE_ALL>()

(add one between CopyIn / Compute / CopyOut).

Result	Conclusion
Passes after full barrier	Insufficient intra-core synchronization → Gradually restore fine-grained synchronization for localization
Still fails	Non-synchronization issue → Experiment C

PIPE_ALL
is only used for experimental isolation, never use it as the final solution.

实验 C：固定/规律输入（地址隔离）

Experiment C: Fixed/Regular Input (Address Isolation)

分别用全 1、等差序列（

torch.arange

）、随机输入测试。

结果	结论
全 1 过、等差/随机挂	地址/偏移/stride 错误（常数输入掩盖了偏移问题）
全都挂	计算逻辑或全局 tiling 错误
全都过	特定数值范围触发精度问题 → 查边界值/极值

Test with all-1, arithmetic sequence (

torch.arange

), and random input respectively.

Result	Conclusion
All-1 passes, arithmetic/random fails	Address/offset/stride error (constant input masks offset issue)
All fail	Calculation logic or global tiling error
All pass	Precision issue triggered by specific value range → Check boundary/extreme values

实验 D：缩小 shape（边界隔离）

Experiment D: Reduce Shape (Boundary Isolation)

shape=(32,)

→

(tileLength,)

→

(tileLength*2,)

→ 原始 shape，定位恰好开始失败的分界点，反推 tile/核边界。

shape=(32,)

→

(tileLength,)

→

(tileLength*2,)

→ original shape, locate the exact boundary where failure starts, reverse-engineer tile/core boundary.

首错下标 + tiling 反推

Reverse-engineering from First Error Index + Tiling

首错线性下标 → 第几个 tile → 哪个核 → 该核 GM 起始偏移 → 搬运预期字节数

周期 = tileLength → 搬运/偏移问题；周期 = 向量宽度 → 计算流程问题；与核边界对齐 → 多核/offset 问题。

First error linear index → Which tile → Which core → GM start offset of the core → Expected byte count for transfer

Period = tileLength → Transfer/offset issue; Period = vector width → Calculation process issue; Aligned with core boundary → Multi-core/offset issue.

Phase 4：插桩定位

Phase 4: Instrumentation Localization

问题范围已收敛到某阶段/某 tile 后，用

AscendC::printf

和

AscendC::DumpTensor

精确定位。

After narrowing down the issue scope to a certain phase/tile, use

AscendC::printf

and

AscendC::DumpTensor

for precise localization.

核心规则

Core Rules

仅 0 核打印：每个核计算逻辑一致时，加
```
if (AscendC::GetBlockIdx() == 0)
```
减少输出量。
同步后再读：在
```
DeQue
```
/
```
PipeBarrier
```
之后才能读
```
LocalTensor
```
，否则读到未完成搬运的脏数据。

FP16 先转 float：

AscendC::printf("v=%.6f\n", static_cast<float>(tensor.GetValue(idx)));

，直接打 half 会乱码。

用 desc 区分阶段：DumpTensor 的 desc 参数（0=CopyIn 后, 1=Compute 中间, 2=CopyOut 前）。
小量起步：DumpTensor 的 dumpSize 从小值开始，过大会导致缓冲满或截断。

Print only on core 0: When calculation logic of each core is consistent, add
```
if (AscendC::GetBlockIdx() == 0)
```
to reduce output volume.
Read after synchronization:
```
LocalTensor
```
can only be read after
```
DeQue
```
/
```
PipeBarrier
```
, otherwise dirty data from unfinished transfer will be read.
Convert FP16 to float first:
```
AscendC::printf("v=%.6f\n", static_cast<float>(tensor.GetValue(idx)));
```
, directly printing half-precision will cause garbled characters.
Use desc to distinguish phases: desc parameter of DumpTensor (0=after CopyIn, 1=middle of Compute, 2=before CopyOut).
Start with small amount: Start with small dumpSize for DumpTensor, large size will cause buffer overflow or truncation.

printf vs DumpTensor 选择

printf vs DumpTensor Selection

场景	工具
标量、分支判断、单个下标	`AscendC::printf`
连续一段 tensor 快速扫	`AscendC::DumpTensor(tensor, desc, dumpSize)`
全量逐元素对比	不在 kernel 内做 — Host 读 GM + Python 脚本

Scenario	Tool
Scalar, branch judgment, single index	`AscendC::printf`
Quick scan of a continuous tensor segment	`AscendC::DumpTensor(tensor, desc, dumpSize)`
Full element-by-element comparison	Do not do inside kernel — Read GM on Host + Python script

插桩策略

Instrumentation Strategy

在 Compute 函数内 DeQue 之后，逐步骤插桩，与 Python 侧用相同输入手算的中间结果逐步对比。第一个出现偏差的步骤即为根因所在。

cpp

// 示意：0 核、第 0 个 tile
if (AscendC::GetBlockIdx() == 0 && progress == 0) {
    AscendC::printf("[step1] tmp[0]=%.6f\n", static_cast<float>(tmp.GetValue(0)));
}

Add instrumentation step by step in Compute function after DeQue, compare with manual calculated intermediate results on Python side with the same input. The first step where deviation occurs is the root cause.

cpp

// Example: Core 0, 0th tile
if (AscendC::GetBlockIdx() == 0 && progress == 0) {
    AscendC::printf("[step1] tmp[0]=%.6f\n", static_cast<float>(tmp.GetValue(0)));
}

Phase 5：修复验证

Phase 5: Fix Verification

常见修复模式

Common Fix Patterns

根因	修复
FP16 未升精度	添加 Cast(fp16→fp32) + 计算 + Cast(fp32→fp16)
GM 偏移错	修正偏移公式（元素 vs 字节）
尾 tile 长度错	计算/搬运用 curTileLength，偏移用 tileLength
tiling 参数错	修正 host 端 tiling 计算
同步缺失	添加正确的 EnQue/DeQue 或 PipeBarrier
ReduceSum 覆盖源	先 Adds 备份再 ReduceSum
搬运长度错	修正 DataCopyExtParams 的 copyLen

Root Cause	Fix
FP16 not upcast	Add Cast(fp16→fp32) + computation + Cast(fp32→fp16)
GM offset error	Correct offset formula (element vs byte)
Wrong tail tile length	Use curTileLength for computation/transfer, use tileLength for offset
Incorrect tiling parameter	Correct tiling calculation on Host side
Missing synchronization	Add correct EnQue/DeQue or PipeBarrier
ReduceSum overwrites source	Backup with Adds first then ReduceSum
Wrong transfer length	Correct copyLen in DataCopyExtParams

修复后

After Fix

移除所有调试插桩（printf/DumpTensor），或用
```
#ifdef DEBUG_PRECISION
```
包裹
重新编译安装
运行原失败用例 + 完整精度测试
仍失败 → 回到 Phase 1（最多 3 轮），3 轮后仍失败则报告用户

Remove all debugging instrumentation (printf/DumpTensor), or wrap with
```
#ifdef DEBUG_PRECISION
```
Recompile and install
Run original failed case + full precision test
Still fails → Return to Phase 1 (max 3 rounds), report to user if still fails after 3 rounds

输出要求（MANDATORY）

Output Requirements (MANDATORY)

调试完成后 MUST 在对话中展示：问题摘要、根因分析、修复内容、验证结果、≥2 条关键经验。NEVER 仅回复「已修复」。

After debugging MUST show in the conversation: Issue Summary, Root Cause Analysis, Fix Content, Verification Result, ≥2 key lessons. NEVER only reply "Fixed".

典型案例（按需加载）

Typical Cases (Load On Demand)

定位到疑似根因后，加载对应案例了解完整排查过程：

误差现象	案例文件	何时加载
FP16 挂 FP32 过，全部偏差	`examples/fp16-no-upcast.md`	怀疑升精度缺失
首错在 tile 边界，周期 = tileLength	`examples/gm-offset-error.md`	怀疑 GM 偏移错误
仅尾部少量元素错	`examples/tail-tile-misalign.md`	怀疑尾 tile 处理
block_dim=1 过，多核挂	`examples/multicore-tiling-overlap.md`	怀疑核间 tiling
多次运行结果不同	`examples/async-sync-missing.md`	怀疑同步缺失

不要一次性加载所有案例。 仅在误差特征匹配时加载对应案例。

After locating suspected root cause, load corresponding case to understand complete troubleshooting process:

Error Phenomenon	Case File	When to Load
FP16 fails, FP32 passes, all deviations	`examples/fp16-no-upcast.md`	Suspect missing upcasting
First error at tile boundary, period = tileLength	`examples/gm-offset-error.md`	Suspect GM offset error
Only a few tail elements are wrong	`examples/tail-tile-misalign.md`	Suspect tail tile handling
block_dim=1 passes, multi-core fails	`examples/multicore-tiling-overlap.md`	Suspect inter-core tiling
Different results in multiple runs	`examples/async-sync-missing.md`	Suspect missing synchronization

Do not load all cases at once. Only load corresponding case when error characteristics match.

反模式（NEVER）

Anti-Patterns (NEVER)

NEVER 不分析误差分布就直接改代码
NEVER 在 kernel 中 printf 循环打全量 tensor — 用 DumpTensor 或 Host 侧对比
NEVER 多核同时大量打印 — 加
```
GetBlockIdx() == 0
```
仅 0 核打印
NEVER 在未同步位置读 LocalTensor — 必须在 DeQue/PipeBarrier 之后
NEVER 用
```
PIPE_ALL
```
作为最终修复 — 仅用于实验隔离
NEVER 修复后不移除调试代码
NEVER 仅修复已知失败用例而不跑完整精度测试
NEVER 超过 3 轮仍失败时继续尝试 — 应报告用户

NEVER modify code directly without analyzing error distribution
NEVER loop printf full tensor in kernel — Use DumpTensor or Host-side comparison
NEVER print massively on multiple cores at the same time — Add
```
GetBlockIdx() == 0
```
to print only on core 0
NEVER read LocalTensor at unsynchronized position — Must be after DeQue/PipeBarrier
NEVER use
```
PIPE_ALL
```
as final fix — Only for experimental isolation
NEVER leave debugging code after fix
NEVER only fix known failed cases without running full precision test
NEVER continue trying if fails after more than 3 rounds — Should report to user