ascendc-operator-precision-debug

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AscendC 算子精度调试

AscendC Operator Precision Debugging

按「由浅入深」五阶段定位根因:先看数据分布,再查代码易错点,然后实验隔离,最后插桩定位。
Phase 1: 误差分析 → Phase 2: 代码审查 → Phase 3: 实验隔离 → Phase 4: 插桩定位 → Phase 5: 修复验证

Locate root causes in five phases from shallow to deep: First analyze data distribution, then review error-prone code points, conduct experimental isolation, and finally use instrumentation for localization.
Phase 1: Error Analysis → Phase 2: Code Review → Phase 3: Experimental Isolation → Phase 4: Instrumentation Localization → Phase 5: Fix Verification

Phase 1:误差分析

Phase 1: Error Analysis

原则:先看数据,再看代码。 先搞清楚「错在哪、错多少、错成什么样」。
收集失败用例的 shape、dtype、MaxAbsErr/MeanAbsErr/CosineSim,然后基于
scripts/debug_precision_template.py
创建
csrc/ops/<op_name>/test/debug_<op_name>_precision.py
(替换占位符后运行),自动分析:
  1. 误差统计:MaxAbsErr、MeanAbsErr、MaxRelErr
  2. 首个错误元素:多维坐标 + 线性下标 + NPU 值 vs 参考值
  3. 错误分布:错误元素数量/占比、错误间隔是否呈周期性
  4. 特殊值:输出是否全零、含 NaN/Inf
  5. 自动对照:固定输入 vs 随机输入、缩小 shape 二分
Principle: Check data first, then code. Clarify "where the error is, how severe it is, and what the error looks like" first.
Collect the shape, dtype, MaxAbsErr/MeanAbsErr/CosineSim of failed cases, then create
csrc/ops/<op_name>/test/debug_<op_name>_precision.py
based on
scripts/debug_precision_template.py
(replace placeholders and run) for automatic analysis:
  1. Error Statistics: MaxAbsErr, MeanAbsErr, MaxRelErr
  2. First Error Element: Multi-dimensional coordinates + linear index + NPU value vs reference value
  3. Error Distribution: Number/proportion of error elements, whether error intervals are periodic
  4. Special Values: Whether output is all-zero, contains NaN/Inf
  5. Automatic Comparison: Fixed input vs random input, binary search with reduced shape

误差特征 → 初步判断

Error Characteristics → Preliminary Judgment

现象最可能原因下一步
FP16 失败,FP32 通过未升精度到 FP32 计算Phase 2 查 Cast
输出全零CopyOut 未执行 / GM 偏移错Phase 2 查 CopyOut
输出含 NaN/Inf除零 / log 负数 / 溢出Phase 2 查 Compute
全部偏差,CosineSim≈1系统性精度损失Phase 2 查升精度
周期性/条纹状错误tile 边界 / 搬运偏移Phase 3 实验
仅尾部元素错尾 tile 长度 / 对齐Phase 2 查尾 tile
多次运行结果不同异步同步不足Phase 3 实验 B
小 shape 过、大 shape 挂多核/tiling 边界Phase 3 实验 A
固定输入过、随机挂地址/stride/偏移错Phase 3 实验 C

PhenomenonMost Likely CauseNext Step
FP16 fails, FP32 passesFailure to upcast to FP32 for computationCheck Cast in Phase 2
All-zero outputCopyOut not executed / GM offset errorCheck CopyOut in Phase 2
Output contains NaN/InfDivision by zero / log of negative number / overflowCheck Compute in Phase 2
All deviations, CosineSim≈1Systematic precision lossCheck upcasting in Phase 2
Periodic/striped errorsTile boundary / data transfer offsetConduct experiments in Phase 3
Only tail elements are wrongTail tile length / alignmentCheck tail tile in Phase 2
Different results in multiple runsInsufficient asynchronous synchronizationConduct Experiment B in Phase 3
Small shape passes, large shape failsMulti-core/tiling boundaryConduct Experiment A in Phase 3
Fixed input passes, random input failsAddress/stride/offset errorConduct Experiment C in Phase 3

Phase 2:代码审查

Phase 2: Code Review

MANDATORY:读取
op_host/<op_name>.cpp
op_kernel/<op_name>.cpp
design.md
(若存在),按以下清单由浅入深排查。
MANDATORY: Read
op_host/<op_name>.cpp
,
op_kernel/<op_name>.cpp
, and
design.md
(if exists), then troubleshoot from shallow to deep according to the following checklist.

第一层:基本正确性(最高频)

Layer 1: Basic Correctness (Highest Frequency)

  • FP16/BF16 未升精度:Compute 中半精度是否先
    Cast
    到 FP32 计算再
    Cast
    回?这是最高频精度 bug。
  • 计算公式错误:API 调用序列与设计文档/PyTorch 逐步对照——运算顺序、标量符号、是否缺步骤。
  • GM 偏移单位混淆
    xGm[progress * tileLength]
    是元素偏移,不要多乘
    sizeof(T)
  • tileLength vs curTileLength:偏移用
    tileLength
    ,计算/搬运用
    curTileLength
    (尾 tile 可能更小)。
  • FP16/BF16 not upcast: In Compute, does half-precision data first
    Cast
    to FP32 for computation then
    Cast
    back? This is the most frequent precision bug.
  • Incorrect calculation formula: Compare API call sequence with design document/PyTorch step by step — operation order, scalar sign, missing steps.
  • GM offset unit confusion:
    xGm[progress * tileLength]
    is element offset, do not multiply by
    sizeof(T)
    extra.
  • tileLength vs curTileLength: Use
    tileLength
    for offset, use
    curTileLength
    for computation/transfer (tail tile may be smaller).

第二层:搬运与对齐

Layer 2: Data Transfer and Alignment

  • DataCopyPad copyLen
    DataCopyExtParams
    的 copyLen 是字节数 =
    curTileLength * sizeof(T)
  • 尾 tile 对齐:尾 tile 不满足 32B 对齐时,
    alignedTailLen
    计算及使用是否正确。
  • 多输入偏移不一致:多输入 tensor shape 不同时(如 RoPE 的 x vs cos/sin),各自的偏移计算是否正确。
  • DataCopyPad copyLen:
    copyLen
    in
    DataCopyExtParams
    is byte count =
    curTileLength * sizeof(T)
    .
  • Tail tile alignment: When tail tile does not meet 32B alignment, is the calculation and usage of
    alignedTailLen
    correct?
  • Inconsistent offsets for multiple inputs: When shapes of multiple input tensors are different (such as x vs cos/sin in RoPE), is the offset calculation for each correct?

第三层:Tiling 与多核

Layer 3: Tiling and Multi-core

  • Host/Kernel tiling 不一致:同一符号(如
    tileLength
    )在 host 和 kernel 中含义是否一致。
  • 核间边界重叠/遗漏:formerNum × formerLength + tailNum × tailLength 是否恰好覆盖全部数据。
  • bufferCoefficient 错误:与设计文档 UB 分配表核对,错误的系数会导致 tileLength 偏差。
  • Host/Kernel tiling inconsistency: Does the same symbol (e.g.,
    tileLength
    ) have the same meaning in host and kernel?
  • Inter-core boundary overlap/omission: Does formerNum × formerLength + tailNum × tailLength exactly cover all data?
  • bufferCoefficient error: Check against the UB allocation table in design document, incorrect coefficient will cause tileLength deviation.

第四层:API 陷阱

Layer 4: API Traps

  • ReduceSum/Max 修改源数据:归约可能改写源 tensor,后续若复用需先
    Adds(backup, src, 0.0f, len)
    备份。
  • AllocTensor/FreeTensor 未配对:与 EnQue/DeQue 需严格配对,否则缓冲区泄漏。
  • 向量长度参数:AscendC 向量 API 长度是元素个数,非字节数。
  • ReduceSum/Max modifies source data: Reduction may rewrite source tensor, need to backup first with
    Adds(backup, src, 0.0f, len)
    if reused later.
  • AllocTensor/FreeTensor not paired: Must be strictly paired with EnQue/DeQue, otherwise buffer leakage occurs.
  • Vector length parameter: The length parameter of AscendC vector API is number of elements, not byte count.

第五层:边界情况

Layer 5: Boundary Cases

  • 除零 / 定义域越界:Div、Reciprocal 防零;Ln 要求正数;Sqrt 要求非负。
  • tiling 整数溢出:乘法是否可能溢出 int32?建议 int64_t。
检查点:输出审查报告——疑似问题列表(按可能性排序)。若已锁定根因,跳到 Phase 5;否则进入 Phase 3。

  • Division by zero / domain out of bounds: Prevent zero for Div, Reciprocal; Ln requires positive numbers; Sqrt requires non-negative numbers.
  • Tiling integer overflow: Is multiplication likely to overflow int32? int64_t is recommended.
Checkpoint: Output review report — list of suspected issues (sorted by possibility). If root cause is locked, jump to Phase 5; otherwise enter Phase 3.

Phase 3:实验隔离

Phase 3: Experimental Isolation

Phase 2 无法直接锁定根因时,通过控制变量实验缩小范围。每次只改一个变量。
When root cause cannot be directly locked in Phase 2, narrow down the scope through controlled variable experiments. Change only one variable each time.

实验 A:block_dim → 1(多核隔离)

Experiment A: block_dim → 1 (Multi-core Isolation)

在 op_host 临时硬编码
blockDim = 1
,重编译测试。可配合缩小 shape。
结果结论
单核过、多核挂核间问题:GM 区间重叠 / tiling 映射 / 核间同步
单核也挂非多核问题 → 实验 B
Temporarily hardcode
blockDim = 1
in op_host, recompile and test. Can be combined with reduced shape.
ResultConclusion
Single-core passes, multi-core failsInter-core issue: GM interval overlap / tiling mapping / inter-core synchronization
Single-core also failsNon-multi-core issue → Experiment B

实验 B:PipeBarrier<PIPE_ALL>(同步隔离)

Experiment B: PipeBarrier<PIPE_ALL> (Synchronization Isolation)

将 kernel Process 中所有同步临时替换为
AscendC::PipeBarrier<PIPE_ALL>()
(CopyIn / Compute / CopyOut 之间各加一个)。
结果结论
全屏障后过核内同步不足 → 逐步恢复细粒度同步定位
仍失败非同步问题 → 实验 C
PIPE_ALL
仅用于实验隔离,绝不可作为最终方案
Temporarily replace all synchronization in kernel Process with
AscendC::PipeBarrier<PIPE_ALL>()
(add one between CopyIn / Compute / CopyOut).
ResultConclusion
Passes after full barrierInsufficient intra-core synchronization → Gradually restore fine-grained synchronization for localization
Still failsNon-synchronization issue → Experiment C
PIPE_ALL
is only used for experimental isolation, never use it as the final solution.

实验 C:固定/规律输入(地址隔离)

Experiment C: Fixed/Regular Input (Address Isolation)

分别用全 1、等差序列(
torch.arange
)、随机输入测试。
结果结论
全 1 过、等差/随机挂地址/偏移/stride 错误(常数输入掩盖了偏移问题)
全都挂计算逻辑或全局 tiling 错误
全都过特定数值范围触发精度问题 → 查边界值/极值
Test with all-1, arithmetic sequence (
torch.arange
), and random input respectively.
ResultConclusion
All-1 passes, arithmetic/random failsAddress/offset/stride error (constant input masks offset issue)
All failCalculation logic or global tiling error
All passPrecision issue triggered by specific value range → Check boundary/extreme values

实验 D:缩小 shape(边界隔离)

Experiment D: Reduce Shape (Boundary Isolation)

shape=(32,)
(tileLength,)
(tileLength*2,)
→ 原始 shape,定位恰好开始失败的分界点,反推 tile/核边界。
shape=(32,)
(tileLength,)
(tileLength*2,)
→ original shape, locate the exact boundary where failure starts, reverse-engineer tile/core boundary.

首错下标 + tiling 反推

Reverse-engineering from First Error Index + Tiling

首错线性下标 → 第几个 tile → 哪个核 → 该核 GM 起始偏移 → 搬运预期字节数
周期 = tileLength → 搬运/偏移问题;周期 = 向量宽度 → 计算流程问题;与核边界对齐 → 多核/offset 问题。

First error linear index → Which tile → Which core → GM start offset of the core → Expected byte count for transfer
Period = tileLength → Transfer/offset issue; Period = vector width → Calculation process issue; Aligned with core boundary → Multi-core/offset issue.

Phase 4:插桩定位

Phase 4: Instrumentation Localization

问题范围已收敛到某阶段/某 tile 后,用
AscendC::printf
AscendC::DumpTensor
精确定位。
After narrowing down the issue scope to a certain phase/tile, use
AscendC::printf
and
AscendC::DumpTensor
for precise localization.

核心规则

Core Rules

  1. 仅 0 核打印:每个核计算逻辑一致时,加
    if (AscendC::GetBlockIdx() == 0)
    减少输出量。
  2. 同步后再读:在
    DeQue
    /
    PipeBarrier
    之后才能读
    LocalTensor
    ,否则读到未完成搬运的脏数据。
  3. FP16 先转 float
    AscendC::printf("v=%.6f\n", static_cast<float>(tensor.GetValue(idx)));
    ,直接打 half 会乱码。
  4. 用 desc 区分阶段:DumpTensor 的 desc 参数(0=CopyIn 后, 1=Compute 中间, 2=CopyOut 前)。
  5. 小量起步:DumpTensor 的 dumpSize 从小值开始,过大会导致缓冲满或截断。
  1. Print only on core 0: When calculation logic of each core is consistent, add
    if (AscendC::GetBlockIdx() == 0)
    to reduce output volume.
  2. Read after synchronization:
    LocalTensor
    can only be read after
    DeQue
    /
    PipeBarrier
    , otherwise dirty data from unfinished transfer will be read.
  3. Convert FP16 to float first:
    AscendC::printf("v=%.6f\n", static_cast<float>(tensor.GetValue(idx)));
    , directly printing half-precision will cause garbled characters.
  4. Use desc to distinguish phases: desc parameter of DumpTensor (0=after CopyIn, 1=middle of Compute, 2=before CopyOut).
  5. Start with small amount: Start with small dumpSize for DumpTensor, large size will cause buffer overflow or truncation.

printf vs DumpTensor 选择

printf vs DumpTensor Selection

场景工具
标量、分支判断、单个下标
AscendC::printf
连续一段 tensor 快速扫
AscendC::DumpTensor(tensor, desc, dumpSize)
全量逐元素对比不在 kernel 内做 — Host 读 GM + Python 脚本
ScenarioTool
Scalar, branch judgment, single index
AscendC::printf
Quick scan of a continuous tensor segment
AscendC::DumpTensor(tensor, desc, dumpSize)
Full element-by-element comparisonDo not do inside kernel — Read GM on Host + Python script

插桩策略

Instrumentation Strategy

在 Compute 函数内 DeQue 之后,逐步骤插桩,与 Python 侧用相同输入手算的中间结果逐步对比。第一个出现偏差的步骤即为根因所在。
cpp
// 示意:0 核、第 0 个 tile
if (AscendC::GetBlockIdx() == 0 && progress == 0) {
    AscendC::printf("[step1] tmp[0]=%.6f\n", static_cast<float>(tmp.GetValue(0)));
}

Add instrumentation step by step in Compute function after DeQue, compare with manual calculated intermediate results on Python side with the same input. The first step where deviation occurs is the root cause.
cpp
// Example: Core 0, 0th tile
if (AscendC::GetBlockIdx() == 0 && progress == 0) {
    AscendC::printf("[step1] tmp[0]=%.6f\n", static_cast<float>(tmp.GetValue(0)));
}

Phase 5:修复验证

Phase 5: Fix Verification

常见修复模式

Common Fix Patterns

根因修复
FP16 未升精度添加 Cast(fp16→fp32) + 计算 + Cast(fp32→fp16)
GM 偏移错修正偏移公式(元素 vs 字节)
尾 tile 长度错计算/搬运用 curTileLength,偏移用 tileLength
tiling 参数错修正 host 端 tiling 计算
同步缺失添加正确的 EnQue/DeQue 或 PipeBarrier
ReduceSum 覆盖源先 Adds 备份再 ReduceSum
搬运长度错修正 DataCopyExtParams 的 copyLen
Root CauseFix
FP16 not upcastAdd Cast(fp16→fp32) + computation + Cast(fp32→fp16)
GM offset errorCorrect offset formula (element vs byte)
Wrong tail tile lengthUse curTileLength for computation/transfer, use tileLength for offset
Incorrect tiling parameterCorrect tiling calculation on Host side
Missing synchronizationAdd correct EnQue/DeQue or PipeBarrier
ReduceSum overwrites sourceBackup with Adds first then ReduceSum
Wrong transfer lengthCorrect copyLen in DataCopyExtParams

修复后

After Fix

  1. 移除所有调试插桩(printf/DumpTensor),或用
    #ifdef DEBUG_PRECISION
    包裹
  2. 重新编译安装
  3. 运行原失败用例 + 完整精度测试
  4. 仍失败 → 回到 Phase 1(最多 3 轮),3 轮后仍失败则报告用户
  1. Remove all debugging instrumentation (printf/DumpTensor), or wrap with
    #ifdef DEBUG_PRECISION
  2. Recompile and install
  3. Run original failed case + full precision test
  4. Still fails → Return to Phase 1 (max 3 rounds), report to user if still fails after 3 rounds

输出要求(MANDATORY)

Output Requirements (MANDATORY)

调试完成后 MUST 在对话中展示:问题摘要、根因分析、修复内容、验证结果、≥2 条关键经验。NEVER 仅回复「已修复」。

After debugging MUST show in the conversation: Issue Summary, Root Cause Analysis, Fix Content, Verification Result, ≥2 key lessons. NEVER only reply "Fixed".

典型案例(按需加载)

Typical Cases (Load On Demand)

定位到疑似根因后,加载对应案例了解完整排查过程:
误差现象案例文件何时加载
FP16 挂 FP32 过,全部偏差
examples/fp16-no-upcast.md
怀疑升精度缺失
首错在 tile 边界,周期 = tileLength
examples/gm-offset-error.md
怀疑 GM 偏移错误
仅尾部少量元素错
examples/tail-tile-misalign.md
怀疑尾 tile 处理
block_dim=1 过,多核挂
examples/multicore-tiling-overlap.md
怀疑核间 tiling
多次运行结果不同
examples/async-sync-missing.md
怀疑同步缺失
不要一次性加载所有案例。 仅在误差特征匹配时加载对应案例。

After locating suspected root cause, load corresponding case to understand complete troubleshooting process:
Error PhenomenonCase FileWhen to Load
FP16 fails, FP32 passes, all deviations
examples/fp16-no-upcast.md
Suspect missing upcasting
First error at tile boundary, period = tileLength
examples/gm-offset-error.md
Suspect GM offset error
Only a few tail elements are wrong
examples/tail-tile-misalign.md
Suspect tail tile handling
block_dim=1 passes, multi-core fails
examples/multicore-tiling-overlap.md
Suspect inter-core tiling
Different results in multiple runs
examples/async-sync-missing.md
Suspect missing synchronization
Do not load all cases at once. Only load corresponding case when error characteristics match.

反模式(NEVER)

Anti-Patterns (NEVER)

  • NEVER 不分析误差分布就直接改代码
  • NEVER 在 kernel 中 printf 循环打全量 tensor — 用 DumpTensor 或 Host 侧对比
  • NEVER 多核同时大量打印 — 加
    GetBlockIdx() == 0
    仅 0 核打印
  • NEVER 在未同步位置读 LocalTensor — 必须在 DeQue/PipeBarrier 之后
  • NEVER
    PIPE_ALL
    作为最终修复 — 仅用于实验隔离
  • NEVER 修复后不移除调试代码
  • NEVER 仅修复已知失败用例而不跑完整精度测试
  • NEVER 超过 3 轮仍失败时继续尝试 — 应报告用户
  • NEVER modify code directly without analyzing error distribution
  • NEVER loop printf full tensor in kernel — Use DumpTensor or Host-side comparison
  • NEVER print massively on multiple cores at the same time — Add
    GetBlockIdx() == 0
    to print only on core 0
  • NEVER read LocalTensor at unsynchronized position — Must be after DeQue/PipeBarrier
  • NEVER use
    PIPE_ALL
    as final fix — Only for experimental isolation
  • NEVER leave debugging code after fix
  • NEVER only fix known failed cases without running full precision test
  • NEVER continue trying if fails after more than 3 rounds — Should report to user