ascendc-operator-precision-debug
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAscendC 算子精度调试
AscendC Operator Precision Debugging
按「由浅入深」五阶段定位根因:先看数据分布,再查代码易错点,然后实验隔离,最后插桩定位。
Phase 1: 误差分析 → Phase 2: 代码审查 → Phase 3: 实验隔离 → Phase 4: 插桩定位 → Phase 5: 修复验证Locate root causes in five phases from shallow to deep: First analyze data distribution, then review error-prone code points, conduct experimental isolation, and finally use instrumentation for localization.
Phase 1: Error Analysis → Phase 2: Code Review → Phase 3: Experimental Isolation → Phase 4: Instrumentation Localization → Phase 5: Fix VerificationPhase 1:误差分析
Phase 1: Error Analysis
原则:先看数据,再看代码。 先搞清楚「错在哪、错多少、错成什么样」。
收集失败用例的 shape、dtype、MaxAbsErr/MeanAbsErr/CosineSim,然后基于 创建 (替换占位符后运行),自动分析:
scripts/debug_precision_template.pycsrc/ops/<op_name>/test/debug_<op_name>_precision.py- 误差统计:MaxAbsErr、MeanAbsErr、MaxRelErr
- 首个错误元素:多维坐标 + 线性下标 + NPU 值 vs 参考值
- 错误分布:错误元素数量/占比、错误间隔是否呈周期性
- 特殊值:输出是否全零、含 NaN/Inf
- 自动对照:固定输入 vs 随机输入、缩小 shape 二分
Principle: Check data first, then code. Clarify "where the error is, how severe it is, and what the error looks like" first.
Collect the shape, dtype, MaxAbsErr/MeanAbsErr/CosineSim of failed cases, then create based on (replace placeholders and run) for automatic analysis:
csrc/ops/<op_name>/test/debug_<op_name>_precision.pyscripts/debug_precision_template.py- Error Statistics: MaxAbsErr, MeanAbsErr, MaxRelErr
- First Error Element: Multi-dimensional coordinates + linear index + NPU value vs reference value
- Error Distribution: Number/proportion of error elements, whether error intervals are periodic
- Special Values: Whether output is all-zero, contains NaN/Inf
- Automatic Comparison: Fixed input vs random input, binary search with reduced shape
误差特征 → 初步判断
Error Characteristics → Preliminary Judgment
| 现象 | 最可能原因 | 下一步 |
|---|---|---|
| FP16 失败,FP32 通过 | 未升精度到 FP32 计算 | Phase 2 查 Cast |
| 输出全零 | CopyOut 未执行 / GM 偏移错 | Phase 2 查 CopyOut |
| 输出含 NaN/Inf | 除零 / log 负数 / 溢出 | Phase 2 查 Compute |
| 全部偏差,CosineSim≈1 | 系统性精度损失 | Phase 2 查升精度 |
| 周期性/条纹状错误 | tile 边界 / 搬运偏移 | Phase 3 实验 |
| 仅尾部元素错 | 尾 tile 长度 / 对齐 | Phase 2 查尾 tile |
| 多次运行结果不同 | 异步同步不足 | Phase 3 实验 B |
| 小 shape 过、大 shape 挂 | 多核/tiling 边界 | Phase 3 实验 A |
| 固定输入过、随机挂 | 地址/stride/偏移错 | Phase 3 实验 C |
| Phenomenon | Most Likely Cause | Next Step |
|---|---|---|
| FP16 fails, FP32 passes | Failure to upcast to FP32 for computation | Check Cast in Phase 2 |
| All-zero output | CopyOut not executed / GM offset error | Check CopyOut in Phase 2 |
| Output contains NaN/Inf | Division by zero / log of negative number / overflow | Check Compute in Phase 2 |
| All deviations, CosineSim≈1 | Systematic precision loss | Check upcasting in Phase 2 |
| Periodic/striped errors | Tile boundary / data transfer offset | Conduct experiments in Phase 3 |
| Only tail elements are wrong | Tail tile length / alignment | Check tail tile in Phase 2 |
| Different results in multiple runs | Insufficient asynchronous synchronization | Conduct Experiment B in Phase 3 |
| Small shape passes, large shape fails | Multi-core/tiling boundary | Conduct Experiment A in Phase 3 |
| Fixed input passes, random input fails | Address/stride/offset error | Conduct Experiment C in Phase 3 |
Phase 2:代码审查
Phase 2: Code Review
MANDATORY:读取 、、(若存在),按以下清单由浅入深排查。
op_host/<op_name>.cppop_kernel/<op_name>.cppdesign.mdMANDATORY: Read , , and (if exists), then troubleshoot from shallow to deep according to the following checklist.
op_host/<op_name>.cppop_kernel/<op_name>.cppdesign.md第一层:基本正确性(最高频)
Layer 1: Basic Correctness (Highest Frequency)
- FP16/BF16 未升精度:Compute 中半精度是否先 到 FP32 计算再
Cast回?这是最高频精度 bug。Cast - 计算公式错误:API 调用序列与设计文档/PyTorch 逐步对照——运算顺序、标量符号、是否缺步骤。
- GM 偏移单位混淆:是元素偏移,不要多乘
xGm[progress * tileLength]。sizeof(T) - tileLength vs curTileLength:偏移用 ,计算/搬运用
tileLength(尾 tile 可能更小)。curTileLength
- FP16/BF16 not upcast: In Compute, does half-precision data first to FP32 for computation then
Castback? This is the most frequent precision bug.Cast - Incorrect calculation formula: Compare API call sequence with design document/PyTorch step by step — operation order, scalar sign, missing steps.
- GM offset unit confusion: is element offset, do not multiply by
xGm[progress * tileLength]extra.sizeof(T) - tileLength vs curTileLength: Use for offset, use
tileLengthfor computation/transfer (tail tile may be smaller).curTileLength
第二层:搬运与对齐
Layer 2: Data Transfer and Alignment
- DataCopyPad copyLen:的 copyLen 是字节数 =
DataCopyExtParams。curTileLength * sizeof(T) - 尾 tile 对齐:尾 tile 不满足 32B 对齐时,计算及使用是否正确。
alignedTailLen - 多输入偏移不一致:多输入 tensor shape 不同时(如 RoPE 的 x vs cos/sin),各自的偏移计算是否正确。
- DataCopyPad copyLen: in
copyLenis byte count =DataCopyExtParams.curTileLength * sizeof(T) - Tail tile alignment: When tail tile does not meet 32B alignment, is the calculation and usage of correct?
alignedTailLen - Inconsistent offsets for multiple inputs: When shapes of multiple input tensors are different (such as x vs cos/sin in RoPE), is the offset calculation for each correct?
第三层:Tiling 与多核
Layer 3: Tiling and Multi-core
- Host/Kernel tiling 不一致:同一符号(如 )在 host 和 kernel 中含义是否一致。
tileLength - 核间边界重叠/遗漏:formerNum × formerLength + tailNum × tailLength 是否恰好覆盖全部数据。
- bufferCoefficient 错误:与设计文档 UB 分配表核对,错误的系数会导致 tileLength 偏差。
- Host/Kernel tiling inconsistency: Does the same symbol (e.g., ) have the same meaning in host and kernel?
tileLength - Inter-core boundary overlap/omission: Does formerNum × formerLength + tailNum × tailLength exactly cover all data?
- bufferCoefficient error: Check against the UB allocation table in design document, incorrect coefficient will cause tileLength deviation.
第四层:API 陷阱
Layer 4: API Traps
- ReduceSum/Max 修改源数据:归约可能改写源 tensor,后续若复用需先 备份。
Adds(backup, src, 0.0f, len) - AllocTensor/FreeTensor 未配对:与 EnQue/DeQue 需严格配对,否则缓冲区泄漏。
- 向量长度参数:AscendC 向量 API 长度是元素个数,非字节数。
- ReduceSum/Max modifies source data: Reduction may rewrite source tensor, need to backup first with if reused later.
Adds(backup, src, 0.0f, len) - AllocTensor/FreeTensor not paired: Must be strictly paired with EnQue/DeQue, otherwise buffer leakage occurs.
- Vector length parameter: The length parameter of AscendC vector API is number of elements, not byte count.
第五层:边界情况
Layer 5: Boundary Cases
- 除零 / 定义域越界:Div、Reciprocal 防零;Ln 要求正数;Sqrt 要求非负。
- tiling 整数溢出:乘法是否可能溢出 int32?建议 int64_t。
检查点:输出审查报告——疑似问题列表(按可能性排序)。若已锁定根因,跳到 Phase 5;否则进入 Phase 3。
- Division by zero / domain out of bounds: Prevent zero for Div, Reciprocal; Ln requires positive numbers; Sqrt requires non-negative numbers.
- Tiling integer overflow: Is multiplication likely to overflow int32? int64_t is recommended.
Checkpoint: Output review report — list of suspected issues (sorted by possibility). If root cause is locked, jump to Phase 5; otherwise enter Phase 3.
Phase 3:实验隔离
Phase 3: Experimental Isolation
Phase 2 无法直接锁定根因时,通过控制变量实验缩小范围。每次只改一个变量。
When root cause cannot be directly locked in Phase 2, narrow down the scope through controlled variable experiments. Change only one variable each time.
实验 A:block_dim → 1(多核隔离)
Experiment A: block_dim → 1 (Multi-core Isolation)
在 op_host 临时硬编码 ,重编译测试。可配合缩小 shape。
blockDim = 1| 结果 | 结论 |
|---|---|
| 单核过、多核挂 | 核间问题:GM 区间重叠 / tiling 映射 / 核间同步 |
| 单核也挂 | 非多核问题 → 实验 B |
Temporarily hardcode in op_host, recompile and test. Can be combined with reduced shape.
blockDim = 1| Result | Conclusion |
|---|---|
| Single-core passes, multi-core fails | Inter-core issue: GM interval overlap / tiling mapping / inter-core synchronization |
| Single-core also fails | Non-multi-core issue → Experiment B |
实验 B:PipeBarrier<PIPE_ALL>(同步隔离)
Experiment B: PipeBarrier<PIPE_ALL> (Synchronization Isolation)
将 kernel Process 中所有同步临时替换为 (CopyIn / Compute / CopyOut 之间各加一个)。
AscendC::PipeBarrier<PIPE_ALL>()| 结果 | 结论 |
|---|---|
| 全屏障后过 | 核内同步不足 → 逐步恢复细粒度同步定位 |
| 仍失败 | 非同步问题 → 实验 C |
仅用于实验隔离,绝不可作为最终方案。PIPE_ALL
Temporarily replace all synchronization in kernel Process with (add one between CopyIn / Compute / CopyOut).
AscendC::PipeBarrier<PIPE_ALL>()| Result | Conclusion |
|---|---|
| Passes after full barrier | Insufficient intra-core synchronization → Gradually restore fine-grained synchronization for localization |
| Still fails | Non-synchronization issue → Experiment C |
is only used for experimental isolation, never use it as the final solution.PIPE_ALL
实验 C:固定/规律输入(地址隔离)
Experiment C: Fixed/Regular Input (Address Isolation)
分别用全 1、等差序列()、随机输入测试。
torch.arange| 结果 | 结论 |
|---|---|
| 全 1 过、等差/随机挂 | 地址/偏移/stride 错误(常数输入掩盖了偏移问题) |
| 全都挂 | 计算逻辑或全局 tiling 错误 |
| 全都过 | 特定数值范围触发精度问题 → 查边界值/极值 |
Test with all-1, arithmetic sequence (), and random input respectively.
torch.arange| Result | Conclusion |
|---|---|
| All-1 passes, arithmetic/random fails | Address/offset/stride error (constant input masks offset issue) |
| All fail | Calculation logic or global tiling error |
| All pass | Precision issue triggered by specific value range → Check boundary/extreme values |
实验 D:缩小 shape(边界隔离)
Experiment D: Reduce Shape (Boundary Isolation)
shape=(32,)(tileLength,)(tileLength*2,)shape=(32,)(tileLength,)(tileLength*2,)首错下标 + tiling 反推
Reverse-engineering from First Error Index + Tiling
首错线性下标 → 第几个 tile → 哪个核 → 该核 GM 起始偏移 → 搬运预期字节数周期 = tileLength → 搬运/偏移问题;周期 = 向量宽度 → 计算流程问题;与核边界对齐 → 多核/offset 问题。
First error linear index → Which tile → Which core → GM start offset of the core → Expected byte count for transferPeriod = tileLength → Transfer/offset issue; Period = vector width → Calculation process issue; Aligned with core boundary → Multi-core/offset issue.
Phase 4:插桩定位
Phase 4: Instrumentation Localization
问题范围已收敛到某阶段/某 tile 后,用 和 精确定位。
AscendC::printfAscendC::DumpTensorAfter narrowing down the issue scope to a certain phase/tile, use and for precise localization.
AscendC::printfAscendC::DumpTensor核心规则
Core Rules
- 仅 0 核打印:每个核计算逻辑一致时,加 减少输出量。
if (AscendC::GetBlockIdx() == 0) - 同步后再读:在 /
DeQue之后才能读PipeBarrier,否则读到未完成搬运的脏数据。LocalTensor - FP16 先转 float:,直接打 half 会乱码。
AscendC::printf("v=%.6f\n", static_cast<float>(tensor.GetValue(idx))); - 用 desc 区分阶段:DumpTensor 的 desc 参数(0=CopyIn 后, 1=Compute 中间, 2=CopyOut 前)。
- 小量起步:DumpTensor 的 dumpSize 从小值开始,过大会导致缓冲满或截断。
- Print only on core 0: When calculation logic of each core is consistent, add to reduce output volume.
if (AscendC::GetBlockIdx() == 0) - Read after synchronization: can only be read after
LocalTensor/DeQue, otherwise dirty data from unfinished transfer will be read.PipeBarrier - Convert FP16 to float first: , directly printing half-precision will cause garbled characters.
AscendC::printf("v=%.6f\n", static_cast<float>(tensor.GetValue(idx))); - Use desc to distinguish phases: desc parameter of DumpTensor (0=after CopyIn, 1=middle of Compute, 2=before CopyOut).
- Start with small amount: Start with small dumpSize for DumpTensor, large size will cause buffer overflow or truncation.
printf vs DumpTensor 选择
printf vs DumpTensor Selection
| 场景 | 工具 |
|---|---|
| 标量、分支判断、单个下标 | |
| 连续一段 tensor 快速扫 | |
| 全量逐元素对比 | 不在 kernel 内做 — Host 读 GM + Python 脚本 |
| Scenario | Tool |
|---|---|
| Scalar, branch judgment, single index | |
| Quick scan of a continuous tensor segment | |
| Full element-by-element comparison | Do not do inside kernel — Read GM on Host + Python script |
插桩策略
Instrumentation Strategy
在 Compute 函数内 DeQue 之后,逐步骤插桩,与 Python 侧用相同输入手算的中间结果逐步对比。第一个出现偏差的步骤即为根因所在。
cpp
// 示意:0 核、第 0 个 tile
if (AscendC::GetBlockIdx() == 0 && progress == 0) {
AscendC::printf("[step1] tmp[0]=%.6f\n", static_cast<float>(tmp.GetValue(0)));
}Add instrumentation step by step in Compute function after DeQue, compare with manual calculated intermediate results on Python side with the same input. The first step where deviation occurs is the root cause.
cpp
// Example: Core 0, 0th tile
if (AscendC::GetBlockIdx() == 0 && progress == 0) {
AscendC::printf("[step1] tmp[0]=%.6f\n", static_cast<float>(tmp.GetValue(0)));
}Phase 5:修复验证
Phase 5: Fix Verification
常见修复模式
Common Fix Patterns
| 根因 | 修复 |
|---|---|
| FP16 未升精度 | 添加 Cast(fp16→fp32) + 计算 + Cast(fp32→fp16) |
| GM 偏移错 | 修正偏移公式(元素 vs 字节) |
| 尾 tile 长度错 | 计算/搬运用 curTileLength,偏移用 tileLength |
| tiling 参数错 | 修正 host 端 tiling 计算 |
| 同步缺失 | 添加正确的 EnQue/DeQue 或 PipeBarrier |
| ReduceSum 覆盖源 | 先 Adds 备份再 ReduceSum |
| 搬运长度错 | 修正 DataCopyExtParams 的 copyLen |
| Root Cause | Fix |
|---|---|
| FP16 not upcast | Add Cast(fp16→fp32) + computation + Cast(fp32→fp16) |
| GM offset error | Correct offset formula (element vs byte) |
| Wrong tail tile length | Use curTileLength for computation/transfer, use tileLength for offset |
| Incorrect tiling parameter | Correct tiling calculation on Host side |
| Missing synchronization | Add correct EnQue/DeQue or PipeBarrier |
| ReduceSum overwrites source | Backup with Adds first then ReduceSum |
| Wrong transfer length | Correct copyLen in DataCopyExtParams |
修复后
After Fix
- 移除所有调试插桩(printf/DumpTensor),或用 包裹
#ifdef DEBUG_PRECISION - 重新编译安装
- 运行原失败用例 + 完整精度测试
- 仍失败 → 回到 Phase 1(最多 3 轮),3 轮后仍失败则报告用户
- Remove all debugging instrumentation (printf/DumpTensor), or wrap with
#ifdef DEBUG_PRECISION - Recompile and install
- Run original failed case + full precision test
- Still fails → Return to Phase 1 (max 3 rounds), report to user if still fails after 3 rounds
输出要求(MANDATORY)
Output Requirements (MANDATORY)
调试完成后 MUST 在对话中展示:问题摘要、根因分析、修复内容、验证结果、≥2 条关键经验。NEVER 仅回复「已修复」。
After debugging MUST show in the conversation: Issue Summary, Root Cause Analysis, Fix Content, Verification Result, ≥2 key lessons. NEVER only reply "Fixed".
典型案例(按需加载)
Typical Cases (Load On Demand)
定位到疑似根因后,加载对应案例了解完整排查过程:
| 误差现象 | 案例文件 | 何时加载 |
|---|---|---|
| FP16 挂 FP32 过,全部偏差 | | 怀疑升精度缺失 |
| 首错在 tile 边界,周期 = tileLength | | 怀疑 GM 偏移错误 |
| 仅尾部少量元素错 | | 怀疑尾 tile 处理 |
| block_dim=1 过,多核挂 | | 怀疑核间 tiling |
| 多次运行结果不同 | | 怀疑同步缺失 |
不要一次性加载所有案例。 仅在误差特征匹配时加载对应案例。
After locating suspected root cause, load corresponding case to understand complete troubleshooting process:
| Error Phenomenon | Case File | When to Load |
|---|---|---|
| FP16 fails, FP32 passes, all deviations | | Suspect missing upcasting |
| First error at tile boundary, period = tileLength | | Suspect GM offset error |
| Only a few tail elements are wrong | | Suspect tail tile handling |
| block_dim=1 passes, multi-core fails | | Suspect inter-core tiling |
| Different results in multiple runs | | Suspect missing synchronization |
Do not load all cases at once. Only load corresponding case when error characteristics match.
反模式(NEVER)
Anti-Patterns (NEVER)
- NEVER 不分析误差分布就直接改代码
- NEVER 在 kernel 中 printf 循环打全量 tensor — 用 DumpTensor 或 Host 侧对比
- NEVER 多核同时大量打印 — 加 仅 0 核打印
GetBlockIdx() == 0 - NEVER 在未同步位置读 LocalTensor — 必须在 DeQue/PipeBarrier 之后
- NEVER 用 作为最终修复 — 仅用于实验隔离
PIPE_ALL - NEVER 修复后不移除调试代码
- NEVER 仅修复已知失败用例而不跑完整精度测试
- NEVER 超过 3 轮仍失败时继续尝试 — 应报告用户
- NEVER modify code directly without analyzing error distribution
- NEVER loop printf full tensor in kernel — Use DumpTensor or Host-side comparison
- NEVER print massively on multiple cores at the same time — Add to print only on core 0
GetBlockIdx() == 0 - NEVER read LocalTensor at unsynchronized position — Must be after DeQue/PipeBarrier
- NEVER use as final fix — Only for experimental isolation
PIPE_ALL - NEVER leave debugging code after fix
- NEVER only fix known failed cases without running full precision test
- NEVER continue trying if fails after more than 3 rounds — Should report to user