aoti-debug

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AOTI Debugging Guide

AOTI调试指南

This skill helps diagnose and fix common AOTInductor issues.

本技能可帮助诊断并修复常见的AOTInductor问题。

Error Pattern Routing

错误模式排查指引

Check the error message and route to the appropriate sub-guide:

检查错误信息，选择对应的子指南进行排查：

Triton Index Out of Bounds

Triton索引越界

If the error matches this pattern:

Assertion `index out of bounds: 0 <= tmpN < ksM` failed

→ Follow the guide in
triton-index-out-of-bounds.md

如果错误符合以下模式：

Assertion `index out of bounds: 0 <= tmpN < ksM` failed

→ 请参考
triton-index-out-of-bounds.md
中的指南

All Other Errors

所有其他错误

Continue with the sections below.

继续阅读以下章节。

First Step: Always Check Device and Shape Matching

第一步：始终检查设备与形状匹配情况

For ANY AOTI error (segfault, exception, crash, wrong output), ALWAYS check these first:

Compile device == Load device: The model must be loaded on the same device type it was compiled on
Input devices match: Runtime inputs must be on the same device as the compiled model
Input shapes match: Runtime input shapes must match the shapes used during compilation (or satisfy dynamic shape constraints)

python

undefined

对于任何AOTI错误（段错误、异常、崩溃、输出结果错误），请首先检查以下几点：

编译设备 == 加载设备：模型必须在与编译时相同的设备类型上加载
输入设备匹配：运行时输入必须与编译后的模型在同一设备上
输入形状匹配：运行时输入形状必须与编译时使用的形状一致（或满足动态形状约束）

python

undefined

During compilation - note the device and shapes

编译阶段 - 注意设备与形状

model = MyModel().eval() # What device? CPU or .cuda()? inp = torch.randn(2, 10) # What device? What shape? compiled_so = torch._inductor.aot_compile(model, (inp,))

model = MyModel().eval() # 当前设备是？CPU还是.cuda()？ inp = torch.randn(2, 10) # 当前设备是？形状是？ compiled_so = torch._inductor.aot_compile(model, (inp,))

During loading - device type MUST match compilation

加载阶段 - 设备类型必须与编译时一致

loaded = torch._export.aot_load(compiled_so, "???") # Must match model/input device above

loaded = torch._export.aot_load(compiled_so, "???") # 必须与上述模型/输入设备匹配

During inference - device and shapes MUST match

推理阶段 - 设备与形状必须匹配

out = loaded(inp.to("???")) # Must match compile device, shape must match


**If any of these don't match, you will get errors ranging from segfaults to exceptions to wrong outputs.**

out = loaded(inp.to("???")) # 必须与编译设备一致，形状也必须匹配


**如果以上任意一项不匹配，你会遇到从段错误、异常到输出结果错误等各类问题。**

Key Constraint: Device Type Matching

核心约束：设备类型匹配

AOTI requires compile and load to use the same device type.

If you compile on CUDA, you must load on CUDA (device index can differ)
If you compile on CPU, you must load on CPU
Cross-device loading (e.g., compile on GPU, load on CPU) is NOT supported

AOTI要求编译与加载使用相同的设备类型。

如果在CUDA上编译，必须在CUDA上加载（设备索引可以不同）
如果在CPU上编译，必须在CPU上加载
跨设备加载（例如在GPU编译，在CPU加载）是不被支持的

Common Error Patterns

常见错误模式

1. Device Mismatch Segfault

1. 设备不匹配导致的段错误

Symptom: Segfault, exception, or crash during

aot_load()

or model execution.

Example error messages:

The specified pointer resides on host memory and is not registered with any CUDA device

Crash during constant loading in AOTInductorModelBase

Expected out tensor to have device cuda:0, but got cpu instead

Cause: Compile and load device types don't match (see "First Step" above).

Solution: Ensure compile and load use the same device type. If compiled on CPU, load on CPU. If compiled on CUDA, load on CUDA.

症状：

aot_load()

或模型执行过程中出现段错误、异常或崩溃。

示例错误信息：

The specified pointer resides on host memory and is not registered with any CUDA device

在AOTInductorModelBase的常量加载阶段崩溃

Expected out tensor to have device cuda:0, but got cpu instead

原因：编译与加载的设备类型不匹配（见上文“第一步”）。

解决方案：确保编译与加载使用相同的设备类型。如果在CPU上编译，就在CPU上加载；如果在CUDA上编译，就在CUDA上加载。

2. Input Device Mismatch at Runtime

2. 运行时输入设备不匹配

Symptom: RuntimeError during model execution.

Cause: Input device doesn't match compile device (see "First Step" above).

Better Debugging: Run with

AOTI_RUNTIME_CHECK_INPUTS=1

for clearer errors. This flag validates all input properties including device type, dtype, sizes, and strides:

bash

AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.py

This produces actionable error messages like:

Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)

症状：模型执行过程中出现RuntimeError。

原因：输入设备与编译设备不匹配（见上文“第一步”）。

更优调试方式：设置

AOTI_RUNTIME_CHECK_INPUTS=1

以获得更清晰的错误信息。该标志会验证所有输入属性，包括设备类型、数据类型、尺寸与步长：

bash

AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.py

这会生成可直接定位问题的错误信息，例如：

Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)

Debugging CUDA Illegal Memory Access (IMA) Errors

调试CUDA非法内存访问（IMA）错误

If you encounter CUDA illegal memory access errors, follow this systematic approach:

如果遇到CUDA非法内存访问错误，请遵循以下系统化步骤：

Step 1: Sanity Checks

步骤1：基础检查

Before diving deep, try these debugging flags:

bash

AOTI_RUNTIME_CHECK_INPUTS=1
TORCHINDUCTOR_NAN_ASSERTS=1

These flags take effect at compilation time (at codegen time):

```
AOTI_RUNTIME_CHECK_INPUTS=1
```
checks if inputs satisfy the same guards used during compilation
```
TORCHINDUCTOR_NAN_ASSERTS=1
```
adds codegen before and after each kernel to check for NaN

在深入排查前，尝试使用以下调试标志：

bash

AOTI_RUNTIME_CHECK_INPUTS=1
TORCHINDUCTOR_NAN_ASSERTS=1

这些标志在编译阶段生效（代码生成时）：

```
AOTI_RUNTIME_CHECK_INPUTS=1
```
：检查输入是否符合编译时使用的约束条件
```
TORCHINDUCTOR_NAN_ASSERTS=1
```
：在每个内核执行前后添加代码以检查是否出现NaN

Step 2: Pinpoint the CUDA IMA

步骤2：定位CUDA IMA错误

CUDA IMA errors can be non-deterministic. Use these flags to trigger the error deterministically:

bash

PYTORCH_NO_CUDA_MEMORY_CACHING=1
CUDA_LAUNCH_BLOCKING=1

These flags take effect at runtime:

```
PYTORCH_NO_CUDA_MEMORY_CACHING=1
```
disables PyTorch's Caching Allocator, which allocates bigger buffers than needed immediately. This is usually why CUDA IMA errors are non-deterministic.
```
CUDA_LAUNCH_BLOCKING=1
```
forces kernels to launch one at a time. Without this, you get "CUDA kernel errors might be asynchronously reported" warnings since kernels launch asynchronously.

CUDA IMA错误可能具有非确定性。使用以下标志使错误可复现：

bash

PYTORCH_NO_CUDA_MEMORY_CACHING=1
CUDA_LAUNCH_BLOCKING=1

这些标志在运行阶段生效：

```
PYTORCH_NO_CUDA_MEMORY_CACHING=1
```
：禁用PyTorch的缓存分配器，该分配器会立即分配比实际需求更大的缓冲区，这通常是CUDA IMA错误非确定性的原因。
```
CUDA_LAUNCH_BLOCKING=1
```
：强制内核逐个启动。如果不设置该标志，由于内核异步启动，你会收到“CUDA kernel errors might be asynchronously reported”警告。

Step 3: Identify Problematic Kernels with Intermediate Value Debugger

步骤3：使用中间值调试器定位问题内核

Use the AOTI Intermediate Value Debugger to pinpoint the problematic kernel:

bash

AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3

This prints kernels one by one at runtime. Together with previous flags, this shows which kernel was launched right before the error.

To inspect inputs to a specific kernel:

bash

AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_add_ge_logical_and_logical_or_lt_231,_add_position_embeddings_kernel_5" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2

If inputs to the kernel are unexpected, inspect the kernel that produces the bad input.

使用AOTI中间值调试器定位问题内核：

bash

AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3

该标志会在运行时逐个打印内核信息。结合之前的标志，可以显示错误发生前最后启动的内核。

要检查特定内核的输入：

bash

AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_add_ge_logical_and_logical_or_lt_231,_add_position_embeddings_kernel_5" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2

如果内核输入不符合预期，请检查生成该错误输入的上游内核。

Additional Debugging Tools

额外调试工具

Logging and Tracing

日志与追踪

tlparse / TORCH_TRACE: Provides complete output codes and records guards used
TORCH_LOGS: Use
```
TORCH_LOGS="+inductor,output_code"
```
to see more PT2 internal logs
TORCH_SHOW_CPP_STACKTRACES: Set to
```
1
```
to see more stack traces

tlparse / TORCH_TRACE：提供完整的输出代码并记录使用的约束条件
TORCH_LOGS：设置
```
TORCH_LOGS="+inductor,output_code"
```
以查看更多PT2内部日志
TORCH_SHOW_CPP_STACKTRACES：设置为
```
1
```
以查看更详细的堆栈追踪信息

Common Sources of Issues

常见问题来源

Dynamic shapes: Historically a source of many IMAs. Pay special attention when debugging dynamic shape scenarios.
Custom ops: Especially when implemented in C++ with dynamic shapes. The meta function may need to be Symint'ified.

动态形状：历史上是IMA错误的主要来源之一。调试动态形状场景时需特别注意。
自定义算子：尤其是使用C++实现的动态形状算子。其元函数可能需要Symint化。

API Notes

API说明

Deprecated API

已废弃API

python

torch._export.aot_compile()  # Deprecated
torch._export.aot_load()     # Deprecated

python

torch._export.aot_compile()  # 已废弃
torch._export.aot_load()     # 已废弃

Current API

当前API

python

torch._inductor.aoti_compile_and_package()
torch._inductor.aoti_load_package()

The new API stores device metadata in the package, so

aoti_load_package()

automatically uses the correct device type. You can only change the device index (e.g., cuda:0 vs cuda:1), not the device type.

python

torch._inductor.aoti_compile_and_package()
torch._inductor.aoti_load_package()

新API会在包中存储设备元数据，因此

aoti_load_package()

会自动使用正确的设备类型。你只能更改设备索引（例如cuda:0与cuda:1），而不能更改设备类型。

Environment Variables Summary

环境变量汇总

Variable	When	Purpose
`AOTI_RUNTIME_CHECK_INPUTS=1`	Compile time	Validate inputs match compilation guards
`TORCHINDUCTOR_NAN_ASSERTS=1`	Compile time	Check for NaN before/after kernels
`PYTORCH_NO_CUDA_MEMORY_CACHING=1`	Runtime	Make IMA errors deterministic
`CUDA_LAUNCH_BLOCKING=1`	Runtime	Force synchronous kernel launches
`AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`	Compile time	Print kernels at runtime
`TORCH_LOGS="+inductor,output_code"`	Runtime	See PT2 internal logs
`TORCH_SHOW_CPP_STACKTRACES=1`	Runtime	Show C++ stack traces

变量	生效阶段	用途
`AOTI_RUNTIME_CHECK_INPUTS=1`	编译阶段	验证输入是否符合编译时的约束条件
`TORCHINDUCTOR_NAN_ASSERTS=1`	编译阶段	检查内核执行前后是否出现NaN
`PYTORCH_NO_CUDA_MEMORY_CACHING=1`	运行阶段	使IMA错误可复现
`CUDA_LAUNCH_BLOCKING=1`	运行阶段	强制同步启动内核
`AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`	编译阶段	运行时打印内核信息
`TORCH_LOGS="+inductor,output_code"`	运行阶段	查看PT2内部日志
`TORCH_SHOW_CPP_STACKTRACES=1`	运行阶段	显示C++堆栈追踪信息