aoti-debug
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAOTI Debugging Guide
AOTI调试指南
This skill helps diagnose and fix common AOTInductor issues.
本技能可帮助诊断并修复常见的AOTInductor问题。
Error Pattern Routing
错误模式排查指引
Check the error message and route to the appropriate sub-guide:
检查错误信息,选择对应的子指南进行排查:
Triton Index Out of Bounds
Triton索引越界
If the error matches this pattern:
Assertion `index out of bounds: 0 <= tmpN < ksM` failed→ Follow the guide in
triton-index-out-of-bounds.md如果错误符合以下模式:
Assertion `index out of bounds: 0 <= tmpN < ksM` failed→ 请参考 中的指南
triton-index-out-of-bounds.mdAll Other Errors
所有其他错误
Continue with the sections below.
继续阅读以下章节。
First Step: Always Check Device and Shape Matching
第一步:始终检查设备与形状匹配情况
For ANY AOTI error (segfault, exception, crash, wrong output), ALWAYS check these first:
- Compile device == Load device: The model must be loaded on the same device type it was compiled on
- Input devices match: Runtime inputs must be on the same device as the compiled model
- Input shapes match: Runtime input shapes must match the shapes used during compilation (or satisfy dynamic shape constraints)
python
undefined对于任何AOTI错误(段错误、异常、崩溃、输出结果错误),请首先检查以下几点:
- 编译设备 == 加载设备:模型必须在与编译时相同的设备类型上加载
- 输入设备匹配:运行时输入必须与编译后的模型在同一设备上
- 输入形状匹配:运行时输入形状必须与编译时使用的形状一致(或满足动态形状约束)
python
undefinedDuring compilation - note the device and shapes
编译阶段 - 注意设备与形状
model = MyModel().eval() # What device? CPU or .cuda()?
inp = torch.randn(2, 10) # What device? What shape?
compiled_so = torch._inductor.aot_compile(model, (inp,))
model = MyModel().eval() # 当前设备是?CPU还是.cuda()?
inp = torch.randn(2, 10) # 当前设备是?形状是?
compiled_so = torch._inductor.aot_compile(model, (inp,))
During loading - device type MUST match compilation
加载阶段 - 设备类型必须与编译时一致
loaded = torch._export.aot_load(compiled_so, "???") # Must match model/input device above
loaded = torch._export.aot_load(compiled_so, "???") # 必须与上述模型/输入设备匹配
During inference - device and shapes MUST match
推理阶段 - 设备与形状必须匹配
out = loaded(inp.to("???")) # Must match compile device, shape must match
**If any of these don't match, you will get errors ranging from segfaults to exceptions to wrong outputs.**out = loaded(inp.to("???")) # 必须与编译设备一致,形状也必须匹配
**如果以上任意一项不匹配,你会遇到从段错误、异常到输出结果错误等各类问题。**Key Constraint: Device Type Matching
核心约束:设备类型匹配
AOTI requires compile and load to use the same device type.
- If you compile on CUDA, you must load on CUDA (device index can differ)
- If you compile on CPU, you must load on CPU
- Cross-device loading (e.g., compile on GPU, load on CPU) is NOT supported
AOTI要求编译与加载使用相同的设备类型。
- 如果在CUDA上编译,必须在CUDA上加载(设备索引可以不同)
- 如果在CPU上编译,必须在CPU上加载
- 跨设备加载(例如在GPU编译,在CPU加载)是不被支持的
Common Error Patterns
常见错误模式
1. Device Mismatch Segfault
1. 设备不匹配导致的段错误
Symptom: Segfault, exception, or crash during or model execution.
aot_load()Example error messages:
The specified pointer resides on host memory and is not registered with any CUDA device- Crash during constant loading in AOTInductorModelBase
Expected out tensor to have device cuda:0, but got cpu instead
Cause: Compile and load device types don't match (see "First Step" above).
Solution: Ensure compile and load use the same device type. If compiled on CPU, load on CPU. If compiled on CUDA, load on CUDA.
症状:或模型执行过程中出现段错误、异常或崩溃。
aot_load()示例错误信息:
The specified pointer resides on host memory and is not registered with any CUDA device- 在AOTInductorModelBase的常量加载阶段崩溃
Expected out tensor to have device cuda:0, but got cpu instead
原因:编译与加载的设备类型不匹配(见上文“第一步”)。
解决方案:确保编译与加载使用相同的设备类型。如果在CPU上编译,就在CPU上加载;如果在CUDA上编译,就在CUDA上加载。
2. Input Device Mismatch at Runtime
2. 运行时输入设备不匹配
Symptom: RuntimeError during model execution.
Cause: Input device doesn't match compile device (see "First Step" above).
Better Debugging: Run with for clearer errors. This flag validates all input properties including device type, dtype, sizes, and strides:
AOTI_RUNTIME_CHECK_INPUTS=1bash
AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.pyThis produces actionable error messages like:
Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)症状:模型执行过程中出现RuntimeError。
原因:输入设备与编译设备不匹配(见上文“第一步”)。
更优调试方式:设置以获得更清晰的错误信息。该标志会验证所有输入属性,包括设备类型、数据类型、尺寸与步长:
AOTI_RUNTIME_CHECK_INPUTS=1bash
AOTI_RUNTIME_CHECK_INPUTS=1 python your_script.py这会生成可直接定位问题的错误信息,例如:
Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)Debugging CUDA Illegal Memory Access (IMA) Errors
调试CUDA非法内存访问(IMA)错误
If you encounter CUDA illegal memory access errors, follow this systematic approach:
如果遇到CUDA非法内存访问错误,请遵循以下系统化步骤:
Step 1: Sanity Checks
步骤1:基础检查
Before diving deep, try these debugging flags:
bash
AOTI_RUNTIME_CHECK_INPUTS=1
TORCHINDUCTOR_NAN_ASSERTS=1These flags take effect at compilation time (at codegen time):
- checks if inputs satisfy the same guards used during compilation
AOTI_RUNTIME_CHECK_INPUTS=1 - adds codegen before and after each kernel to check for NaN
TORCHINDUCTOR_NAN_ASSERTS=1
在深入排查前,尝试使用以下调试标志:
bash
AOTI_RUNTIME_CHECK_INPUTS=1
TORCHINDUCTOR_NAN_ASSERTS=1这些标志在编译阶段生效(代码生成时):
- :检查输入是否符合编译时使用的约束条件
AOTI_RUNTIME_CHECK_INPUTS=1 - :在每个内核执行前后添加代码以检查是否出现NaN
TORCHINDUCTOR_NAN_ASSERTS=1
Step 2: Pinpoint the CUDA IMA
步骤2:定位CUDA IMA错误
CUDA IMA errors can be non-deterministic. Use these flags to trigger the error deterministically:
bash
PYTORCH_NO_CUDA_MEMORY_CACHING=1
CUDA_LAUNCH_BLOCKING=1These flags take effect at runtime:
- disables PyTorch's Caching Allocator, which allocates bigger buffers than needed immediately. This is usually why CUDA IMA errors are non-deterministic.
PYTORCH_NO_CUDA_MEMORY_CACHING=1 - forces kernels to launch one at a time. Without this, you get "CUDA kernel errors might be asynchronously reported" warnings since kernels launch asynchronously.
CUDA_LAUNCH_BLOCKING=1
CUDA IMA错误可能具有非确定性。使用以下标志使错误可复现:
bash
PYTORCH_NO_CUDA_MEMORY_CACHING=1
CUDA_LAUNCH_BLOCKING=1这些标志在运行阶段生效:
- :禁用PyTorch的缓存分配器,该分配器会立即分配比实际需求更大的缓冲区,这通常是CUDA IMA错误非确定性的原因。
PYTORCH_NO_CUDA_MEMORY_CACHING=1 - :强制内核逐个启动。如果不设置该标志,由于内核异步启动,你会收到“CUDA kernel errors might be asynchronously reported”警告。
CUDA_LAUNCH_BLOCKING=1
Step 3: Identify Problematic Kernels with Intermediate Value Debugger
步骤3:使用中间值调试器定位问题内核
Use the AOTI Intermediate Value Debugger to pinpoint the problematic kernel:
bash
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3This prints kernels one by one at runtime. Together with previous flags, this shows which kernel was launched right before the error.
To inspect inputs to a specific kernel:
bash
AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_add_ge_logical_and_logical_or_lt_231,_add_position_embeddings_kernel_5" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2If inputs to the kernel are unexpected, inspect the kernel that produces the bad input.
使用AOTI中间值调试器定位问题内核:
bash
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3该标志会在运行时逐个打印内核信息。结合之前的标志,可以显示错误发生前最后启动的内核。
要检查特定内核的输入:
bash
AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_add_ge_logical_and_logical_or_lt_231,_add_position_embeddings_kernel_5" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2如果内核输入不符合预期,请检查生成该错误输入的上游内核。
Additional Debugging Tools
额外调试工具
Logging and Tracing
日志与追踪
- tlparse / TORCH_TRACE: Provides complete output codes and records guards used
- TORCH_LOGS: Use to see more PT2 internal logs
TORCH_LOGS="+inductor,output_code" - TORCH_SHOW_CPP_STACKTRACES: Set to to see more stack traces
1
- tlparse / TORCH_TRACE:提供完整的输出代码并记录使用的约束条件
- TORCH_LOGS:设置以查看更多PT2内部日志
TORCH_LOGS="+inductor,output_code" - TORCH_SHOW_CPP_STACKTRACES:设置为以查看更详细的堆栈追踪信息
1
Common Sources of Issues
常见问题来源
- Dynamic shapes: Historically a source of many IMAs. Pay special attention when debugging dynamic shape scenarios.
- Custom ops: Especially when implemented in C++ with dynamic shapes. The meta function may need to be Symint'ified.
- 动态形状:历史上是IMA错误的主要来源之一。调试动态形状场景时需特别注意。
- 自定义算子:尤其是使用C++实现的动态形状算子。其元函数可能需要Symint化。
API Notes
API说明
Deprecated API
已废弃API
python
torch._export.aot_compile() # Deprecated
torch._export.aot_load() # Deprecatedpython
torch._export.aot_compile() # 已废弃
torch._export.aot_load() # 已废弃Current API
当前API
python
torch._inductor.aoti_compile_and_package()
torch._inductor.aoti_load_package()The new API stores device metadata in the package, so automatically uses the correct device type. You can only change the device index (e.g., cuda:0 vs cuda:1), not the device type.
aoti_load_package()python
torch._inductor.aoti_compile_and_package()
torch._inductor.aoti_load_package()新API会在包中存储设备元数据,因此会自动使用正确的设备类型。你只能更改设备索引(例如cuda:0与cuda:1),而不能更改设备类型。
aoti_load_package()Environment Variables Summary
环境变量汇总
| Variable | When | Purpose |
|---|---|---|
| Compile time | Validate inputs match compilation guards |
| Compile time | Check for NaN before/after kernels |
| Runtime | Make IMA errors deterministic |
| Runtime | Force synchronous kernel launches |
| Compile time | Print kernels at runtime |
| Runtime | See PT2 internal logs |
| Runtime | Show C++ stack traces |
| 变量 | 生效阶段 | 用途 |
|---|---|---|
| 编译阶段 | 验证输入是否符合编译时的约束条件 |
| 编译阶段 | 检查内核执行前后是否出现NaN |
| 运行阶段 | 使IMA错误可复现 |
| 运行阶段 | 强制同步启动内核 |
| 编译阶段 | 运行时打印内核信息 |
| 运行阶段 | 查看PT2内部日志 |
| 运行阶段 | 显示C++堆栈追踪信息 |