ad-accuracy-debug

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AutoDeploy Accuracy Debugging

AutoDeploy精度调试

Where this skill applies

本技能适用场景

This file is part of trtllm-agent-toolkit. Paths such as

tensorrt_llm/

tests/

, and

examples/auto_deploy/

are relative to a TensorRT-LLM source checkout on the user's machine, not the plugin repository.

本文档是trtllm-agent-toolkit的一部分。

tensorrt_llm/

、

tests/

和

examples/auto_deploy/

等路径相对于用户机器上的TensorRT-LLM源码检出目录，而非插件仓库。

Related skills in this plugin

本插件中的相关技能

```
trtllm-agent-toolkit:ad-graph-dump
```
— inspect per-transform FX graph snapshots when Phase 2 suggests a transform was applied incorrectly or is corrupting activations.
```
trtllm-agent-toolkit:ad-conf-check
```
— verify that precision or config settings (FP8, sharding, chunked prefill, etc.) were actually applied at runtime before attributing an accuracy gap to a kernel or weight bug.

Input: model name, failing accuracy score, reference score, eval task (e.g. MMLU, GSM8K). Output: identified root cause, minimal reproducer, and a code fix.

```
trtllm-agent-toolkit:ad-graph-dump
```
— 当Phase 2显示某个变换应用错误或激活值被破坏时，检查每个变换的FX图快照。
```
trtllm-agent-toolkit:ad-conf-check
```
— 在将精度差异归因于内核或权重bug之前，验证精度或配置设置（FP8、分片、分块预填充等）是否在运行时实际生效。

输入： 模型名称、不合格的精度分数、参考分数、评估任务（如MMLU、GSM8K）。 输出： 已识别的根本原因、最小复现用例和代码修复方案。

Situation Assessment

情况评估

Before debugging, confirm:

What is the reference score? Is it from the PyTorch backend test, a published leaderboard, or set manually?
How large is the gap? A 1-2% gap may be within statistical noise; a 5%+ gap is a real bug.
Is the eval framework itself suspect? Run the same eval on the PyTorch backend to validate the harness before blaming AutoDeploy.

调试前，请确认：

参考分数来源是什么？ 是来自PyTorch后端测试、已发布的排行榜还是手动设置的？
分数差距有多大？ 1-2%的差距可能在统计噪声范围内；5%以上的差距则是真实bug。
评估框架本身是否可疑？ 在PyTorch后端运行相同的评估，先验证测试工具是否可靠，再归咎于AutoDeploy。

Abbreviations

缩写说明

AD — AutoDeploy, TRT-LLM with
```
_autodeploy
```
backend
PT — PyTorch, TRT-LLM with
```
pytorch
```
backend (manual deployment)

AD — AutoDeploy，使用
```
_autodeploy
```
后端的TRT-LLM
PT — PyTorch，使用
```
pytorch
```
后端的TRT-LLM（手动部署）

Phase 0 — Validate the Test Harness

Phase 0 — 验证测试工具

Run the equivalent PyTorch backend test on the same model and same eval task. If PT also fails or scores lower than expected, the issue is in the eval framework (prompt format, chat template, sampling params), not AD-specific.

Key things to verify in the eval harness:

Prompt format /
apply_chat_template
: does the evaluator send raw prompts or apply a chat template? The relationship is two-sided for reasoning/chat models:
- Applying
```
apply_chat_template
```
  to a concatenated few-shot prompt (without
```
fewshot_as_multiturn
```
  ) collapses the examples into a malformed single turn and can produce 0% accuracy.
- Omitting
```
apply_chat_template
```
  for a chat-first model can be equally wrong. For chat models on few-shot benchmarks, consider whether
```
apply_chat_template=True
```
  paired with
```
fewshot_as_multiturn=True
```
  is appropriate — the latter turns each few-shot example into an explicit user/assistant exchange before the template is applied. (Reference: Qwen3.5-MoE accuracy fix in
```
test_llm_api_autodeploy.py
```
  .)
max_output_len
for generation tasks: for benchmarks where the model must generate a full reasoning chain before the answer (e.g. GSM8K with a reasoning model), the default
```
MAX_OUTPUT_LEN
```
may truncate the response before the final answer is reached. Consider patching it up (e.g. 512) if outputs appear cut off. This is distinct from capping
```
max_tokens
```
for classification tasks like MMLU where you want to prevent long generations.
max_tokens
for classification tasks: must be capped (e.g. 2 for MMLU) to prevent the model generating a full reasoning chain.
Dataset path: confirm
```
LLM_MODELS_ROOT
```
is set correctly and the dataset directory exists.

If PT passes: the harness is fine. Proceed to Phase 1.

在相同模型和相同评估任务上运行等效的PyTorch后端测试。如果PT也失败或分数低于预期，问题出在评估框架（提示格式、聊天模板、采样参数），而非AD特有问题。

评估工具中需要验证的关键事项：

提示格式 /
apply_chat_template
：评估器是发送原始提示还是应用聊天模板？对于推理/聊天模型，这是双向关联的：
- 对拼接的few-shot提示应用
```
apply_chat_template
```
  （未设置
```
fewshot_as_multiturn
```
  ）会将示例折叠为格式错误的单轮对话，可能导致0%的精度。
- 针对优先支持聊天的模型省略
```
apply_chat_template
```
  同样会出错。对于few-shot基准测试中的聊天模型，考虑是否适合将
```
apply_chat_template=True
```
  与
```
fewshot_as_multiturn=True
```
  配合使用——后者会在应用模板前将每个few-shot示例转换为显式的用户/助手对话。（参考：
```
test_llm_api_autodeploy.py
```
  中Qwen3.5-MoE的精度修复方案。）
生成任务的
max_output_len
：对于模型必须在生成最终答案前生成完整推理链的基准测试（如使用推理模型的GSM8K），默认的
```
MAX_OUTPUT_LEN
```
可能在生成最终答案前截断响应。如果输出看起来被截断，考虑将其调高（如512）。这与MMLU等分类任务中限制
```
max_tokens
```
以防止长生成的情况不同。
分类任务的
max_tokens
：必须限制长度（如MMLU设置为2），以防止模型生成完整的推理链。
数据集路径：确认
```
LLM_MODELS_ROOT
```
已正确设置且数据集目录存在。

如果PT测试通过：工具可靠，进入Phase 1。

Phase 1 — Quick Diagnostic with a Small Sample

Phase 1 — 小样本快速诊断

Write a standalone diagnostic script that:

Loads the AD model directly via

from tensorrt_llm._torch.auto_deploy import LLM as AutoDeployLLM

Reproduces the exact prompt format the evaluator uses (not a simplified variant), including few-shot examples if any
Runs ~50-100 samples
Prints per-sample
```
(ref, output, correct)
```
and overall accuracy

Critical: reproduce the evaluator's exact prompt format. Deviating — for example, using a 0-shot prompt when the evaluator uses 5-shot — can cause thinking models to produce "Okay" or other meta-responses instead of the expected answer, making results uninterpretable. Verify the first printed prompt matches what the evaluator sends.

Typical evaluator sources:

```
tensorrt_llm/evaluate/mmlu.py
```
— 5-shot format with dev examples
```
tensorrt_llm/evaluate/gsm8k.py
```
— few-shot with CoT references

tests/integration/defs/accuracy/accuracy_core.py

—

MAX_INPUT_LEN

MAX_OUTPUT_LEN

NUM_SAMPLES

per task

编写独立的诊断脚本，实现以下功能：

通过

from tensorrt_llm._torch.auto_deploy import LLM as AutoDeployLLM

直接加载AD模型

复现评估器使用的完全一致的提示格式（而非简化版本），包括可能存在的few-shot示例
运行约50-100个样本

打印每个样本的

(参考输出, 模型输出, 是否正确)

以及整体精度

关键注意事项： 必须复现评估器的精确提示格式。如果偏离——例如评估器使用5-shot提示而你使用0-shot——会导致推理模型生成“好的”或其他元响应，而非预期答案，使结果无法解释。验证打印的第一个提示是否与评估器发送的一致。

典型评估器来源：

```
tensorrt_llm/evaluate/mmlu.py
```
— 带开发示例的5-shot格式
```
tensorrt_llm/evaluate/gsm8k.py
```
— 带CoT参考的few-shot格式

tests/integration/defs/accuracy/accuracy_core.py

— 每个任务的

MAX_INPUT_LEN

、

MAX_OUTPUT_LEN

、

NUM_SAMPLES

设置

Phase 2 — Classify the Error Pattern

Phase 2 — 错误模式分类

From the diagnostic output, determine what the model is generating:

Output Pattern	Likely Root Cause
Coherent but consistently wrong letter / answer	Numerical accuracy bug (attention, FP8 kernel, weight corruption)
Generates meta-text ("The user wants...", "The answer is...", "Let me think...")	Prompt format issue — model not primed to answer directly
Outputs empty string or EOS immediately	KV cache garbage (uninitialized cache, scale overflow), or `end_id` matching first token
Completely random tokens / gibberish	Transformation applied incorrectly, load hook missing or applied twice, corrupted weights
Correct on easy subjects, wrong on hard subjects	Subtle numerical precision bug (FP8 kernel mismatch, attention scale wrong)
NaN in logits, especially on prefill	FX graph transform produced a node without shape metadata — enable `AD_DUMP_GRAPHS_DIR` and look for nodes missing `meta["val"]` ; often caused by an opaque Python closure inside a transform
Passes at `world_size=1` , fails at `world_size>1`	Sharding bug — see Phase 4c

根据诊断输出，判断模型生成内容的模式：

输出模式	可能的根本原因
内容连贯但持续输出错误选项/答案	数值精度bug（注意力机制、FP8内核、权重损坏）
生成元文本（“用户想要...”, “答案是...”, “让我想想...”）	提示格式问题——模型未被正确引导直接作答
输出空字符串或立即输出EOS	KV缓存异常（未初始化缓存、缩放溢出），或 `end_id` 与第一个token匹配
完全随机的token/乱码	变换应用错误、加载钩子缺失或重复应用、权重损坏
简单主题正确，复杂主题错误	细微的数值精度bug（FP8内核不匹配、注意力缩放错误）
logits中出现NaN，尤其是在预填充阶段	FX图变换生成了缺少形状元数据的节点——启用 `AD_DUMP_GRAPHS_DIR` 并查找缺少 `meta["val"]` 的节点；通常由变换内部的不透明Python闭包导致
`world_size=1` 时正常， `world_size>1` 时失败	分片bug——见Phase 4c

Phase 3 — Configuration Isolation

Phase 3 — 配置隔离

Narrow down which part of the setup is responsible by reducing the environment to its simplest form, then re-enabling components one at a time until the regression reappears.

Step 1 — Strip to a minimal configuration:

Where feasible, reduce complexity along each axis, re-running the Phase 1 diagnostic after each change:

Remove sharding or reduce to TP=1 / single GPU
Disable multi-streaming
Disable non-default transform passes in the YAML config (
```
enabled: false
```
)
Revert to
torch-simple

compile_backend
: AutoDeploy currently supports two backends —
```
torch-cudagraph
```
(CUDA graphs, the typical production setting) and
```
torch-simple
```
(no CUDA graphs, significantly slower). If the model is configured with
```
torch-cudagraph
```
, revert to
```
torch-simple
```
and check whether the accuracy issue persists. Note the slower throughput will make the validation loop take longer. If accuracy recovers at
```
torch-simple
```
, CUDA graph capture or replay is the suspect.

If the issue disappears when a component is removed, that component is the suspect — note it and proceed to Step 2 targeting it. If the issue persists even at minimal config, the bug is in a core path (weight loading, attention, KV cache) — proceed to Phase 4.

Step 2 — Re-enable one component at a time:

Starting from the stripped-down configuration that still reproduces the issue, re-enable the suspected components individually — one per diagnostic run. Stop as soon as accuracy drops: the last re-enabled component is the offending pass or backend. Carry this finding into Phase 4 to investigate the root cause.

通过将环境简化到最基础形式，然后逐个重新启用组件，直到回归问题再次出现，从而缩小问题范围。

步骤1 — 简化为最小配置：

在可行的情况下，沿每个维度降低复杂度，每次更改后重新运行Phase 1的诊断：

移除分片或减少到TP=1 / 单GPU
禁用多流功能
在YAML配置中禁用非默认的变换通道（设置
```
enabled: false
```
）
切换回
torch-simple
编译后端：AutoDeploy目前支持两种后端——
```
torch-cudagraph
```
（CUDA图，典型生产设置）和
```
torch-simple
```
（无CUDA图，速度显著较慢）。如果模型配置为
```
torch-cudagraph
```
，切换到
```
torch-simple
```
并检查精度问题是否仍然存在。注意较慢的吞吐量会使验证循环耗时更长。如果切换到
```
torch-simple
```
后精度恢复，问题可能出在CUDA图捕获或重放环节。

如果移除某个组件后问题消失，该组件就是可疑对象——记录下来并进入步骤2针对性排查。即使在最小配置下问题仍然存在，说明bug位于核心路径（权重加载、注意力机制、KV缓存）——进入Phase 4。

步骤2 — 逐个重新启用组件：

从仍能复现问题的简化配置开始，逐个重新启用可疑组件——每次诊断运行只启用一个。一旦精度下降就停止：最后启用的组件就是有问题的通道或后端。将此发现带入Phase 4以调查根本原因。

Phase 4 — Root Cause Investigation

Phase 4 — 根本原因调查

This phase contains targeted investigation paths for known root-cause categories. Add model-specific or error-pattern-specific steps here as they are discovered.

本阶段包含针对已知根本原因类别的定向调查路径。随着新的根本原因或错误模式被发现，可在此添加模型特定或错误模式特定的步骤。

4a — Quantized Model Accuracy

4a — 量化模型精度

If the failing model is quantized (e.g. FP8, NVFP4), first verify whether the issue is in the quantization itself or in the quantized kernel path:

Step 1 — Test an unquantized baseline.

Ask the user for an unquantized (BF16/FP16) version of the same model. Run the Phase 1 diagnostic against it with an identical configuration (same

compile_backend

, same TP, same eval format).

If the unquantized model also fails: the bug is not quantization-related — the issue is in a transform pass, attention implementation, or weight loading. Continue with Phase 3 isolation against the unquantized model.
If the unquantized model passes: the accuracy gap is introduced by quantization or the quantized kernel path. Proceed to Step 2.

Step 2 — Suspect classification.

When quantization is confirmed as the source, the likely causes are (in rough order of severity):

Suspect	Symptom	How to isolate
Missing scale during dequantization	Near-zero or astronomically large logits; catastrophic accuracy loss (≈ random chance or worse)	Log a few raw logits; they will be wildly out of range
Inverted scale (multiplied instead of divided, or vice versa)	Similarly catastrophic; outputs plausible tokens but systematically wrong	Same logit inspection; compare scale values in the checkpoint vs what the kernel receives
Incorrect block-scale computation	Major but not catastrophic degradation; typically 5–20% below unquantized reference	Compare per-block scales against a reference quantizer on a few weight tensors
`.to(dtype)` used instead of `.view(dtype)` for packed format reinterpretation	Wrong scale or weight values without an error — `.to()` converts values numerically while `.view()` reinterprets the raw bits	Grep the quantization transform for `.to(` on quantized weight/scale tensors of packed types (FP4, FP8); the intent is bit-level reinterpretation, which requires `.view()`
Quantized kernel bug (wrong accumulation, wrong cast)	Non-catastrophic; may be input-dependent or shape-dependent	Step 3 below

Step 3 — Isolate quantized kernels via fake quantization.

AutoDeploy's transform pipeline has a built-in fake-quantization path that implements exactly Q→DQ→high-precision-matmul. Understanding the two stages helps:

Stage 1 (
pattern_matcher
): Replaces
```
nn.Linear
```
nodes with
```
torch.ops.auto_deploy.torch_fake_quant_fp8_linear
```
/
```
torch_fake_quant_nvfp4_linear
```
etc. These ops quantize the input, immediately dequantize both input and weight back to BF16/FP16, then run a standard
```
torch.matmul
```
. Scales are exercised but all arithmetic is in high precision. Implementation:
```
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py
```
, lines 178–286.
Stage 2 (
post_load_fusion
):
```
fuse_fp8_linear
```
,
```
fuse_nvfp4_linear
```
, and
```
fuse_finegrained_fp8_linear
```
transforms replace the fake-quant ops with optimized low-precision kernels. This is where a kernel bug would be introduced.

To run inference in fake-quantization mode (bypassing the low-precision kernels), add the following to the YAML config file the test is already using (passed via

--config

--extra_llm_api_options

yaml

transforms:
  fuse_fp8_linear:
    enabled: false
  fuse_nvfp4_linear:
    enabled: false
  fuse_finegrained_fp8_linear:
    enabled: false
  # For MoE models also add:
  fuse_fp8_moe:
    enabled: false
  fuse_finegrained_fp8_moe:
    enabled: false
  fuse_nvfp4_moe:
    enabled: false

If there is no existing config file, create one with only the above content and pass it via

--extra_llm_api_options /path/to/fake_quant_debug.yaml

. The

transforms

key maps directly to

LlmArgs.transforms

; any transform not listed inherits its default from

tensorrt_llm/_torch/auto_deploy/config/default.yaml

If accuracy recovers with fake quantization, the quantized kernel (not the scales) is the bug. If accuracy is still wrong, the scales or weight data are the likely culprit.

如果失败的模型是量化模型（如FP8、NVFP4），首先验证问题是出在量化本身还是量化内核路径：

步骤1 — 测试未量化基准。

向用户索要同一模型的未量化（BF16/FP16）版本。使用完全相同的配置（相同的

compile_backend

、相同的TP、相同的评估格式）对其运行Phase 1诊断。

如果未量化模型也失败：bug与量化无关——问题出在变换通道、注意力实现或权重加载。针对未量化模型继续Phase 3的隔离排查。
如果未量化模型通过：精度差距由量化或量化内核路径引入。进入步骤2。

步骤2 — 可疑对象分类。

当确认量化是问题来源时，可能的原因按严重程度大致排序如下：

可疑对象	症状	隔离方法
反量化时缺少缩放因子	logits接近零或异常巨大；精度灾难性损失（≈随机水平或更差）	记录几个原始logits；它们会严重超出正常范围
缩放因子反转（本该乘却除，反之亦然）	同样是灾难性损失；输出看似合理的token但系统性错误	同样检查logits；对比检查点中的缩放值与内核接收的缩放值
块级缩放计算错误	严重但非灾难性的精度下降；通常比未量化参考值低5–20%	对比几个权重张量的块级缩放值与参考量化器的结果
对打包格式使用 `.to(dtype)` 而非 `.view(dtype)` 进行重新解释	缩放或权重值错误但无报错—— `.to()` 会数值转换值，而 `.view()` 会重新解释原始比特	在量化变换中搜索针对打包类型（FP4、FP8）的量化权重/缩放张量的 `.to(` 操作；如果意图是比特级重新解释，需要使用 `.view()`
量化内核bug（累加错误、类型转换错误）	非灾难性；可能依赖输入或形状	见下方步骤3

步骤3 — 通过伪量化隔离量化内核。

AutoDeploy的变换流水线内置了伪量化路径，实现精确的Q→DQ→高精度矩阵乘法。理解以下两个阶段会有所帮助：

阶段1（
pattern_matcher
）：将
```
nn.Linear
```
节点替换为
```
torch.ops.auto_deploy.torch_fake_quant_fp8_linear
```
/
```
torch_fake_quant_nvfp4_linear
```
等操作。这些操作会量化输入，立即将输入和权重反量化回BF16/FP16，然后运行标准的
```
torch.matmul
```
。缩放因子会被使用，但所有运算都在高精度下进行。实现位置：
```
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py
```
，第178–286行。
阶段2（
post_load_fusion
）：
```
fuse_fp8_linear
```
、
```
fuse_nvfp4_linear
```
和
```
fuse_finegrained_fp8_linear
```
变换会替换伪量化操作为优化的低精度内核。内核bug通常在此阶段引入。

要在伪量化模式下运行推理（绕过低精度内核），在测试已使用的YAML配置文件中添加以下内容（通过

--config

或

--extra_llm_api_options

传入）：

yaml

transforms:
  fuse_fp8_linear:
    enabled: false
  fuse_nvfp4_linear:
    enabled: false
  fuse_finegrained_fp8_linear:
    enabled: false
  # 对于MoE模型还需添加：
  fuse_fp8_moe:
    enabled: false
  fuse_finegrained_fp8_moe:
    enabled: false
  fuse_nvfp4_moe:
    enabled: false

如果没有现有配置文件，创建仅包含上述内容的文件并通过

--extra_llm_api_options /path/to/fake_quant_debug.yaml

传入。

transforms

键直接映射到

LlmArgs.transforms

；未列出的变换会继承

tensorrt_llm/_torch/auto_deploy/config/default.yaml

中的默认设置。

如果伪量化模式下精度恢复，说明问题出在量化内核（而非缩放因子）。如果精度仍然错误，缩放因子或权重数据可能是罪魁祸首。

4b — Kernel Wrapper Hardcoded Assumptions

4b — 内核包装器硬编码假设

Symptom: coherent but systematically wrong output on one specific model family (or one model variant within a family), while a structurally similar model is fine.

Root cause pattern: The C++ kernel or its Python wrapper has a constant where it should read from the model config or from the actual tensor. Two common forms:

Config value hardcoded: a kernel wrapper assumes a default value instead of reading it from the HF config loaded at runtime.
Tensor stride/shape hardcoded: a C++ kernel assumes a specific memory layout that differs from what AutoDeploy actually passes. The kernel then reads the wrong memory locations silently.

How to investigate:

Identify which kernel is dispatched for the failing op (search for the op's Python entry point in
```
tensorrt_llm/_torch/auto_deploy/
```
).
Compare every constant or default in the kernel wrapper call against the corresponding field in the model's HF config (
```
config.json
```
). Flag any value that is not read from the config.
For stride bugs: print the actual strides of the tensors being passed to the kernel and compare against what the kernel expects (check the C++ kernel source for stride assumptions or parameters).
If a mismatch is found, the fix is to pass the config or tensor property as an explicit parameter rather than using a hardcoded constant.

症状： 在某一特定模型家族（或家族内的某一模型变体）上输出连贯但系统性错误，而结构相似的模型运行正常。

根本原因模式： C++内核或其Python包装器使用了常量，而实际上应该从模型配置或实际张量中读取值。常见形式有两种：

配置值硬编码：内核包装器假设使用默认值，而非从运行时加载的HF配置中读取。
张量步长/形状硬编码：C++内核假设特定的内存布局，但与AutoDeploy实际传递的布局不同。内核会静默读取错误的内存位置。

调查方法：

确定失败操作调度的内核（在
```
tensorrt_llm/_torch/auto_deploy/
```
中搜索操作的Python入口点）。
将内核包装器调用中的每个常量或默认值与模型HF配置（
```
config.json
```
）中的对应字段进行比较。标记任何未从配置中读取的值。
对于步长bug：打印传递给内核的张量的实际步长，并与内核预期的步长进行比较（检查C++内核源码中的步长假设或参数）。
如果发现不匹配，修复方法是将配置或张量属性作为显式参数传递，而非使用硬编码常量。

4c — Sharding-Related Accuracy (world_size > 1)

4c — 分片相关精度问题（world_size > 1）

First step: reproduce the issue at

world_size=1

. If accuracy recovers, the bug is in the sharding path. If it fails at

world_size=1

too, sharding is not the cause — return to Phase 3.

To run at

world_size=1

, set

world_size: 1

in the model's YAML config or pass

--extra_llm_api_options

with

world_size: 1

Known sharding bug patterns (check in order):

Suspect	Symptom	How to isolate
Wrong allreduce strategy	Non-deterministic or rank-dependent outputs; may appear only at TP≥4	Set `allreduce_strategy: NCCL` in the sharding transform config; the `AUTO` default has caused correctness issues in the past
Double `all_reduce` in MoE	MoE output doubled in magnitude; accuracy catastrophic	Inspect the exported graph; there should be exactly one `all_reduce` after the sum of routed and shared expert outputs, not one per branch
Head reshape with wrong stride after TP	Attention output garbage at TP>1, correct at TP=1	Reshapes that use concrete head counts from `torch.export` become wrong after TP splits the head dimension; these must use `torch.ops.auto_deploy.view` with `tp_scaled_dim`
Sharding a projection that must not be sharded	Dim-1 gating projections or latent projections sharded → wrong results	Check `tp_mode` on small-output projections (e.g. MoE router, MLA latent q_a/kv_a); they must be `"none"`
Nested parameter deletion breaking weight loading	Some weights missing after sharding, silently defaulting to zero or random	If sharding deletes parent module params and child params are looked up by the old path, the load hook may silently skip them

Validating a sharding fix:

If model size permits, run it at

world_size=1

(baseline), then

world_size=2

, then the target

world_size

. If accuracy is correct at TP=1 and TP=2 but wrong at TP=8, the bug is likely a head-count divisibility assumption (head dim must be divisible by the TP degree). If it is wrong at all TP>1, it is a structural sharding bug (missing allreduce, wrong split point, wrong stride).

第一步： 在

world_size=1

时复现问题。如果精度恢复，bug出在分片路径。如果

world_size=1

时也失败，分片不是原因——返回Phase 3。

要在

world_size=1

下运行，在模型的YAML配置中设置

world_size: 1

，或通过

--extra_llm_api_options

传入

world_size: 1

。

已知分片bug模式（按顺序检查）：

可疑对象	症状	隔离方法
错误的allreduce策略	输出非确定性或依赖rank；可能仅在TP≥4时出现	在分片变换配置中设置 `allreduce_strategy: NCCL` ；过去 `AUTO` 默认值曾导致正确性问题
MoE中重复 `all_reduce`	MoE输出幅度翻倍；精度灾难性损失	检查导出的图；在路由和共享专家输出求和后应该只有一个 `all_reduce` ，而非每个分支一个
TP后头部重塑使用错误步长	TP>1时注意力输出乱码，TP=1时正常	使用 `torch.export` 中具体头部数量的重塑在TP拆分头部维度后会出错；这些必须使用带 `tp_scaled_dim` 的 `torch.ops.auto_deploy.view`
对不应分片的投影进行分片	Dim-1门控投影或潜在投影被分片→结果错误	检查小输出投影（如MoE路由器、MLA潜在q_a/kv_a）的 `tp_mode` ；必须设置为 `"none"`
嵌套参数删除破坏权重加载	分片后部分权重缺失，静默默认值为零或随机值	如果分片删除了父模块参数，而子参数通过旧路径查找，加载钩子可能会静默跳过它们

验证分片修复：

如果模型大小允许，依次在

world_size=1

（基准）、

world_size=2

和目标

world_size

下运行。如果TP=1和TP=2时精度正确，但TP=8时错误，bug可能是头部数量可分性假设（头部维度必须能被TP度数整除）。如果所有TP>1时都错误，说明是结构性分片bug（缺少allreduce、错误的拆分点、错误的步长）。

Phase 5 — Per-Subject / Per-Category Breakdown

Phase 5 — 按主题/类别细分

When the overall score is lower than expected but not catastrophically wrong, look at per-subject or per-category breakdowns in the eval logs. Patterns to look for:

Pattern	Implication
All subjects uniformly ~N% below reference	Uniform precision loss — suspect FP8 kernel or attention scale
Specific subjects near 25% (random chance for 4-choice MCQ)	Those subjects have a systematic error — suspect subject length or chunked prefill
Easy subjects correct, hard subjects wrong	Near-decision-boundary sensitivity — suspect subtle numerical error
Subject-correlated errors	Prompt-length correlation — verify truncation behavior

For MCQ tasks like MMLU, random chance is 25%. Subjects scoring 25-35% may be genuinely hard for the model even in the PT backend — verify against PT per-subject scores before concluding an AD-specific bug.

当整体分数低于预期但未出现灾难性错误时，查看评估日志中的按主题或类别细分结果。需要关注的模式：

模式	含义
所有主题均比参考值低约N%	均匀精度损失——怀疑FP8内核或注意力缩放问题
特定主题分数接近25%（4选MCQ的随机水平）	这些主题存在系统性错误——怀疑主题长度或分块预填充问题
简单主题正确，复杂主题错误	决策边界附近敏感性问题——怀疑细微数值错误
与主题相关的错误	与提示长度相关——验证截断行为

对于MMLU等MCQ任务，随机水平为25%。分数在25-35%之间的主题可能即使在PT后端对模型来说也确实很难——在断定是AD特有bug之前，先与PT的按主题分数进行对比。

Phase 6 — Iterative Ablation

Phase 6 — 迭代消融测试

Once a hypothesis is formed, verify it by toggling one change at a time and re-running the diagnostic (50-100 samples is sufficient for 5%+ gaps).

Each ablation should be a separate diagnostic run. Do not batch multiple hypotheses in one run — it makes results ambiguous.

形成假设后，通过每次切换一个变更并重新运行诊断（对于5%以上的差距，50-100个样本足够）来验证。

每个消融测试应单独运行一次诊断。不要在一次运行中同时测试多个假设——这会导致结果模糊不清。

Anti-Patterns

反模式

Do not use 0-shot prompts to diagnose a 5-shot evaluator. Meta-responses ("Okay", "The answer is...") from a 0-shot run are a prompt-format artifact, not an AD inference bug.
Do not invoke
torchrun
for AD tests. The AD LLM API spawns MPI workers internally;
```
torchrun
```
adds a second layer of distributed init that deadlocks.
Do not override
LLM_MODELS_ROOT
. If it is already set in the environment (CI sets it to
```
/path/to/llm-models
```
), unsetting or overriding it breaks dataset lookups. Check
```
echo $LLM_MODELS_ROOT
```
before assuming it needs to be set.
Do not lower the reference threshold as a "fix." The reference value must be validated against the PT backend before being accepted. If PT also fails, re-examine the harness, not the threshold.
Do not apply the same load hook twice. If a hook converts interleaved → NeoX, applying it again corrupts the weights (it is not idempotent). Check the git log for reverts/restores before adding a hook that might already exist elsewhere in the call chain.

不要用0-shot提示诊断5-shot评估器。 0-shot运行产生的元响应（“好的”、“答案是...”）是提示格式的产物，而非AD推理bug。
不要为AD测试调用
torchrun
。 AD LLM API会在内部生成MPI工作进程；
```
torchrun
```
会添加第二层分布式初始化，导致死锁。
不要覆盖
LLM_MODELS_ROOT
。如果环境中已设置（CI中设置为
```
/path/to/llm-models
```
），取消设置或覆盖会破坏数据集查找。在假设需要设置之前，先检查
```
echo $LLM_MODELS_ROOT
```
。
不要降低参考阈值作为“修复”。 参考值必须先通过PT后端验证才能被接受。如果PT也失败，重新检查测试工具，而非阈值。
不要重复应用相同的加载钩子。 如果钩子将交错格式转换为NeoX格式，重复应用会损坏权重（操作不具备幂等性）。在添加可能已存在于调用链其他位置的钩子之前，检查git日志中的回滚/恢复记录。

Keeping This Skill Up to Date

保持本技能更新

Whenever this skill is used and the debugging session uncovers a new root cause or error pattern that is not yet described here, update the skill before closing the session.

Where to add new findings:

Phase 2 — Classify the Error Pattern: add a new row to the table if the session revealed a symptom → root cause mapping that is not already listed. Keep the "Output Pattern" column observable (something visible in diagnostic output), and the "Likely Root Cause" column actionable (points to a concrete next step or Phase 4 subsection).
Phase 4 — Root Cause Investigation: add the investigation steps under the most fitting existing subsection (4a quantization, 4b kernel wrapper assumptions, 4c sharding). If the finding does not fit any existing subsection, create a new one numbered sequentially (4d, 4e, …). Each subsection should follow the same structure: symptom, root cause pattern, and a numbered investigation procedure.

What is worth capturing:

A root cause that is not already represented in Phase 4.
A symptom pattern that allows earlier classification (Phase 2).
A configuration or environment condition that reproduces or masks the bug (Phase 3).
A new anti-pattern discovered during the session.

What is not worth capturing:

Model-specific quirks with no generalization potential.
Findings that duplicate what is already written.
Workarounds that paper over a bug rather than identifying it.

每当使用本技能且调试会话发现此处未描述的新根本原因或错误模式时，在关闭会话前更新本技能。

新增发现的添加位置：

Phase 2 — 错误模式分类：如果会话揭示了未列出的症状→根本原因映射，在表格中添加新行。保持“输出模式”列可观测（在诊断输出中可见的内容），“可能的根本原因”列可操作（指向具体的下一步或Phase 4小节）。
Phase 4 — 根本原因调查：将调查步骤添加到最合适的现有小节下（4a量化、4b内核包装器假设、4c分片）。如果发现的内容不适合任何现有小节，按顺序创建新的小节（4d、4e等）。每个小节应遵循相同的结构：症状、根本原因模式和编号的调查步骤。

值得记录的内容：

Phase 4中未涵盖的根本原因。
可更早分类的症状模式（Phase 2）。
可复现或掩盖bug的配置或环境条件（Phase 3）。
会话中发现的新反模式。

不值得记录的内容：

不具备泛化潜力的模型特定 quirks。
与已有内容重复的发现。
掩盖bug而非识别bug的临时解决方案。