ad-accuracy-debug
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAutoDeploy Accuracy Debugging
AutoDeploy精度调试
Where this skill applies
本技能适用场景
This file is part of trtllm-agent-toolkit. Paths such as , , and
are relative to a TensorRT-LLM source checkout on the user's machine,
not the plugin repository.
tensorrt_llm/tests/examples/auto_deploy/本文档是trtllm-agent-toolkit的一部分。、和等路径相对于用户机器上的TensorRT-LLM源码检出目录,而非插件仓库。
tensorrt_llm/tests/examples/auto_deploy/Related skills in this plugin
本插件中的相关技能
- — inspect per-transform FX graph snapshots when Phase 2 suggests a transform was applied incorrectly or is corrupting activations.
trtllm-agent-toolkit:ad-graph-dump - — verify that precision or config settings (FP8, sharding, chunked prefill, etc.) were actually applied at runtime before attributing an accuracy gap to a kernel or weight bug.
trtllm-agent-toolkit:ad-conf-check
Input: model name, failing accuracy score, reference score, eval task (e.g. MMLU, GSM8K).
Output: identified root cause, minimal reproducer, and a code fix.
- — 当Phase 2显示某个变换应用错误或激活值被破坏时,检查每个变换的FX图快照。
trtllm-agent-toolkit:ad-graph-dump - — 在将精度差异归因于内核或权重bug之前,验证精度或配置设置(FP8、分片、分块预填充等)是否在运行时实际生效。
trtllm-agent-toolkit:ad-conf-check
输入: 模型名称、不合格的精度分数、参考分数、评估任务(如MMLU、GSM8K)。
输出: 已识别的根本原因、最小复现用例和代码修复方案。
Situation Assessment
情况评估
Before debugging, confirm:
- What is the reference score? Is it from the PyTorch backend test, a published leaderboard, or set manually?
- How large is the gap? A 1-2% gap may be within statistical noise; a 5%+ gap is a real bug.
- Is the eval framework itself suspect? Run the same eval on the PyTorch backend to validate the harness before blaming AutoDeploy.
调试前,请确认:
- 参考分数来源是什么? 是来自PyTorch后端测试、已发布的排行榜还是手动设置的?
- 分数差距有多大? 1-2%的差距可能在统计噪声范围内;5%以上的差距则是真实bug。
- 评估框架本身是否可疑? 在PyTorch后端运行相同的评估,先验证测试工具是否可靠,再归咎于AutoDeploy。
Abbreviations
缩写说明
- AD — AutoDeploy, TRT-LLM with backend
_autodeploy - PT — PyTorch, TRT-LLM with backend (manual deployment)
pytorch
- AD — AutoDeploy,使用后端的TRT-LLM
_autodeploy - PT — PyTorch,使用后端的TRT-LLM(手动部署)
pytorch
Phase 0 — Validate the Test Harness
Phase 0 — 验证测试工具
Run the equivalent PyTorch backend test on the same model and same eval task. If PT also fails or scores lower than expected, the issue is in the eval framework (prompt format, chat template, sampling params), not AD-specific.
Key things to verify in the eval harness:
- Prompt format / : does the evaluator send raw prompts or apply a chat template? The relationship is two-sided for reasoning/chat models:
apply_chat_template- Applying to a concatenated few-shot prompt (without
apply_chat_template) collapses the examples into a malformed single turn and can produce 0% accuracy.fewshot_as_multiturn - Omitting for a chat-first model can be equally wrong. For chat models on few-shot benchmarks, consider whether
apply_chat_templatepaired withapply_chat_template=Trueis appropriate — the latter turns each few-shot example into an explicit user/assistant exchange before the template is applied. (Reference: Qwen3.5-MoE accuracy fix infewshot_as_multiturn=True.)test_llm_api_autodeploy.py
- Applying
- for generation tasks: for benchmarks where the model must generate a full reasoning chain before the answer (e.g. GSM8K with a reasoning model), the default
max_output_lenmay truncate the response before the final answer is reached. Consider patching it up (e.g. 512) if outputs appear cut off. This is distinct from cappingMAX_OUTPUT_LENfor classification tasks like MMLU where you want to prevent long generations.max_tokens - for classification tasks: must be capped (e.g. 2 for MMLU) to prevent the model generating a full reasoning chain.
max_tokens - Dataset path: confirm is set correctly and the dataset directory exists.
LLM_MODELS_ROOT
If PT passes: the harness is fine. Proceed to Phase 1.
在相同模型和相同评估任务上运行等效的PyTorch后端测试。如果PT也失败或分数低于预期,问题出在评估框架(提示格式、聊天模板、采样参数),而非AD特有问题。
评估工具中需要验证的关键事项:
- 提示格式 / :评估器是发送原始提示还是应用聊天模板? 对于推理/聊天模型,这是双向关联的:
apply_chat_template- 对拼接的few-shot提示应用(未设置
apply_chat_template)会将示例折叠为格式错误的单轮对话,可能导致0%的精度。fewshot_as_multiturn - 针对优先支持聊天的模型省略同样会出错。 对于few-shot基准测试中的聊天模型,考虑是否适合将
apply_chat_template与apply_chat_template=True配合使用——后者会在应用模板前将每个few-shot示例转换为显式的用户/助手对话。 (参考:fewshot_as_multiturn=True中Qwen3.5-MoE的精度修复方案。)test_llm_api_autodeploy.py
- 对拼接的few-shot提示应用
- 生成任务的:对于模型必须在生成最终答案前生成完整推理链的基准测试(如使用推理模型的GSM8K),默认的
max_output_len可能在生成最终答案前截断响应。如果输出看起来被截断,考虑将其调高(如512)。这与MMLU等分类任务中限制MAX_OUTPUT_LEN以防止长生成的情况不同。max_tokens - 分类任务的:必须限制长度(如MMLU设置为2),以防止模型生成完整的推理链。
max_tokens - 数据集路径:确认已正确设置且数据集目录存在。
LLM_MODELS_ROOT
如果PT测试通过:工具可靠,进入Phase 1。
Phase 1 — Quick Diagnostic with a Small Sample
Phase 1 — 小样本快速诊断
Write a standalone diagnostic script that:
- Loads the AD model directly via
from tensorrt_llm._torch.auto_deploy import LLM as AutoDeployLLM - Reproduces the exact prompt format the evaluator uses (not a simplified variant), including few-shot examples if any
- Runs ~50-100 samples
- Prints per-sample and overall accuracy
(ref, output, correct)
Critical: reproduce the evaluator's exact prompt format. Deviating — for example, using a 0-shot prompt when the evaluator uses 5-shot — can cause thinking models to produce "Okay" or other meta-responses instead of the expected answer, making results uninterpretable. Verify the first printed prompt matches what the evaluator sends.
Typical evaluator sources:
- — 5-shot format with dev examples
tensorrt_llm/evaluate/mmlu.py - — few-shot with CoT references
tensorrt_llm/evaluate/gsm8k.py - —
tests/integration/defs/accuracy/accuracy_core.py,MAX_INPUT_LEN,MAX_OUTPUT_LENper taskNUM_SAMPLES
编写独立的诊断脚本,实现以下功能:
- 通过直接加载AD模型
from tensorrt_llm._torch.auto_deploy import LLM as AutoDeployLLM - 复现评估器使用的完全一致的提示格式(而非简化版本),包括可能存在的few-shot示例
- 运行约50-100个样本
- 打印每个样本的以及整体精度
(参考输出, 模型输出, 是否正确)
关键注意事项: 必须复现评估器的精确提示格式。如果偏离——例如评估器使用5-shot提示而你使用0-shot——会导致推理模型生成“好的”或其他元响应,而非预期答案,使结果无法解释。验证打印的第一个提示是否与评估器发送的一致。
典型评估器来源:
- — 带开发示例的5-shot格式
tensorrt_llm/evaluate/mmlu.py - — 带CoT参考的few-shot格式
tensorrt_llm/evaluate/gsm8k.py - — 每个任务的
tests/integration/defs/accuracy/accuracy_core.py、MAX_INPUT_LEN、MAX_OUTPUT_LEN设置NUM_SAMPLES
Phase 2 — Classify the Error Pattern
Phase 2 — 错误模式分类
From the diagnostic output, determine what the model is generating:
| Output Pattern | Likely Root Cause |
|---|---|
| Coherent but consistently wrong letter / answer | Numerical accuracy bug (attention, FP8 kernel, weight corruption) |
| Generates meta-text ("The user wants...", "The answer is...", "Let me think...") | Prompt format issue — model not primed to answer directly |
| Outputs empty string or EOS immediately | KV cache garbage (uninitialized cache, scale overflow), or |
| Completely random tokens / gibberish | Transformation applied incorrectly, load hook missing or applied twice, corrupted weights |
| Correct on easy subjects, wrong on hard subjects | Subtle numerical precision bug (FP8 kernel mismatch, attention scale wrong) |
| NaN in logits, especially on prefill | FX graph transform produced a node without shape metadata — enable |
Passes at | Sharding bug — see Phase 4c |
根据诊断输出,判断模型生成内容的模式:
| 输出模式 | 可能的根本原因 |
|---|---|
| 内容连贯但持续输出错误选项/答案 | 数值精度bug(注意力机制、FP8内核、权重损坏) |
| 生成元文本(“用户想要...”, “答案是...”, “让我想想...”) | 提示格式问题——模型未被正确引导直接作答 |
| 输出空字符串或立即输出EOS | KV缓存异常(未初始化缓存、缩放溢出),或 |
| 完全随机的token/乱码 | 变换应用错误、加载钩子缺失或重复应用、权重损坏 |
| 简单主题正确,复杂主题错误 | 细微的数值精度bug(FP8内核不匹配、注意力缩放错误) |
| logits中出现NaN,尤其是在预填充阶段 | FX图变换生成了缺少形状元数据的节点——启用 |
| 分片bug——见Phase 4c |
Phase 3 — Configuration Isolation
Phase 3 — 配置隔离
Narrow down which part of the setup is responsible by reducing the environment to its simplest
form, then re-enabling components one at a time until the regression reappears.
Step 1 — Strip to a minimal configuration:
Where feasible, reduce complexity along each axis, re-running the Phase 1 diagnostic after each
change:
- Remove sharding or reduce to TP=1 / single GPU
- Disable multi-streaming
- Disable non-default transform passes in the YAML config ()
enabled: false - Revert to
torch-simple: AutoDeploy currently supports two backends —compile_backend(CUDA graphs, the typical production setting) andtorch-cudagraph(no CUDA graphs, significantly slower). If the model is configured withtorch-simple, revert totorch-cudagraphand check whether the accuracy issue persists. Note the slower throughput will make the validation loop take longer. If accuracy recovers attorch-simple, CUDA graph capture or replay is the suspect.torch-simple
If the issue disappears when a component is removed, that component is the suspect — note it and
proceed to Step 2 targeting it. If the issue persists even at minimal config, the bug is in a core
path (weight loading, attention, KV cache) — proceed to Phase 4.
Step 2 — Re-enable one component at a time:
Starting from the stripped-down configuration that still reproduces the issue, re-enable the
suspected components individually — one per diagnostic run. Stop as soon as accuracy drops: the
last re-enabled component is the offending pass or backend. Carry this finding into Phase 4 to
investigate the root cause.
通过将环境简化到最基础形式,然后逐个重新启用组件,直到回归问题再次出现,从而缩小问题范围。
步骤1 — 简化为最小配置:
在可行的情况下,沿每个维度降低复杂度,每次更改后重新运行Phase 1的诊断:
- 移除分片或减少到TP=1 / 单GPU
- 禁用多流功能
- 在YAML配置中禁用非默认的变换通道(设置)
enabled: false - 切换回编译后端:AutoDeploy目前支持两种后端——
torch-simple(CUDA图,典型生产设置)和torch-cudagraph(无CUDA图,速度显著较慢)。如果模型配置为torch-simple,切换到torch-cudagraph并检查精度问题是否仍然存在。注意较慢的吞吐量会使验证循环耗时更长。如果切换到torch-simple后精度恢复,问题可能出在CUDA图捕获或重放环节。torch-simple
如果移除某个组件后问题消失,该组件就是可疑对象——记录下来并进入步骤2针对性排查。即使在最小配置下问题仍然存在,说明bug位于核心路径(权重加载、注意力机制、KV缓存)——进入Phase 4。
步骤2 — 逐个重新启用组件:
从仍能复现问题的简化配置开始,逐个重新启用可疑组件——每次诊断运行只启用一个。一旦精度下降就停止:最后启用的组件就是有问题的通道或后端。将此发现带入Phase 4以调查根本原因。
Phase 4 — Root Cause Investigation
Phase 4 — 根本原因调查
This phase contains targeted investigation paths for known root-cause categories. Add model-specific or error-pattern-specific steps here as they are discovered.
本阶段包含针对已知根本原因类别的定向调查路径。随着新的根本原因或错误模式被发现,可在此添加模型特定或错误模式特定的步骤。
4a — Quantized Model Accuracy
4a — 量化模型精度
If the failing model is quantized (e.g. FP8, NVFP4), first verify whether the issue
is in the quantization itself or in the quantized kernel path:
Step 1 — Test an unquantized baseline.
Ask the user for an unquantized (BF16/FP16) version of the same model. Run the Phase 1 diagnostic
against it with an identical configuration (same , same TP, same eval format).
compile_backend- If the unquantized model also fails: the bug is not quantization-related — the issue is in a transform pass, attention implementation, or weight loading. Continue with Phase 3 isolation against the unquantized model.
- If the unquantized model passes: the accuracy gap is introduced by quantization or the quantized kernel path. Proceed to Step 2.
Step 2 — Suspect classification.
When quantization is confirmed as the source, the likely causes are (in rough order of severity):
| Suspect | Symptom | How to isolate |
|---|---|---|
| Missing scale during dequantization | Near-zero or astronomically large logits; catastrophic accuracy loss (≈ random chance or worse) | Log a few raw logits; they will be wildly out of range |
| Inverted scale (multiplied instead of divided, or vice versa) | Similarly catastrophic; outputs plausible tokens but systematically wrong | Same logit inspection; compare scale values in the checkpoint vs what the kernel receives |
| Incorrect block-scale computation | Major but not catastrophic degradation; typically 5–20% below unquantized reference | Compare per-block scales against a reference quantizer on a few weight tensors |
| Wrong scale or weight values without an error — | Grep the quantization transform for |
| Quantized kernel bug (wrong accumulation, wrong cast) | Non-catastrophic; may be input-dependent or shape-dependent | Step 3 below |
Step 3 — Isolate quantized kernels via fake quantization.
AutoDeploy's transform pipeline has a built-in fake-quantization path that implements exactly
Q→DQ→high-precision-matmul. Understanding the two stages helps:
-
Stage 1 (): Replaces
pattern_matchernodes withnn.Linear/torch.ops.auto_deploy.torch_fake_quant_fp8_linearetc. These ops quantize the input, immediately dequantize both input and weight back to BF16/FP16, then run a standardtorch_fake_quant_nvfp4_linear. Scales are exercised but all arithmetic is in high precision. Implementation:torch.matmul, lines 178–286.tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py -
Stage 2 ():
post_load_fusion,fuse_fp8_linear, andfuse_nvfp4_lineartransforms replace the fake-quant ops with optimized low-precision kernels. This is where a kernel bug would be introduced.fuse_finegrained_fp8_linear
To run inference in fake-quantization mode (bypassing the low-precision kernels), add the
following to the YAML config file the test is already using (passed via or
):
--config--extra_llm_api_optionsyaml
transforms:
fuse_fp8_linear:
enabled: false
fuse_nvfp4_linear:
enabled: false
fuse_finegrained_fp8_linear:
enabled: false
# For MoE models also add:
fuse_fp8_moe:
enabled: false
fuse_finegrained_fp8_moe:
enabled: false
fuse_nvfp4_moe:
enabled: falseIf there is no existing config file, create one with only the above content and pass it via
. The key maps directly to
; any transform not listed inherits its default from
.
--extra_llm_api_options /path/to/fake_quant_debug.yamltransformsLlmArgs.transformstensorrt_llm/_torch/auto_deploy/config/default.yamlIf accuracy recovers with fake quantization, the quantized kernel (not the scales) is the bug.
If accuracy is still wrong, the scales or weight data are the likely culprit.
如果失败的模型是量化模型(如FP8、NVFP4),首先验证问题是出在量化本身还是量化内核路径:
步骤1 — 测试未量化基准。
向用户索要同一模型的未量化(BF16/FP16)版本。使用完全相同的配置(相同的、相同的TP、相同的评估格式)对其运行Phase 1诊断。
compile_backend- 如果未量化模型也失败:bug与量化无关——问题出在变换通道、注意力实现或权重加载。针对未量化模型继续Phase 3的隔离排查。
- 如果未量化模型通过:精度差距由量化或量化内核路径引入。进入步骤2。
步骤2 — 可疑对象分类。
当确认量化是问题来源时,可能的原因按严重程度大致排序如下:
| 可疑对象 | 症状 | 隔离方法 |
|---|---|---|
| 反量化时缺少缩放因子 | logits接近零或异常巨大;精度灾难性损失(≈随机水平或更差) | 记录几个原始logits;它们会严重超出正常范围 |
| 缩放因子反转(本该乘却除,反之亦然) | 同样是灾难性损失;输出看似合理的token但系统性错误 | 同样检查logits;对比检查点中的缩放值与内核接收的缩放值 |
| 块级缩放计算错误 | 严重但非灾难性的精度下降;通常比未量化参考值低5–20% | 对比几个权重张量的块级缩放值与参考量化器的结果 |
对打包格式使用 | 缩放或权重值错误但无报错—— | 在量化变换中搜索针对打包类型(FP4、FP8)的量化权重/缩放张量的 |
| 量化内核bug(累加错误、类型转换错误) | 非灾难性;可能依赖输入或形状 | 见下方步骤3 |
步骤3 — 通过伪量化隔离量化内核。
AutoDeploy的变换流水线内置了伪量化路径,实现精确的Q→DQ→高精度矩阵乘法。理解以下两个阶段会有所帮助:
-
阶段1():将
pattern_matcher节点替换为nn.Linear/torch.ops.auto_deploy.torch_fake_quant_fp8_linear等操作。这些操作会量化输入,立即将输入和权重反量化回BF16/FP16,然后运行标准的torch_fake_quant_nvfp4_linear。缩放因子会被使用,但所有运算都在高精度下进行。实现位置:torch.matmul,第178–286行。tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py -
阶段2():
post_load_fusion、fuse_fp8_linear和fuse_nvfp4_linear变换会替换伪量化操作为优化的低精度内核。内核bug通常在此阶段引入。fuse_finegrained_fp8_linear
要在伪量化模式下运行推理(绕过低精度内核),在测试已使用的YAML配置文件中添加以下内容(通过或传入):
--config--extra_llm_api_optionsyaml
transforms:
fuse_fp8_linear:
enabled: false
fuse_nvfp4_linear:
enabled: false
fuse_finegrained_fp8_linear:
enabled: false
# 对于MoE模型还需添加:
fuse_fp8_moe:
enabled: false
fuse_finegrained_fp8_moe:
enabled: false
fuse_nvfp4_moe:
enabled: false如果没有现有配置文件,创建仅包含上述内容的文件并通过传入。键直接映射到;未列出的变换会继承中的默认设置。
--extra_llm_api_options /path/to/fake_quant_debug.yamltransformsLlmArgs.transformstensorrt_llm/_torch/auto_deploy/config/default.yaml如果伪量化模式下精度恢复,说明问题出在量化内核(而非缩放因子)。如果精度仍然错误,缩放因子或权重数据可能是罪魁祸首。
4b — Kernel Wrapper Hardcoded Assumptions
4b — 内核包装器硬编码假设
Symptom: coherent but systematically wrong output on one specific model family (or one model
variant within a family), while a structurally similar model is fine.
Root cause pattern: The C++ kernel or its Python wrapper has a constant where it should read
from the model config or from the actual tensor. Two common forms:
- Config value hardcoded: a kernel wrapper assumes a default value instead of reading it from the HF config loaded at runtime.
- Tensor stride/shape hardcoded: a C++ kernel assumes a specific memory layout that differs from what AutoDeploy actually passes. The kernel then reads the wrong memory locations silently.
How to investigate:
- Identify which kernel is dispatched for the failing op (search for the op's Python entry point
in ).
tensorrt_llm/_torch/auto_deploy/ - Compare every constant or default in the kernel wrapper call against the corresponding field
in the model's HF config (). Flag any value that is not read from the config.
config.json - For stride bugs: print the actual strides of the tensors being passed to the kernel and compare against what the kernel expects (check the C++ kernel source for stride assumptions or parameters).
- If a mismatch is found, the fix is to pass the config or tensor property as an explicit parameter rather than using a hardcoded constant.
症状: 在某一特定模型家族(或家族内的某一模型变体)上输出连贯但系统性错误,而结构相似的模型运行正常。
根本原因模式: C++内核或其Python包装器使用了常量,而实际上应该从模型配置或实际张量中读取值。常见形式有两种:
- 配置值硬编码:内核包装器假设使用默认值,而非从运行时加载的HF配置中读取。
- 张量步长/形状硬编码:C++内核假设特定的内存布局,但与AutoDeploy实际传递的布局不同。内核会静默读取错误的内存位置。
调查方法:
- 确定失败操作调度的内核(在中搜索操作的Python入口点)。
tensorrt_llm/_torch/auto_deploy/ - 将内核包装器调用中的每个常量或默认值与模型HF配置()中的对应字段进行比较。标记任何未从配置中读取的值。
config.json - 对于步长bug:打印传递给内核的张量的实际步长,并与内核预期的步长进行比较(检查C++内核源码中的步长假设或参数)。
- 如果发现不匹配,修复方法是将配置或张量属性作为显式参数传递,而非使用硬编码常量。
4c — Sharding-Related Accuracy (world_size > 1)
4c — 分片相关精度问题(world_size > 1)
First step: reproduce the issue at . If accuracy recovers, the bug is in the
sharding path. If it fails at too, sharding is not the cause — return to Phase 3.
world_size=1world_size=1To run at , set in the model's YAML config or pass
with .
world_size=1world_size: 1--extra_llm_api_optionsworld_size: 1Known sharding bug patterns (check in order):
| Suspect | Symptom | How to isolate |
|---|---|---|
| Wrong allreduce strategy | Non-deterministic or rank-dependent outputs; may appear only at TP≥4 | Set |
Double | MoE output doubled in magnitude; accuracy catastrophic | Inspect the exported graph; there should be exactly one |
| Head reshape with wrong stride after TP | Attention output garbage at TP>1, correct at TP=1 | Reshapes that use concrete head counts from |
| Sharding a projection that must not be sharded | Dim-1 gating projections or latent projections sharded → wrong results | Check |
| Nested parameter deletion breaking weight loading | Some weights missing after sharding, silently defaulting to zero or random | If sharding deletes parent module params and child params are looked up by the old path, the load hook may silently skip them |
Validating a sharding fix:
If model size permits, run it at (baseline), then , then the target .
If accuracy is correct at TP=1 and TP=2 but wrong at TP=8, the bug is likely a head-count divisibility
assumption (head dim must be divisible by the TP degree). If it is wrong at all TP>1, it is a
structural sharding bug (missing allreduce, wrong split point, wrong stride).
world_size=1world_size=2world_size第一步: 在时复现问题。如果精度恢复,bug出在分片路径。如果时也失败,分片不是原因——返回Phase 3。
world_size=1world_size=1要在下运行,在模型的YAML配置中设置,或通过传入。
world_size=1world_size: 1--extra_llm_api_optionsworld_size: 1已知分片bug模式(按顺序检查):
| 可疑对象 | 症状 | 隔离方法 |
|---|---|---|
| 错误的allreduce策略 | 输出非确定性或依赖rank;可能仅在TP≥4时出现 | 在分片变换配置中设置 |
MoE中重复 | MoE输出幅度翻倍;精度灾难性损失 | 检查导出的图;在路由和共享专家输出求和后应该只有一个 |
| TP后头部重塑使用错误步长 | TP>1时注意力输出乱码,TP=1时正常 | 使用 |
| 对不应分片的投影进行分片 | Dim-1门控投影或潜在投影被分片→结果错误 | 检查小输出投影(如MoE路由器、MLA潜在q_a/kv_a)的 |
| 嵌套参数删除破坏权重加载 | 分片后部分权重缺失,静默默认值为零或随机值 | 如果分片删除了父模块参数,而子参数通过旧路径查找,加载钩子可能会静默跳过它们 |
验证分片修复:
如果模型大小允许,依次在(基准)、和目标下运行。如果TP=1和TP=2时精度正确,但TP=8时错误,bug可能是头部数量可分性假设(头部维度必须能被TP度数整除)。如果所有TP>1时都错误,说明是结构性分片bug(缺少allreduce、错误的拆分点、错误的步长)。
world_size=1world_size=2world_sizePhase 5 — Per-Subject / Per-Category Breakdown
Phase 5 — 按主题/类别细分
When the overall score is lower than expected but not catastrophically wrong, look at per-subject or per-category breakdowns in the eval logs. Patterns to look for:
| Pattern | Implication |
|---|---|
| All subjects uniformly ~N% below reference | Uniform precision loss — suspect FP8 kernel or attention scale |
| Specific subjects near 25% (random chance for 4-choice MCQ) | Those subjects have a systematic error — suspect subject length or chunked prefill |
| Easy subjects correct, hard subjects wrong | Near-decision-boundary sensitivity — suspect subtle numerical error |
| Subject-correlated errors | Prompt-length correlation — verify truncation behavior |
For MCQ tasks like MMLU, random chance is 25%. Subjects scoring 25-35% may be genuinely hard for the model even in the PT backend — verify against PT per-subject scores before concluding an AD-specific bug.
当整体分数低于预期但未出现灾难性错误时,查看评估日志中的按主题或类别细分结果。需要关注的模式:
| 模式 | 含义 |
|---|---|
| 所有主题均比参考值低约N% | 均匀精度损失——怀疑FP8内核或注意力缩放问题 |
| 特定主题分数接近25%(4选MCQ的随机水平) | 这些主题存在系统性错误——怀疑主题长度或分块预填充问题 |
| 简单主题正确,复杂主题错误 | 决策边界附近敏感性问题——怀疑细微数值错误 |
| 与主题相关的错误 | 与提示长度相关——验证截断行为 |
对于MMLU等MCQ任务,随机水平为25%。分数在25-35%之间的主题可能即使在PT后端对模型来说也确实很难——在断定是AD特有bug之前,先与PT的按主题分数进行对比。
Phase 6 — Iterative Ablation
Phase 6 — 迭代消融测试
Once a hypothesis is formed, verify it by toggling one change at a time and re-running the diagnostic (50-100 samples is sufficient for 5%+ gaps).
Each ablation should be a separate diagnostic run. Do not batch multiple hypotheses in one run — it makes results ambiguous.
形成假设后,通过每次切换一个变更并重新运行诊断(对于5%以上的差距,50-100个样本足够)来验证。
每个消融测试应单独运行一次诊断。不要在一次运行中同时测试多个假设——这会导致结果模糊不清。
Anti-Patterns
反模式
- Do not use 0-shot prompts to diagnose a 5-shot evaluator. Meta-responses ("Okay", "The answer is...") from a 0-shot run are a prompt-format artifact, not an AD inference bug.
- Do not invoke for AD tests. The AD LLM API spawns MPI workers internally;
torchrunadds a second layer of distributed init that deadlocks.torchrun - Do not override . If it is already set in the environment (CI sets it to
LLM_MODELS_ROOT), unsetting or overriding it breaks dataset lookups. Check/path/to/llm-modelsbefore assuming it needs to be set.echo $LLM_MODELS_ROOT - Do not lower the reference threshold as a "fix." The reference value must be validated against the PT backend before being accepted. If PT also fails, re-examine the harness, not the threshold.
- Do not apply the same load hook twice. If a hook converts interleaved → NeoX, applying it again corrupts the weights (it is not idempotent). Check the git log for reverts/restores before adding a hook that might already exist elsewhere in the call chain.
- 不要用0-shot提示诊断5-shot评估器。 0-shot运行产生的元响应(“好的”、“答案是...”)是提示格式的产物,而非AD推理bug。
- 不要为AD测试调用。 AD LLM API会在内部生成MPI工作进程;
torchrun会添加第二层分布式初始化,导致死锁。torchrun - 不要覆盖。 如果环境中已设置(CI中设置为
LLM_MODELS_ROOT),取消设置或覆盖会破坏数据集查找。在假设需要设置之前,先检查/path/to/llm-models。echo $LLM_MODELS_ROOT - 不要降低参考阈值作为“修复”。 参考值必须先通过PT后端验证才能被接受。如果PT也失败,重新检查测试工具,而非阈值。
- 不要重复应用相同的加载钩子。 如果钩子将交错格式转换为NeoX格式,重复应用会损坏权重(操作不具备幂等性)。在添加可能已存在于调用链其他位置的钩子之前,检查git日志中的回滚/恢复记录。
Keeping This Skill Up to Date
保持本技能更新
Whenever this skill is used and the debugging session uncovers a new root cause or error pattern
that is not yet described here, update the skill before closing the session.
Where to add new findings:
-
Phase 2 — Classify the Error Pattern: add a new row to the table if the session revealed a symptom → root cause mapping that is not already listed. Keep the "Output Pattern" column observable (something visible in diagnostic output), and the "Likely Root Cause" column actionable (points to a concrete next step or Phase 4 subsection).
-
Phase 4 — Root Cause Investigation: add the investigation steps under the most fitting existing subsection (4a quantization, 4b kernel wrapper assumptions, 4c sharding). If the finding does not fit any existing subsection, create a new one numbered sequentially (4d, 4e, …). Each subsection should follow the same structure: symptom, root cause pattern, and a numbered investigation procedure.
What is worth capturing:
- A root cause that is not already represented in Phase 4.
- A symptom pattern that allows earlier classification (Phase 2).
- A configuration or environment condition that reproduces or masks the bug (Phase 3).
- A new anti-pattern discovered during the session.
What is not worth capturing:
- Model-specific quirks with no generalization potential.
- Findings that duplicate what is already written.
- Workarounds that paper over a bug rather than identifying it.
每当使用本技能且调试会话发现此处未描述的新根本原因或错误模式时,在关闭会话前更新本技能。
新增发现的添加位置:
-
Phase 2 — 错误模式分类:如果会话揭示了未列出的症状→根本原因映射,在表格中添加新行。保持“输出模式”列可观测(在诊断输出中可见的内容),“可能的根本原因”列可操作(指向具体的下一步或Phase 4小节)。
-
Phase 4 — 根本原因调查:将调查步骤添加到最合适的现有小节下(4a量化、4b内核包装器假设、4c分片)。如果发现的内容不适合任何现有小节,按顺序创建新的小节(4d、4e等)。每个小节应遵循相同的结构:症状、根本原因模式和编号的调查步骤。
值得记录的内容:
- Phase 4中未涵盖的根本原因。
- 可更早分类的症状模式(Phase 2)。
- 可复现或掩盖bug的配置或环境条件(Phase 3)。
- 会话中发现的新反模式。
不值得记录的内容:
- 不具备泛化潜力的模型特定 quirks。
- 与已有内容重复的发现。
- 掩盖bug而非识别bug的临时解决方案。