ad-model-onboard

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AutoDeploy Model Onboarding

AutoDeploy模型上线流程

Input: HuggingFace model ID. Output: prefill-only custom model file + hierarchical tests + summary report.

输入： HuggingFace模型ID。输出： 仅支持预填充的自定义模型文件 + 分层测试用例 + 总结报告。

Phase 0 — Gather All Resources Upfront

阶段0 — 提前准备所有资源

Web/GitHub fetches require user approval and the user may leave. Do ALL network access now and save locally before proceeding.

网页/GitHub拉取操作需要用户授权，且用户可能中途离开。请现在完成所有网络访问操作，并将资源保存到本地后再继续。

Step 0 — GPU memory sanity check

步骤0 — GPU内存合理性检查

Before anything else, check whether the model can fit on the current system.

Run

nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits

to get the total VRAM (in MiB) across all GPUs on the system.

Estimate the model's memory footprint from the HuggingFace model card or config (number of parameters × bytes per parameter, e.g. 7B params × 2 bytes = ~14 GB for bfloat16).
If the estimated size exceeds total system VRAM, stop and report this to the user — do not proceed with onboarding until the user acknowledges and decides how to proceed. Example message: "This model requires ~Xgb but the system only has Ygb across N GPUs. Onboarding is likely to fail at the e2e run stage."

Step 1 — Check local transformers install first:

bash

python -c "import transformers; print(transformers.__file__)"

Look for

models/{model_type}/modeling_*.py

under that path. If found, use it directly — no network needed.

Step 2 — If not found, download the HF repo (code only, skip weights):

bash

huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"

This downloads config, code, and tokenizer files into the standard HF cache (

$HF_HOME

~/.cache/huggingface/

) while skipping large weight files. Files cached here are automatically found by

transformers.AutoConfig.from_pretrained

and similar APIs — no extra path wiring needed. Once downloaded you can work fully offline — read

config.json

and

modeling_*.py

from the cache snapshot directory printed by the command.

在开始任何操作前，检查模型是否能适配当前系统。

运行

nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits

获取系统所有GPU的总显存（单位：MiB）。

通过HuggingFace模型卡片或配置文件估算模型内存占用（参数数量 × 每个参数占用字节数，例如7B参数 × 2字节 = bfloat16格式下约14 GB）。
如果估算的模型大小超过系统总显存，立即停止并告知用户 —— 在用户确认并决定后续操作前，不要继续上线流程。示例提示："该模型需要约XGB显存，但系统N张GPU总显存仅为YGB。上线流程很可能在端到端运行阶段失败。"

步骤1 — 先检查本地transformers安装情况：

bash

python -c "import transformers; print(transformers.__file__)"

在该路径下查找

models/{model_type}/modeling_*.py

文件。如果找到，直接使用该文件 —— 无需访问网络。

步骤2 — 如果未找到，下载HF仓库（仅代码，跳过权重文件）：

bash

huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"

该命令会将配置文件、代码和分词器文件下载到标准HF缓存目录（

$HF_HOME

或

~/.cache/huggingface/

），同时跳过大型权重文件。缓存的文件会被

transformers.AutoConfig.from_pretrained

等API自动识别 —— 无需额外配置路径。下载完成后即可完全离线工作 —— 从命令输出的缓存快照目录中读取

config.json

和

modeling_*.py

文件。

Phase 1 — Survey Existing Coverage & Analyze HF Model

阶段1 — 调研现有覆盖范围并分析HF模型

Step 1 — Check for existing AD custom modeling code

步骤1 — 检查是否已有AD自定义建模代码

Before writing anything, check if an AD custom model already covers this architecture:

Read the model's
```
config.json
```
to find its
```
model_type
```
and
```
architectures
```
fields.
Search
```
tensorrt_llm/_torch/auto_deploy/models/custom/
```
for existing
```
modeling_*.py
```
files that register the same config class name (grep for the
```
architectures
```
value or
```
model_type
```
).

Also check

tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py

for existing registrations.

If existing code is found:

Read it carefully. It may already handle this exact model — in which case no new modeling file is needed, only registry entries and possibly tests.
If the existing code covers a closely related model in the same family but needs adaptation (e.g., the family added MoE in a newer variant, or changed the attention type), decide whether to extend the existing file or create a new one. Prefer extending if the changes are minor; create a new file if the architecture diverges significantly. Report the decision and rationale to the user before proceeding.

If no existing code is found: proceed to write a new model file in Phase 2.

在编写任何代码前，检查是否已有AD自定义模型覆盖该架构：

读取模型的
```
config.json
```
文件，找到
```
model_type
```
和
```
architectures
```
字段。
在
```
tensorrt_llm/_torch/auto_deploy/models/custom/
```
目录下搜索已有的
```
modeling_*.py
```
文件，查找是否注册了相同的配置类名称（通过grep匹配
```
architectures
```
或
```
model_type
```
的值）。

同时检查

tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py

文件中的现有注册项。

如果找到现有代码：

仔细阅读代码。它可能已经支持该模型 —— 这种情况下无需编写新的建模文件，仅需添加注册项和可能的测试用例。
如果现有代码支持同系列中密切相关的模型，但需要适配（例如，系列新增了MoE变体，或修改了注意力类型），请决定是扩展现有文件还是创建新文件。如果改动较小，优先选择扩展；如果架构差异较大，则创建新文件。在继续操作前，向用户汇报决策及理由。

如果未找到现有代码： 进入阶段2，编写新的模型文件。

Step 2 — Survey the model family in the registry

步骤2 — 调研模型系列在注册表中的情况

Check

examples/auto_deploy/model_registry/models.yaml

for other models from the same family (e.g., if asked to onboard

Qwen/Qwen3-8B

, look for

Qwen/Qwen3-0.6B

Qwen/Qwen3-32B

Qwen/Qwen3-235B-A22B

, etc.). Also check HuggingFace for the full set of model sizes/variants in the family.

Identify which family members already have registry entries and which are missing.
Identify which family members share the same architecture (same
```
model_type
```
/
```
architectures
```
in their config) — these can all use a single modeling file.
Plan to onboard the entire family cohesively: one modeling file + one test file should cover all members that share an architecture. The registry should have entries for all commonly-used sizes.
Report the family survey findings to the user: which models exist, which are missing, and the proposed plan for covering them all.

检查

examples/auto_deploy/model_registry/models.yaml

文件，查找同系列的其他模型（例如，如果需要上线

Qwen/Qwen3-8B

，请查找

Qwen/Qwen3-0.6B

、

Qwen/Qwen3-32B

、

Qwen/Qwen3-235B-A22B

等）。同时在HuggingFace上查找该系列的所有模型尺寸/变体。

确定哪些系列成员已有注册表项，哪些缺失。
确定哪些系列成员共享相同架构（配置文件中的
```
model_type
```
/
```
architectures
```
相同）—— 这些模型可以共用同一个建模文件。
计划统一上线整个系列：一个建模文件 + 一个测试文件应覆盖所有共享同一架构的成员。注册表中应包含所有常用尺寸的模型项。
向用户汇报系列调研结果：已存在的模型、缺失的模型，以及覆盖所有模型的计划方案。

Step 3 — Analyze HF model architecture

步骤3 — 分析HF模型架构

Study the locally-available

config.json

and

modeling_*.py

(NOT from

tensorrt_llm/_torch/models/

). Identify attention type (MHA/GQA/MLA), MoE config, RoPE variant, normalization, activation, and any data-dependent ops that break

torch.export

(e.g.

torch.nonzero

, data-conditioned

if

研究本地可用的

config.json

和

modeling_*.py

文件（不要使用
tensorrt_llm/_torch/models/
下的文件）。确定注意力类型（MHA/GQA/MLA）、MoE配置、RoPE变体、归一化方式、激活函数，以及任何会破坏

torch.export

的数据依赖算子（例如

torch.nonzero

、基于数据条件的

if

语句）。

Phase 2 — Write a Lean Prefill-Only Model

阶段2 — 编写轻量型仅预填充模型

Create

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py

. Use

modeling_glm4_moe_lite.py

as a structural template only (class layout, dataclass outputs, forward signature).

The goal is a minimal prefill-only model for
torch.export
with AD canonical IR ops. Keep the code as lean as possible — every line should serve the export path. Do not port HF features that AD doesn't need.

Strip: KV cache, training paths, dropout, flash attention variants,

repeat_interleave

repeat_kv

for GQA (AD attention ops handle this natively), fallback logic for generating

position_ids

(assert instead), optional code paths gated on config flags irrelevant to prefill export.

Keep:

PreTrainedModel

hierarchy,

ModelOutput

dataclass, minimal forward

(input_ids, position_ids, inputs_embeds=None, **kwargs)

Critical: Make sure the custom modeling code nn.Module hierarchy matches what the checkpoint safetensor json expects.

Critical rule: Do NOT import or reuse existing AD custom model code (e.g.

from .modeling_deepseek import ...

). Every

modeling_{name}.py

must be self-contained. Use the HF source (

$CLONE_DIR/modeling_*.py

) as the source of truth for the model's logic and translate it fresh — even if a structurally similar AD model already exists. This prevents hidden coupling, makes each model auditable on its own, and ensures model-specific quirks are captured correctly.

创建

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py

文件。仅将

modeling_glm4_moe_lite.py

作为结构模板（类布局、数据类输出、前向签名）。

目标是为
torch.export
创建一个使用AD标准IR算子的最小化仅预填充模型。代码应尽可能精简 —— 每一行都应为导出路径服务。无需移植AD不需要的HF功能。

移除以下内容：KV缓存、训练路径、dropout、Flash Attention变体、GQA的

repeat_interleave

repeat_kv

（AD注意力算子原生支持该功能）、生成

position_ids

的 fallback 逻辑（改为断言）、与预填充导出无关的配置标志 gated 的可选代码路径。

保留以下内容：

PreTrainedModel

层级结构、

ModelOutput

数据类、最小化的前向方法签名

(input_ids, position_ids, inputs_embeds=None, **kwargs)

。

关键注意事项： 确保自定义建模代码的nn.Module层级结构与checkpoint的safetensor json文件预期一致。

关键规则：不要导入或重用现有AD自定义模型代码（例如

from .modeling_deepseek import ...

）。每个

modeling_{name}.py

文件必须是独立的。以HF源码（

$CLONE_DIR/modeling_*.py

）为模型逻辑的唯一来源，重新转换代码 —— 即使已有结构相似的AD模型存在。这样可以避免隐藏耦合，使每个模型都可独立审计，并确保正确捕获模型特有的细节。

Phase 3 — Use AutoDeploy Canonical Ops (CRITICAL)

阶段3 — 使用AutoDeploy标准算子（至关重要）

Use
torch.ops.auto_deploy.torch_*
canonical ops WHENEVER POSSIBLE. These are the IR nodes that AD transforms later replace with optimized backends (triton, flashinfer, trtllm) at deployment time. If a canonical op exists for an operation, you MUST use it — do not reimplement the logic in plain PyTorch.

Available canonical ops (see

tensorrt_llm/_torch/auto_deploy/custom_ops/README.md

for full list):

Attention:

torch_attention

torch_attention_sdpa

torch_attention_repeat_kv

MLA:
```
torch_mla
```

RoPE:

torch_rope_with_explicit_cos_sin

torch_rope_with_complex_freqs

torch_rope_with_qk_interleaving

MoE:

torch_moe

torch_moe_fused

torch_moe_router

torch_moe_dense_mlp

Normalization:

torch_rmsnorm

torch_rmsnorm_gated

torch_l2norm

Linear:
```
torch_linear_simple
```
SSM/Mamba:
```
torch_ssm
```
,
```
torch_causal_conv1d
```
FLA:
```
torch_gated_delta_rule
```

Quantization:

torch_quant_fp8_linear

torch_quant_nvfp4_linear

, etc.

Never use

triton_*

flashinfer_*

trtllm_*

— backend selection happens later in AD transforms. Plain PyTorch is acceptable ONLY for operations where no canonical op exists (e.g., simple activation functions, embedding lookups, basic tensor arithmetic). If you find yourself writing manual attention, MoE routing, RoPE, or normalization in plain PyTorch, stop and use the canonical op instead.

Do NOT use
repeat_interleave
or
repeat_kv
for GQA. HF reference code often repeats K/V heads to match the Q head count before attention. The AD canonical attention ops (

torch_attention

torch_attention_sdpa

) handle GQA natively — they accept Q, K, V with different head counts and do the right thing internally. Manually repeating K/V heads is unnecessary bloat and prevents AD from optimizing the attention path.

只要存在对应的操作，就必须使用
torch.ops.auto_deploy.torch_*
标准算子。这些是IR节点，AD转换会在部署阶段将其替换为优化后的后端（triton、flashinfer、trtllm）。如果存在针对某操作的标准算子，必须使用该算子 —— 不要用纯PyTorch重新实现逻辑。

可用的标准算子（完整列表请见

tensorrt_llm/_torch/auto_deploy/custom_ops/README.md

）：

注意力：

torch_attention

torch_attention_sdpa

torch_attention_repeat_kv

MLA：
```
torch_mla
```

RoPE：

torch_rope_with_explicit_cos_sin

torch_rope_with_complex_freqs

torch_rope_with_qk_interleaving

MoE：

torch_moe

torch_moe_fused

torch_moe_router

torch_moe_dense_mlp

归一化：

torch_rmsnorm

torch_rmsnorm_gated

torch_l2norm

线性层：
```
torch_linear_simple
```
SSM/Mamba：
```
torch_ssm
```
,
```
torch_causal_conv1d
```
FLA：
```
torch_gated_delta_rule
```

量化：

torch_quant_fp8_linear

torch_quant_nvfp4_linear

等

绝对不要使用

triton_*

flashinfer_*

trtllm_*

算子 —— 后端选择会在AD转换的后期进行。仅当没有对应标准算子时，才可以使用纯PyTorch实现（例如简单的激活函数、嵌入查找、基础张量运算）。如果发现自己在用纯PyTorch手动实现注意力、MoE路由、RoPE或归一化，请停止并改用标准算子。

不要为GQA使用
repeat_interleave
或
repeat_kv
。 HF参考代码通常会在注意力前重复K/V头以匹配Q头数量。AD标准注意力算子（

torch_attention

torch_attention_sdpa

）原生支持GQA —— 它们接受头数不同的Q、K、V，并在内部处理正确逻辑。手动重复K/V头是不必要的冗余操作，会阻碍AD优化注意力路径。

Phase 4 — Register

阶段4 — 注册模型

Bottom of model file:

AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)

Add import +
```
__all__
```
entry in
```
models/custom/__init__.py
```
.
Prefer reusing the existing config class — if the config can be loaded via
```
AutoConfig.from_pretrained(model_id)
```
(either from the installed
```
transformers
```
or from files in the HF cache downloaded in Phase 0), import it from
```
transformers
```
and use it directly. Do NOT recreate or copy the config class into the modeling file when it is already available. Note: AD's factory already calls
```
AutoConfig.from_pretrained(model_id, trust_remote_code=True)
```
and passes the result to your model, so you rarely need to import the config at all — if you find yourself doing so, sanity-check that it's genuinely needed.
Only if the config is truly not available (not in
```
transformers
```
and not bundled with the checkpoint), define a minimal config class in the modeling file and
```
AutoConfig.register(model_type, ConfigCls, exist_ok=True)
```
. A good sanity check: if the E2E test passes without a custom config class, you don't need one —
```
AutoConfig.from_pretrained
```
already picked up the right class.

在模型文件末尾添加：

AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)

。

在
```
models/custom/__init__.py
```
文件中添加导入语句和
```
__all__
```
条目。
优先重用现有配置类 —— 如果可以通过
```
AutoConfig.from_pretrained(model_id)
```
加载配置（无论是从已安装的
```
transformers
```
还是阶段0下载的HF缓存文件中），直接从
```
transformers
```
导入并使用该配置类。当配置类已存在时，不要在建模文件中重新创建或复制该类。注意：AD的工厂已调用
```
AutoConfig.from_pretrained(model_id, trust_remote_code=True)
```
并将结果传递给你的模型，因此你几乎不需要导入配置类 —— 如果发现需要导入，请先确认是否真的有必要。
仅当配置类确实不可用时（不在
```
transformers
```
中，也未随checkpoint打包），才在建模文件中定义最小化的配置类，并调用
```
AutoConfig.register(model_type, ConfigCls, exist_ok=True)
```
。一个合理的检查方法：如果端到端测试无需自定义配置类即可通过，那么你不需要该类 ——
```
AutoConfig.from_pretrained
```
已自动加载了正确的类。

Phase 5 — Model Input Contract

阶段5 — 模型输入契约

The custom model's forward signature must follow these rules:

Always
input_ids
— The top-level model always receives
```
input_ids
```
. A submodule graph may internally receive
```
inputs_embeds
```
(e.g., after the embedding layer), but the exported entry point takes token IDs.
Always
position_ids
— Vanilla sequential
```
position_ids
```
are always provided. Assert
position_ids is not None
at the top of the forward method — it is a required input, never optional. Do not include fallback logic to generate
```
position_ids
```
from
```
input_ids
```
(HF models often do this; strip it). If the model uses a non-standard RoPE variant or custom position encoding, the model must compute it internally on top of the provided vanilla
```
position_ids
```
.
Multi-modal inputs — If the model supports vision/audio/etc., those additional inputs are passed during prefill alongside
```
input_ids
```
.
No attention mask, no cache inputs, no HF-runtime features — Do not accept
```
attention_mask
```
,
```
past_key_values
```
,
```
use_cache
```
, or similar HF-runtime arguments. AD manages masking and caching via its own transforms and runtime.

自定义模型的前向方法签名必须遵循以下规则：

必须包含
input_ids
—— 顶层模型始终接收
```
input_ids
```
。子模块图内部可能会接收
```
inputs_embeds
```
（例如嵌入层之后），但导出的入口点接收的是token ID。
必须包含
position_ids
—— 始终提供标准顺序的
```
position_ids
```
。在前向方法开头断言
position_ids is not None
—— 这是必填输入，永远不是可选的。不要包含从
```
input_ids
```
生成
```
position_ids
```
的 fallback 逻辑（HF模型通常会这样做；请移除该逻辑）。如果模型使用非标准RoPE变体或自定义位置编码，模型必须在提供的标准
```
position_ids
```
基础上内部计算所需编码。
多模态输入 —— 如果模型支持视觉/音频等输入，这些额外输入会在预填充阶段与
```
input_ids
```
一起传递。
不要包含注意力掩码、缓存输入或HF运行时特性 —— 不要接受
```
attention_mask
```
、
```
past_key_values
```
、
```
use_cache
```
或类似的HF运行时参数。AD通过自身的转换和运行时管理掩码和缓存。

Phase 6 — Hierarchical Tests

阶段6 — 分层测试

Create

tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.py

. Use

test_glm4_moe_lite_modeling.py

as template. No smoke tests. Small config (hidden=64, layers=2-3, vocab=1000). Use

pytest.skip

if HF class unavailable.

HF Reference Strategy: Equivalence tests compare our custom implementation against the HF reference with identical weights and inputs. Use actual HF classes if they exist — prefer importing directly over standalone HF-like implementations for unit tests. Standalone "reference" implementations are effectively alternative AD IR models and defeat the purpose of the reference test; they also tend to silently agree with whatever bugs exist in the custom model.

If HF modules exist in the installed
transformers
: import them directly (e.g.,

from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM

). Wrap imports in

_get_hf_*_class()

try/except helpers that return

None

ImportError

, and use

pytest.skip

when

None

If HF modules are NOT in the installed
transformers
: copy the minimal module definitions from the HF
```
modeling_*.py
```
source into the test file as standalone reference classes. This keeps tests self-contained without requiring a specific
```
transformers
```
version or HF cache at test time. Important: make sure the copy is minimal and strictly faithful to the HF implementation only. Do NOT tweak the functionality of the reference. The same applies to config classes that use
```
trust_remote_code
```
(i.e., not available in
```
transformers
```
): copy a minimal faithful version into the test file. The modeling file should NOT import the config class — AD loads it at runtime via
```
AutoConfig.from_pretrained(..., trust_remote_code=True)
```
. The test-only config copy lets you verify config-wrapping behavior (e.g., structure of state_dict).
Weight conversion helpers: Write test-only helpers for any weight format differences between HF and custom (e.g., RoPE de-interleaving, stacked-to-per-expert MoE weights, gate weight key remapping). For full-model tests, prefer using
```
load_state_dict
```
pre-hooks already registered on the custom model.

Numerical comparison: For equivalence tests comparing custom ops against HF reference, use the shared

assert_rmse_close

utility from

_model_test_utils

python

from _model_test_utils import assert_rmse_close

This computes

rmse(actual - expected) / rmse(expected)

— more robust than per-element

torch.testing.assert_close

since a few outlier elements won't fail the test. Use

torch.testing.assert_close

only for blocks with identical math (e.g., plain MLP with no custom ops).

Recommended

rmse_ratio_tol

values for bfloat16:

Identical math (MLP, Norm): use
```
torch.testing.assert_close
```
with tight rtol/atol (1e-3)
MoE block (fused routing):
```
0.02
```
Decoder layer / MoE layer / full model:
```
0.05
```
Attention:
```
0.10
```

Bottom-up levels (each must pass before next):

Block equivalence — Test MLP, Attention, MoE, Norm individually: same weights + same input →
```
assert_rmse_close
```
(or
```
torch.testing.assert_close
```
for identical-math blocks).
Layer equivalence — Full decoder layer. If model has heterogeneous layers (dense vs MoE, attention vs SSM), test each type separately.
Full model equivalence — End-to-end logits comparison. Use a small config with <10 layers that covers the essence of the architecture (e.g., at least one of each layer type).
Export test —
```
torch_export_to_gm
```
with
```
Dim.DYNAMIC
```
for batch+seq, verify finite output, test a second shape.

创建

tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.py

文件。以

test_glm4_moe_lite_modeling.py

为模板。不要编写冒烟测试。使用小型配置（hidden=64, layers=2-3, vocab=1000）。如果HF类不可用，使用

pytest.skip

。

HF参考策略： 等价测试会比较我们的自定义实现与HF参考实现在相同权重和输入下的结果。如果HF类存在，优先直接导入，而不是在单元测试中使用独立的类HF实现。 独立的“参考”实现实际上是另一种AD IR模型，会失去参考测试的意义；它们还可能与自定义模型中的bug保持一致，导致测试无法发现问题。

如果已安装的
transformers
中存在HF模块：直接导入（例如
```
from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM
```
）。将导入语句包装在
```
_get_hf_*_class()
```
尝试/捕获辅助函数中，当出现
```
ImportError
```
时返回
```
None
```
，并在返回
```
None
```
时使用
```
pytest.skip
```
。
如果已安装的
transformers
中不存在HF模块：将HF
```
modeling_*.py
```
源码中的最小模块定义复制到测试文件中，作为独立的参考类。这样可以保持测试的独立性，无需依赖特定版本的
```
transformers
```
或测试时的HF缓存。重要提示：确保复制的内容最小化，且严格忠实于HF实现。不要修改参考类的功能。对于使用
```
trust_remote_code
```
的配置类（即不在
```
transformers
```
中的配置类），同样适用：将最小化的忠实版本复制到测试文件中。建模文件不应导入配置类 —— AD会在运行时通过
```
AutoConfig.from_pretrained(..., trust_remote_code=True)
```
加载该类。测试专用的配置副本可以验证配置包装行为（例如state_dict的结构）。
权重转换辅助函数： 编写测试专用的辅助函数，处理HF与自定义模型之间的权重格式差异（例如RoPE解交织、堆叠到单专家的MoE权重、门控权重键重映射）。对于全模型测试，优先使用自定义模型上已注册的
```
load_state_dict
```
预钩子。

数值比较： 在比较自定义算子与HF参考实现的等价性测试中，使用

_model_test_utils

中的共享工具函数

assert_rmse_close

：

python

from _model_test_utils import assert_rmse_close

该函数计算

rmse(actual - expected) / rmse(expected)

—— 比逐元素的

torch.testing.assert_close

更鲁棒，因为少数异常值不会导致测试失败。仅在数学逻辑完全相同的模块（例如无自定义算子的纯MLP）中使用

torch.testing.assert_close

。

推荐的bfloat16格式下的

rmse_ratio_tol

值：

数学逻辑完全相同（MLP、归一化）：使用
```
torch.testing.assert_close
```
，设置严格的rtol/atol（1e-3）
MoE模块（融合路由）：
```
0.02
```
解码器层 / MoE层 / 全模型：
```
0.05
```
注意力：
```
0.10
```

自底向上的测试层级（必须通过前一层级才能进入下一层级）：

模块等价性 —— 单独测试MLP、注意力、MoE、归一化模块：相同权重 + 相同输入 →
```
assert_rmse_close
```
（或对于数学逻辑完全相同的模块使用
```
torch.testing.assert_close
```
）。
层等价性 —— 测试完整的解码器层。如果模型包含异构层（密集层 vs MoE层、注意力层 vs SSM层），请分别测试每种类型的层。
全模型等价性 —— 端到端的logits比较。使用包含架构核心特征的小型配置（例如至少包含每种类型的层，层数<10）。
导出测试 —— 使用
```
Dim.DYNAMIC
```
处理batch+seq维度的
```
torch_export_to_gm
```
，验证输出为有限值，并测试第二种形状。

Phase 7 — Independent Review (MANDATORY)

阶段7 — 独立评审（必填）

Invoke the

ad-onboard-reviewer

subagent with ONLY the following information:

Model name
Path to the model file created
Path to the test file created

Do NOT include your own assessment of correctness. Do NOT summarize what you did. Let the reviewer read the files and judge independently.

If the reviewer returns FAIL on any item:

Read the reviewer's specific failure reasons and file:line references
Fix each failed item
Invoke the reviewer again with the same minimal inputs
Repeat until you get a full PASS

Do NOT proceed to Phase 8 until the reviewer returns PASS.

仅向

ad-onboard-reviewer

子代理提供以下信息：

模型名称
创建的模型文件路径
创建的测试文件路径

不要包含你自己对正确性的评估。不要总结你所做的工作。 让评审者直接阅读文件并独立判断。

如果评审者返回任何项的FAIL结果：

阅读评审者给出的具体失败原因和文件:行号引用
修复每个失败项
再次向评审者提供相同的最小输入
重复操作直到获得完整的PASS结果

在评审者返回PASS结果前，不要进入阶段8。

Phase 8 — Create or Update Model Registry Entries (Including Family)

阶段8 — 创建或更新模型注册表项（包括同系列模型）

Before running the model end-to-end, ensure it and all identified family members from Phase 1 have valid entries in the AutoDeploy model registry at

examples/auto_deploy/model_registry/

For each model (the requested model + any family members identified in Phase 1 Step 2):

Check
examples/auto_deploy/model_registry/models.yaml
for an existing entry matching the model's HF id.
If the entry is missing, add it with the appropriate
```
yaml_extra
```
list:
- Always include
```
dashboard_default.yaml
```
  first.
- Pick
```
world_size_N.yaml
```
  based on model size (1 for <2B, 2 for 2-15B, 4 for 20-80B, 8 for 80B+). The
```
world_size
```
  determines how many GPUs are needed for the run.
- Add model-specific YAML if the model needs custom settings (e.g.,
```
model_kwargs
```
  , non-default transforms).
If a model-specific config YAML is needed and doesn't exist, create it under
```
examples/auto_deploy/model_registry/configs/
```
. See existing configs for format examples.
If the entry exists but needs changes (e.g., wrong world_size, missing model-specific config), update it.

Family members that share the same architecture should all use the same modeling code. Different sizes only need different

world_size_N.yaml

entries and maybe different sharding configurations.

See

examples/auto_deploy/model_registry/README.md

for full documentation on the registry format and best practices.

在端到端运行模型前，确保阶段1中识别的所有同系列模型在AutoDeploy模型注册表

examples/auto_deploy/model_registry/

中有有效的条目。

对于每个模型（请求的模型 + 阶段1步骤2中识别的所有同系列模型）：

检查
examples/auto_deploy/model_registry/models.yaml
是否存在与模型HF ID匹配的现有条目。
如果条目缺失，添加条目并附上合适的
```
yaml_extra
```
列表：
- 始终首先包含
```
dashboard_default.yaml
```
  。
- 根据模型大小选择
```
world_size_N.yaml
```
  （<2B参数选1，2-15B选2，20-80B选4，80B+选8）。
```
world_size
```
  决定运行所需的GPU数量。
- 如果模型需要自定义设置（例如
```
model_kwargs
```
  、非默认转换），添加模型专用的YAML文件。
如果需要模型专用的配置YAML但该文件不存在，在
```
examples/auto_deploy/model_registry/configs/
```
目录下创建该文件。参考现有配置文件的格式示例。
如果条目存在但需要修改（例如错误的world_size、缺失模型专用配置），更新该条目。

共享相同架构的同系列模型应使用同一个建模文件。不同尺寸的模型仅需不同的

world_size_N.yaml

条目，可能还需要不同的分片配置。

有关注册表格式和最佳实践的完整文档，请见

examples/auto_deploy/model_registry/README.md

。

Phase 9 — AutoDeploy End-to-End Run

阶段9 — AutoDeploy端到端运行

⚠️ MANDATORY: You MUST use the standalone config YAML with

--args.yaml-extra

⚠️

⚠️ 必填：必须使用独立配置YAML文件，通过

--args.yaml-extra

参数指定 ⚠️

You MUST run the model using the standalone config YAML created in Phase 8. The same YAML will be referenced by the cookbook's
trtllm-serve
command in Phase 11. The command is:

bash

CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

The standalone config YAML under
examples/auto_deploy/model_registry/configs/
is self-contained — it includes all settings needed for running the model (compile backend, batch size, seq len, transforms, world_size, etc.). This is the same YAML that
trtllm-serve --extra_llm_api_options
will use in the cookbook, so validating it here ensures the cookbook works out of the box.

If the run FAILS:

Fix the standalone config YAML — update settings in
```
examples/auto_deploy/model_registry/configs/<model>.yaml
```
and re-run.
The standalone config YAML is the source of truth. If it is wrong, fix it. If it is missing settings, add them. The model MUST work via this YAML before you are done.

Invoke the

ad-run-agent

subagent to run the model through AutoDeploy on GPU. Pass it:

Step 1: Reduced num layers Run with reduced num layers to test the e2e flow for issues and iterate faster. The generation will be bad in step 1 because we are not loading all layers.

Step 2: Full layers Run with full num layers. The generation should be coherent in step 2.

Model HF ID: the HuggingFace model-id (or local checkpoint path) used throughout onboarding
Standalone config YAML path: the path to the config YAML under
```
examples/auto_deploy/model_registry/configs/
```
Description: a short description of the current state, e.g.:
- "first try after onboarding"
- "updated yaml with reduced layers"
- "changed attention backend to torch_mha"
- "fixed weight loading hooks"

The model is run via:

bash

CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

The

ad-run-agent

will determine the required

world_size

from the config YAML, check GPU availability via

nvidia-smi

, select free GPUs, and wait if not enough are available.

The ad-run-agent will build+run the model, check generation quality, archive logs, and update its worklog.

If the run fails or produces bad generation:

Read the ad-run-agent's worklog and log file to understand the error
Fix the issue (model code, standalone config YAML, weight hooks, etc.)
Re-invoke the ad-run-agent with an updated description reflecting the change (e.g., "retry after fixing RoPE scaling in config")
Always re-run with
--args.yaml-extra
. Fix the standalone config YAML, don't work around it.
Repeat until the run succeeds with meaningful generation

Do NOT proceed to Phase 10 until the step 2 with full layers reports a successful run with coherent generation.

Important: The successful E2E run outputs (prompts and generated text) will be needed for the cookbook notebook in Phase 11 and the summary report in Phase 12. Save them.

必须使用阶段8中创建的独立配置YAML文件运行模型。该YAML文件将在阶段11的cookbook中被
trtllm-serve
命令引用。命令格式如下：

bash

CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

examples/auto_deploy/model_registry/configs/
目录下的独立配置YAML文件是自包含的 —— 它包含运行模型所需的所有设置（编译后端、batch大小、序列长度、转换、world_size等）。这与cookbook中
trtllm-serve --extra_llm_api_options
将使用的YAML文件相同，因此在此处验证该文件可确保cookbook开箱即用。

如果运行失败：

修复独立配置YAML文件 —— 更新
```
examples/auto_deploy/model_registry/configs/<model>.yaml
```
中的设置并重新运行。
独立配置YAML文件是唯一的可信来源。如果文件有误，修复它；如果缺少设置，添加设置。必须确保模型通过该YAML文件正常运行后才能结束此阶段。

调用

ad-run-agent

子代理在GPU上通过AutoDeploy运行模型。向其提供以下信息：

步骤1：减少层数减少层数运行，以测试端到端流程是否存在问题并快速迭代。由于未加载所有层，此步骤的生成结果质量会很差。

步骤2：完整层数使用完整层数运行。此步骤的生成结果应连贯合理。

模型HF ID： 整个上线流程中使用的HuggingFace模型ID（或本地checkpoint路径）
独立配置YAML路径：
```
examples/auto_deploy/model_registry/configs/
```
目录下的配置YAML文件路径
描述： 当前状态的简短描述，例如：
- "上线后的首次尝试"
- "更新YAML文件，减少层数"
- "将注意力后端改为torch_mha"
- "修复权重加载钩子"

模型将通过以下命令运行：

bash

CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

ad-run-agent

会从配置YAML文件中确定所需的

world_size

，通过

nvidia-smi

检查GPU可用性，选择空闲GPU，如果GPU不足则等待。

ad-run-agent

会构建并运行模型，检查生成质量，归档日志，并更新工作记录。

如果运行失败或生成质量不佳：

阅读
```
ad-run-agent
```
的工作记录和日志文件，了解错误原因
修复问题（模型代码、独立配置YAML文件、权重钩子等）
更新描述以反映所做的更改（例如 "修复配置中的RoPE缩放后重试"），再次调用
```
ad-run-agent
```
始终使用
--args.yaml-extra
参数重新运行。修复独立配置YAML文件，不要使用临时解决方案。
重复操作直到运行成功并生成有意义的结果

在步骤2（完整层数）报告运行成功且生成结果连贯合理前，不要进入阶段10。

重要提示： 成功的端到端运行输出（提示词和生成文本）将用于阶段11的cookbook笔记本和阶段12的总结报告。请保存这些输出。

Phase 10 — Update Model Support Matrix

阶段10 — 更新模型支持矩阵

After a successful E2E run, update the TensorRT-LLM model support matrix at

docs/source/models/supported-models.md

to include the newly onboarded model.

Read the current support matrix to understand the format and existing entries.
Add a row to the "Supported Models" table (the first table in the file) with:
- ```
Architecture
```
  : The model's architecture class name (e.g.,
```
MiniMaxM2ForCausalLM
```
  ) — use the class name registered in Phase 4.
- ```
Model
```
  : The model family/display name (e.g.,
```
MiniMax M2/M2.1/M2.7
```
  ).
- ```
HuggingFace Example
```
  : A representative HF model ID (e.g.,
```
MiniMaxAI/MiniMax-M2.7
```
  ).
- Place the new row alphabetically by architecture class name to keep the table sorted.
If the model is AutoDeploy-only (i.e., it does NOT have native PyTorch backend support in
```
tensorrt_llm/_torch/models/
```
), add a footnote indicating AutoDeploy support with a link to the AD config YAML, following the pattern of existing AD-only models (e.g.,
```
[^N]: Supported via the [AutoDeploy](../features/auto_deploy/auto-deploy.md) backend. See [AD config](../../../examples/auto_deploy/model_registry/configs/<model>.yaml).
```
).
If the model warrants an entry in the Model-Feature Support Matrix (second table — typically for key/flagship models), add a row there too. For newly onboarded AD models, most advanced features should be marked
```
Untested
```
unless you have verified them. Use existing AD model entries (e.g.,
```
Glm4MoeLiteForCausalLM
```
) as a reference for which features to mark as supported vs untested.

端到端运行成功后，更新TensorRT-LLM模型支持矩阵

docs/source/models/supported-models.md

，添加新上线的模型。

阅读当前支持矩阵，了解格式和现有条目。
在"Supported Models"表格中添加一行（文件中的第一个表格），包含：
- ```
Architecture
```
  ：模型的架构类名称（例如
```
MiniMaxM2ForCausalLM
```
  ）—— 使用阶段4中注册的类名称。
- ```
Model
```
  ：模型系列/显示名称（例如
```
MiniMax M2/M2.1/M2.7
```
  ）。
- ```
HuggingFace Example
```
  ：代表性的HF模型ID（例如
```
MiniMaxAI/MiniMax-M2.7
```
  ）。
- 按照架构类名称的字母顺序放置新行，保持表格有序。
如果模型仅支持AutoDeploy（即
```
tensorrt_llm/_torch/models/
```
中没有原生PyTorch后端支持），添加脚注说明AutoDeploy支持，并链接到AD配置YAML文件，遵循现有仅支持AD的模型的格式（例如
```
[^N]: Supported via the [AutoDeploy](../features/auto_deploy/auto-deploy.md) backend. See [AD config](../../../examples/auto_deploy/model_registry/configs/<model>.yaml).
```
）。
如果模型需要在Model-Feature Support Matrix中添加条目（第二个表格 —— 通常针对关键/旗舰模型），也在该表格中添加一行。对于新上线的AD模型，大多数高级功能应标记为
```
Untested
```
，除非你已验证过这些功能。参考现有AD模型条目（例如
```
Glm4MoeLiteForCausalLM
```
），了解哪些功能应标记为支持，哪些标记为未测试。

Phase 11 — Create AutoDeploy Cookbook

阶段11 — 创建AutoDeploy Cookbook

Create an AutoDeploy cookbook notebook for the model, following the pattern of existing cookbooks.

Use
examples/auto_deploy/cookbooks/glm_4.7_flash_trtllm_cookbook.ipynb
as the template. Copy its structure exactly.

Create the new notebook at

examples/auto_deploy/cookbooks/{model_name}_trtllm_cookbook.ipynb

, using a snake_case version of the model name (e.g.,

minimax_m2.7_trtllm_cookbook.ipynb

Adapt all model-specific content:
- Title and description: update the model name, HF model ID, and description.
- Model Resources: update links to the model's HuggingFace card, blog posts, technical reports, API platform, and community links. Search the web or the model's HF card for relevant URLs.
- Model Highlights: update architecture details (e.g., MoE params, context length, special features like tool calling, interleaved thinking, etc.) from the model card.
- Prerequisites: update VRAM requirements based on model size and precision.
- ```
trtllm-serve
```
  command: update the model ID and use
```
--extra_llm_api_options
```
  pointing to the standalone AD config YAML under
```
examples/auto_deploy/model_registry/configs/
```
  (e.g.,
```
examples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml
```
  ). This is the same standalone config YAML validated in Phase 9 via
```
build_and_run_ad.py --args.yaml-extra
```
  . It is self-contained — it includes all the settings
```
trtllm-serve
```
  needs (compile backend, batch size, seq len, transforms, etc.).
- OpenAI client
```
MODEL_ID
```
  : update to the correct HF model ID.
- Evaluation Parameters: update recommended inference parameters from the model's documentation/model card.
- Additional Resources: update all links to be model-specific.
Do NOT include cell outputs in the committed notebook — the notebook should be clean with no pre-run outputs, so users run it themselves. (Exception: if the model was already run and outputs were captured during Phase 9, you may include them for reference, but this is optional.)
Verify the notebook is valid JSON — malformed
```
.ipynb
```
files will not render on GitHub or in Jupyter.

按照现有cookbook的模式，为该模型创建AutoDeploy cookbook笔记本。

以
examples/auto_deploy/cookbooks/glm_4.7_flash_trtllm_cookbook.ipynb
为模板。完全复制其结构。

创建新笔记本，保存到

examples/auto_deploy/cookbooks/{model_name}_trtllm_cookbook.ipynb

，使用模型名称的蛇形命名法（例如

minimax_m2.7_trtllm_cookbook.ipynb

）。

调整所有模型专用内容：
- 标题和描述：更新模型名称、HF模型ID和描述。
- Model Resources：更新模型的HuggingFace卡片、博客文章、技术报告、API平台和社区链接。在网络或模型的HF卡片中查找相关URL。
- Model Highlights：从模型卡片中更新架构细节（例如MoE参数、上下文长度、工具调用、 interleaved thinking等特殊功能）。
- Prerequisites：根据模型大小和精度更新显存要求。
- ```
trtllm-serve
```
  命令：更新模型ID，并使用
```
--extra_llm_api_options
```
  参数指向
```
examples/auto_deploy/model_registry/configs/
```
  目录下的独立AD配置YAML文件（例如
```
examples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml
```
  ）。这与阶段9中通过
```
build_and_run_ad.py --args.yaml-extra
```
  验证的独立配置YAML文件相同。该文件是自包含的 —— 它包含
```
trtllm-serve
```
  所需的所有设置（编译后端、batch大小、序列长度、转换等）。
- OpenAI客户端
```
MODEL_ID
```
  ：更新为正确的HF模型ID。
- Evaluation Parameters：从模型的文档/模型卡片中更新推荐的推理参数。
- Additional Resources：更新所有链接为模型专用链接。
提交的笔记本中不要包含单元格输出 —— 笔记本应保持干净，没有预运行的输出，以便用户自行运行。（例外情况：如果模型已在阶段9中运行并捕获了输出，可以将其包含在笔记本中作为参考，但这不是必需的。）
验证笔记本是有效的JSON —— 格式错误的
```
.ipynb
```
文件无法在GitHub或Jupyter中渲染。

Phase 12 — Summary Report

阶段12 — 总结报告

⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final

build_and_run_ad.py --args.yaml-extra

run ⚠️

⚠️ 必填：必须包含最终

build_and_run_ad.py --args.yaml-extra

运行的所有原始提示词和生成输出 ⚠️

Print (not file) after completion:

Model overview + unique features
Tricky parts needing human review
Files created/modified (including any new registry configs)
Test results table (name | validates | PASS/FAIL)
Known limitations
Reviewer result (PASS + how many review iterations it took)
AD end-to-end run result (success/fail, number of iterations, final generation quality)
Registry entry added/updated in
```
models.yaml
```
and any new config YAMLs created
ALL raw prompts and their corresponding generated outputs from the final successful
build_and_run_ad.py --args.yaml-extra
run. Copy-paste the COMPLETE prompt→output pairs verbatim from the run log. Do NOT summarize, truncate, or paraphrase them. The user needs to see exactly what the model generated to judge quality.
Model support matrix update — confirm the row was added to
```
docs/source/models/supported-models.md
```
and which footnote (if any) was used.
AutoDeploy cookbook created — path to the new notebook file (
```
examples/auto_deploy/cookbooks/<model>_trtllm_cookbook.ipynb
```
).

完成后打印（不要保存为文件）：

模型概述 + 独特功能
需要人工评审的复杂部分
创建/修改的文件（包括任何新的注册表配置）
测试结果表格（名称 | 验证内容 | PASS/FAIL）
已知限制
评审结果（PASS + 经过多少次评审迭代）
AD端到端运行结果（成功/失败、迭代次数、最终生成质量）
在
```
models.yaml
```
中添加/更新的注册表条目，以及创建的任何新配置YAML文件
最终成功运行
build_and_run_ad.py --args.yaml-extra
的所有原始提示词及其对应的生成输出。从运行日志中逐字复制完整的提示词→输出对。不要总结、截断或改写。用户需要查看模型生成的精确内容以判断质量。
模型支持矩阵更新 —— 确认已在
```
docs/source/models/supported-models.md
```
中添加了该行，以及使用了哪个脚注（如果有）。
创建的AutoDeploy cookbook —— 新笔记本文件的路径（
```
examples/auto_deploy/cookbooks/<model>_trtllm_cookbook.ipynb
```
）。

Phase 13 — Prepare a Pull Request

阶段13 — 准备Pull Request

GitHub CLI config: Before running any

gh

command, confirm which

GH_CONFIG_DIR

to use. The default is

~/.config/gh

, but a different directory may be needed when targeting a fork (e.g.,

nv-auto-deploy/TensorRT-LLM

NVIDIA/TensorRT-LLM

). Check if the user has specified a custom

GH_CONFIG_DIR

(e.g., in

CLAUDE.local.md

or environment). If not, ask the user before proceeding. Prefix all

gh

commands with:

GH_CONFIG_DIR=<path> gh ...

Prepare a pull request against

upstream

(https://github.com/NVIDIA/TensorRT-LLM) targeting branch

main

. Then, ask the user to provide feedback on the PR and wait for the user to get back to you when the feedback has been posted. Then continue iterating according to the user's feedback. For any comment or other post, please prepend your message with "[AGENT]" so that it is clear that this was a coding agent posting the comment. When you post a PR, you MUST include:

ALL raw prompts and their complete generated outputs from the final successful
```
build_and_run_ad.py --args.yaml-extra
```
run. Copy-paste the COMPLETE prompt→output pairs verbatim — do NOT summarize, truncate, or paraphrase. The reviewer needs to see exactly what the model generated.
A reproducible command:

bash

python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

A detailed pytest command for the unit tests you added so they can be run by the reviewer as well. Make sure you have run this pytest command on the latest commit that you are pushing, and include these results in the PR.

GitHub CLI配置： 在运行任何

gh

命令前，确认要使用的

GH_CONFIG_DIR

。默认路径是

~/.config/gh

，但针对fork仓库（例如

nv-auto-deploy/TensorRT-LLM

NVIDIA/TensorRT-LLM

）可能需要不同的目录。检查用户是否指定了自定义

GH_CONFIG_DIR

（例如在

CLAUDE.local.md

或环境变量中）。如果没有，在继续操作前询问用户。所有

gh

命令前添加前缀：

GH_CONFIG_DIR=<path> gh ...

准备针对

upstream

（https://github.com/NVIDIA/TensorRT-LLM）的Pull Request，目标分支为

main

。然后，请求用户提供对PR的反馈，并等待用户反馈发布后再继续。根据用户的反馈进行迭代。在发布任何评论或其他内容时，请在消息前添加 "[AGENT]"，以便明确这是由编码代理发布的评论。发布PR时，必须包含：

最终成功运行
build_and_run_ad.py --args.yaml-extra
的所有原始提示词及其完整生成输出。逐字复制完整的提示词→输出对 —— 不要总结、截断或改写。评审者需要查看模型生成的精确内容。
可重现的命令：

bash

python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

你添加的单元测试的详细pytest命令，以便评审者也可以运行这些测试。确保你已在要推送的最新提交上运行了该pytest命令，并将结果包含在PR中。

⚠️ MANDATORY: Re-run and re-post logs on EVERY PR update — NO EXCEPTIONS ⚠️

⚠️ 必填：每次PR更新时都必须重新运行并重新发布日志 —— 无例外 ⚠️

Every single time you push changes to the PR — whether it is a new commit, a rebase, an amendment, a fixup, or any other update — you MUST:

Re-run
build_and_run_ad.py --args.yaml-extra
using the
```
ad-run-agent
```
subagent, exactly as in Phase 9. The code has changed, so previous run results are stale and invalid.
Re-run the full unit test suite (
```
pytest <test_file> -v
```
) for the model's test file created in Phase 6. Previous test results are stale and invalid after any code change.
Post ALL raw output from both runs as a PR comment:
- The COMPLETE prompt→output pairs from
```
build_and_run_ad.py
```
  verbatim — do NOT summarize, truncate, or paraphrase.
- The COMPLETE pytest output verbatim — every test name, every PASSED/FAILED line, every error traceback if any. Do NOT summarize or cherry-pick.

This is not optional. There are no exceptions. Even if the change seems trivial (a typo fix, a comment edit, a formatting change), both runs must be re-executed and the full raw logs must be posted. The reviewer cannot verify correctness without seeing generation output AND test results from the exact code that is currently on the branch.

Workflow for every PR update cycle:

Make the requested code changes
Commit the changes
Before pushing, always rebase onto the target branch to check for conflicts:
```
git fetch upstream && git rebase upstream/main
```
. If there are conflicts, resolve them before proceeding. Do NOT push without rebasing first — the branch must be up-to-date with the target branch.
Push (or force-push if rebase rewrote history)

Re-invoke the

ad-run-agent

to run

build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

on the updated code

Re-run the unit tests:
```
pytest <test_file> -v
```
Wait for both runs to complete
Post a reply to every PR comment containing:
- A brief description of what changed in this update
- The COMPLETE raw prompts and generated outputs from the
```
build_and_run_ad.py
```
  run
- The COMPLETE raw pytest output (full verbatim log)
- The reproducible commands used for both runs
Resume polling for new comments (see below)

每次向PR推送更改时 —— 无论是新提交、变基、修正、修补还是任何其他更新 —— 必须：

重新运行
build_and_run_ad.py --args.yaml-extra
，使用
```
ad-run-agent
```
子代理，与阶段9中的操作完全相同。代码已更改，因此之前的运行结果已过时且无效。
重新运行完整的单元测试套件（
```
pytest <test_file> -v
```
），针对阶段6中创建的模型测试文件。代码更改后，之前的测试结果已过时且无效。
将两次运行的所有原始输出作为PR评论发布：
- 逐字复制
```
build_and_run_ad.py
```
  运行的完整提示词→输出对 —— 不要总结、截断或改写。
- 逐字复制完整的pytest输出 —— 每个测试名称、每个PASSED/FAILED行、任何错误回溯（如果有）。不要总结或挑选内容。

这不是可选操作。没有例外。 即使更改看起来微不足道（拼写修正、注释编辑、格式更改），也必须重新运行两次操作并发布完整的原始日志。评审者需要查看当前分支上精确代码的生成输出和测试结果，才能验证正确性。

每次PR更新周期的工作流程：

进行请求的代码更改
提交更改
推送前，始终变基到目标分支以检查冲突：
```
git fetch upstream && git rebase upstream/main
```
。如果有冲突，解决冲突后再继续。不要在变基前推送 —— 分支必须与目标分支保持同步。
推送（如果变基重写了历史，则强制推送）

重新调用

ad-run-agent

，在更新后的代码上运行

build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

重新运行单元测试：
```
pytest <test_file> -v
```
等待两次运行完成
回复每个PR评论，包含：
- 此更新中更改内容的简要描述
- ```
build_and_run_ad.py
```
  运行的完整原始提示词和生成输出
- 完整的原始pytest输出（逐字日志）
- 两次运行使用的可重现命令
恢复轮询新评论（见下文）

⚠️ MANDATORY: Poll PR for new comments every 5 minutes ⚠️

⚠️ 必填：每5分钟轮询PR的新评论 ⚠️

After opening the PR and after every PR update you post, you MUST set up a polling loop that checks for new PR comments every 5 minutes. Do not simply post and walk away — actively monitor the PR for reviewer feedback.

How to poll:

bash

undefined

在发布PR后，以及每次发布PR更新后，必须设置轮询循环，每5分钟检查一次PR的新评论。 不要发布后就不管了 —— 主动监控PR以获取评审者反馈。

轮询方式：

bash

undefined

Fetch all PR comments, sorted newest-first, and check for any posted after your last comment

获取所有PR评论，按最新排序，检查是否有在你上次评论之后发布的新评论

GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/pulls/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"

Also check issue-level comments (top-level PR comments, not inline review comments)

同时检查问题级评论（PR的顶级评论，不是内联评审评论）

GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/issues/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"

Also check the PR's review status

同时检查PR的评审状态

GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state


**Polling loop behavior:**
1. After posting your PR (or posting an update comment), immediately start polling every 5 minutes.
2. On each poll, check for:
   - **New review comments** (inline or top-level) posted after your last comment's timestamp
   - **PR approval status** — check if the PR has been approved
   - **Termination signals** — any comment clearly indicating the agent's work is done (e.g., "LGTM", "looks good, we're done", "no more changes needed", "agent work complete", or similar)
3. If **new actionable comments are found**: stop polling, process the feedback, and execute the full PR update cycle (steps 1–8 above). After posting the update, resume polling.
4. If the **PR is approved** or a **termination signal** is found: stop polling, report to the user that the PR review cycle is complete, and end.
5. If **no new comments** are found: sleep 5 minutes and poll again.

**Do NOT stop polling prematurely.** The loop must continue until the PR is approved or a clear termination signal is received. If polling has been running for an extended period (e.g., >2 hours) with no new activity, inform the user that you are still monitoring and ask if they want you to continue or stop.

GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state


**轮询循环行为：**
1. 在发布PR（或发布更新评论）后，立即开始每5分钟轮询一次。
2. 每次轮询时，检查：
   - **新的评审评论**（内联或顶级），发布时间在你上次评论的时间戳之后
   - **PR批准状态** —— 检查PR是否已被批准
   - **终止信号** —— 任何明确表明代理工作已完成的评论（例如 "LGTM"、"看起来不错，我们完成了"、"无需更多更改"、"代理工作完成" 或类似内容）
3. 如果**发现新的可操作评论**：停止轮询，处理反馈，并执行完整的PR更新周期（上述步骤1–8）。发布更新后，恢复轮询。
4. 如果**PR已被批准**或收到**终止信号**：停止轮询，向用户汇报PR评审周期已完成，并结束工作。
5. 如果**没有新评论**：等待5分钟后再次轮询。

**不要提前停止轮询。** 循环必须持续到PR被批准或收到明确的终止信号。如果轮询已运行较长时间（例如>2小时）且没有新活动，请告知用户你仍在监控，并询问用户是否希望你继续或停止。

Sharding-aware IR model porting (

modeling_*_ir.py

)

支持分片的IR模型移植（

modeling_*_ir.py

）

Use this when porting an existing AutoDeploy custom model (

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_*.py

) to explicit sharding hint ops in

modeling_*_ir.py

in the same directory (no separate

new_sharding/

tree). The exported FX graph must fully specify how the model should be sharded: the

apply_sharding_hints

transform combines hints with a runtime

DistConfig

for deterministic, node-local sharding.

Argument reference: Do not duplicate operator tables here. Refer to the custom op docstrings in

tensorrt_llm/_torch/auto_deploy/custom_ops/

for the complete argument reference (including sharding hints,

tp_mode

layer_type

, and which ops accept hints).

当将现有AutoDeploy自定义模型（

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_*.py

）移植到同一目录下的

modeling_*_ir.py

文件中的显式分片提示算子时，使用此流程（无需单独的

new_sharding/

目录）。导出的FX图必须完全指定模型的分片方式：

apply_sharding_hints

转换会将提示与运行时

DistConfig

结合，实现确定性的节点本地分片。

参数参考： 不要在此处重复算子表。请参考

tensorrt_llm/_torch/auto_deploy/custom_ops/

中的自定义算子文档字符串，获取完整的参数参考（包括分片提示、

tp_mode

、

layer_type

，以及哪些算子接受提示）。

Reference examples (study before porting)

参考示例（移植前请研究）

Original	IR / sharding-aware	Layer types
`modeling_nemotron_h.py`	`modeling_nemotron_h_ir.py`	Mamba SSM, MHA, SwiGLU MLP, MoE
`modeling_qwen3_5_moe.py`	`modeling_qwen3_5_moe_ir.py`	GatedDeltaNet, Gated MHA, SwiGLU MLP, MoE
`modeling_mistral.py`	`modeling_mistral_ir.py`	MHA, SwiGLU MLP (simplest)
`modeling_deepseek_v2.py`	`modeling_deepseek_v2_ir.py`	MLA, SwiGLU MLP, MoE

原始文件	IR/支持分片的文件	层类型
`modeling_nemotron_h.py`	`modeling_nemotron_h_ir.py`	Mamba SSM, MHA, SwiGLU MLP, MoE
`modeling_qwen3_5_moe.py`	`modeling_qwen3_5_moe_ir.py`	GatedDeltaNet, Gated MHA, SwiGLU MLP, MoE
`modeling_mistral.py`	`modeling_mistral_ir.py`	MHA, SwiGLU MLP（最简单）
`modeling_deepseek_v2.py`	`modeling_deepseek_v2_ir.py`	MLA, SwiGLU MLP, MoE

Step-by-step porting procedure

分步移植流程

Step 1: Copy the source file

步骤1：复制源文件

bash

cp tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo.py \
   tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo_ir.py

bash

cp tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo.py \
   tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo_ir.py

Step 2: Update the module docstring and add imports

步骤2：更新模块文档字符串并添加导入

At the top of the IR file:

python

import tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401 -- register all ops

Do not add global

SHARD_*

flags. Layer-level control uses the

layer_type

hint on each op and

shard_layers

in YAML.

在IR文件顶部添加：

python

import tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401 -- 注册所有算子

不要添加全局

SHARD_*

标志。层级控制使用每个算子上的

layer_type

提示和YAML中的

shard_layers

。

Step 3: Replace linear projections

步骤3：替换线性投影

For every

self.proj(x)

nn.Linear

call, use

torch.ops.auto_deploy.torch_linear_simple

with explicit

tp_mode

and

layer_type

. Always set

tp_mode

unconditionally (no

if _s else "none"

). Rules: opening projections (Q/K/V/gate/up/in_proj) →

"colwise"

; closing (O/down/out_proj) →

"rowwise"

; tiny outputs (e.g.

shared_expert_gate

dim 1) →

"none"

; MLA latent projections (q_a, kv_a) →

"none"

. For fused weights split later, pass

output_sizes=[...]

. For GQA, use

tp_min_local_shape=self.head_dim

on K/V colwise lines.

对于每个

self.proj(x)

或

nn.Linear

调用，使用

torch.ops.auto_deploy.torch_linear_simple

，并显式指定

tp_mode

和

layer_type

。始终无条件设置

tp_mode

（不要使用

if _s else "none"

）。规则： 起始投影（Q/K/V/gate/up/in_proj）→

"colwise"

；结束投影（O/down/out_proj）→

"rowwise"

；小输出（例如

shared_expert_gate

的维度1）→

"none"

；MLA潜在投影（q_a, kv_a）→

"none"

。对于后续会拆分的融合权重，传递

output_sizes=[...]

。对于GQA，在K/V的colwise行上使用

tp_min_local_shape=self.head_dim

。

Step 4: Replace split / chunk after fused colwise projections

步骤4：替换融合colwise投影后的split / chunk

Use

torch.ops.auto_deploy.split_with_sizes

with

shardable

layer_type

where sizes scale with TP.

当尺寸随TP缩放时，使用带有

shardable

layer_type

的

torch.ops.auto_deploy.split_with_sizes

。

Step 5: Replace view / reshape with concrete head counts

步骤5：用具体的头数替换view / reshape

During

torch.export

-1

becomes concrete; after TP, wrong values break. Any reshape whose dimension is a head count that scales with TP must use

torch.ops.auto_deploy.view

with

tp_scaled_dim

set appropriately. Safe cases: flat-to-2D, or

[B,S,-1]

when the input is already correctly sharded.

在

torch.export

过程中，

-1

会变为具体值；TP后，错误的值会导致失败。任何维度为随TP缩放的头数的reshape操作，必须使用

torch.ops.auto_deploy.view

，并适当设置

tp_scaled_dim

。安全场景：扁平到2D，或输入已正确分片时的

[B,S,-1]

。

Step 6: Insert

all_reduce

步骤6：插入

all_reduce

After every rowwise projection, add

torch.ops.auto_deploy.all_reduce(..., layer_type=...)

. Parallel branch rule: when branches merge by addition, use a single

all_reduce

after the sum (e.g. MoE routed + shared expert; parallel attention + MLP residual branches).

在每个rowwise投影后，添加

torch.ops.auto_deploy.all_reduce(..., layer_type=...)

。并行分支规则： 当分支通过加法合并时，在求和后使用单个

all_reduce

（例如MoE路由专家 + 共享专家；并行注意力 + MLP残差分支）。

Step 7: Special ops (Conv1d, SSM, GatedDeltaNet, gated RMSNorm)

步骤7：特殊算子（Conv1d, SSM, GatedDeltaNet, gated RMSNorm）

Add sharding hints on

torch_causal_conv1d

torch_ssm

torch_gated_delta_rule

torch_rmsnorm_gated

per docstrings—typically

shardable

output_sizes

tp_mode

as required.

根据文档字符串为

torch_causal_conv1d

、

torch_ssm

、

torch_gated_delta_rule

、

torch_rmsnorm_gated

添加分片提示 —— 通常需要设置

shardable

output_sizes

tp_mode

。

Step 8: MoE

步骤8：MoE

Pass

layer_type="moe"

into

torch_moe

;

apply_sharding_hints

handles EP/TP.

将

layer_type="moe"

传递给

torch_moe

；

apply_sharding_hints

会处理EP/TP。

Step 9: Register the IR model

步骤9：注册IR模型

Bottom of the IR file:

AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)

(same pattern as Phase 4).

Add a side-effect import in
```
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
```
(e.g.
```
from . import modeling_foo_ir  # noqa: F401
```
) and extend
```
__all__
```
if you export symbols. Without this import, worker processes may not load your class and
```
apply_sharding_hints
```
can report 0 nodes processed. Do not use a separate
```
register_sharded_models.py
```
indirection.

在IR文件末尾添加：

AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)

（与阶段4相同的模式）。

在
```
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
```
文件中添加副作用导入（例如
```
from . import modeling_foo_ir  # noqa: F401
```
），如果导出符号则扩展
```
__all__
```
。如果没有此导入，工作进程可能无法加载你的类，
```
apply_sharding_hints
```
可能会报告0个节点被处理。不要使用单独的
```
register_sharded_models.py
```
间接导入。

Step 10: YAML — composable registry pattern

步骤10：YAML —— 可组合注册表模式

Prefer the model registry (

examples/auto_deploy/model_registry/models.yaml

) and compose shared fragments under

examples/auto_deploy/model_registry/configs/

, same as other models: list

dashboard_default.yaml

, the right

world_size_N.yaml

, then a dedicated fragment (e.g.

enable_sharder_ir.yaml

) that holds IR sharding transforms. That fragment should disable legacy sharding passes and enable hint-driven sharding. Registry fragments are deep-merged in

yaml_extra

order (see

DynamicYamlMixInForSettings

tensorrt_llm/_torch/auto_deploy/utils/_config.py

); place transform keys under

transforms:

so they merge with

dashboard_default.yaml

. Standalone experiment YAMLs for

build_and_run_ad

may wrap the same fields under a top-level

args:

block matching

LlmArgs

Example transform block:

yaml

undefined

优先使用模型注册表（

examples/auto_deploy/model_registry/models.yaml

），并在

examples/auto_deploy/model_registry/configs/

目录下组合共享片段，与其他模型相同：列出

dashboard_default.yaml

、合适的

world_size_N.yaml

，然后是专用片段（例如

enable_sharder_ir.yaml

），该片段包含IR分片转换。该片段应禁用旧版分片传递，并启用基于提示的分片。注册表片段会按

yaml_extra

顺序深度合并（见

tensorrt_llm/_torch/auto_deploy/utils/_config.py

中的

DynamicYamlMixInForSettings

）；将转换键放在

transforms:

下，以便与

dashboard_default.yaml

合并。用于

build_and_run_ad

的独立实验YAML文件可能会将相同字段包装在与

LlmArgs

匹配的顶层

args:

块下。

示例转换块：

yaml

undefined

Typical contents for enable_sharder_ir.yaml (registry composable fragment)

enable_sharder_ir.yaml的典型内容（注册表可组合片段）

transforms: export_to_gm: num_moe_experts_for_export: 2 # often required when expert count is large (>64) detect_sharding: stage: sharding enabled: false sharding_transform_executor: stage: sharding enabled: false apply_sharding_hints: stage: sharding enabled: true run_shape_prop: true allreduce_strategy: NCCL # shard_layers: ['mha', 'mlp'] # optional selective sharding gather_logits_before_lm_head: enabled: true


Use `world_size: 8` when validating TP head-divisibility. Optional `shard_layers` limits which `layer_type` hints are processed; unset means shard all shardable nodes.

transforms: export_to_gm: num_moe_experts_for_export: 2 # 当专家数量较大时（>64）通常需要设置 detect_sharding: stage: sharding enabled: false sharding_transform_executor: stage: sharding enabled: false apply_sharding_hints: stage: sharding enabled: true run_shape_prop: true allreduce_strategy: NCCL # shard_layers: ['mha', 'mlp'] # 可选的选择性分片 gather_logits_before_lm_head: enabled: true


验证TP头可分性时使用 `world_size: 8`。可选的 `shard_layers` 限制处理哪些 `layer_type` 提示；未设置时会分片所有可分片节点。

Step 11: Validate

步骤11：验证

Do not report success until a run completes successfully.

Prefer
```
python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry
```
after adding/updating the registry entry and composable YAMLs (Phase 8–9 style).
```
apply_sharding_hints
```
logs should show N nodes processed
with N > 0.
If validation fails with infrastructure limits (e.g. head count not divisible by
```
world_size
```
), document the assert and compatible sizes; do not "fix" core
```
sharding.py
```
/ custom op schemas without owner review.
If blocked by missing infrastructure support, rename artifacts to
```
broken_modeling_*_ir.py
```
/ broken YAML and file a short error report for humans (do not silently patch core transforms).

Layer type strings (for

layer_type

shard_layers

): use

"mha"

"mla"

"mlp"

"moe"

"ssm"

"delta"

, or

"unknown"

(default; skipped when

shard_layers

is set). Match the conventions used in

apply_sharding_hints

and project enums.

不要报告成功，直到运行完全成功。

在添加/更新注册表条目和可组合YAML文件后，优先使用
```
python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry
```
（阶段8–9的方式）。
```
apply_sharding_hints
```
日志应显示**
```
N nodes processed
```
，且N > 0**。
如果验证因基础设施限制失败（例如头数无法被
```
world_size
```
整除），记录断言和兼容的尺寸；不要在未经所有者评审的情况下“修复”核心
```
sharding.py
```
/ 自定义算子模式。
如果因缺少基础设施支持而受阻，将工件重命名为
```
broken_modeling_*_ir.py
```
/ broken YAML，并为人类编写简短的错误报告（不要静默修补核心转换）。

层类型字符串（用于

layer_type

shard_layers

）：使用

"mha"

、

"mla"

、

"mlp"

、

"moe"

、

"ssm"

、

"delta"

或

"unknown"

（默认；设置

shard_layers

时会被跳过）。与

apply_sharding_hints

和项目枚举中使用的约定保持一致。

Layer-specific sharding patterns

特定层的分片模式

MHA (standard or gated):

layer_type="mha"

: q/k/v colwise (GQA:

tp_min_local_shape

view

with

tp_scaled_dim

for head dim, o rowwise +

all_reduce

. Fused Q+gate interleaved per head: colwise without

output_sizes

; contiguous Q|K|V fused blocks need

output_sizes

SwiGLU MLP:

layer_type="mlp"

: gate/up colwise, down rowwise +

all_reduce

Mamba / SSM:

layer_type="ssm"

: in_proj colwise +

output_sizes

, splits shardable, conv1d shardable +

output_sizes

, views,

torch_ssm

shardable, norm gated colwise if weight scales, out rowwise +

all_reduce

GatedDeltaNet:

layer_type="delta"

: in_proj_qkv with

output_sizes

, other in_projs colwise, conv1d/splits/views as above,

torch_gated_delta_rule

shardable, out rowwise +

all_reduce

MoE + shared expert:

layer_type="moe"

: router replicated; one

all_reduce

after

routed + shared

, not two.

MLA (DeepSeek):

layer_type="mla"

: keep

torch_mla

intact with

shardable=True

—do not decompose into separate linears +

torch_attention

(introduces bad

expand

view

with concrete head counts). q_a/kv_a latent:

tp_mode="none"

; q_b colwise;

o_proj

rowwise +

all_reduce

MHA（标准或门控）：

layer_type="mha"

：q/k/v colwise（GQA：

tp_min_local_shape

），头维度使用带有

tp_scaled_dim

的

view

，o rowwise +

all_reduce

。每个头融合的Q+gate交错：仅colwise，无需

output_sizes

；连续的Q|K|V融合块需要

output_sizes

。

SwiGLU MLP：

layer_type="mlp"

：gate/up colwise，down rowwise +

all_reduce

。

Mamba / SSM：

layer_type="ssm"

：in_proj colwise +

output_sizes

，分片可拆分，conv1d可分片 +

output_sizes

，views，

torch_ssm

可分片，gated归一化colwise（如果权重缩放），out rowwise +

all_reduce

。

GatedDeltaNet：

layer_type="delta"

：in_proj_qkv 带有

output_sizes

，其他in_projs colwise，conv1d/splits/views 如上，

torch_gated_delta_rule

可分片，out rowwise +

all_reduce

。

MoE + 共享专家：

layer_type="moe"

：router复制；在

routed + shared

后使用一个

all_reduce

，不要使用两个。

MLA（DeepSeek）：

layer_type="mla"

：保持

torch_mla

完整，设置

shardable=True

—— 不要分解为单独的线性层 +

torch_attention

（这会引入带有具体头数的不良

expand

view

）。q_a/kv_a 潜在层：

tp_mode="none"

；q_b colwise；

o_proj

rowwise +

all_reduce

。

Common pitfalls (sharding IR)

常见陷阱（分片IR）

Missing
auto_deploy::view
for head reshapes — concrete shapes from export break after sharding.
Sharding tiny projections — dim-1 gates:
```
tp_mode="none"
```
.
Double
all_reduce
in MoE — one merge-point reduction for routed + shared.
Cross-layer parameter contamination — in
```
_apply_hint_*
```
handlers using
```
get_source_nodes()
```
, restrict with
```
allowed_ops
```
so residual links do not pull weights from other layers.
Missing
num_moe_experts_for_export
for very large expert counts — export can hang.
Decomposing ops that absorb weights (e.g.
```
torch_mla
```
) — use
```
shardable
```
+ handler instead of splitting into plain linears.
Interleaved vs contiguous fused weights — interleaved per-head groups: colwise only; contiguous Q|K|V blocks: require
```
output_sizes
```
.
Omitting
layer_type
when using
shard_layers
—
```
"unknown"
```
nodes are skipped; set hints explicitly on sharding-aware ops.
layer_type
on non-hint ops — do not pass
```
layer_type
```
to ops that are not designed for sharding hints (e.g.
```
torch_attention
```
,
```
torch_l2norm
```
,
```
torch_rope_*
```
); extra positional args break calls. Confirm in
```
custom_ops/
```
docstrings which ops accept hints.
Conditional hint values — no
```
if _s else "none"
```
; use unconditional hints and rely on
```
shard_layers
```
/ transform config.

头reshape操作缺少
auto_deploy::view
—— 导出的具体形状在分片后会失效。
分片小型投影 —— 维度1的门控：
```
tp_mode="none"
```
。
MoE中使用双重
all_reduce
—— 在路由专家 + 共享专家的合并点使用一次归约。
跨层参数污染 —— 在使用
```
get_source_nodes()
```
的
```
_apply_hint_*
```
处理程序中，使用
```
allowed_ops
```
限制范围，避免残差链接从其他层获取权重。
专家数量很大时缺少
num_moe_experts_for_export
—— 导出可能会挂起。
分解吸收权重的算子（例如
```
torch_mla
```
）—— 使用
```
shardable
```
+ 处理程序，而不是拆分为纯线性层。
交错vs连续融合权重 —— 每个头交错的组：仅colwise；连续的Q|K|V块：需要
```
output_sizes
```
。
使用
shard_layers
时省略
layer_type
——
```
"unknown"
```
节点会被跳过；在支持分片的算子上显式设置提示。
在非提示算子上设置
layer_type
—— 不要将
```
layer_type
```
传递给不支持分片提示的算子（例如
```
torch_attention
```
、
```
torch_l2norm
```
、
```
torch_rope_*
```
）；额外的位置参数会破坏调用。请在
```
custom_ops/
```
文档字符串中确认哪些算子接受提示。
条件提示值 —— 不要使用
```
if _s else "none"
```
；使用无条件提示，并依赖
```
shard_layers
```
/ 转换配置。

Sharding IR validation checklist (human review)

分片IR验证清单（人工评审）

```
world_size=1
```
: unsharded path; hints should not break correctness.
```
world_size=2
```
and
```
8
```
: shape checks and coherent output.
```
apply_sharding_hints
```
node count vs expectation.
Optional:
```
shard_layers: ['moe']
```
to verify selective sharding.

```
world_size=1
```
：未分片路径；提示不应破坏正确性。
```
world_size=2
```
和
```
8
```
：形状检查和连贯输出。
```
apply_sharding_hints
```
节点数量与预期一致。
可选：
```
shard_layers: ['moe']
```
验证选择性分片。

Key Gotchas

关键注意事项

Canonical ops first: Always use
```
torch.ops.auto_deploy.torch_*
```
canonical ops whenever one exists for the operation. This is how AD knows what to optimize. Writing manual attention, MoE, RoPE, or normalization in plain PyTorch instead of using the canonical op will prevent AD transforms from working.
No
repeat_interleave
: AD attention ops handle GQA natively. Never repeat K/V heads manually.
Lean code: Every line should serve prefill export. No optional HF features, no dead code paths, no fallback logic.
Reuse config classes: Import from
```
transformers
```
or load from checkpoint whenever possible. Only bundle a config class if it truly doesn't exist anywhere.
Assert
position_ids
: Always assert
```
position_ids is not None
```
— it is a required input, never optional.
Self-contained files only: Never import from other AD custom models. Each
```
modeling_{name}.py
```
is a standalone translation from HF source.
RoPE cos/sin: slice ONCE, not per layer.
```
_ad_
```
prefix for RoPE buffers.
```
RotaryEmbedding.forward(x, position_ids)
```
MUST slice by
```
position_ids
```
once and return pre-sliced
```
(cos, sin)
```
. Pass those tensors to all layers. NEVER pass
```
position_ids
```
through to each layer/attention forward to re-index — that is redundant compute that bloats the exported graph. See Phase 2 for the full pattern.
MoE weights: use
```
nn.ModuleList
```
per-expert for checkpoint compatibility. Write test-only state_dict converters for HF stacked format.
```
noaux_tc
```
routers (DeepSeek-V3 style): use vanilla PyTorch (sigmoid + bias + group topk + normalize + scale). AD transforms can replace with fused
```
trtllm
```
kernels at deployment time.
Vision towers are typically not exported. Keep vision logic in eager PyTorch and export only the text path unless explicitly requested otherwise.
Model code and tests must run on CPU. Use only
```
torch_*
```
prefixed reference ops in AutoDeploy — never
```
triton_*
```
,
```
flashinfer_*
```
, or
```
trtllm_*
```
.

优先使用标准算子： 只要存在对应的操作，就必须使用
```
torch.ops.auto_deploy.torch_*
```
标准算子。这是AD知道如何优化的方式。用纯PyTorch手动实现注意力、MoE、RoPE或归一化，而不使用标准算子，会导致AD转换无法工作。
不要使用
repeat_interleave
： AD注意力算子原生支持GQA。永远不要手动重复K/V头。
精简代码： 每一行都应为预填充导出服务。不要包含可选的HF功能、无效代码路径或fallback逻辑。
重用配置类： 尽可能从
```
transformers
```
导入或从checkpoint加载。仅当配置类确实不存在时才打包该类。
断言
position_ids
：始终断言
```
position_ids is not None
```
—— 这是必填输入，永远不是可选的。
仅使用独立文件： 永远不要从其他AD自定义模型导入。每个
```
modeling_{name}.py
```
文件都是从HF源码独立转换而来。
RoPE cos/sin：仅切片一次，不要每层切片。 RoPE缓冲区使用
```
_ad_
```
前缀。
```
RotaryEmbedding.forward(x, position_ids)
```
必须按
```
position_ids
```
切片一次，并返回预切片的
```
(cos, sin)
```
。将这些张量传递给所有层。永远不要将
```
position_ids
```
传递到每个层/注意力前向方法中重新索引 —— 这是冗余计算，会膨胀导出的图。请见阶段2的完整模式。
MoE权重：使用
```
nn.ModuleList
```
存储每个专家的权重，以兼容checkpoint。为HF堆叠格式编写测试专用的state_dict转换器。
```
noaux_tc
```
路由器（DeepSeek-V3风格）：使用纯PyTorch实现（sigmoid + bias + group topk + normalize + scale）。AD转换会在部署阶段将其替换为融合的
```
trtllm
```
内核。
视觉塔通常不导出。将视觉逻辑保留在eager PyTorch中，除非明确要求，否则仅导出文本路径。
模型代码和测试必须能在CPU上运行。在AutoDeploy中仅使用
```
torch_*
```
前缀的参考算子 —— 永远不要使用
```
triton_*
```
、
```
flashinfer_*
```
或
```
trtllm_*
```
。

ad-model-onboard

Original

Translation

AutoDeploy Model Onboarding

AutoDeploy模型上线流程

Phase 0 — Gather All Resources Upfront

阶段0 — 提前准备所有资源

Step 0 — GPU memory sanity check

步骤0 — GPU内存合理性检查

Phase 1 — Survey Existing Coverage & Analyze HF Model

阶段1 — 调研现有覆盖范围并分析HF模型

Step 1 — Check for existing AD custom modeling code

步骤1 — 检查是否已有AD自定义建模代码

Step 2 — Survey the model family in the registry

步骤2 — 调研模型系列在注册表中的情况

Step 3 — Analyze HF model architecture

步骤3 — 分析HF模型架构

Phase 2 — Write a Lean Prefill-Only Model

阶段2 — 编写轻量型仅预填充模型

Phase 3 — Use AutoDeploy Canonical Ops (CRITICAL)

阶段3 — 使用AutoDeploy标准算子（至关重要）

Phase 4 — Register

阶段4 — 注册模型

Phase 5 — Model Input Contract

阶段5 — 模型输入契约

Phase 6 — Hierarchical Tests

阶段6 — 分层测试

Phase 7 — Independent Review (MANDATORY)

阶段7 — 独立评审（必填）

Phase 8 — Create or Update Model Registry Entries (Including Family)

阶段8 — 创建或更新模型注册表项（包括同系列模型）

Phase 9 — AutoDeploy End-to-End Run

阶段9 — AutoDeploy端到端运行

⚠️ MANDATORY: You MUST use the standalone config YAML with --args.yaml-extra ⚠️

⚠️ 必填：必须使用独立配置YAML文件，通过 --args.yaml-extra 参数指定 ⚠️

Phase 10 — Update Model Support Matrix

阶段10 — 更新模型支持矩阵

Phase 11 — Create AutoDeploy Cookbook

阶段11 — 创建AutoDeploy Cookbook

Phase 12 — Summary Report

阶段12 — 总结报告

⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final build_and_run_ad.py --args.yaml-extra run ⚠️

⚠️ 必填：必须包含最终 build_and_run_ad.py --args.yaml-extra 运行的所有原始提示词和生成输出 ⚠️

Phase 13 — Prepare a Pull Request

阶段13 — 准备Pull Request

⚠️ MANDATORY: Re-run and re-post logs on EVERY PR update — NO EXCEPTIONS ⚠️

⚠️ 必填：每次PR更新时都必须重新运行并重新发布日志 —— 无例外 ⚠️

⚠️ MANDATORY: Poll PR for new comments every 5 minutes ⚠️

⚠️ 必填：每5分钟轮询PR的新评论 ⚠️

Fetch all PR comments, sorted newest-first, and check for any posted after your last comment

获取所有PR评论，按最新排序，检查是否有在你上次评论之后发布的新评论

Also check issue-level comments (top-level PR comments, not inline review comments)

同时检查问题级评论（PR的顶级评论，不是内联评审评论）

Also check the PR's review status

同时检查PR的评审状态

Sharding-aware IR model porting (modeling_*_ir.py)

支持分片的IR模型移植（modeling_*_ir.py）

Reference examples (study before porting)

参考示例（移植前请研究）

Step-by-step porting procedure

分步移植流程

Step 1: Copy the source file

步骤1：复制源文件

Step 2: Update the module docstring and add imports

步骤2：更新模块文档字符串并添加导入

Step 3: Replace linear projections

步骤3：替换线性投影

Step 4: Replace split / chunk after fused colwise projections

步骤4：替换融合colwise投影后的split / chunk

Step 5: Replace view / reshape with concrete head counts

步骤5：用具体的头数替换view / reshape

Step 6: Insert all_reduce

步骤6：插入 all_reduce

Step 7: Special ops (Conv1d, SSM, GatedDeltaNet, gated RMSNorm)

步骤7：特殊算子（Conv1d, SSM, GatedDeltaNet, gated RMSNorm）

Step 8: MoE

步骤8：MoE

Step 9: Register the IR model

步骤9：注册IR模型

Step 10: YAML — composable registry pattern

⚠️ MANDATORY: You MUST use the standalone config YAML with
`--args.yaml-extra`
⚠️

⚠️ 必填：必须使用独立配置YAML文件，通过
`--args.yaml-extra`
参数指定 ⚠️

⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final
`build_and_run_ad.py --args.yaml-extra`
run ⚠️

⚠️ 必填：必须包含最终
`build_and_run_ad.py --args.yaml-extra`
运行的所有原始提示词和生成输出 ⚠️

Sharding-aware IR model porting (
`modeling_*_ir.py`
)

支持分片的IR模型移植（
`modeling_*_ir.py`
）

Step 6: Insert
`all_reduce`

步骤6：插入
`all_reduce`