ad-model-onboard

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AutoDeploy Model Onboarding

AutoDeploy模型上线流程

Input: HuggingFace model ID. Output: prefill-only custom model file + hierarchical tests + summary report.
输入: HuggingFace模型ID。输出: 仅支持预填充的自定义模型文件 + 分层测试用例 + 总结报告。

Phase 0 — Gather All Resources Upfront

阶段0 — 提前准备所有资源

Web/GitHub fetches require user approval and the user may leave. Do ALL network access now and save locally before proceeding.
网页/GitHub拉取操作需要用户授权,且用户可能中途离开。请现在完成所有网络访问操作,并将资源保存到本地后再继续。

Step 0 — GPU memory sanity check

步骤0 — GPU内存合理性检查

Before anything else, check whether the model can fit on the current system.
  1. Run
    nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits
    to get the total VRAM (in MiB) across all GPUs on the system.
  2. Estimate the model's memory footprint from the HuggingFace model card or config (number of parameters × bytes per parameter, e.g. 7B params × 2 bytes = ~14 GB for bfloat16).
  3. If the estimated size exceeds total system VRAM, stop and report this to the user — do not proceed with onboarding until the user acknowledges and decides how to proceed. Example message: "This model requires ~Xgb but the system only has Ygb across N GPUs. Onboarding is likely to fail at the e2e run stage."
Step 1 — Check local transformers install first:
bash
python -c "import transformers; print(transformers.__file__)"
Look for
models/{model_type}/modeling_*.py
under that path. If found, use it directly — no network needed.
Step 2 — If not found, download the HF repo (code only, skip weights):
bash
huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"
This downloads config, code, and tokenizer files into the standard HF cache (
$HF_HOME
or
~/.cache/huggingface/
) while skipping large weight files. Files cached here are automatically found by
transformers.AutoConfig.from_pretrained
and similar APIs — no extra path wiring needed. Once downloaded you can work fully offline — read
config.json
and
modeling_*.py
from the cache snapshot directory printed by the command.
在开始任何操作前,检查模型是否能适配当前系统。
  1. 运行
    nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits
    获取系统所有GPU的总显存(单位:MiB)。
  2. 通过HuggingFace模型卡片或配置文件估算模型内存占用(参数数量 × 每个参数占用字节数,例如7B参数 × 2字节 = bfloat16格式下约14 GB)。
  3. 如果估算的模型大小超过系统总显存,立即停止并告知用户 —— 在用户确认并决定后续操作前,不要继续上线流程。示例提示:"该模型需要约XGB显存,但系统N张GPU总显存仅为YGB。上线流程很可能在端到端运行阶段失败。"
步骤1 — 先检查本地transformers安装情况:
bash
python -c "import transformers; print(transformers.__file__)"
在该路径下查找
models/{model_type}/modeling_*.py
文件。如果找到,直接使用该文件 —— 无需访问网络。
步骤2 — 如果未找到,下载HF仓库(仅代码,跳过权重文件):
bash
huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"
该命令会将配置文件、代码和分词器文件下载到标准HF缓存目录(
$HF_HOME
~/.cache/huggingface/
),同时跳过大型权重文件。缓存的文件会被
transformers.AutoConfig.from_pretrained
等API自动识别 —— 无需额外配置路径。下载完成后即可完全离线工作 —— 从命令输出的缓存快照目录中读取
config.json
modeling_*.py
文件。

Phase 1 — Survey Existing Coverage & Analyze HF Model

阶段1 — 调研现有覆盖范围并分析HF模型

Step 1 — Check for existing AD custom modeling code

步骤1 — 检查是否已有AD自定义建模代码

Before writing anything, check if an AD custom model already covers this architecture:
  1. Read the model's
    config.json
    to find its
    model_type
    and
    architectures
    fields.
  2. Search
    tensorrt_llm/_torch/auto_deploy/models/custom/
    for existing
    modeling_*.py
    files that register the same config class name (grep for the
    architectures
    value or
    model_type
    ).
  3. Also check
    tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
    for existing registrations.
If existing code is found:
  • Read it carefully. It may already handle this exact model — in which case no new modeling file is needed, only registry entries and possibly tests.
  • If the existing code covers a closely related model in the same family but needs adaptation (e.g., the family added MoE in a newer variant, or changed the attention type), decide whether to extend the existing file or create a new one. Prefer extending if the changes are minor; create a new file if the architecture diverges significantly. Report the decision and rationale to the user before proceeding.
If no existing code is found: proceed to write a new model file in Phase 2.
在编写任何代码前,检查是否已有AD自定义模型覆盖该架构:
  1. 读取模型的
    config.json
    文件,找到
    model_type
    architectures
    字段。
  2. tensorrt_llm/_torch/auto_deploy/models/custom/
    目录下搜索已有的
    modeling_*.py
    文件,查找是否注册了相同的配置类名称(通过grep匹配
    architectures
    model_type
    的值)。
  3. 同时检查
    tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
    文件中的现有注册项。
如果找到现有代码:
  • 仔细阅读代码。它可能已经支持该模型 —— 这种情况下无需编写新的建模文件,仅需添加注册项和可能的测试用例。
  • 如果现有代码支持同系列中密切相关的模型,但需要适配(例如,系列新增了MoE变体,或修改了注意力类型),请决定是扩展现有文件还是创建新文件。如果改动较小,优先选择扩展;如果架构差异较大,则创建新文件。在继续操作前,向用户汇报决策及理由。
如果未找到现有代码: 进入阶段2,编写新的模型文件。

Step 2 — Survey the model family in the registry

步骤2 — 调研模型系列在注册表中的情况

Check
examples/auto_deploy/model_registry/models.yaml
for other models from the same family (e.g., if asked to onboard
Qwen/Qwen3-8B
, look for
Qwen/Qwen3-0.6B
,
Qwen/Qwen3-32B
,
Qwen/Qwen3-235B-A22B
, etc.). Also check HuggingFace for the full set of model sizes/variants in the family.
  • Identify which family members already have registry entries and which are missing.
  • Identify which family members share the same architecture (same
    model_type
    /
    architectures
    in their config) — these can all use a single modeling file.
  • Plan to onboard the entire family cohesively: one modeling file + one test file should cover all members that share an architecture. The registry should have entries for all commonly-used sizes.
  • Report the family survey findings to the user: which models exist, which are missing, and the proposed plan for covering them all.
检查
examples/auto_deploy/model_registry/models.yaml
文件,查找同系列的其他模型(例如,如果需要上线
Qwen/Qwen3-8B
,请查找
Qwen/Qwen3-0.6B
Qwen/Qwen3-32B
Qwen/Qwen3-235B-A22B
等)。同时在HuggingFace上查找该系列的所有模型尺寸/变体。
  • 确定哪些系列成员已有注册表项,哪些缺失。
  • 确定哪些系列成员共享相同架构(配置文件中的
    model_type
    /
    architectures
    相同)—— 这些模型可以共用同一个建模文件。
  • 计划统一上线整个系列:一个建模文件 + 一个测试文件应覆盖所有共享同一架构的成员。注册表中应包含所有常用尺寸的模型项。
  • 向用户汇报系列调研结果:已存在的模型、缺失的模型,以及覆盖所有模型的计划方案。

Step 3 — Analyze HF model architecture

步骤3 — 分析HF模型架构

Study the locally-available
config.json
and
modeling_*.py
(NOT from
tensorrt_llm/_torch/models/
). Identify attention type (MHA/GQA/MLA), MoE config, RoPE variant, normalization, activation, and any data-dependent ops that break
torch.export
(e.g.
torch.nonzero
, data-conditioned
if
).
研究本地可用的
config.json
modeling_*.py
文件(不要使用
tensorrt_llm/_torch/models/
下的文件
)。确定注意力类型(MHA/GQA/MLA)、MoE配置、RoPE变体、归一化方式、激活函数,以及任何会破坏
torch.export
的数据依赖算子(例如
torch.nonzero
、基于数据条件的
if
语句)。

Phase 2 — Write a Lean Prefill-Only Model

阶段2 — 编写轻量型仅预填充模型

Create
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py
. Use
modeling_glm4_moe_lite.py
as a structural template only (class layout, dataclass outputs, forward signature).
The goal is a minimal prefill-only model for
torch.export
with AD canonical IR ops.
Keep the code as lean as possible — every line should serve the export path. Do not port HF features that AD doesn't need.
Strip: KV cache, training paths, dropout, flash attention variants,
repeat_interleave
/
repeat_kv
for GQA (AD attention ops handle this natively), fallback logic for generating
position_ids
(assert instead), optional code paths gated on config flags irrelevant to prefill export.
Keep:
PreTrainedModel
hierarchy,
ModelOutput
dataclass, minimal forward
(input_ids, position_ids, inputs_embeds=None, **kwargs)
.
Critical: Make sure the custom modeling code nn.Module hierarchy matches what the checkpoint safetensor json expects.
Critical rule: Do NOT import or reuse existing AD custom model code (e.g.
from .modeling_deepseek import ...
). Every
modeling_{name}.py
must be self-contained. Use the HF source (
$CLONE_DIR/modeling_*.py
) as the source of truth for the model's logic and translate it fresh — even if a structurally similar AD model already exists. This prevents hidden coupling, makes each model auditable on its own, and ensures model-specific quirks are captured correctly.
创建
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py
文件。仅将
modeling_glm4_moe_lite.py
作为结构模板(类布局、数据类输出、前向签名)。
目标是为
torch.export
创建一个使用AD标准IR算子的最小化仅预填充模型。
代码应尽可能精简 —— 每一行都应为导出路径服务。无需移植AD不需要的HF功能。
移除以下内容:KV缓存、训练路径、dropout、Flash Attention变体、GQA的
repeat_interleave
/
repeat_kv
(AD注意力算子原生支持该功能)、生成
position_ids
的 fallback 逻辑(改为断言)、与预填充导出无关的配置标志 gated 的可选代码路径。
保留以下内容:
PreTrainedModel
层级结构、
ModelOutput
数据类、最小化的前向方法签名
(input_ids, position_ids, inputs_embeds=None, **kwargs)
关键注意事项: 确保自定义建模代码的nn.Module层级结构与checkpoint的safetensor json文件预期一致。
关键规则:不要导入或重用现有AD自定义模型代码(例如
from .modeling_deepseek import ...
)。每个
modeling_{name}.py
文件必须是独立的。以HF源码(
$CLONE_DIR/modeling_*.py
)为模型逻辑的唯一来源,重新转换代码 —— 即使已有结构相似的AD模型存在。这样可以避免隐藏耦合,使每个模型都可独立审计,并确保正确捕获模型特有的细节。

Phase 3 — Use AutoDeploy Canonical Ops (CRITICAL)

阶段3 — 使用AutoDeploy标准算子(至关重要)

Use
torch.ops.auto_deploy.torch_*
canonical ops WHENEVER POSSIBLE.
These are the IR nodes that AD transforms later replace with optimized backends (triton, flashinfer, trtllm) at deployment time. If a canonical op exists for an operation, you MUST use it — do not reimplement the logic in plain PyTorch.
Available canonical ops (see
tensorrt_llm/_torch/auto_deploy/custom_ops/README.md
for full list):
  • Attention:
    torch_attention
    ,
    torch_attention_sdpa
    ,
    torch_attention_repeat_kv
  • MLA:
    torch_mla
  • RoPE:
    torch_rope_with_explicit_cos_sin
    ,
    torch_rope_with_complex_freqs
    ,
    torch_rope_with_qk_interleaving
  • MoE:
    torch_moe
    ,
    torch_moe_fused
    ,
    torch_moe_router
    ,
    torch_moe_dense_mlp
  • Normalization:
    torch_rmsnorm
    ,
    torch_rmsnorm_gated
    ,
    torch_l2norm
  • Linear:
    torch_linear_simple
  • SSM/Mamba:
    torch_ssm
    ,
    torch_causal_conv1d
  • FLA:
    torch_gated_delta_rule
  • Quantization:
    torch_quant_fp8_linear
    ,
    torch_quant_nvfp4_linear
    , etc.
Never use
triton_*
/
flashinfer_*
/
trtllm_*
— backend selection happens later in AD transforms. Plain PyTorch is acceptable ONLY for operations where no canonical op exists (e.g., simple activation functions, embedding lookups, basic tensor arithmetic). If you find yourself writing manual attention, MoE routing, RoPE, or normalization in plain PyTorch, stop and use the canonical op instead.
Do NOT use
repeat_interleave
or
repeat_kv
for GQA.
HF reference code often repeats K/V heads to match the Q head count before attention. The AD canonical attention ops (
torch_attention
,
torch_attention_sdpa
) handle GQA natively — they accept Q, K, V with different head counts and do the right thing internally. Manually repeating K/V heads is unnecessary bloat and prevents AD from optimizing the attention path.
只要存在对应的操作,就必须使用
torch.ops.auto_deploy.torch_*
标准算子。
这些是IR节点,AD转换会在部署阶段将其替换为优化后的后端(triton、flashinfer、trtllm)。如果存在针对某操作的标准算子,必须使用该算子 —— 不要用纯PyTorch重新实现逻辑。
可用的标准算子(完整列表请见
tensorrt_llm/_torch/auto_deploy/custom_ops/README.md
):
  • 注意力:
    torch_attention
    ,
    torch_attention_sdpa
    ,
    torch_attention_repeat_kv
  • MLA:
    torch_mla
  • RoPE:
    torch_rope_with_explicit_cos_sin
    ,
    torch_rope_with_complex_freqs
    ,
    torch_rope_with_qk_interleaving
  • MoE:
    torch_moe
    ,
    torch_moe_fused
    ,
    torch_moe_router
    ,
    torch_moe_dense_mlp
  • 归一化:
    torch_rmsnorm
    ,
    torch_rmsnorm_gated
    ,
    torch_l2norm
  • 线性层:
    torch_linear_simple
  • SSM/Mamba:
    torch_ssm
    ,
    torch_causal_conv1d
  • FLA:
    torch_gated_delta_rule
  • 量化:
    torch_quant_fp8_linear
    ,
    torch_quant_nvfp4_linear
绝对不要使用
triton_*
/
flashinfer_*
/
trtllm_*
算子 —— 后端选择会在AD转换的后期进行。仅当没有对应标准算子时,才可以使用纯PyTorch实现(例如简单的激活函数、嵌入查找、基础张量运算)。如果发现自己在用纯PyTorch手动实现注意力、MoE路由、RoPE或归一化,请停止并改用标准算子。
不要为GQA使用
repeat_interleave
repeat_kv
HF参考代码通常会在注意力前重复K/V头以匹配Q头数量。AD标准注意力算子(
torch_attention
,
torch_attention_sdpa
)原生支持GQA —— 它们接受头数不同的Q、K、V,并在内部处理正确逻辑。手动重复K/V头是不必要的冗余操作,会阻碍AD优化注意力路径。

Phase 4 — Register

阶段4 — 注册模型

  1. Bottom of model file:
    AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)
    .
  2. Add import +
    __all__
    entry in
    models/custom/__init__.py
    .
  3. Prefer reusing the existing config class — if the config can be loaded via
    AutoConfig.from_pretrained(model_id)
    (either from the installed
    transformers
    or from files in the HF cache downloaded in Phase 0), import it from
    transformers
    and use it directly. Do NOT recreate or copy the config class into the modeling file when it is already available. Note: AD's factory already calls
    AutoConfig.from_pretrained(model_id, trust_remote_code=True)
    and passes the result to your model, so you rarely need to import the config at all — if you find yourself doing so, sanity-check that it's genuinely needed.
  4. Only if the config is truly not available (not in
    transformers
    and not bundled with the checkpoint), define a minimal config class in the modeling file and
    AutoConfig.register(model_type, ConfigCls, exist_ok=True)
    . A good sanity check: if the E2E test passes without a custom config class, you don't need one —
    AutoConfig.from_pretrained
    already picked up the right class.
  1. 在模型文件末尾添加:
    AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)
  2. models/custom/__init__.py
    文件中添加导入语句和
    __all__
    条目。
  3. 优先重用现有配置类 —— 如果可以通过
    AutoConfig.from_pretrained(model_id)
    加载配置(无论是从已安装的
    transformers
    还是阶段0下载的HF缓存文件中),直接从
    transformers
    导入并使用该配置类。当配置类已存在时,不要在建模文件中重新创建或复制该类。注意:AD的工厂已调用
    AutoConfig.from_pretrained(model_id, trust_remote_code=True)
    并将结果传递给你的模型,因此你几乎不需要导入配置类 —— 如果发现需要导入,请先确认是否真的有必要。
  4. 仅当配置类确实不可用时(不在
    transformers
    中,也未随checkpoint打包),才在建模文件中定义最小化的配置类,并调用
    AutoConfig.register(model_type, ConfigCls, exist_ok=True)
    。一个合理的检查方法:如果端到端测试无需自定义配置类即可通过,那么你不需要该类 ——
    AutoConfig.from_pretrained
    已自动加载了正确的类。

Phase 5 — Model Input Contract

阶段5 — 模型输入契约

The custom model's forward signature must follow these rules:
  1. Always
    input_ids
    — The top-level model always receives
    input_ids
    . A submodule graph may internally receive
    inputs_embeds
    (e.g., after the embedding layer), but the exported entry point takes token IDs.
  2. Always
    position_ids
    — Vanilla sequential
    position_ids
    are always provided. Assert
    position_ids is not None
    at the top of the forward method — it is a required input, never optional. Do not include fallback logic to generate
    position_ids
    from
    input_ids
    (HF models often do this; strip it). If the model uses a non-standard RoPE variant or custom position encoding, the model must compute it internally on top of the provided vanilla
    position_ids
    .
  3. Multi-modal inputs — If the model supports vision/audio/etc., those additional inputs are passed during prefill alongside
    input_ids
    .
  4. No attention mask, no cache inputs, no HF-runtime features — Do not accept
    attention_mask
    ,
    past_key_values
    ,
    use_cache
    , or similar HF-runtime arguments. AD manages masking and caching via its own transforms and runtime.
自定义模型的前向方法签名必须遵循以下规则:
  1. 必须包含
    input_ids
    —— 顶层模型始终接收
    input_ids
    。子模块图内部可能会接收
    inputs_embeds
    (例如嵌入层之后),但导出的入口点接收的是token ID。
  2. 必须包含
    position_ids
    —— 始终提供标准顺序的
    position_ids
    在前向方法开头断言
    position_ids is not None
    —— 这是必填输入,永远不是可选的。不要包含从
    input_ids
    生成
    position_ids
    的 fallback 逻辑(HF模型通常会这样做;请移除该逻辑)。如果模型使用非标准RoPE变体或自定义位置编码,模型必须在提供的标准
    position_ids
    基础上内部计算所需编码。
  3. 多模态输入 —— 如果模型支持视觉/音频等输入,这些额外输入会在预填充阶段与
    input_ids
    一起传递。
  4. 不要包含注意力掩码、缓存输入或HF运行时特性 —— 不要接受
    attention_mask
    past_key_values
    use_cache
    或类似的HF运行时参数。AD通过自身的转换和运行时管理掩码和缓存。

Phase 6 — Hierarchical Tests

阶段6 — 分层测试

Create
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.py
. Use
test_glm4_moe_lite_modeling.py
as template. No smoke tests. Small config (hidden=64, layers=2-3, vocab=1000). Use
pytest.skip
if HF class unavailable.
HF Reference Strategy: Equivalence tests compare our custom implementation against the HF reference with identical weights and inputs. Use actual HF classes if they exist — prefer importing directly over standalone HF-like implementations for unit tests. Standalone "reference" implementations are effectively alternative AD IR models and defeat the purpose of the reference test; they also tend to silently agree with whatever bugs exist in the custom model.
  • If HF modules exist in the installed
    transformers
    : import them directly (e.g.,
    from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM
    ). Wrap imports in
    _get_hf_*_class()
    try/except helpers that return
    None
    on
    ImportError
    , and use
    pytest.skip
    when
    None
    .
  • If HF modules are NOT in the installed
    transformers
    : copy the minimal module definitions from the HF
    modeling_*.py
    source into the test file as standalone reference classes. This keeps tests self-contained without requiring a specific
    transformers
    version or HF cache at test time. Important: make sure the copy is minimal and strictly faithful to the HF implementation only. Do NOT tweak the functionality of the reference. The same applies to config classes that use
    trust_remote_code
    (i.e., not available in
    transformers
    ): copy a minimal faithful version into the test file. The modeling file should NOT import the config class — AD loads it at runtime via
    AutoConfig.from_pretrained(..., trust_remote_code=True)
    . The test-only config copy lets you verify config-wrapping behavior (e.g., structure of state_dict).
  • Weight conversion helpers: Write test-only helpers for any weight format differences between HF and custom (e.g., RoPE de-interleaving, stacked-to-per-expert MoE weights, gate weight key remapping). For full-model tests, prefer using
    load_state_dict
    pre-hooks already registered on the custom model.
Numerical comparison: For equivalence tests comparing custom ops against HF reference, use the shared
assert_rmse_close
utility from
_model_test_utils
:
python
from _model_test_utils import assert_rmse_close
This computes
rmse(actual - expected) / rmse(expected)
— more robust than per-element
torch.testing.assert_close
since a few outlier elements won't fail the test. Use
torch.testing.assert_close
only for blocks with identical math (e.g., plain MLP with no custom ops).
Recommended
rmse_ratio_tol
values for bfloat16:
  • Identical math (MLP, Norm): use
    torch.testing.assert_close
    with tight rtol/atol (1e-3)
  • MoE block (fused routing):
    0.02
  • Decoder layer / MoE layer / full model:
    0.05
  • Attention:
    0.10
Bottom-up levels (each must pass before next):
  1. Block equivalence — Test MLP, Attention, MoE, Norm individually: same weights + same input →
    assert_rmse_close
    (or
    torch.testing.assert_close
    for identical-math blocks).
  2. Layer equivalence — Full decoder layer. If model has heterogeneous layers (dense vs MoE, attention vs SSM), test each type separately.
  3. Full model equivalence — End-to-end logits comparison. Use a small config with <10 layers that covers the essence of the architecture (e.g., at least one of each layer type).
  4. Export test
    torch_export_to_gm
    with
    Dim.DYNAMIC
    for batch+seq, verify finite output, test a second shape.
创建
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.py
文件。以
test_glm4_moe_lite_modeling.py
为模板。不要编写冒烟测试。使用小型配置(hidden=64, layers=2-3, vocab=1000)。如果HF类不可用,使用
pytest.skip
HF参考策略: 等价测试会比较我们的自定义实现与HF参考实现在相同权重和输入下的结果。如果HF类存在,优先直接导入,而不是在单元测试中使用独立的类HF实现。 独立的“参考”实现实际上是另一种AD IR模型,会失去参考测试的意义;它们还可能与自定义模型中的bug保持一致,导致测试无法发现问题。
  • 如果已安装的
    transformers
    中存在HF模块:
    直接导入(例如
    from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM
    )。将导入语句包装在
    _get_hf_*_class()
    尝试/捕获辅助函数中,当出现
    ImportError
    时返回
    None
    ,并在返回
    None
    时使用
    pytest.skip
  • 如果已安装的
    transformers
    中不存在HF模块:
    将HF
    modeling_*.py
    源码中的最小模块定义复制到测试文件中,作为独立的参考类。这样可以保持测试的独立性,无需依赖特定版本的
    transformers
    或测试时的HF缓存。重要提示:确保复制的内容最小化,且严格忠实于HF实现。不要修改参考类的功能。对于使用
    trust_remote_code
    的配置类(即不在
    transformers
    中的配置类),同样适用:将最小化的忠实版本复制到测试文件中。建模文件不应导入配置类 —— AD会在运行时通过
    AutoConfig.from_pretrained(..., trust_remote_code=True)
    加载该类。测试专用的配置副本可以验证配置包装行为(例如state_dict的结构)。
  • 权重转换辅助函数: 编写测试专用的辅助函数,处理HF与自定义模型之间的权重格式差异(例如RoPE解交织、堆叠到单专家的MoE权重、门控权重键重映射)。对于全模型测试,优先使用自定义模型上已注册的
    load_state_dict
    预钩子。
数值比较: 在比较自定义算子与HF参考实现的等价性测试中,使用
_model_test_utils
中的共享工具函数
assert_rmse_close
python
from _model_test_utils import assert_rmse_close
该函数计算
rmse(actual - expected) / rmse(expected)
—— 比逐元素的
torch.testing.assert_close
更鲁棒,因为少数异常值不会导致测试失败。仅在数学逻辑完全相同的模块(例如无自定义算子的纯MLP)中使用
torch.testing.assert_close
推荐的bfloat16格式下的
rmse_ratio_tol
值:
  • 数学逻辑完全相同(MLP、归一化):使用
    torch.testing.assert_close
    ,设置严格的rtol/atol(1e-3)
  • MoE模块(融合路由):
    0.02
  • 解码器层 / MoE层 / 全模型
    0.05
  • 注意力
    0.10
自底向上的测试层级(必须通过前一层级才能进入下一层级):
  1. 模块等价性 —— 单独测试MLP、注意力、MoE、归一化模块:相同权重 + 相同输入 →
    assert_rmse_close
    (或对于数学逻辑完全相同的模块使用
    torch.testing.assert_close
    )。
  2. 层等价性 —— 测试完整的解码器层。如果模型包含异构层(密集层 vs MoE层、注意力层 vs SSM层),请分别测试每种类型的层。
  3. 全模型等价性 —— 端到端的logits比较。使用包含架构核心特征的小型配置(例如至少包含每种类型的层,层数<10)。
  4. 导出测试 —— 使用
    Dim.DYNAMIC
    处理batch+seq维度的
    torch_export_to_gm
    ,验证输出为有限值,并测试第二种形状。

Phase 7 — Independent Review (MANDATORY)

阶段7 — 独立评审(必填)

Invoke the
ad-onboard-reviewer
subagent with ONLY the following information:
  • Model name
  • Path to the model file created
  • Path to the test file created
Do NOT include your own assessment of correctness. Do NOT summarize what you did. Let the reviewer read the files and judge independently.
If the reviewer returns FAIL on any item:
  1. Read the reviewer's specific failure reasons and file:line references
  2. Fix each failed item
  3. Invoke the reviewer again with the same minimal inputs
  4. Repeat until you get a full PASS
Do NOT proceed to Phase 8 until the reviewer returns PASS.
仅向
ad-onboard-reviewer
子代理提供以下信息:
  • 模型名称
  • 创建的模型文件路径
  • 创建的测试文件路径
不要包含你自己对正确性的评估。不要总结你所做的工作。 让评审者直接阅读文件并独立判断。
如果评审者返回任何项的FAIL结果:
  1. 阅读评审者给出的具体失败原因和文件:行号引用
  2. 修复每个失败项
  3. 再次向评审者提供相同的最小输入
  4. 重复操作直到获得完整的PASS结果
在评审者返回PASS结果前,不要进入阶段8。

Phase 8 — Create or Update Model Registry Entries (Including Family)

阶段8 — 创建或更新模型注册表项(包括同系列模型)

Before running the model end-to-end, ensure it and all identified family members from Phase 1 have valid entries in the AutoDeploy model registry at
examples/auto_deploy/model_registry/
.
For each model (the requested model + any family members identified in Phase 1 Step 2):
  1. Check
    examples/auto_deploy/model_registry/models.yaml
    for an existing entry matching the model's HF id.
  2. If the entry is missing, add it with the appropriate
    yaml_extra
    list:
    • Always include
      dashboard_default.yaml
      first.
    • Pick
      world_size_N.yaml
      based on model size (1 for <2B, 2 for 2-15B, 4 for 20-80B, 8 for 80B+). The
      world_size
      determines how many GPUs are needed for the run.
    • Add model-specific YAML if the model needs custom settings (e.g.,
      model_kwargs
      , non-default transforms).
  3. If a model-specific config YAML is needed and doesn't exist, create it under
    examples/auto_deploy/model_registry/configs/
    . See existing configs for format examples.
  4. If the entry exists but needs changes (e.g., wrong world_size, missing model-specific config), update it.
Family members that share the same architecture should all use the same modeling code. Different sizes only need different
world_size_N.yaml
entries and maybe different sharding configurations.
See
examples/auto_deploy/model_registry/README.md
for full documentation on the registry format and best practices.
在端到端运行模型前,确保阶段1中识别的所有同系列模型在AutoDeploy模型注册表
examples/auto_deploy/model_registry/
中有有效的条目。
对于每个模型(请求的模型 + 阶段1步骤2中识别的所有同系列模型):
  1. 检查
    examples/auto_deploy/model_registry/models.yaml
    是否存在与模型HF ID匹配的现有条目。
  2. 如果条目缺失,添加条目并附上合适的
    yaml_extra
    列表:
    • 始终首先包含
      dashboard_default.yaml
    • 根据模型大小选择
      world_size_N.yaml
      (<2B参数选1,2-15B选2,20-80B选4,80B+选8)。
      world_size
      决定运行所需的GPU数量。
    • 如果模型需要自定义设置(例如
      model_kwargs
      、非默认转换),添加模型专用的YAML文件。
  3. 如果需要模型专用的配置YAML但该文件不存在,在
    examples/auto_deploy/model_registry/configs/
    目录下创建该文件。参考现有配置文件的格式示例。
  4. 如果条目存在但需要修改(例如错误的world_size、缺失模型专用配置),更新该条目。
共享相同架构的同系列模型应使用同一个建模文件。不同尺寸的模型仅需不同的
world_size_N.yaml
条目,可能还需要不同的分片配置。
有关注册表格式和最佳实践的完整文档,请见
examples/auto_deploy/model_registry/README.md

Phase 9 — AutoDeploy End-to-End Run

阶段9 — AutoDeploy端到端运行

⚠️ MANDATORY: You MUST use the standalone config YAML with
--args.yaml-extra
⚠️

⚠️ 必填:必须使用独立配置YAML文件,通过
--args.yaml-extra
参数指定 ⚠️

You MUST run the model using the standalone config YAML created in Phase 8. The same YAML will be referenced by the cookbook's
trtllm-serve
command in Phase 11. The command is:
bash
CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml
The standalone config YAML under
examples/auto_deploy/model_registry/configs/
is self-contained — it includes all settings needed for running the model (compile backend, batch size, seq len, transforms, world_size, etc.). This is the same YAML that
trtllm-serve --extra_llm_api_options
will use in the cookbook, so validating it here ensures the cookbook works out of the box.
If the run FAILS:
  1. Fix the standalone config YAML — update settings in
    examples/auto_deploy/model_registry/configs/<model>.yaml
    and re-run.
  2. The standalone config YAML is the source of truth. If it is wrong, fix it. If it is missing settings, add them. The model MUST work via this YAML before you are done.
Invoke the
ad-run-agent
subagent to run the model through AutoDeploy on GPU. Pass it:
Step 1: Reduced num layers Run with reduced num layers to test the e2e flow for issues and iterate faster. The generation will be bad in step 1 because we are not loading all layers.
Step 2: Full layers Run with full num layers. The generation should be coherent in step 2.
  • Model HF ID: the HuggingFace model-id (or local checkpoint path) used throughout onboarding
  • Standalone config YAML path: the path to the config YAML under
    examples/auto_deploy/model_registry/configs/
  • Description: a short description of the current state, e.g.:
    • "first try after onboarding"
    • "updated yaml with reduced layers"
    • "changed attention backend to torch_mha"
    • "fixed weight loading hooks"
The model is run via:
bash
CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml
The
ad-run-agent
will determine the required
world_size
from the config YAML, check GPU availability via
nvidia-smi
, select free GPUs, and wait if not enough are available.
The ad-run-agent will build+run the model, check generation quality, archive logs, and update its worklog.
If the run fails or produces bad generation:
  1. Read the ad-run-agent's worklog and log file to understand the error
  2. Fix the issue (model code, standalone config YAML, weight hooks, etc.)
  3. Re-invoke the ad-run-agent with an updated description reflecting the change (e.g., "retry after fixing RoPE scaling in config")
  4. Always re-run with
    --args.yaml-extra
    .
    Fix the standalone config YAML, don't work around it.
  5. Repeat until the run succeeds with meaningful generation
Do NOT proceed to Phase 10 until the step 2 with full layers reports a successful run with coherent generation.
Important: The successful E2E run outputs (prompts and generated text) will be needed for the cookbook notebook in Phase 11 and the summary report in Phase 12. Save them.
必须使用阶段8中创建的独立配置YAML文件运行模型。该YAML文件将在阶段11的cookbook中被
trtllm-serve
命令引用。命令格式如下:
bash
CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml
examples/auto_deploy/model_registry/configs/
目录下的独立配置YAML文件是自包含的 —— 它包含运行模型所需的所有设置(编译后端、batch大小、序列长度、转换、world_size等)。这与cookbook中
trtllm-serve --extra_llm_api_options
将使用的YAML文件相同,因此在此处验证该文件可确保cookbook开箱即用。
如果运行失败:
  1. 修复独立配置YAML文件 —— 更新
    examples/auto_deploy/model_registry/configs/<model>.yaml
    中的设置并重新运行。
  2. 独立配置YAML文件是唯一的可信来源。如果文件有误,修复它;如果缺少设置,添加设置。必须确保模型通过该YAML文件正常运行后才能结束此阶段。
调用
ad-run-agent
子代理在GPU上通过AutoDeploy运行模型。向其提供以下信息:
步骤1:减少层数 减少层数运行,以测试端到端流程是否存在问题并快速迭代。由于未加载所有层,此步骤的生成结果质量会很差。
步骤2:完整层数 使用完整层数运行。此步骤的生成结果应连贯合理。
  • 模型HF ID: 整个上线流程中使用的HuggingFace模型ID(或本地checkpoint路径)
  • 独立配置YAML路径:
    examples/auto_deploy/model_registry/configs/
    目录下的配置YAML文件路径
  • 描述: 当前状态的简短描述,例如:
    • "上线后的首次尝试"
    • "更新YAML文件,减少层数"
    • "将注意力后端改为torch_mha"
    • "修复权重加载钩子"
模型将通过以下命令运行:
bash
CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml
ad-run-agent
会从配置YAML文件中确定所需的
world_size
,通过
nvidia-smi
检查GPU可用性,选择空闲GPU,如果GPU不足则等待。
ad-run-agent
会构建并运行模型,检查生成质量,归档日志,并更新工作记录。
如果运行失败或生成质量不佳
  1. 阅读
    ad-run-agent
    的工作记录和日志文件,了解错误原因
  2. 修复问题(模型代码、独立配置YAML文件、权重钩子等)
  3. 更新描述以反映所做的更改(例如 "修复配置中的RoPE缩放后重试"),再次调用
    ad-run-agent
  4. 始终使用
    --args.yaml-extra
    参数重新运行
    。修复独立配置YAML文件,不要使用临时解决方案。
  5. 重复操作直到运行成功并生成有意义的结果
在步骤2(完整层数)报告运行成功且生成结果连贯合理前,不要进入阶段10。
重要提示: 成功的端到端运行输出(提示词和生成文本)将用于阶段11的cookbook笔记本和阶段12的总结报告。请保存这些输出。

Phase 10 — Update Model Support Matrix

阶段10 — 更新模型支持矩阵

After a successful E2E run, update the TensorRT-LLM model support matrix at
docs/source/models/supported-models.md
to include the newly onboarded model.
  1. Read the current support matrix to understand the format and existing entries.
  2. Add a row to the "Supported Models" table (the first table in the file) with:
    • Architecture
      : The model's architecture class name (e.g.,
      MiniMaxM2ForCausalLM
      ) — use the class name registered in Phase 4.
    • Model
      : The model family/display name (e.g.,
      MiniMax M2/M2.1/M2.7
      ).
    • HuggingFace Example
      : A representative HF model ID (e.g.,
      MiniMaxAI/MiniMax-M2.7
      ).
    • Place the new row alphabetically by architecture class name to keep the table sorted.
  3. If the model is AutoDeploy-only (i.e., it does NOT have native PyTorch backend support in
    tensorrt_llm/_torch/models/
    ), add a footnote indicating AutoDeploy support with a link to the AD config YAML, following the pattern of existing AD-only models (e.g.,
    [^N]: Supported via the [AutoDeploy](../features/auto_deploy/auto-deploy.md) backend. See [AD config](../../../examples/auto_deploy/model_registry/configs/<model>.yaml).
    ).
  4. If the model warrants an entry in the Model-Feature Support Matrix (second table — typically for key/flagship models), add a row there too. For newly onboarded AD models, most advanced features should be marked
    Untested
    unless you have verified them. Use existing AD model entries (e.g.,
    Glm4MoeLiteForCausalLM
    ) as a reference for which features to mark as supported vs untested.
端到端运行成功后,更新TensorRT-LLM模型支持矩阵
docs/source/models/supported-models.md
,添加新上线的模型。
  1. 阅读当前支持矩阵,了解格式和现有条目。
  2. 在"Supported Models"表格中添加一行(文件中的第一个表格),包含:
    • Architecture
      :模型的架构类名称(例如
      MiniMaxM2ForCausalLM
      )—— 使用阶段4中注册的类名称。
    • Model
      :模型系列/显示名称(例如
      MiniMax M2/M2.1/M2.7
      )。
    • HuggingFace Example
      :代表性的HF模型ID(例如
      MiniMaxAI/MiniMax-M2.7
      )。
    • 按照架构类名称的字母顺序放置新行,保持表格有序。
  3. 如果模型仅支持AutoDeploy(即
    tensorrt_llm/_torch/models/
    中没有原生PyTorch后端支持),添加脚注说明AutoDeploy支持,并链接到AD配置YAML文件,遵循现有仅支持AD的模型的格式(例如
    [^N]: Supported via the [AutoDeploy](../features/auto_deploy/auto-deploy.md) backend. See [AD config](../../../examples/auto_deploy/model_registry/configs/<model>.yaml).
    )。
  4. 如果模型需要在Model-Feature Support Matrix中添加条目(第二个表格 —— 通常针对关键/旗舰模型),也在该表格中添加一行。对于新上线的AD模型,大多数高级功能应标记为
    Untested
    ,除非你已验证过这些功能。参考现有AD模型条目(例如
    Glm4MoeLiteForCausalLM
    ),了解哪些功能应标记为支持,哪些标记为未测试。

Phase 11 — Create AutoDeploy Cookbook

阶段11 — 创建AutoDeploy Cookbook

Create an AutoDeploy cookbook notebook for the model, following the pattern of existing cookbooks.
  1. Use
    examples/auto_deploy/cookbooks/glm_4.7_flash_trtllm_cookbook.ipynb
    as the template.
    Copy its structure exactly.
  2. Create the new notebook at
    examples/auto_deploy/cookbooks/{model_name}_trtllm_cookbook.ipynb
    , using a snake_case version of the model name (e.g.,
    minimax_m2.7_trtllm_cookbook.ipynb
    ).
  3. Adapt all model-specific content:
    • Title and description: update the model name, HF model ID, and description.
    • Model Resources: update links to the model's HuggingFace card, blog posts, technical reports, API platform, and community links. Search the web or the model's HF card for relevant URLs.
    • Model Highlights: update architecture details (e.g., MoE params, context length, special features like tool calling, interleaved thinking, etc.) from the model card.
    • Prerequisites: update VRAM requirements based on model size and precision.
    • trtllm-serve
      command: update the model ID and use
      --extra_llm_api_options
      pointing to the standalone AD config YAML under
      examples/auto_deploy/model_registry/configs/
      (e.g.,
      examples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml
      ). This is the same standalone config YAML validated in Phase 9 via
      build_and_run_ad.py --args.yaml-extra
      . It is self-contained — it includes all the settings
      trtllm-serve
      needs (compile backend, batch size, seq len, transforms, etc.).
    • OpenAI client
      MODEL_ID
      : update to the correct HF model ID.
    • Evaluation Parameters: update recommended inference parameters from the model's documentation/model card.
    • Additional Resources: update all links to be model-specific.
  4. Do NOT include cell outputs in the committed notebook — the notebook should be clean with no pre-run outputs, so users run it themselves. (Exception: if the model was already run and outputs were captured during Phase 9, you may include them for reference, but this is optional.)
  5. Verify the notebook is valid JSON — malformed
    .ipynb
    files will not render on GitHub or in Jupyter.
按照现有cookbook的模式,为该模型创建AutoDeploy cookbook笔记本。
  1. examples/auto_deploy/cookbooks/glm_4.7_flash_trtllm_cookbook.ipynb
    为模板
    。完全复制其结构。
  2. 创建新笔记本,保存到
    examples/auto_deploy/cookbooks/{model_name}_trtllm_cookbook.ipynb
    ,使用模型名称的蛇形命名法(例如
    minimax_m2.7_trtllm_cookbook.ipynb
    )。
  3. 调整所有模型专用内容:
    • 标题和描述:更新模型名称、HF模型ID和描述。
    • Model Resources:更新模型的HuggingFace卡片、博客文章、技术报告、API平台和社区链接。在网络或模型的HF卡片中查找相关URL。
    • Model Highlights:从模型卡片中更新架构细节(例如MoE参数、上下文长度、工具调用、 interleaved thinking等特殊功能)。
    • Prerequisites:根据模型大小和精度更新显存要求。
    • trtllm-serve
      命令:更新模型ID,并使用
      --extra_llm_api_options
      参数指向
      examples/auto_deploy/model_registry/configs/
      目录下的独立AD配置YAML文件(例如
      examples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml
      )。这与阶段9中通过
      build_and_run_ad.py --args.yaml-extra
      验证的独立配置YAML文件相同。该文件是自包含的 —— 它包含
      trtllm-serve
      所需的所有设置(编译后端、batch大小、序列长度、转换等)。
    • OpenAI客户端
      MODEL_ID
      :更新为正确的HF模型ID。
    • Evaluation Parameters:从模型的文档/模型卡片中更新推荐的推理参数。
    • Additional Resources:更新所有链接为模型专用链接。
  4. 提交的笔记本中不要包含单元格输出 —— 笔记本应保持干净,没有预运行的输出,以便用户自行运行。(例外情况:如果模型已在阶段9中运行并捕获了输出,可以将其包含在笔记本中作为参考,但这不是必需的。)
  5. 验证笔记本是有效的JSON —— 格式错误的
    .ipynb
    文件无法在GitHub或Jupyter中渲染。

Phase 12 — Summary Report

阶段12 — 总结报告

⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final
build_and_run_ad.py --args.yaml-extra
run ⚠️

⚠️ 必填:必须包含最终
build_and_run_ad.py --args.yaml-extra
运行的所有原始提示词和生成输出 ⚠️

Print (not file) after completion:
  1. Model overview + unique features
  2. Tricky parts needing human review
  3. Files created/modified (including any new registry configs)
  4. Test results table (name | validates | PASS/FAIL)
  5. Known limitations
  6. Reviewer result (PASS + how many review iterations it took)
  7. AD end-to-end run result (success/fail, number of iterations, final generation quality)
  8. Registry entry added/updated in
    models.yaml
    and any new config YAMLs created
  9. ALL raw prompts and their corresponding generated outputs from the final successful
    build_and_run_ad.py --args.yaml-extra
    run.
    Copy-paste the COMPLETE prompt→output pairs verbatim from the run log. Do NOT summarize, truncate, or paraphrase them. The user needs to see exactly what the model generated to judge quality.
  10. Model support matrix update — confirm the row was added to
    docs/source/models/supported-models.md
    and which footnote (if any) was used.
  11. AutoDeploy cookbook created — path to the new notebook file (
    examples/auto_deploy/cookbooks/<model>_trtllm_cookbook.ipynb
    ).
完成后打印(不要保存为文件):
  1. 模型概述 + 独特功能
  2. 需要人工评审的复杂部分
  3. 创建/修改的文件(包括任何新的注册表配置)
  4. 测试结果表格(名称 | 验证内容 | PASS/FAIL)
  5. 已知限制
  6. 评审结果(PASS + 经过多少次评审迭代)
  7. AD端到端运行结果(成功/失败、迭代次数、最终生成质量)
  8. models.yaml
    中添加/更新的注册表条目,以及创建的任何新配置YAML文件
  9. 最终成功运行
    build_and_run_ad.py --args.yaml-extra
    的所有原始提示词及其对应的生成输出
    。从运行日志中逐字复制完整的提示词→输出对。不要总结、截断或改写。用户需要查看模型生成的精确内容以判断质量。
  10. 模型支持矩阵更新 —— 确认已在
    docs/source/models/supported-models.md
    中添加了该行,以及使用了哪个脚注(如果有)。
  11. 创建的AutoDeploy cookbook —— 新笔记本文件的路径(
    examples/auto_deploy/cookbooks/<model>_trtllm_cookbook.ipynb
    )。

Phase 13 — Prepare a Pull Request

阶段13 — 准备Pull Request

GitHub CLI config: Before running any
gh
command, confirm which
GH_CONFIG_DIR
to use. The default is
~/.config/gh
, but a different directory may be needed when targeting a fork (e.g.,
nv-auto-deploy/TensorRT-LLM
vs
NVIDIA/TensorRT-LLM
). Check if the user has specified a custom
GH_CONFIG_DIR
(e.g., in
CLAUDE.local.md
or environment). If not, ask the user before proceeding. Prefix all
gh
commands with:
GH_CONFIG_DIR=<path> gh ...
Prepare a pull request against
upstream
(https://github.com/NVIDIA/TensorRT-LLM) targeting branch
main
. Then, ask the user to provide feedback on the PR and wait for the user to get back to you when the feedback has been posted. Then continue iterating according to the user's feedback. For any comment or other post, please prepend your message with "[AGENT]" so that it is clear that this was a coding agent posting the comment. When you post a PR, you MUST include:
  1. ALL raw prompts and their complete generated outputs from the final successful
    build_and_run_ad.py --args.yaml-extra
    run. Copy-paste the COMPLETE prompt→output pairs verbatim — do NOT summarize, truncate, or paraphrase. The reviewer needs to see exactly what the model generated.
  2. A reproducible command:
bash
python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml
  1. A detailed pytest command for the unit tests you added so they can be run by the reviewer as well. Make sure you have run this pytest command on the latest commit that you are pushing, and include these results in the PR.
GitHub CLI配置: 在运行任何
gh
命令前,确认要使用的
GH_CONFIG_DIR
。默认路径是
~/.config/gh
,但针对fork仓库(例如
nv-auto-deploy/TensorRT-LLM
vs
NVIDIA/TensorRT-LLM
)可能需要不同的目录。检查用户是否指定了自定义
GH_CONFIG_DIR
(例如在
CLAUDE.local.md
或环境变量中)。如果没有,在继续操作前询问用户。所有
gh
命令前添加前缀:
GH_CONFIG_DIR=<path> gh ...
准备针对
upstream
https://github.com/NVIDIA/TensorRT-LLM)的Pull Request,目标分支为
main
。然后,请求用户提供对PR的反馈,并等待用户反馈发布后再继续。根据用户的反馈进行迭代。在发布任何评论或其他内容时,请在消息前添加 "[AGENT]",以便明确这是由编码代理发布的评论。 发布PR时,必须包含:
  1. 最终成功运行
    build_and_run_ad.py --args.yaml-extra
    的所有原始提示词及其完整生成输出
    。逐字复制完整的提示词→输出对 —— 不要总结、截断或改写。评审者需要查看模型生成的精确内容。
  2. 可重现的命令:
bash
python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml
  1. 你添加的单元测试的详细pytest命令,以便评审者也可以运行这些测试。确保你已在要推送的最新提交上运行了该pytest命令,并将结果包含在PR中。

⚠️ MANDATORY: Re-run and re-post logs on EVERY PR update — NO EXCEPTIONS ⚠️

⚠️ 必填:每次PR更新时都必须重新运行并重新发布日志 —— 无例外 ⚠️

Every single time you push changes to the PR — whether it is a new commit, a rebase, an amendment, a fixup, or any other update — you MUST:
  1. Re-run
    build_and_run_ad.py --args.yaml-extra
    using the
    ad-run-agent
    subagent, exactly as in Phase 9. The code has changed, so previous run results are stale and invalid.
  2. Re-run the full unit test suite (
    pytest <test_file> -v
    ) for the model's test file created in Phase 6. Previous test results are stale and invalid after any code change.
  3. Post ALL raw output from both runs as a PR comment:
    • The COMPLETE prompt→output pairs from
      build_and_run_ad.py
      verbatim — do NOT summarize, truncate, or paraphrase.
    • The COMPLETE pytest output verbatim — every test name, every PASSED/FAILED line, every error traceback if any. Do NOT summarize or cherry-pick.
This is not optional. There are no exceptions. Even if the change seems trivial (a typo fix, a comment edit, a formatting change), both runs must be re-executed and the full raw logs must be posted. The reviewer cannot verify correctness without seeing generation output AND test results from the exact code that is currently on the branch.
Workflow for every PR update cycle:
  1. Make the requested code changes
  2. Commit the changes
  3. Before pushing, always rebase onto the target branch to check for conflicts:
    git fetch upstream && git rebase upstream/main
    . If there are conflicts, resolve them before proceeding. Do NOT push without rebasing first — the branch must be up-to-date with the target branch.
  4. Push (or force-push if rebase rewrote history)
  5. Re-invoke the
    ad-run-agent
    to run
    build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml
    on the updated code
  6. Re-run the unit tests:
    pytest <test_file> -v
  7. Wait for both runs to complete
  8. Post a reply to every PR comment containing:
    • A brief description of what changed in this update
    • The COMPLETE raw prompts and generated outputs from the
      build_and_run_ad.py
      run
    • The COMPLETE raw pytest output (full verbatim log)
    • The reproducible commands used for both runs
  9. Resume polling for new comments (see below)
每次向PR推送更改时 —— 无论是新提交、变基、修正、修补还是任何其他更新 —— 必须:
  1. 重新运行
    build_and_run_ad.py --args.yaml-extra
    ,使用
    ad-run-agent
    子代理,与阶段9中的操作完全相同。代码已更改,因此之前的运行结果已过时且无效。
  2. 重新运行完整的单元测试套件
    pytest <test_file> -v
    ),针对阶段6中创建的模型测试文件。代码更改后,之前的测试结果已过时且无效。
  3. 将两次运行的所有原始输出作为PR评论发布:
    • 逐字复制
      build_and_run_ad.py
      运行的完整提示词→输出对 —— 不要总结、截断或改写。
    • 逐字复制完整的pytest输出 —— 每个测试名称、每个PASSED/FAILED行、任何错误回溯(如果有)。不要总结或挑选内容。
这不是可选操作。没有例外。 即使更改看起来微不足道(拼写修正、注释编辑、格式更改),也必须重新运行两次操作并发布完整的原始日志。评审者需要查看当前分支上精确代码的生成输出和测试结果,才能验证正确性。
每次PR更新周期的工作流程:
  1. 进行请求的代码更改
  2. 提交更改
  3. 推送前,始终变基到目标分支以检查冲突:
    git fetch upstream && git rebase upstream/main
    。如果有冲突,解决冲突后再继续。不要在变基前推送 —— 分支必须与目标分支保持同步。
  4. 推送(如果变基重写了历史,则强制推送)
  5. 重新调用
    ad-run-agent
    ,在更新后的代码上运行
    build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml
  6. 重新运行单元测试:
    pytest <test_file> -v
  7. 等待两次运行完成
  8. 回复每个PR评论,包含:
    • 此更新中更改内容的简要描述
    • build_and_run_ad.py
      运行的完整原始提示词和生成输出
    • 完整的原始pytest输出(逐字日志)
    • 两次运行使用的可重现命令
  9. 恢复轮询新评论(见下文)

⚠️ MANDATORY: Poll PR for new comments every 5 minutes ⚠️

⚠️ 必填:每5分钟轮询PR的新评论 ⚠️

After opening the PR and after every PR update you post, you MUST set up a polling loop that checks for new PR comments every 5 minutes. Do not simply post and walk away — actively monitor the PR for reviewer feedback.
How to poll:
bash
undefined
在发布PR后,以及每次发布PR更新后,必须设置轮询循环,每5分钟检查一次PR的新评论。 不要发布后就不管了 —— 主动监控PR以获取评审者反馈。
轮询方式:
bash
undefined

Fetch all PR comments, sorted newest-first, and check for any posted after your last comment

获取所有PR评论,按最新排序,检查是否有在你上次评论之后发布的新评论

GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/pulls/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/pulls/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"

Also check issue-level comments (top-level PR comments, not inline review comments)

同时检查问题级评论(PR的顶级评论,不是内联评审评论)

GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/issues/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/issues/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"

Also check the PR's review status

同时检查PR的评审状态

GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state

**Polling loop behavior:**
1. After posting your PR (or posting an update comment), immediately start polling every 5 minutes.
2. On each poll, check for:
   - **New review comments** (inline or top-level) posted after your last comment's timestamp
   - **PR approval status** — check if the PR has been approved
   - **Termination signals** — any comment clearly indicating the agent's work is done (e.g., "LGTM", "looks good, we're done", "no more changes needed", "agent work complete", or similar)
3. If **new actionable comments are found**: stop polling, process the feedback, and execute the full PR update cycle (steps 1–8 above). After posting the update, resume polling.
4. If the **PR is approved** or a **termination signal** is found: stop polling, report to the user that the PR review cycle is complete, and end.
5. If **no new comments** are found: sleep 5 minutes and poll again.

**Do NOT stop polling prematurely.** The loop must continue until the PR is approved or a clear termination signal is received. If polling has been running for an extended period (e.g., >2 hours) with no new activity, inform the user that you are still monitoring and ask if they want you to continue or stop.
GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state

**轮询循环行为:**
1. 在发布PR(或发布更新评论)后,立即开始每5分钟轮询一次。
2. 每次轮询时,检查:
   - **新的评审评论**(内联或顶级),发布时间在你上次评论的时间戳之后
   - **PR批准状态** —— 检查PR是否已被批准
   - **终止信号** —— 任何明确表明代理工作已完成的评论(例如 "LGTM"、"看起来不错,我们完成了"、"无需更多更改"、"代理工作完成" 或类似内容)
3. 如果**发现新的可操作评论**:停止轮询,处理反馈,并执行完整的PR更新周期(上述步骤1–8)。发布更新后,恢复轮询。
4. 如果**PR已被批准**或收到**终止信号**:停止轮询,向用户汇报PR评审周期已完成,并结束工作。
5. 如果**没有新评论**:等待5分钟后再次轮询。

**不要提前停止轮询。** 循环必须持续到PR被批准或收到明确的终止信号。如果轮询已运行较长时间(例如>2小时)且没有新活动,请告知用户你仍在监控,并询问用户是否希望你继续或停止。

Sharding-aware IR model porting (
modeling_*_ir.py
)

支持分片的IR模型移植(
modeling_*_ir.py

Use this when porting an existing AutoDeploy custom model (
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_*.py
) to explicit sharding hint ops in
modeling_*_ir.py
in the same directory (no separate
new_sharding/
tree). The exported FX graph must fully specify how the model should be sharded: the
apply_sharding_hints
transform combines hints with a runtime
DistConfig
for deterministic, node-local sharding.
Argument reference: Do not duplicate operator tables here. Refer to the custom op docstrings in
tensorrt_llm/_torch/auto_deploy/custom_ops/
for the complete argument reference (including sharding hints,
tp_mode
,
layer_type
, and which ops accept hints).
当将现有AutoDeploy自定义模型(
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_*.py
)移植到同一目录下的
modeling_*_ir.py
文件中的显式分片提示算子时,使用此流程(无需单独的
new_sharding/
目录)。导出的FX图必须完全指定模型的分片方式:
apply_sharding_hints
转换会将提示与运行时
DistConfig
结合,实现确定性的节点本地分片。
参数参考: 不要在此处重复算子表。请参考
tensorrt_llm/_torch/auto_deploy/custom_ops/
中的自定义算子文档字符串,获取完整的参数参考(包括分片提示、
tp_mode
layer_type
,以及哪些算子接受提示)。

Reference examples (study before porting)

参考示例(移植前请研究)

OriginalIR / sharding-awareLayer types
modeling_nemotron_h.py
modeling_nemotron_h_ir.py
Mamba SSM, MHA, SwiGLU MLP, MoE
modeling_qwen3_5_moe.py
modeling_qwen3_5_moe_ir.py
GatedDeltaNet, Gated MHA, SwiGLU MLP, MoE
modeling_mistral.py
modeling_mistral_ir.py
MHA, SwiGLU MLP (simplest)
modeling_deepseek_v2.py
modeling_deepseek_v2_ir.py
MLA, SwiGLU MLP, MoE
原始文件IR/支持分片的文件层类型
modeling_nemotron_h.py
modeling_nemotron_h_ir.py
Mamba SSM, MHA, SwiGLU MLP, MoE
modeling_qwen3_5_moe.py
modeling_qwen3_5_moe_ir.py
GatedDeltaNet, Gated MHA, SwiGLU MLP, MoE
modeling_mistral.py
modeling_mistral_ir.py
MHA, SwiGLU MLP(最简单)
modeling_deepseek_v2.py
modeling_deepseek_v2_ir.py
MLA, SwiGLU MLP, MoE

Step-by-step porting procedure

分步移植流程

Step 1: Copy the source file

步骤1:复制源文件

bash
cp tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo.py \
   tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo_ir.py
bash
cp tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo.py \
   tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo_ir.py

Step 2: Update the module docstring and add imports

步骤2:更新模块文档字符串并添加导入

At the top of the IR file:
python
import tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401 -- register all ops
Do not add global
SHARD_*
flags. Layer-level control uses the
layer_type
hint on each op and
shard_layers
in YAML.
在IR文件顶部添加:
python
import tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401 -- 注册所有算子
不要添加全局
SHARD_*
标志。层级控制使用每个算子上的
layer_type
提示和YAML中的
shard_layers

Step 3: Replace linear projections

步骤3:替换线性投影

For every
self.proj(x)
or
nn.Linear
call, use
torch.ops.auto_deploy.torch_linear_simple
with explicit
tp_mode
and
layer_type
. Always set
tp_mode
unconditionally (no
if _s else "none"
). Rules: opening projections (Q/K/V/gate/up/in_proj) →
"colwise"
; closing (O/down/out_proj) →
"rowwise"
; tiny outputs (e.g.
shared_expert_gate
dim 1) →
"none"
; MLA latent projections (q_a, kv_a) →
"none"
. For fused weights split later, pass
output_sizes=[...]
. For GQA, use
tp_min_local_shape=self.head_dim
on K/V colwise lines.
对于每个
self.proj(x)
nn.Linear
调用,使用
torch.ops.auto_deploy.torch_linear_simple
,并显式指定
tp_mode
layer_type
。始终无条件设置
tp_mode
(不要使用
if _s else "none"
)。规则: 起始投影(Q/K/V/gate/up/in_proj)→
"colwise"
;结束投影(O/down/out_proj)→
"rowwise"
;小输出(例如
shared_expert_gate
的维度1)→
"none"
;MLA潜在投影(q_a, kv_a)→
"none"
。对于后续会拆分的融合权重,传递
output_sizes=[...]
。对于GQA,在K/V的colwise行上使用
tp_min_local_shape=self.head_dim

Step 4: Replace split / chunk after fused colwise projections

步骤4:替换融合colwise投影后的split / chunk

Use
torch.ops.auto_deploy.split_with_sizes
with
shardable
/
layer_type
where sizes scale with TP.
当尺寸随TP缩放时,使用带有
shardable
/
layer_type
torch.ops.auto_deploy.split_with_sizes

Step 5: Replace view / reshape with concrete head counts

步骤5:用具体的头数替换view / reshape

During
torch.export
,
-1
becomes concrete; after TP, wrong values break. Any reshape whose dimension is a head count that scales with TP must use
torch.ops.auto_deploy.view
with
tp_scaled_dim
set appropriately. Safe cases: flat-to-2D, or
[B,S,-1]
when the input is already correctly sharded.
torch.export
过程中,
-1
会变为具体值;TP后,错误的值会导致失败。任何维度为随TP缩放的头数的reshape操作,必须使用
torch.ops.auto_deploy.view
,并适当设置
tp_scaled_dim
。安全场景:扁平到2D,或输入已正确分片时的
[B,S,-1]

Step 6: Insert
all_reduce

步骤6:插入
all_reduce

After every rowwise projection, add
torch.ops.auto_deploy.all_reduce(..., layer_type=...)
. Parallel branch rule: when branches merge by addition, use a single
all_reduce
after the sum (e.g. MoE routed + shared expert; parallel attention + MLP residual branches).
在每个rowwise投影后,添加
torch.ops.auto_deploy.all_reduce(..., layer_type=...)
并行分支规则: 当分支通过加法合并时,在求和后使用单个
all_reduce
(例如MoE路由专家 + 共享专家;并行注意力 + MLP残差分支)。

Step 7: Special ops (Conv1d, SSM, GatedDeltaNet, gated RMSNorm)

步骤7:特殊算子(Conv1d, SSM, GatedDeltaNet, gated RMSNorm)

Add sharding hints on
torch_causal_conv1d
,
torch_ssm
,
torch_gated_delta_rule
,
torch_rmsnorm_gated
per docstrings—typically
shardable
/
output_sizes
/
tp_mode
as required.
根据文档字符串为
torch_causal_conv1d
torch_ssm
torch_gated_delta_rule
torch_rmsnorm_gated
添加分片提示 —— 通常需要设置
shardable
/
output_sizes
/
tp_mode

Step 8: MoE

步骤8:MoE

Pass
layer_type="moe"
into
torch_moe
;
apply_sharding_hints
handles EP/TP.
layer_type="moe"
传递给
torch_moe
apply_sharding_hints
会处理EP/TP。

Step 9: Register the IR model

步骤9:注册IR模型

  1. Bottom of the IR file:
    AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)
    (same pattern as Phase 4).
  2. Add a side-effect import in
    tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
    (e.g.
    from . import modeling_foo_ir  # noqa: F401
    ) and extend
    __all__
    if you export symbols. Without this import, worker processes may not load your class and
    apply_sharding_hints
    can report 0 nodes processed. Do not use a separate
    register_sharded_models.py
    indirection.
  1. 在IR文件末尾添加:
    AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)
    (与阶段4相同的模式)。
  2. tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
    文件中添加副作用导入(例如
    from . import modeling_foo_ir  # noqa: F401
    ),如果导出符号则扩展
    __all__
    。如果没有此导入,工作进程可能无法加载你的类,
    apply_sharding_hints
    可能会报告0个节点被处理。不要使用单独的
    register_sharded_models.py
    间接导入。

Step 10: YAML — composable registry pattern

步骤10:YAML —— 可组合注册表模式

Prefer the model registry (
examples/auto_deploy/model_registry/models.yaml
) and compose shared fragments under
examples/auto_deploy/model_registry/configs/
, same as other models: list
dashboard_default.yaml
, the right
world_size_N.yaml
, then a dedicated fragment (e.g.
enable_sharder_ir.yaml
) that holds IR sharding transforms. That fragment should disable legacy sharding passes and enable hint-driven sharding. Registry fragments are deep-merged in
yaml_extra
order (see
DynamicYamlMixInForSettings
in
tensorrt_llm/_torch/auto_deploy/utils/_config.py
); place transform keys under
transforms:
so they merge with
dashboard_default.yaml
. Standalone experiment YAMLs for
build_and_run_ad
may wrap the same fields under a top-level
args:
block matching
LlmArgs
.
Example transform block:
yaml
undefined
优先使用模型注册表(
examples/auto_deploy/model_registry/models.yaml
),并在
examples/auto_deploy/model_registry/configs/
目录下组合共享片段,与其他模型相同:列出
dashboard_default.yaml
、合适的
world_size_N.yaml
,然后是专用片段(例如
enable_sharder_ir.yaml
),该片段包含IR分片转换。该片段应禁用旧版分片传递,并启用基于提示的分片。注册表片段会按
yaml_extra
顺序深度合并(见
tensorrt_llm/_torch/auto_deploy/utils/_config.py
中的
DynamicYamlMixInForSettings
);将转换键放在
transforms:
下,以便与
dashboard_default.yaml
合并。用于
build_and_run_ad
的独立实验YAML文件可能会将相同字段包装在与
LlmArgs
匹配的顶层
args:
块下。
示例转换块:
yaml
undefined

Typical contents for enable_sharder_ir.yaml (registry composable fragment)

enable_sharder_ir.yaml的典型内容(注册表可组合片段)

transforms: export_to_gm: num_moe_experts_for_export: 2 # often required when expert count is large (>64) detect_sharding: stage: sharding enabled: false sharding_transform_executor: stage: sharding enabled: false apply_sharding_hints: stage: sharding enabled: true run_shape_prop: true allreduce_strategy: NCCL # shard_layers: ['mha', 'mlp'] # optional selective sharding gather_logits_before_lm_head: enabled: true

Use `world_size: 8` when validating TP head-divisibility. Optional `shard_layers` limits which `layer_type` hints are processed; unset means shard all shardable nodes.
transforms: export_to_gm: num_moe_experts_for_export: 2 # 当专家数量较大时(>64)通常需要设置 detect_sharding: stage: sharding enabled: false sharding_transform_executor: stage: sharding enabled: false apply_sharding_hints: stage: sharding enabled: true run_shape_prop: true allreduce_strategy: NCCL # shard_layers: ['mha', 'mlp'] # 可选的选择性分片 gather_logits_before_lm_head: enabled: true

验证TP头可分性时使用 `world_size: 8`。可选的 `shard_layers` 限制处理哪些 `layer_type` 提示;未设置时会分片所有可分片节点。

Step 11: Validate

步骤11:验证

Do not report success until a run completes successfully.
  1. Prefer
    python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry
    after adding/updating the registry entry and composable YAMLs (Phase 8–9 style).
  2. apply_sharding_hints
    logs should show
    N nodes processed
    with N > 0
    .
  3. If validation fails with infrastructure limits (e.g. head count not divisible by
    world_size
    ), document the assert and compatible sizes; do not "fix" core
    sharding.py
    / custom op schemas without owner review.
  4. If blocked by missing infrastructure support, rename artifacts to
    broken_modeling_*_ir.py
    / broken YAML and file a short error report for humans (do not silently patch core transforms).
Layer type strings (for
layer_type
/
shard_layers
): use
"mha"
,
"mla"
,
"mlp"
,
"moe"
,
"ssm"
,
"delta"
, or
"unknown"
(default; skipped when
shard_layers
is set). Match the conventions used in
apply_sharding_hints
and project enums.
不要报告成功,直到运行完全成功。
  1. 在添加/更新注册表条目和可组合YAML文件后,优先使用
    python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry
    (阶段8–9的方式)。
  2. apply_sharding_hints
    日志应显示**
    N nodes processed
    ,且N > 0**。
  3. 如果验证因基础设施限制失败(例如头数无法被
    world_size
    整除),记录断言和兼容的尺寸;不要在未经所有者评审的情况下“修复”核心
    sharding.py
    / 自定义算子模式。
  4. 如果因缺少基础设施支持而受阻,将工件重命名为
    broken_modeling_*_ir.py
    / broken YAML,并为人类编写简短的错误报告(不要静默修补核心转换)。
层类型字符串(用于
layer_type
/
shard_layers
):使用
"mha"
"mla"
"mlp"
"moe"
"ssm"
"delta"
"unknown"
(默认;设置
shard_layers
时会被跳过)。与
apply_sharding_hints
和项目枚举中使用的约定保持一致。

Layer-specific sharding patterns

特定层的分片模式

MHA (standard or gated):
layer_type="mha"
: q/k/v colwise (GQA:
tp_min_local_shape
),
view
with
tp_scaled_dim
for head dim, o rowwise +
all_reduce
. Fused Q+gate interleaved per head: colwise without
output_sizes
; contiguous Q|K|V fused blocks need
output_sizes
.
SwiGLU MLP:
layer_type="mlp"
: gate/up colwise, down rowwise +
all_reduce
.
Mamba / SSM:
layer_type="ssm"
: in_proj colwise +
output_sizes
, splits shardable, conv1d shardable +
output_sizes
, views,
torch_ssm
shardable, norm gated colwise if weight scales, out rowwise +
all_reduce
.
GatedDeltaNet:
layer_type="delta"
: in_proj_qkv with
output_sizes
, other in_projs colwise, conv1d/splits/views as above,
torch_gated_delta_rule
shardable, out rowwise +
all_reduce
.
MoE + shared expert:
layer_type="moe"
: router replicated; one
all_reduce
after
routed + shared
, not two.
MLA (DeepSeek):
layer_type="mla"
: keep
torch_mla
intact with
shardable=True
—do not decompose into separate linears +
torch_attention
(introduces bad
expand
/
view
with concrete head counts). q_a/kv_a latent:
tp_mode="none"
; q_b colwise;
o_proj
rowwise +
all_reduce
.
MHA(标准或门控):
layer_type="mha"
:q/k/v colwise(GQA:
tp_min_local_shape
),头维度使用带有
tp_scaled_dim
view
,o rowwise +
all_reduce
。每个头融合的Q+gate交错:仅colwise,无需
output_sizes
;连续的Q|K|V融合块需要
output_sizes
SwiGLU MLP:
layer_type="mlp"
:gate/up colwise,down rowwise +
all_reduce
Mamba / SSM:
layer_type="ssm"
:in_proj colwise +
output_sizes
,分片可拆分,conv1d可分片 +
output_sizes
,views,
torch_ssm
可分片,gated归一化colwise(如果权重缩放),out rowwise +
all_reduce
GatedDeltaNet:
layer_type="delta"
:in_proj_qkv 带有
output_sizes
,其他in_projs colwise,conv1d/splits/views 如上,
torch_gated_delta_rule
可分片,out rowwise +
all_reduce
MoE + 共享专家:
layer_type="moe"
:router复制;在
routed + shared
后使用一个
all_reduce
,不要使用两个。
MLA(DeepSeek):
layer_type="mla"
:保持
torch_mla
完整,设置
shardable=True
—— 不要分解为单独的线性层 +
torch_attention
(这会引入带有具体头数的不良
expand
/
view
)。q_a/kv_a 潜在层:
tp_mode="none"
;q_b colwise;
o_proj
rowwise +
all_reduce

Common pitfalls (sharding IR)

常见陷阱(分片IR)

  1. Missing
    auto_deploy::view
    for head reshapes
    — concrete shapes from export break after sharding.
  2. Sharding tiny projections — dim-1 gates:
    tp_mode="none"
    .
  3. Double
    all_reduce
    in MoE
    — one merge-point reduction for routed + shared.
  4. Cross-layer parameter contamination — in
    _apply_hint_*
    handlers using
    get_source_nodes()
    , restrict with
    allowed_ops
    so residual links do not pull weights from other layers.
  5. Missing
    num_moe_experts_for_export
    for very large expert counts — export can hang.
  6. Decomposing ops that absorb weights (e.g.
    torch_mla
    ) — use
    shardable
    + handler instead of splitting into plain linears.
  7. Interleaved vs contiguous fused weights — interleaved per-head groups: colwise only; contiguous Q|K|V blocks: require
    output_sizes
    .
  8. Omitting
    layer_type
    when using
    shard_layers
    "unknown"
    nodes are skipped; set hints explicitly on sharding-aware ops.
  9. layer_type
    on non-hint ops
    — do not pass
    layer_type
    to ops that are not designed for sharding hints (e.g.
    torch_attention
    ,
    torch_l2norm
    ,
    torch_rope_*
    ); extra positional args break calls. Confirm in
    custom_ops/
    docstrings which ops accept hints.
  10. Conditional hint values — no
    if _s else "none"
    ; use unconditional hints and rely on
    shard_layers
    / transform config.
  1. 头reshape操作缺少
    auto_deploy::view
    —— 导出的具体形状在分片后会失效。
  2. 分片小型投影 —— 维度1的门控:
    tp_mode="none"
  3. MoE中使用双重
    all_reduce
    —— 在路由专家 + 共享专家的合并点使用一次归约。
  4. 跨层参数污染 —— 在使用
    get_source_nodes()
    _apply_hint_*
    处理程序中,使用
    allowed_ops
    限制范围,避免残差链接从其他层获取权重。
  5. 专家数量很大时缺少
    num_moe_experts_for_export
    —— 导出可能会挂起。
  6. 分解吸收权重的算子(例如
    torch_mla
    )—— 使用
    shardable
    + 处理程序,而不是拆分为纯线性层。
  7. 交错vs连续融合权重 —— 每个头交错的组:仅colwise;连续的Q|K|V块:需要
    output_sizes
  8. 使用
    shard_layers
    时省略
    layer_type
    ——
    "unknown"
    节点会被跳过;在支持分片的算子上显式设置提示。
  9. 在非提示算子上设置
    layer_type
    —— 不要将
    layer_type
    传递给不支持分片提示的算子(例如
    torch_attention
    torch_l2norm
    torch_rope_*
    );额外的位置参数会破坏调用。请在
    custom_ops/
    文档字符串中确认哪些算子接受提示。
  10. 条件提示值 —— 不要使用
    if _s else "none"
    ;使用无条件提示,并依赖
    shard_layers
    / 转换配置。

Sharding IR validation checklist (human review)

分片IR验证清单(人工评审)

  • world_size=1
    : unsharded path; hints should not break correctness.
  • world_size=2
    and
    8
    : shape checks and coherent output.
  • apply_sharding_hints
    node count vs expectation.
  • Optional:
    shard_layers: ['moe']
    to verify selective sharding.
  • world_size=1
    :未分片路径;提示不应破坏正确性。
  • world_size=2
    8
    :形状检查和连贯输出。
  • apply_sharding_hints
    节点数量与预期一致。
  • 可选:
    shard_layers: ['moe']
    验证选择性分片。

Key Gotchas

关键注意事项

  • Canonical ops first: Always use
    torch.ops.auto_deploy.torch_*
    canonical ops whenever one exists for the operation. This is how AD knows what to optimize. Writing manual attention, MoE, RoPE, or normalization in plain PyTorch instead of using the canonical op will prevent AD transforms from working.
  • No
    repeat_interleave
    :
    AD attention ops handle GQA natively. Never repeat K/V heads manually.
  • Lean code: Every line should serve prefill export. No optional HF features, no dead code paths, no fallback logic.
  • Reuse config classes: Import from
    transformers
    or load from checkpoint whenever possible. Only bundle a config class if it truly doesn't exist anywhere.
  • Assert
    position_ids
    :
    Always assert
    position_ids is not None
    — it is a required input, never optional.
  • Self-contained files only: Never import from other AD custom models. Each
    modeling_{name}.py
    is a standalone translation from HF source.
  • RoPE cos/sin: slice ONCE, not per layer.
    _ad_
    prefix for RoPE buffers.
    RotaryEmbedding.forward(x, position_ids)
    MUST slice by
    position_ids
    once and return pre-sliced
    (cos, sin)
    . Pass those tensors to all layers. NEVER pass
    position_ids
    through to each layer/attention forward to re-index — that is redundant compute that bloats the exported graph. See Phase 2 for the full pattern.
  • MoE weights: use
    nn.ModuleList
    per-expert for checkpoint compatibility. Write test-only state_dict converters for HF stacked format.
  • noaux_tc
    routers (DeepSeek-V3 style): use vanilla PyTorch (sigmoid + bias + group topk + normalize + scale). AD transforms can replace with fused
    trtllm
    kernels at deployment time.
  • Vision towers are typically not exported. Keep vision logic in eager PyTorch and export only the text path unless explicitly requested otherwise.
  • Model code and tests must run on CPU. Use only
    torch_*
    prefixed reference ops in AutoDeploy — never
    triton_*
    ,
    flashinfer_*
    , or
    trtllm_*
    .
  • 优先使用标准算子: 只要存在对应的操作,就必须使用
    torch.ops.auto_deploy.torch_*
    标准算子。这是AD知道如何优化的方式。用纯PyTorch手动实现注意力、MoE、RoPE或归一化,而不使用标准算子,会导致AD转换无法工作。
  • 不要使用
    repeat_interleave
    AD注意力算子原生支持GQA。永远不要手动重复K/V头。
  • 精简代码: 每一行都应为预填充导出服务。不要包含可选的HF功能、无效代码路径或fallback逻辑。
  • 重用配置类: 尽可能从
    transformers
    导入或从checkpoint加载。仅当配置类确实不存在时才打包该类。
  • 断言
    position_ids
    始终断言
    position_ids is not None
    —— 这是必填输入,永远不是可选的。
  • 仅使用独立文件: 永远不要从其他AD自定义模型导入。每个
    modeling_{name}.py
    文件都是从HF源码独立转换而来。
  • RoPE cos/sin:仅切片一次,不要每层切片。 RoPE缓冲区使用
    _ad_
    前缀。
    RotaryEmbedding.forward(x, position_ids)
    必须按
    position_ids
    切片一次,并返回预切片的
    (cos, sin)
    。将这些张量传递给所有层。永远不要将
    position_ids
    传递到每个层/注意力前向方法中重新索引 —— 这是冗余计算,会膨胀导出的图。请见阶段2的完整模式。
  • MoE权重:使用
    nn.ModuleList
    存储每个专家的权重,以兼容checkpoint。为HF堆叠格式编写测试专用的state_dict转换器。
  • noaux_tc
    路由器(DeepSeek-V3风格):使用纯PyTorch实现(sigmoid + bias + group topk + normalize + scale)。AD转换会在部署阶段将其替换为融合的
    trtllm
    内核。
  • 视觉塔通常不导出。将视觉逻辑保留在eager PyTorch中,除非明确要求,否则仅导出文本路径。
  • 模型代码和测试必须能在CPU上运行。在AutoDeploy中仅使用
    torch_*
    前缀的参考算子 —— 永远不要使用
    triton_*
    flashinfer_*
    trtllm_*