ad-model-onboard
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAutoDeploy Model Onboarding
AutoDeploy模型上线流程
Input: HuggingFace model ID. Output: prefill-only custom model file + hierarchical tests + summary report.
输入: HuggingFace模型ID。输出: 仅支持预填充的自定义模型文件 + 分层测试用例 + 总结报告。
Phase 0 — Gather All Resources Upfront
阶段0 — 提前准备所有资源
Web/GitHub fetches require user approval and the user may leave. Do ALL network access now and save locally before proceeding.
网页/GitHub拉取操作需要用户授权,且用户可能中途离开。请现在完成所有网络访问操作,并将资源保存到本地后再继续。
Step 0 — GPU memory sanity check
步骤0 — GPU内存合理性检查
Before anything else, check whether the model can fit on the current system.
- Run to get the total VRAM (in MiB) across all GPUs on the system.
nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits - Estimate the model's memory footprint from the HuggingFace model card or config (number of parameters × bytes per parameter, e.g. 7B params × 2 bytes = ~14 GB for bfloat16).
- If the estimated size exceeds total system VRAM, stop and report this to the user — do not proceed with onboarding until the user acknowledges and decides how to proceed. Example message: "This model requires ~Xgb but the system only has Ygb across N GPUs. Onboarding is likely to fail at the e2e run stage."
Step 1 — Check local transformers install first:
bash
python -c "import transformers; print(transformers.__file__)"Look for under that path. If found, use it directly — no network needed.
models/{model_type}/modeling_*.pyStep 2 — If not found, download the HF repo (code only, skip weights):
bash
huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"This downloads config, code, and tokenizer files into the standard HF cache ( or ) while skipping large weight files. Files cached here are automatically found by and similar APIs — no extra path wiring needed. Once downloaded you can work fully offline — read and from the cache snapshot directory printed by the command.
$HF_HOME~/.cache/huggingface/transformers.AutoConfig.from_pretrainedconfig.jsonmodeling_*.py在开始任何操作前,检查模型是否能适配当前系统。
- 运行 获取系统所有GPU的总显存(单位:MiB)。
nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits - 通过HuggingFace模型卡片或配置文件估算模型内存占用(参数数量 × 每个参数占用字节数,例如7B参数 × 2字节 = bfloat16格式下约14 GB)。
- 如果估算的模型大小超过系统总显存,立即停止并告知用户 —— 在用户确认并决定后续操作前,不要继续上线流程。示例提示:"该模型需要约XGB显存,但系统N张GPU总显存仅为YGB。上线流程很可能在端到端运行阶段失败。"
步骤1 — 先检查本地transformers安装情况:
bash
python -c "import transformers; print(transformers.__file__)"在该路径下查找 文件。如果找到,直接使用该文件 —— 无需访问网络。
models/{model_type}/modeling_*.py步骤2 — 如果未找到,下载HF仓库(仅代码,跳过权重文件):
bash
huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"该命令会将配置文件、代码和分词器文件下载到标准HF缓存目录( 或 ),同时跳过大型权重文件。缓存的文件会被 等API自动识别 —— 无需额外配置路径。下载完成后即可完全离线工作 —— 从命令输出的缓存快照目录中读取 和 文件。
$HF_HOME~/.cache/huggingface/transformers.AutoConfig.from_pretrainedconfig.jsonmodeling_*.pyPhase 1 — Survey Existing Coverage & Analyze HF Model
阶段1 — 调研现有覆盖范围并分析HF模型
Step 1 — Check for existing AD custom modeling code
步骤1 — 检查是否已有AD自定义建模代码
Before writing anything, check if an AD custom model already covers this architecture:
- Read the model's to find its
config.jsonandmodel_typefields.architectures - Search for existing
tensorrt_llm/_torch/auto_deploy/models/custom/files that register the same config class name (grep for themodeling_*.pyvalue orarchitectures).model_type - Also check for existing registrations.
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
If existing code is found:
- Read it carefully. It may already handle this exact model — in which case no new modeling file is needed, only registry entries and possibly tests.
- If the existing code covers a closely related model in the same family but needs adaptation (e.g., the family added MoE in a newer variant, or changed the attention type), decide whether to extend the existing file or create a new one. Prefer extending if the changes are minor; create a new file if the architecture diverges significantly. Report the decision and rationale to the user before proceeding.
If no existing code is found: proceed to write a new model file in Phase 2.
在编写任何代码前,检查是否已有AD自定义模型覆盖该架构:
- 读取模型的 文件,找到
config.json和model_type字段。architectures - 在 目录下搜索已有的
tensorrt_llm/_torch/auto_deploy/models/custom/文件,查找是否注册了相同的配置类名称(通过grep匹配modeling_*.py或architectures的值)。model_type - 同时检查 文件中的现有注册项。
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
如果找到现有代码:
- 仔细阅读代码。它可能已经支持该模型 —— 这种情况下无需编写新的建模文件,仅需添加注册项和可能的测试用例。
- 如果现有代码支持同系列中密切相关的模型,但需要适配(例如,系列新增了MoE变体,或修改了注意力类型),请决定是扩展现有文件还是创建新文件。如果改动较小,优先选择扩展;如果架构差异较大,则创建新文件。在继续操作前,向用户汇报决策及理由。
如果未找到现有代码: 进入阶段2,编写新的模型文件。
Step 2 — Survey the model family in the registry
步骤2 — 调研模型系列在注册表中的情况
Check for other models from the same family (e.g., if asked to onboard , look for , , , etc.). Also check HuggingFace for the full set of model sizes/variants in the family.
examples/auto_deploy/model_registry/models.yamlQwen/Qwen3-8BQwen/Qwen3-0.6BQwen/Qwen3-32BQwen/Qwen3-235B-A22B- Identify which family members already have registry entries and which are missing.
- Identify which family members share the same architecture (same /
model_typein their config) — these can all use a single modeling file.architectures - Plan to onboard the entire family cohesively: one modeling file + one test file should cover all members that share an architecture. The registry should have entries for all commonly-used sizes.
- Report the family survey findings to the user: which models exist, which are missing, and the proposed plan for covering them all.
检查 文件,查找同系列的其他模型(例如,如果需要上线 ,请查找 、、 等)。同时在HuggingFace上查找该系列的所有模型尺寸/变体。
examples/auto_deploy/model_registry/models.yamlQwen/Qwen3-8BQwen/Qwen3-0.6BQwen/Qwen3-32BQwen/Qwen3-235B-A22B- 确定哪些系列成员已有注册表项,哪些缺失。
- 确定哪些系列成员共享相同架构(配置文件中的 /
model_type相同)—— 这些模型可以共用同一个建模文件。architectures - 计划统一上线整个系列:一个建模文件 + 一个测试文件应覆盖所有共享同一架构的成员。注册表中应包含所有常用尺寸的模型项。
- 向用户汇报系列调研结果:已存在的模型、缺失的模型,以及覆盖所有模型的计划方案。
Step 3 — Analyze HF model architecture
步骤3 — 分析HF模型架构
Study the locally-available and (NOT from ). Identify attention type (MHA/GQA/MLA), MoE config, RoPE variant, normalization, activation, and any data-dependent ops that break (e.g. , data-conditioned ).
config.jsonmodeling_*.pytensorrt_llm/_torch/models/torch.exporttorch.nonzeroif研究本地可用的 和 文件(不要使用 下的文件)。确定注意力类型(MHA/GQA/MLA)、MoE配置、RoPE变体、归一化方式、激活函数,以及任何会破坏 的数据依赖算子(例如 、基于数据条件的 语句)。
config.jsonmodeling_*.pytensorrt_llm/_torch/models/torch.exporttorch.nonzeroifPhase 2 — Write a Lean Prefill-Only Model
阶段2 — 编写轻量型仅预填充模型
Create . Use as a structural template only (class layout, dataclass outputs, forward signature).
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.pymodeling_glm4_moe_lite.pyThe goal is a minimal prefill-only model for with AD canonical IR ops. Keep the code as lean as possible — every line should serve the export path. Do not port HF features that AD doesn't need.
torch.exportStrip: KV cache, training paths, dropout, flash attention variants, / for GQA (AD attention ops handle this natively), fallback logic for generating (assert instead), optional code paths gated on config flags irrelevant to prefill export.
repeat_interleaverepeat_kvposition_idsKeep: hierarchy, dataclass, minimal forward .
PreTrainedModelModelOutput(input_ids, position_ids, inputs_embeds=None, **kwargs)Critical: Make sure the custom modeling code nn.Module hierarchy matches what the checkpoint safetensor json expects.
Critical rule: Do NOT import or reuse existing AD custom model code (e.g. ). Every must be self-contained. Use the HF source () as the source of truth for the model's logic and translate it fresh — even if a structurally similar AD model already exists. This prevents hidden coupling, makes each model auditable on its own, and ensures model-specific quirks are captured correctly.
from .modeling_deepseek import ...modeling_{name}.py$CLONE_DIR/modeling_*.py创建 文件。仅将 作为结构模板(类布局、数据类输出、前向签名)。
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.pymodeling_glm4_moe_lite.py目标是为 创建一个使用AD标准IR算子的最小化仅预填充模型。 代码应尽可能精简 —— 每一行都应为导出路径服务。无需移植AD不需要的HF功能。
torch.export移除以下内容:KV缓存、训练路径、dropout、Flash Attention变体、GQA的 /(AD注意力算子原生支持该功能)、生成 的 fallback 逻辑(改为断言)、与预填充导出无关的配置标志 gated 的可选代码路径。
repeat_interleaverepeat_kvposition_ids保留以下内容: 层级结构、 数据类、最小化的前向方法签名 。
PreTrainedModelModelOutput(input_ids, position_ids, inputs_embeds=None, **kwargs)关键注意事项: 确保自定义建模代码的nn.Module层级结构与checkpoint的safetensor json文件预期一致。
关键规则:不要导入或重用现有AD自定义模型代码(例如 )。每个 文件必须是独立的。以HF源码()为模型逻辑的唯一来源,重新转换代码 —— 即使已有结构相似的AD模型存在。这样可以避免隐藏耦合,使每个模型都可独立审计,并确保正确捕获模型特有的细节。
from .modeling_deepseek import ...modeling_{name}.py$CLONE_DIR/modeling_*.pyPhase 3 — Use AutoDeploy Canonical Ops (CRITICAL)
阶段3 — 使用AutoDeploy标准算子(至关重要)
Use canonical ops WHENEVER POSSIBLE. These are the IR nodes that AD transforms later replace with optimized backends (triton, flashinfer, trtllm) at deployment time. If a canonical op exists for an operation, you MUST use it — do not reimplement the logic in plain PyTorch.
torch.ops.auto_deploy.torch_*Available canonical ops (see for full list):
tensorrt_llm/_torch/auto_deploy/custom_ops/README.md- Attention: ,
torch_attention,torch_attention_sdpatorch_attention_repeat_kv - MLA:
torch_mla - RoPE: ,
torch_rope_with_explicit_cos_sin,torch_rope_with_complex_freqstorch_rope_with_qk_interleaving - MoE: ,
torch_moe,torch_moe_fused,torch_moe_routertorch_moe_dense_mlp - Normalization: ,
torch_rmsnorm,torch_rmsnorm_gatedtorch_l2norm - Linear:
torch_linear_simple - SSM/Mamba: ,
torch_ssmtorch_causal_conv1d - FLA:
torch_gated_delta_rule - Quantization: ,
torch_quant_fp8_linear, etc.torch_quant_nvfp4_linear
Never use // — backend selection happens later in AD transforms. Plain PyTorch is acceptable ONLY for operations where no canonical op exists (e.g., simple activation functions, embedding lookups, basic tensor arithmetic). If you find yourself writing manual attention, MoE routing, RoPE, or normalization in plain PyTorch, stop and use the canonical op instead.
triton_*flashinfer_*trtllm_*Do NOT use or for GQA. HF reference code often repeats K/V heads to match the Q head count before attention. The AD canonical attention ops (, ) handle GQA natively — they accept Q, K, V with different head counts and do the right thing internally. Manually repeating K/V heads is unnecessary bloat and prevents AD from optimizing the attention path.
repeat_interleaverepeat_kvtorch_attentiontorch_attention_sdpa只要存在对应的操作,就必须使用 标准算子。 这些是IR节点,AD转换会在部署阶段将其替换为优化后的后端(triton、flashinfer、trtllm)。如果存在针对某操作的标准算子,必须使用该算子 —— 不要用纯PyTorch重新实现逻辑。
torch.ops.auto_deploy.torch_*可用的标准算子(完整列表请见 ):
tensorrt_llm/_torch/auto_deploy/custom_ops/README.md- 注意力: ,
torch_attention,torch_attention_sdpatorch_attention_repeat_kv - MLA:
torch_mla - RoPE: ,
torch_rope_with_explicit_cos_sin,torch_rope_with_complex_freqstorch_rope_with_qk_interleaving - MoE: ,
torch_moe,torch_moe_fused,torch_moe_routertorch_moe_dense_mlp - 归一化: ,
torch_rmsnorm,torch_rmsnorm_gatedtorch_l2norm - 线性层:
torch_linear_simple - SSM/Mamba: ,
torch_ssmtorch_causal_conv1d - FLA:
torch_gated_delta_rule - 量化: ,
torch_quant_fp8_linear等torch_quant_nvfp4_linear
绝对不要使用 // 算子 —— 后端选择会在AD转换的后期进行。仅当没有对应标准算子时,才可以使用纯PyTorch实现(例如简单的激活函数、嵌入查找、基础张量运算)。如果发现自己在用纯PyTorch手动实现注意力、MoE路由、RoPE或归一化,请停止并改用标准算子。
triton_*flashinfer_*trtllm_*不要为GQA使用 或 。 HF参考代码通常会在注意力前重复K/V头以匹配Q头数量。AD标准注意力算子(, )原生支持GQA —— 它们接受头数不同的Q、K、V,并在内部处理正确逻辑。手动重复K/V头是不必要的冗余操作,会阻碍AD优化注意力路径。
repeat_interleaverepeat_kvtorch_attentiontorch_attention_sdpaPhase 4 — Register
阶段4 — 注册模型
- Bottom of model file: .
AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM) - Add import + entry in
__all__.models/custom/__init__.py - Prefer reusing the existing config class — if the config can be loaded via (either from the installed
AutoConfig.from_pretrained(model_id)or from files in the HF cache downloaded in Phase 0), import it fromtransformersand use it directly. Do NOT recreate or copy the config class into the modeling file when it is already available. Note: AD's factory already callstransformersand passes the result to your model, so you rarely need to import the config at all — if you find yourself doing so, sanity-check that it's genuinely needed.AutoConfig.from_pretrained(model_id, trust_remote_code=True) - Only if the config is truly not available (not in and not bundled with the checkpoint), define a minimal config class in the modeling file and
transformers. A good sanity check: if the E2E test passes without a custom config class, you don't need one —AutoConfig.register(model_type, ConfigCls, exist_ok=True)already picked up the right class.AutoConfig.from_pretrained
- 在模型文件末尾添加:。
AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM) - 在 文件中添加导入语句和
models/custom/__init__.py条目。__all__ - 优先重用现有配置类 —— 如果可以通过 加载配置(无论是从已安装的
AutoConfig.from_pretrained(model_id)还是阶段0下载的HF缓存文件中),直接从transformers导入并使用该配置类。当配置类已存在时,不要在建模文件中重新创建或复制该类。注意:AD的工厂已调用transformers并将结果传递给你的模型,因此你几乎不需要导入配置类 —— 如果发现需要导入,请先确认是否真的有必要。AutoConfig.from_pretrained(model_id, trust_remote_code=True) - 仅当配置类确实不可用时(不在 中,也未随checkpoint打包),才在建模文件中定义最小化的配置类,并调用
transformers。一个合理的检查方法:如果端到端测试无需自定义配置类即可通过,那么你不需要该类 ——AutoConfig.register(model_type, ConfigCls, exist_ok=True)已自动加载了正确的类。AutoConfig.from_pretrained
Phase 5 — Model Input Contract
阶段5 — 模型输入契约
The custom model's forward signature must follow these rules:
- Always — The top-level model always receives
input_ids. A submodule graph may internally receiveinput_ids(e.g., after the embedding layer), but the exported entry point takes token IDs.inputs_embeds - Always — Vanilla sequential
position_idsare always provided. Assertposition_idsat the top of the forward method — it is a required input, never optional. Do not include fallback logic to generateposition_ids is not Nonefromposition_ids(HF models often do this; strip it). If the model uses a non-standard RoPE variant or custom position encoding, the model must compute it internally on top of the provided vanillainput_ids.position_ids - Multi-modal inputs — If the model supports vision/audio/etc., those additional inputs are passed during prefill alongside .
input_ids - No attention mask, no cache inputs, no HF-runtime features — Do not accept ,
attention_mask,past_key_values, or similar HF-runtime arguments. AD manages masking and caching via its own transforms and runtime.use_cache
自定义模型的前向方法签名必须遵循以下规则:
- 必须包含 —— 顶层模型始终接收
input_ids。子模块图内部可能会接收input_ids(例如嵌入层之后),但导出的入口点接收的是token ID。inputs_embeds - 必须包含 —— 始终提供标准顺序的
position_ids。在前向方法开头断言position_ids—— 这是必填输入,永远不是可选的。不要包含从position_ids is not None生成input_ids的 fallback 逻辑(HF模型通常会这样做;请移除该逻辑)。如果模型使用非标准RoPE变体或自定义位置编码,模型必须在提供的标准position_ids基础上内部计算所需编码。position_ids - 多模态输入 —— 如果模型支持视觉/音频等输入,这些额外输入会在预填充阶段与 一起传递。
input_ids - 不要包含注意力掩码、缓存输入或HF运行时特性 —— 不要接受 、
attention_mask、past_key_values或类似的HF运行时参数。AD通过自身的转换和运行时管理掩码和缓存。use_cache
Phase 6 — Hierarchical Tests
阶段6 — 分层测试
Create . Use as template. No smoke tests. Small config (hidden=64, layers=2-3, vocab=1000). Use if HF class unavailable.
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.pytest_glm4_moe_lite_modeling.pypytest.skipHF Reference Strategy: Equivalence tests compare our custom implementation against the HF reference with identical weights and inputs. Use actual HF classes if they exist — prefer importing directly over standalone HF-like implementations for unit tests. Standalone "reference" implementations are effectively alternative AD IR models and defeat the purpose of the reference test; they also tend to silently agree with whatever bugs exist in the custom model.
- If HF modules exist in the installed : import them directly (e.g.,
transformers). Wrap imports infrom transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLMtry/except helpers that return_get_hf_*_class()onNone, and useImportErrorwhenpytest.skip.None - If HF modules are NOT in the installed : copy the minimal module definitions from the HF
transformerssource into the test file as standalone reference classes. This keeps tests self-contained without requiring a specificmodeling_*.pyversion or HF cache at test time. Important: make sure the copy is minimal and strictly faithful to the HF implementation only. Do NOT tweak the functionality of the reference. The same applies to config classes that usetransformers(i.e., not available intrust_remote_code): copy a minimal faithful version into the test file. The modeling file should NOT import the config class — AD loads it at runtime viatransformers. The test-only config copy lets you verify config-wrapping behavior (e.g., structure of state_dict).AutoConfig.from_pretrained(..., trust_remote_code=True) - Weight conversion helpers: Write test-only helpers for any weight format differences between HF and custom (e.g., RoPE de-interleaving, stacked-to-per-expert MoE weights, gate weight key remapping). For full-model tests, prefer using pre-hooks already registered on the custom model.
load_state_dict
Numerical comparison: For equivalence tests comparing custom ops against HF reference, use the shared utility from :
assert_rmse_close_model_test_utilspython
from _model_test_utils import assert_rmse_closeThis computes — more robust than per-element since a few outlier elements won't fail the test. Use only for blocks with identical math (e.g., plain MLP with no custom ops).
rmse(actual - expected) / rmse(expected)torch.testing.assert_closetorch.testing.assert_closeRecommended values for bfloat16:
rmse_ratio_tol- Identical math (MLP, Norm): use with tight rtol/atol (1e-3)
torch.testing.assert_close - MoE block (fused routing):
0.02 - Decoder layer / MoE layer / full model:
0.05 - Attention:
0.10
Bottom-up levels (each must pass before next):
- Block equivalence — Test MLP, Attention, MoE, Norm individually: same weights + same input → (or
assert_rmse_closefor identical-math blocks).torch.testing.assert_close - Layer equivalence — Full decoder layer. If model has heterogeneous layers (dense vs MoE, attention vs SSM), test each type separately.
- Full model equivalence — End-to-end logits comparison. Use a small config with <10 layers that covers the essence of the architecture (e.g., at least one of each layer type).
- Export test — with
torch_export_to_gmfor batch+seq, verify finite output, test a second shape.Dim.DYNAMIC
创建 文件。以 为模板。不要编写冒烟测试。使用小型配置(hidden=64, layers=2-3, vocab=1000)。如果HF类不可用,使用 。
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.pytest_glm4_moe_lite_modeling.pypytest.skipHF参考策略: 等价测试会比较我们的自定义实现与HF参考实现在相同权重和输入下的结果。如果HF类存在,优先直接导入,而不是在单元测试中使用独立的类HF实现。 独立的“参考”实现实际上是另一种AD IR模型,会失去参考测试的意义;它们还可能与自定义模型中的bug保持一致,导致测试无法发现问题。
- 如果已安装的 中存在HF模块: 直接导入(例如
transformers)。将导入语句包装在from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM尝试/捕获辅助函数中,当出现_get_hf_*_class()时返回ImportError,并在返回None时使用None。pytest.skip - 如果已安装的 中不存在HF模块: 将HF
transformers源码中的最小模块定义复制到测试文件中,作为独立的参考类。这样可以保持测试的独立性,无需依赖特定版本的modeling_*.py或测试时的HF缓存。重要提示:确保复制的内容最小化,且严格忠实于HF实现。不要修改参考类的功能。对于使用transformers的配置类(即不在trust_remote_code中的配置类),同样适用:将最小化的忠实版本复制到测试文件中。建模文件不应导入配置类 —— AD会在运行时通过transformers加载该类。测试专用的配置副本可以验证配置包装行为(例如state_dict的结构)。AutoConfig.from_pretrained(..., trust_remote_code=True) - 权重转换辅助函数: 编写测试专用的辅助函数,处理HF与自定义模型之间的权重格式差异(例如RoPE解交织、堆叠到单专家的MoE权重、门控权重键重映射)。对于全模型测试,优先使用自定义模型上已注册的 预钩子。
load_state_dict
数值比较: 在比较自定义算子与HF参考实现的等价性测试中,使用 中的共享工具函数 :
_model_test_utilsassert_rmse_closepython
from _model_test_utils import assert_rmse_close该函数计算 —— 比逐元素的 更鲁棒,因为少数异常值不会导致测试失败。仅在数学逻辑完全相同的模块(例如无自定义算子的纯MLP)中使用 。
rmse(actual - expected) / rmse(expected)torch.testing.assert_closetorch.testing.assert_close推荐的bfloat16格式下的 值:
rmse_ratio_tol- 数学逻辑完全相同(MLP、归一化):使用 ,设置严格的rtol/atol(1e-3)
torch.testing.assert_close - MoE模块(融合路由):
0.02 - 解码器层 / MoE层 / 全模型:
0.05 - 注意力:
0.10
自底向上的测试层级(必须通过前一层级才能进入下一层级):
- 模块等价性 —— 单独测试MLP、注意力、MoE、归一化模块:相同权重 + 相同输入 → (或对于数学逻辑完全相同的模块使用
assert_rmse_close)。torch.testing.assert_close - 层等价性 —— 测试完整的解码器层。如果模型包含异构层(密集层 vs MoE层、注意力层 vs SSM层),请分别测试每种类型的层。
- 全模型等价性 —— 端到端的logits比较。使用包含架构核心特征的小型配置(例如至少包含每种类型的层,层数<10)。
- 导出测试 —— 使用 处理batch+seq维度的
Dim.DYNAMIC,验证输出为有限值,并测试第二种形状。torch_export_to_gm
Phase 7 — Independent Review (MANDATORY)
阶段7 — 独立评审(必填)
Invoke the subagent with ONLY the following information:
ad-onboard-reviewer- Model name
- Path to the model file created
- Path to the test file created
Do NOT include your own assessment of correctness. Do NOT summarize what you did. Let the reviewer read the files and judge independently.
If the reviewer returns FAIL on any item:
- Read the reviewer's specific failure reasons and file:line references
- Fix each failed item
- Invoke the reviewer again with the same minimal inputs
- Repeat until you get a full PASS
Do NOT proceed to Phase 8 until the reviewer returns PASS.
仅向 子代理提供以下信息:
ad-onboard-reviewer- 模型名称
- 创建的模型文件路径
- 创建的测试文件路径
不要包含你自己对正确性的评估。不要总结你所做的工作。 让评审者直接阅读文件并独立判断。
如果评审者返回任何项的FAIL结果:
- 阅读评审者给出的具体失败原因和文件:行号引用
- 修复每个失败项
- 再次向评审者提供相同的最小输入
- 重复操作直到获得完整的PASS结果
在评审者返回PASS结果前,不要进入阶段8。
Phase 8 — Create or Update Model Registry Entries (Including Family)
阶段8 — 创建或更新模型注册表项(包括同系列模型)
Before running the model end-to-end, ensure it and all identified family members from Phase 1 have valid entries in the AutoDeploy model registry at .
examples/auto_deploy/model_registry/For each model (the requested model + any family members identified in Phase 1 Step 2):
- Check for an existing entry matching the model's HF id.
examples/auto_deploy/model_registry/models.yaml - If the entry is missing, add it with the appropriate list:
yaml_extra- Always include first.
dashboard_default.yaml - Pick based on model size (1 for <2B, 2 for 2-15B, 4 for 20-80B, 8 for 80B+). The
world_size_N.yamldetermines how many GPUs are needed for the run.world_size - Add model-specific YAML if the model needs custom settings (e.g., , non-default transforms).
model_kwargs
- Always include
- If a model-specific config YAML is needed and doesn't exist, create it under . See existing configs for format examples.
examples/auto_deploy/model_registry/configs/ - If the entry exists but needs changes (e.g., wrong world_size, missing model-specific config), update it.
Family members that share the same architecture should all use the same modeling code. Different sizes only need different entries and maybe different sharding configurations.
world_size_N.yamlSee for full documentation on the registry format and best practices.
examples/auto_deploy/model_registry/README.md在端到端运行模型前,确保阶段1中识别的所有同系列模型在AutoDeploy模型注册表 中有有效的条目。
examples/auto_deploy/model_registry/对于每个模型(请求的模型 + 阶段1步骤2中识别的所有同系列模型):
- 检查 是否存在与模型HF ID匹配的现有条目。
examples/auto_deploy/model_registry/models.yaml - 如果条目缺失,添加条目并附上合适的 列表:
yaml_extra- 始终首先包含 。
dashboard_default.yaml - 根据模型大小选择 (<2B参数选1,2-15B选2,20-80B选4,80B+选8)。
world_size_N.yaml决定运行所需的GPU数量。world_size - 如果模型需要自定义设置(例如 、非默认转换),添加模型专用的YAML文件。
model_kwargs
- 始终首先包含
- 如果需要模型专用的配置YAML但该文件不存在,在 目录下创建该文件。参考现有配置文件的格式示例。
examples/auto_deploy/model_registry/configs/ - 如果条目存在但需要修改(例如错误的world_size、缺失模型专用配置),更新该条目。
共享相同架构的同系列模型应使用同一个建模文件。不同尺寸的模型仅需不同的 条目,可能还需要不同的分片配置。
world_size_N.yaml有关注册表格式和最佳实践的完整文档,请见 。
examples/auto_deploy/model_registry/README.mdPhase 9 — AutoDeploy End-to-End Run
阶段9 — AutoDeploy端到端运行
⚠️ MANDATORY: You MUST use the standalone config YAML with --args.yaml-extra
⚠️
--args.yaml-extra⚠️ 必填:必须使用独立配置YAML文件,通过 --args.yaml-extra
参数指定 ⚠️
--args.yaml-extraYou MUST run the model using the standalone config YAML created in Phase 8. The same YAML will be referenced by the cookbook's command in Phase 11. The command is:
trtllm-servebash
CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yamlThe standalone config YAML under is self-contained — it includes all settings needed for running the model (compile backend, batch size, seq len, transforms, world_size, etc.). This is the same YAML that will use in the cookbook, so validating it here ensures the cookbook works out of the box.
examples/auto_deploy/model_registry/configs/trtllm-serve --extra_llm_api_optionsIf the run FAILS:
- Fix the standalone config YAML — update settings in and re-run.
examples/auto_deploy/model_registry/configs/<model>.yaml - The standalone config YAML is the source of truth. If it is wrong, fix it. If it is missing settings, add them. The model MUST work via this YAML before you are done.
Invoke the subagent to run the model through AutoDeploy on GPU. Pass it:
ad-run-agentStep 1: Reduced num layers
Run with reduced num layers to test the e2e flow for issues and iterate faster.
The generation will be bad in step 1 because we are not loading all layers.
Step 2: Full layers
Run with full num layers. The generation should be coherent in step 2.
- Model HF ID: the HuggingFace model-id (or local checkpoint path) used throughout onboarding
- Standalone config YAML path: the path to the config YAML under
examples/auto_deploy/model_registry/configs/ - Description: a short description of the current state, e.g.:
- "first try after onboarding"
- "updated yaml with reduced layers"
- "changed attention backend to torch_mha"
- "fixed weight loading hooks"
The model is run via:
bash
CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yamlThe will determine the required from the config YAML, check GPU availability via , select free GPUs, and wait if not enough are available.
ad-run-agentworld_sizenvidia-smiThe ad-run-agent will build+run the model, check generation quality, archive logs, and update its worklog.
If the run fails or produces bad generation:
- Read the ad-run-agent's worklog and log file to understand the error
- Fix the issue (model code, standalone config YAML, weight hooks, etc.)
- Re-invoke the ad-run-agent with an updated description reflecting the change (e.g., "retry after fixing RoPE scaling in config")
- Always re-run with . Fix the standalone config YAML, don't work around it.
--args.yaml-extra - Repeat until the run succeeds with meaningful generation
Do NOT proceed to Phase 10 until the step 2 with full layers reports a successful run with coherent generation.
Important: The successful E2E run outputs (prompts and generated text) will be needed for the cookbook notebook in Phase 11 and the summary report in Phase 12. Save them.
必须使用阶段8中创建的独立配置YAML文件运行模型。该YAML文件将在阶段11的cookbook中被 命令引用。命令格式如下:
trtllm-servebash
CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yamlexamples/auto_deploy/model_registry/configs/trtllm-serve --extra_llm_api_options如果运行失败:
- 修复独立配置YAML文件 —— 更新 中的设置并重新运行。
examples/auto_deploy/model_registry/configs/<model>.yaml - 独立配置YAML文件是唯一的可信来源。如果文件有误,修复它;如果缺少设置,添加设置。必须确保模型通过该YAML文件正常运行后才能结束此阶段。
调用 子代理在GPU上通过AutoDeploy运行模型。向其提供以下信息:
ad-run-agent步骤1:减少层数
减少层数运行,以测试端到端流程是否存在问题并快速迭代。由于未加载所有层,此步骤的生成结果质量会很差。
步骤2:完整层数
使用完整层数运行。此步骤的生成结果应连贯合理。
- 模型HF ID: 整个上线流程中使用的HuggingFace模型ID(或本地checkpoint路径)
- 独立配置YAML路径: 目录下的配置YAML文件路径
examples/auto_deploy/model_registry/configs/ - 描述: 当前状态的简短描述,例如:
- "上线后的首次尝试"
- "更新YAML文件,减少层数"
- "将注意力后端改为torch_mha"
- "修复权重加载钩子"
模型将通过以下命令运行:
bash
CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yamlad-run-agentworld_sizenvidia-smiad-run-agent如果运行失败或生成质量不佳:
- 阅读 的工作记录和日志文件,了解错误原因
ad-run-agent - 修复问题(模型代码、独立配置YAML文件、权重钩子等)
- 更新描述以反映所做的更改(例如 "修复配置中的RoPE缩放后重试"),再次调用
ad-run-agent - 始终使用 参数重新运行。修复独立配置YAML文件,不要使用临时解决方案。
--args.yaml-extra - 重复操作直到运行成功并生成有意义的结果
在步骤2(完整层数)报告运行成功且生成结果连贯合理前,不要进入阶段10。
重要提示: 成功的端到端运行输出(提示词和生成文本)将用于阶段11的cookbook笔记本和阶段12的总结报告。请保存这些输出。
Phase 10 — Update Model Support Matrix
阶段10 — 更新模型支持矩阵
After a successful E2E run, update the TensorRT-LLM model support matrix at to include the newly onboarded model.
docs/source/models/supported-models.md- Read the current support matrix to understand the format and existing entries.
- Add a row to the "Supported Models" table (the first table in the file) with:
- : The model's architecture class name (e.g.,
Architecture) — use the class name registered in Phase 4.MiniMaxM2ForCausalLM - : The model family/display name (e.g.,
Model).MiniMax M2/M2.1/M2.7 - : A representative HF model ID (e.g.,
HuggingFace Example).MiniMaxAI/MiniMax-M2.7 - Place the new row alphabetically by architecture class name to keep the table sorted.
- If the model is AutoDeploy-only (i.e., it does NOT have native PyTorch backend support in ), add a footnote indicating AutoDeploy support with a link to the AD config YAML, following the pattern of existing AD-only models (e.g.,
tensorrt_llm/_torch/models/).[^N]: Supported via the [AutoDeploy](../features/auto_deploy/auto-deploy.md) backend. See [AD config](../../../examples/auto_deploy/model_registry/configs/<model>.yaml). - If the model warrants an entry in the Model-Feature Support Matrix (second table — typically for key/flagship models), add a row there too. For newly onboarded AD models, most advanced features should be marked unless you have verified them. Use existing AD model entries (e.g.,
Untested) as a reference for which features to mark as supported vs untested.Glm4MoeLiteForCausalLM
端到端运行成功后,更新TensorRT-LLM模型支持矩阵 ,添加新上线的模型。
docs/source/models/supported-models.md- 阅读当前支持矩阵,了解格式和现有条目。
- 在"Supported Models"表格中添加一行(文件中的第一个表格),包含:
- :模型的架构类名称(例如
Architecture)—— 使用阶段4中注册的类名称。MiniMaxM2ForCausalLM - :模型系列/显示名称(例如
Model)。MiniMax M2/M2.1/M2.7 - :代表性的HF模型ID(例如
HuggingFace Example)。MiniMaxAI/MiniMax-M2.7 - 按照架构类名称的字母顺序放置新行,保持表格有序。
- 如果模型仅支持AutoDeploy(即 中没有原生PyTorch后端支持),添加脚注说明AutoDeploy支持,并链接到AD配置YAML文件,遵循现有仅支持AD的模型的格式(例如
tensorrt_llm/_torch/models/)。[^N]: Supported via the [AutoDeploy](../features/auto_deploy/auto-deploy.md) backend. See [AD config](../../../examples/auto_deploy/model_registry/configs/<model>.yaml). - 如果模型需要在Model-Feature Support Matrix中添加条目(第二个表格 —— 通常针对关键/旗舰模型),也在该表格中添加一行。对于新上线的AD模型,大多数高级功能应标记为 ,除非你已验证过这些功能。参考现有AD模型条目(例如
Untested),了解哪些功能应标记为支持,哪些标记为未测试。Glm4MoeLiteForCausalLM
Phase 11 — Create AutoDeploy Cookbook
阶段11 — 创建AutoDeploy Cookbook
Create an AutoDeploy cookbook notebook for the model, following the pattern of existing cookbooks.
- Use as the template. Copy its structure exactly.
examples/auto_deploy/cookbooks/glm_4.7_flash_trtllm_cookbook.ipynb - Create the new notebook at , using a snake_case version of the model name (e.g.,
examples/auto_deploy/cookbooks/{model_name}_trtllm_cookbook.ipynb).minimax_m2.7_trtllm_cookbook.ipynb - Adapt all model-specific content:
- Title and description: update the model name, HF model ID, and description.
- Model Resources: update links to the model's HuggingFace card, blog posts, technical reports, API platform, and community links. Search the web or the model's HF card for relevant URLs.
- Model Highlights: update architecture details (e.g., MoE params, context length, special features like tool calling, interleaved thinking, etc.) from the model card.
- Prerequisites: update VRAM requirements based on model size and precision.
- command: update the model ID and use
trtllm-servepointing to the standalone AD config YAML under--extra_llm_api_options(e.g.,examples/auto_deploy/model_registry/configs/). This is the same standalone config YAML validated in Phase 9 viaexamples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml. It is self-contained — it includes all the settingsbuild_and_run_ad.py --args.yaml-extraneeds (compile backend, batch size, seq len, transforms, etc.).trtllm-serve - OpenAI client : update to the correct HF model ID.
MODEL_ID - Evaluation Parameters: update recommended inference parameters from the model's documentation/model card.
- Additional Resources: update all links to be model-specific.
- Do NOT include cell outputs in the committed notebook — the notebook should be clean with no pre-run outputs, so users run it themselves. (Exception: if the model was already run and outputs were captured during Phase 9, you may include them for reference, but this is optional.)
- Verify the notebook is valid JSON — malformed files will not render on GitHub or in Jupyter.
.ipynb
按照现有cookbook的模式,为该模型创建AutoDeploy cookbook笔记本。
- 以 为模板。完全复制其结构。
examples/auto_deploy/cookbooks/glm_4.7_flash_trtllm_cookbook.ipynb - 创建新笔记本,保存到 ,使用模型名称的蛇形命名法(例如
examples/auto_deploy/cookbooks/{model_name}_trtllm_cookbook.ipynb)。minimax_m2.7_trtllm_cookbook.ipynb - 调整所有模型专用内容:
- 标题和描述:更新模型名称、HF模型ID和描述。
- Model Resources:更新模型的HuggingFace卡片、博客文章、技术报告、API平台和社区链接。在网络或模型的HF卡片中查找相关URL。
- Model Highlights:从模型卡片中更新架构细节(例如MoE参数、上下文长度、工具调用、 interleaved thinking等特殊功能)。
- Prerequisites:根据模型大小和精度更新显存要求。
- 命令:更新模型ID,并使用
trtllm-serve参数指向--extra_llm_api_options目录下的独立AD配置YAML文件(例如examples/auto_deploy/model_registry/configs/)。这与阶段9中通过examples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml验证的独立配置YAML文件相同。该文件是自包含的 —— 它包含build_and_run_ad.py --args.yaml-extra所需的所有设置(编译后端、batch大小、序列长度、转换等)。trtllm-serve - OpenAI客户端 :更新为正确的HF模型ID。
MODEL_ID - Evaluation Parameters:从模型的文档/模型卡片中更新推荐的推理参数。
- Additional Resources:更新所有链接为模型专用链接。
- 提交的笔记本中不要包含单元格输出 —— 笔记本应保持干净,没有预运行的输出,以便用户自行运行。(例外情况:如果模型已在阶段9中运行并捕获了输出,可以将其包含在笔记本中作为参考,但这不是必需的。)
- 验证笔记本是有效的JSON —— 格式错误的 文件无法在GitHub或Jupyter中渲染。
.ipynb
Phase 12 — Summary Report
阶段12 — 总结报告
⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final build_and_run_ad.py --args.yaml-extra
run ⚠️
build_and_run_ad.py --args.yaml-extra⚠️ 必填:必须包含最终 build_and_run_ad.py --args.yaml-extra
运行的所有原始提示词和生成输出 ⚠️
build_and_run_ad.py --args.yaml-extraPrint (not file) after completion:
- Model overview + unique features
- Tricky parts needing human review
- Files created/modified (including any new registry configs)
- Test results table (name | validates | PASS/FAIL)
- Known limitations
- Reviewer result (PASS + how many review iterations it took)
- AD end-to-end run result (success/fail, number of iterations, final generation quality)
- Registry entry added/updated in and any new config YAMLs created
models.yaml - ALL raw prompts and their corresponding generated outputs from the final successful run. Copy-paste the COMPLETE prompt→output pairs verbatim from the run log. Do NOT summarize, truncate, or paraphrase them. The user needs to see exactly what the model generated to judge quality.
build_and_run_ad.py --args.yaml-extra - Model support matrix update — confirm the row was added to and which footnote (if any) was used.
docs/source/models/supported-models.md - AutoDeploy cookbook created — path to the new notebook file ().
examples/auto_deploy/cookbooks/<model>_trtllm_cookbook.ipynb
完成后打印(不要保存为文件):
- 模型概述 + 独特功能
- 需要人工评审的复杂部分
- 创建/修改的文件(包括任何新的注册表配置)
- 测试结果表格(名称 | 验证内容 | PASS/FAIL)
- 已知限制
- 评审结果(PASS + 经过多少次评审迭代)
- AD端到端运行结果(成功/失败、迭代次数、最终生成质量)
- 在 中添加/更新的注册表条目,以及创建的任何新配置YAML文件
models.yaml - 最终成功运行 的所有原始提示词及其对应的生成输出。从运行日志中逐字复制完整的提示词→输出对。不要总结、截断或改写。用户需要查看模型生成的精确内容以判断质量。
build_and_run_ad.py --args.yaml-extra - 模型支持矩阵更新 —— 确认已在 中添加了该行,以及使用了哪个脚注(如果有)。
docs/source/models/supported-models.md - 创建的AutoDeploy cookbook —— 新笔记本文件的路径()。
examples/auto_deploy/cookbooks/<model>_trtllm_cookbook.ipynb
Phase 13 — Prepare a Pull Request
阶段13 — 准备Pull Request
GitHub CLI config: Before running any command, confirm which to use. The default is , but a different directory may be needed when targeting a fork (e.g., vs ). Check if the user has specified a custom (e.g., in or environment). If not, ask the user before proceeding. Prefix all commands with:
ghGH_CONFIG_DIR~/.config/ghnv-auto-deploy/TensorRT-LLMNVIDIA/TensorRT-LLMGH_CONFIG_DIRCLAUDE.local.mdghGH_CONFIG_DIR=<path> gh ...Prepare a pull request against (https://github.com/NVIDIA/TensorRT-LLM) targeting
branch . Then, ask the user to provide feedback on the PR and wait for the
user to get back to you when the feedback has been posted. Then continue iterating according to the
user's feedback. For any comment or other post, please prepend your message with "[AGENT]" so that it is clear that this was a coding agent posting the comment.
When you post a PR, you MUST include:
upstreammain- ALL raw prompts and their complete generated outputs from the final successful run. Copy-paste the COMPLETE prompt→output pairs verbatim — do NOT summarize, truncate, or paraphrase. The reviewer needs to see exactly what the model generated.
build_and_run_ad.py --args.yaml-extra - A reproducible command:
bash
python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml- A detailed pytest command for the unit tests you added so they can be run by the reviewer as well. Make sure you have run this pytest command on the latest commit that you are pushing, and include these results in the PR.
GitHub CLI配置: 在运行任何 命令前,确认要使用的 。默认路径是 ,但针对fork仓库(例如 vs )可能需要不同的目录。检查用户是否指定了自定义 (例如在 或环境变量中)。如果没有,在继续操作前询问用户。所有 命令前添加前缀:
ghGH_CONFIG_DIR~/.config/ghnv-auto-deploy/TensorRT-LLMNVIDIA/TensorRT-LLMGH_CONFIG_DIRCLAUDE.local.mdghGH_CONFIG_DIR=<path> gh ...准备针对 (https://github.com/NVIDIA/TensorRT-LLM)的Pull Request,目标分支为 。然后,请求用户提供对PR的反馈,并等待用户反馈发布后再继续。根据用户的反馈进行迭代。在发布任何评论或其他内容时,请在消息前添加 "[AGENT]",以便明确这是由编码代理发布的评论。
发布PR时,必须包含:
upstreammain- 最终成功运行 的所有原始提示词及其完整生成输出。逐字复制完整的提示词→输出对 —— 不要总结、截断或改写。评审者需要查看模型生成的精确内容。
build_and_run_ad.py --args.yaml-extra - 可重现的命令:
bash
python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml- 你添加的单元测试的详细pytest命令,以便评审者也可以运行这些测试。确保你已在要推送的最新提交上运行了该pytest命令,并将结果包含在PR中。
⚠️ MANDATORY: Re-run and re-post logs on EVERY PR update — NO EXCEPTIONS ⚠️
⚠️ 必填:每次PR更新时都必须重新运行并重新发布日志 —— 无例外 ⚠️
Every single time you push changes to the PR — whether it is a new commit, a rebase, an amendment, a fixup, or any other update — you MUST:
- Re-run using the
build_and_run_ad.py --args.yaml-extrasubagent, exactly as in Phase 9. The code has changed, so previous run results are stale and invalid.ad-run-agent - Re-run the full unit test suite () for the model's test file created in Phase 6. Previous test results are stale and invalid after any code change.
pytest <test_file> -v - Post ALL raw output from both runs as a PR comment:
- The COMPLETE prompt→output pairs from verbatim — do NOT summarize, truncate, or paraphrase.
build_and_run_ad.py - The COMPLETE pytest output verbatim — every test name, every PASSED/FAILED line, every error traceback if any. Do NOT summarize or cherry-pick.
- The COMPLETE prompt→output pairs from
This is not optional. There are no exceptions. Even if the change seems trivial (a typo fix, a comment edit, a formatting change), both runs must be re-executed and the full raw logs must be posted. The reviewer cannot verify correctness without seeing generation output AND test results from the exact code that is currently on the branch.
Workflow for every PR update cycle:
- Make the requested code changes
- Commit the changes
- Before pushing, always rebase onto the target branch to check for conflicts: . If there are conflicts, resolve them before proceeding. Do NOT push without rebasing first — the branch must be up-to-date with the target branch.
git fetch upstream && git rebase upstream/main - Push (or force-push if rebase rewrote history)
- Re-invoke the to run
ad-run-agenton the updated codebuild_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml - Re-run the unit tests:
pytest <test_file> -v - Wait for both runs to complete
- Post a reply to every PR comment containing:
- A brief description of what changed in this update
- The COMPLETE raw prompts and generated outputs from the run
build_and_run_ad.py - The COMPLETE raw pytest output (full verbatim log)
- The reproducible commands used for both runs
- Resume polling for new comments (see below)
每次向PR推送更改时 —— 无论是新提交、变基、修正、修补还是任何其他更新 —— 必须:
- 重新运行 ,使用
build_and_run_ad.py --args.yaml-extra子代理,与阶段9中的操作完全相同。代码已更改,因此之前的运行结果已过时且无效。ad-run-agent - 重新运行完整的单元测试套件(),针对阶段6中创建的模型测试文件。代码更改后,之前的测试结果已过时且无效。
pytest <test_file> -v - 将两次运行的所有原始输出作为PR评论发布:
- 逐字复制 运行的完整提示词→输出对 —— 不要总结、截断或改写。
build_and_run_ad.py - 逐字复制完整的pytest输出 —— 每个测试名称、每个PASSED/FAILED行、任何错误回溯(如果有)。不要总结或挑选内容。
- 逐字复制
这不是可选操作。没有例外。 即使更改看起来微不足道(拼写修正、注释编辑、格式更改),也必须重新运行两次操作并发布完整的原始日志。评审者需要查看当前分支上精确代码的生成输出和测试结果,才能验证正确性。
每次PR更新周期的工作流程:
- 进行请求的代码更改
- 提交更改
- 推送前,始终变基到目标分支以检查冲突:。如果有冲突,解决冲突后再继续。不要在变基前推送 —— 分支必须与目标分支保持同步。
git fetch upstream && git rebase upstream/main - 推送(如果变基重写了历史,则强制推送)
- 重新调用 ,在更新后的代码上运行
ad-run-agentbuild_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml - 重新运行单元测试:
pytest <test_file> -v - 等待两次运行完成
- 回复每个PR评论,包含:
- 此更新中更改内容的简要描述
- 运行的完整原始提示词和生成输出
build_and_run_ad.py - 完整的原始pytest输出(逐字日志)
- 两次运行使用的可重现命令
- 恢复轮询新评论(见下文)
⚠️ MANDATORY: Poll PR for new comments every 5 minutes ⚠️
⚠️ 必填:每5分钟轮询PR的新评论 ⚠️
After opening the PR and after every PR update you post, you MUST set up a polling loop that checks for new PR comments every 5 minutes. Do not simply post and walk away — actively monitor the PR for reviewer feedback.
How to poll:
bash
undefined在发布PR后,以及每次发布PR更新后,必须设置轮询循环,每5分钟检查一次PR的新评论。 不要发布后就不管了 —— 主动监控PR以获取评审者反馈。
轮询方式:
bash
undefinedFetch all PR comments, sorted newest-first, and check for any posted after your last comment
获取所有PR评论,按最新排序,检查是否有在你上次评论之后发布的新评论
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/pulls/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/pulls/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
Also check issue-level comments (top-level PR comments, not inline review comments)
同时检查问题级评论(PR的顶级评论,不是内联评审评论)
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/issues/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/issues/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
Also check the PR's review status
同时检查PR的评审状态
GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state
**Polling loop behavior:**
1. After posting your PR (or posting an update comment), immediately start polling every 5 minutes.
2. On each poll, check for:
- **New review comments** (inline or top-level) posted after your last comment's timestamp
- **PR approval status** — check if the PR has been approved
- **Termination signals** — any comment clearly indicating the agent's work is done (e.g., "LGTM", "looks good, we're done", "no more changes needed", "agent work complete", or similar)
3. If **new actionable comments are found**: stop polling, process the feedback, and execute the full PR update cycle (steps 1–8 above). After posting the update, resume polling.
4. If the **PR is approved** or a **termination signal** is found: stop polling, report to the user that the PR review cycle is complete, and end.
5. If **no new comments** are found: sleep 5 minutes and poll again.
**Do NOT stop polling prematurely.** The loop must continue until the PR is approved or a clear termination signal is received. If polling has been running for an extended period (e.g., >2 hours) with no new activity, inform the user that you are still monitoring and ask if they want you to continue or stop.GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state
**轮询循环行为:**
1. 在发布PR(或发布更新评论)后,立即开始每5分钟轮询一次。
2. 每次轮询时,检查:
- **新的评审评论**(内联或顶级),发布时间在你上次评论的时间戳之后
- **PR批准状态** —— 检查PR是否已被批准
- **终止信号** —— 任何明确表明代理工作已完成的评论(例如 "LGTM"、"看起来不错,我们完成了"、"无需更多更改"、"代理工作完成" 或类似内容)
3. 如果**发现新的可操作评论**:停止轮询,处理反馈,并执行完整的PR更新周期(上述步骤1–8)。发布更新后,恢复轮询。
4. 如果**PR已被批准**或收到**终止信号**:停止轮询,向用户汇报PR评审周期已完成,并结束工作。
5. 如果**没有新评论**:等待5分钟后再次轮询。
**不要提前停止轮询。** 循环必须持续到PR被批准或收到明确的终止信号。如果轮询已运行较长时间(例如>2小时)且没有新活动,请告知用户你仍在监控,并询问用户是否希望你继续或停止。Sharding-aware IR model porting (modeling_*_ir.py
)
modeling_*_ir.py支持分片的IR模型移植(modeling_*_ir.py
)
modeling_*_ir.pyUse this when porting an existing AutoDeploy custom model () to explicit sharding hint ops in in the same directory (no separate tree). The exported FX graph must fully specify how the model should be sharded: the transform combines hints with a runtime for deterministic, node-local sharding.
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_*.pymodeling_*_ir.pynew_sharding/apply_sharding_hintsDistConfigArgument reference: Do not duplicate operator tables here. Refer to the custom op docstrings in for the complete argument reference (including sharding hints, , , and which ops accept hints).
tensorrt_llm/_torch/auto_deploy/custom_ops/tp_modelayer_type当将现有AutoDeploy自定义模型()移植到同一目录下的 文件中的显式分片提示算子时,使用此流程(无需单独的 目录)。导出的FX图必须完全指定模型的分片方式: 转换会将提示与运行时 结合,实现确定性的节点本地分片。
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_*.pymodeling_*_ir.pynew_sharding/apply_sharding_hintsDistConfig参数参考: 不要在此处重复算子表。请参考 中的自定义算子文档字符串,获取完整的参数参考(包括分片提示、、,以及哪些算子接受提示)。
tensorrt_llm/_torch/auto_deploy/custom_ops/tp_modelayer_typeReference examples (study before porting)
参考示例(移植前请研究)
| Original | IR / sharding-aware | Layer types |
|---|---|---|
| | Mamba SSM, MHA, SwiGLU MLP, MoE |
| | GatedDeltaNet, Gated MHA, SwiGLU MLP, MoE |
| | MHA, SwiGLU MLP (simplest) |
| | MLA, SwiGLU MLP, MoE |
| 原始文件 | IR/支持分片的文件 | 层类型 |
|---|---|---|
| | Mamba SSM, MHA, SwiGLU MLP, MoE |
| | GatedDeltaNet, Gated MHA, SwiGLU MLP, MoE |
| | MHA, SwiGLU MLP(最简单) |
| | MLA, SwiGLU MLP, MoE |
Step-by-step porting procedure
分步移植流程
Step 1: Copy the source file
步骤1:复制源文件
bash
cp tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo.py \
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo_ir.pybash
cp tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo.py \
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo_ir.pyStep 2: Update the module docstring and add imports
步骤2:更新模块文档字符串并添加导入
At the top of the IR file:
python
import tensorrt_llm._torch.auto_deploy.custom_ops # noqa: F401 -- register all opsDo not add global flags. Layer-level control uses the hint on each op and in YAML.
SHARD_*layer_typeshard_layers在IR文件顶部添加:
python
import tensorrt_llm._torch.auto_deploy.custom_ops # noqa: F401 -- 注册所有算子不要添加全局 标志。层级控制使用每个算子上的 提示和YAML中的 。
SHARD_*layer_typeshard_layersStep 3: Replace linear projections
步骤3:替换线性投影
For every or call, use with explicit and . Always set unconditionally (no ). Rules: opening projections (Q/K/V/gate/up/in_proj) → ; closing (O/down/out_proj) → ; tiny outputs (e.g. dim 1) → ; MLA latent projections (q_a, kv_a) → . For fused weights split later, pass . For GQA, use on K/V colwise lines.
self.proj(x)nn.Lineartorch.ops.auto_deploy.torch_linear_simpletp_modelayer_typetp_modeif _s else "none""colwise""rowwise"shared_expert_gate"none""none"output_sizes=[...]tp_min_local_shape=self.head_dim对于每个 或 调用,使用 ,并显式指定 和 。始终无条件设置 (不要使用 )。规则: 起始投影(Q/K/V/gate/up/in_proj)→ ;结束投影(O/down/out_proj)→ ;小输出(例如 的维度1)→ ;MLA潜在投影(q_a, kv_a)→ 。对于后续会拆分的融合权重,传递 。对于GQA,在K/V的colwise行上使用 。
self.proj(x)nn.Lineartorch.ops.auto_deploy.torch_linear_simpletp_modelayer_typetp_modeif _s else "none""colwise""rowwise"shared_expert_gate"none""none"output_sizes=[...]tp_min_local_shape=self.head_dimStep 4: Replace split / chunk after fused colwise projections
步骤4:替换融合colwise投影后的split / chunk
Use with / where sizes scale with TP.
torch.ops.auto_deploy.split_with_sizesshardablelayer_type当尺寸随TP缩放时,使用带有 / 的 。
shardablelayer_typetorch.ops.auto_deploy.split_with_sizesStep 5: Replace view / reshape with concrete head counts
步骤5:用具体的头数替换view / reshape
During , becomes concrete; after TP, wrong values break. Any reshape whose dimension is a head count that scales with TP must use with set appropriately. Safe cases: flat-to-2D, or when the input is already correctly sharded.
torch.export-1torch.ops.auto_deploy.viewtp_scaled_dim[B,S,-1]在 过程中, 会变为具体值;TP后,错误的值会导致失败。任何维度为随TP缩放的头数的reshape操作,必须使用 ,并适当设置 。安全场景:扁平到2D,或输入已正确分片时的 。
torch.export-1torch.ops.auto_deploy.viewtp_scaled_dim[B,S,-1]Step 6: Insert all_reduce
all_reduce步骤6:插入 all_reduce
all_reduceAfter every rowwise projection, add . Parallel branch rule: when branches merge by addition, use a single after the sum (e.g. MoE routed + shared expert; parallel attention + MLP residual branches).
torch.ops.auto_deploy.all_reduce(..., layer_type=...)all_reduce在每个rowwise投影后,添加 。并行分支规则: 当分支通过加法合并时,在求和后使用单个 (例如MoE路由专家 + 共享专家;并行注意力 + MLP残差分支)。
torch.ops.auto_deploy.all_reduce(..., layer_type=...)all_reduceStep 7: Special ops (Conv1d, SSM, GatedDeltaNet, gated RMSNorm)
步骤7:特殊算子(Conv1d, SSM, GatedDeltaNet, gated RMSNorm)
Add sharding hints on , , , per docstrings—typically / / as required.
torch_causal_conv1dtorch_ssmtorch_gated_delta_ruletorch_rmsnorm_gatedshardableoutput_sizestp_mode根据文档字符串为 、、、 添加分片提示 —— 通常需要设置 / / 。
torch_causal_conv1dtorch_ssmtorch_gated_delta_ruletorch_rmsnorm_gatedshardableoutput_sizestp_modeStep 8: MoE
步骤8:MoE
Pass into ; handles EP/TP.
layer_type="moe"torch_moeapply_sharding_hints将 传递给 ; 会处理EP/TP。
layer_type="moe"torch_moeapply_sharding_hintsStep 9: Register the IR model
步骤9:注册IR模型
- Bottom of the IR file: (same pattern as Phase 4).
AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM) - Add a side-effect import in (e.g.
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py) and extendfrom . import modeling_foo_ir # noqa: F401if you export symbols. Without this import, worker processes may not load your class and__all__can report 0 nodes processed. Do not use a separateapply_sharding_hintsindirection.register_sharded_models.py
- 在IR文件末尾添加:(与阶段4相同的模式)。
AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM) - 在 文件中添加副作用导入(例如
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py),如果导出符号则扩展from . import modeling_foo_ir # noqa: F401。如果没有此导入,工作进程可能无法加载你的类,__all__可能会报告0个节点被处理。不要使用单独的apply_sharding_hints间接导入。register_sharded_models.py
Step 10: YAML — composable registry pattern
步骤10:YAML —— 可组合注册表模式
Prefer the model registry () and compose shared fragments under , same as other models: list , the right , then a dedicated fragment (e.g. ) that holds IR sharding transforms. That fragment should disable legacy sharding passes and enable hint-driven sharding. Registry fragments are deep-merged in order (see in ); place transform keys under so they merge with . Standalone experiment YAMLs for may wrap the same fields under a top-level block matching .
examples/auto_deploy/model_registry/models.yamlexamples/auto_deploy/model_registry/configs/dashboard_default.yamlworld_size_N.yamlenable_sharder_ir.yamlyaml_extraDynamicYamlMixInForSettingstensorrt_llm/_torch/auto_deploy/utils/_config.pytransforms:dashboard_default.yamlbuild_and_run_adargs:LlmArgsExample transform block:
yaml
undefined优先使用模型注册表(),并在 目录下组合共享片段,与其他模型相同:列出 、合适的 ,然后是专用片段(例如 ),该片段包含IR分片转换。该片段应禁用旧版分片传递,并启用基于提示的分片。注册表片段会按 顺序深度合并(见 中的 );将转换键放在 下,以便与 合并。用于 的独立实验YAML文件可能会将相同字段包装在与 匹配的顶层 块下。
examples/auto_deploy/model_registry/models.yamlexamples/auto_deploy/model_registry/configs/dashboard_default.yamlworld_size_N.yamlenable_sharder_ir.yamlyaml_extratensorrt_llm/_torch/auto_deploy/utils/_config.pyDynamicYamlMixInForSettingstransforms:dashboard_default.yamlbuild_and_run_adLlmArgsargs:示例转换块:
yaml
undefinedTypical contents for enable_sharder_ir.yaml (registry composable fragment)
enable_sharder_ir.yaml的典型内容(注册表可组合片段)
transforms:
export_to_gm:
num_moe_experts_for_export: 2 # often required when expert count is large (>64)
detect_sharding:
stage: sharding
enabled: false
sharding_transform_executor:
stage: sharding
enabled: false
apply_sharding_hints:
stage: sharding
enabled: true
run_shape_prop: true
allreduce_strategy: NCCL
# shard_layers: ['mha', 'mlp'] # optional selective sharding
gather_logits_before_lm_head:
enabled: true
Use `world_size: 8` when validating TP head-divisibility. Optional `shard_layers` limits which `layer_type` hints are processed; unset means shard all shardable nodes.transforms:
export_to_gm:
num_moe_experts_for_export: 2 # 当专家数量较大时(>64)通常需要设置
detect_sharding:
stage: sharding
enabled: false
sharding_transform_executor:
stage: sharding
enabled: false
apply_sharding_hints:
stage: sharding
enabled: true
run_shape_prop: true
allreduce_strategy: NCCL
# shard_layers: ['mha', 'mlp'] # 可选的选择性分片
gather_logits_before_lm_head:
enabled: true
验证TP头可分性时使用 `world_size: 8`。可选的 `shard_layers` 限制处理哪些 `layer_type` 提示;未设置时会分片所有可分片节点。Step 11: Validate
步骤11:验证
Do not report success until a run completes successfully.
- Prefer after adding/updating the registry entry and composable YAMLs (Phase 8–9 style).
python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry - logs should show
apply_sharding_hintswith N > 0.N nodes processed - If validation fails with infrastructure limits (e.g. head count not divisible by ), document the assert and compatible sizes; do not "fix" core
world_size/ custom op schemas without owner review.sharding.py - If blocked by missing infrastructure support, rename artifacts to / broken YAML and file a short error report for humans (do not silently patch core transforms).
broken_modeling_*_ir.py
Layer type strings (for / ): use , , , , , , or (default; skipped when is set). Match the conventions used in and project enums.
layer_typeshard_layers"mha""mla""mlp""moe""ssm""delta""unknown"shard_layersapply_sharding_hints不要报告成功,直到运行完全成功。
- 在添加/更新注册表条目和可组合YAML文件后,优先使用 (阶段8–9的方式)。
python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry - 日志应显示**
apply_sharding_hints,且N > 0**。N nodes processed - 如果验证因基础设施限制失败(例如头数无法被 整除),记录断言和兼容的尺寸;不要在未经所有者评审的情况下“修复”核心
world_size/ 自定义算子模式。sharding.py - 如果因缺少基础设施支持而受阻,将工件重命名为 / broken YAML,并为人类编写简短的错误报告(不要静默修补核心转换)。
broken_modeling_*_ir.py
层类型字符串(用于 / ):使用 、、、、、 或 (默认;设置 时会被跳过)。与 和项目枚举中使用的约定保持一致。
layer_typeshard_layers"mha""mla""mlp""moe""ssm""delta""unknown"shard_layersapply_sharding_hintsLayer-specific sharding patterns
特定层的分片模式
MHA (standard or gated): : q/k/v colwise (GQA: ), with for head dim, o rowwise + . Fused Q+gate interleaved per head: colwise without ; contiguous Q|K|V fused blocks need .
layer_type="mha"tp_min_local_shapeviewtp_scaled_dimall_reduceoutput_sizesoutput_sizesSwiGLU MLP: : gate/up colwise, down rowwise + .
layer_type="mlp"all_reduceMamba / SSM: : in_proj colwise + , splits shardable, conv1d shardable + , views, shardable, norm gated colwise if weight scales, out rowwise + .
layer_type="ssm"output_sizesoutput_sizestorch_ssmall_reduceGatedDeltaNet: : in_proj_qkv with , other in_projs colwise, conv1d/splits/views as above, shardable, out rowwise + .
layer_type="delta"output_sizestorch_gated_delta_ruleall_reduceMoE + shared expert: : router replicated; one after , not two.
layer_type="moe"all_reducerouted + sharedMLA (DeepSeek): : keep intact with —do not decompose into separate linears + (introduces bad / with concrete head counts). q_a/kv_a latent: ; q_b colwise; rowwise + .
layer_type="mla"torch_mlashardable=Truetorch_attentionexpandviewtp_mode="none"o_projall_reduceMHA(标准或门控): :q/k/v colwise(GQA:),头维度使用带有 的 ,o rowwise + 。每个头融合的Q+gate交错:仅colwise,无需 ;连续的Q|K|V融合块需要 。
layer_type="mha"tp_min_local_shapetp_scaled_dimviewall_reduceoutput_sizesoutput_sizesSwiGLU MLP: :gate/up colwise,down rowwise + 。
layer_type="mlp"all_reduceMamba / SSM: :in_proj colwise + ,分片可拆分,conv1d可分片 + ,views, 可分片,gated归一化colwise(如果权重缩放),out rowwise + 。
layer_type="ssm"output_sizesoutput_sizestorch_ssmall_reduceGatedDeltaNet: :in_proj_qkv 带有 ,其他in_projs colwise,conv1d/splits/views 如上, 可分片,out rowwise + 。
layer_type="delta"output_sizestorch_gated_delta_ruleall_reduceMoE + 共享专家: :router复制;在 后使用一个 ,不要使用两个。
layer_type="moe"routed + sharedall_reduceMLA(DeepSeek): :保持 完整,设置 —— 不要分解为单独的线性层 + (这会引入带有具体头数的不良 /)。q_a/kv_a 潜在层:;q_b colwise; rowwise + 。
layer_type="mla"torch_mlashardable=Truetorch_attentionexpandviewtp_mode="none"o_projall_reduceCommon pitfalls (sharding IR)
常见陷阱(分片IR)
- Missing for head reshapes — concrete shapes from export break after sharding.
auto_deploy::view - Sharding tiny projections — dim-1 gates: .
tp_mode="none" - Double in MoE — one merge-point reduction for routed + shared.
all_reduce - Cross-layer parameter contamination — in handlers using
_apply_hint_*, restrict withget_source_nodes()so residual links do not pull weights from other layers.allowed_ops - Missing for very large expert counts — export can hang.
num_moe_experts_for_export - Decomposing ops that absorb weights (e.g. ) — use
torch_mla+ handler instead of splitting into plain linears.shardable - Interleaved vs contiguous fused weights — interleaved per-head groups: colwise only; contiguous Q|K|V blocks: require .
output_sizes - Omitting when using
layer_type—shard_layersnodes are skipped; set hints explicitly on sharding-aware ops."unknown" - on non-hint ops — do not pass
layer_typeto ops that are not designed for sharding hints (e.g.layer_type,torch_attention,torch_l2norm); extra positional args break calls. Confirm intorch_rope_*docstrings which ops accept hints.custom_ops/ - Conditional hint values — no ; use unconditional hints and rely on
if _s else "none"/ transform config.shard_layers
- 头reshape操作缺少 —— 导出的具体形状在分片后会失效。
auto_deploy::view - 分片小型投影 —— 维度1的门控:。
tp_mode="none" - MoE中使用双重 —— 在路由专家 + 共享专家的合并点使用一次归约。
all_reduce - 跨层参数污染 —— 在使用 的
get_source_nodes()处理程序中,使用_apply_hint_*限制范围,避免残差链接从其他层获取权重。allowed_ops - 专家数量很大时缺少 —— 导出可能会挂起。
num_moe_experts_for_export - 分解吸收权重的算子(例如 )—— 使用
torch_mla+ 处理程序,而不是拆分为纯线性层。shardable - 交错vs连续融合权重 —— 每个头交错的组:仅colwise;连续的Q|K|V块:需要 。
output_sizes - 使用 时省略
shard_layers——layer_type节点会被跳过;在支持分片的算子上显式设置提示。"unknown" - 在非提示算子上设置 —— 不要将
layer_type传递给不支持分片提示的算子(例如layer_type、torch_attention、torch_l2norm);额外的位置参数会破坏调用。请在torch_rope_*文档字符串中确认哪些算子接受提示。custom_ops/ - 条件提示值 —— 不要使用 ;使用无条件提示,并依赖
if _s else "none"/ 转换配置。shard_layers
Sharding IR validation checklist (human review)
分片IR验证清单(人工评审)
- : unsharded path; hints should not break correctness.
world_size=1 - and
world_size=2: shape checks and coherent output.8 - node count vs expectation.
apply_sharding_hints - Optional: to verify selective sharding.
shard_layers: ['moe']
- :未分片路径;提示不应破坏正确性。
world_size=1 - 和
world_size=2:形状检查和连贯输出。8 - 节点数量与预期一致。
apply_sharding_hints - 可选:验证选择性分片。
shard_layers: ['moe']
Key Gotchas
关键注意事项
- Canonical ops first: Always use canonical ops whenever one exists for the operation. This is how AD knows what to optimize. Writing manual attention, MoE, RoPE, or normalization in plain PyTorch instead of using the canonical op will prevent AD transforms from working.
torch.ops.auto_deploy.torch_* - No : AD attention ops handle GQA natively. Never repeat K/V heads manually.
repeat_interleave - Lean code: Every line should serve prefill export. No optional HF features, no dead code paths, no fallback logic.
- Reuse config classes: Import from or load from checkpoint whenever possible. Only bundle a config class if it truly doesn't exist anywhere.
transformers - Assert : Always assert
position_ids— it is a required input, never optional.position_ids is not None - Self-contained files only: Never import from other AD custom models. Each is a standalone translation from HF source.
modeling_{name}.py - RoPE cos/sin: slice ONCE, not per layer. prefix for RoPE buffers.
_ad_MUST slice byRotaryEmbedding.forward(x, position_ids)once and return pre-slicedposition_ids. Pass those tensors to all layers. NEVER pass(cos, sin)through to each layer/attention forward to re-index — that is redundant compute that bloats the exported graph. See Phase 2 for the full pattern.position_ids - MoE weights: use per-expert for checkpoint compatibility. Write test-only state_dict converters for HF stacked format.
nn.ModuleList - routers (DeepSeek-V3 style): use vanilla PyTorch (sigmoid + bias + group topk + normalize + scale). AD transforms can replace with fused
noaux_tckernels at deployment time.trtllm - Vision towers are typically not exported. Keep vision logic in eager PyTorch and export only the text path unless explicitly requested otherwise.
- Model code and tests must run on CPU. Use only prefixed reference ops in AutoDeploy — never
torch_*,triton_*, orflashinfer_*.trtllm_*
- 优先使用标准算子: 只要存在对应的操作,就必须使用 标准算子。这是AD知道如何优化的方式。用纯PyTorch手动实现注意力、MoE、RoPE或归一化,而不使用标准算子,会导致AD转换无法工作。
torch.ops.auto_deploy.torch_* - 不要使用 : AD注意力算子原生支持GQA。永远不要手动重复K/V头。
repeat_interleave - 精简代码: 每一行都应为预填充导出服务。不要包含可选的HF功能、无效代码路径或fallback逻辑。
- 重用配置类: 尽可能从 导入或从checkpoint加载。仅当配置类确实不存在时才打包该类。
transformers - 断言 : 始终断言
position_ids—— 这是必填输入,永远不是可选的。position_ids is not None - 仅使用独立文件: 永远不要从其他AD自定义模型导入。每个 文件都是从HF源码独立转换而来。
modeling_{name}.py - RoPE cos/sin:仅切片一次,不要每层切片。 RoPE缓冲区使用 前缀。
_ad_必须按RotaryEmbedding.forward(x, position_ids)切片一次,并返回预切片的position_ids。将这些张量传递给所有层。永远不要将(cos, sin)传递到每个层/注意力前向方法中重新索引 —— 这是冗余计算,会膨胀导出的图。请见阶段2的完整模式。position_ids - MoE权重:使用 存储每个专家的权重,以兼容checkpoint。为HF堆叠格式编写测试专用的state_dict转换器。
nn.ModuleList - 路由器(DeepSeek-V3风格):使用纯PyTorch实现(sigmoid + bias + group topk + normalize + scale)。AD转换会在部署阶段将其替换为融合的
noaux_tc内核。trtllm - 视觉塔通常不导出。将视觉逻辑保留在eager PyTorch中,除非明确要求,否则仅导出文本路径。
- 模型代码和测试必须能在CPU上运行。在AutoDeploy中仅使用 前缀的参考算子 —— 永远不要使用
torch_*、triton_*或flashinfer_*。trtllm_*