ad-add-fusion-transformation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAutodeploy: Add Fusion Transformation Pass
Autodeploy:添加融合变换Pass
Where this skill applies
本技能的适用场景
This file lives in the trtllm-agent-toolkit plugin. Paths such as , , and are relative to a TensorRT-LLM source checkout on the user’s machine — not the plugin tree.
tensorrt_llm/...examples/auto_deploy/...tests/...After installing the plugin (see the toolkit ), skills use the prefix (for example ).
README.mdtrtllm-agent-toolkit:trtllm-agent-toolkit:ad-add-fusion-transformation本文件属于trtllm-agent-toolkit插件。、和等路径均相对于用户机器上的TensorRT-LLM源码仓库,而非插件目录。
tensorrt_llm/...examples/auto_deploy/...tests/...安装插件后(详见工具包的),技能需使用前缀(例如)。
README.mdtrtllm-agent-toolkit:trtllm-agent-toolkit:ad-add-fusion-transformationRelated skills in this plugin
本插件中的相关技能
| Skill | Use it for |
|---|---|
| ad-graph-dump | Enabling |
| trtllm-codebase-exploration | Mapping existing transforms, custom ops, and search patterns before writing a pass. |
| trtllm-code-contribution | TensorRT-LLM pre-commit, tests, DCO sign-off, and PR expectations. |
| triton-kernel-writing | Implementing a Triton op only after existing-kernel lookup fails. |
| triton-tileir-optimization | Tuning existing Triton kernels for the TileIR backend when that path applies. |
Use this skill when you already know which subgraph or pattern you are targeting (from graph dumps, logs, or code reading). For dump capture and file semantics, follow ad-graph-dump first.
| 技能 | 适用场景 |
|---|---|
| ad-graph-dump | 启用 |
| trtllm-codebase-exploration | 在编写Pass前,梳理现有变换、自定义算子及搜索模式。 |
| trtllm-code-contribution | TensorRT-LLM的预提交、测试、DCO签署及PR规范。 |
| triton-kernel-writing | 仅在找不到现有内核时,实现Triton算子。 |
| triton-tileir-optimization | 当适用时,针对TileIR后端调优现有Triton内核。 |
当你已明确要处理的子图或模式(来自图转储、日志或代码阅读)时,可使用本技能。如需捕获转储文件及了解文件语义,请先遵循ad-graph-dump的流程。
When to use this skill
何时使用本技能
- Adding, extending, or reviewing a fusion under AutoDeploy transforms in a TensorRT-LLM tree.
- 在TensorRT-LLM代码库中添加、扩展或评审AutoDeploy变换下的融合逻辑。
Workflow (concise)
精简工作流
- Confirm the pattern in current graph dumps (see ad-graph-dump).
- Search for an existing kernel or custom-op path before new Triton or CUDA.
- Implement the smallest change that proves correctness and matching; add tests.
- Re-run dumps and tests; if outputs drift, separate matching issues from metadata loss from numeric differences.
- 确认当前图转储中的目标模式(详见ad-graph-dump)。
- 在开发新的Triton或CUDA内核前,先查找是否存在现有内核或自定义算子方案。
- 实现最小化变更以验证正确性和匹配度,并添加测试。
- 重新运行转储和测试;若输出出现偏差,区分匹配问题、元数据丢失和数值差异。
Finding fusion candidates (lightweight)
寻找融合候选项(轻量版)
Do this before writing a new pass so you work on real graph structure.
在编写新Pass前执行此步骤,确保基于真实的图结构开展工作。
Inputs
输入
- Graph dump directory from a run with set (see ad-graph-dump).
AD_DUMP_GRAPHS_DIR - Model id and active AutoDeploy config (registry YAML, overlays).
default.yaml - TensorRT-LLM source tree for kernel and transform lookup.
- 设置后生成的图转储目录(详见ad-graph-dump)。
AD_DUMP_GRAPHS_DIR - 模型ID及当前生效的AutoDeploy配置(注册表YAML、覆盖配置)。
default.yaml - TensorRT-LLM源码树,用于查找内核和变换。
Outputs
输出
- Ordered list of candidates with: graph evidence, existing-kernel lookup (/
found), recommendation (not_found,use_existing_kernel,needs_triton_fallback), and trade-offs (complexity, correctness risk).defer
- 有序的候选项列表,包含:图证据、现有内核查找结果(/
found)、建议(not_found、use_existing_kernel、needs_triton_fallback),以及权衡因素(复杂度、正确性风险)。defer
Discovery workflow
发现工作流
- Parse dumps for repeated unfused patterns (element-wise chains, norm chains, epilogues, attention-adjacent ops).
- Search the tree for equivalent transforms or custom ops; record file/symbol evidence.
- If nothing fits, mark Triton or other kernel work as a deliberate fallback.
- Prefer candidates with clear recurrence, existing support, and lower numerical risk.
- 解析转储文件,找出重复出现的未融合模式(逐元素链、归一化链、结尾操作、注意力相关算子)。
- 在代码库中搜索等效的变换或自定义算子;记录文件/符号证据。
- 若找不到匹配项,则将Triton或其他内核开发标记为备选方案。
- 优先选择出现频率高、已有支持、数值风险低的候选项。
Per-candidate template
候选项模板
text
Candidate: <short-name>
Affected graph pattern: <pattern>
Existing kernel lookup: <found|not_found>
Evidence: <path/symbol>
Recommendation: <use_existing_kernel|needs_triton_fallback|defer>
Strengths / weaknesses / risks:
- ...text
候选项:<short-name>
受影响的图模式:<pattern>
现有内核查找结果:<found|not_found>
证据:<path/symbol>
建议:<use_existing_kernel|needs_triton_fallback|defer>
优势/劣势/风险:
- ...Guardrails
约束规则
- Do not skip existing-kernel lookup.
- Do not default to Triton when a viable existing op already exists.
- If uncertain, and narrow the question with one more dump or test.
defer
- 不可跳过现有内核查找步骤。
- 当存在可行的现有算子时,不得默认使用Triton。
- 若不确定,选择并通过额外的转储或测试缩小问题范围。
defer
Inputs (implementation)
输入(实现阶段)
- Chosen candidate or concrete subgraph.
- Active model and config files.
- Fresh graph dumps when available.
- Current baseline: match counts from logs, unit test status, any accuracy notes you already maintain.
- 选定的候选项或具体子图。
- 当前生效的模型及配置文件。
- 最新的图转储文件(若可用)。
- 当前基准:日志中的匹配计数、单元测试状态、已记录的精度说明。
Outputs (implementation)
输出(实现阶段)
- Pass design or patch: registered transform, entry, optional model-registry YAML.
default.yaml - Path decision: vs
existing_kernel_path(or other kernel stack).triton_fallback_path - Validation notes: graph evidence, before/after from AutoDeploy logs, test results.
[SUMMARY] matches=...
- Pass设计或补丁:已注册的变换、条目、可选的模型注册表YAML。
default.yaml - 路径决策:vs
existing_kernel_path(或其他内核栈)。triton_fallback_path - 验证说明:图证据、AutoDeploy日志中变更前后的信息、测试结果。
[SUMMARY] matches=...
Implementation workflow
实现工作流
- Align the pass with observed graph structure from dumps — not assumed op names from docs alone.
- Search ,
transform/library/,custom_ops/, and related tests for reuse.torch.ops.auto_deploy.* - Integrate an existing op when possible; otherwise delegate kernel work to the appropriate skill (triton-kernel-writing, cuda-kernel-writing, etc.).
- Keep one logical change per patch; extend tests in the same change.
- Re-read dumps after the change; if match counts collapse, suspect pattern availability or metadata propagation.
- 使Pass与转储文件中实际观察到的图结构对齐——不可仅依赖文档中的算子名称假设。
- 在、
transform/library/、custom_ops/及相关测试中查找可复用的逻辑。torch.ops.auto_deploy.* - 尽可能集成现有算子;否则将内核开发工作委托给对应的技能(triton-kernel-writing、cuda-kernel-writing等)。
- 每个补丁仅包含一个逻辑变更;在同一变更中扩展测试。
- 变更后重新读取转储文件;若匹配计数骤降,需排查模式可用性或元数据传播问题。
Where fusion passes live
融合Pass的存放位置
- Transforms:
tensorrt_llm/_torch/auto_deploy/transform/library/ - Registry / base behavior:
tensorrt_llm/_torch/auto_deploy/transform/interface.py - Default transform list:
tensorrt_llm/_torch/auto_deploy/config/default.yaml - Dump helper:
tensorrt_llm/_torch/auto_deploy/utils/graph_writer.py - Graph utilities: ,
tensorrt_llm/_torch/auto_deploy/utils/node_utils.pytensorrt_llm/_torch/auto_deploy/utils/_graph.py - Custom ops:
tensorrt_llm/_torch/auto_deploy/custom_ops/
Tests (typical):
tests/unittest/auto_deploy/singlegpu/transformations/library/- (when behavior or numerics may change)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
- 变换:
tensorrt_llm/_torch/auto_deploy/transform/library/ - 注册表/基础逻辑:
tensorrt_llm/_torch/auto_deploy/transform/interface.py - 默认变换列表:
tensorrt_llm/_torch/auto_deploy/config/default.yaml - 转储工具:
tensorrt_llm/_torch/auto_deploy/utils/graph_writer.py - 图工具类:、
tensorrt_llm/_torch/auto_deploy/utils/node_utils.pytensorrt_llm/_torch/auto_deploy/utils/_graph.py - 自定义算子:
tensorrt_llm/_torch/auto_deploy/custom_ops/
测试(典型位置):
tests/unittest/auto_deploy/singlegpu/transformations/library/- (当行为或数值可能发生变化时)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
How to add a transform
如何添加变换
Implement the pass
实现Pass
Create or update a module under and register the class:
transform/library/python
@TransformRegistry.register("my_transform_key")
class MyTransform(BaseTransform):
@classmethod
def get_config_class(cls):
return MyTransformConfigUse a dedicated config class only when the pass needs parameters beyond the base transform config.
在下创建或更新模块,并注册类:
transform/library/python
@TransformRegistry.register("my_transform_key")
class MyTransform(BaseTransform):
@classmethod
def get_config_class(cls):
return MyTransformConfig仅当Pass需要基础变换配置之外的参数时,才使用专用的配置类。
Register in default.yaml
default.yaml在default.yaml
中注册
default.yamlAdd a key under in . Copy the field set from the closest existing transform in the same section of the file (required keys depend on the transform config class and on how peers are declared). New experimental passes should stay until covered by tests and dumps.
transforms:tensorrt_llm/_torch/auto_deploy/config/default.yamlenabled: false在的下添加键。复制文件中同区域最相近的现有变换的字段设置(必填键取决于变换配置类及声明方式)。新的实验性Pass应保持****,直到有测试和转储文件覆盖验证。
tensorrt_llm/_torch/auto_deploy/config/default.yamltransforms:enabled: falseEnable for a specific model
针对特定模型启用
For targeted rollout, adjust registry YAMLs under rather than turning on unproven passes globally.
examples/auto_deploy/model_registry/configs/如需定向部署,请调整下的注册表YAML,而非全局启用未经验证的Pass。
examples/auto_deploy/model_registry/configs/Implementation rules
实现规则
- Prefer existing AutoDeploy / TRT-LLM ops and entries.
torch.ops.auto_deploy - Prefer stable, backend-neutral graph contracts; avoid hiding real dataflow in when an edge should carry it.
node.meta - Use metadata for observable tensor facts (shape, dtype) and preserve it across rewrites when replacements should remain traceable.
- One hypothesis per patch — do not mix unrelated fusions.
- 优先使用现有AutoDeploy/TRT-LLM算子及条目。
torch.ops.auto_deploy - 优先选择稳定、与后端无关的图约定;当可用边传递数据时,避免将真实数据流隐藏在中。
node.meta - 使用元数据存储可观察的张量信息(形状、数据类型),并在重写过程中保留元数据,以便追踪替换逻辑。
- 每个补丁仅验证一个假设——不得混合无关的融合逻辑。
Existing kernel first, Triton second
优先使用现有内核,其次考虑Triton
Before Triton:
- Search and
transform/library/.custom_ops/ - Search and TRT-LLM custom op definitions.
torch.ops.auto_deploy.* - Read tests for similar integrations.
Use triton-kernel-writing only when no suitable op exists and you accept owning kernel + integration work.
在使用Triton前,请执行以下步骤:
- 搜索和
transform/library/。custom_ops/ - 搜索及TRT-LLM自定义算子定义。
torch.ops.auto_deploy.* - 阅读类似集成的测试用例。
仅当不存在合适的算子且你愿意承担内核+集成工作时,才使用triton-kernel-writing技能。
Validation order
验证顺序
- Graph dumps — pattern present, rewrite visible (see ad-graph-dump).
- Unit tests for the transform.
- Integration or accuracy checks when numerics or end-to-end behavior may change.
- 图转储——确认模式存在、重写可见(详见ad-graph-dump)。
- 变换的单元测试。
- 当数值或端到端行为可能变化时,执行集成或精度检查。
Match counts
匹配计数
AutoDeploy logs (or / ) per transform. Compare before and after your change; a large drop usually indicates pattern or metadata issues, not “slow runs.”
[SUMMARY] matches=<n>skippeddisabledAutoDeploy会为每个变换记录(或/)。对比变更前后的计数;计数大幅下降通常表明模式或元数据存在问题,而非“运行缓慢”。
[SUMMARY] matches=<n>skippeddisabledTesting expectations
测试要求
Follow trtllm-code-contribution for repo conventions. Cover:
- Happy-path micrograph or exported-graph rewrites.
- Failure modes that must not fuse (multiple consumers, mixed consumers).
- Metadata preservation when an upstream pass feeds your pattern.
Primary unittest location for library transforms:
tests/unittest/auto_deploy/singlegpu/transformations/library/
遵循trtllm-code-contribution的仓库规范,覆盖:
- 正常路径下的微图或导出图重写。
- 不得融合的失败场景(多消费者、混合消费者)。
- 当前游Pass输出你的目标模式时,元数据的保留情况。
库变换的主要单元测试位置:
tests/unittest/auto_deploy/singlegpu/transformations/library/
Review checklist
评审检查清单
- Target structure appears in current dumps.
- Transform registered and listed in consistently with peer entries.
default.yaml - Model-registry toggles are intentional.
- Non-zero where expected, or
matchesis explained.skipped - Before/after dump snippets or diffs saved for the review thread.
- Tests cover both success and intentional non-match cases.
- If outputs change, classify match loss vs metadata loss vs acceptable numeric drift.
- 目标结构出现在当前转储文件中。
- 变换已注册,且在中的条目与同类变换保持一致。
default.yaml - 模型注册表的开关设置符合预期。
- 在预期场景下匹配计数非零,或状态有合理说明。
skipped - 为评审线程保存变更前后的转储片段或差异。
- 测试覆盖成功场景和预期不匹配的场景。
- 若输出发生变化,区分匹配丢失、元数据丢失和可接受的数值漂移。
Guardrails
约束规则
- Do not bundle unrelated passes in one change.
- If dumps contradict expectations, document what you observed before chasing unrelated hypotheses.
- 不得在一个变更中打包无关的Pass。
- 若转储文件与预期不符,在追踪无关假设前先记录观察到的现象。
Iteration note (template)
迭代记录模板
text
Candidate: <name>
Path: <existing_kernel_path|triton_fallback_path|other>
Rationale:
- ...
Graph validation: <pass|fail — what files / ops>
Summary logs: <matches before / after>
Tests: <what ran>
Open risks:
- ...text
候选项:<name>
路径:<existing_kernel_path|triton_fallback_path|other>
理由:
- ...
图验证:<通过/失败 — 涉及的文件/算子>
摘要日志:<变更前后的匹配计数>
测试:<已执行的测试>
未解决风险:
- ...