ad-add-fusion-transformation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Autodeploy: Add Fusion Transformation Pass

Autodeploy：添加融合变换Pass

Where this skill applies

本技能的适用场景

This file lives in the trtllm-agent-toolkit plugin. Paths such as

tensorrt_llm/...

examples/auto_deploy/...

, and

tests/...

are relative to a TensorRT-LLM source checkout on the user’s machine — not the plugin tree.

After installing the plugin (see the toolkit

README.md

), skills use the

trtllm-agent-toolkit:

prefix (for example

trtllm-agent-toolkit:ad-add-fusion-transformation

本文件属于trtllm-agent-toolkit插件。

tensorrt_llm/...

、

examples/auto_deploy/...

和

tests/...

等路径均相对于用户机器上的TensorRT-LLM源码仓库，而非插件目录。

安装插件后（详见工具包的

README.md

），技能需使用

trtllm-agent-toolkit:

前缀（例如

trtllm-agent-toolkit:ad-add-fusion-transformation

）。

Related skills in this plugin

本插件中的相关技能

Skill	Use it for
ad-graph-dump	Enabling `AD_DUMP_GRAPHS_DIR` , dump file layout, and how to read SSA graph output.
trtllm-codebase-exploration	Mapping existing transforms, custom ops, and search patterns before writing a pass.
trtllm-code-contribution	TensorRT-LLM pre-commit, tests, DCO sign-off, and PR expectations.
triton-kernel-writing	Implementing a Triton op only after existing-kernel lookup fails.
triton-tileir-optimization	Tuning existing Triton kernels for the TileIR backend when that path applies.

Use this skill when you already know which subgraph or pattern you are targeting (from graph dumps, logs, or code reading). For dump capture and file semantics, follow ad-graph-dump first.

技能	适用场景
ad-graph-dump	启用 `AD_DUMP_GRAPHS_DIR` 、配置转储文件布局，以及如何读取SSA图输出。
trtllm-codebase-exploration	在编写Pass前，梳理现有变换、自定义算子及搜索模式。
trtllm-code-contribution	TensorRT-LLM的预提交、测试、DCO签署及PR规范。
triton-kernel-writing	仅在找不到现有内核时，实现Triton算子。
triton-tileir-optimization	当适用时，针对TileIR后端调优现有Triton内核。

当你已明确要处理的子图或模式（来自图转储、日志或代码阅读）时，可使用本技能。如需捕获转储文件及了解文件语义，请先遵循ad-graph-dump的流程。

When to use this skill

何时使用本技能

Adding, extending, or reviewing a fusion under AutoDeploy transforms in a TensorRT-LLM tree.

在TensorRT-LLM代码库中添加、扩展或评审AutoDeploy变换下的融合逻辑。

Workflow (concise)

精简工作流

Confirm the pattern in current graph dumps (see ad-graph-dump).
Search for an existing kernel or custom-op path before new Triton or CUDA.
Implement the smallest change that proves correctness and matching; add tests.
Re-run dumps and tests; if outputs drift, separate matching issues from metadata loss from numeric differences.

确认当前图转储中的目标模式（详见ad-graph-dump）。
在开发新的Triton或CUDA内核前，先查找是否存在现有内核或自定义算子方案。
实现最小化变更以验证正确性和匹配度，并添加测试。
重新运行转储和测试；若输出出现偏差，区分匹配问题、元数据丢失和数值差异。

Finding fusion candidates (lightweight)

寻找融合候选项（轻量版）

Do this before writing a new pass so you work on real graph structure.

在编写新Pass前执行此步骤，确保基于真实的图结构开展工作。

Inputs

输入

Graph dump directory from a run with
```
AD_DUMP_GRAPHS_DIR
```
set (see ad-graph-dump).
Model id and active AutoDeploy config (registry YAML,
```
default.yaml
```
overlays).
TensorRT-LLM source tree for kernel and transform lookup.

设置
```
AD_DUMP_GRAPHS_DIR
```
后生成的图转储目录（详见ad-graph-dump）。
模型ID及当前生效的AutoDeploy配置（注册表YAML、
```
default.yaml
```
覆盖配置）。
TensorRT-LLM源码树，用于查找内核和变换。

Outputs

输出

Ordered list of candidates with: graph evidence, existing-kernel lookup (
```
found
```
/
```
not_found
```
), recommendation (
```
use_existing_kernel
```
,
```
needs_triton_fallback
```
,
```
defer
```
), and trade-offs (complexity, correctness risk).

有序的候选项列表，包含：图证据、现有内核查找结果（
```
found
```
/
```
not_found
```
）、建议（
```
use_existing_kernel
```
、
```
needs_triton_fallback
```
、
```
defer
```
），以及权衡因素（复杂度、正确性风险）。

Discovery workflow

发现工作流

Parse dumps for repeated unfused patterns (element-wise chains, norm chains, epilogues, attention-adjacent ops).
Search the tree for equivalent transforms or custom ops; record file/symbol evidence.
If nothing fits, mark Triton or other kernel work as a deliberate fallback.
Prefer candidates with clear recurrence, existing support, and lower numerical risk.

解析转储文件，找出重复出现的未融合模式（逐元素链、归一化链、结尾操作、注意力相关算子）。
在代码库中搜索等效的变换或自定义算子；记录文件/符号证据。
若找不到匹配项，则将Triton或其他内核开发标记为备选方案。
优先选择出现频率高、已有支持、数值风险低的候选项。

Per-candidate template

候选项模板

text

Candidate: <short-name>
Affected graph pattern: <pattern>
Existing kernel lookup: <found|not_found>
Evidence: <path/symbol>
Recommendation: <use_existing_kernel|needs_triton_fallback|defer>
Strengths / weaknesses / risks:
- ...

text

候选项：<short-name>
受影响的图模式：<pattern>
现有内核查找结果：<found|not_found>
证据：<path/symbol>
建议：<use_existing_kernel|needs_triton_fallback|defer>
优势/劣势/风险：
- ...

Guardrails

约束规则

Do not skip existing-kernel lookup.
Do not default to Triton when a viable existing op already exists.
If uncertain,
```
defer
```
and narrow the question with one more dump or test.

不可跳过现有内核查找步骤。
当存在可行的现有算子时，不得默认使用Triton。
若不确定，选择
```
defer
```
并通过额外的转储或测试缩小问题范围。

Inputs (implementation)

输入（实现阶段）

Chosen candidate or concrete subgraph.
Active model and config files.
Fresh graph dumps when available.
Current baseline: match counts from logs, unit test status, any accuracy notes you already maintain.

选定的候选项或具体子图。
当前生效的模型及配置文件。
最新的图转储文件（若可用）。
当前基准：日志中的匹配计数、单元测试状态、已记录的精度说明。

Outputs (implementation)

输出（实现阶段）

Pass design or patch: registered transform,
```
default.yaml
```
entry, optional model-registry YAML.

Path decision:

existing_kernel_path

triton_fallback_path

(or other kernel stack).

Validation notes: graph evidence,
```
[SUMMARY] matches=...
```
before/after from AutoDeploy logs, test results.

Pass设计或补丁：已注册的变换、
```
default.yaml
```
条目、可选的模型注册表YAML。
路径决策：
```
existing_kernel_path
```
vs
```
triton_fallback_path
```
（或其他内核栈）。
验证说明：图证据、AutoDeploy日志中变更前后的
```
[SUMMARY] matches=...
```
信息、测试结果。

Implementation workflow

实现工作流

Align the pass with observed graph structure from dumps — not assumed op names from docs alone.

transform/library/

custom_ops/

torch.ops.auto_deploy.*

, and related tests for reuse.

Integrate an existing op when possible; otherwise delegate kernel work to the appropriate skill (triton-kernel-writing, cuda-kernel-writing, etc.).
Keep one logical change per patch; extend tests in the same change.
Re-read dumps after the change; if match counts collapse, suspect pattern availability or metadata propagation.

使Pass与转储文件中实际观察到的图结构对齐——不可仅依赖文档中的算子名称假设。
在
```
transform/library/
```
、
```
custom_ops/
```
、
```
torch.ops.auto_deploy.*
```
及相关测试中查找可复用的逻辑。
尽可能集成现有算子；否则将内核开发工作委托给对应的技能（triton-kernel-writing、cuda-kernel-writing等）。
每个补丁仅包含一个逻辑变更；在同一变更中扩展测试。
变更后重新读取转储文件；若匹配计数骤降，需排查模式可用性或元数据传播问题。

Where fusion passes live

融合Pass的存放位置

Transforms:

tensorrt_llm/_torch/auto_deploy/transform/library/

Registry / base behavior:

tensorrt_llm/_torch/auto_deploy/transform/interface.py

Default transform list:

tensorrt_llm/_torch/auto_deploy/config/default.yaml

Dump helper:

tensorrt_llm/_torch/auto_deploy/utils/graph_writer.py

Graph utilities:

tensorrt_llm/_torch/auto_deploy/utils/node_utils.py

tensorrt_llm/_torch/auto_deploy/utils/_graph.py

Custom ops:

tensorrt_llm/_torch/auto_deploy/custom_ops/

Tests (typical):

tests/unittest/auto_deploy/singlegpu/transformations/library/

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

(when behavior or numerics may change)

变换：

tensorrt_llm/_torch/auto_deploy/transform/library/

注册表/基础逻辑：

tensorrt_llm/_torch/auto_deploy/transform/interface.py

默认变换列表：

tensorrt_llm/_torch/auto_deploy/config/default.yaml

转储工具：

tensorrt_llm/_torch/auto_deploy/utils/graph_writer.py

图工具类：

tensorrt_llm/_torch/auto_deploy/utils/node_utils.py

、

tensorrt_llm/_torch/auto_deploy/utils/_graph.py

自定义算子：

tensorrt_llm/_torch/auto_deploy/custom_ops/

测试（典型位置）：

tests/unittest/auto_deploy/singlegpu/transformations/library/

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

（当行为或数值可能发生变化时）

How to add a transform

如何添加变换

Implement the pass

实现Pass

Create or update a module under

transform/library/

and register the class:

python

@TransformRegistry.register("my_transform_key")
class MyTransform(BaseTransform):
    @classmethod
    def get_config_class(cls):
        return MyTransformConfig

Use a dedicated config class only when the pass needs parameters beyond the base transform config.

在

transform/library/

下创建或更新模块，并注册类：

python

@TransformRegistry.register("my_transform_key")
class MyTransform(BaseTransform):
    @classmethod
    def get_config_class(cls):
        return MyTransformConfig

仅当Pass需要基础变换配置之外的参数时，才使用专用的配置类。

default.yaml

在

default.yaml

中注册

Add a key under

transforms:

tensorrt_llm/_torch/auto_deploy/config/default.yaml

. Copy the field set from the closest existing transform in the same section of the file (required keys depend on the transform config class and on how peers are declared). New experimental passes should stay enabled: false
until covered by tests and dumps.

在

tensorrt_llm/_torch/auto_deploy/config/default.yaml

的

transforms:

下添加键。复制文件中同区域最相近的现有变换的字段设置（必填键取决于变换配置类及声明方式）。新的实验性Pass应保持**

enabled: false

**，直到有测试和转储文件覆盖验证。

Enable for a specific model

针对特定模型启用

For targeted rollout, adjust registry YAMLs under

examples/auto_deploy/model_registry/configs/

rather than turning on unproven passes globally.

如需定向部署，请调整

examples/auto_deploy/model_registry/configs/

下的注册表YAML，而非全局启用未经验证的Pass。

Implementation rules

实现规则

Prefer existing AutoDeploy / TRT-LLM ops and
```
torch.ops.auto_deploy
```
entries.
Prefer stable, backend-neutral graph contracts; avoid hiding real dataflow in
```
node.meta
```
when an edge should carry it.
Use metadata for observable tensor facts (shape, dtype) and preserve it across rewrites when replacements should remain traceable.
One hypothesis per patch — do not mix unrelated fusions.

优先使用现有AutoDeploy/TRT-LLM算子及
```
torch.ops.auto_deploy
```
条目。
优先选择稳定、与后端无关的图约定；当可用边传递数据时，避免将真实数据流隐藏在
```
node.meta
```
中。
使用元数据存储可观察的张量信息（形状、数据类型），并在重写过程中保留元数据，以便追踪替换逻辑。
每个补丁仅验证一个假设——不得混合无关的融合逻辑。

Existing kernel first, Triton second

优先使用现有内核，其次考虑Triton

Before Triton:

Search
```
transform/library/
```
and
```
custom_ops/
```
.
Search
```
torch.ops.auto_deploy.*
```
and TRT-LLM custom op definitions.
Read tests for similar integrations.

Use triton-kernel-writing only when no suitable op exists and you accept owning kernel + integration work.

在使用Triton前，请执行以下步骤：

搜索
```
transform/library/
```
和
```
custom_ops/
```
。
搜索
```
torch.ops.auto_deploy.*
```
及TRT-LLM自定义算子定义。
阅读类似集成的测试用例。

仅当不存在合适的算子且你愿意承担内核+集成工作时，才使用triton-kernel-writing技能。

Validation order

验证顺序

Graph dumps — pattern present, rewrite visible (see ad-graph-dump).
Unit tests for the transform.
Integration or accuracy checks when numerics or end-to-end behavior may change.

图转储——确认模式存在、重写可见（详见ad-graph-dump）。
变换的单元测试。
当数值或端到端行为可能变化时，执行集成或精度检查。

Match counts

匹配计数

AutoDeploy logs

[SUMMARY] matches=<n>

(or

skipped

disabled

) per transform. Compare before and after your change; a large drop usually indicates pattern or metadata issues, not “slow runs.”

AutoDeploy会为每个变换记录

[SUMMARY] matches=<n>

（或

skipped

disabled

）。对比变更前后的计数；计数大幅下降通常表明模式或元数据存在问题，而非“运行缓慢”。

Testing expectations

测试要求

Follow trtllm-code-contribution for repo conventions. Cover:

Happy-path micrograph or exported-graph rewrites.
Failure modes that must not fuse (multiple consumers, mixed consumers).
Metadata preservation when an upstream pass feeds your pattern.

Primary unittest location for library transforms:

tests/unittest/auto_deploy/singlegpu/transformations/library/

遵循trtllm-code-contribution的仓库规范，覆盖：

正常路径下的微图或导出图重写。
不得融合的失败场景（多消费者、混合消费者）。
当前游Pass输出你的目标模式时，元数据的保留情况。

库变换的主要单元测试位置：

tests/unittest/auto_deploy/singlegpu/transformations/library/

Review checklist

评审检查清单

Target structure appears in current dumps.
Transform registered and listed in
```
default.yaml
```
consistently with peer entries.
Model-registry toggles are intentional.
Non-zero
```
matches
```
where expected, or
```
skipped
```
is explained.
Before/after dump snippets or diffs saved for the review thread.
Tests cover both success and intentional non-match cases.
If outputs change, classify match loss vs metadata loss vs acceptable numeric drift.

目标结构出现在当前转储文件中。
变换已注册，且在
```
default.yaml
```
中的条目与同类变换保持一致。
模型注册表的开关设置符合预期。
在预期场景下匹配计数非零，或
```
skipped
```
状态有合理说明。
为评审线程保存变更前后的转储片段或差异。
测试覆盖成功场景和预期不匹配的场景。
若输出发生变化，区分匹配丢失、元数据丢失和可接受的数值漂移。

Guardrails

约束规则

Do not bundle unrelated passes in one change.
If dumps contradict expectations, document what you observed before chasing unrelated hypotheses.

不得在一个变更中打包无关的Pass。
若转储文件与预期不符，在追踪无关假设前先记录观察到的现象。

Iteration note (template)

迭代记录模板

text

Candidate: <name>
Path: <existing_kernel_path|triton_fallback_path|other>
Rationale:
- ...
Graph validation: <pass|fail — what files / ops>
Summary logs: <matches before / after>
Tests: <what ran>
Open risks:
- ...

text

候选项：<name>
路径：<existing_kernel_path|triton_fallback_path|other>
理由：
- ...
图验证：<通过/失败 — 涉及的文件/算子>
摘要日志：<变更前后的匹配计数>
测试：<已执行的测试>
未解决风险：
- ...

ad-add-fusion-transformation

Original

Translation

Autodeploy: Add Fusion Transformation Pass

Autodeploy：添加融合变换Pass

Where this skill applies

本技能的适用场景

Related skills in this plugin

本插件中的相关技能

When to use this skill

何时使用本技能

Workflow (concise)

精简工作流

Finding fusion candidates (lightweight)

寻找融合候选项（轻量版）

Inputs

输入

Outputs

输出

Discovery workflow

发现工作流

Per-candidate template

候选项模板

Guardrails

约束规则

Inputs (implementation)

输入（实现阶段）

Outputs (implementation)

输出（实现阶段）

Implementation workflow

实现工作流

Where fusion passes live

融合Pass的存放位置

How to add a transform

如何添加变换

Implement the pass

实现Pass

Register in default.yaml

在default.yaml中注册

Enable for a specific model

针对特定模型启用

Implementation rules

实现规则

Existing kernel first, Triton second

优先使用现有内核，其次考虑Triton

Validation order

验证顺序

Match counts

匹配计数

Testing expectations

测试要求

Review checklist

评审检查清单

Guardrails

约束规则

Iteration note (template)

迭代记录模板

Register in
`default.yaml`

在
`default.yaml`
中注册