ascendc-operator-performance-optim

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Ascend C 算子性能优化（排查 → 修改 → 验证闭环）

Ascend C Operator Performance Optimization (Troubleshoot → Modify → Validate Closed Loop)

本 skill 不仅排查性能问题，还负责 修改代码并验证优化效果。完整流程为：

Phase 1: 排查 — 审查代码 + 学习设计文档，发现优化点
Phase 2: 基线 — 保存当前性能测试结果（自定义算子 vs 标杆）
Phase 3: 优化 — 学习 code-gen 知识后修改算子代码
Phase 4: 精度 — 精度验证（确保优化后功能正确）
Phase 5: 性能 — 同 case 性能对比（优化后 vs 标杆）
Phase 6: 迭代 — 未提升则继续优化，最多 3 轮

This skill not only troubleshoots performance issues, but also takes charge of modifying code and verifying optimization effects. The complete process is as follows:

Phase 1: Troubleshoot — Review code + study design documents to identify optimization points
Phase 2: Baseline — Save current performance test results (custom operator vs benchmark)
Phase 3: Optimization — Modify operator code after learning code-gen knowledge
Phase 4: Precision — Precision verification (ensure correct functionality after optimization)
Phase 5: Performance — Performance comparison with the same cases (post-optimization vs benchmark)
Phase 6: Iteration — Continue optimization if no improvement is achieved, up to 3 rounds

Phase 1: 排查 — 发现优化点

Phase 1: Troubleshoot — Identify Optimization Points

1.1 学习算子设计文档

1.1 Study Operator Design Documents

MANDATORY — 排查前必须先理解算子设计：

读取
```
ascend-kernel/csrc/ops/<op_name>/design.md
```
（若存在），提取：
- 算子类型（elementwise / 行处理 / Cube）
- Tiling 策略（核间切分 / 核内切分）
- UB 空间分配方案
- 计算逻辑与数据流

读取

op_host/<op_name>.cpp

和

op_kernel/<op_name>.cpp

全部源码

MANDATORY — Must understand operator design before troubleshooting:

Read
```
ascend-kernel/csrc/ops/<op_name>/design.md
```
(if exists) and extract:
- Operator type (elementwise / row processing / Cube)
- Tiling strategy (inter-core splitting / intra-core splitting)
- UB space allocation scheme
- Computation logic and data flow

Read all source codes of

op_host/<op_name>.cpp

and

op_kernel/<op_name>.cpp

1.2 逐阶段排查

1.2 Troubleshoot Phase by Phase

按以下顺序逐阶段审查算子代码。对每个阶段，加载对应的 reference 文件，逐项对照代码检查。

- [ ] 1. Tiling    — 数据在多核与 L2Cache 间的切分策略
- [ ] 2. 搬运      — DataCopy 的带宽利用率
- [ ] 3. API 使用  — Ascend C API 的高效用法
- [ ] 4. 内存      — 数据在存储层级中的放置策略
- [ ] 5. 流水      — CopyIn / Compute / CopyOut 的重叠执行

每个阶段有独立的 reference 文件，排查时仅加载当前阶段的文件：

阶段 1：references/tiling-prof.md
阶段 2：references/data-copy-prof.md
阶段 3：references/api-usage-prof.md
阶段 4：references/memory-prof.md
阶段 5：references/pipeline-prof.md

Review operator code in the following order. For each phase, load the corresponding reference file and check against the code item by item.

- [ ] 1. Tiling    — Splitting strategy of data between multi-cores and L2Cache
- [ ] 2. Data Copy — Bandwidth utilization of DataCopy
- [ ] 3. API Usage — Efficient usage of Ascend C API
- [ ] 4. Memory    — Placement strategy of data in storage hierarchy
- [ ] 5. Pipeline  — Overlapped execution of CopyIn / Compute / CopyOut

Each phase has an independent reference file, only load the current phase's file during troubleshooting:

Phase 1: references/tiling-prof.md
Phase 2: references/data-copy-prof.md
Phase 3: references/api-usage-prof.md
Phase 4: references/memory-prof.md
Phase 5: references/pipeline-prof.md

1. Tiling

详细示例：references/tiling-prof.md

排查项：

1.1 多核切分：
```
blockDim
```
是否设为硬件核数？
- 耦合架构：
```
GetCoreNumAiv()
```
  或
```
GetCoreNumAic()
```
- 分离架构 Vector 算子：AIV 核数（如 40）
- 分离架构 Cube 算子：AIC 核数（如 20）
- 分离架构 MIX 算子：物理核组数（如 20 = 40 AIV / 2），不可超过物理核数
1.2 L2Cache 切分：当
```
输入 + 输出 > L2Cache 容量
```
时，是否将数据按 L2Cache 大小分块，所有核协同处理同一块后再切换下一块？
1.3 核间负载均衡：L2Cache 切分后，尾块是否在各 pass 间交替分配，避免固定某些核始终拖尾？

Detailed example: references/tiling-prof.md

Troubleshooting items:

1.1 Multi-core Splitting: Is
```
blockDim
```
set to the number of hardware cores?
- Coupled architecture:
```
GetCoreNumAiv()
```
  or
```
GetCoreNumAic()
```
- Separated architecture Vector operators: Number of AIV cores (e.g., 40)
- Separated architecture Cube operators: Number of AIC cores (e.g., 20)
- Separated architecture MIX operators: Number of physical core groups (e.g., 20 = 40 AIV / 2), cannot exceed the number of physical cores
1.2 L2Cache Splitting: When
```
input + output > L2Cache capacity
```
, is the data divided into blocks according to L2Cache size, with all cores processing the same block collaboratively before switching to the next block?
1.3 Inter-core Load Balancing: After L2Cache splitting, are tail blocks alternately allocated among passes to avoid certain cores always lagging behind?

2. 搬运

2. Data Copy

详细示例：references/data-copy-prof.md

排查项：

2.1 单次搬运量 >= 16 KB：每次
```
DataCopy
```
是否搬运至少 16 KB？小于此值带宽利用率显著下降。
2.2 GM 地址 512B 对齐：GM 起始地址是否 512 字节对齐？（Atlas A2 系列上，32B 对齐比 512B 对齐带宽最多低 30%。）
2.3 stride 参数代替 for 循环：间隔搬运是否使用
```
DataCopyParams
```
（blockCount/blockLen/srcStride/dstStride）一次下发，而非用 for 循环逐行搬运？

Detailed example: references/data-copy-prof.md

Troubleshooting items:

2.1 Single Copy Size >= 16 KB: Does each
```
DataCopy
```
transfer at least 16 KB? Bandwidth utilization drops significantly below this value.
2.2 GM Address 512B Alignment: Is the GM start address aligned to 512 bytes? (On Atlas A2 series, 32B alignment can reduce bandwidth by up to 30% compared to 512B alignment.)
2.3 Stride Parameter Instead of for Loop: Is
```
DataCopyParams
```
(blockCount/blockLen/srcStride/dstStride) used for interval transfer in one go instead of row-by-row transfer via for loop?

3. API 使用

3. API Usage

详细示例：references/api-usage-prof.md

排查项：

3.1 TPipe 在 kernel 类外创建：
```
TPipe
```
是否在 kernel 入口函数中创建并以指针传入类？（类内 TPipe 会阻止 Scalar 常量折叠，增加约 17% scalar_time。）
3.2 纯搬运算子使用 TQueBind：无 Vector 计算的算子是否用
```
TQueBind<VECIN, VECOUT>
```
替代了分离的
```
TQue<VECIN>
```
+
```
TQue<VECOUT>
```
？（消除冗余的 LocalTensor 间 DataCopy，
```
aiv_vec_time
```
降至约 0。）
3.3 Counter 模式（SetMaskCount）：Vector 指令是否使用 Counter 模式，而非 Normal 模式手动计算主块/尾块 mask？
3.4 Matmul AtomicAdd：Matmul 结果 C 需要与 GM 矩阵 D 相加时，是否在
```
IterateAll
```
/
```
GetTensorC
```
中设置
```
enAtomic=1
```
融合累加？（可减少约 12% cycle。）
3.5 归约指令组合：连续 buffer 归约到标量时，是否使用
```
BlockReduceSum
```
+
```
WholeReduceSum
```
组合，而非多次相同归约指令？

Detailed example: references/api-usage-prof.md

Troubleshooting items:

3.1 TPipe Created Outside Kernel Class: Is
```
TPipe
```
created in the kernel entry function and passed into the class as a pointer? (TPipe inside the class prevents Scalar constant folding, increasing scalar_time by about 17%.)
3.2 TQueBind Used for Pure Copy Operators: Do operators without Vector computation use
```
TQueBind<VECIN, VECOUT>
```
instead of separate
```
TQue<VECIN>
```
+
```
TQue<VECOUT>
```
? (Eliminates redundant DataCopy between LocalTensors, reducing
```
aiv_vec_time
```
to approximately 0.)
3.3 Counter Mode (SetMaskCount): Do Vector instructions use Counter mode instead of manually calculating main block/tail block mask in Normal mode?
3.4 Matmul AtomicAdd: When Matmul result C needs to be added to GM matrix D, is
```
enAtomic=1
```
set in
```
IterateAll
```
/
```
GetTensorC
```
to fuse accumulation? (Reduces cycles by about 12%.)
3.5 Reduction Instruction Combination: When reducing a continuous buffer to a scalar, is the combination of
```
BlockReduceSum
```
+
```
WholeReduceSum
```
used instead of multiple identical reduction instructions?

4. 内存

4. Memory

详细示例：references/memory-prof.md

排查项：

4.1 UB Buffer 融合：连续 Vector 运算（如 Exp → Abs）的中间结果是否留在 UB 内，而非经 GM 往返？
4.2 L0C 累加矩阵乘：
```
A1*B1 + A2*B2 + ...
```
场景下，Mmad 结果是否在 CO1（L0C）中原地累加，而非逐次写 GM 再在 UB 求和？
4.3 小矩阵长驻 L1：当 L1 无法同时容纳左右矩阵时，较小矩阵是否一次加载后常驻 L1，仅循环搬运较大矩阵？
4.4 BT Buffer 存放 bias（分离架构）：bias 是否存入 BT Buffer（C2）并通过
```
Mmad
```
一步融合，而非在 UB 中单独做 Add？
4.5 FP Buffer 存放量化参数（分离架构）：量化参数是否存入 FP Buffer（C2PIPE2GM）并通过
```
Fixpipe
```
随路量化，而非在 UB 中单独计算？

Detailed example: references/memory-prof.md

Troubleshooting items:

4.1 UB Buffer Fusion: Are intermediate results of continuous Vector operations (such as Exp → Abs) kept in UB instead of being transferred back and forth via GM?
4.2 L0C Accumulation for Matrix Multiplication: In scenarios like
```
A1*B1 + A2*B2 + ...
```
, are Mmad results accumulated in-place in CO1 (L0C) instead of writing to GM sequentially and then summing in UB?
4.3 Small Matrix Resides in L1: When L1 cannot accommodate both left and right matrices simultaneously, is the smaller matrix loaded once and kept in L1, with only the larger matrix transferred cyclically?
4.4 BT Buffer for Bias (Separated Architecture): Is bias stored in BT Buffer (C2) and fused in one step via
```
Mmad
```
instead of performing Add separately in UB?
4.5 FP Buffer for Quantization Parameters (Separated Architecture): Are quantization parameters stored in FP Buffer (C2PIPE2GM) and quantized along the way via
```
Fixpipe
```
instead of being calculated separately in UB?

5. 流水

5. Pipeline

详细示例：references/pipeline-prof.md

排查项：

5.1 CopyIn/Compute/CopyOut 范式：算子是否划分为三级流水，使用
```
TQue
```
进行级间同步？
5.2 Double Buffer：
```
InitBuffer
```
的 buffer 个数是否设为 2，使 CopyIn/CopyOut 与 Compute 重叠执行？（前提：循环次数 >= 2，且搬运时间相对计算时间不可忽略。）
5.3 异步 Iterate（MIX 模式）：Matmul MIX 场景下，是否使用
```
Iterate<false>()
```
/
```
IterateAll<false>()
```
避免每次迭代的 AIC/AIV 同步开销？

Detailed example: references/pipeline-prof.md

Troubleshooting items:

5.1 CopyIn/Compute/CopyOut Paradigm: Is the operator divided into three-level pipeline, with
```
TQue
```
used for inter-stage synchronization?
5.2 Double Buffer: Is the number of buffers in
```
InitBuffer
```
set to 2 to enable overlapped execution of CopyIn/CopyOut and Compute? (Prerequisite: Number of loops >= 2, and transfer time is not negligible relative to computation time.)
5.3 Asynchronous Iterate (MIX Mode): In Matmul MIX scenarios, is
```
Iterate<false>()
```
/
```
IterateAll<false>()
```
used to avoid AIC/AIV synchronization overhead per iteration?

1.3 输出排查报告

1.3 Output Troubleshooting Report

排查完所有阶段后，按以下格式输出汇总：

undefined

After troubleshooting all phases, output a summary in the following format:

undefined

优化排查报告

Optimization Troubleshooting Report

发现的问题（按预期收益排序）

Identified Issues (Sorted by Expected Benefit)

[阶段 X.Y] <问题描述> — <预期收益>
[阶段 X.Y] <问题描述> — <预期收益> ...

[Phase X.Y] <Issue Description> — <Expected Benefit>
[Phase X.Y] <Issue Description> — <Expected Benefit> ...

已确认无问题

Confirmed No Issues

[阶段 X.Y] <检查项描述> ...

[Phase X.Y] <Check Item Description> ...

优化计划

Optimization Plan

按预期收益从大到小排列，确定本轮优化的目标项。

---

Sort by expected benefit from highest to lowest, determine the target items for this round of optimization.

---

Phase 2: 基线 — 保存当前性能测试结果

Phase 2: Baseline — Save Current Performance Test Results

优化前必须保存性能基线，以便优化后精确对比。

Must save the performance baseline before optimization for accurate comparison after optimization.

2.1 检查现有性能验证结果

2.1 Check Existing Performance Verification Results

检查

csrc/ops/<op_name>/test/

下是否已存在：

```
<op_name>_perf_cases.jsonl
```
— 性能测试用例
```
<op_name>_torch_npu_profiler_report.md
```
— 性能对比报告

Check if the following files exist under

csrc/ops/<op_name>/test/

```
<op_name>_perf_cases.jsonl
```
— Performance test cases
```
<op_name>_torch_npu_profiler_report.md
```
— Performance comparison report

2.2 无结果时执行性能评估

2.2 Execute Performance Evaluation if No Results Exist

若上述文件不存在或结果已过时（如代码已更新但报告未重新生成），MUST 调用 ascendc-operator-performance-eval
skill 完成完整性能评估：

读取
```
ascendc-operator-performance-eval
```
SKILL.md
按其流程生成性能用例（JSONL）、运行 profiler、生成对比报告
确保报告包含自定义算子 vs 标杆的完整对比数据

If the above files do not exist or the results are outdated (e.g., code has been updated but the report has not been regenerated), MUST call the ascendc-operator-performance-eval
skill to complete the full performance evaluation:

Read the SKILL.md of
```
ascendc-operator-performance-eval
```
Generate performance cases (JSONL), run profiler, and generate comparison report according to its process
Ensure the report contains complete comparison data between custom operator vs benchmark

2.3 保存基线快照

2.3 Save Baseline Snapshot

将当前性能报告备份为基线文件，命名为

<op_name>_baseline_report.md

，保存在同一

test/

目录下。该文件后续用于对比优化效果。

csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl                 ← 性能用例（优化前后共用）
├── <op_name>_torch_npu_profiler_report.md     ← 当前报告（会被覆盖）
└── <op_name>_baseline_report.md               ← 基线快照（优化前的性能数据）

Back up the current performance report as a baseline file named

<op_name>_baseline_report.md

, stored in the same

test/

directory. This file will be used for comparing optimization effects later.

csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl                 ← Performance cases (shared before and after optimization)
├── <op_name>_torch_npu_profiler_report.md     ← Current report (will be overwritten)
└── <op_name>_baseline_report.md               ← Baseline snapshot (performance data before optimization)

Phase 3: 优化 — 学习知识后修改代码

Phase 3: Optimization — Modify Code After Learning Knowledge

3.1 学习算子开发知识（MANDATORY）

3.1 Learn Operator Development Knowledge (MANDATORY)

修改代码前 MUST 加载
ascendc-operator-code-gen
skill 的 reference 文件，确保对 AscendC API、数据搬运、同步控制等有准确理解。

按需加载以下 reference（位于

ascendc-operator-code-gen/references/

）：

Reference 文件	用途
`GUIDE.md`	总览：模板选择、代码生成流程
`data-copy-api.md`	DataCopy/DataCopyPad API 详解
`vector-compute-api.md`	Vector 计算 API 详解
`sync-control-api.md`	TQue/Pipe 同步控制
`resource-management-api.md`	TPipe/TBuf 资源管理
`basic-data-structures-api.md`	LocalTensor/GlobalTensor 等基础结构
`kernel-constraints.md`	Kernel 编程约束与常见陷阱

根据 Phase 1 发现的优化点，选择性加载相关 reference。例如：

优化搬运 → 加载
```
data-copy-api.md
```

优化流水 → 加载

sync-control-api.md

resource-management-api.md

优化计算 → 加载
```
vector-compute-api.md
```

MUST load the reference files of
ascendc-operator-code-gen
skill before modifying code to ensure accurate understanding of AscendC API, data transfer, synchronization control, etc.

Load the following references as needed (located in

ascendc-operator-code-gen/references/

Reference File	Purpose
`GUIDE.md`	Overview: Template selection, code generation process
`data-copy-api.md`	Detailed explanation of DataCopy/DataCopyPad API
`vector-compute-api.md`	Detailed explanation of Vector computation API
`sync-control-api.md`	TQue/Pipe synchronization control
`resource-management-api.md`	TPipe/TBuf resource management
`basic-data-structures-api.md`	Basic structures such as LocalTensor/GlobalTensor
`kernel-constraints.md`	Kernel programming constraints and common pitfalls

Selectively load relevant references based on the optimization points identified in Phase 1. For example:

Optimize data copy → Load
```
data-copy-api.md
```

Optimize pipeline → Load

sync-control-api.md

resource-management-api.md

Optimize computation → Load
```
vector-compute-api.md
```

3.2 制定修改方案

3.2 Formulate Modification Plan

针对 Phase 1 排查报告中的每个优化点，制定具体的代码修改方案：

优化点 [X.Y]: <问题描述>
├── 修改文件: op_host / op_kernel / 两者
├── 修改内容: <具体代码变更描述>
├── 预期效果: <量化预期（如搬运时间减少 30%）>
└── 风险评估: <是否可能影响精度/是否需要修改 tiling>

For each optimization point in the Phase 1 troubleshooting report, formulate a specific code modification plan:

Optimization Point [X.Y]: <Issue Description>
├── Modified Files: op_host / op_kernel / both
├── Modification Content: <Specific code change description>
├── Expected Effect: <Quantified expectation (e.g., 30% reduction in transfer time)>
└── Risk Assessment: <Whether precision may be affected / whether tiling needs to be modified>

3.3 执行代码修改

3.3 Execute Code Modification

按照修改方案逐一修改代码。修改时遵守以下规则：

MUST 遵守 code-gen 反模式清单：

NEVER 让 FP16/BF16 直接参与复杂数学计算，必须先 Cast 到 FP32
NEVER 在 EXEC_KERNEL_CMD 中传右值
NEVER 对 GM↔UB 搬运使用 DataCopy，必须用 DataCopyPad
NEVER 在 ReduceSum/ReduceMax 后直接复用源 tensor
NEVER 在 kernel 中使用
```
std::min/max/abs/sqrt/exp
```
等标准库函数
NEVER 向高维切分 API 传入 repeatTime > 255
NEVER 修改
```
cmake/
```
或
```
csrc/utils/
```
下的文件
NEVER 硬编码核数或 UB 大小

Modify code one by one according to the modification plan. Follow the following rules during modification:

MUST comply with the code-gen anti-pattern list:

NEVER let FP16/BF16 directly participate in complex mathematical calculations, must cast to FP32 first
NEVER pass r-values in EXEC_KERNEL_CMD
NEVER use DataCopy for GM↔UB transfer, must use DataCopyPad
NEVER directly reuse source tensor after ReduceSum/ReduceMax
NEVER use standard library functions such as
```
std::min/max/abs/sqrt/exp
```
in kernel
NEVER pass repeatTime > 255 to high-dimensional splitting API
NEVER modify files under
```
cmake/
```
or
```
csrc/utils/
```
NEVER hardcode core count or UB size

3.4 编译安装

3.4 Compile and Install

修改完成后必须重新编译安装：

bash

source ${ASCEND_HOME_PATH}/set_env.sh
cd task/ascend-kernel
bash build.sh
pip install output/ascend_kernel*.whl --force-reinstall --no-deps

编译失败时进入排错循环（最多 3 次）。

Must recompile and install after modification:

bash

source ${ASCEND_HOME_PATH}/set_env.sh
cd task/ascend-kernel
bash build.sh
pip install output/ascend_kernel*.whl --force-reinstall --no-deps

Enter the troubleshooting loop if compilation fails (up to 3 times).

Phase 4: 精度验证 — 确保优化后功能正确

Phase 4: Precision Verification — Ensure Correct Functionality After Optimization

MANDATORY — 优化后必须先通过精度验证再进行性能对比。

MANDATORY — Must pass precision verification before performance comparison after optimization.

4.1 调用精度评估 skill

4.1 Call Precision Evaluation Skill

读取并执行 ascendc-operator-precision-eval
SKILL.md 的完整流程：

生成精度测试用例（≥30 例，覆盖全部 dtype）
运行 pytest 精度测试
生成精度报告（Markdown + JSON）
在当前对话中展示总览、失败摘要与关键发现

Read and execute the complete process of ascendc-operator-precision-eval
SKILL.md:

Generate precision test cases (≥30 cases, covering all dtypes)
Run pytest precision tests
Generate precision report (Markdown + JSON)
Display overview, failure summary and key findings in the current conversation

4.2 精度判定

4.2 Precision Determination

结果	处理
全部通过	进入 Phase 5 性能验证
部分失败	分析失败原因，回退或修复代码，重新进入 Phase 3
大量失败	回退本轮所有修改，重新分析优化方案

Result	Handling
All Passed	Proceed to Phase 5 Performance Verification
Partially Failed	Analyze failure causes, roll back or fix code, and re-enter Phase 3
Massively Failed	Roll back all modifications in this round, re-analyze optimization plan

Phase 5: 性能验证 — 确认优化效果

Phase 5: Performance Verification — Confirm Optimization Effects

5.1 运行同 case 性能测试

5.1 Run Performance Tests with the Same Cases

使用 Phase 2 中相同的性能用例（

<op_name>_perf_cases.jsonl

），调用 ascendc-operator-performance-eval
skill 重新执行性能评估。

关键要求：

MUST 使用与基线完全相同的 perf_cases.jsonl（不能增删用例）
MUST 生成新的
```
<op_name>_torch_npu_profiler_report.md
```
MUST 在当前对话中展示对比表、汇总与结论

Use the same performance cases from Phase 2 (

<op_name>_perf_cases.jsonl

), call the ascendc-operator-performance-eval
skill to re-execute performance evaluation.

Key requirements:

MUST use the exact same perf_cases.jsonl as the baseline (cannot add or remove cases)
MUST generate a new
```
<op_name>_torch_npu_profiler_report.md
```
MUST display comparison table, summary and conclusion in the current conversation

5.2 对比分析

5.2 Comparative Analysis

将优化后的性能数据与 Phase 2 保存的基线进行对比：

undefined

Compare the post-optimization performance data with the baseline saved in Phase 2:

undefined

优化效果对比

Optimization Effect Comparison

Case	Shape	dtype	基线 per-step(us)	优化后 per-step(us)	提升比	标杆 per-step(us)	vs 标杆
...	...	...	...	...	...	...	...

Case	Shape	dtype	Baseline per-step(us)	Post-optimization per-step(us)	Improvement Ratio	Benchmark per-step(us)	vs Benchmark
...	...	...	...	...	...	...	...

汇总

Summary

平均提升: X%
最大提升: X%（Case Y）
vs 标杆平均比值: 优化前 A → 优化后 B

undefined

Average Improvement: X%
Maximum Improvement: X% (Case Y)
vs Benchmark Average Ratio: Before optimization A → After optimization B

undefined

5.3 性能判定

5.3 Performance Determination

结果	处理
性能提升（大部分 case 优化后更快）	优化成功，输出最终报告
性能未提升或回退	进入 Phase 6 迭代优化

Result	Handling
Performance Improved (most cases are faster after optimization)	Optimization successful, output final report
Performance Not Improved or Regressed	Enter Phase 6 Iterative Optimization

Phase 6: 迭代优化（最多 3 轮）

Phase 6: Iterative Optimization (Up to 3 Rounds)

若 Phase 5 判定性能未提升，进入迭代：

当前轮次: N (N ∈ {1, 2, 3})

├── N < 3: 回到 Phase 1，选择下一优先级优化点或调整方案
│   ├── 重新排查，分析上一轮修改为何未生效
│   ├── 选择新的优化点或调整上一轮的方案
│   └── 重复 Phase 3 → Phase 4 → Phase 5
│
└── N = 3: 停止迭代，输出最终报告（含所有轮次记录）

If Phase 5 determines no performance improvement, enter iteration:

Current Round: N (N ∈ {1, 2, 3})

├── N < 3: Return to Phase 1, select next priority optimization point or adjust plan
│   ├── Re-troubleshoot, analyze why previous round's modification did not take effect
│   ├── Select new optimization points or adjust previous round's plan
│   └── Repeat Phase 3 → Phase 4 → Phase 5
│
└── N = 3: Stop iteration, output final report (including records of all rounds)

迭代记录

Iteration Records

每轮迭代必须记录：

undefined

Must record each iteration:

undefined

第 N 轮优化

Round N Optimization

优化目标: [阶段 X.Y] <描述>
修改内容: <代码变更摘要>
精度结果: 通过 / 失败
性能结果: 提升 X% / 未提升 / 回退 Y%
决策: 保留本轮修改 / 回退 / 继续下一轮

---

Optimization Target: [Phase X.Y] <Description>
Modification Content: <Code change summary>
Precision Result: Passed / Failed
Performance Result: Improved X% / Not improved / Regressed Y%
Decision: Keep current round's modification / Roll back / Proceed to next round

---

最终输出

Final Output

所有轮次完成后（成功提升或达到 3 轮上限），输出最终汇总报告。

After completing all rounds (successful improvement or reaching 3-round limit), output the final summary report.

在当前对话中展示（MANDATORY）

Display in Current Conversation (MANDATORY)

MUST 在对话中展示以下内容，NEVER 仅输出文件路径：

优化排查总结：发现的所有问题及处理状态
性能对比总表：基线 → 优化后 → 标杆的三方对比
迭代历史摘要：每轮的优化目标、结果、决策
≥3 条关键结论：主要瓶颈、优化收益分布、剩余优化空间等
文件路径殿后：报告与代码文件路径

MUST display the following content in the conversation, NEVER only output file paths:

Optimization Troubleshooting Summary: All identified issues and their handling status
Performance Comparison Summary Table: Three-way comparison of baseline → post-optimization → benchmark
Iteration History Summary: Optimization target, result and decision of each round
≥3 Key Conclusions: Main bottlenecks, optimization benefit distribution, remaining optimization space, etc.
File Paths at the End: Paths of reports and code files

文件产物

File Products

csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl                 ← 性能用例
├── <op_name>_baseline_report.md               ← 优化前基线
├── <op_name>_torch_npu_profiler_report.md     ← 优化后最终性能报告
├── <op_name>_precision_report.md              ← 精度验证报告
└── <op_name>_optim_summary.md                 ← 优化迭代汇总报告（新增）

csrc/ops/<op_name>/
├── op_host/<op_name>.cpp                      ← 优化后的 host 代码
└── op_kernel/<op_name>.cpp                    ← 优化后的 kernel 代码

csrc/ops/<op_name>/test/
├── <op_name>_perf_cases.jsonl                 ← Performance cases
├── <op_name>_baseline_report.md               ← Pre-optimization baseline
├── <op_name>_torch_npu_profiler_report.md     ← Final post-optimization performance report
├── <op_name>_precision_report.md              ← Precision verification report
└── <op_name>_optim_summary.md                 ← Optimization iteration summary report (newly added)

csrc/ops/<op_name>/
├── op_host/<op_name>.cpp                      ← Optimized host code
└── op_kernel/<op_name>.cpp                    ← Optimized kernel code

优化迭代汇总报告结构

Optimization Iteration Summary Report Structure

<op_name>_optim_summary.md

必须包含：

markdown

undefined

<op_name>_optim_summary.md

must include:

markdown

undefined

<op_name> 性能优化报告

<op_name> Performance Optimization Report

排查发现

Troubleshooting Findings

（Phase 1 的排查报告内容）

(Content of Phase 1 troubleshooting report)

优化前基线

Pre-optimization Baseline

（Phase 2 的性能数据摘要）

(Summary of Phase 2 performance data)

迭代历史

Iteration History

第 1 轮

Round 1

优化目标: ...
代码修改: ...
精度结果: ...
性能结果: ...

Optimization Target: ...
Code Modifications: ...
Precision Result: ...
Performance Result: ...

第 N 轮

Round N

...

最终性能对比

Final Performance Comparison

（优化前 vs 优化后 vs 标杆三方对比表）

(Three-way comparison table of pre-optimization vs post-optimization vs benchmark)

结论

Conclusions

（≥3 条关键发现）

---

(≥3 key findings)

---

检查清单（助手自检）

Checklist (Assistant Self-check)

Phase 1: 排查

Phase 1: Troubleshoot

已读取算子设计文档（design.md）
已读取 op_host + op_kernel 完整源码
已逐阶段加载 reference 并逐项排查
已输出排查报告，优化点按预期收益排序

Have read operator design document (design.md)
Have read complete source codes of op_host + op_kernel
Have loaded reference phase by phase and checked item by item
Have output troubleshooting report, with optimization points sorted by expected benefit

Phase 2: 基线

Phase 2: Baseline

已确认或生成性能测试用例（JSONL）
已确认或运行性能评估（自定义 vs 标杆）
已保存基线快照（
```
_baseline_report.md
```
）

Have confirmed or generated performance test cases (JSONL)
Have confirmed or run performance evaluation (custom vs benchmark)
Have saved baseline snapshot (
```
_baseline_report.md
```
)

Phase 3: 优化

Phase 3: Optimization

已加载 code-gen reference（修改前必读）
代码修改遵守反模式清单
编译安装成功

Have loaded code-gen reference (must read before modification)
Code modifications comply with anti-pattern list
Compilation and installation successful

Phase 4: 精度

Phase 4: Precision

已按
```
ascendc-operator-precision-eval
```
流程完成精度验证
精度验证通过（全部或大部分用例 PASS）

Have completed precision verification according to
```
ascendc-operator-precision-eval
```
process
Precision verification passed (all or most cases PASS)

Phase 5: 性能

Phase 5: Performance

使用与基线相同的 perf_cases.jsonl
已在对话中展示性能对比数据
已判定是否提升

Used the same perf_cases.jsonl as baseline
Have displayed performance comparison data in conversation
Have determined whether performance improved

Phase 6: 迭代

Phase 6: Iteration

迭代不超过 3 轮
每轮均有记录（目标、修改、精度、性能、决策）
已输出最终汇总报告（
```
_optim_summary.md
```
）

Iteration does not exceed 3 rounds
Each round has records (target, modification, precision, performance, decision)
Have output final summary report (
```
_optim_summary.md
```
)

输出

Output

已在当前对话中展示排查总结、性能对比、迭代历史、≥3 条结论
NEVER 仅输出文件路径

Have displayed troubleshooting summary, performance comparison, iteration history, ≥3 conclusions in current conversation
NEVER only output file paths

ascendc-operator-performance-optim

Original

Translation

Ascend C 算子性能优化（排查 → 修改 → 验证 闭环）

Ascend C Operator Performance Optimization (Troubleshoot → Modify → Validate Closed Loop)

Phase 1: 排查 — 发现优化点

Phase 1: Troubleshoot — Identify Optimization Points

1.1 学习算子设计文档

1.1 Study Operator Design Documents

1.2 逐阶段排查

1.2 Troubleshoot Phase by Phase

1. Tiling

1. Tiling

2. 搬运

2. Data Copy

3. API 使用

3. API Usage

4. 内存

4. Memory

5. 流水

5. Pipeline

1.3 输出排查报告

1.3 Output Troubleshooting Report

优化排查报告

Optimization Troubleshooting Report

发现的问题（按预期收益排序）

Identified Issues (Sorted by Expected Benefit)

已确认无问题

Confirmed No Issues

优化计划

Optimization Plan

Phase 2: 基线 — 保存当前性能测试结果

Phase 2: Baseline — Save Current Performance Test Results

2.1 检查现有性能验证结果

2.1 Check Existing Performance Verification Results

2.2 无结果时执行性能评估

2.2 Execute Performance Evaluation if No Results Exist

2.3 保存基线快照

2.3 Save Baseline Snapshot

Phase 3: 优化 — 学习知识后修改代码

Phase 3: Optimization — Modify Code After Learning Knowledge

3.1 学习算子开发知识（MANDATORY）

3.1 Learn Operator Development Knowledge (MANDATORY)

3.2 制定修改方案

3.2 Formulate Modification Plan

3.3 执行代码修改

3.3 Execute Code Modification

3.4 编译安装

3.4 Compile and Install

Phase 4: 精度验证 — 确保优化后功能正确

Phase 4: Precision Verification — Ensure Correct Functionality After Optimization

4.1 调用精度评估 skill

4.1 Call Precision Evaluation Skill

4.2 精度判定

4.2 Precision Determination

Phase 5: 性能验证 — 确认优化效果

Phase 5: Performance Verification — Confirm Optimization Effects

5.1 运行同 case 性能测试

5.1 Run Performance Tests with the Same Cases

5.2 对比分析

5.2 Comparative Analysis

优化效果对比

Optimization Effect Comparison

汇总

Summary

5.3 性能判定

5.3 Performance Determination

Phase 6: 迭代优化（最多 3 轮）

Phase 6: Iterative Optimization (Up to 3 Rounds)

迭代记录

Iteration Records

第 N 轮优化

Round N Optimization

最终输出

Final Output

在当前对话中展示（MANDATORY）

Display in Current Conversation (MANDATORY)

文件产物

File Products

优化迭代汇总报告结构

Ascend C 算子性能优化（排查 → 修改 → 验证闭环）