ascendc-operator-performance-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AscendC 算子 torch_npu.profiler 性能评估

Performance Evaluation of AscendC Operators with torch_npu.profiler

本技能目录内参考文件

Reference Files in This Skill Directory

执行本技能时，应优先使用 本目录 下材料：

文件 / 目录	用途
`SKILL.md` （本文件）	流程、目录约定、完整 JSONL 用例规范、报告结构、固定 schedule
`references/REFERENCE_JSON_CASE_FORMAT.md`	与下文「性能用例 JSONL 规范」同文
`references/REFERENCE_PROFILER_AND_METRICS.md`	`torch_npu.profiler` 、 `op_statistic.csv` 、 `*_ascend_pt` 路径
`examples/sample_perf_cases.jsonl`	最小 LayerNorm 风格 JSONL，可复制改名
`examples/layer_norm_profiler_reference/`	Layer Norm 参考实现（ `layer_norm_profiler_common.py` 、 `benchmark_layer_norm_torch_npu_profiler.py` 、用例 JSONL、说明）；新算子可复制该目录到 `csrc/ops/<op>/test/` 再替换前向与文件名

When executing this skill, prioritize materials in this directory:

File / Directory	Purpose
`SKILL.md` (this file)	Process, directory conventions, complete JSONL test case specifications, report structure, fixed schedule
`references/REFERENCE_JSON_CASE_FORMAT.md`	Identical to the "JSONL Specification for Performance Test Cases" section below
`references/REFERENCE_PROFILER_AND_METRICS.md`	`torch_npu.profiler` , `op_statistic.csv` , `*_ascend_pt` path
`examples/sample_perf_cases.jsonl`	Minimal LayerNorm-style JSONL, can be copied and renamed
`examples/layer_norm_profiler_reference/`	Layer Norm reference implementation ( `layer_norm_profiler_common.py` , `benchmark_layer_norm_torch_npu_profiler.py` , test case JSONL, instructions); new operators can copy this directory to `csrc/ops/<op>/test/` and replace the forward logic and filenames

角色

Role

在 ascend-kernel 中，为

csrc/ops/<算子名>/

建立可复用的 profiler 性能用例 与 自定义算子 vs 标杆 的 Markdown 报告流程。采集必须走 torch_npu.profiler
，且 warmup
与
active
固定为 5（见下节）。细节见 references/REFERENCE_PROFILER_AND_METRICS.md
。

核心原则（两条强制约束）：

对比报告始终必须呈现：无论标杆路径是标杆 API 还是小算子拼接，最终报告都必须含自定义算子 vs 标杆的双路径对比表，禁止使用单路径报告替代，标杆必须在NPU上运行。
用例生成必须先读
design.md
：在生成任何 JSONL 用例之前，必须读取算子目录下的
```
csrc/ops/<op>/design.md
```
，从中提取参数约束、典型 shape、支持的 dtype 及关键属性值，用例须覆盖设计文档中描述的所有执行模式。

In ascend-kernel, establish a reusable process for profiler performance test cases and Markdown reports comparing custom operators vs baselines for

csrc/ops/<operator-name>/

. Data collection must use torch_npu.profiler
, with fixed
warmup
and
active
values of 5 (see next section). For details, refer to references/REFERENCE_PROFILER_AND_METRICS.md
.

Core Principles (Two Mandatory Constraints):

Comparison report must always be presented: Regardless of whether the baseline path is a baseline API or a combination of small operators, the final report must include a dual-path comparison table of custom operator vs baseline. Single-path reports are not allowed, and the baseline must run on NPU.
Test case generation must first read
design.md
: Before generating any JSONL test cases, read
```
csrc/ops/<op>/design.md
```
in the operator directory, extract parameter constraints, typical shapes, supported dtypes, and key attribute values. Test cases must cover all execution modes described in the design document.

用例来源：从 testcase-gen 用例文档加载（MANDATORY）

Test Case Source: Load from testcase-gen Test Case Document (MANDATORY)

在生成或修改任何 JSONL 用例之前，MUST 首先读取 testcase-gen 产出的用例文档：

Before generating or modifying any JSONL test cases, MUST first read the test case document generated by testcase-gen:

Step 0：读取 testcase-gen 用例文档

Step 0: Read testcase-gen Test Case Document

READ csrc/ops/<op>/test/<op>-test-cases.md

从中提取：

提取项	在用例文档中的位置	用途
SUPPORTED_DTYPES	§测试配置	JSONL 用例的 dtype 覆盖范围
TEST_SHAPES	§测试配置	小/中/大规模 shape 的选取基准
GENERAL_SHAPES	§测试配置	泛化 shape，可补充用于性能场景
NPU 调用方式	§算子标杆	自定义算子的前向调用
CPU 参考实现	§算子标杆	标杆路径的参考实现

READ csrc/ops/<op>/test/<op>-test-cases.md

Extract the following from it:

Extracted Item	Location in Test Case Document	Purpose
SUPPORTED_DTYPES	§Test Configuration	Coverage of dtypes for JSONL test cases
TEST_SHAPES	§Test Configuration	Benchmark for selecting small/medium/large-scale shapes
GENERAL_SHAPES	§Test Configuration	Generalized shapes, can be supplemented for performance scenarios
NPU Calling Method	§Operator Baseline	Forward call for custom operator
CPU Reference Implementation	§Operator Baseline	Reference implementation for baseline path

testcase-gen 输出 → JSONL 用例转换规则

Conversion Rules from testcase-gen Output to JSONL Test Cases

从 TEST_SHAPES + GENERAL_SHAPES 中选取代表性 shape（覆盖小/中/大规模），避免重复
每个 shape 遍历 SUPPORTED_DTYPES 中的全部 dtype
结合 design.md 中的属性值（如 block_size、eps 等）填充 JSONL 的
```
inputs
```
字段
JSONL 用例总数 ≥ 8
算子标杆中的 NPU 调用方式和 CPU 参考实现用于构建自定义算子路径和标杆路径

若
<op>-test-cases.md
不存在：回退为完全从 design.md 自行设计用例（按下方流程），但需在报告中注明"用例为自行设计，非 testcase-gen 产出"。

Select representative shapes from TEST_SHAPES + GENERAL_SHAPES (covering small/medium/large scales), avoid duplicates
Iterate through all dtypes in SUPPORTED_DTYPES for each shape
Fill the
```
inputs
```
field of JSONL with attribute values (such as block_size, eps, etc.) from design.md
Total number of JSONL test cases ≥ 8
The NPU calling method and CPU reference implementation in the operator baseline are used to build the custom operator path and baseline path

If
<op>-test-cases.md
does not exist: Fall back to designing test cases entirely from design.md (following the process below), but note "Test cases are self-designed, not generated by testcase-gen" in the report.

用例生成：必须先读 design.md（强制）

Test Case Generation: Must First Read design.md (Mandatory)

在生成或修改任何 JSONL 用例之前（无论是否已加载 testcase-gen 用例文档），必须执行以下步骤：

Before generating or modifying any JSONL test cases (whether or not the testcase-gen test case document has been loaded), perform the following steps:

Step 1：读取设计文档

Step 1: Read Design Document

READ csrc/ops/<op>/design.md

从中提取：

提取项	在 design.md 中的位置	用途
支持的数据类型	§1「支持的数据类型」	用例的 dtype 覆盖范围
参数约束与取值范围	§1「参数说明」约束条件列	属性值的合法范围（如 block_size ≤ 128）
典型 shape / 输入规模	§2「计算逻辑」/ §3「Tiling 策略」	小/中/大规模用例的 shape 基准
关键属性的模式组合	§2「伪代码」/ §1「参数说明」	需要各自覆盖的执行路径（如 do_transpose=True/False、is_input_split=True/False）
性能关键点	§6「性能优化」/ §3「Tiling 策略」	影响性能的分支（如转置 vs 非转置走不同 DMA 路径）

READ csrc/ops/<op>/design.md

Extract the following from it:

Extracted Item	Location in design.md	Purpose
Supported Data Types	§1 "Supported Data Types"	Coverage of dtypes for test cases
Parameter Constraints and Value Ranges	Constraint column in §1 "Parameter Description"	Valid range for attribute values (e.g., block_size ≤ 128)
Typical Shapes / Input Scales	§2 "Computation Logic" / §3 "Tiling Strategy"	Benchmark for small/medium/large-scale test case shapes
Mode Combinations of Key Attributes	§2 "Pseudocode" / §1 "Parameter Description"	Execution paths that need to be covered individually (e.g., do_transpose=True/False, is_input_split=True/False)
Performance Key Points	§6 "Performance Optimization" / §3 "Tiling Strategy"	Branches that affect performance (e.g., transpose vs non-transpose using different DMA paths)

Step 2：用例设计规则

Step 2: Test Case Design Rules

规则	说明
覆盖所有执行模式	design.md 描述了多个执行路径（如转置/非转置、input_split 模式）时，每种模式必须有至少一个用例
覆盖所有支持的 dtype	每种支持的数据类型至少有一组用例，典型中等规模 shape
小/中/大规模 shape 各一组	小规模（内核 launch 开销主导）、中规模（典型生产场景）、大规模（访存带宽主导）各需覆盖
参数值来自约束范围	属性值（如 block_size）必须从 design.md 的约束条件中选取，不得随意设定
整数/索引张量值须语义合法	win_lengths、offsets 等张量的具体值需满足算子语义（如 offsets 必须是合法的 window 起始偏移）

Rule	Description
Cover all execution modes	When design.md describes multiple execution paths (e.g., transpose/non-transpose, input_split mode), there must be at least one test case for each mode
Cover all supported dtypes	At least one set of test cases for each supported data type, using a typical medium-scale shape
One set each for small/medium/large-scale shapes	Must cover small scale (kernel launch overhead dominant), medium scale (typical production scenario), and large scale (memory bandwidth dominant)
Parameter values from constraint ranges	Attribute values (such as block_size) must be selected from the constraints in design.md, cannot be set arbitrarily
Integer/index tensor values must be semantically valid	Specific values of tensors like win_lengths, offsets must comply with operator semantics (e.g., offsets must be valid window start offsets)

Step 3：验证用例

Step 3: Validate Test Cases

生成用例后检查：

所有 dtype 均已覆盖
所有执行模式（由 design.md 定义）均有对应用例
参数值（含属性值和整数张量值）在 design.md 约束范围内
包含至少一个「小 shape」用例和一个「大 shape」用例

After generating test cases, check:

All dtypes are covered
All execution modes (defined by design.md) have corresponding test cases
Parameter values (including attribute values and integer tensor values) are within the constraints of design.md
Includes at least one "small shape" test case and one "large shape" test case

参考路径决策树（强制）

Reference Path Decision Tree (Mandatory)

性能评估始终需要双路径对比（自定义算子 vs 标杆），按以下顺序确定标杆路径：

算子是否有标杆等价 API？
  ├─ 是（如 torch.nn.functional.*、torch_npu 内置算子）
  │    └─ 使用标杆 API 作为标杆路径
  └─ 否（无标杆等价接口）
       └─ 必须实现「小算子拼接」标杆路径 ← 本技能的强制要求
            └─ 用设计文档 §「参考实现」或「伪代码」中的 PyTorch 基础算子组合实现

Performance evaluation always requires dual-path comparison (custom operator vs baseline). Determine the baseline path in the following order:

Does the operator have an equivalent baseline API?
  ├─ Yes (e.g., torch.nn.functional.*, built-in torch_npu operators)
  │    └─ Use the baseline API as the baseline path
  └─ No (no equivalent baseline interface)
       └─ Must implement a "small operator combination" baseline path ← Mandatory requirement of this skill
            └─ Implement using PyTorch basic operator combinations from the "Reference Implementation" or "Pseudocode" section of the design document

小算子拼接标杆路径要求

Requirements for Small Operator Combination Baseline Path

当无标杆等价接口时，必须：

从 design.md 读取参考实现：设计文档通常包含 PyTorch 参考实现（伪代码或 Python 函数），以该实现为基础构建标杆路径。
使用 PyTorch 基础算子组合：
```
torch.zeros
```
、切片赋值、
```
.permute()
```
、
```
torch.cat
```
等。不得使用循环+Python 标量赋值（否则 profiler 采集的是 CPU 算子而非 NPU 算子，无法公平对比）；整个标杆实现必须以张量操作为主，可在 NPU 上执行。
在报告中明确标注：报告头部须写明「无标杆等价接口，标杆路径为小算子拼接」，列明所用的基础算子。
对比表格必须呈现：不得因「无标杆接口」而退化为单路径报告，必须保留「自定义算子 per-step」「标杆 per-step」「比值」三列。

NEVER：以「无标杆等价接口」为由输出单路径报告或跳过对比表。

When there is no equivalent baseline interface, must:

Read the reference implementation from design.md: The design document usually contains a PyTorch reference implementation (pseudocode or Python function), use this as the basis to build the baseline path.
Use PyTorch basic operator combinations:
```
torch.zeros
```
, slice assignment,
```
.permute()
```
,
```
torch.cat
```
, etc. Do not use loops + Python scalar assignment (otherwise the profiler collects CPU operators instead of NPU operators, making fair comparison impossible); the entire baseline implementation must be dominated by tensor operations and executable on NPU.
Clearly label in the report: The header of the report must state "No equivalent baseline interface, baseline path is a combination of small operators" and list the basic operators used.
Comparison table must be presented: Do not degrade to a single-path report due to "no baseline interface", must retain the three columns: "Custom Operator per-step", "Baseline per-step", "Ratio".

NEVER: Output a single-path report or skip the comparison table on the grounds of "no equivalent baseline interface".

固定 Profiler 步数（强制）

Fixed Profiler Steps (Mandatory)

参数	值	说明
`warmup`	5	不允许脚本或 CLI 改为其它值
`active`	5	不允许脚本或 CLI 改为其它值
`wait`	默认 `0`	可保留 CLI 或常量，按需
`repeat`	默认 `1`	简单场景固定为 1；若 `repeat>1` ，须在文档中说明 CSV 选取语义

每步末尾必须 prof.step()
；循环总步数 =

repeat * (wait + warmup + active)

。

Parameter	Value	Description
`warmup`	5	Do not modify to other values via script or CLI
`active`	5	Do not modify to other values via script or CLI
`wait`	Default `0`	Can retain in CLI or as a constant, adjust as needed
`repeat`	Default `1`	Fixed to 1 for simple scenarios; if `repeat>1` , must explain CSV selection semantics in the document

Must call prof.step()
at the end of each step; total number of loop steps =

repeat * (wait + warmup + active)

文件落点（统一在算子

test/

子目录）

File Placement (Unified in Operator

test/

Subdirectory)

所有下列产物放在 ascend-kernel/csrc/ops/<op>/test/
：

类别	命名约定（ `<op>` 如 `layer_norm` ）
用例仅 JSONL	`<op>_perf_cases.jsonl` （一行一个 JSON 对象）；不维护、不生成 `<op>_perf_cases.json`
Markdown 报告	`<op>_torch_npu_profiler_report.md` （唯一结构化结果落盘；不生成 `<op>_torch_npu_profiler_results.json` ）
Profiler 导出根目录	`test/profiler_trace/` （或 `--trace-root` 覆盖）

性能脚本、公共模块与上述文件同处
test/
。

All the following products are placed in ascend-kernel/csrc/ops/<op>/test/
:

Category	Naming Convention ( `<op>` e.g., `layer_norm` )
Test Cases JSONL only	`<op>_perf_cases.jsonl` (one JSON object per line); do not maintain or generate `<op>_perf_cases.json`
Markdown Report	`<op>_torch_npu_profiler_report.md` (only structured result to be saved; do not generate `<op>_torch_npu_profiler_results.json` )
Profiler Export Root Directory	`test/profiler_trace/` (or override with `--trace-root` )

Performance scripts, common modules are placed in the same

test/

directory as the above files.

性能用例 JSONL 完整规范

Complete JSONL Specification for Performance Test Cases

以下为用例文件的完整字段与类型说明；仅使用
.jsonl
作为用例载体。

The following is the complete field and type description for test case files; only use
.jsonl
as the test case carrier.

1. 文件形态

1. File Format

形态	说明
JSONL	每行一个 JSON 对象，行尾换行；空行忽略。扩展名 `.jsonl` 。

禁止在本流程中生成或与用例同步维护

.json

数组文件。

Format	Description
JSONL	Each line contains one JSON object, with a newline at the end; empty lines are ignored. File extension `.jsonl` .

Prohibit generating or maintaining

.json

array files in sync with test cases in this process.

2. 单条用例顶层结构

2. Top-level Structure of a Single Test Case

每个用例对象必须含键 "inputs"
，值为数组。

Layer Norm 示例（算子不同则替换

build_inputs

约定，结构仍须含

inputs

数组）：

json

{
  "inputs": [
    { "name": "x", "type": "tensor", "required": true, "dtype": "float16", "shape": [8, 128] },
    { "name": "normalized_shape", "type": "attr", "required": true, "dtype": "int", "value": [128] },
    { "name": "use_affine", "type": "attr", "required": false, "dtype": "bool", "value": true },
    { "name": "eps", "type": "attr", "required": false, "dtype": "float", "value": 1e-05 }
  ]
}

```
inputs
```
内各元素的 name
在同一用例内唯一。

其它算子：张量 /

tensor_list

attr

/ 整数张量

range

等规则见 references/REFERENCE_JSON_CASE_FORMAT.md
。

Each test case object must contain the key "inputs"
, whose value is an array.

Layer Norm Example (replace

build_inputs

convention for different operators, but the structure must still include the

inputs

array):

json

{
  "inputs": [
    { "name": "x", "type": "tensor", "required": true, "dtype": "float16", "shape": [8, 128] },
    { "name": "normalized_shape", "type": "attr", "required": true, "dtype": "int", "value": [128] },
    { "name": "use_affine", "type": "attr", "required": false, "dtype": "bool", "value": true },
    { "name": "eps", "type": "attr", "required": false, "dtype": "float", "value": 1e-05 }
  ]
}

name
of each element in
inputs
is unique within the same test case.
For other operators: Rules for tensors /
```
tensor_list
```
/
```
attr
```
/ integer tensor
```
range
```
, etc. are in references/REFERENCE_JSON_CASE_FORMAT.md
.

3. JSONL 完整示例（两行，Layer Norm）

3. Complete JSONL Example (Two Lines, Layer Norm)

json

{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[2,128]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[128]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":true},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[4,256]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[256]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":false},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}

更完整的字段说明（

tensor_list

、

int

张量

range

等）见 references/REFERENCE_JSON_CASE_FORMAT.md
。

json

{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[2,128]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[128]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":true},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[4,256]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[256]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":false},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}

For more complete field descriptions (such as

tensor_list

int

tensor

range

, etc.), see references/REFERENCE_JSON_CASE_FORMAT.md
.

Profiler 与目录语义（摘要）

Profiler and Directory Semantics (Summary)

每次

with torch_npu.profiler.profile(...)

，在 handler 目录下生成以
_ascend_pt
为后缀的导出目录；CSV 路径 …/*_ascend_pt/ASCEND_PROFILER_OUTPUT/op_statistic.csv
。

每个用例、每种实现（如
```
custom
```
/
```
baseline
```
）一次独立
```
with
```
；子路径建议
```
{trace_root}/{op_trace_tag}/{custom|baseline}/case_XXX/
```
，运行前清空
```
case_XXX
```
。

详见 references/REFERENCE_PROFILER_AND_METRICS.md
。

Each

with torch_npu.profiler.profile(...)

generates an export directory with the suffix _ascend_pt
under the handler directory; CSV path is …/*_ascend_pt/ASCEND_PROFILER_OUTPUT/op_statistic.csv
.

Each test case, each implementation (e.g.,
```
custom
```
/
```
baseline
```
) uses an independent
```
with
```
; subpath is recommended as
```
{trace_root}/{op_trace_tag}/{custom|baseline}/case_XXX/
```
, clear
```
case_XXX
```
before running.

For details, see references/REFERENCE_PROFILER_AND_METRICS.md
.

指标（摘要）

Metrics (Summary)

对单次
```
with
```
对应 CSV：各算子行 Total Time(us) 求和。

除以

active×repeat

（

divisor_mode=active_steps

时）或仅

active

（

active_only

）。本技能固定 active=5
；若 repeat=1
，则 divisor = 5
。

For the CSV corresponding to a single
```
with
```
: Sum the Total Time(us) of each operator row.

Divide by

active×repeat

(when

divisor_mode=active_steps

) or only

active

(when

active_only

). This skill fixes active=5
; if repeat=1
, then divisor = 5
.

性能对比报告（Markdown）必备结构

Mandatory Structure of Performance Comparison Report (Markdown)

报告格式严格参照 examples/sample_report.md
，结构如下：

The report format strictly follows examples/sample_report.md
, with the following structure:

1. 标题

1. Title

markdown

undefined

markdown

undefined

性能评估结果

Performance Evaluation Results

undefined

undefined

2. 对比表（统一单表，强制双路径）

2. Comparison Table (Unified Single Table, Mandatory Dual Path)

所有用例在同一张表中展示，表头固定为

Case | Shape | DType | 自定义算子(us) | 标杆(us) | 加速比

。

示例：

markdown

undefined

All test cases are displayed in the same table, with fixed headers:

Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio

Example:

markdown

undefined

性能对比

Performance Comparison

Case	Shape	DType	自定义算子(us)	标杆(us)	加速比
0	[128, 4096]	float16	9.75	10.10	1.036
1	[128, 5120]	float16	10.52	9.39	0.893
2	[128, 6144]	float16	10.99	14.36	1.307
3	[64, 6400]	float16	9.13	9.49	1.040
4	[2, 1024, 4096]	float16	57.01	84.92	1.490
5	[2, 1024, 6144]	float16	73.80	139.56	1.891
6	[1, 2048, 6400]	float16	75.60	143.09	1.893
7	[64, 4096]	float32	8.45	7.14	0.846

undefined

Case	Shape	DType	Custom Operator(us)	Baseline(us)	Speedup Ratio
0	[128, 4096]	float16	9.75	10.10	1.036
1	[128, 5120]	float16	10.52	9.39	0.893
2	[128, 6144]	float16	10.99	14.36	1.307
3	[64, 6400]	float16	9.13	9.49	1.040
4	[2, 1024, 4096]	float16	57.01	84.92	1.490
5	[2, 1024, 6144]	float16	73.80	139.56	1.891
6	[1, 2048, 6400]	float16	75.60	143.09	1.893
7	[64, 4096]	float32	8.45	7.14	0.846

undefined

3. 全量汇总

3. Full Summary

使用

## 全量汇总

二级标题，内含键值对表：

markdown

undefined

Use the second-level title

## Full Summary

, which contains a key-value table:

markdown

undefined

全量汇总

Full Summary

指标	值
用例数	N
平均加速比（>1 表示自定义算子更快）	X.XXX
自定义算子更优（比值>1）	M
标杆更优（比值<1）	K


紧接其下用 `### 按数据类型汇总` 三级标题，展示分 dtype 的汇总表：

```markdown

Metric	Value
Number of Test Cases	N
Average Speedup Ratio (>1 means custom operator is faster)	X.XXX
Custom Operator Better (Ratio>1)	M
Baseline Better (Ratio<1)	K


Immediately below, use the third-level title `### Summary by Data Type` to display the summary table grouped by dtype:

```markdown

按数据类型汇总

Summary by Data Type

DType	用例数	平均加速比	自定义算子更优	标杆更优
float16	7	1.364	6	1
float32	1	0.846	0	1

undefined

DType	Number of Test Cases	Average Speedup Ratio	Custom Operator Better	Baseline Better
float16	7	1.364	6	1
float32	1	0.846	0	1

undefined

4. 简短分析

4. Brief Analysis

使用

## 简短分析

二级标题，列出 ≥3 条 无序列表形式的简短结论，内容涵盖：整体趋势、不同 dtype / shape 规模差异、访存与计算特征等。

markdown

undefined

Use the second-level title

## Brief Analysis

, list ≥3 short conclusions in unordered list format, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.

markdown

undefined

简短分析

Brief Analysis

平均加速比大于 1，自定义算子整体略有优势。
大 shape 下自定义算子优势更明显，向量路径利用更充分。
float32 小 shape 场景自定义算子略逊于标杆，可能与 kernel launch 开销占比较高有关。

undefined

The average speedup ratio is greater than 1, so the custom operator has a slight overall advantage.
The custom operator has a more obvious advantage in large shape scenarios, as the vector path is more fully utilized.
The custom operator is slightly inferior to the baseline in the float32 small shape scenario, which may be related to the high proportion of kernel launch overhead.

undefined

其他约定

Other Conventions

不写与报告重复的
```
*_profiler_results.json
```
；中间统计仅存在于脚本内存中并写入 Markdown。

Do not write
```
*_profiler_results.json
```
which duplicates the report; intermediate statistics only exist in script memory and are written to Markdown.

对话内展示结果（MANDATORY）

Display Results in Conversation (MANDATORY)

生成 csrc/ops/<op>/test/<op>_torch_npu_profiler_report.md
（或已存在且本次运行已更新）后，助手在当前对话的回复中 MUST 同时完成下列事项，不得只输出「报告已生成」和路径而不展示数据：

粘贴主要性能内容（用户无需打开文件即可阅读结论，展示内容与报告结构一致）：
- 统一对比表：表头
```
Case | Shape | DType | 自定义算子(us) | 标杆(us) | 加速比
```
  ，所有 dtype 在同一张表中展示。case 多时可截断并注明「其余见报告」。
- 全量汇总：键值对形式的汇总指标（用例数、平均比值、自定义算子/标杆更优条数）及按数据类型汇总表。
- 简短分析：≥3 条 无序列表结论，涵盖整体趋势、不同 dtype / shape 规模差异、访存与计算特征等。

NEVER：仅回复报告路径；NEVER 用「请自行打开 Markdown」替代在对话中展示核心数字与结论。

After generating csrc/ops/<op>/test/<op>_torch_npu_profiler_report.md
(or if it already exists and has been updated in this run), the assistant MUST simultaneously complete the following in the current conversation reply, cannot only output "Report generated" and the path without displaying data:

Paste key performance content (users can read conclusions without opening files, displayed content is consistent with report structure):
- Unified comparison table: Header
```
Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio
```
  , all dtypes displayed in the same table. Truncate and note "See report for rest" if there are many cases.
- Full summary: Summary metrics in key-value format (number of test cases, average ratio, number of cases where custom operator/baseline is better) and summary table by data type.
- Brief analysis: ≥3 unordered list conclusions, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.

NEVER: Only reply with the report path; NEVER replace displaying core numbers and conclusions in the conversation with "Please open the Markdown file yourself".

易错点

Common Mistakes

```
warmup
```
/
```
active
```
被改为非 5，与技能约定不一致。
未使用 torch_npu.profiler
或
```
prof.step()
```
与 schedule 不一致。
repeat>1
可能多份
```
*_ascend_pt
```
导出；按 mtime 取 CSV 时需说明语义。
CSV 表头 BOM / 列名变化时须兼容 Total Time(us)。
自定义算子未注册时只做标杆路径；对比前须加载自定义库。
未读
<op>-test-cases.md
就直接自行设计用例：testcase-gen 已生成统一用例文档，应优先从中提取 shape 和 dtype，避免重复设计。
未读 design.md 就直接生成用例：导致 shape 不符合约束、缺少关键执行模式的覆盖（如漏掉 transpose=True 路径）。
以「无标杆等价接口」为由输出单路径报告：必须实现小算子拼接标杆路径，始终输出双路径对比表。
小算子拼接用 Python 循环逐标量赋值：profiler 采集的是 CPU 逻辑而非 NPU 算子，导致标杆路径耗时失真；标杆实现应以张量操作为主。
助手仅输出报告路径、未在当前对话中展示主要表格与汇总结论（违反「对话内展示结果」）。

```
warmup
```
/
```
active
```
are changed to non-5 values, inconsistent with skill conventions.
torch_npu.profiler
is not used, or
```
prof.step()
```
is inconsistent with the schedule.
repeat>1
may generate multiple
```
*_ascend_pt
```
exports; must explain semantics when selecting CSV by mtime.
Must be compatible with Total Time(us) when CSV header has BOM / column names change.
Only run the baseline path if the custom operator is not registered; load the custom library before comparison.
Designing test cases directly without reading
<op>-test-cases.md
: testcase-gen has generated a unified test case document, should first extract shapes and dtypes from it to avoid duplicate design.
Generating test cases directly without reading design.md: Results in shapes not complying with constraints, missing coverage of key execution modes (such as omitting the transpose=True path).
Outputting a single-path report on the grounds of "no equivalent baseline interface": Must implement a small operator combination baseline path, always output a dual-path comparison table.
Using Python loops for scalar-by-scalar assignment in small operator combination: Profiler collects CPU logic instead of NPU operators, leading to distorted baseline path latency; baseline implementation should be dominated by tensor operations.
Assistant only outputs the report path, does not display key tables and summary conclusions in the current conversation (violates "Display Results in Conversation").

参考实现（

examples/layer_norm_profiler_reference/

）

Reference Implementation (

examples/layer_norm_profiler_reference/

)

与 ascend-kernel/csrc/ops/layer_norm/test/
中 profiler 相关文件保持同构，包含：

layer_norm_profiler_common.py

、

benchmark_layer_norm_torch_npu_profiler.py

layer_norm_perf_cases.jsonl
（仅 JSONL，无
.json
）

LAYER_NORM_PROFILER_PERF_GUIDE.md

、

README.md

新算子从该目录 整体拷贝 到

csrc/ops/<op>/test/

，再替换算子名、前向调用、

build_inputs

与 trace 子目录名。若仓内

layer_norm/test/

有更新，应同步回

examples/layer_norm_profiler_reference/

。

Is isomorphic to the profiler-related files in ascend-kernel/csrc/ops/layer_norm/test/
, including:

layer_norm_profiler_common.py

benchmark_layer_norm_torch_npu_profiler.py

layer_norm_perf_cases.jsonl
(JSONL only, no
.json
)

LAYER_NORM_PROFILER_PERF_GUIDE.md

README.md

Copy this directory in its entirety to

csrc/ops/<op>/test/

for new operators, then replace the operator name, forward call,

build_inputs

, and trace subdirectory name. If there are updates in

layer_norm/test/

in the repository, synchronize them back to

examples/layer_norm_profiler_reference/

检查清单（助手自检）

Checklist (Assistant Self-Check)

已读取 csrc/ops/<op>/test/<op>-test-cases.md
（若存在），从中提取 SUPPORTED_DTYPES、TEST_SHAPES、GENERAL_SHAPES、算子标杆
已读取 csrc/ops/<op>/design.md
，从中提取 dtype、参数约束、典型 shape、执行模式
用例覆盖 design.md 中描述的所有执行模式（如 transpose/非transpose、input_split 模式等）
用例中参数值（属性值与整数张量值）均在 design.md 约束范围内
已确认标杆路径类型（标杆 API 或小算子拼接），并在报告头部明确标注
若无标杆等价接口，已实现小算子拼接标杆路径，且标杆实现以张量操作为主（非 Python 标量循环）
已用 torch_npu.profiler
，且 warmup=5
、
active=5
未被改写
已生成或更新 <op>_torch_npu_profiler_report.md
，格式与
```
examples/sample_report.md
```
一致：含 DType 列的统一对比表 + 全量汇总 + 按数据类型汇总 + 简短分析
已在当前对话中展示含 DType 列的统一对比表、全量汇总、按数据类型汇总与 ≥3 条简短分析结论，不仅附路径
已说明固定步数约定与指标口径

Has read csrc/ops/<op>/test/<op>-test-cases.md
(if exists), extracted SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, operator baseline from it
Has read csrc/ops/<op>/design.md
, extracted dtypes, parameter constraints, typical shapes, execution modes from it
Test cases cover all execution modes described in design.md (such as transpose/non-transpose, input_split mode, etc.)
Parameter values (including attribute values and integer tensor values) in test cases are within the constraints of design.md
Has confirmed the baseline path type (baseline API or small operator combination), and clearly labeled it in the report header
If there is no equivalent baseline interface, has implemented the small operator combination baseline path, and the baseline implementation is dominated by tensor operations (not Python scalar loops)
Has used torch_npu.profiler
, and warmup=5
、
active=5
have not been modified
Has generated or updated <op>_torch_npu_profiler_report.md
, with format consistent with
```
examples/sample_report.md
```
: includes unified comparison table with DType column + full summary + summary by data type + brief analysis
Has displayed the unified comparison table with DType column, full summary, summary by data type, and ≥3 brief analysis conclusions in the current conversation, not just attaching the path
Has explained the fixed step convention and metric caliber

ascendc-operator-performance-eval

Original

Translation

AscendC 算子 torch_npu.profiler 性能评估

Performance Evaluation of AscendC Operators with torch_npu.profiler

本技能目录内参考文件

Reference Files in This Skill Directory

角色

Role

用例来源：从 testcase-gen 用例文档加载（MANDATORY）

Test Case Source: Load from testcase-gen Test Case Document (MANDATORY)

Step 0：读取 testcase-gen 用例文档

Step 0: Read testcase-gen Test Case Document

testcase-gen 输出 → JSONL 用例转换规则

Conversion Rules from testcase-gen Output to JSONL Test Cases

用例生成：必须先读 design.md（强制）

Test Case Generation: Must First Read design.md (Mandatory)

Step 1：读取设计文档

Step 1: Read Design Document

Step 2：用例设计规则

Step 2: Test Case Design Rules

Step 3：验证用例

Step 3: Validate Test Cases

参考路径决策树（强制）

Reference Path Decision Tree (Mandatory)

小算子拼接标杆路径要求

Requirements for Small Operator Combination Baseline Path

固定 Profiler 步数（强制）

Fixed Profiler Steps (Mandatory)

文件落点（统一在算子 test/ 子目录）

File Placement (Unified in Operator test/ Subdirectory)

性能用例 JSONL 完整规范

Complete JSONL Specification for Performance Test Cases

1. 文件形态

1. File Format

2. 单条用例顶层结构

2. Top-level Structure of a Single Test Case

3. JSONL 完整示例（两行，Layer Norm）

3. Complete JSONL Example (Two Lines, Layer Norm)

Profiler 与目录语义（摘要）

Profiler and Directory Semantics (Summary)

指标（摘要）

Metrics (Summary)

性能对比报告（Markdown）必备结构

Mandatory Structure of Performance Comparison Report (Markdown)

1. 标题

1. Title

性能评估结果

Performance Evaluation Results

2. 对比表（统一单表，强制双路径）

2. Comparison Table (Unified Single Table, Mandatory Dual Path)

性能对比

Performance Comparison

3. 全量汇总

3. Full Summary

全量汇总

Full Summary

按数据类型汇总

Summary by Data Type

4. 简短分析

4. Brief Analysis

简短分析

Brief Analysis

其他约定

Other Conventions

对话内展示结果（MANDATORY）

Display Results in Conversation (MANDATORY)

易错点

Common Mistakes

参考实现（examples/layer_norm_profiler_reference/）

Reference Implementation (examples/layer_norm_profiler_reference/)

检查清单（助手自检）

Checklist (Assistant Self-Check)

文件落点（统一在算子
`test/`
子目录）

File Placement (Unified in Operator
`test/`
Subdirectory)

参考实现（
`examples/layer_norm_profiler_reference/`
）

Reference Implementation (
`examples/layer_norm_profiler_reference/`
)