ascendc-operator-performance-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AscendC 算子 torch_npu.profiler 性能评估

Performance Evaluation of AscendC Operators with torch_npu.profiler

本技能目录内参考文件

Reference Files in This Skill Directory

执行本技能时,应优先使用 本目录 下材料:
文件 / 目录用途
SKILL.md
(本文件)
流程、目录约定、完整 JSONL 用例规范、报告结构、固定 schedule
references/REFERENCE_JSON_CASE_FORMAT.md
与下文「性能用例 JSONL 规范」同文
references/REFERENCE_PROFILER_AND_METRICS.md
torch_npu.profiler
op_statistic.csv
*_ascend_pt
路径
examples/sample_perf_cases.jsonl
最小 LayerNorm 风格 JSONL,可复制改名
examples/layer_norm_profiler_reference/
Layer Norm 参考实现
layer_norm_profiler_common.py
benchmark_layer_norm_torch_npu_profiler.py
、用例 JSONL、说明);新算子可复制该目录到
csrc/ops/<op>/test/
再替换前向与文件名

When executing this skill, prioritize materials in this directory:
File / DirectoryPurpose
SKILL.md
(this file)
Process, directory conventions, complete JSONL test case specifications, report structure, fixed schedule
references/REFERENCE_JSON_CASE_FORMAT.md
Identical to the "JSONL Specification for Performance Test Cases" section below
references/REFERENCE_PROFILER_AND_METRICS.md
torch_npu.profiler
,
op_statistic.csv
,
*_ascend_pt
path
examples/sample_perf_cases.jsonl
Minimal LayerNorm-style JSONL, can be copied and renamed
examples/layer_norm_profiler_reference/
Layer Norm reference implementation (
layer_norm_profiler_common.py
,
benchmark_layer_norm_torch_npu_profiler.py
, test case JSONL, instructions); new operators can copy this directory to
csrc/ops/<op>/test/
and replace the forward logic and filenames

角色

Role

ascend-kernel 中,为
csrc/ops/<算子名>/
建立可复用的 profiler 性能用例自定义算子 vs 标杆 的 Markdown 报告流程。采集必须走
torch_npu.profiler
,且
warmup
active
固定为 5
(见下节)。细节见
references/REFERENCE_PROFILER_AND_METRICS.md
核心原则(两条强制约束)
  1. 对比报告始终必须呈现:无论标杆路径是标杆 API 还是小算子拼接,最终报告都必须含自定义算子 vs 标杆的双路径对比表,禁止使用单路径报告替代标杆必须在NPU上运行
  2. 用例生成必须先读
    design.md
    :在生成任何 JSONL 用例之前,必须读取算子目录下的
    csrc/ops/<op>/design.md
    ,从中提取参数约束、典型 shape、支持的 dtype 及关键属性值,用例须覆盖设计文档中描述的所有执行模式。

In ascend-kernel, establish a reusable process for profiler performance test cases and Markdown reports comparing custom operators vs baselines for
csrc/ops/<operator-name>/
. Data collection must use
torch_npu.profiler
, with fixed
warmup
and
active
values of 5
(see next section). For details, refer to
references/REFERENCE_PROFILER_AND_METRICS.md
.
Core Principles (Two Mandatory Constraints):
  1. Comparison report must always be presented: Regardless of whether the baseline path is a baseline API or a combination of small operators, the final report must include a dual-path comparison table of custom operator vs baseline. Single-path reports are not allowed, and the baseline must run on NPU.
  2. Test case generation must first read
    design.md
    : Before generating any JSONL test cases, read
    csrc/ops/<op>/design.md
    in the operator directory, extract parameter constraints, typical shapes, supported dtypes, and key attribute values. Test cases must cover all execution modes described in the design document.

用例来源:从 testcase-gen 用例文档加载(MANDATORY)

Test Case Source: Load from testcase-gen Test Case Document (MANDATORY)

在生成或修改任何 JSONL 用例之前MUST 首先读取 testcase-gen 产出的用例文档:
Before generating or modifying any JSONL test cases, MUST first read the test case document generated by testcase-gen:

Step 0:读取 testcase-gen 用例文档

Step 0: Read testcase-gen Test Case Document

READ csrc/ops/<op>/test/<op>-test-cases.md
从中提取:
提取项在用例文档中的位置用途
SUPPORTED_DTYPES§测试配置JSONL 用例的 dtype 覆盖范围
TEST_SHAPES§测试配置小/中/大规模 shape 的选取基准
GENERAL_SHAPES§测试配置泛化 shape,可补充用于性能场景
NPU 调用方式§算子标杆自定义算子的前向调用
CPU 参考实现§算子标杆标杆路径的参考实现
READ csrc/ops/<op>/test/<op>-test-cases.md
Extract the following from it:
Extracted ItemLocation in Test Case DocumentPurpose
SUPPORTED_DTYPES§Test ConfigurationCoverage of dtypes for JSONL test cases
TEST_SHAPES§Test ConfigurationBenchmark for selecting small/medium/large-scale shapes
GENERAL_SHAPES§Test ConfigurationGeneralized shapes, can be supplemented for performance scenarios
NPU Calling Method§Operator BaselineForward call for custom operator
CPU Reference Implementation§Operator BaselineReference implementation for baseline path

testcase-gen 输出 → JSONL 用例转换规则

Conversion Rules from testcase-gen Output to JSONL Test Cases

  1. 从 TEST_SHAPES + GENERAL_SHAPES 中选取代表性 shape(覆盖小/中/大规模),避免重复
  2. 每个 shape 遍历 SUPPORTED_DTYPES 中的全部 dtype
  3. 结合 design.md 中的属性值(如 block_size、eps 等)填充 JSONL 的
    inputs
    字段
  4. JSONL 用例总数 ≥ 8
  5. 算子标杆中的 NPU 调用方式和 CPU 参考实现用于构建自定义算子路径和标杆路径
<op>-test-cases.md
不存在
:回退为完全从 design.md 自行设计用例(按下方流程),但需在报告中注明"用例为自行设计,非 testcase-gen 产出"。

  1. Select representative shapes from TEST_SHAPES + GENERAL_SHAPES (covering small/medium/large scales), avoid duplicates
  2. Iterate through all dtypes in SUPPORTED_DTYPES for each shape
  3. Fill the
    inputs
    field of JSONL with attribute values (such as block_size, eps, etc.) from design.md
  4. Total number of JSONL test cases ≥ 8
  5. The NPU calling method and CPU reference implementation in the operator baseline are used to build the custom operator path and baseline path
If
<op>-test-cases.md
does not exist
: Fall back to designing test cases entirely from design.md (following the process below), but note "Test cases are self-designed, not generated by testcase-gen" in the report.

用例生成:必须先读 design.md(强制)

Test Case Generation: Must First Read design.md (Mandatory)

在生成或修改任何 JSONL 用例之前(无论是否已加载 testcase-gen 用例文档),必须执行以下步骤:
Before generating or modifying any JSONL test cases (whether or not the testcase-gen test case document has been loaded), perform the following steps:

Step 1:读取设计文档

Step 1: Read Design Document

READ csrc/ops/<op>/design.md
从中提取:
提取项在 design.md 中的位置用途
支持的数据类型§1「支持的数据类型」用例的 dtype 覆盖范围
参数约束与取值范围§1「参数说明」约束条件列属性值的合法范围(如 block_size ≤ 128)
典型 shape / 输入规模§2「计算逻辑」/ §3「Tiling 策略」小/中/大规模用例的 shape 基准
关键属性的模式组合§2「伪代码」/ §1「参数说明」需要各自覆盖的执行路径(如 do_transpose=True/False、is_input_split=True/False)
性能关键点§6「性能优化」/ §3「Tiling 策略」影响性能的分支(如转置 vs 非转置走不同 DMA 路径)
READ csrc/ops/<op>/design.md
Extract the following from it:
Extracted ItemLocation in design.mdPurpose
Supported Data Types§1 "Supported Data Types"Coverage of dtypes for test cases
Parameter Constraints and Value RangesConstraint column in §1 "Parameter Description"Valid range for attribute values (e.g., block_size ≤ 128)
Typical Shapes / Input Scales§2 "Computation Logic" / §3 "Tiling Strategy"Benchmark for small/medium/large-scale test case shapes
Mode Combinations of Key Attributes§2 "Pseudocode" / §1 "Parameter Description"Execution paths that need to be covered individually (e.g., do_transpose=True/False, is_input_split=True/False)
Performance Key Points§6 "Performance Optimization" / §3 "Tiling Strategy"Branches that affect performance (e.g., transpose vs non-transpose using different DMA paths)

Step 2:用例设计规则

Step 2: Test Case Design Rules

规则说明
覆盖所有执行模式design.md 描述了多个执行路径(如转置/非转置、input_split 模式)时,每种模式必须有至少一个用例
覆盖所有支持的 dtype每种支持的数据类型至少有一组用例,典型中等规模 shape
小/中/大规模 shape 各一组小规模(内核 launch 开销主导)、中规模(典型生产场景)、大规模(访存带宽主导)各需覆盖
参数值来自约束范围属性值(如 block_size)必须从 design.md 的约束条件中选取,不得随意设定
整数/索引张量值须语义合法win_lengths、offsets 等张量的具体值需满足算子语义(如 offsets 必须是合法的 window 起始偏移)
RuleDescription
Cover all execution modesWhen design.md describes multiple execution paths (e.g., transpose/non-transpose, input_split mode), there must be at least one test case for each mode
Cover all supported dtypesAt least one set of test cases for each supported data type, using a typical medium-scale shape
One set each for small/medium/large-scale shapesMust cover small scale (kernel launch overhead dominant), medium scale (typical production scenario), and large scale (memory bandwidth dominant)
Parameter values from constraint rangesAttribute values (such as block_size) must be selected from the constraints in design.md, cannot be set arbitrarily
Integer/index tensor values must be semantically validSpecific values of tensors like win_lengths, offsets must comply with operator semantics (e.g., offsets must be valid window start offsets)

Step 3:验证用例

Step 3: Validate Test Cases

生成用例后检查:
  • 所有 dtype 均已覆盖
  • 所有执行模式(由 design.md 定义)均有对应用例
  • 参数值(含属性值和整数张量值)在 design.md 约束范围内
  • 包含至少一个「小 shape」用例和一个「大 shape」用例

After generating test cases, check:
  • All dtypes are covered
  • All execution modes (defined by design.md) have corresponding test cases
  • Parameter values (including attribute values and integer tensor values) are within the constraints of design.md
  • Includes at least one "small shape" test case and one "large shape" test case

参考路径决策树(强制)

Reference Path Decision Tree (Mandatory)

性能评估始终需要双路径对比(自定义算子 vs 标杆),按以下顺序确定标杆路径:
算子是否有标杆等价 API?
  ├─ 是(如 torch.nn.functional.*、torch_npu 内置算子)
  │    └─ 使用标杆 API 作为标杆路径
  └─ 否(无标杆等价接口)
       └─ 必须实现「小算子拼接」标杆路径 ← 本技能的强制要求
            └─ 用设计文档 §「参考实现」或「伪代码」中的 PyTorch 基础算子组合实现
Performance evaluation always requires dual-path comparison (custom operator vs baseline). Determine the baseline path in the following order:
Does the operator have an equivalent baseline API?
  ├─ Yes (e.g., torch.nn.functional.*, built-in torch_npu operators)
  │    └─ Use the baseline API as the baseline path
  └─ No (no equivalent baseline interface)
       └─ Must implement a "small operator combination" baseline path ← Mandatory requirement of this skill
            └─ Implement using PyTorch basic operator combinations from the "Reference Implementation" or "Pseudocode" section of the design document

小算子拼接标杆路径要求

Requirements for Small Operator Combination Baseline Path

当无标杆等价接口时,必须
  1. 从 design.md 读取参考实现:设计文档通常包含 PyTorch 参考实现(伪代码或 Python 函数),以该实现为基础构建标杆路径。
  2. 使用 PyTorch 基础算子组合
    torch.zeros
    、切片赋值、
    .permute()
    torch.cat
    等。不得使用循环+Python 标量赋值(否则 profiler 采集的是 CPU 算子而非 NPU 算子,无法公平对比);整个标杆实现必须以张量操作为主,可在 NPU 上执行。
  3. 在报告中明确标注:报告头部须写明「无标杆等价接口,标杆路径为小算子拼接」,列明所用的基础算子。
  4. 对比表格必须呈现:不得因「无标杆接口」而退化为单路径报告,必须保留「自定义算子 per-step」「标杆 per-step」「比值」三列。
NEVER:以「无标杆等价接口」为由输出单路径报告或跳过对比表。

When there is no equivalent baseline interface, must:
  1. Read the reference implementation from design.md: The design document usually contains a PyTorch reference implementation (pseudocode or Python function), use this as the basis to build the baseline path.
  2. Use PyTorch basic operator combinations:
    torch.zeros
    , slice assignment,
    .permute()
    ,
    torch.cat
    , etc. Do not use loops + Python scalar assignment (otherwise the profiler collects CPU operators instead of NPU operators, making fair comparison impossible); the entire baseline implementation must be dominated by tensor operations and executable on NPU.
  3. Clearly label in the report: The header of the report must state "No equivalent baseline interface, baseline path is a combination of small operators" and list the basic operators used.
  4. Comparison table must be presented: Do not degrade to a single-path report due to "no baseline interface", must retain the three columns: "Custom Operator per-step", "Baseline per-step", "Ratio".
NEVER: Output a single-path report or skip the comparison table on the grounds of "no equivalent baseline interface".

固定 Profiler 步数(强制)

Fixed Profiler Steps (Mandatory)

参数说明
warmup
5不允许脚本或 CLI 改为其它值
active
5不允许脚本或 CLI 改为其它值
wait
默认
0
可保留 CLI 或常量,按需
repeat
默认
1
简单场景固定为 1;若
repeat>1
,须在文档中说明 CSV 选取语义
每步末尾必须
prof.step()
;循环总步数 =
repeat * (wait + warmup + active)

ParameterValueDescription
warmup
5Do not modify to other values via script or CLI
active
5Do not modify to other values via script or CLI
wait
Default
0
Can retain in CLI or as a constant, adjust as needed
repeat
Default
1
Fixed to 1 for simple scenarios; if
repeat>1
, must explain CSV selection semantics in the document
Must call
prof.step()
at the end of each step; total number of loop steps =
repeat * (wait + warmup + active)
.

文件落点(统一在算子
test/
子目录)

File Placement (Unified in Operator
test/
Subdirectory)

所有下列产物放在
ascend-kernel/csrc/ops/<op>/test/
类别命名约定(
<op>
layer_norm
用例 仅 JSONL
<op>_perf_cases.jsonl
(一行一个 JSON 对象);不维护、不生成
<op>_perf_cases.json
Markdown 报告
<op>_torch_npu_profiler_report.md
(唯一结构化结果落盘;不生成
<op>_torch_npu_profiler_results.json
Profiler 导出根目录
test/profiler_trace/
(或
--trace-root
覆盖)
性能脚本、公共模块与上述文件 同处
test/

All the following products are placed in
ascend-kernel/csrc/ops/<op>/test/
:
CategoryNaming Convention (
<op>
e.g.,
layer_norm
)
Test Cases JSONL only
<op>_perf_cases.jsonl
(one JSON object per line); do not maintain or generate
<op>_perf_cases.json
Markdown Report
<op>_torch_npu_profiler_report.md
(only structured result to be saved; do not generate
<op>_torch_npu_profiler_results.json
)
Profiler Export Root Directory
test/profiler_trace/
(or override with
--trace-root
)
Performance scripts, common modules are placed in the same
test/
directory as the above files.

性能用例 JSONL 完整规范

Complete JSONL Specification for Performance Test Cases

以下为用例文件的 完整 字段与类型说明;仅使用
.jsonl
作为用例载体。
The following is the complete field and type description for test case files; only use
.jsonl
as the test case carrier.

1. 文件形态

1. File Format

形态说明
JSONL每行 一个 JSON 对象,行尾换行;空行忽略。扩展名
.jsonl
禁止在本流程中生成或与用例同步维护
.json
数组文件。
FormatDescription
JSONLEach line contains one JSON object, with a newline at the end; empty lines are ignored. File extension
.jsonl
.
Prohibit generating or maintaining
.json
array files in sync with test cases in this process.

2. 单条用例顶层结构

2. Top-level Structure of a Single Test Case

每个用例对象 必须 含键
"inputs"
,值为 数组
Layer Norm 示例(算子不同则替换
build_inputs
约定,结构仍须含
inputs
数组):
json
{
  "inputs": [
    { "name": "x", "type": "tensor", "required": true, "dtype": "float16", "shape": [8, 128] },
    { "name": "normalized_shape", "type": "attr", "required": true, "dtype": "int", "value": [128] },
    { "name": "use_affine", "type": "attr", "required": false, "dtype": "bool", "value": true },
    { "name": "eps", "type": "attr", "required": false, "dtype": "float", "value": 1e-05 }
  ]
}
  • inputs
    内各元素的
    name
    在同一用例内唯一
  • 其它算子:张量 /
    tensor_list
    /
    attr
    / 整数张量
    range
    等规则见
    references/REFERENCE_JSON_CASE_FORMAT.md
Each test case object must contain the key
"inputs"
, whose value is an array.
Layer Norm Example (replace
build_inputs
convention for different operators, but the structure must still include the
inputs
array):
json
{
  "inputs": [
    { "name": "x", "type": "tensor", "required": true, "dtype": "float16", "shape": [8, 128] },
    { "name": "normalized_shape", "type": "attr", "required": true, "dtype": "int", "value": [128] },
    { "name": "use_affine", "type": "attr", "required": false, "dtype": "bool", "value": true },
    { "name": "eps", "type": "attr", "required": false, "dtype": "float", "value": 1e-05 }
  ]
}
  • name
    of each element in
    inputs
    is unique within the same test case
    .
  • For other operators: Rules for tensors /
    tensor_list
    /
    attr
    / integer tensor
    range
    , etc. are in
    references/REFERENCE_JSON_CASE_FORMAT.md
    .

3. JSONL 完整示例(两行,Layer Norm)

3. Complete JSONL Example (Two Lines, Layer Norm)

json
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[2,128]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[128]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":true},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[4,256]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[256]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":false},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
更完整的字段说明(
tensor_list
int
张量
range
等)见
references/REFERENCE_JSON_CASE_FORMAT.md

json
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[2,128]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[128]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":true},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[4,256]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[256]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":false},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
For more complete field descriptions (such as
tensor_list
,
int
tensor
range
, etc.), see
references/REFERENCE_JSON_CASE_FORMAT.md
.

Profiler 与目录语义(摘要)

Profiler and Directory Semantics (Summary)

  • 每次
    with torch_npu.profiler.profile(...)
    ,在 handler 目录下生成
    _ascend_pt
    为后缀
    的导出目录;CSV 路径
    …/*_ascend_pt/ASCEND_PROFILER_OUTPUT/op_statistic.csv
  • 每个用例、每种实现(如
    custom
    /
    baseline
    一次独立
    with
    ;子路径建议
    {trace_root}/{op_trace_tag}/{custom|baseline}/case_XXX/
    ,运行前清空
    case_XXX
详见
references/REFERENCE_PROFILER_AND_METRICS.md
  • Each
    with torch_npu.profiler.profile(...)
    generates an export directory with the suffix
    _ascend_pt
    under the handler directory; CSV path is
    …/*_ascend_pt/ASCEND_PROFILER_OUTPUT/op_statistic.csv
    .
  • Each test case, each implementation (e.g.,
    custom
    /
    baseline
    ) uses an independent
    with
    ; subpath is recommended as
    {trace_root}/{op_trace_tag}/{custom|baseline}/case_XXX/
    , clear
    case_XXX
    before running.
For details, see
references/REFERENCE_PROFILER_AND_METRICS.md
.

指标(摘要)

Metrics (Summary)

  1. 对单次
    with
    对应 CSV:各算子行 Total Time(us) 求和
  2. 除以
    active×repeat
    divisor_mode=active_steps
    时)或仅
    active
    active_only
    )。本技能固定
    active=5
    ;若
    repeat=1
    ,则
    divisor = 5

  1. For the CSV corresponding to a single
    with
    : Sum the Total Time(us) of each operator row.
  2. Divide by
    active×repeat
    (when
    divisor_mode=active_steps
    ) or only
    active
    (when
    active_only
    ). This skill fixes
    active=5
    ; if
    repeat=1
    , then
    divisor = 5
    .

性能对比报告(Markdown)必备结构

Mandatory Structure of Performance Comparison Report (Markdown)

报告格式严格参照
examples/sample_report.md
,结构如下:
The report format strictly follows
examples/sample_report.md
, with the following structure:

1. 标题

1. Title

markdown
undefined
markdown
undefined

性能评估结果

Performance Evaluation Results

undefined
undefined

2. 对比表(统一单表,强制双路径)

2. Comparison Table (Unified Single Table, Mandatory Dual Path)

所有用例在同一张表中展示,表头固定为
Case | Shape | DType | 自定义算子(us) | 标杆(us) | 加速比
示例:
markdown
undefined
All test cases are displayed in the same table, with fixed headers:
Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio
.
Example:
markdown
undefined

性能对比

Performance Comparison

CaseShapeDType自定义算子(us)标杆(us)加速比
0[128, 4096]float169.7510.101.036
1[128, 5120]float1610.529.390.893
2[128, 6144]float1610.9914.361.307
3[64, 6400]float169.139.491.040
4[2, 1024, 4096]float1657.0184.921.490
5[2, 1024, 6144]float1673.80139.561.891
6[1, 2048, 6400]float1675.60143.091.893
7[64, 4096]float328.457.140.846
undefined
CaseShapeDTypeCustom Operator(us)Baseline(us)Speedup Ratio
0[128, 4096]float169.7510.101.036
1[128, 5120]float1610.529.390.893
2[128, 6144]float1610.9914.361.307
3[64, 6400]float169.139.491.040
4[2, 1024, 4096]float1657.0184.921.490
5[2, 1024, 6144]float1673.80139.561.891
6[1, 2048, 6400]float1675.60143.091.893
7[64, 4096]float328.457.140.846
undefined

3. 全量汇总

3. Full Summary

使用
## 全量汇总
二级标题,内含键值对表
markdown
undefined
Use the second-level title
## Full Summary
, which contains a key-value table:
markdown
undefined

全量汇总

Full Summary

指标
用例数N
平均 加速比(>1 表示自定义算子更快)X.XXX
自定义算子更优(比值>1)M
标杆更优(比值<1)K

紧接其下用 `### 按数据类型汇总` 三级标题,展示分 dtype 的汇总表:

```markdown
MetricValue
Number of Test CasesN
Average Speedup Ratio (>1 means custom operator is faster)X.XXX
Custom Operator Better (Ratio>1)M
Baseline Better (Ratio<1)K

Immediately below, use the third-level title `### Summary by Data Type` to display the summary table grouped by dtype:

```markdown

按数据类型汇总

Summary by Data Type

DType用例数平均 加速比自定义算子更优标杆更优
float1671.36461
float3210.84601
undefined
DTypeNumber of Test CasesAverage Speedup RatioCustom Operator BetterBaseline Better
float1671.36461
float3210.84601
undefined

4. 简短分析

4. Brief Analysis

使用
## 简短分析
二级标题,列出 ≥3 条 无序列表形式的简短结论,内容涵盖:整体趋势、不同 dtype / shape 规模差异、访存与计算特征等。
markdown
undefined
Use the second-level title
## Brief Analysis
, list ≥3 short conclusions in unordered list format, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.
markdown
undefined

简短分析

Brief Analysis

  • 平均 加速比 大于 1,自定义算子整体略有优势。
  • 大 shape 下自定义算子优势更明显,向量路径利用更充分。
  • float32 小 shape 场景自定义算子略逊于标杆,可能与 kernel launch 开销占比较高有关。
undefined
  • The average speedup ratio is greater than 1, so the custom operator has a slight overall advantage.
  • The custom operator has a more obvious advantage in large shape scenarios, as the vector path is more fully utilized.
  • The custom operator is slightly inferior to the baseline in the float32 small shape scenario, which may be related to the high proportion of kernel launch overhead.
undefined

其他约定

Other Conventions

  • 不写与报告重复的
    *_profiler_results.json
    ;中间统计仅存在于脚本内存中并写入 Markdown。

  • Do not write
    *_profiler_results.json
    which duplicates the report; intermediate statistics only exist in script memory and are written to Markdown.

对话内展示结果(MANDATORY)

Display Results in Conversation (MANDATORY)

生成
csrc/ops/<op>/test/<op>_torch_npu_profiler_report.md
(或已存在且本次运行已更新)后,助手在当前对话的回复中 MUST 同时完成下列事项,不得只输出「报告已生成」和路径而不展示数据:
  1. 粘贴主要性能内容(用户无需打开文件即可阅读结论,展示内容与报告结构一致):
    • 统一对比表:表头
      Case | Shape | DType | 自定义算子(us) | 标杆(us) | 加速比
      ,所有 dtype 在同一张表中展示。case 多时可截断并注明「其余见报告」。
    • 全量汇总:键值对形式的汇总指标(用例数、平均比值、自定义算子/标杆更优条数)及按数据类型汇总表。
    • 简短分析≥3 条 无序列表结论,涵盖整体趋势、不同 dtype / shape 规模差异、访存与计算特征等。
NEVER:仅回复报告路径;NEVER 用「请自行打开 Markdown」替代在对话中展示核心数字与结论。

After generating
csrc/ops/<op>/test/<op>_torch_npu_profiler_report.md
(or if it already exists and has been updated in this run), the assistant MUST simultaneously complete the following in the current conversation reply, cannot only output "Report generated" and the path without displaying data:
  1. Paste key performance content (users can read conclusions without opening files, displayed content is consistent with report structure):
    • Unified comparison table: Header
      Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio
      , all dtypes displayed in the same table. Truncate and note "See report for rest" if there are many cases.
    • Full summary: Summary metrics in key-value format (number of test cases, average ratio, number of cases where custom operator/baseline is better) and summary table by data type.
    • Brief analysis: ≥3 unordered list conclusions, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.
NEVER: Only reply with the report path; NEVER replace displaying core numbers and conclusions in the conversation with "Please open the Markdown file yourself".

易错点

Common Mistakes

  • warmup
    /
    active
    被改为非 5,与技能约定不一致。
  • 未使用
    torch_npu.profiler
    prof.step()
    与 schedule 不一致。
  • repeat>1
    可能多份
    *_ascend_pt
    导出;按 mtime 取 CSV 时需说明语义。
  • CSV 表头 BOM / 列名变化时须兼容 Total Time(us)
  • 自定义算子未注册时只做标杆路径;对比前须加载自定义库。
  • 未读
    <op>-test-cases.md
    就直接自行设计用例
    :testcase-gen 已生成统一用例文档,应优先从中提取 shape 和 dtype,避免重复设计。
  • 未读 design.md 就直接生成用例:导致 shape 不符合约束、缺少关键执行模式的覆盖(如漏掉 transpose=True 路径)。
  • 以「无标杆等价接口」为由输出单路径报告:必须实现小算子拼接标杆路径,始终输出双路径对比表。
  • 小算子拼接用 Python 循环逐标量赋值:profiler 采集的是 CPU 逻辑而非 NPU 算子,导致标杆路径耗时失真;标杆实现应以张量操作为主。
  • 助手仅输出报告路径、未在当前对话中展示主要表格与汇总结论(违反「对话内展示结果」)。

  • warmup
    /
    active
    are changed to non-5 values, inconsistent with skill conventions.
  • torch_npu.profiler
    is not used, or
    prof.step()
    is inconsistent with the schedule.
  • repeat>1
    may generate multiple
    *_ascend_pt
    exports; must explain semantics when selecting CSV by mtime.
  • Must be compatible with Total Time(us) when CSV header has BOM / column names change.
  • Only run the baseline path if the custom operator is not registered; load the custom library before comparison.
  • Designing test cases directly without reading
    <op>-test-cases.md
    : testcase-gen has generated a unified test case document, should first extract shapes and dtypes from it to avoid duplicate design.
  • Generating test cases directly without reading design.md: Results in shapes not complying with constraints, missing coverage of key execution modes (such as omitting the transpose=True path).
  • Outputting a single-path report on the grounds of "no equivalent baseline interface": Must implement a small operator combination baseline path, always output a dual-path comparison table.
  • Using Python loops for scalar-by-scalar assignment in small operator combination: Profiler collects CPU logic instead of NPU operators, leading to distorted baseline path latency; baseline implementation should be dominated by tensor operations.
  • Assistant only outputs the report path, does not display key tables and summary conclusions in the current conversation (violates "Display Results in Conversation").

参考实现(
examples/layer_norm_profiler_reference/

Reference Implementation (
examples/layer_norm_profiler_reference/
)

ascend-kernel/csrc/ops/layer_norm/test/
中 profiler 相关文件保持同构,包含:
  • layer_norm_profiler_common.py
    benchmark_layer_norm_torch_npu_profiler.py
  • layer_norm_perf_cases.jsonl
    (仅 JSONL,无
    .json
  • LAYER_NORM_PROFILER_PERF_GUIDE.md
    README.md
新算子从该目录 整体拷贝
csrc/ops/<op>/test/
,再替换算子名、前向调用、
build_inputs
与 trace 子目录名。若仓内
layer_norm/test/
有更新,应同步回
examples/layer_norm_profiler_reference/

Is isomorphic to the profiler-related files in
ascend-kernel/csrc/ops/layer_norm/test/
, including:
  • layer_norm_profiler_common.py
    ,
    benchmark_layer_norm_torch_npu_profiler.py
  • layer_norm_perf_cases.jsonl
    (JSONL only, no
    .json
    )
  • LAYER_NORM_PROFILER_PERF_GUIDE.md
    ,
    README.md
Copy this directory in its entirety to
csrc/ops/<op>/test/
for new operators, then replace the operator name, forward call,
build_inputs
, and trace subdirectory name. If there are updates in
layer_norm/test/
in the repository, synchronize them back to
examples/layer_norm_profiler_reference/
.

检查清单(助手自检)

Checklist (Assistant Self-Check)

  • 已读取
    csrc/ops/<op>/test/<op>-test-cases.md
    (若存在),从中提取 SUPPORTED_DTYPES、TEST_SHAPES、GENERAL_SHAPES、算子标杆
  • 已读取
    csrc/ops/<op>/design.md
    ,从中提取 dtype、参数约束、典型 shape、执行模式
  • 用例覆盖 design.md 中描述的所有执行模式(如 transpose/非transpose、input_split 模式等)
  • 用例中参数值(属性值与整数张量值)均在 design.md 约束范围内
  • 已确认标杆路径类型(标杆 API 或小算子拼接),并在报告头部明确标注
  • 若无标杆等价接口,已实现小算子拼接标杆路径,且标杆实现以张量操作为主(非 Python 标量循环)
  • 已用
    torch_npu.profiler
    ,且
    warmup=5
    active=5
    未被改写
  • 已生成或更新
    <op>_torch_npu_profiler_report.md
    ,格式与
    examples/sample_report.md
    一致:含 DType 列的统一对比表 + 全量汇总 + 按数据类型汇总 + 简短分析
  • 已在当前对话中展示含 DType 列的统一对比表、全量汇总、按数据类型汇总与 ≥3 条简短分析结论,不仅附路径
  • 已说明固定步数约定与指标口径
  • Has read
    csrc/ops/<op>/test/<op>-test-cases.md
    (if exists), extracted SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, operator baseline from it
  • Has read
    csrc/ops/<op>/design.md
    , extracted dtypes, parameter constraints, typical shapes, execution modes from it
  • Test cases cover all execution modes described in design.md (such as transpose/non-transpose, input_split mode, etc.)
  • Parameter values (including attribute values and integer tensor values) in test cases are within the constraints of design.md
  • Has confirmed the baseline path type (baseline API or small operator combination), and clearly labeled it in the report header
  • If there is no equivalent baseline interface, has implemented the small operator combination baseline path, and the baseline implementation is dominated by tensor operations (not Python scalar loops)
  • Has used
    torch_npu.profiler
    , and
    warmup=5
    active=5
    have not been modified
  • Has generated or updated
    <op>_torch_npu_profiler_report.md
    , with format consistent with
    examples/sample_report.md
    : includes unified comparison table with DType column + full summary + summary by data type + brief analysis
  • Has displayed the unified comparison table with DType column, full summary, summary by data type, and ≥3 brief analysis conclusions in the current conversation, not just attaching the path
  • Has explained the fixed step convention and metric caliber