Performance Evaluation of AscendC Operators with torch_npu.profiler

Reference Files in This Skill Directory

When executing this skill, prioritize materials in this directory:

File / Directory	Purpose
`SKILL.md` (this file)	Process, directory conventions, complete JSONL test case specifications, report structure, fixed schedule
`references/REFERENCE_JSON_CASE_FORMAT.md`	Identical to the "JSONL Specification for Performance Test Cases" section below
`references/REFERENCE_PROFILER_AND_METRICS.md`	`torch_npu.profiler` , `op_statistic.csv` , `*_ascend_pt` path
`examples/sample_perf_cases.jsonl`	Minimal LayerNorm-style JSONL, can be copied and renamed
`examples/layer_norm_profiler_reference/`	Layer Norm reference implementation ( `layer_norm_profiler_common.py` , `benchmark_layer_norm_torch_npu_profiler.py` , test case JSONL, instructions); new operators can copy this directory to `csrc/ops/<op>/test/` and replace the forward logic and filenames

Role

In ascend-kernel, establish a reusable process for profiler performance test cases and Markdown reports comparing custom operators vs baselines for

csrc/ops/<operator-name>/

. Data collection must use torch_npu.profiler
, with fixed
warmup
and
active
values of 5 (see next section). For details, refer to references/REFERENCE_PROFILER_AND_METRICS.md
.

Core Principles (Two Mandatory Constraints):

Comparison report must always be presented: Regardless of whether the baseline path is a baseline API or a combination of small operators, the final report must include a dual-path comparison table of custom operator vs baseline. Single-path reports are not allowed, and the baseline must run on NPU.
Test case generation must first read
design.md
: Before generating any JSONL test cases, read
```
csrc/ops/<op>/design.md
```
in the operator directory, extract parameter constraints, typical shapes, supported dtypes, and key attribute values. Test cases must cover all execution modes described in the design document.

Test Case Source: Load from testcase-gen Test Case Document (MANDATORY)

Before generating or modifying any JSONL test cases, MUST first read the test case document generated by testcase-gen:

Step 0: Read testcase-gen Test Case Document

READ csrc/ops/<op>/test/<op>-test-cases.md

Extract the following from it:

Extracted Item	Location in Test Case Document	Purpose
SUPPORTED_DTYPES	§Test Configuration	Coverage of dtypes for JSONL test cases
TEST_SHAPES	§Test Configuration	Benchmark for selecting small/medium/large-scale shapes
GENERAL_SHAPES	§Test Configuration	Generalized shapes, can be supplemented for performance scenarios
NPU Calling Method	§Operator Baseline	Forward call for custom operator
CPU Reference Implementation	§Operator Baseline	Reference implementation for baseline path

Conversion Rules from testcase-gen Output to JSONL Test Cases

Select representative shapes from TEST_SHAPES + GENERAL_SHAPES (covering small/medium/large scales), avoid duplicates
Iterate through all dtypes in SUPPORTED_DTYPES for each shape
Fill the
```
inputs
```
field of JSONL with attribute values (such as block_size, eps, etc.) from design.md
Total number of JSONL test cases ≥ 8
The NPU calling method and CPU reference implementation in the operator baseline are used to build the custom operator path and baseline path

If
<op>-test-cases.md
does not exist: Fall back to designing test cases entirely from design.md (following the process below), but note "Test cases are self-designed, not generated by testcase-gen" in the report.

Test Case Generation: Must First Read design.md (Mandatory)

Before generating or modifying any JSONL test cases (whether or not the testcase-gen test case document has been loaded), perform the following steps:

Step 1: Read Design Document

READ csrc/ops/<op>/design.md

Extract the following from it:

Extracted Item	Location in design.md	Purpose
Supported Data Types	§1 "Supported Data Types"	Coverage of dtypes for test cases
Parameter Constraints and Value Ranges	Constraint column in §1 "Parameter Description"	Valid range for attribute values (e.g., block_size ≤ 128)
Typical Shapes / Input Scales	§2 "Computation Logic" / §3 "Tiling Strategy"	Benchmark for small/medium/large-scale test case shapes
Mode Combinations of Key Attributes	§2 "Pseudocode" / §1 "Parameter Description"	Execution paths that need to be covered individually (e.g., do_transpose=True/False, is_input_split=True/False)
Performance Key Points	§6 "Performance Optimization" / §3 "Tiling Strategy"	Branches that affect performance (e.g., transpose vs non-transpose using different DMA paths)

Step 2: Test Case Design Rules

Rule	Description
Cover all execution modes	When design.md describes multiple execution paths (e.g., transpose/non-transpose, input_split mode), there must be at least one test case for each mode
Cover all supported dtypes	At least one set of test cases for each supported data type, using a typical medium-scale shape
One set each for small/medium/large-scale shapes	Must cover small scale (kernel launch overhead dominant), medium scale (typical production scenario), and large scale (memory bandwidth dominant)
Parameter values from constraint ranges	Attribute values (such as block_size) must be selected from the constraints in design.md, cannot be set arbitrarily
Integer/index tensor values must be semantically valid	Specific values of tensors like win_lengths, offsets must comply with operator semantics (e.g., offsets must be valid window start offsets)

Step 3: Validate Test Cases

After generating test cases, check:

All dtypes are covered
All execution modes (defined by design.md) have corresponding test cases
Parameter values (including attribute values and integer tensor values) are within the constraints of design.md
Includes at least one "small shape" test case and one "large shape" test case

Reference Path Decision Tree (Mandatory)

Performance evaluation always requires dual-path comparison (custom operator vs baseline). Determine the baseline path in the following order:

Does the operator have an equivalent baseline API?
  ├─ Yes (e.g., torch.nn.functional.*, built-in torch_npu operators)
  │    └─ Use the baseline API as the baseline path
  └─ No (no equivalent baseline interface)
       └─ Must implement a "small operator combination" baseline path ← Mandatory requirement of this skill
            └─ Implement using PyTorch basic operator combinations from the "Reference Implementation" or "Pseudocode" section of the design document

Requirements for Small Operator Combination Baseline Path

When there is no equivalent baseline interface, must:

Read the reference implementation from design.md: The design document usually contains a PyTorch reference implementation (pseudocode or Python function), use this as the basis to build the baseline path.
Use PyTorch basic operator combinations:
```
torch.zeros
```
, slice assignment,
```
.permute()
```
,
```
torch.cat
```
, etc. Do not use loops + Python scalar assignment (otherwise the profiler collects CPU operators instead of NPU operators, making fair comparison impossible); the entire baseline implementation must be dominated by tensor operations and executable on NPU.
Clearly label in the report: The header of the report must state "No equivalent baseline interface, baseline path is a combination of small operators" and list the basic operators used.
Comparison table must be presented: Do not degrade to a single-path report due to "no baseline interface", must retain the three columns: "Custom Operator per-step", "Baseline per-step", "Ratio".

NEVER: Output a single-path report or skip the comparison table on the grounds of "no equivalent baseline interface".

Fixed Profiler Steps (Mandatory)

Parameter	Value	Description
`warmup`	5	Do not modify to other values via script or CLI
`active`	5	Do not modify to other values via script or CLI
`wait`	Default `0`	Can retain in CLI or as a constant, adjust as needed
`repeat`	Default `1`	Fixed to 1 for simple scenarios; if `repeat>1` , must explain CSV selection semantics in the document

Must call prof.step()
at the end of each step; total number of loop steps =

repeat * (wait + warmup + active)

File Placement (Unified in Operator

test/

Subdirectory)

All the following products are placed in ascend-kernel/csrc/ops/<op>/test/
:

Category	Naming Convention ( `<op>` e.g., `layer_norm` )
Test Cases JSONL only	`<op>_perf_cases.jsonl` (one JSON object per line); do not maintain or generate `<op>_perf_cases.json`
Markdown Report	`<op>_torch_npu_profiler_report.md` (only structured result to be saved; do not generate `<op>_torch_npu_profiler_results.json` )
Profiler Export Root Directory	`test/profiler_trace/` (or override with `--trace-root` )

Performance scripts, common modules are placed in the same

test/

directory as the above files.

Complete JSONL Specification for Performance Test Cases

The following is the complete field and type description for test case files; only use
.jsonl
as the test case carrier.

1. File Format

Format	Description
JSONL	Each line contains one JSON object, with a newline at the end; empty lines are ignored. File extension `.jsonl` .

Prohibit generating or maintaining

.json

array files in sync with test cases in this process.

2. Top-level Structure of a Single Test Case

Each test case object must contain the key "inputs"
, whose value is an array.

Layer Norm Example (replace

build_inputs

convention for different operators, but the structure must still include the

inputs

array):

json

{
  "inputs": [
    { "name": "x", "type": "tensor", "required": true, "dtype": "float16", "shape": [8, 128] },
    { "name": "normalized_shape", "type": "attr", "required": true, "dtype": "int", "value": [128] },
    { "name": "use_affine", "type": "attr", "required": false, "dtype": "bool", "value": true },
    { "name": "eps", "type": "attr", "required": false, "dtype": "float", "value": 1e-05 }
  ]
}

name
of each element in
inputs
is unique within the same test case.
For other operators: Rules for tensors /
```
tensor_list
```
/
```
attr
```
/ integer tensor
```
range
```
, etc. are in references/REFERENCE_JSON_CASE_FORMAT.md
.

3. Complete JSONL Example (Two Lines, Layer Norm)

json

{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[2,128]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[128]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":true},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}
{"inputs":[{"name":"x","type":"tensor","required":true,"dtype":"float16","shape":[4,256]},{"name":"normalized_shape","type":"attr","required":true,"dtype":"int","value":[256]},{"name":"use_affine","type":"attr","required":false,"dtype":"bool","value":false},{"name":"eps","type":"attr","required":false,"dtype":"float","value":1e-05}]}

For more complete field descriptions (such as

tensor_list

int

tensor

range

, etc.), see references/REFERENCE_JSON_CASE_FORMAT.md
.

Profiler and Directory Semantics (Summary)

Each

with torch_npu.profiler.profile(...)

generates an export directory with the suffix _ascend_pt
under the handler directory; CSV path is …/*_ascend_pt/ASCEND_PROFILER_OUTPUT/op_statistic.csv
.

Each test case, each implementation (e.g.,
```
custom
```
/
```
baseline
```
) uses an independent
```
with
```
; subpath is recommended as
```
{trace_root}/{op_trace_tag}/{custom|baseline}/case_XXX/
```
, clear
```
case_XXX
```
before running.

For details, see references/REFERENCE_PROFILER_AND_METRICS.md
.

Metrics (Summary)

For the CSV corresponding to a single
```
with
```
: Sum the Total Time(us) of each operator row.

Divide by

active×repeat

(when

divisor_mode=active_steps

) or only

active

(when

active_only

). This skill fixes active=5
; if repeat=1
, then divisor = 5
.

Mandatory Structure of Performance Comparison Report (Markdown)

The report format strictly follows examples/sample_report.md
, with the following structure:

1. Title

markdown

# Performance Evaluation Results

2. Comparison Table (Unified Single Table, Mandatory Dual Path)

All test cases are displayed in the same table, with fixed headers:

Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio

Example:

markdown

## Performance Comparison

| Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio |
| ---- | ----- | ----- | ------------- | -------- | -------------- |
| 0 | [128, 4096] | float16 | 9.75 | 10.10 | 1.036 |
| 1 | [128, 5120] | float16 | 10.52 | 9.39 | 0.893 |
| 2 | [128, 6144] | float16 | 10.99 | 14.36 | 1.307 |
| 3 | [64, 6400] | float16 | 9.13 | 9.49 | 1.040 |
| 4 | [2, 1024, 4096] | float16 | 57.01 | 84.92 | 1.490 |
| 5 | [2, 1024, 6144] | float16 | 73.80 | 139.56 | 1.891 |
| 6 | [1, 2048, 6400] | float16 | 75.60 | 143.09 | 1.893 |
| 7 | [64, 4096] | float32 | 8.45 | 7.14 | 0.846 |

3. Full Summary

Use the second-level title

## Full Summary

, which contains a key-value table:

markdown

## Full Summary

| Metric | Value |
| ---- | -- |
| Number of Test Cases | N |
| Average Speedup Ratio (>1 means custom operator is faster) | X.XXX |
| Custom Operator Better (Ratio>1) | M |
| Baseline Better (Ratio<1) | K |

Immediately below, use the third-level title

### Summary by Data Type

to display the summary table grouped by dtype:

markdown

### Summary by Data Type

| DType | Number of Test Cases | Average Speedup Ratio | Custom Operator Better | Baseline Better |
| ----- | ------ | ------------------- | ------------- | -------- |
| float16 | 7 | 1.364 | 6 | 1 |
| float32 | 1 | 0.846 | 0 | 1 |

4. Brief Analysis

Use the second-level title

## Brief Analysis

, list ≥3 short conclusions in unordered list format, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.

markdown

## Brief Analysis

- The average speedup ratio is greater than 1, so the custom operator has a slight overall advantage.
- The custom operator has a more obvious advantage in large shape scenarios, as the vector path is more fully utilized.
- The custom operator is slightly inferior to the baseline in the float32 small shape scenario, which may be related to the high proportion of kernel launch overhead.

Other Conventions

Do not write
```
*_profiler_results.json
```
which duplicates the report; intermediate statistics only exist in script memory and are written to Markdown.

Display Results in Conversation (MANDATORY)

After generating csrc/ops/<op>/test/<op>_torch_npu_profiler_report.md
(or if it already exists and has been updated in this run), the assistant MUST simultaneously complete the following in the current conversation reply, cannot only output "Report generated" and the path without displaying data:

Paste key performance content (users can read conclusions without opening files, displayed content is consistent with report structure):
- Unified comparison table: Header
```
Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio
```
  , all dtypes displayed in the same table. Truncate and note "See report for rest" if there are many cases.
- Full summary: Summary metrics in key-value format (number of test cases, average ratio, number of cases where custom operator/baseline is better) and summary table by data type.
- Brief analysis: ≥3 unordered list conclusions, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.

NEVER: Only reply with the report path; NEVER replace displaying core numbers and conclusions in the conversation with "Please open the Markdown file yourself".

Common Mistakes

```
warmup
```
/
```
active
```
are changed to non-5 values, inconsistent with skill conventions.
torch_npu.profiler
is not used, or
```
prof.step()
```
is inconsistent with the schedule.
repeat>1
may generate multiple
```
*_ascend_pt
```
exports; must explain semantics when selecting CSV by mtime.
Must be compatible with Total Time(us) when CSV header has BOM / column names change.
Only run the baseline path if the custom operator is not registered; load the custom library before comparison.
Designing test cases directly without reading
<op>-test-cases.md
: testcase-gen has generated a unified test case document, should first extract shapes and dtypes from it to avoid duplicate design.
Generating test cases directly without reading design.md: Results in shapes not complying with constraints, missing coverage of key execution modes (such as omitting the transpose=True path).
Outputting a single-path report on the grounds of "no equivalent baseline interface": Must implement a small operator combination baseline path, always output a dual-path comparison table.
Using Python loops for scalar-by-scalar assignment in small operator combination: Profiler collects CPU logic instead of NPU operators, leading to distorted baseline path latency; baseline implementation should be dominated by tensor operations.
Assistant only outputs the report path, does not display key tables and summary conclusions in the current conversation (violates "Display Results in Conversation").

Reference Implementation (

examples/layer_norm_profiler_reference/

)

Is isomorphic to the profiler-related files in ascend-kernel/csrc/ops/layer_norm/test/
, including:

layer_norm_profiler_common.py

benchmark_layer_norm_torch_npu_profiler.py

layer_norm_perf_cases.jsonl
(JSONL only, no
.json
)

LAYER_NORM_PROFILER_PERF_GUIDE.md

README.md

Copy this directory in its entirety to

csrc/ops/<op>/test/

for new operators, then replace the operator name, forward call,

build_inputs

, and trace subdirectory name. If there are updates in

layer_norm/test/

in the repository, synchronize them back to

examples/layer_norm_profiler_reference/

Checklist (Assistant Self-Check)

Has read csrc/ops/<op>/test/<op>-test-cases.md
(if exists), extracted SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, operator baseline from it
Has read csrc/ops/<op>/design.md
, extracted dtypes, parameter constraints, typical shapes, execution modes from it
Test cases cover all execution modes described in design.md (such as transpose/non-transpose, input_split mode, etc.)
Parameter values (including attribute values and integer tensor values) in test cases are within the constraints of design.md
Has confirmed the baseline path type (baseline API or small operator combination), and clearly labeled it in the report header
If there is no equivalent baseline interface, has implemented the small operator combination baseline path, and the baseline implementation is dominated by tensor operations (not Python scalar loops)
Has used torch_npu.profiler
, and warmup=5
、
active=5
have not been modified
Has generated or updated <op>_torch_npu_profiler_report.md
, with format consistent with
```
examples/sample_report.md
```
: includes unified comparison table with DType column + full summary + summary by data type + brief analysis
Has displayed the unified comparison table with DType column, full summary, summary by data type, and ≥3 brief analysis conclusions in the current conversation, not just attaching the path
Has explained the fixed step convention and metric caliber

ascendc-operator-performance-eval

NPX Install

Tags

SKILL.md Content (Chinese)

Performance Evaluation of AscendC Operators with torch_npu.profiler

Reference Files in This Skill Directory

Role

Test Case Source: Load from testcase-gen Test Case Document (MANDATORY)

Step 0: Read testcase-gen Test Case Document

Conversion Rules from testcase-gen Output to JSONL Test Cases

Test Case Generation: Must First Read design.md (Mandatory)

Step 1: Read Design Document

Step 2: Test Case Design Rules

Step 3: Validate Test Cases

Reference Path Decision Tree (Mandatory)

Requirements for Small Operator Combination Baseline Path

Fixed Profiler Steps (Mandatory)

File Placement (Unified in Operator
`test/`
Subdirectory)

Complete JSONL Specification for Performance Test Cases

1. File Format

2. Top-level Structure of a Single Test Case

3. Complete JSONL Example (Two Lines, Layer Norm)

Profiler and Directory Semantics (Summary)

Metrics (Summary)

Mandatory Structure of Performance Comparison Report (Markdown)

1. Title

2. Comparison Table (Unified Single Table, Mandatory Dual Path)

3. Full Summary

4. Brief Analysis

Other Conventions

Display Results in Conversation (MANDATORY)

Common Mistakes

Reference Implementation (
`examples/layer_norm_profiler_reference/`
)

Checklist (Assistant Self-Check)

ascendc-operator-performance-eval

NPX Install

Tags

SKILL.md Content (Chinese)

Performance Evaluation of AscendC Operators with torch_npu.profiler

Reference Files in This Skill Directory

Role

Test Case Source: Load from testcase-gen Test Case Document (MANDATORY)

Step 0: Read testcase-gen Test Case Document

Conversion Rules from testcase-gen Output to JSONL Test Cases

Test Case Generation: Must First Read design.md (Mandatory)

Step 1: Read Design Document

Step 2: Test Case Design Rules

Step 3: Validate Test Cases

Reference Path Decision Tree (Mandatory)

Requirements for Small Operator Combination Baseline Path

Fixed Profiler Steps (Mandatory)

File Placement (Unified in Operator test/ Subdirectory)

Complete JSONL Specification for Performance Test Cases

1. File Format

2. Top-level Structure of a Single Test Case

3. Complete JSONL Example (Two Lines, Layer Norm)

Profiler and Directory Semantics (Summary)

Metrics (Summary)

Mandatory Structure of Performance Comparison Report (Markdown)

1. Title

2. Comparison Table (Unified Single Table, Mandatory Dual Path)

3. Full Summary

4. Brief Analysis

Other Conventions

Display Results in Conversation (MANDATORY)

Common Mistakes

Reference Implementation (examples/layer_norm_profiler_reference/)

Checklist (Assistant Self-Check)

File Placement (Unified in Operator
`test/`
Subdirectory)

Reference Implementation (
`examples/layer_norm_profiler_reference/`
)