Maintain JSONL-only profiler performance test cases under csrc/ops/<op>/test in ascend-kernel. Collect data using torch_npu.profiler (with fixed warmup=5 and active=5), aggregate the Total Time(us) from ASCEND_PROFILER_OUTPUT/op_statistic.csv, and output a unified Markdown comparison report (custom operator vs baseline) that includes a DType column. Do not generate perf_cases.json or *_profiler_results.json. Refer to examples/layer_norm_profiler_reference/ for the reference implementation.
Performance Evaluation of AscendC Operators with torch_npu.profiler
Reference Files in This Skill Directory
When executing this skill, prioritize materials in this directory:
File / Directory
Purpose
SKILL.md
(this file)
Process, directory conventions, complete JSONL test case specifications, report structure, fixed schedule
references/REFERENCE_JSON_CASE_FORMAT.md
Identical to the "JSONL Specification for Performance Test Cases" section below
references/REFERENCE_PROFILER_AND_METRICS.md
torch_npu.profiler
,
op_statistic.csv
,
*_ascend_pt
path
examples/sample_perf_cases.jsonl
Minimal LayerNorm-style JSONL, can be copied and renamed
examples/layer_norm_profiler_reference/
Layer Norm reference implementation (
layer_norm_profiler_common.py
,
benchmark_layer_norm_torch_npu_profiler.py
, test case JSONL, instructions); new operators can copy this directory to
csrc/ops/<op>/test/
and replace the forward logic and filenames
Role
In ascend-kernel, establish a reusable process for profiler performance test cases and Markdown reports comparing custom operators vs baselines for
csrc/ops/<operator-name>/
. Data collection must use
torch_npu.profiler
, with fixed
warmup
and
active
values of 5 (see next section). For details, refer to
references/REFERENCE_PROFILER_AND_METRICS.md
.
Core Principles (Two Mandatory Constraints):
Comparison report must always be presented: Regardless of whether the baseline path is a baseline API or a combination of small operators, the final report must include a dual-path comparison table of custom operator vs baseline. Single-path reports are not allowed, and the baseline must run on NPU.
Test case generation must first read
design.md
: Before generating any JSONL test cases, read
csrc/ops/<op>/design.md
in the operator directory, extract parameter constraints, typical shapes, supported dtypes, and key attribute values. Test cases must cover all execution modes described in the design document.
Test Case Source: Load from testcase-gen Test Case Document (MANDATORY)
Before generating or modifying any JSONL test cases, MUST first read the test case document generated by testcase-gen:
Step 0: Read testcase-gen Test Case Document
READ csrc/ops/<op>/test/<op>-test-cases.md
Extract the following from it:
Extracted Item
Location in Test Case Document
Purpose
SUPPORTED_DTYPES
§Test Configuration
Coverage of dtypes for JSONL test cases
TEST_SHAPES
§Test Configuration
Benchmark for selecting small/medium/large-scale shapes
GENERAL_SHAPES
§Test Configuration
Generalized shapes, can be supplemented for performance scenarios
NPU Calling Method
§Operator Baseline
Forward call for custom operator
CPU Reference Implementation
§Operator Baseline
Reference implementation for baseline path
Conversion Rules from testcase-gen Output to JSONL Test Cases
Iterate through all dtypes in SUPPORTED_DTYPES for each shape
Fill the
inputs
field of JSONL with attribute values (such as block_size, eps, etc.) from design.md
Total number of JSONL test cases ≥ 8
The NPU calling method and CPU reference implementation in the operator baseline are used to build the custom operator path and baseline path
If
<op>-test-cases.md
does not exist: Fall back to designing test cases entirely from design.md (following the process below), but note "Test cases are self-designed, not generated by testcase-gen" in the report.
Test Case Generation: Must First Read design.md (Mandatory)
Before generating or modifying any JSONL test cases (whether or not the testcase-gen test case document has been loaded), perform the following steps:
Step 1: Read Design Document
READ csrc/ops/<op>/design.md
Extract the following from it:
Extracted Item
Location in design.md
Purpose
Supported Data Types
§1 "Supported Data Types"
Coverage of dtypes for test cases
Parameter Constraints and Value Ranges
Constraint column in §1 "Parameter Description"
Valid range for attribute values (e.g., block_size ≤ 128)
Typical Shapes / Input Scales
§2 "Computation Logic" / §3 "Tiling Strategy"
Benchmark for small/medium/large-scale test case shapes
Mode Combinations of Key Attributes
§2 "Pseudocode" / §1 "Parameter Description"
Execution paths that need to be covered individually (e.g., do_transpose=True/False, is_input_split=True/False)
Branches that affect performance (e.g., transpose vs non-transpose using different DMA paths)
Step 2: Test Case Design Rules
Rule
Description
Cover all execution modes
When design.md describes multiple execution paths (e.g., transpose/non-transpose, input_split mode), there must be at least one test case for each mode
Cover all supported dtypes
At least one set of test cases for each supported data type, using a typical medium-scale shape
One set each for small/medium/large-scale shapes
Must cover small scale (kernel launch overhead dominant), medium scale (typical production scenario), and large scale (memory bandwidth dominant)
Parameter values from constraint ranges
Attribute values (such as block_size) must be selected from the constraints in design.md, cannot be set arbitrarily
Integer/index tensor values must be semantically valid
Specific values of tensors like win_lengths, offsets must comply with operator semantics (e.g., offsets must be valid window start offsets)
Step 3: Validate Test Cases
After generating test cases, check:
All dtypes are covered
All execution modes (defined by design.md) have corresponding test cases
Parameter values (including attribute values and integer tensor values) are within the constraints of design.md
Includes at least one "small shape" test case and one "large shape" test case
Reference Path Decision Tree (Mandatory)
Performance evaluation always requires dual-path comparison (custom operator vs baseline). Determine the baseline path in the following order:
Does the operator have an equivalent baseline API?
├─ Yes (e.g., torch.nn.functional.*, built-in torch_npu operators)
│ └─ Use the baseline API as the baseline path
└─ No (no equivalent baseline interface)
└─ Must implement a "small operator combination" baseline path ← Mandatory requirement of this skill
└─ Implement using PyTorch basic operator combinations from the "Reference Implementation" or "Pseudocode" section of the design document
Requirements for Small Operator Combination Baseline Path
When there is no equivalent baseline interface, must:
Read the reference implementation from design.md: The design document usually contains a PyTorch reference implementation (pseudocode or Python function), use this as the basis to build the baseline path.
Use PyTorch basic operator combinations:
torch.zeros
, slice assignment,
.permute()
,
torch.cat
, etc. Do not use loops + Python scalar assignment (otherwise the profiler collects CPU operators instead of NPU operators, making fair comparison impossible); the entire baseline implementation must be dominated by tensor operations and executable on NPU.
Clearly label in the report: The header of the report must state "No equivalent baseline interface, baseline path is a combination of small operators" and list the basic operators used.
Comparison table must be presented: Do not degrade to a single-path report due to "no baseline interface", must retain the three columns: "Custom Operator per-step", "Baseline per-step", "Ratio".
NEVER: Output a single-path report or skip the comparison table on the grounds of "no equivalent baseline interface".
Fixed Profiler Steps (Mandatory)
Parameter
Value
Description
warmup
5
Do not modify to other values via script or CLI
active
5
Do not modify to other values via script or CLI
wait
Default
0
Can retain in CLI or as a constant, adjust as needed
repeat
Default
1
Fixed to 1 for simple scenarios; if
repeat>1
, must explain CSV selection semantics in the document
Must call
prof.step()
at the end of each step; total number of loop steps =
repeat * (wait + warmup + active)
.
File Placement (Unified in Operator
test/
Subdirectory)
All the following products are placed in
ascend-kernel/csrc/ops/<op>/test/
:
Category
Naming Convention (
<op>
e.g.,
layer_norm
)
Test Cases JSONL only
<op>_perf_cases.jsonl
(one JSON object per line); do not maintain or generate
<op>_perf_cases.json
Markdown Report
<op>_torch_npu_profiler_report.md
(only structured result to be saved; do not generate
<op>_torch_npu_profiler_results.json
)
Profiler Export Root Directory
test/profiler_trace/
(or override with
--trace-root
)
Performance scripts, common modules are placed in the same
test/
directory as the above files.
Complete JSONL Specification for Performance Test Cases
The following is the complete field and type description for test case files; only use
.jsonl
as the test case carrier.
1. File Format
Format
Description
JSONL
Each line contains one JSON object, with a newline at the end; empty lines are ignored. File extension
.jsonl
.
Prohibit generating or maintaining
.json
array files in sync with test cases in this process.
2. Top-level Structure of a Single Test Case
Each test case object must contain the key
"inputs"
, whose value is an array.
Layer Norm Example (replace
build_inputs
convention for different operators, but the structure must still include the
## Full Summary| Metric | Value |
| ---- | -- |
| Number of Test Cases | N |
| Average Speedup Ratio (>1 means custom operator is faster) | X.XXX |
| Custom Operator Better (Ratio>1) | M |
| Baseline Better (Ratio<1) | K |
Immediately below, use the third-level title
### Summary by Data Type
to display the summary table grouped by dtype:
markdown
### Summary by Data Type| DType | Number of Test Cases | Average Speedup Ratio | Custom Operator Better | Baseline Better ||-----|------|-------------------|-------------|--------|| float16 | 7 | 1.364 | 6 | 1 || float32 | 1 | 0.846 | 0 | 1 |
4. Brief Analysis
Use the second-level title
## Brief Analysis
, list ≥3 short conclusions in unordered list format, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.
markdown
## Brief Analysis- The average speedup ratio is greater than 1, so the custom operator has a slight overall advantage.
- The custom operator has a more obvious advantage in large shape scenarios, as the vector path is more fully utilized.
- The custom operator is slightly inferior to the baseline in the float32 small shape scenario, which may be related to the high proportion of kernel launch overhead.
Other Conventions
Do not write
*_profiler_results.json
which duplicates the report; intermediate statistics only exist in script memory and are written to Markdown.
(or if it already exists and has been updated in this run), the assistant MUST simultaneously complete the following in the current conversation reply, cannot only output "Report generated" and the path without displaying data:
Paste key performance content (users can read conclusions without opening files, displayed content is consistent with report structure):
Unified comparison table: Header
Case | Shape | DType | Custom Operator(us) | Baseline(us) | Speedup Ratio
, all dtypes displayed in the same table. Truncate and note "See report for rest" if there are many cases.
Full summary: Summary metrics in key-value format (number of test cases, average ratio, number of cases where custom operator/baseline is better) and summary table by data type.
Brief analysis: ≥3 unordered list conclusions, covering overall trends, differences between different dtypes/shape scales, memory access and computation characteristics, etc.
NEVER: Only reply with the report path; NEVER replace displaying core numbers and conclusions in the conversation with "Please open the Markdown file yourself".
Common Mistakes
warmup
/
active
are changed to non-5 values, inconsistent with skill conventions.
torch_npu.profiler
is not used, or
prof.step()
is inconsistent with the schedule.
repeat>1
may generate multiple
*_ascend_pt
exports; must explain semantics when selecting CSV by mtime.
Must be compatible with Total Time(us) when CSV header has BOM / column names change.
Only run the baseline path if the custom operator is not registered; load the custom library before comparison.
Designing test cases directly without reading
<op>-test-cases.md
: testcase-gen has generated a unified test case document, should first extract shapes and dtypes from it to avoid duplicate design.
Generating test cases directly without reading design.md: Results in shapes not complying with constraints, missing coverage of key execution modes (such as omitting the transpose=True path).
Outputting a single-path report on the grounds of "no equivalent baseline interface": Must implement a small operator combination baseline path, always output a dual-path comparison table.
Using Python loops for scalar-by-scalar assignment in small operator combination: Profiler collects CPU logic instead of NPU operators, leading to distorted baseline path latency; baseline implementation should be dominated by tensor operations.
Assistant only outputs the report path, does not display key tables and summary conclusions in the current conversation (violates "Display Results in Conversation").
Reference Implementation (
examples/layer_norm_profiler_reference/
)
Is isomorphic to the profiler-related files in
ascend-kernel/csrc/ops/layer_norm/test/
, including:
layer_norm_profiler_common.py
,
benchmark_layer_norm_torch_npu_profiler.py
layer_norm_perf_cases.jsonl
(JSONL only, no
.json
)
LAYER_NORM_PROFILER_PERF_GUIDE.md
,
README.md
Copy this directory in its entirety to
csrc/ops/<op>/test/
for new operators, then replace the operator name, forward call,
build_inputs
, and trace subdirectory name. If there are updates in
layer_norm/test/
in the repository, synchronize them back to
examples/layer_norm_profiler_reference/
.
Checklist (Assistant Self-Check)
Has read
csrc/ops/<op>/test/<op>-test-cases.md
(if exists), extracted SUPPORTED_DTYPES, TEST_SHAPES, GENERAL_SHAPES, operator baseline from it
Has read
csrc/ops/<op>/design.md
, extracted dtypes, parameter constraints, typical shapes, execution modes from it
Test cases cover all execution modes described in design.md (such as transpose/non-transpose, input_split mode, etc.)
Parameter values (including attribute values and integer tensor values) in test cases are within the constraints of design.md
Has confirmed the baseline path type (baseline API or small operator combination), and clearly labeled it in the report header
If there is no equivalent baseline interface, has implemented the small operator combination baseline path, and the baseline implementation is dominated by tensor operations (not Python scalar loops)
Has used
torch_npu.profiler
, and
warmup=5
、
active=5
have not been modified
Has generated or updated
<op>_torch_npu_profiler_report.md
, with format consistent with
examples/sample_report.md
: includes unified comparison table with DType column + full summary + summary by data type + brief analysis
Has displayed the unified comparison table with DType column, full summary, summary by data type, and ≥3 brief analysis conclusions in the current conversation, not just attaching the path
Has explained the fixed step convention and metric caliber