triton-ascend-migration

Original🇨🇳 Chinese
Translated

Migrate GPU/CUDA Triton operators to Triton-Ascend, or rewrite Python/PyTorch operators into Triton-Ascend implementations that can run on Ascend NPU. When clear optimization opportunities are identified, directly output the optimized code, minimal validation script, and troubleshooting instructions. This skill should be prioritized when users mention 昇腾 (Ascend), Ascend, NPU, triton-ascend, Triton operator migration, PyTorch operator rewriting, coreDim, UB overflow, 1D grid, physical core binding, block_ptr, stride, memory access alignment, mask performance, dtype degradation, operator optimization, or directly ask questions like "How to use this skill", "How to run it in the command line", "How to perform migration/validation in a container", even if users do not explicitly say "write a skill" or "perform migration".

1installs

NPX Install

npx skill4agent add ascend-ai-coding/awesome-ascend-skills triton-ascend-migration

SKILL.md Content (Chinese)

View Translation Comparison →

Triton-Ascend Migration

Quick Start

When handling migration requests, follow the sequence below:
  1. First identify the input method:
    • File path / specified code snippet
    • User directly pastes code
  2. Then identify the input source:
    • GPU/CUDA Triton kernel
    • Python/PyTorch operator implementation
  3. Then identify the operator type:
    • elementwise
    • broadcast / mask
    • reduce
    • Contains
      tl.dot
  4. First create a minimally runnable version:
    • cuda -> npu
    • Add
      torch_npu
      import
    • Remove GPU-specific device logic
    • Prioritize 1D grid
    • For simple tutorial examples, default to the "minimal diff migration version"
  5. Perform Ascend-side optimization after the code runs successfully:
    • Physical core binding
    • BLOCK_SIZE/XBLOCK
    • BLOCK_SIZE_SUB/XBLOCK_SUB
    • Continuous/aligned memory access
    • Troubleshooting for
      coreDim
      / UB / dtype / mask
  6. If clear optimization opportunities exist, directly output the optimized implementation instead of just providing suggestions.

How to Use This Skill

If users ask "How to use this skill", do not immediately dive into lengthy migration analysis; first provide a concise usage guide in 3 to 6 lines, then proceed based on the user's input.
Only retain these points in the concise guide:
  • Users can provide
    Triton/CUDA
    code,
    PyTorch
    reference implementation, file path, or error/performance logs.
  • Users are advised to specify the runtime environment: local command line, existing container, CI, or code generation only without execution.
  • Users can also indicate preferences:
    minimal diff migration
    ,
    documentation style
    ,
    get it running first then optimize
    ,
    directly provide optimized version
    .
  • Corresponding outputs will be provided based on scenarios:
    Triton-Ascend implementation
    ,
    minimal validation script
    ,
    execution command
    ,
    optimization instructions
    .
If users follow up with questions like "How to ask specifically", "What command to write", "How to run in a container", refer to
references/usage.md
and provide local commands, container commands, and example questions as needed; do not directly include the entire long guide in regular responses.
Copy this checklist to track progress:
text
Migration Progress
- [ ] Identify input source and operator type
- [ ] First perform minimal migration or semantic rewriting
- [ ] Adjust to Ascend-friendly parallelism and grid
- [ ] Redesign block / tiling
- [ ] Review stride / block_ptr / alignment
- [ ] Handle coreDim / UB / scalar degradation
- [ ] Implement feasible optimization directly
- [ ] Generate and save minimal NPU validation script
- [ ] Actually execute the validation script
- [ ] Output results and optimization instructions

Input Identification

First answer these three questions:
  1. Is the user providing a file path or directly pasting code?
  2. Is it a complete script, partial snippet, or single kernel?
  3. Is it GPU Triton migration or Python/PyTorch semantic rewriting?
Details about input methods, default handling when information is missing, and priority when file paths conflict with pasted code can be found in
references/input-modes.md
.

Scenario A: GPU Triton -> Triton-Ascend

Prioritize checking:
  • Whether
    device='cuda'
    exists
  • Whether there is GPU-specific device acquisition or assertion logic
  • Whether GPU-style multi-dimensional free grid is retained
  • Whether
    tl.dot
    is used
  • Whether complex
    shape/stride/block_ptr/order
    exists

Scenario B: Python/PyTorch -> Triton-Ascend

First extract semantics, then write Triton code:
  • Input-output tensor relationship
  • Indexing and broadcasting method
  • Mask / reduce logic
  • dtype and precision requirements
  • Whether the original PyTorch implementation already has naturally continuous memory access
If the original operator is only a reference implementation, first write a semantically equivalent Triton-Ascend version, then proceed with optimization.

Migration Process

1. Collect Minimal Necessary Information

Prioritize collecting this information; supplement what is missing:
  • Input code or minimal reproduction case
  • Input method: file path / specified code snippet / user directly pastes code
  • shape, dtype, stride
  • Whether there is mask, broadcast, reduce
  • Current error or performance issue
  • Whether exact precision consistency is required
  • Runtime environment: local command line, inside container, CI, or code generation only without execution
If information is incomplete, supplement in this order:
  1. First infer from existing code
  2. Then use minimal reasonable assumptions to complete the validation script
  3. Finally ask users for necessary information
If the "execution location" information is missing, infer in this order:
  1. First check if the user provided container name,
    docker exec
    , container path, or image information
  2. Then check if the user provided local file path, current directory, or terminal command
  3. If still undetermined, ask: "Would you like me to write the validation steps based on local command line or container environment?"

2. First Perform Minimal Migration or Semantic Rewriting

Default to pursuing "semantic alignment and successful execution":
  • GPU Triton: First change
    cuda
    to
    npu
  • Import
    torch_npu
  • Remove GPU-specific device logic
  • For documentation/tutorial-style simple examples, try to keep the original
    kernel
    name, wrapper name,
    BLOCK_SIZE
    , grid writing method, and main code structure unchanged
  • Do not actively add
    contiguous()
    , additional assertions, function renaming, or engineering packaging in the first version, unless the user explicitly requests "enhanced/production version", or these changes are necessary to fix deterministic issues on NPU
  • Python/PyTorch: First rewrite into the most straightforward Triton kernel according to the original computation semantics
Do not over-rewrite in the first step.
If users explicitly mention these keywords:
  • Official documentation style
  • Strict minimal migration
  • Minimal diff
  • No engineering enhanced version
  • Only refer to official migration examples
Then this "minimal migration mode" should override the subsequent generalized optimization requirements:
  • Only make necessary code modifications
  • The
    optimization instructions
    can be 1 to 3 lines, clearly stating "No in-depth optimization is performed for this task"
  • Do not forcefully expand content like
    TRITON_ALL_BLOCKS_PARALLEL
    ,
    multibuffer
    ,
    care_padding=False
    , physical core binding just to complete the template
  • Do not deviate the response style from "documentation diff" to "engineering optimization overview"
  • The validation script should also remain "minimally runnable", do not default to writing it as an engineering test framework
Details about documentation-style minimal migration, single-file example organization, and validation script naming and saving rules can be found in
references/output-and-validation.md
.

3. Rewrite Parallelism Model

Ascend-side should follow these rules first:
  • Prioritize 1D grid
  • Switch from GPU logical grid thinking to Ascend physical core binding thinking
  • Vector-only
    operators should be designed with Vector Core path in mind first
  • Operators containing
    tl.dot
    should be designed with AI Core path in mind first
Further judge based on this set of "general convergence rules", do not mechanically retain all implementation branches from GPU:
  • If the original implementation has multiple kernels,
    autotune
    , environment variable branches, or automatic distribution of different data paths, first distinguish which are "semantically necessary" and which are just "performance strategies on GPU"
  • For performance branches that are obviously no longer necessary on Ascend, converge to a single kernel or fewer paths; focus on retaining semantics rather than all historical branches
  • If an operator is essentially
    Vector-only
    , but the original implementation uses complex
    block_ptr
    , 2D/3D grid, additional tiling, or multi-version kernels, prioritize evaluating whether it can be rewritten into a more straightforward 1D grid, fixed configuration, single-path implementation
  • If an operator contains
    tl.dot
    , do not just think about "compressing multi-dimensional grid into 1D"; first judge which grid dimensions are only logical chunk / token / tile dimensions, and whether they are more suitable to be moved into the kernel's internal loop to reduce scheduling dimensions
  • Do not mechanically classify based on "tl.dot appears in the source code"; if
    tl.dot
    is only used to implement intermediate techniques like prefix-sum, local scan, triangular mask aggregation, first judge whether it is more like a
    Vector-only
    reduction/scan based on the main semantics of the operator, or if it should indeed follow the AI Core path
  • If the operator naturally has structures like chunk, tile, window, prefix-sum, local reduction, do not just follow the original block pointer logic; also evaluate whether "rearrange layout first, then perform vectorized computation" is more suitable for Ascend
  • If an auxiliary tensor (such as gate, mask, bias, index, state-gate) is not continuous in the current access direction, first perform lightweight
    transpose/contiguous
    or equivalent layout rearrangement on the wrapper side, then access it with a simpler linear ptr or more regular
    block_ptr
    inside the kernel
  • If the main loop order is rearranged, such as changing from "K first then T" to "T first then K", re-review the
    shape/stride/block_ptr/order
    of state tensors, cache tensors, and historical block tensors simultaneously; do not just change the scheduling order while continuing to use the old view and remedy with
    trans
    or additional indexing
  • If common capabilities like
    get_vectorcore_num()
    , device attribute tools, or common layout helpers already exist in the current project, prioritize reusing project helpers instead of writing inline replacement versions by default
  • However, if the current output target is an "independent runnable script" or "minimal validation script", continue to check whether these helpers rely on additional initialization; if they rely on project initialization steps, either add the initialization or clearly state the preconditions in the result
  • When you decide to "delete branches / converge implementation", explain the reason in the result: whether the branch only serves GPU autotune, only serves shared memory selection, or has no clear benefit on Ascend
  • If the runtime log of the migrated Triton-Ascend shows warnings like
    Please DO NOT tune args ['num_warps']
    ,
    ['num_stages']
    or similar, first check whether GPU-style launch/tuning parameters are still mechanically retained; for minimally runnable implementations on Ascend, do not explicitly retain these parameters by default unless you can provide clear compilation requirements or measured benefits
  • Do not use only a set of general shapes in the validation script; the test set should be derived from operator features, covering at least one non-divisible block case, one case that is most likely to trigger branch differences, and one case closer to the real working set
If the user provides a 2D/3D grid, prioritize evaluating whether it can be folded into a 1D grid and then restore the index inside the kernel. Details about
coreDim
, UB,
shape/stride/block_ptr/order
,
care_padding=False
,
TRITON_ALL_BLOCKS_PARALLEL
,
multibuffer
can be found in
references/reference.md
.

Optimization and Troubleshooting

Default Rules for Direct Optimization

Directly provide the optimized implementation if any of the following conditions are met:
  • coreDim
    is obviously exceeded
  • UB usage is obviously too large
  • Memory access is discrete and can be reconstructed into continuous access
  • Mask load/store has a more optimal writing method
  • dtype obviously causes vector operations to degrade to scalar operations
If these conditions are not met, especially for simple examples like vector addition, do not output an enhanced wrapped version by default just to "look more complete". First provide the minimal migration version, then put the enhanced items in "Optional Optimization".

Optimization Priority

  1. Adjust grid and number of cores
  2. Adjust main block size
  3. Introduce or reconstruct sub-block loops
  4. Correct
    shape/stride/block_ptr/order
  5. Evaluate
    care_padding=False
  6. Evaluate
    TRITON_ALL_BLOCKS_PARALLEL
  7. Evaluate
    multibuffer
    and related compilation optimization items
  8. Adjust dtype path without breaking semantics

Key Points to Cover

The output must cover these content:
  • cuda -> npu
  • torch_npu
  • 1D grid
  • Physical core binding
  • Distinction between
    Vector-only
    and operators containing
    tl.dot
  • coreDim <= 65535
  • UB limit
  • Continuous / aligned memory access
  • Re-review of
    shape/stride/block_ptr/order
  • TRITON_ALL_BLOCKS_PARALLEL
  • multibuffer
  • care_padding=False
  • Scalar degradation caused by dtype

Fixed Output Template

Always output in this structure:
markdown
## Migration Conclusion
- Input Source:
- Operator Type:
- Main Migration Actions:

## Triton-Ascend Implementation
- Provide the final kernel and calling wrapper code
- For basic migration scenarios, first provide the "minimal diff migration version"
- Only provide "engineering enhanced/optimized version" additionally when users request it, or when there are clear optimization opportunities
- If clear optimization opportunities exist, directly provide the optimized version
- Explain the save path and naming of the generated file

## Validation Script
- Provide a minimally executable validation script
- Compare with PyTorch reference
- Include at least `allclose` or maximum error output
- Explain the save path of the validation script
- Clearly state whether it has been actually executed, along with execution commands and results

## Optimization Instructions
- Explain the reasons for adjusting grid / number of cores / block / sub-block
- Explain whether `coreDim`, UB, memory access, dtype, mask performance issues are handled
- Explain whether `TRITON_ALL_BLOCKS_PARALLEL`, `multibuffer`, `care_padding=False` are used

If the current task is "documentation-style minimal migration", this section can be extremely concise:
- Only state that minimal migration is performed first
- Briefly state that optimization items like `coreDim` / UB / `multibuffer` are not expanded in this task
- Do not expand into lengthy optimization analysis just to fit the template

## Risks and Limitations
- List unvalidated boundary conditions
- List information that needs to be supplemented by users
- If the script fails to run, clearly state which step it is stuck on
If the user's question is "How to use this skill", add a minimalist "Usage" section before the official template, limited to 3 to 6 lines, explaining:
  • What input the user should provide
  • Whether to handle it according to local or container scenario
  • What outputs will be generated next
Then proceed to the normal migration output.
If users ask about command lines, containers, directory switching, validation command templates, refer to
references/usage.md
, do not include these details in every migration response by default.

Additional Resources

For detailed rules, refer to:
  • Usage, Local Commands and Container Scenarios
  • Input Methods and Context Completion
  • Output, Naming and Minimal Validation Script
  • Migration and Optimization Reference
  • Typical Examples and Output Samples
  • Manual Review Test Checklist