triton-ascend-migration
Migrate GPU/CUDA Triton operators to Triton-Ascend, or rewrite Python/PyTorch operators into Triton-Ascend implementations that can run on Ascend NPU. When clear optimization opportunities are identified, directly output the optimized code, minimal validation script, and troubleshooting instructions. This skill should be prioritized when users mention 昇腾 (Ascend), Ascend, NPU, triton-ascend, Triton operator migration, PyTorch operator rewriting, coreDim, UB overflow, 1D grid, physical core binding, block_ptr, stride, memory access alignment, mask performance, dtype degradation, operator optimization, or directly ask questions like "How to use this skill", "How to run it in the command line", "How to perform migration/validation in a container", even if users do not explicitly say "write a skill" or "perform migration".
NPX Install
npx skill4agent add ascend-ai-coding/awesome-ascend-skills triton-ascend-migrationTags
Translated version includes tags in frontmatterSKILL.md Content (Chinese)
View Translation Comparison →Triton-Ascend Migration
Quick Start
- First identify the input method:
- File path / specified code snippet
- User directly pastes code
- Then identify the input source:
- GPU/CUDA Triton kernel
- Python/PyTorch operator implementation
- Then identify the operator type:
elementwisebroadcast / maskreduce- Contains
tl.dot
- First create a minimally runnable version:
cuda -> npu- Add import
torch_npu - Remove GPU-specific device logic
- Prioritize 1D grid
- For simple tutorial examples, default to the "minimal diff migration version"
- Perform Ascend-side optimization after the code runs successfully:
- Physical core binding
BLOCK_SIZE/XBLOCKBLOCK_SIZE_SUB/XBLOCK_SUB- Continuous/aligned memory access
- Troubleshooting for / UB / dtype / mask
coreDim
- If clear optimization opportunities exist, directly output the optimized implementation instead of just providing suggestions.
How to Use This Skill
- Users can provide code,
Triton/CUDAreference implementation, file path, or error/performance logs.PyTorch - Users are advised to specify the runtime environment: local command line, existing container, CI, or code generation only without execution.
- Users can also indicate preferences: ,
minimal diff migration,documentation style,get it running first then optimize.directly provide optimized version - Corresponding outputs will be provided based on scenarios: ,
Triton-Ascend implementation,minimal validation script,execution command.optimization instructions
references/usage.mdMigration Progress
- [ ] Identify input source and operator type
- [ ] First perform minimal migration or semantic rewriting
- [ ] Adjust to Ascend-friendly parallelism and grid
- [ ] Redesign block / tiling
- [ ] Review stride / block_ptr / alignment
- [ ] Handle coreDim / UB / scalar degradation
- [ ] Implement feasible optimization directly
- [ ] Generate and save minimal NPU validation script
- [ ] Actually execute the validation script
- [ ] Output results and optimization instructionsInput Identification
- Is the user providing a file path or directly pasting code?
- Is it a complete script, partial snippet, or single kernel?
- Is it GPU Triton migration or Python/PyTorch semantic rewriting?
references/input-modes.mdScenario A: GPU Triton -> Triton-Ascend
- Whether exists
device='cuda' - Whether there is GPU-specific device acquisition or assertion logic
- Whether GPU-style multi-dimensional free grid is retained
- Whether is used
tl.dot - Whether complex exists
shape/stride/block_ptr/order
Scenario B: Python/PyTorch -> Triton-Ascend
- Input-output tensor relationship
- Indexing and broadcasting method
- Mask / reduce logic
- dtype and precision requirements
- Whether the original PyTorch implementation already has naturally continuous memory access
Migration Process
1. Collect Minimal Necessary Information
- Input code or minimal reproduction case
- Input method: file path / specified code snippet / user directly pastes code
- shape, dtype, stride
- Whether there is mask, broadcast, reduce
- Current error or performance issue
- Whether exact precision consistency is required
- Runtime environment: local command line, inside container, CI, or code generation only without execution
- First infer from existing code
- Then use minimal reasonable assumptions to complete the validation script
- Finally ask users for necessary information
- First check if the user provided container name, , container path, or image information
docker exec - Then check if the user provided local file path, current directory, or terminal command
- If still undetermined, ask: "Would you like me to write the validation steps based on local command line or container environment?"
2. First Perform Minimal Migration or Semantic Rewriting
- GPU Triton: First change to
cudanpu - Import
torch_npu - Remove GPU-specific device logic
- For documentation/tutorial-style simple examples, try to keep the original name, wrapper name,
kernel, grid writing method, and main code structure unchangedBLOCK_SIZE - Do not actively add , additional assertions, function renaming, or engineering packaging in the first version, unless the user explicitly requests "enhanced/production version", or these changes are necessary to fix deterministic issues on NPU
contiguous() - Python/PyTorch: First rewrite into the most straightforward Triton kernel according to the original computation semantics
- Official documentation style
- Strict minimal migration
- Minimal diff
- No engineering enhanced version
- Only refer to official migration examples
- Only make necessary code modifications
- The can be 1 to 3 lines, clearly stating "No in-depth optimization is performed for this task"
optimization instructions - Do not forcefully expand content like ,
TRITON_ALL_BLOCKS_PARALLEL,multibuffer, physical core binding just to complete the templatecare_padding=False - Do not deviate the response style from "documentation diff" to "engineering optimization overview"
- The validation script should also remain "minimally runnable", do not default to writing it as an engineering test framework
references/output-and-validation.md3. Rewrite Parallelism Model
- Prioritize 1D grid
- Switch from GPU logical grid thinking to Ascend physical core binding thinking
- operators should be designed with Vector Core path in mind first
Vector-only - Operators containing should be designed with AI Core path in mind first
tl.dot
- If the original implementation has multiple kernels, , environment variable branches, or automatic distribution of different data paths, first distinguish which are "semantically necessary" and which are just "performance strategies on GPU"
autotune - For performance branches that are obviously no longer necessary on Ascend, converge to a single kernel or fewer paths; focus on retaining semantics rather than all historical branches
- If an operator is essentially , but the original implementation uses complex
Vector-only, 2D/3D grid, additional tiling, or multi-version kernels, prioritize evaluating whether it can be rewritten into a more straightforward 1D grid, fixed configuration, single-path implementationblock_ptr - If an operator contains , do not just think about "compressing multi-dimensional grid into 1D"; first judge which grid dimensions are only logical chunk / token / tile dimensions, and whether they are more suitable to be moved into the kernel's internal loop to reduce scheduling dimensions
tl.dot - Do not mechanically classify based on "tl.dot appears in the source code"; if is only used to implement intermediate techniques like prefix-sum, local scan, triangular mask aggregation, first judge whether it is more like a
tl.dotreduction/scan based on the main semantics of the operator, or if it should indeed follow the AI Core pathVector-only - If the operator naturally has structures like chunk, tile, window, prefix-sum, local reduction, do not just follow the original block pointer logic; also evaluate whether "rearrange layout first, then perform vectorized computation" is more suitable for Ascend
- If an auxiliary tensor (such as gate, mask, bias, index, state-gate) is not continuous in the current access direction, first perform lightweight or equivalent layout rearrangement on the wrapper side, then access it with a simpler linear ptr or more regular
transpose/contiguousinside the kernelblock_ptr - If the main loop order is rearranged, such as changing from "K first then T" to "T first then K", re-review the of state tensors, cache tensors, and historical block tensors simultaneously; do not just change the scheduling order while continuing to use the old view and remedy with
shape/stride/block_ptr/orderor additional indexingtrans - If common capabilities like , device attribute tools, or common layout helpers already exist in the current project, prioritize reusing project helpers instead of writing inline replacement versions by default
get_vectorcore_num() - However, if the current output target is an "independent runnable script" or "minimal validation script", continue to check whether these helpers rely on additional initialization; if they rely on project initialization steps, either add the initialization or clearly state the preconditions in the result
- When you decide to "delete branches / converge implementation", explain the reason in the result: whether the branch only serves GPU autotune, only serves shared memory selection, or has no clear benefit on Ascend
- If the runtime log of the migrated Triton-Ascend shows warnings like ,
Please DO NOT tune args ['num_warps']or similar, first check whether GPU-style launch/tuning parameters are still mechanically retained; for minimally runnable implementations on Ascend, do not explicitly retain these parameters by default unless you can provide clear compilation requirements or measured benefits['num_stages'] - Do not use only a set of general shapes in the validation script; the test set should be derived from operator features, covering at least one non-divisible block case, one case that is most likely to trigger branch differences, and one case closer to the real working set
coreDimshape/stride/block_ptr/ordercare_padding=FalseTRITON_ALL_BLOCKS_PARALLELmultibufferreferences/reference.mdOptimization and Troubleshooting
Default Rules for Direct Optimization
- is obviously exceeded
coreDim - UB usage is obviously too large
- Memory access is discrete and can be reconstructed into continuous access
- Mask load/store has a more optimal writing method
- dtype obviously causes vector operations to degrade to scalar operations
Optimization Priority
- Adjust grid and number of cores
- Adjust main block size
- Introduce or reconstruct sub-block loops
- Correct
shape/stride/block_ptr/order - Evaluate
care_padding=False - Evaluate
TRITON_ALL_BLOCKS_PARALLEL - Evaluate and related compilation optimization items
multibuffer - Adjust dtype path without breaking semantics
Key Points to Cover
cuda -> nputorch_npu- 1D grid
- Physical core binding
- Distinction between and operators containing
Vector-onlytl.dot coreDim <= 65535- UB limit
- Continuous / aligned memory access
- Re-review of
shape/stride/block_ptr/order TRITON_ALL_BLOCKS_PARALLELmultibuffercare_padding=False- Scalar degradation caused by dtype
Fixed Output Template
## Migration Conclusion
- Input Source:
- Operator Type:
- Main Migration Actions:
## Triton-Ascend Implementation
- Provide the final kernel and calling wrapper code
- For basic migration scenarios, first provide the "minimal diff migration version"
- Only provide "engineering enhanced/optimized version" additionally when users request it, or when there are clear optimization opportunities
- If clear optimization opportunities exist, directly provide the optimized version
- Explain the save path and naming of the generated file
## Validation Script
- Provide a minimally executable validation script
- Compare with PyTorch reference
- Include at least `allclose` or maximum error output
- Explain the save path of the validation script
- Clearly state whether it has been actually executed, along with execution commands and results
## Optimization Instructions
- Explain the reasons for adjusting grid / number of cores / block / sub-block
- Explain whether `coreDim`, UB, memory access, dtype, mask performance issues are handled
- Explain whether `TRITON_ALL_BLOCKS_PARALLEL`, `multibuffer`, `care_padding=False` are used
If the current task is "documentation-style minimal migration", this section can be extremely concise:
- Only state that minimal migration is performed first
- Briefly state that optimization items like `coreDim` / UB / `multibuffer` are not expanded in this task
- Do not expand into lengthy optimization analysis just to fit the template
## Risks and Limitations
- List unvalidated boundary conditions
- List information that needs to be supplemented by users
- If the script fails to run, clearly state which step it is stuck on- What input the user should provide
- Whether to handle it according to local or container scenario
- What outputs will be generated next
references/usage.mdAdditional Resources
- Usage, Local Commands and Container Scenarios
- Input Methods and Context Completion
- Output, Naming and Minimal Validation Script
- Migration and Optimization Reference
- Typical Examples and Output Samples
- Manual Review Test Checklist