Loading...
Loading...
Found 25 Skills
Provides installation guidance for CANN on Ascend NPU. Call this skill when users need to install CANN, configure the Ascend environment, or resolve installation issues.
vLLM Ascend plugin for LLM inference serving on Huawei Ascend NPU. Use for offline batch inference, API server deployment, quantization inference (with msmodelslim quantized models), tensor/pipeline parallelism for distributed serving, and OpenAI-compatible API endpoints. Supports Qwen, DeepSeek, GLM, LLaMA models with Ascend-optimized kernels.
Analyze Huawei Ascend NPU profiling data to discover hidden performance anomalies and produce a detailed model architecture report reverse-engineered from profiling. Trigger on Ascend profiling traces, NPU bottlenecks, device idle gaps, host-device issues, kernel_details.csv / trace_view.json / op_summary / communication.json. Also trigger on "profiling", "step time", "device bubble", "underfeed", "host bound", "device bound", "AICPU", "wait anchor", "kernel gap", "Ascend performance", "model architecture", "layer structure", "forward pass", "model structure". Runs anomaly discovery (bubble detection, wait-anchor, AICPU exposure) alongside model architecture analysis (layer classification, per-layer sub-structure, communication pipeline). Outputs a separate Markdown architecture report alongside anomaly analysis.
Optimize the performance of Triton operators optimized for Ascend NPU. This guide is for users who need to optimize the performance of Triton operators on Ascend NPU, resolve UB overflow, improve Cube unit utilization, and design Tiling strategies.
Complete toolkit for Huawei Ascend NPU model conversion and end-to-end inference adaptation. Workflow 1 auto-discovers input shapes and parameters from user source code. Workflow 2 exports PyTorch models to ONNX. Workflow 3 converts ONNX to .om via ATC with multi-CANN version support. Workflow 4 adapts the user's full inference pipeline (preprocessing + model + postprocessing) to run end-to-end on NPU. Workflow 5 verifies precision between ONNX and OM outputs. Workflow 6 generates a reproducible README. Supports any standard PyTorch/ONNX model. Use when converting, testing, or deploying models on Ascend AI processors.
This skill should be used when the user asks about "Ascend NPU", "昇腾", "Huawei NPU", "triton-ascend", "Ascend kernel development", "NPU算子开发", "Atlas", "CANN", or mentions Ascend hardware, AI Core, Cube/Vector/Scalar units. Provides expert guidance on Ascend NPU hardware architecture, triton-ascend kernel development, and GPU to NPU migration. Always use this skill for Ascend-related questions to avoid confusion with GPU documentation and concepts.
GENERator DNA 序列生成模型的昇腾 NPU 迁移 Skill,适用于将基于 HuggingFace Transformers 的 Causal LM 从 CUDA 迁移到华为 Ascend NPU,覆盖环境搭建、依赖安装、代码适配、多进程处理和 sequence recovery 验证。
Migrate GPU/CUDA Triton operators to Triton-Ascend, or rewrite Python/PyTorch operators into Triton-Ascend implementations that can run on Ascend NPU. When clear optimization opportunities are identified, directly output the optimized code, minimal validation script, and troubleshooting instructions. This skill should be prioritized when users mention 昇腾 (Ascend), Ascend, NPU, triton-ascend, Triton operator migration, PyTorch operator rewriting, coreDim, UB overflow, 1D grid, physical core binding, block_ptr, stride, memory access alignment, mask performance, dtype degradation, operator optimization, or directly ask questions like "How to use this skill", "How to run it in the command line", "How to perform migration/validation in a container", even if users do not explicitly say "write a skill" or "perform migration".
AI for Science 场景下的昇腾 NPU Profiling 采集与性能分析 Skill,用于在华为 Ascend NPU 上使用 torch_npu.profiler 采集 L0、L1、L2 级性能数据,分析训练或推理中的算子耗时、调用栈、内存与瓶颈,并指导后续调优。
DeepFRI 的 TensorFlow 到 PyTorch 转换与昇腾 NPU 迁移 Skill,适用于蛋白质功能预测场景下的 TF 模型分析、PyTorch 重写、权重逐层映射、NPU 推理与精度验证,尤其适合需要在 Ascend 上运行 DeepFRI CNN 或 GCN 路径时使用。
MindSpeed-LLM 环境搭建指南,用于华为昇腾 NPU。覆盖 CANN 环境激活、PyTorch + torch_npu 安装、MindSpeed 加速库安装、Megatron-LM 核心模块集成、MindSpeed-LLM 安装及环境验证。当用户需要在昇腾 NPU 上搭建 MindSpeed-LLM 训练环境时使用。
将简单Vector类型Triton算子从GPU迁移到昇腾NPU。当用户需要迁移Triton代码到NPU、提到GPU到NPU迁移、Triton迁移、昇腾适配时使用。注意:无法自动迁移存在编译问题的算子。