Loading...
Loading...
Found 98 Skills
Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.
PyTorch library for audio generation including text-to-music (MusicGen) and text-to-sound (AudioGen). Use when you need to generate music from text descriptions, create sound effects, or perform melody-conditioned music generation.
Image processing, object detection, segmentation, and vision models. Use for image classification, object detection, or visual analysis tasks.
GENERator DNA 序列生成模型的昇腾 NPU 迁移 Skill,适用于将基于 HuggingFace Transformers 的 Causal LM 从 CUDA 迁移到华为 Ascend NPU,覆盖环境搭建、依赖安装、代码适配、多进程处理和 sequence recovery 验证。
Code instrumentation for timing workloads. Two scenarios: (1) Training loop — inject manual timing to report per-iteration latency, throughput (samples/sec), and data load time. (2) Standalone kernel/op — write CUDA event timing code with warmup, per-iteration statistics, and anti-pattern avoidance. Also covers NVTX annotation for labeling profiler timelines. NOT for: running or analyzing profiler tools (nsys, ncu, Nsight Systems, Nsight Compute), writing kernels (Triton, CuTe, CUDA), applying optimizations (CUDA Graphs, gradient checkpointing, fusion), or interpreting roofline/SOL% metrics. Triggers: "measure throughput", "benchmark this function", "time my training loop", "samples per second", "NVTX annotate", "instrument my dataloader", "data load time", "kernel timing", "how do I time".
Graph Neural Networks (PyG). Node/graph classification, link prediction, GCN, GAT, GraphSAGE, heterogeneous graphs, molecular property prediction, for geometric deep learning.
MindSpeed-LLM 环境搭建指南,用于华为昇腾 NPU。覆盖 CANN 环境激活、PyTorch + torch_npu 安装、MindSpeed 加速库安装、Megatron-LM 核心模块集成、MindSpeed-LLM 安装及环境验证。当用户需要在昇腾 NPU 上搭建 MindSpeed-LLM 训练环境时使用。
Systematic approach to exploring the TensorRT-LLM codebase before implementing new features or optimizations. Teaches how to discover existing infrastructure, trace code paths, and avoid reimplementing what already exists. Derived from real mistakes where ~250 lines of code were written and deleted because existing forward methods weren't discovered upfront. Use when starting any new feature, optimization, or code modification in TRT-LLM.
WildWorld large-scale action-conditioned world modeling dataset with 108M+ frames from a photorealistic ARPG game, featuring per-frame annotations, 450+ actions, and explicit state information for generative world modeling research.
Add a new cuTile GPU kernel operator to TileGym. Covers dispatch registration in ops.py, cuTile backend implementation, __init__.py exports, test creation, and benchmark in tests/benchmark. Use when adding, creating, or implementing a new cuTile operator/kernel in TileGym, or when asking how to register a new cuTile op.
Query CZ CELLxGENE Census (61M+ cells). Filter by cell type/tissue/disease, retrieve expression data, integrate with scanpy/PyTorch, for population-scale single-cell analysis.
Write ML experiment code with iterative improvement. Generate training/evaluation pipelines, debug errors, and optimize results through code reflection. Use when implementing experiments for a research paper.