Loading...
Loading...
Found 6 Skills
Battle-tested PyTorch training recipes for all domains — LLMs, vision, diffusion, medical imaging, protein/drug discovery, spatial omics, genomics. Covers training loops, optimizer selection (AdamW, Muon), LR scheduling, mixed precision, debugging, and systematic experimentation. Use when training or fine-tuning neural networks, debugging loss spikes or OOM, choosing architectures, or optimizing GPU throughput.
NCU-driven iterative optimization workflow for CUDA/CUTLASS/Triton/CuTe DSL kernels. MANDATORY: every optimization MUST start with NCU profiling, followed by multi-dimensional analysis, then targeted code modification, then re-profiling to verify. Supports roofline, memory hierarchy, warp stalls, instruction mix, occupancy, divergence analysis. Provides implementation-specific code modifications: Native CUDA (launch config, memory patterns, async copy, Tensor Core), CUTLASS (ThreadblockShape, stages, epilogue, schedule policy, alignment), Triton (autotune params, compiler hints, tl.* API patterns), CuTe DSL (threads_per_cta, elems_per_thread, tiled_copy, copy atom, shared memory, warp/cta reduce). Use when optimizing any CUDA kernel performance.
Identify and eliminate host-device synchronizations in PyTorch code. Detects sync points (.item(), .cpu(), boolean indexing, torch.tensor on CUDA), classifies false vs true dependencies, provides sync-free alternatives. Triggers: sync-free, synchronization, .item(), .cpu(), host-device sync, eliminate syncs, CPU stall, non_blocking, set_sync_debug_mode, cudaStreamSynchronize, cudaEventSynchronize, remove syncs, async GPU.
Apply CUDA Graphs to PyTorch workloads — API selection (torch.compile, PyTorch make_graphed_callables, TE make_graphed_callables, MCore CudaGraphManager, FullCudaGraphWrapper, manual torch.cuda.graph), code compatibility, capture workflows, dynamic pattern handling, and troubleshooting. Triggers: CUDA graph, torch.cuda.graph, make_graphed_callables, reduce-overhead, graph capture, graph replay, kernel launch overhead, CudaGraphManager, FullCudaGraphWrapper, full-iteration graph, stream capture.
Performance analysis coordination workflow. Guides profiling delegation, bottleneck classification (compute/memory/launch/communication/sync), and structured report generation. Use when the user asks to analyze performance, profile a workload, check MFU/SOL, or diagnose bottlenecks.
Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.