Search Results: gpu-optimization

Found 8 Skills

nemo-mbridge-perf-cuda-graphs

Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.

🇺🇸|EnglishTranslated

Uncategorizedpluginagentmarketplace/cu...

particle-systems

Creating visual effects using particle systems, physics simulation, and post-processing for polished, dynamic game graphics.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningmindrally/skills

pytorch

PyTorch deep learning development with transformers, diffusion models, and GPU optimization.

🇺🇸|EnglishTranslated

Backend Developmentpepperu96/hyper-mla

design-cute-dsl-kernel

CuTe Python DSL kernel workflow, CuteKernel runtime wrapper, suitability gate, tiling guidance, and CuTe-specific pitfalls. Use when: (1) planning or implementing a kernel in the CuTe Python DSL, (2) the optimization needs more explicit control than cuTile exposes but should remain in a Python-driven workflow, (3) defining package naming for cute-dsl kernels, (4) documenting CuTe Python DSL design choices, (5) recording language-specific knowledge for CuTe Python DSL.

🇺🇸|EnglishTranslated

Frontend Developmentmartinholovsky/claude-ski...

glsl

GLSL shader programming for JARVIS holographic effects

🇺🇸|EnglishTranslated

AI & Machine Learningslowlyc/agent-gpu-skills

cutlass-skill

Write, debug, and optimize CUTLASS and CuTeDSL GPU kernels using local source code, examples, and header references. Use when the user mentions CUTLASS, CuTe, CuTeDSL, cute::Layout, cute::Tensor, TiledMMA, TiledCopy, CollectiveMainloop, CollectiveEpilogue, GEMM kernel, grouped GEMM, sparse GEMM, flash attention CUTLASS, blackwell GEMM, hopper GEMM, FP8 GEMM, blockwise scaling, MoE GEMM, StreamK, warp specialization CUTLASS, TMA CUTLASS, or asks about writing high-performance CUDA kernels with CUTLASS/CuTe templates.

🇺🇸|EnglishTranslated

1 scripts/Attention

AI & Machine Learningpepperu96/hyper-mla

optimization-catalog

Compatibility router for the shared optimization knowledge base and the language-specific optimization catalog skills. Use when: (1) selecting which optimization catalog skill to load, (2) the implementation language is not fixed yet, (3) a workflow still references the legacy optimization-catalog skill name, (4) deciding whether a finding is shared or language-specific, (5) updating the generalized knowledge-base structure.

🇺🇸|EnglishTranslated

AI & Machine Learningreplicate/skills

build-models

Package and build custom AI models with Cog for deployment on Replicate. Use when creating a cog.yaml or predict.py, defining model inputs and outputs, loading model weights at setup time, building Docker images for ML models, serving locally with cog serve or cog predict, or porting a HuggingFace, GitHub, or ComfyUI model to run on Replicate. Trigger on phrases like "build a model", "package a model", "create a Cog model", "wrap a model", "containerize an AI model", "predict.py", "cog.yaml", "BasePredictor", or "Cog container", and when referencing cog.run, github.com/replicate/cog, or github.com/replicate/cog-examples. Covers GPU and CUDA setup, pget for fast weight downloads, async predictors with continuous batching, streaming outputs, and cold-boot optimization for image, video, audio, and LLM models. For pushing built models to Replicate, see publish-models. For running existing models, see run-models.

🇺🇸|EnglishTranslated