Search Results: gpu-optimization

Found 8 Skills

pytorch

PyTorch deep learning development with transformers, diffusion models, and GPU optimization.

Uncategorizedpluginagentmarketplace/cu...

particle-systems

Creating visual effects using particle systems, physics simulation, and post-processing for polished, dynamic game graphics.

🇺🇸|EnglishTranslated

1 scripts/Checked

Frontend Developmentmartinholovsky/claude-ski...

glsl

GLSL shader programming for JARVIS holographic effects

🇺🇸|EnglishTranslated

Mobile Developmenterichowens/some_claude_sk...

metal-shader-expert

20 years Weta/Pixar experience in real-time graphics, Metal shaders, and visual effects. Expert in MSL shaders, PBR rendering, tile-based deferred rendering (TBDR), and GPU debugging. Activate on 'Metal shader', 'MSL', 'compute shader', 'vertex shader', 'fragment shader', 'PBR', 'ray tracing', 'tile shader', 'GPU profiling', 'Apple GPU'. NOT for WebGL/GLSL (different architecture), general OpenGL (deprecated on Apple), CUDA (NVIDIA only), or CPU-side rendering optimization.

🇺🇸|EnglishTranslated

AI & Machine Learningpepperu96/hyper-mla

optimization-catalog

Compatibility router for the shared optimization knowledge base and the language-specific optimization catalog skills. Use when: (1) selecting which optimization catalog skill to load, (2) the implementation language is not fixed yet, (3) a workflow still references the legacy optimization-catalog skill name, (4) deciding whether a finding is shared or language-specific, (5) updating the generalized knowledge-base structure.

🇺🇸|EnglishTranslated

Backend Developmentpepperu96/hyper-mla

design-cute-dsl-kernel

CuTe Python DSL kernel workflow, CuteKernel runtime wrapper, suitability gate, tiling guidance, and CuTe-specific pitfalls. Use when: (1) planning or implementing a kernel in the CuTe Python DSL, (2) the optimization needs more explicit control than cuTile exposes but should remain in a Python-driven workflow, (3) defining package naming for cute-dsl kernels, (4) documenting CuTe Python DSL design choices, (5) recording language-specific knowledge for CuTe Python DSL.

🇺🇸|EnglishTranslated

AI & Machine Learningmathews-tom/praxis-skills

gpu-optimizer

Expert GPU optimization for modern consumer GPUs (8-24GB VRAM). Use this skill when you need to optimize GPU training, speed up CUDA code, reduce OOM errors, tune XGBoost for GPU, migrate NumPy to CuPy, make a model faster, manage GPU memory, optimize VRAM usage, or benchmark PyTorch. Covers mixed precision, gradient checkpointing, XGBoost GPU acceleration, CuPy/cuDF migration, vectorization, torch.compile, and diagnostics. NVIDIA GPUs only. PyTorch, XGBoost, and RAPIDS frameworks.

🇺🇸|EnglishTranslated

AI & Machine Learningslowlyc/agent-gpu-skills

cutlass-skill

Write, debug, and optimize CUTLASS and CuTeDSL GPU kernels using local source code, examples, and header references. Use when the user mentions CUTLASS, CuTe, CuTeDSL, cute::Layout, cute::Tensor, TiledMMA, TiledCopy, CollectiveMainloop, CollectiveEpilogue, GEMM kernel, grouped GEMM, sparse GEMM, flash attention CUTLASS, blackwell GEMM, hopper GEMM, FP8 GEMM, blockwise scaling, MoE GEMM, StreamK, warp specialization CUTLASS, TMA CUTLASS, or asks about writing high-performance CUDA kernels with CUTLASS/CuTe templates.

🇺🇸|EnglishTranslated

1 scripts/Attention