Search Results: distributed-training

Found 12 Skills

AI & Machine Learningk-dense-ai/claude-scienti...

pytorch-lightning

Deep learning framework (PyTorch Lightning). Organize PyTorch code into LightningModules, configure Trainers for multi-GPU/TPU, implement data pipelines, callbacks, logging (W&B, TensorBoard), distributed training (DDP, FSDP, DeepSpeed), for scalable neural network training.

🇺🇸|EnglishTranslated

3 scripts/Checked

AI & Machine Learningdavila7/claude-code-templ...

pytorch-fsdp

Expert guidance for Fully Sharded Data Parallel training with PyTorch FSDP - parameter sharding, mixed precision, CPU offloading, FSDP2

🇺🇸|EnglishTranslated

AI & Machine Learningitsmostafa/llm-engineerin...

pytorch

Building and training neural networks with PyTorch. Use when implementing deep learning models, training loops, data pipelines, model optimization with torch.compile, distributed training, or deploying PyTorch models.

🇺🇸|EnglishTranslated

AI & Machine Learningtondevrel/scientific-agen...

pytorch-research

Advanced sub-skill for PyTorch focused on deep research and production engineering. Covers custom Autograd functions, module hooks, advanced initialization, Distributed Data Parallel (DDP), and performance profiling.

🇺🇸|EnglishTranslated

AI & Machine Learningruvnet/ruflo

flow-nexus-neural

Train and deploy neural networks in distributed E2B sandboxes with Flow Nexus

🇺🇸|EnglishTranslated

AI & Machine Learningascend/agent-skills

hccl-test

HCCL (Huawei Collective Communication Library) performance testing for Ascend NPU clusters. Use for testing distributed communication bandwidth, verifying HCCL functionality, and benchmarking collective operations like AllReduce, AllGather. Covers MPI installation, multi-node pre-flight checks (SSH/CANN version/NPU health), and production testing workflows.

🇺🇸|EnglishTranslated

5 scripts/Attention

AI & Machine Learningkiterlin/intelligent-dete...

pytorch-fsdp2

Adds PyTorch FSDP2 (fully_shard) to training scripts with correct init, sharding, mixed precision/offload config, and distributed checkpointing. Use when models exceed single-GPU memory or when you need DTensor-based sharding with DeviceMesh.

🇺🇸|EnglishTranslated

AI & Machine Learningwanshuiyin/auto-claude-co...

qzcli

Manage GPU compute jobs on the Qizhi (启智) platform using qzcli — a kubectl-style CLI tool. Use when user says "qzcli", "启智平台", "submit job", "stop job", "查计算组", "avail", "list jobs", "batch submit", or needs to manage distributed training jobs on a Qizhi instance.

🇺🇸|EnglishTranslated

AI & Machine Learningeyadsibai/ltk

llm-training

Use when "training LLM", "finetuning", "RLHF", "distributed training", "DeepSpeed", "Accelerate", "PyTorch Lightning", "Ray Train", "TRL", "Unsloth", "LoRA training", "flash attention", "gradient checkpointing"

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

torchforge-rl-training

Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

pytorch-lightning

High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

ray-train

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

🇺🇸|EnglishTranslated