llm-training

Original：🇺🇸 English

Translated

Use when "training LLM", "finetuning", "RLHF", "distributed training", "DeepSpeed", "Accelerate", "PyTorch Lightning", "Ray Train", "TRL", "Unsloth", "LoRA training", "flash attention", "gradient checkpointing"

2installs

Sourceeyadsibai/ltk

Added on2026-02-16

NPX Install

npx skill4agent add eyadsibai/ltk llm-training

SKILL.md Content

View Translation Comparison →

LLM Training

Frameworks and techniques for training and finetuning large language models.

Framework Comparison

Framework	Best For	Multi-GPU	Memory Efficient
Accelerate	Simple distributed	Yes	Basic
DeepSpeed	Large models, ZeRO	Yes	Excellent
PyTorch Lightning	Clean training loops	Yes	Good
Ray Train	Scalable, multi-node	Yes	Good
TRL	RLHF, reward modeling	Yes	Good
Unsloth	Fast LoRA finetuning	Limited	Excellent

Accelerate (HuggingFace)

Minimal wrapper for distributed training. Run

accelerate config

for interactive setup.

Key concept: Wrap model, optimizer, dataloader with

accelerator.prepare()

, use

accelerator.backward()

for loss.

DeepSpeed (Large Models)

Microsoft's optimization library for training massive models.

ZeRO Stages:

Stage 1: Optimizer states partitioned across GPUs
Stage 2: + Gradients partitioned
Stage 3: + Parameters partitioned (for largest models, 100B+)

Key concept: Configure via JSON, higher stages = more memory savings but more communication overhead.

TRL (RLHF/DPO)

HuggingFace library for reinforcement learning from human feedback.

Training types:

SFT (Supervised Finetuning): Standard instruction tuning
DPO (Direct Preference Optimization): Simpler than RLHF, uses preference pairs
PPO: Classic RLHF with reward model

Key concept: DPO is often preferred over PPO - simpler, no reward model needed, just chosen/rejected response pairs.

Unsloth (Fast LoRA)

Optimized LoRA finetuning - 2x faster, 60% less memory.

Key concept: Drop-in replacement for standard LoRA with automatic optimizations. Best for 7B-13B models.

Memory Optimization Techniques

Technique	Memory Savings	Trade-off
Gradient checkpointing	~30-50%	Slower training
Mixed precision (fp16/bf16)	~50%	Minor precision loss
4-bit quantization (QLoRA)	~75%	Some quality loss
Flash Attention	~20-40%	Requires compatible GPU
Gradient accumulation	Effective batch↑	No memory cost

Decision Guide

Scenario	Recommendation
Simple finetuning	Accelerate + PEFT
7B-13B models	Unsloth (fastest)
70B+ models	DeepSpeed ZeRO-3
RLHF/DPO alignment	TRL
Multi-node cluster	Ray Train
Clean code structure	PyTorch Lightning

Resources

Accelerate: https://huggingface.co/docs/accelerate
DeepSpeed: https://www.deepspeed.ai/
TRL: https://huggingface.co/docs/trl
Unsloth: https://github.com/unslothai/unsloth