llm-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Training

LLM 训练

Frameworks and techniques for training and finetuning large language models.
用于训练和微调大语言模型的框架与技术。

Framework Comparison

框架对比

FrameworkBest ForMulti-GPUMemory Efficient
AccelerateSimple distributedYesBasic
DeepSpeedLarge models, ZeROYesExcellent
PyTorch LightningClean training loopsYesGood
Ray TrainScalable, multi-nodeYesGood
TRLRLHF, reward modelingYesGood
UnslothFast LoRA finetuningLimitedExcellent

框架适用场景多GPU支持内存效率
Accelerate简单分布式训练基础级
DeepSpeed大模型训练、ZeRO优化卓越级
PyTorch Lightning简洁训练循环良好
Ray Train可扩展多节点训练良好
TRLRLHF、奖励模型训练良好
Unsloth快速LoRA微调有限支持卓越级

Accelerate (HuggingFace)

Accelerate(HuggingFace)

Minimal wrapper for distributed training. Run
accelerate config
for interactive setup.
Key concept: Wrap model, optimizer, dataloader with
accelerator.prepare()
, use
accelerator.backward()
for loss.

分布式训练的轻量级封装。运行
accelerate config
进行交互式配置。
核心概念:用
accelerator.prepare()
封装模型、优化器和数据加载器,使用
accelerator.backward()
计算损失。

DeepSpeed (Large Models)

DeepSpeed(大模型专用)

Microsoft's optimization library for training massive models.
ZeRO Stages:
  • Stage 1: Optimizer states partitioned across GPUs
  • Stage 2: + Gradients partitioned
  • Stage 3: + Parameters partitioned (for largest models, 100B+)
Key concept: Configure via JSON, higher stages = more memory savings but more communication overhead.

微软推出的大规模模型训练优化库。
ZeRO 阶段:
  • Stage 1:优化器状态在GPU间分区存储
  • Stage 2:+ 梯度在GPU间分区存储
  • Stage 3:+ 参数在GPU间分区存储(适用于100B+参数的超大模型)
核心概念:通过JSON配置,阶段越高内存节省越多,但通信开销也越大。

TRL (RLHF/DPO)

TRL(RLHF/DPO)

HuggingFace library for reinforcement learning from human feedback.
Training types:
  • SFT (Supervised Finetuning): Standard instruction tuning
  • DPO (Direct Preference Optimization): Simpler than RLHF, uses preference pairs
  • PPO: Classic RLHF with reward model
Key concept: DPO is often preferred over PPO - simpler, no reward model needed, just chosen/rejected response pairs.

HuggingFace推出的基于人类反馈的强化学习库。
训练类型:
  • SFT(监督微调):标准指令微调
  • DPO(直接偏好优化):比RLHF更简单,使用偏好样本对
  • PPO:经典RLHF,需搭配奖励模型
核心概念:DPO通常比PPO更受欢迎——更简单,无需奖励模型,仅需选择/拒绝的响应对即可。

Unsloth (Fast LoRA)

Unsloth(快速LoRA微调)

Optimized LoRA finetuning - 2x faster, 60% less memory.
Key concept: Drop-in replacement for standard LoRA with automatic optimizations. Best for 7B-13B models.

经过优化的LoRA微调——速度提升2倍,内存占用减少60%。
核心概念:可直接替代标准LoRA,自带自动优化。最适合7B-13B参数的模型。

Memory Optimization Techniques

内存优化技术

TechniqueMemory SavingsTrade-off
Gradient checkpointing~30-50%Slower training
Mixed precision (fp16/bf16)~50%Minor precision loss
4-bit quantization (QLoRA)~75%Some quality loss
Flash Attention~20-40%Requires compatible GPU
Gradient accumulationEffective batch↑No memory cost

技术内存节省比例权衡 trade-off
Gradient checkpointing(梯度检查点)~30-50%训练速度变慢
Mixed precision(fp16/bf16,混合精度)~50%精度损失极小
4-bit quantization(QLoRA,4位量化)~75%存在一定质量损失
Flash Attention~20-40%需要兼容的GPU
Gradient accumulation(梯度累积)有效批量增大无内存成本

Decision Guide

决策指南

ScenarioRecommendation
Simple finetuningAccelerate + PEFT
7B-13B modelsUnsloth (fastest)
70B+ modelsDeepSpeed ZeRO-3
RLHF/DPO alignmentTRL
Multi-node clusterRay Train
Clean code structurePyTorch Lightning
场景推荐方案
简单微调Accelerate + PEFT
7B-13B参数模型Unsloth(速度最快)
70B+参数模型DeepSpeed ZeRO-3
RLHF/DPO对齐训练TRL
多节点集群训练Ray Train
代码结构简洁性优先PyTorch Lightning

Resources

资源