implementing-llms-litgpt
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLitGPT - Clean LLM Implementations
LitGPT - 简洁的大语言模型实现
Quick start
快速开始
LitGPT provides 20+ pretrained LLM implementations with clean, readable code and production-ready training workflows.
Installation:
bash
pip install 'litgpt[extra]'Load and use any model:
python
from litgpt import LLMLitGPT提供20余种预训练大语言模型的实现,代码简洁易读,同时具备生产就绪的训练工作流。
安装:
bash
pip install 'litgpt[extra]'加载并使用任意模型:
python
from litgpt import LLMLoad pretrained model
加载预训练模型
llm = LLM.load("microsoft/phi-2")
llm = LLM.load("microsoft/phi-2")
Generate text
生成文本
result = llm.generate(
"What is the capital of France?",
max_new_tokens=50,
temperature=0.7
)
print(result)
**List available models**:
```bash
litgpt download listresult = llm.generate(
"What is the capital of France?",
max_new_tokens=50,
temperature=0.7
)
print(result)
**列出可用模型**:
```bash
litgpt download listCommon workflows
常见工作流
Workflow 1: Fine-tune on custom dataset
工作流1:在自定义数据集上微调
Copy this checklist:
Fine-Tuning Setup:
- [ ] Step 1: Download pretrained model
- [ ] Step 2: Prepare dataset
- [ ] Step 3: Configure training
- [ ] Step 4: Run fine-tuningStep 1: Download pretrained model
bash
undefined复制以下检查清单:
微调设置:
- [ ] 步骤1:下载预训练模型
- [ ] 步骤2:准备数据集
- [ ] 步骤3:配置训练参数
- [ ] 步骤4:运行微调步骤1:下载预训练模型
bash
undefinedDownload Llama 3 8B
下载Llama 3 8B
litgpt download meta-llama/Meta-Llama-3-8B
litgpt download meta-llama/Meta-Llama-3-8B
Download Phi-2 (smaller, faster)
下载Phi-2(体积更小,速度更快)
litgpt download microsoft/phi-2
litgpt download microsoft/phi-2
Download Gemma 2B
下载Gemma 2B
litgpt download google/gemma-2b
Models are saved to `checkpoints/` directory.
**Step 2: Prepare dataset**
LitGPT supports multiple formats:
**Alpaca format** (instruction-response):
```json
[
{
"instruction": "What is the capital of France?",
"input": "",
"output": "The capital of France is Paris."
},
{
"instruction": "Translate to Spanish: Hello, how are you?",
"input": "",
"output": "Hola, ¿cómo estás?"
}
]Save as .
data/my_dataset.jsonStep 3: Configure training
bash
undefinedlitgpt download google/gemma-2b
模型将保存至`checkpoints/`目录。
**步骤2:准备数据集**
LitGPT支持多种数据格式:
**Alpaca格式**(指令-响应式):
```json
[
{
"instruction": "What is the capital of France?",
"input": "",
"output": "The capital of France is Paris."
},
{
"instruction": "Translate to Spanish: Hello, how are you?",
"input": "",
"output": "Hola, ¿cómo estás?"
}
]保存为。
data/my_dataset.json步骤3:配置训练参数
bash
undefinedFull fine-tuning (requires 40GB+ GPU for 7B models)
全参数微调(7B模型需要40GB以上显存的GPU)
litgpt finetune
meta-llama/Meta-Llama-3-8B
--data JSON
--data.json_path data/my_dataset.json
--train.max_steps 1000
--train.learning_rate 2e-5
--train.micro_batch_size 1
--train.global_batch_size 16
meta-llama/Meta-Llama-3-8B
--data JSON
--data.json_path data/my_dataset.json
--train.max_steps 1000
--train.learning_rate 2e-5
--train.micro_batch_size 1
--train.global_batch_size 16
litgpt finetune
meta-llama/Meta-Llama-3-8B
--data JSON
--data.json_path data/my_dataset.json
--train.max_steps 1000
--train.learning_rate 2e-5
--train.micro_batch_size 1
--train.global_batch_size 16
meta-llama/Meta-Llama-3-8B
--data JSON
--data.json_path data/my_dataset.json
--train.max_steps 1000
--train.learning_rate 2e-5
--train.micro_batch_size 1
--train.global_batch_size 16
LoRA fine-tuning (efficient, 16GB GPU)
LoRA微调(高效显存利用,仅需16GB显存的GPU)
litgpt finetune_lora
microsoft/phi-2
--data JSON
--data.json_path data/my_dataset.json
--lora_r 16
--lora_alpha 32
--lora_dropout 0.05
--train.max_steps 1000
--train.learning_rate 1e-4
microsoft/phi-2
--data JSON
--data.json_path data/my_dataset.json
--lora_r 16
--lora_alpha 32
--lora_dropout 0.05
--train.max_steps 1000
--train.learning_rate 1e-4
**Step 4: Run fine-tuning**
Training saves checkpoints to `out/finetune/` automatically.
Monitor training:
```bashlitgpt finetune_lora
microsoft/phi-2
--data JSON
--data.json_path data/my_dataset.json
--lora_r 16
--lora_alpha 32
--lora_dropout 0.05
--train.max_steps 1000
--train.learning_rate 1e-4
microsoft/phi-2
--data JSON
--data.json_path data/my_dataset.json
--lora_r 16
--lora_alpha 32
--lora_dropout 0.05
--train.max_steps 1000
--train.learning_rate 1e-4
**步骤4:运行微调**
训练过程中会自动将检查点保存至`out/finetune/`目录。
监控训练状态:
```bashView logs
查看日志
tail -f out/finetune/logs.txt
tail -f out/finetune/logs.txt
TensorBoard (if using --train.logger_name tensorboard)
TensorBoard监控(若启用--train.logger_name tensorboard)
tensorboard --logdir out/finetune/lightning_logs
undefinedtensorboard --logdir out/finetune/lightning_logs
undefinedWorkflow 2: LoRA fine-tuning on single GPU
工作流2:单GPU上的LoRA微调
Most memory-efficient option.
LoRA Training:
- [ ] Step 1: Choose base model
- [ ] Step 2: Configure LoRA parameters
- [ ] Step 3: Train with LoRA
- [ ] Step 4: Merge LoRA weights (optional)Step 1: Choose base model
For limited GPU memory (12-16GB):
- Phi-2 (2.7B) - Best quality/size tradeoff
- Llama 3 1B - Smallest, fastest
- Gemma 2B - Good reasoning
Step 2: Configure LoRA parameters
bash
litgpt finetune_lora \
microsoft/phi-2 \
--data JSON \
--data.json_path data/my_dataset.json \
--lora_r 16 \ # LoRA rank (8-64, higher=more capacity)
--lora_alpha 32 \ # LoRA scaling (typically 2×r)
--lora_dropout 0.05 \ # Prevent overfitting
--lora_query true \ # Apply LoRA to query projection
--lora_key false \ # Usually not needed
--lora_value true \ # Apply LoRA to value projection
--lora_projection true \ # Apply LoRA to output projection
--lora_mlp false \ # Usually not needed
--lora_head false # Usually not neededLoRA rank guide:
- : Lightweight, 2-4MB adapters
r=8 - : Standard, good quality
r=16 - : High capacity, use for complex tasks
r=32 - : Maximum quality, 4× larger adapters
r=64
Step 3: Train with LoRA
bash
litgpt finetune_lora \
microsoft/phi-2 \
--data JSON \
--data.json_path data/my_dataset.json \
--lora_r 16 \
--train.epochs 3 \
--train.learning_rate 1e-4 \
--train.micro_batch_size 4 \
--train.global_batch_size 32 \
--out_dir out/phi2-lora这是显存利用率最高的选项。
LoRA训练:
- [ ] 步骤1:选择基础模型
- [ ] 步骤2:配置LoRA参数
- [ ] 步骤3:执行LoRA训练
- [ ] 步骤4:合并LoRA权重(可选)步骤1:选择基础模型
针对显存有限的GPU(12-16GB),推荐以下模型:
- Phi-2(2.7B参数)- 质量与体积的最优平衡
- Llama 3 1B - 体积最小,速度最快
- Gemma 2B - 推理能力出色
步骤2:配置LoRA参数
bash
litgpt finetune_lora \
microsoft/phi-2 \
--data JSON \
--data.json_path data/my_dataset.json \
--lora_r 16 \ # LoRA秩(取值8-64,值越大容量越高)
--lora_alpha 32 \ # LoRA缩放系数(通常为秩的2倍)
--lora_dropout 0.05 \ # 防止过拟合
--lora_query true \ # 对查询投影层应用LoRA
--lora_key false \ # 通常无需启用
--lora_value true \ # 对值投影层应用LoRA
--lora_projection true \ # 对输出投影层应用LoRA
--lora_mlp false \ # 通常无需启用
--lora_head false # 通常无需启用LoRA秩选择指南:
- : 轻量级,适配器体积2-4MB
r=8 - : 标准配置,性能均衡
r=16 - : 高容量,适用于复杂任务
r=32 - : 最高性能,适配器体积为r=16的4倍
r=64
步骤3:执行LoRA训练
bash
litgpt finetune_lora \
microsoft/phi-2 \
--data JSON \
--data.json_path data/my_dataset.json \
--lora_r 16 \
--train.epochs 3 \
--train.learning_rate 1e-4 \
--train.micro_batch_size 4 \
--train.global_batch_size 32 \
--out_dir out/phi2-loraMemory usage: ~8-12GB for Phi-2 with LoRA
显存占用:使用LoRA的Phi-2模型约占8-12GB显存
**Step 4: Merge LoRA weights** (optional)
Merge LoRA adapters into base model for deployment:
```bash
litgpt merge_lora \
out/phi2-lora/final \
--out_dir out/phi2-mergedNow use merged model:
python
from litgpt import LLM
llm = LLM.load("out/phi2-merged")
**步骤4:合并LoRA权重**(可选)
将LoRA适配器合并到基础模型中,便于部署:
```bash
litgpt merge_lora \
out/phi2-lora/final \
--out_dir out/phi2-merged合并后即可使用该模型:
python
from litgpt import LLM
llm = LLM.load("out/phi2-merged")Workflow 3: Pretrain from scratch
工作流3:从零开始预训练
Train new model on your domain data.
Pretraining:
- [ ] Step 1: Prepare pretraining dataset
- [ ] Step 2: Configure model architecture
- [ ] Step 3: Set up multi-GPU training
- [ ] Step 4: Launch pretrainingStep 1: Prepare pretraining dataset
LitGPT expects tokenized data. Use :
prepare_dataset.pybash
python scripts/prepare_dataset.py \
--source_path data/my_corpus.txt \
--checkpoint_dir checkpoints/tokenizer \
--destination_path data/pretrain \
--split train,valStep 2: Configure model architecture
Edit config file or use existing:
python
undefined在自定义领域数据上训练新模型。
预训练:
- [ ] 步骤1:准备预训练数据集
- [ ] 步骤2:配置模型架构
- [ ] 步骤3:设置多GPU训练
- [ ] 步骤4:启动预训练步骤1:准备预训练数据集
LitGPT要求使用已分词的数据。可使用脚本:
prepare_dataset.pybash
python scripts/prepare_dataset.py \
--source_path data/my_corpus.txt \
--checkpoint_dir checkpoints/tokenizer \
--destination_path data/pretrain \
--split train,val步骤2:配置模型架构
编辑配置文件或使用现有配置:
python
undefinedconfig/pythia-160m.yaml
config/pythia-160m.yaml
model_name: pythia-160m
block_size: 2048
vocab_size: 50304
n_layer: 12
n_head: 12
n_embd: 768
rotary_percentage: 0.25
parallel_residual: true
bias: true
**Step 3: Set up multi-GPU training**
```bashmodel_name: pythia-160m
block_size: 2048
vocab_size: 50304
n_layer: 12
n_head: 12
n_embd: 768
rotary_percentage: 0.25
parallel_residual: true
bias: true
**步骤3:设置多GPU训练**
```bashSingle GPU
单GPU训练
litgpt pretrain
--config config/pythia-160m.yaml
--data.data_dir data/pretrain
--train.max_tokens 10_000_000_000
--config config/pythia-160m.yaml
--data.data_dir data/pretrain
--train.max_tokens 10_000_000_000
litgpt pretrain
--config config/pythia-160m.yaml
--data.data_dir data/pretrain
--train.max_tokens 10_000_000_000
--config config/pythia-160m.yaml
--data.data_dir data/pretrain
--train.max_tokens 10_000_000_000
Multi-GPU with FSDP
多GPU训练(使用FSDP)
litgpt pretrain
--config config/pythia-1b.yaml
--data.data_dir data/pretrain
--devices 8
--train.max_tokens 100_000_000_000
--config config/pythia-1b.yaml
--data.data_dir data/pretrain
--devices 8
--train.max_tokens 100_000_000_000
**Step 4: Launch pretraining**
For large-scale pretraining on cluster:
```bashlitgpt pretrain
--config config/pythia-1b.yaml
--data.data_dir data/pretrain
--devices 8
--train.max_tokens 100_000_000_000
--config config/pythia-1b.yaml
--data.data_dir data/pretrain
--devices 8
--train.max_tokens 100_000_000_000
**步骤4:启动预训练**
在集群上进行大规模预训练:
```bashUsing SLURM
使用SLURM调度器
sbatch --nodes=8 --gpus-per-node=8
pretrain_script.sh
pretrain_script.sh
sbatch --nodes=8 --gpus-per-node=8
pretrain_script.sh
pretrain_script.sh
pretrain_script.sh content:
pretrain_script.sh 内容:
litgpt pretrain
--config config/pythia-1b.yaml
--data.data_dir /shared/data/pretrain
--devices 8
--num_nodes 8
--train.global_batch_size 512
--train.max_tokens 300_000_000_000
--config config/pythia-1b.yaml
--data.data_dir /shared/data/pretrain
--devices 8
--num_nodes 8
--train.global_batch_size 512
--train.max_tokens 300_000_000_000
undefinedlitgpt pretrain
--config config/pythia-1b.yaml
--data.data_dir /shared/data/pretrain
--devices 8
--num_nodes 8
--train.global_batch_size 512
--train.max_tokens 300_000_000_000
--config config/pythia-1b.yaml
--data.data_dir /shared/data/pretrain
--devices 8
--num_nodes 8
--train.global_batch_size 512
--train.max_tokens 300_000_000_000
undefinedWorkflow 4: Convert and deploy model
工作流4:模型转换与部署
Export LitGPT models for production.
Model Deployment:
- [ ] Step 1: Test inference locally
- [ ] Step 2: Quantize model (optional)
- [ ] Step 3: Convert to GGUF (for llama.cpp)
- [ ] Step 4: Deploy with APIStep 1: Test inference locally
python
from litgpt import LLM
llm = LLM.load("out/phi2-lora/final")将LitGPT模型导出用于生产环境。
模型部署:
- [ ] 步骤1:本地测试推理
- [ ] 步骤2:模型量化(可选)
- [ ] 步骤3:转换为GGUF格式(用于llama.cpp)
- [ ] 步骤4:通过API部署步骤1:本地测试推理
python
from litgpt import LLM
llm = LLM.load("out/phi2-lora/final")Single generation
单次生成
print(llm.generate("What is machine learning?"))
print(llm.generate("What is machine learning?"))
Streaming
流式生成
for token in llm.generate("Explain quantum computing", stream=True):
print(token, end="", flush=True)
for token in llm.generate("Explain quantum computing", stream=True):
print(token, end="", flush=True)
Batch inference
批量推理
prompts = ["Hello", "Goodbye", "Thank you"]
results = [llm.generate(p) for p in prompts]
**Step 2: Quantize model** (optional)
Reduce model size with minimal quality loss:
```bashprompts = ["Hello", "Goodbye", "Thank you"]
results = [llm.generate(p) for p in prompts]
**步骤2:模型量化**(可选)
在几乎不损失性能的前提下减小模型体积:
```bash8-bit quantization (50% size reduction)
8位量化(体积减小50%)
litgpt convert_lit_checkpoint
out/phi2-lora/final
--dtype bfloat16
--quantize bnb.nf4
out/phi2-lora/final
--dtype bfloat16
--quantize bnb.nf4
litgpt convert_lit_checkpoint
out/phi2-lora/final
--dtype bfloat16
--quantize bnb.nf4
out/phi2-lora/final
--dtype bfloat16
--quantize bnb.nf4
4-bit quantization (75% size reduction)
4位量化(体积减小75%)
litgpt convert_lit_checkpoint
out/phi2-lora/final
--quantize bnb.nf4-dq # Double quantization
out/phi2-lora/final
--quantize bnb.nf4-dq # Double quantization
**Step 3: Convert to GGUF** (for llama.cpp)
```bash
python scripts/convert_lit_checkpoint.py \
--checkpoint_path out/phi2-lora/final \
--output_path models/phi2.gguf \
--model_name microsoft/phi-2Step 4: Deploy with API
python
from fastapi import FastAPI
from litgpt import LLM
app = FastAPI()
llm = LLM.load("out/phi2-lora/final")
@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
result = llm.generate(
prompt,
max_new_tokens=max_tokens,
temperature=0.7
)
return {"response": result}litgpt convert_lit_checkpoint
out/phi2-lora/final
--quantize bnb.nf4-dq # 双重量化
out/phi2-lora/final
--quantize bnb.nf4-dq # 双重量化
**步骤3:转换为GGUF格式**(用于llama.cpp)
```bash
python scripts/convert_lit_checkpoint.py \
--checkpoint_path out/phi2-lora/final \
--output_path models/phi2.gguf \
--model_name microsoft/phi-2步骤4:通过API部署
python
from fastapi import FastAPI
from litgpt import LLM
app = FastAPI()
llm = LLM.load("out/phi2-lora/final")
@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
result = llm.generate(
prompt,
max_new_tokens=max_tokens,
temperature=0.7
)
return {"response": result}Run: uvicorn api:app --host 0.0.0.0 --port 8000
启动命令:uvicorn api:app --host 0.0.0.0 --port 8000
undefinedundefinedWhen to use vs alternatives
适用场景与替代方案对比
Use LitGPT when:
- Want to understand LLM architectures (clean, readable code)
- Need production-ready training recipes
- Educational purposes or research
- Prototyping new model ideas
- Lightning ecosystem user
Use alternatives instead:
- Axolotl/TRL: More fine-tuning features, YAML configs
- Megatron-Core: Maximum performance for >70B models
- HuggingFace Transformers: Broadest model support
- vLLM: Inference-only (no training)
选择LitGPT的场景:
- 希望深入理解大语言模型架构(代码简洁易读)
- 需要生产就绪的训练方案
- 用于教学或研究目的
- 快速原型验证新模型想法
- 已经在使用Lightning生态系统
选择其他工具的场景:
- Axolotl/TRL: 提供更多微调功能,支持YAML配置
- Megatron-Core: 针对70B以上大模型提供极致性能
- HuggingFace Transformers: 支持最广泛的模型种类
- vLLM: 仅专注于推理场景(不支持训练)
Common issues
常见问题
Issue: Out of memory during fine-tuning
Use LoRA instead of full fine-tuning:
bash
undefined问题:微调过程中显存不足
改用LoRA微调替代全参数微调:
bash
undefinedInstead of litgpt finetune (requires 40GB+)
替代需要40GB+显存的litgpt finetune
litgpt finetune_lora # Only needs 12-16GB
Or enable gradient checkpointing:
```bash
litgpt finetune_lora \
... \
--train.gradient_accumulation_iters 4 # Accumulate gradientsIssue: Training too slow
Enable Flash Attention (built-in, automatic on compatible hardware):
python
undefinedlitgpt finetune_lora # 仅需12-16GB显存
或启用梯度检查点:
```bash
litgpt finetune_lora \
... \
--train.gradient_accumulation_iters 4 # 梯度累积问题:训练速度过慢
启用Flash Attention(兼容硬件会自动启用):
python
undefinedAlready enabled by default on Ampere+ GPUs (A100, RTX 30/40 series)
在Ampere架构及以上的GPU(如A100、RTX30/40系列)默认已启用
No configuration needed
无需额外配置
Use smaller micro-batch and accumulate:
```bash
--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32 # Effective batch=32Issue: Model not loading
Check model name:
bash
undefined
使用更小的微批次并启用梯度累积:
```bash
--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32 # 等效批次大小为32问题:模型无法加载
检查模型名称是否正确:
bash
undefinedList all available models
列出所有可用模型
litgpt download list
litgpt download list
Download if not exists
若模型未下载则执行下载
litgpt download meta-llama/Meta-Llama-3-8B
Verify checkpoints directory:
```bash
ls checkpoints/litgpt download meta-llama/Meta-Llama-3-8B
验证检查点目录:
```bash
ls checkpoints/Should see: meta-llama/Meta-Llama-3-8B/
应包含:meta-llama/Meta-Llama-3-8B/
**Issue: LoRA adapters too large**
Reduce LoRA rank:
```bash
--lora_r 8 # Instead of 16 or 32Apply LoRA to fewer layers:
bash
--lora_query true \
--lora_value true \
--lora_projection false \ # Disable this
--lora_mlp false # And this
**问题:LoRA适配器体积过大**
减小LoRA秩:
```bash
--lora_r 8 # 替代16或32仅对必要层应用LoRA:
bash
--lora_query true \
--lora_value true \
--lora_projection false \ # 禁用该选项
--lora_mlp false # 禁用该选项Advanced topics
进阶主题
Supported architectures: See references/supported-models.md for complete list of 20+ model families with sizes and capabilities.
Training recipes: See references/training-recipes.md for proven hyperparameter configurations for pretraining and fine-tuning.
FSDP configuration: See references/distributed-training.md for multi-GPU training with Fully Sharded Data Parallel.
Custom architectures: See references/custom-models.md for implementing new model architectures in LitGPT style.
支持的架构: 完整的20余种预训练架构列表(包括Llama、Gemma、Phi、Qwen、Mistral、Mixtral、Falcon等)请参考references/supported-models.md。
训练方案: 经过验证的预训练与微调超参数配置请参考references/training-recipes.md。
FSDP配置: 关于使用全分片数据并行(FSDP)进行多GPU训练的详细说明请参考references/distributed-training.md。
自定义架构: 关于如何以LitGPT风格实现新模型架构的指南请参考references/custom-models.md。
Hardware requirements
硬件要求
- GPU: NVIDIA (CUDA 11.8+), AMD (ROCm), Apple Silicon (MPS)
- Memory:
- Inference (Phi-2): 6GB
- LoRA fine-tuning (7B): 16GB
- Full fine-tuning (7B): 40GB+
- Pretraining (1B): 24GB
- Storage: 5-50GB per model (depending on size)
- GPU: NVIDIA(CUDA 11.8+)、AMD(ROCm)、Apple Silicon(MPS)
- 显存:
- 推理(Phi-2): 6GB
- LoRA微调(7B模型): 16GB
- 全参数微调(7B模型): 40GB+
- 预训练(1B模型):24GB
- 存储: 每个模型占用5-50GB存储空间(取决于模型大小)
Resources
相关资源
- GitHub: https://github.com/Lightning-AI/litgpt
- Docs: https://lightning.ai/docs/litgpt
- Tutorials: https://lightning.ai/docs/litgpt/tutorials
- Model zoo: 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral, Mixtral, Falcon, etc.)
- GitHub: https://github.com/Lightning-AI/litgpt
- 文档: https://lightning.ai/docs/litgpt
- 教程: https://lightning.ai/docs/litgpt/tutorials
- 模型库: 20余种预训练架构(Llama、Gemma、Phi、Qwen、Mistral、Mixtral、Falcon等)