implementing-llms-litgpt

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LitGPT - Clean LLM Implementations

LitGPT - 简洁的大语言模型实现

Quick start

快速开始

LitGPT provides 20+ pretrained LLM implementations with clean, readable code and production-ready training workflows.

Installation:

bash

pip install 'litgpt[extra]'

Load and use any model:

python

from litgpt import LLM

LitGPT提供20余种预训练大语言模型的实现，代码简洁易读，同时具备生产就绪的训练工作流。

安装:

bash

pip install 'litgpt[extra]'

加载并使用任意模型:

python

from litgpt import LLM

Load pretrained model

加载预训练模型

llm = LLM.load("microsoft/phi-2")

Generate text

生成文本

result = llm.generate( "What is the capital of France?", max_new_tokens=50, temperature=0.7 ) print(result)


**List available models**:
```bash
litgpt download list

result = llm.generate( "What is the capital of France?", max_new_tokens=50, temperature=0.7 ) print(result)


**列出可用模型**:
```bash
litgpt download list

Common workflows

常见工作流

Workflow 1: Fine-tune on custom dataset

工作流1：在自定义数据集上微调

Copy this checklist:

Fine-Tuning Setup:
- [ ] Step 1: Download pretrained model
- [ ] Step 2: Prepare dataset
- [ ] Step 3: Configure training
- [ ] Step 4: Run fine-tuning

Step 1: Download pretrained model

bash

undefined

复制以下检查清单:

微调设置:
- [ ] 步骤1：下载预训练模型
- [ ] 步骤2：准备数据集
- [ ] 步骤3：配置训练参数
- [ ] 步骤4：运行微调

步骤1：下载预训练模型

bash

undefined

Download Llama 3 8B

下载Llama 3 8B

litgpt download meta-llama/Meta-Llama-3-8B

Download Phi-2 (smaller, faster)

下载Phi-2（体积更小，速度更快）

litgpt download microsoft/phi-2

Download Gemma 2B

下载Gemma 2B

litgpt download google/gemma-2b


Models are saved to `checkpoints/` directory.

**Step 2: Prepare dataset**

LitGPT supports multiple formats:

**Alpaca format** (instruction-response):
```json
[
  {
    "instruction": "What is the capital of France?",
    "input": "",
    "output": "The capital of France is Paris."
  },
  {
    "instruction": "Translate to Spanish: Hello, how are you?",
    "input": "",
    "output": "Hola, ¿cómo estás?"
  }
]

Save as

data/my_dataset.json

Step 3: Configure training

bash

undefined

litgpt download google/gemma-2b


模型将保存至`checkpoints/`目录。

**步骤2：准备数据集**

LitGPT支持多种数据格式:

**Alpaca格式**（指令-响应式）:
```json
[
  {
    "instruction": "What is the capital of France?",
    "input": "",
    "output": "The capital of France is Paris."
  },
  {
    "instruction": "Translate to Spanish: Hello, how are you?",
    "input": "",
    "output": "Hola, ¿cómo estás?"
  }
]

保存为

data/my_dataset.json

。

步骤3：配置训练参数

bash

undefined

Full fine-tuning (requires 40GB+ GPU for 7B models)

全参数微调（7B模型需要40GB以上显存的GPU）

litgpt finetune
meta-llama/Meta-Llama-3-8B
--data JSON
--data.json_path data/my_dataset.json
--train.max_steps 1000
--train.learning_rate 2e-5
--train.micro_batch_size 1
--train.global_batch_size 16

LoRA fine-tuning (efficient, 16GB GPU)

LoRA微调（高效显存利用，仅需16GB显存的GPU）

litgpt finetune_lora
microsoft/phi-2
--data JSON
--data.json_path data/my_dataset.json
--lora_r 16
--lora_alpha 32
--lora_dropout 0.05
--train.max_steps 1000
--train.learning_rate 1e-4


**Step 4: Run fine-tuning**

Training saves checkpoints to `out/finetune/` automatically.

Monitor training:
```bash

litgpt finetune_lora
microsoft/phi-2
--data JSON
--data.json_path data/my_dataset.json
--lora_r 16
--lora_alpha 32
--lora_dropout 0.05
--train.max_steps 1000
--train.learning_rate 1e-4


**步骤4：运行微调**

训练过程中会自动将检查点保存至`out/finetune/`目录。

监控训练状态:
```bash

View logs

查看日志

tail -f out/finetune/logs.txt

TensorBoard (if using --train.logger_name tensorboard)

TensorBoard监控（若启用--train.logger_name tensorboard）

tensorboard --logdir out/finetune/lightning_logs

undefined

tensorboard --logdir out/finetune/lightning_logs

undefined

Workflow 2: LoRA fine-tuning on single GPU

工作流2：单GPU上的LoRA微调

Most memory-efficient option.

LoRA Training:
- [ ] Step 1: Choose base model
- [ ] Step 2: Configure LoRA parameters
- [ ] Step 3: Train with LoRA
- [ ] Step 4: Merge LoRA weights (optional)

Step 1: Choose base model

For limited GPU memory (12-16GB):

Phi-2 (2.7B) - Best quality/size tradeoff
Llama 3 1B - Smallest, fastest
Gemma 2B - Good reasoning

Step 2: Configure LoRA parameters

bash

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \          # LoRA rank (8-64, higher=more capacity)
  --lora_alpha 32 \      # LoRA scaling (typically 2×r)
  --lora_dropout 0.05 \  # Prevent overfitting
  --lora_query true \    # Apply LoRA to query projection
  --lora_key false \     # Usually not needed
  --lora_value true \    # Apply LoRA to value projection
  --lora_projection true \  # Apply LoRA to output projection
  --lora_mlp false \     # Usually not needed
  --lora_head false      # Usually not needed

LoRA rank guide:

```
r=8
```
: Lightweight, 2-4MB adapters
```
r=16
```
: Standard, good quality
```
r=32
```
: High capacity, use for complex tasks
```
r=64
```
: Maximum quality, 4× larger adapters

Step 3: Train with LoRA

bash

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --train.epochs 3 \
  --train.learning_rate 1e-4 \
  --train.micro_batch_size 4 \
  --train.global_batch_size 32 \
  --out_dir out/phi2-lora

这是显存利用率最高的选项。

LoRA训练:
- [ ] 步骤1：选择基础模型
- [ ] 步骤2：配置LoRA参数
- [ ] 步骤3：执行LoRA训练
- [ ] 步骤4：合并LoRA权重（可选）

步骤1：选择基础模型

针对显存有限的GPU（12-16GB），推荐以下模型:

Phi-2（2.7B参数）- 质量与体积的最优平衡
Llama 3 1B - 体积最小，速度最快
Gemma 2B - 推理能力出色

步骤2：配置LoRA参数

bash

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \          # LoRA秩（取值8-64，值越大容量越高）
  --lora_alpha 32 \      # LoRA缩放系数（通常为秩的2倍）
  --lora_dropout 0.05 \  # 防止过拟合
  --lora_query true \    # 对查询投影层应用LoRA
  --lora_key false \     # 通常无需启用
  --lora_value true \    # 对值投影层应用LoRA
  --lora_projection true \  # 对输出投影层应用LoRA
  --lora_mlp false \     # 通常无需启用
  --lora_head false      # 通常无需启用

LoRA秩选择指南:

```
r=8
```
: 轻量级，适配器体积2-4MB
```
r=16
```
: 标准配置，性能均衡
```
r=32
```
: 高容量，适用于复杂任务
```
r=64
```
: 最高性能，适配器体积为r=16的4倍

步骤3：执行LoRA训练

bash

litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --train.epochs 3 \
  --train.learning_rate 1e-4 \
  --train.micro_batch_size 4 \
  --train.global_batch_size 32 \
  --out_dir out/phi2-lora

Memory usage: ~8-12GB for Phi-2 with LoRA

显存占用：使用LoRA的Phi-2模型约占8-12GB显存


**Step 4: Merge LoRA weights** (optional)

Merge LoRA adapters into base model for deployment:

```bash
litgpt merge_lora \
  out/phi2-lora/final \
  --out_dir out/phi2-merged

Now use merged model:

python

from litgpt import LLM
llm = LLM.load("out/phi2-merged")


**步骤4：合并LoRA权重**（可选）

将LoRA适配器合并到基础模型中，便于部署:

```bash
litgpt merge_lora \
  out/phi2-lora/final \
  --out_dir out/phi2-merged

合并后即可使用该模型:

python

from litgpt import LLM
llm = LLM.load("out/phi2-merged")

Workflow 3: Pretrain from scratch

工作流3：从零开始预训练

Train new model on your domain data.

Pretraining:
- [ ] Step 1: Prepare pretraining dataset
- [ ] Step 2: Configure model architecture
- [ ] Step 3: Set up multi-GPU training
- [ ] Step 4: Launch pretraining

Step 1: Prepare pretraining dataset

LitGPT expects tokenized data. Use

prepare_dataset.py

bash

python scripts/prepare_dataset.py \
  --source_path data/my_corpus.txt \
  --checkpoint_dir checkpoints/tokenizer \
  --destination_path data/pretrain \
  --split train,val

Step 2: Configure model architecture

Edit config file or use existing:

python

undefined

在自定义领域数据上训练新模型。

预训练:
- [ ] 步骤1：准备预训练数据集
- [ ] 步骤2：配置模型架构
- [ ] 步骤3：设置多GPU训练
- [ ] 步骤4：启动预训练

步骤1：准备预训练数据集

LitGPT要求使用已分词的数据。可使用

prepare_dataset.py

脚本:

bash

python scripts/prepare_dataset.py \
  --source_path data/my_corpus.txt \
  --checkpoint_dir checkpoints/tokenizer \
  --destination_path data/pretrain \
  --split train,val

步骤2：配置模型架构

编辑配置文件或使用现有配置:

python

undefined

config/pythia-160m.yaml

model_name: pythia-160m block_size: 2048 vocab_size: 50304 n_layer: 12 n_head: 12 n_embd: 768 rotary_percentage: 0.25 parallel_residual: true bias: true


**Step 3: Set up multi-GPU training**

```bash

model_name: pythia-160m block_size: 2048 vocab_size: 50304 n_layer: 12 n_head: 12 n_embd: 768 rotary_percentage: 0.25 parallel_residual: true bias: true


**步骤3：设置多GPU训练**

```bash

Single GPU

单GPU训练

litgpt pretrain
--config config/pythia-160m.yaml
--data.data_dir data/pretrain
--train.max_tokens 10_000_000_000

Multi-GPU with FSDP

多GPU训练（使用FSDP）

litgpt pretrain
--config config/pythia-1b.yaml
--data.data_dir data/pretrain
--devices 8
--train.max_tokens 100_000_000_000


**Step 4: Launch pretraining**

For large-scale pretraining on cluster:

```bash

litgpt pretrain
--config config/pythia-1b.yaml
--data.data_dir data/pretrain
--devices 8
--train.max_tokens 100_000_000_000


**步骤4：启动预训练**

在集群上进行大规模预训练:

```bash

Using SLURM

使用SLURM调度器

sbatch --nodes=8 --gpus-per-node=8
pretrain_script.sh

pretrain_script.sh content:

pretrain_script.sh 内容:

litgpt pretrain
--config config/pythia-1b.yaml
--data.data_dir /shared/data/pretrain
--devices 8
--num_nodes 8
--train.global_batch_size 512
--train.max_tokens 300_000_000_000

undefined

litgpt pretrain
--config config/pythia-1b.yaml
--data.data_dir /shared/data/pretrain
--devices 8
--num_nodes 8
--train.global_batch_size 512
--train.max_tokens 300_000_000_000

undefined

Workflow 4: Convert and deploy model

工作流4：模型转换与部署

Export LitGPT models for production.

Model Deployment:
- [ ] Step 1: Test inference locally
- [ ] Step 2: Quantize model (optional)
- [ ] Step 3: Convert to GGUF (for llama.cpp)
- [ ] Step 4: Deploy with API

Step 1: Test inference locally

python

from litgpt import LLM

llm = LLM.load("out/phi2-lora/final")

将LitGPT模型导出用于生产环境。

模型部署:
- [ ] 步骤1：本地测试推理
- [ ] 步骤2：模型量化（可选）
- [ ] 步骤3：转换为GGUF格式（用于llama.cpp）
- [ ] 步骤4：通过API部署

步骤1：本地测试推理

python

from litgpt import LLM

llm = LLM.load("out/phi2-lora/final")

Single generation

单次生成

print(llm.generate("What is machine learning?"))

Streaming

流式生成

for token in llm.generate("Explain quantum computing", stream=True): print(token, end="", flush=True)

Batch inference

批量推理

prompts = ["Hello", "Goodbye", "Thank you"] results = [llm.generate(p) for p in prompts]


**Step 2: Quantize model** (optional)

Reduce model size with minimal quality loss:

```bash

prompts = ["Hello", "Goodbye", "Thank you"] results = [llm.generate(p) for p in prompts]


**步骤2：模型量化**（可选）

在几乎不损失性能的前提下减小模型体积:

```bash

8-bit quantization (50% size reduction)

8位量化（体积减小50%）

litgpt convert_lit_checkpoint
out/phi2-lora/final
--dtype bfloat16
--quantize bnb.nf4

4-bit quantization (75% size reduction)

4位量化（体积减小75%）

litgpt convert_lit_checkpoint
out/phi2-lora/final
--quantize bnb.nf4-dq # Double quantization


**Step 3: Convert to GGUF** (for llama.cpp)

```bash
python scripts/convert_lit_checkpoint.py \
  --checkpoint_path out/phi2-lora/final \
  --output_path models/phi2.gguf \
  --model_name microsoft/phi-2

Step 4: Deploy with API

python

from fastapi import FastAPI
from litgpt import LLM

app = FastAPI()
llm = LLM.load("out/phi2-lora/final")

@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
    result = llm.generate(
        prompt,
        max_new_tokens=max_tokens,
        temperature=0.7
    )
    return {"response": result}

litgpt convert_lit_checkpoint
out/phi2-lora/final
--quantize bnb.nf4-dq # 双重量化


**步骤3：转换为GGUF格式**（用于llama.cpp）

```bash
python scripts/convert_lit_checkpoint.py \
  --checkpoint_path out/phi2-lora/final \
  --output_path models/phi2.gguf \
  --model_name microsoft/phi-2

步骤4：通过API部署

python

from fastapi import FastAPI
from litgpt import LLM

app = FastAPI()
llm = LLM.load("out/phi2-lora/final")

@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
    result = llm.generate(
        prompt,
        max_new_tokens=max_tokens,
        temperature=0.7
    )
    return {"response": result}

Run: uvicorn api:app --host 0.0.0.0 --port 8000

启动命令：uvicorn api:app --host 0.0.0.0 --port 8000

undefined

undefined

When to use vs alternatives

适用场景与替代方案对比

Use LitGPT when:

Want to understand LLM architectures (clean, readable code)
Need production-ready training recipes
Educational purposes or research
Prototyping new model ideas
Lightning ecosystem user

Use alternatives instead:

Axolotl/TRL: More fine-tuning features, YAML configs
Megatron-Core: Maximum performance for >70B models
HuggingFace Transformers: Broadest model support
vLLM: Inference-only (no training)

选择LitGPT的场景:

希望深入理解大语言模型架构（代码简洁易读）
需要生产就绪的训练方案
用于教学或研究目的
快速原型验证新模型想法
已经在使用Lightning生态系统

选择其他工具的场景:

Axolotl/TRL: 提供更多微调功能，支持YAML配置
Megatron-Core: 针对70B以上大模型提供极致性能
HuggingFace Transformers: 支持最广泛的模型种类
vLLM: 仅专注于推理场景（不支持训练）

Common issues

常见问题

Issue: Out of memory during fine-tuning

Use LoRA instead of full fine-tuning:

bash

undefined

问题：微调过程中显存不足

改用LoRA微调替代全参数微调:

bash

undefined

Instead of litgpt finetune (requires 40GB+)

替代需要40GB+显存的litgpt finetune

litgpt finetune_lora # Only needs 12-16GB


Or enable gradient checkpointing:
```bash
litgpt finetune_lora \
  ... \
  --train.gradient_accumulation_iters 4  # Accumulate gradients

Issue: Training too slow

Enable Flash Attention (built-in, automatic on compatible hardware):

python

undefined

litgpt finetune_lora # 仅需12-16GB显存


或启用梯度检查点:
```bash
litgpt finetune_lora \
  ... \
  --train.gradient_accumulation_iters 4  # 梯度累积

问题：训练速度过慢

启用Flash Attention（兼容硬件会自动启用）:

python

undefined

Already enabled by default on Ampere+ GPUs (A100, RTX 30/40 series)

在Ampere架构及以上的GPU（如A100、RTX30/40系列）默认已启用

No configuration needed

无需额外配置


Use smaller micro-batch and accumulate:
```bash
--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32  # Effective batch=32

Issue: Model not loading

Check model name:

bash

undefined


使用更小的微批次并启用梯度累积:
```bash
--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32  # 等效批次大小为32

问题：模型无法加载

检查模型名称是否正确:

bash

undefined

List all available models

列出所有可用模型

litgpt download list

Download if not exists

若模型未下载则执行下载

litgpt download meta-llama/Meta-Llama-3-8B


Verify checkpoints directory:
```bash
ls checkpoints/

litgpt download meta-llama/Meta-Llama-3-8B


验证检查点目录:
```bash
ls checkpoints/

Should see: meta-llama/Meta-Llama-3-8B/

应包含：meta-llama/Meta-Llama-3-8B/


**Issue: LoRA adapters too large**

Reduce LoRA rank:
```bash
--lora_r 8  # Instead of 16 or 32

Apply LoRA to fewer layers:

bash

--lora_query true \
--lora_value true \
--lora_projection false \  # Disable this
--lora_mlp false  # And this


**问题：LoRA适配器体积过大**

减小LoRA秩:
```bash
--lora_r 8  # 替代16或32

仅对必要层应用LoRA:

bash

--lora_query true \
--lora_value true \
--lora_projection false \  # 禁用该选项
--lora_mlp false  # 禁用该选项

Advanced topics

进阶主题

Supported architectures: See references/supported-models.md for complete list of 20+ model families with sizes and capabilities.

Training recipes: See references/training-recipes.md for proven hyperparameter configurations for pretraining and fine-tuning.

FSDP configuration: See references/distributed-training.md for multi-GPU training with Fully Sharded Data Parallel.

Custom architectures: See references/custom-models.md for implementing new model architectures in LitGPT style.

支持的架构: 完整的20余种预训练架构列表（包括Llama、Gemma、Phi、Qwen、Mistral、Mixtral、Falcon等）请参考references/supported-models.md。

训练方案: 经过验证的预训练与微调超参数配置请参考references/training-recipes.md。

FSDP配置: 关于使用全分片数据并行（FSDP）进行多GPU训练的详细说明请参考references/distributed-training.md。

自定义架构: 关于如何以LitGPT风格实现新模型架构的指南请参考references/custom-models.md。

Hardware requirements

硬件要求

GPU: NVIDIA (CUDA 11.8+), AMD (ROCm), Apple Silicon (MPS)
Memory:
- Inference (Phi-2): 6GB
- LoRA fine-tuning (7B): 16GB
- Full fine-tuning (7B): 40GB+
- Pretraining (1B): 24GB
Storage: 5-50GB per model (depending on size)

GPU: NVIDIA（CUDA 11.8+）、AMD（ROCm）、Apple Silicon（MPS）
显存:
- 推理（Phi-2）: 6GB
- LoRA微调（7B模型）: 16GB
- 全参数微调（7B模型）: 40GB+
- 预训练（1B模型）:24GB
存储: 每个模型占用5-50GB存储空间（取决于模型大小）

implementing-llms-litgpt

Original

Translation

LitGPT - Clean LLM Implementations

LitGPT - 简洁的大语言模型实现

Quick start

快速开始

Load pretrained model

加载预训练模型

Generate text

生成文本

Common workflows

常见工作流

Workflow 1: Fine-tune on custom dataset

工作流1：在自定义数据集上微调

Download Llama 3 8B

下载Llama 3 8B

Download Phi-2 (smaller, faster)

下载Phi-2（体积更小，速度更快）

Download Gemma 2B

下载Gemma 2B

Full fine-tuning (requires 40GB+ GPU for 7B models)

全参数微调（7B模型需要40GB以上显存的GPU）

LoRA fine-tuning (efficient, 16GB GPU)

LoRA微调（高效显存利用，仅需16GB显存的GPU）

View logs

查看日志

TensorBoard (if using --train.logger_name tensorboard)

TensorBoard监控（若启用--train.logger_name tensorboard）

Workflow 2: LoRA fine-tuning on single GPU

工作流2：单GPU上的LoRA微调

Memory usage: ~8-12GB for Phi-2 with LoRA

显存占用：使用LoRA的Phi-2模型约占8-12GB显存

Workflow 3: Pretrain from scratch

工作流3：从零开始预训练

config/pythia-160m.yaml

config/pythia-160m.yaml

Single GPU

单GPU训练

Multi-GPU with FSDP

多GPU训练（使用FSDP）

Using SLURM

使用SLURM调度器

pretrain_script.sh content:

pretrain_script.sh 内容:

Workflow 4: Convert and deploy model

工作流4：模型转换与部署

Single generation

单次生成

Streaming

流式生成

Batch inference

批量推理

8-bit quantization (50% size reduction)

8位量化（体积减小50%）

4-bit quantization (75% size reduction)

4位量化（体积减小75%）

Run: uvicorn api:app --host 0.0.0.0 --port 8000

启动命令：uvicorn api:app --host 0.0.0.0 --port 8000

When to use vs alternatives

适用场景与替代方案对比

Common issues

常见问题

Instead of litgpt finetune (requires 40GB+)

替代需要40GB+显存的litgpt finetune

Already enabled by default on Ampere+ GPUs (A100, RTX 30/40 series)

在Ampere架构及以上的GPU（如A100、RTX30/40系列）默认已启用

No configuration needed

无需额外配置

List all available models

列出所有可用模型

Download if not exists

若模型未下载则执行下载

Should see: meta-llama/Meta-Llama-3-8B/

应包含：meta-llama/Meta-Llama-3-8B/

Advanced topics

进阶主题

Hardware requirements

硬件要求

Resources