moai-ml-llm-fine-tuning

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Fine-Tuning Expert

大语言模型微调专家

Parameter-Efficient Fine-Tuning (PEFT) for Enterprise LLMs
Focus: LoRA, QLoRA, Domain Adaptation
Models: Llama 3.1, Mistral, Mixtral, Falcon
Stack: PyTorch, Transformers, PEFT, bitsandbytes

面向企业级LLM的参数高效微调(PEFT)
聚焦领域:LoRA、QLoRA、领域适配
支持模型:Llama 3.1、Mistral、Mixtral、Falcon
技术栈:PyTorch、Transformers、PEFT、bitsandbytes

Overview

概述

Enterprise-grade fine-tuning strategies for customizing Large Language Models (LLMs) with minimal resource requirements.
以极低资源需求定制大语言模型(LLM)的企业级微调方案。

Core Capabilities

核心能力

  • Parameter-Efficient Fine-Tuning (PEFT): LoRA, QLoRA, Prefix Tuning
  • Quantization: 4-bit/8-bit training with bitsandbytes
  • Distributed Training: Multi-GPU, DeepSpeed, FSDP
  • Optimization: Flash Attention 2, Gradient Checkpointing
  • Evaluation: Perplexity, BLEU, ROUGE, Domain benchmarks
  • 参数高效微调(PEFT):LoRA、QLoRA、前缀微调
  • 量化:基于bitsandbytes的4位/8位训练
  • 分布式训练:多GPU、DeepSpeed、FSDP
  • 优化:Flash Attention 2、梯度检查点
  • 评估:困惑度、BLEU、ROUGE、领域基准测试

Technology Stack

技术栈

  • PEFT 0.13+: Adapter management
  • Transformers 4.45+: Model architecture
  • TRL 0.11+: Supervised Fine-Tuning (SFT), DPO
  • Accelerate 0.34+: Training loop orchestration
  • bitsandbytes 0.45+: Low-precision optimization

  • PEFT 0.13+:适配器管理
  • Transformers 4.45+:模型架构
  • TRL 0.11+:监督微调(SFT)、DPO
  • Accelerate 0.34+:训练流程编排
  • bitsandbytes 0.45+:低精度优化

Fine-Tuning Strategies

微调策略

MethodParams UpdatedVRAM (70B)Use Case
Full Fine-Tuning100%~420GBFoundation model creation
LoRA0.1-1%~180GBDomain adaptation, Style transfer
QLoRA0.1-1%~24GBConsumer GPU training, Cost efficiency

方法更新参数占比70B模型显存占用适用场景
全量微调100%~420GB基础模型创建
LoRA0.1-1%~180GB领域适配、风格迁移
QLoRA0.1-1%~24GB消费级GPU训练、成本优化

Implementation Patterns

实现模式

1. QLoRA Configuration (Recommended)

1. QLoRA配置(推荐)

Efficient 4-bit training for large models.
python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

def setup_qlora_model(model_id="meta-llama/Llama-3.1-8B"):
    # 1. Quantization Config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

    # 2. Load Model
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )

    # 3. Enable Gradient Checkpointing & K-bit Training
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)

    # 4. LoRA Config
    peft_config = LoraConfig(
        r=16,                    # Rank
        lora_alpha=32,           # Alpha (scaling)
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj"
        ],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    # 5. Apply Adapter
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

    return model
针对大模型的高效4位训练方案。
python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

def setup_qlora_model(model_id="meta-llama/Llama-3.1-8B"):
    # 1. Quantization Config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

    # 2. Load Model
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )

    # 3. Enable Gradient Checkpointing & K-bit Training
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)

    # 4. LoRA Config
    peft_config = LoraConfig(
        r=16,                    # Rank
        lora_alpha=32,           # Alpha (scaling)
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj"
        ],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    # 5. Apply Adapter
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

    return model

2. Training Loop with TRL

2. 基于TRL的训练循环

Supervised Fine-Tuning (SFT) using HuggingFace TRL.
python
from trl import SFTTrainer
from transformers import TrainingArguments

def train_model(model, tokenizer, dataset, output_dir="./results"):
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        fp16=True,             # or bf16=True for Ampere+
        logging_steps=10,
        save_strategy="epoch",
        optim="paged_adamw_8bit",
        report_to="tensorboard"
    )

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        dataset_text_field="text", # Column containing formatted prompt
        max_seq_length=2048,
        tokenizer=tokenizer,
        args=training_args,
        packing=False,
    )

    trainer.train()
    trainer.save_model()
使用HuggingFace TRL实现监督微调(SFT)。
python
from trl import SFTTrainer
from transformers import TrainingArguments

def train_model(model, tokenizer, dataset, output_dir="./results"):
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        fp16=True,             # or bf16=True for Ampere+
        logging_steps=10,
        save_strategy="epoch",
        optim="paged_adamw_8bit",
        report_to="tensorboard"
    )

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        dataset_text_field="text", # Column containing formatted prompt
        max_seq_length=2048,
        tokenizer=tokenizer,
        args=training_args,
        packing=False,
    )

    trainer.train()
    trainer.save_model()

3. Data Preparation

3. 数据准备

Formatting data for instruction tuning.
python
from datasets import load_dataset

def format_instruction(sample):
    return f"""### Instruction:
{sample['instruction']}
为指令微调格式化数据。
python
from datasets import load_dataset

def format_instruction(sample):
    return f"""### Instruction:
{sample['instruction']}

Input:

Input:

{sample['input']}
{sample['input']}

Response:

Response:

{sample['output']} """
def prepare_dataset(path="databricks/databricks-dolly-15k"): dataset = load_dataset(path, split="train")
# Format for SFTTrainer
dataset = dataset.map(lambda x: {"text": format_instruction(x)})

return dataset

---
{sample['output']} """
def prepare_dataset(path="databricks/databricks-dolly-15k"): dataset = load_dataset(path, split="train")
# Format for SFTTrainer
dataset = dataset.map(lambda x: {"text": format_instruction(x)})

return dataset

---

Advanced Techniques

高级技巧

Multi-GPU Distributed Training

多GPU分布式训练

Using Accelerate and DeepSpeed for scaling.
bash
undefined
使用Accelerate和DeepSpeed实现算力扩展。
bash
undefined

config.yaml for accelerate

config.yaml for accelerate

compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU deepspeed_config: gradient_accumulation_steps: 4 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero_optimization: stage: 2

Run command:

```bash
accelerate launch --config_file config.yaml train.py
compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU deepspeed_config: gradient_accumulation_steps: 4 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero_optimization: stage: 2

运行命令:

```bash
accelerate launch --config_file config.yaml train.py

Model Merging

模型合并

Merging LoRA adapters back into the base model for deployment.
python
from peft import PeftModel

def merge_adapter(base_model_id, adapter_path, output_path):
    # Load base model in FP16 (not 4-bit)
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    # Load adapter
    model = PeftModel.from_pretrained(base_model, adapter_path)

    # Merge
    model = model.merge_and_unload()

    # Save
    model.save_pretrained(output_path)
    base_model.tokenizer.save_pretrained(output_path)

将LoRA适配器合并回基础模型用于部署。
python
from peft import PeftModel

def merge_adapter(base_model_id, adapter_path, output_path):
    # Load base model in FP16 (not 4-bit)
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    # Load adapter
    model = PeftModel.from_pretrained(base_model, adapter_path)

    # Merge
    model = model.merge_and_unload()

    # Save
    model.save_pretrained(output_path)
    base_model.tokenizer.save_pretrained(output_path)

Validation Checklist

验证检查清单

Setup:
  • GPU environment verified (CUDA available)
  • Dependencies installed (peft, trl, bitsandbytes)
  • HuggingFace token configured
Data:
  • Dataset formatted correctly (Instruction/Input/Output)
  • Tokenization length checked (< context window)
  • Train/Val split created
Training:
  • QLoRA config applied (4-bit, nf4)
  • Gradient checkpointing enabled
  • Learning rate scheduled (warmup + decay)
  • Loss monitoring active
Evaluation:
  • Perplexity calculated
  • Generation quality manually verified
  • Adapter merged successfully

环境配置
  • GPU环境验证通过(CUDA可用)
  • 依赖安装完成(peft、trl、bitsandbytes)
  • HuggingFace令牌已配置
数据准备
  • 数据集格式正确(Instruction/Input/Output结构)
  • 分词长度检查通过(小于上下文窗口)
  • 已划分训练/验证集
训练过程
  • 已应用QLoRA配置(4位、nf4量化)
  • 已启用梯度检查点
  • 学习率调度配置完成(预热+衰减)
  • 损失监控已激活
模型评估
  • 已计算困惑度
  • 生成质量已人工核验
  • 适配器合并成功

Related Skills

相关技能

  • moai-domain-ml
    : General ML workflows
  • moai-domain-data-science
    : Data preparation
  • moai-essentials-perf
    : Inference optimization

Last Updated: 2025-11-20
  • moai-domain-ml
    :通用机器学习工作流
  • moai-domain-data-science
    :数据准备
  • moai-essentials-perf
    :推理优化

最后更新:2025-11-20