LLM Fine-Tuning Expert

大语言模型微调专家

Parameter-Efficient Fine-Tuning (PEFT) for Enterprise LLMs

Focus: LoRA, QLoRA, Domain Adaptation
Models: Llama 3.1, Mistral, Mixtral, Falcon
Stack: PyTorch, Transformers, PEFT, bitsandbytes

面向企业级LLM的参数高效微调（PEFT）

聚焦领域：LoRA、QLoRA、领域适配
支持模型：Llama 3.1、Mistral、Mixtral、Falcon
技术栈：PyTorch、Transformers、PEFT、bitsandbytes

Overview

概述

Enterprise-grade fine-tuning strategies for customizing Large Language Models (LLMs) with minimal resource requirements.

以极低资源需求定制大语言模型（LLM）的企业级微调方案。

Core Capabilities

核心能力

Parameter-Efficient Fine-Tuning (PEFT): LoRA, QLoRA, Prefix Tuning
Quantization: 4-bit/8-bit training with bitsandbytes
Distributed Training: Multi-GPU, DeepSpeed, FSDP
Optimization: Flash Attention 2, Gradient Checkpointing
Evaluation: Perplexity, BLEU, ROUGE, Domain benchmarks

参数高效微调（PEFT）：LoRA、QLoRA、前缀微调
量化：基于bitsandbytes的4位/8位训练
分布式训练：多GPU、DeepSpeed、FSDP
优化：Flash Attention 2、梯度检查点
评估：困惑度、BLEU、ROUGE、领域基准测试

Technology Stack

技术栈

PEFT 0.13+: Adapter management
Transformers 4.45+: Model architecture
TRL 0.11+: Supervised Fine-Tuning (SFT), DPO
Accelerate 0.34+: Training loop orchestration
bitsandbytes 0.45+: Low-precision optimization

PEFT 0.13+：适配器管理
Transformers 4.45+：模型架构
TRL 0.11+：监督微调（SFT）、DPO
Accelerate 0.34+：训练流程编排
bitsandbytes 0.45+：低精度优化

Fine-Tuning Strategies

微调策略

Method	Params Updated	VRAM (70B)	Use Case
Full Fine-Tuning	100%	~420GB	Foundation model creation
LoRA	0.1-1%	~180GB	Domain adaptation, Style transfer
QLoRA	0.1-1%	~24GB	Consumer GPU training, Cost efficiency

方法	更新参数占比	70B模型显存占用	适用场景
全量微调	100%	~420GB	基础模型创建
LoRA	0.1-1%	~180GB	领域适配、风格迁移
QLoRA	0.1-1%	~24GB	消费级GPU训练、成本优化

Implementation Patterns

实现模式

1. QLoRA Configuration (Recommended)

1. QLoRA配置（推荐）

Efficient 4-bit training for large models.

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

def setup_qlora_model(model_id="meta-llama/Llama-3.1-8B"):
    # 1. Quantization Config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

    # 2. Load Model
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )

    # 3. Enable Gradient Checkpointing & K-bit Training
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)

    # 4. LoRA Config
    peft_config = LoraConfig(
        r=16,                    # Rank
        lora_alpha=32,           # Alpha (scaling)
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj"
        ],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    # 5. Apply Adapter
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

    return model

针对大模型的高效4位训练方案。

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

def setup_qlora_model(model_id="meta-llama/Llama-3.1-8B"):
    # 1. Quantization Config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )

    # 2. Load Model
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )

    # 3. Enable Gradient Checkpointing & K-bit Training
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)

    # 4. LoRA Config
    peft_config = LoraConfig(
        r=16,                    # Rank
        lora_alpha=32,           # Alpha (scaling)
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj"
        ],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    # 5. Apply Adapter
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

    return model

2. Training Loop with TRL

2. 基于TRL的训练循环

Supervised Fine-Tuning (SFT) using HuggingFace TRL.

python

from trl import SFTTrainer
from transformers import TrainingArguments

def train_model(model, tokenizer, dataset, output_dir="./results"):
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        fp16=True,             # or bf16=True for Ampere+
        logging_steps=10,
        save_strategy="epoch",
        optim="paged_adamw_8bit",
        report_to="tensorboard"
    )

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        dataset_text_field="text", # Column containing formatted prompt
        max_seq_length=2048,
        tokenizer=tokenizer,
        args=training_args,
        packing=False,
    )

    trainer.train()
    trainer.save_model()

使用HuggingFace TRL实现监督微调（SFT）。

python

from trl import SFTTrainer
from transformers import TrainingArguments

def train_model(model, tokenizer, dataset, output_dir="./results"):
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        fp16=True,             # or bf16=True for Ampere+
        logging_steps=10,
        save_strategy="epoch",
        optim="paged_adamw_8bit",
        report_to="tensorboard"
    )

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        dataset_text_field="text", # Column containing formatted prompt
        max_seq_length=2048,
        tokenizer=tokenizer,
        args=training_args,
        packing=False,
    )

    trainer.train()
    trainer.save_model()

3. Data Preparation

3. 数据准备

Formatting data for instruction tuning.

python

from datasets import load_dataset

def format_instruction(sample):
    return f"""### Instruction:
{sample['instruction']}

为指令微调格式化数据。

python

from datasets import load_dataset

def format_instruction(sample):
    return f"""### Instruction:
{sample['instruction']}

Input:

{sample['input']}

Response:

{sample['output']} """

def prepare_dataset(path="databricks/databricks-dolly-15k"): dataset = load_dataset(path, split="train")

# Format for SFTTrainer
dataset = dataset.map(lambda x: {"text": format_instruction(x)})

return dataset

---

{sample['output']} """

def prepare_dataset(path="databricks/databricks-dolly-15k"): dataset = load_dataset(path, split="train")

# Format for SFTTrainer
dataset = dataset.map(lambda x: {"text": format_instruction(x)})

return dataset

---

Advanced Techniques

高级技巧

Multi-GPU Distributed Training

多GPU分布式训练

Using Accelerate and DeepSpeed for scaling.

bash

undefined

使用Accelerate和DeepSpeed实现算力扩展。

bash

undefined

config.yaml for accelerate

compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU deepspeed_config: gradient_accumulation_steps: 4 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero_optimization: stage: 2


Run command:

```bash
accelerate launch --config_file config.yaml train.py

compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU deepspeed_config: gradient_accumulation_steps: 4 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero_optimization: stage: 2


运行命令：

```bash
accelerate launch --config_file config.yaml train.py

Model Merging

模型合并

Merging LoRA adapters back into the base model for deployment.

python

from peft import PeftModel

def merge_adapter(base_model_id, adapter_path, output_path):
    # Load base model in FP16 (not 4-bit)
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    # Load adapter
    model = PeftModel.from_pretrained(base_model, adapter_path)

    # Merge
    model = model.merge_and_unload()

    # Save
    model.save_pretrained(output_path)
    base_model.tokenizer.save_pretrained(output_path)

将LoRA适配器合并回基础模型用于部署。

python

from peft import PeftModel

def merge_adapter(base_model_id, adapter_path, output_path):
    # Load base model in FP16 (not 4-bit)
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    # Load adapter
    model = PeftModel.from_pretrained(base_model, adapter_path)

    # Merge
    model = model.merge_and_unload()

    # Save
    model.save_pretrained(output_path)
    base_model.tokenizer.save_pretrained(output_path)

moai-ml-llm-fine-tuning

Original

Translation

LLM Fine-Tuning Expert

大语言模型微调专家

Overview

概述

Core Capabilities

核心能力

Technology Stack

技术栈

Fine-Tuning Strategies

微调策略

Implementation Patterns

实现模式

1. QLoRA Configuration (Recommended)

1. QLoRA配置（推荐）

2. Training Loop with TRL

2. 基于TRL的训练循环

3. Data Preparation

3. 数据准备

Input:

Input:

Response:

Response:

Advanced Techniques

高级技巧

Multi-GPU Distributed Training

多GPU分布式训练

config.yaml for accelerate

config.yaml for accelerate

Model Merging

模型合并

Validation Checklist

验证检查清单

Related Skills

相关技能