llm-fine-tuning-guide

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Fine-Tuning Guide

LLM 微调指南

Master the art of fine-tuning large language models to create specialized models optimized for your specific use cases, domains, and performance requirements.
掌握大语言模型微调技术,打造针对你的特定用例、领域和性能需求优化的专用模型。

Overview

概述

Fine-tuning adapts pre-trained LLMs to specific tasks, domains, or styles by training them on curated datasets. This improves accuracy, reduces hallucinations, and optimizes costs.
微调通过在精心整理的数据集上训练预训练LLM,使其适配特定任务、领域或风格。这能提升准确率、减少幻觉并优化成本。

When to Fine-Tune

何时进行微调

  • Domain Specialization: Legal documents, medical records, financial reports
  • Task-Specific Performance: Better results on specific tasks than base model
  • Cost Optimization: Smaller fine-tuned model replaces expensive large model
  • Style Adaptation: Match specific writing styles or tones
  • Compliance Requirements: Keep sensitive data within your infrastructure
  • Latency Requirements: Smaller models deploy faster
  • 领域专业化:法律文档、医疗记录、财务报告
  • 特定任务性能提升:在特定任务上取得比基础模型更好的结果
  • 成本优化:用更小的微调模型替代昂贵的大型模型
  • 风格适配:匹配特定写作风格或语气
  • 合规要求:将敏感数据保留在你的基础设施内
  • 延迟要求:更小的模型部署速度更快

When NOT to Fine-Tune

何时不进行微调

  • One-off queries (use prompting instead)
  • Rapidly changing information (use RAG instead)
  • Limited training data (< 100 examples typically insufficient)
  • General knowledge questions (base model sufficient)
  • 一次性查询(改用提示词即可)
  • 快速变化的信息(改用RAG)
  • 训练数据有限(通常少于100个示例不足够)
  • 通用知识问题(基础模型已足够)

Quick Start

快速开始

Full Fine-Tuning:
bash
python examples/full_fine_tuning.py
LoRA (Recommended for most cases):
bash
python examples/lora_fine_tuning.py
QLoRA (Single GPU):
bash
python examples/qlora_fine_tuning.py
Data Preparation:
bash
python scripts/data_preparation.py
全量微调:
bash
python examples/full_fine_tuning.py
LoRA(大多数场景推荐):
bash
python examples/lora_fine_tuning.py
QLoRA(单GPU):
bash
python examples/qlora_fine_tuning.py
数据准备:
bash
python scripts/data_preparation.py

Fine-Tuning Approaches

微调方法

1. Full Fine-Tuning

1. 全量微调

Update all model parameters during training.
Pros:
  • Maximum performance improvement
  • Can completely rewrite model behavior
  • Best for significant domain shifts
Cons:
  • High computational cost
  • Requires large dataset (1000+ examples)
  • Risk of catastrophic forgetting
  • Long training time
python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

model_id = "meta-llama/Llama-2-7b"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

training_args = TrainingArguments(
    output_dir="./fine-tuned-llama",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=10,
    save_steps=100,
    eval_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()
训练期间更新所有模型参数。
优点:
  • 性能提升最大化
  • 可完全改写模型行为
  • 最适合显著的领域迁移
缺点:
  • 计算成本高
  • 需要大型数据集(1000+示例)
  • 存在灾难性遗忘风险
  • 训练时间长
python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

model_id = "meta-llama/Llama-2-7b"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

training_args = TrainingArguments(
    output_dir="./fine-tuned-llama",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=10,
    save_steps=100,
    eval_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

2. Parameter-Efficient Fine-Tuning (PEFT)

2. 参数高效微调(PEFT)

Train only a small fraction of parameters.
仅训练小部分参数。

LoRA (Low-Rank Adaptation)

LoRA(低秩适配)

Adds trainable low-rank matrices to existing weights.
Pros:
  • 99% fewer trainable parameters
  • Maintains base model knowledge
  • Fast training (10-20x faster)
  • Easy to switch between adapters
Cons:
  • Slightly lower performance than full fine-tuning
  • Requires base model at inference
python
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_id = "meta-llama/Llama-2-7b"
model = AutoModelForCausalLM.from_pretrained(base_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
在现有权重中添加可训练的低秩矩阵。
优点:
  • 可训练参数减少99%
  • 保留基础模型知识
  • 训练速度快(10-20倍)
  • 轻松在适配器间切换
缺点:
  • 性能略低于全量微调
  • 推理时需要基础模型
python
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_id = "meta-llama/Llama-2-7b"
model = AutoModelForCausalLM.from_pretrained(base_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

Configure LoRA

配置LoRA

lora_config = LoraConfig( r=8, # Rank of low-rank matrices lora_alpha=16, # Scaling factor target_modules=["q_proj", "v_proj"], # Which layers to adapt lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM )
lora_config = LoraConfig( r=8, # 低秩矩阵的秩 lora_alpha=16, # 缩放因子 target_modules=["q_proj", "v_proj"], # 要适配的层 lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM )

Wrap model with LoRA

用LoRA包装模型

model = get_peft_model(model, lora_config) model.print_trainable_parameters()
model = get_peft_model(model, lora_config) model.print_trainable_parameters()

Output: trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06

输出: trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06

Train as normal

正常训练

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, ) trainer.train()
trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, ) trainer.train()

Save only LoRA weights

仅保存LoRA权重

model.save_pretrained("./llama-lora-adapter")
undefined
model.save_pretrained("./llama-lora-adapter")
undefined

QLoRA (Quantized LoRA)

QLoRA(量化LoRA)

Combines LoRA with quantization for extreme efficiency.
python
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
将LoRA与量化结合实现极致效率。
python
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

Quantization config

量化配置

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True )
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True )

Load quantized model

加载量化模型

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b", quantization_config=bnb_config, device_map="auto" )
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b", quantization_config=bnb_config, device_map="auto" )

Prepare for training

准备训练

model = prepare_model_for_kbit_training(model)
model = prepare_model_for_kbit_training(model)

Apply LoRA

应用LoRA

lora_config = LoraConfig( r=8, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM )
model = get_peft_model(model, lora_config)
lora_config = LoraConfig( r=8, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM )
model = get_peft_model(model, lora_config)

Train on single GPU

在单GPU上训练

trainer = Trainer( model=model, args=TrainingArguments( output_dir="./qlora-output", per_device_train_batch_size=1, gradient_accumulation_steps=4, learning_rate=5e-4, num_train_epochs=3, ), train_dataset=train_dataset, ) trainer.train()
undefined
trainer = Trainer( model=model, args=TrainingArguments( output_dir="./qlora-output", per_device_train_batch_size=1, gradient_accumulation_steps=4, learning_rate=5e-4, num_train_epochs=3, ), train_dataset=train_dataset, ) trainer.train()
undefined

Prefix Tuning

Prefix Tuning(前缀微调)

Prepends trainable tokens to input.
python
from peft import get_peft_model, PrefixTuningConfig

config = PrefixTuningConfig(
    num_virtual_tokens=20,
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, config)
在输入前添加可训练的tokens。
python
from peft import get_peft_model, PrefixTuningConfig

config = PrefixTuningConfig(
    num_virtual_tokens=20,
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, config)

Only 20 * embedding_dim parameters trained

仅训练20 * embedding_dim个参数

undefined
undefined

3. Instruction Fine-Tuning

3. 指令微调

Train model to follow instructions with examples.
python
undefined
用示例训练模型遵循指令。
python
undefined

Training data format

训练数据格式

training_data = [ { "instruction": "Translate to French", "input": "Hello, how are you?", "output": "Bonjour, comment allez-vous?" }, { "instruction": "Summarize this text", "input": "Long document...", "output": "Summary..." } ]
training_data = [ { "instruction": "Translate to French", "input": "Hello, how are you?", "output": "Bonjour, comment allez-vous?" }, { "instruction": "Summarize this text", "input": "Long document...", "output": "Summary..." } ]

Template for training

训练模板

template = """Below is an instruction that describes a task, paired with an input that provides further context.
template = """Below is an instruction that describes a task, paired with an input that provides further context.

Instruction:

Instruction:

{instruction}
{instruction}

Input:

Input:

{input}
{input}

Response:

Response:

{output}"""
{output}"""

Create formatted dataset

创建格式化数据集

formatted_data = [ template.format(**example) for example in training_data ]
undefined
formatted_data = [ template.format(**example) for example in training_data ]
undefined

4. Domain-Specific Fine-Tuning

4. 领域特定微调

Tailor models for specific industries or fields.
为特定行业或领域定制模型。

Legal Domain Example

法律领域示例

python
legal_training_data = [
    {
        "prompt": "What are the key clauses in an NDA?",
        "completion": """Key clauses typically include:
1. Definition of Confidential Information
2. Non-Disclosure Obligations
3. Permitted Disclosures
4. Term and Termination
5. Return of Information
6. Remedies"""
    },
    # ... more legal examples
]
python
legal_training_data = [
    {
        "prompt": "What are the key clauses in an NDA?",
        "completion": """Key clauses typically include:
1. Definition of Confidential Information
2. Non-Disclosure Obligations
3. Permitted Disclosures
4. Term and Termination
5. Return of Information
6. Remedies"""
    },
    # ... 更多法律示例
]

Train on legal domain

在法律领域上训练

model = fine_tune_on_domain( base_model="gpt-3.5-turbo", training_data=legal_training_data, epochs=3, learning_rate=0.0002, )
undefined
model = fine_tune_on_domain( base_model="gpt-3.5-turbo", training_data=legal_training_data, epochs=3, learning_rate=0.0002, )
undefined

Data Preparation

数据准备

1. Dataset Quality

1. 数据集质量

python
class DatasetValidator:
    def validate_dataset(self, data):
        issues = {
            "empty_samples": 0,
            "duplicates": 0,
            "outliers": 0,
            "imbalance": {}
        }

        # Check for empty samples
        for sample in data:
            if not sample.get("text"):
                issues["empty_samples"] += 1

        # Check for duplicates
        texts = [s.get("text") for s in data]
        issues["duplicates"] = len(texts) - len(set(texts))

        # Check for length outliers
        lengths = [len(t.split()) for t in texts]
        mean_length = sum(lengths) / len(lengths)
        issues["outliers"] = sum(1 for l in lengths if l > mean_length * 3)

        return issues
python
class DatasetValidator:
    def validate_dataset(self, data):
        issues = {
            "empty_samples": 0,
            "duplicates": 0,
            "outliers": 0,
            "imbalance": {}
        }

        # 检查空样本
        for sample in data:
            if not sample.get("text"):
                issues["empty_samples"] += 1

        # 检查重复样本
        texts = [s.get("text") for s in data]
        issues["duplicates"] = len(texts) - len(set(texts))

        # 检查长度异常值
        lengths = [len(t.split()) for t in texts]
        mean_length = sum(lengths) / len(lengths)
        issues["outliers"] = sum(1 for l in lengths if l > mean_length * 3)

        return issues

Validate before training

训练前验证

validator = DatasetValidator() issues = validator.validate_dataset(training_data) print(f"Dataset Issues: {issues}")
undefined
validator = DatasetValidator() issues = validator.validate_dataset(training_data) print(f"Dataset Issues: {issues}")
undefined

2. Data Augmentation

2. 数据增强

python
from nlpaug.augmenter.word import SynonymAug, RandomWordAug
import nlpaug.flow as naf
python
from nlpaug.augmenter.word import SynonymAug, RandomWordAug
import nlpaug.flow as naf

Create augmentation pipeline

创建增强管道

text = "The quick brown fox jumps over the lazy dog"
text = "The quick brown fox jumps over the lazy dog"

Synonym replacement

同义词替换

aug_syn = SynonymAug(aug_p=0.3) augmented_syn = aug_syn.augment(text)
aug_syn = SynonymAug(aug_p=0.3) augmented_syn = aug_syn.augment(text)

Random word insertion

随机插入单词

aug_insert = RandomWordAug(action="insert", aug_p=0.3) augmented_insert = aug_insert.augment(text)
aug_insert = RandomWordAug(action="insert", aug_p=0.3) augmented_insert = aug_insert.augment(text)

Combine augmentations

组合增强方法

flow = naf.Sequential([ SynonymAug(aug_p=0.2), RandomWordAug(action="swap", aug_p=0.2) ]) augmented = flow.augment(text)
undefined
flow = naf.Sequential([ SynonymAug(aug_p=0.2), RandomWordAug(action="swap", aug_p=0.2) ]) augmented = flow.augment(text)
undefined

3. Train/Validation Split

3. 训练/验证集划分

python
from sklearn.model_selection import train_test_split
python
from sklearn.model_selection import train_test_split

Create splits

创建划分

train_data, eval_data = train_test_split( data, test_size=0.2, random_state=42 )
eval_data, test_data = train_test_split( eval_data, test_size=0.5, random_state=42 )
print(f"Train: {len(train_data)}, Eval: {len(eval_data)}, Test: {len(test_data)}")
undefined
train_data, eval_data = train_test_split( data, test_size=0.2, random_state=42 )
eval_data, test_data = train_test_split( eval_data, test_size=0.5, random_state=42 )
print(f"Train: {len(train_data)}, Eval: {len(eval_data)}, Test: {len(test_data)}")
undefined

Training Techniques

训练技巧

1. Learning Rate Scheduling

1. 学习率调度

python
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR
python
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR

Linear warmup + cosine annealing

线性预热 + 余弦退火

def get_scheduler(optimizer, num_steps): lr_scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=500, num_training_steps=num_steps ) return lr_scheduler
training_args = TrainingArguments( learning_rate=1e-4, lr_scheduler_type="cosine", warmup_steps=500, warmup_ratio=0.1, )
undefined
def get_scheduler(optimizer, num_steps): lr_scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=500, num_training_steps=num_steps ) return lr_scheduler
training_args = TrainingArguments( learning_rate=1e-4, lr_scheduler_type="cosine", warmup_steps=500, warmup_ratio=0.1, )
undefined

2. Gradient Accumulation

2. 梯度累积

python
training_args = TrainingArguments(
    gradient_accumulation_steps=4,  # Accumulate gradients over 4 steps
    per_device_train_batch_size=1,   # Effective batch size: 1 * 4 = 4
)
python
training_args = TrainingArguments(
    gradient_accumulation_steps=4,  # 累积4步梯度
    per_device_train_batch_size=1,   # 有效批量大小: 1 * 4 = 4
)

Simulates larger batch on limited GPU memory

在有限GPU内存上模拟大批量

undefined
undefined

3. Mixed Precision Training

3. 混合精度训练

python
training_args = TrainingArguments(
    fp16=True,  # Use 16-bit floats
    bf16=False,
)
python
training_args = TrainingArguments(
    fp16=True,  # 使用16位浮点数
    bf16=False,
)

Reduces memory usage by 50%, speeds up training

内存使用减少50%,训练速度提升

undefined
undefined

4. Multi-GPU Training

4. 多GPU训练

python
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    dataloader_pin_memory=True,
    dataloader_num_workers=4,
)
python
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    dataloader_pin_memory=True,
    dataloader_num_workers=4,
)

Automatically uses all available GPUs

自动使用所有可用GPU

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, )
undefined
trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, )
undefined

Popular Models for Fine-Tuning

适合微调的热门模型

Open Source Models

开源模型

Llama 3.2 (Meta)

Llama 3.2 (Meta)

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-7b")
python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-7b")

Fine-tune on custom data

在自定义数据上微调

... training code

... 训练代码


**Characteristics**:
- 7B, 70B parameter versions
- Strong instruction-following
- Excellent for domain adaptation
- Apache 2.0 license

**特点**:
- 7B、70B参数版本
- 出色的指令遵循能力
- 非常适合领域适配
- Apache 2.0许可证

Gemma 3 (Google)

Gemma 3 (Google)

python
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-2b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-2b")
python
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-2b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-2b")

Gemma 3 sizes: 2B, 7B, 27B

Gemma 3尺寸: 2B, 7B, 27B

Very efficient, great for fine-tuning

效率极高,非常适合微调


**Characteristics**:
- Small, medium, large sizes
- Efficient architecture
- Good for edge deployment
- Built on cutting-edge research

**特点**:
- 小、中、大三种尺寸
- 高效架构
- 适合边缘部署
- 基于前沿研究构建

Mistral 7B

Mistral 7B

python
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
python
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

Strong performance, efficient architecture

性能强劲,架构高效


**Characteristics**:
- Sliding window attention
- Efficient inference
- Strong performance-to-size ratio

**特点**:
- 滑动窗口注意力
- 推理高效
- 出色的性能-尺寸比

Commercial Models

商用模型

OpenAI Fine-Tuning API

OpenAI 微调API

python
import openai
python
import openai

Prepare training data

准备训练数据

training_file = openai.File.create( file=open("training_data.jsonl", "rb"), purpose="fine-tune" )
training_file = openai.File.create( file=open("training_data.jsonl", "rb"), purpose="fine-tune" )

Create fine-tuning job

创建微调任务

fine_tune_job = openai.FineTuningJob.create( training_file=training_file.id, model="gpt-3.5-turbo", hyperparameters={ "n_epochs": 3, "learning_rate_multiplier": 0.1, } )
fine_tune_job = openai.FineTuningJob.create( training_file=training_file.id, model="gpt-3.5-turbo", hyperparameters={ "n_epochs": 3, "learning_rate_multiplier": 0.1, } )

Wait for completion

等待完成

fine_tuned_model = openai.FineTuningJob.retrieve(fine_tune_job.id) print(f"Status: {fine_tuned_model.status}")
fine_tuned_model = openai.FineTuningJob.retrieve(fine_tune_job.id) print(f"Status: {fine_tuned_model.status}")

Use fine-tuned model

使用微调后的模型

response = openai.ChatCompletion.create( model=fine_tuned_model.fine_tuned_model, messages=[{"role": "user", "content": "Hello"}] )
undefined
response = openai.ChatCompletion.create( model=fine_tuned_model.fine_tuned_model, messages=[{"role": "user", "content": "Hello"}] )
undefined

Evaluation and Metrics

评估与指标

1. Perplexity

1. 困惑度

python
import torch
from math import exp

def calculate_perplexity(model, eval_dataset):
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for batch in eval_dataset:
            outputs = model(**batch)
            loss = outputs.loss
            total_loss += loss.item() * batch["input_ids"].shape[0]
            total_tokens += batch["input_ids"].shape[0]

    perplexity = exp(total_loss / total_tokens)
    return perplexity

perplexity = calculate_perplexity(model, eval_dataset)
print(f"Perplexity: {perplexity:.2f}")
python
import torch
from math import exp

def calculate_perplexity(model, eval_dataset):
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for batch in eval_dataset:
            outputs = model(**batch)
            loss = outputs.loss
            total_loss += loss.item() * batch["input_ids"].shape[0]
            total_tokens += batch["input_ids"].shape[0]

    perplexity = exp(total_loss / total_tokens)
    return perplexity

perplexity = calculate_perplexity(model, eval_dataset)
print(f"Perplexity: {perplexity:.2f}")

2. Task-Specific Metrics

2. 特定任务指标

python
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def evaluate_task(predictions, ground_truth):
    return {
        "accuracy": accuracy_score(ground_truth, predictions),
        "precision": precision_score(ground_truth, predictions, average='weighted'),
        "recall": recall_score(ground_truth, predictions, average='weighted'),
        "f1": f1_score(ground_truth, predictions, average='weighted'),
    }
python
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def evaluate_task(predictions, ground_truth):
    return {
        "accuracy": accuracy_score(ground_truth, predictions),
        "precision": precision_score(ground_truth, predictions, average='weighted'),
        "recall": recall_score(ground_truth, predictions, average='weighted'),
        "f1": f1_score(ground_truth, predictions, average='weighted'),
    }

Evaluate on task

评估任务

predictions = [model.predict(x) for x in test_data] metrics = evaluate_task(predictions, test_labels) print(f"Metrics: {metrics}")
undefined
predictions = [model.predict(x) for x in test_data] metrics = evaluate_task(predictions, test_labels) print(f"Metrics: {metrics}")
undefined

3. Human Evaluation

3. 人工评估

python
class HumanEvaluator:
    def evaluate_response(self, prompt, response):
        criteria = {
            "relevance": self._score_relevance(prompt, response),
            "coherence": self._score_coherence(response),
            "factuality": self._score_factuality(response),
            "helpfulness": self._score_helpfulness(response),
        }
        return sum(criteria.values()) / len(criteria)

    def _score_relevance(self, prompt, response):
        # Score 1-5
        pass

    def _score_coherence(self, response):
        # Score 1-5
        pass
python
class HumanEvaluator:
    def evaluate_response(self, prompt, response):
        criteria = {
            "relevance": self._score_relevance(prompt, response),
            "coherence": self._score_coherence(response),
            "factuality": self._score_factuality(response),
            "helpfulness": self._score_helpfulness(response),
        }
        return sum(criteria.values()) / len(criteria)

    def _score_relevance(self, prompt, response):
        # 评分1-5
        pass

    def _score_coherence(self, response):
        # 评分1-5
        pass

Common Challenges & Solutions

常见挑战与解决方案

Challenge: Catastrophic Forgetting

挑战:灾难性遗忘

Model forgets pre-trained knowledge while adapting to new domain.
Solutions:
  • Use lower learning rates (2e-5 to 5e-5)
  • Smaller training epochs (1-3)
  • Regularization techniques
  • Continual learning approaches
python
undefined
模型在适配新领域时忘记预训练知识。
解决方案:
  • 使用更低的学习率(2e-5至5e-5)
  • 更少的训练轮次(1-3轮)
  • 正则化技术
  • 持续学习方法
python
undefined

Conservative training settings

保守的训练设置

training_args = TrainingArguments( learning_rate=2e-5, # Lower learning rate num_train_epochs=2, # Few epochs weight_decay=0.01, # L2 regularization warmup_steps=500, save_total_limit=3, load_best_model_at_end=True, )
undefined
training_args = TrainingArguments( learning_rate=2e-5, # 更低的学习率 num_train_epochs=2, # 较少的轮次 weight_decay=0.01, # L2正则化 warmup_steps=500, save_total_limit=3, load_best_model_at_end=True, )
undefined

Challenge: Overfitting

挑战:过拟合

Model performs well on training data but poorly on new data.
Solutions:
  • Use more training data
  • Implement dropout
  • Early stopping
  • Validation monitoring
python
training_args = TrainingArguments(
    eval_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True,
    early_stopping_patience=3,
    metric_for_best_model="eval_loss",
)
模型在训练数据上表现良好,但在新数据上表现差。
解决方案:
  • 使用更多训练数据
  • 实现dropout
  • 早停
  • 验证集监控
python
training_args = TrainingArguments(
    eval_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True,
    early_stopping_patience=3,
    metric_for_best_model="eval_loss",
)

Challenge: Insufficient Training Data

挑战:训练数据不足

Few examples for fine-tuning.
Solutions:
  • Data augmentation
  • Use PEFT (LoRA) instead of full fine-tuning
  • Few-shot learning with prompting
  • Transfer learning
python
undefined
微调可用示例很少。
解决方案:
  • 数据增强
  • 使用PEFT(LoRA)替代全量微调
  • 结合提示词的少样本学习
  • 迁移学习
python
undefined

Use LoRA when data is limited

数据有限时使用LoRA

lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, )
undefined
lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, )
undefined

Best Practices

最佳实践

Before Fine-Tuning

微调前

  • ✓ Start with a strong base model
  • ✓ Prepare high-quality training data (100+ examples recommended)
  • ✓ Define clear evaluation metrics
  • ✓ Set up proper train/validation splits
  • ✓ Document your objectives
  • ✓ 从强大的基础模型开始
  • ✓ 准备高质量训练数据(建议100+示例)
  • ✓ 定义清晰的评估指标
  • ✓ 设置合适的训练/验证集划分
  • ✓ 记录你的目标

During Fine-Tuning

微调中

  • ✓ Monitor training/validation loss
  • ✓ Use appropriate learning rates
  • ✓ Save checkpoints regularly
  • ✓ Validate on held-out data
  • ✓ Watch for overfitting/underfitting
  • ✓ 监控训练/验证损失
  • ✓ 使用合适的学习率
  • ✓ 定期保存检查点
  • ✓ 在预留数据上验证
  • ✓ 注意过拟合/欠拟合

After Fine-Tuning

微调后

  • ✓ Evaluate on test set
  • ✓ Compare against baseline
  • ✓ Perform qualitative analysis
  • ✓ Document configuration and results
  • ✓ Version your fine-tuned models
  • ✓ 在测试集上评估
  • ✓ 与基线模型对比
  • ✓ 执行定性分析
  • ✓ 记录配置和结果
  • ✓ 版本化你的微调模型

Implementation Checklist

实施检查清单

  • Determine fine-tuning approach (full, LoRA, QLoRA, instruction)
  • Prepare and validate training dataset (100+ examples)
  • Choose base model (Llama 3.2, Gemma 3, Mistral, etc.)
  • Set up PEFT if using parameter-efficient methods
  • Configure training arguments and hyperparameters
  • Implement data loading and preprocessing
  • Set up evaluation metrics
  • Train model with monitoring
  • Evaluate on test set
  • Save and version fine-tuned model
  • Test in production environment
  • Document process and results
  • 确定微调方法(全量、LoRA、QLoRA、指令微调)
  • 准备并验证训练数据集(100+示例)
  • 选择基础模型(Llama 3.2、Gemma 3、Mistral等)
  • 如果使用参数高效方法,设置PEFT
  • 配置训练参数和超参数
  • 实现数据加载和预处理
  • 设置评估指标
  • 带监控的训练模型
  • 在测试集上评估
  • 保存并版本化微调模型
  • 在生产环境中测试
  • 记录过程和结果

Resources

资源

Frameworks

框架

Models

模型

Papers

论文

  • "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al.)
  • "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al.)
  • "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al.)
  • "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al.)