huggingface_transformers

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Hugging Face Transformers Best Practices

Hugging Face Transformers 最佳实践

Comprehensive guide to using the Hugging Face Transformers library including model loading, tokenization, fine-tuning workflows, pipeline usage, custom datasets, and deployment optimization.

本指南全面介绍Hugging Face Transformers库的使用方法,包括模型加载、分词、微调工作流、Pipeline使用、自定义数据集以及部署优化。

Quick Reference

快速参考

When to use this skill:
  • Loading and using pre-trained transformers (BERT, GPT, T5, LLaMA, etc.)
  • Fine-tuning models on custom data
  • Implementing NLP tasks (classification, QA, generation, etc.)
  • Optimizing inference (quantization, ONNX, etc.)
  • Debugging tokenization issues
  • Using Hugging Face pipelines
  • Deploying transformers to production
Models covered:
  • Encoders: BERT, RoBERTa, DeBERTa, ALBERT
  • Decoders: GPT-2, GPT-Neo, LLaMA, Mistral
  • Encoder-Decoders: T5, BART, Flan-T5
  • Vision: ViT, CLIP, Stable Diffusion

何时使用本技能:
  • 加载并使用预训练Transformer模型(BERT、GPT、T5、LLaMA等)
  • 在自定义数据上微调模型
  • 实现NLP任务(分类、问答、生成等)
  • 优化推理过程(量化、ONNX等)
  • 调试分词相关问题
  • 使用Hugging Face Pipeline
  • 将Transformer模型部署到生产环境
涵盖的模型:
  • 编码器模型:BERT、RoBERTa、DeBERTa、ALBERT
  • 解码器模型:GPT-2、GPT-Neo、LLaMA、Mistral
  • 编码器-解码器模型:T5、BART、Flan-T5
  • 视觉模型:ViT、CLIP、Stable Diffusion

Part 1: Model Loading Patterns

第一部分:模型加载模式

Pattern 1: Basic Model Loading

模式1:基础模型加载

python
from transformers import AutoModel, AutoTokenizer
python
from transformers import AutoModel, AutoTokenizer

Load model and tokenizer

Load model and tokenizer

model_name = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name)
model_name = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name)

For specific tasks

For specific tasks

from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=3 # For 3-class classification )
undefined
from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=3 # For 3-class classification )
undefined

Pattern 2: Loading with Specific Configuration

模式2:加载指定配置的模型

python
from transformers import AutoConfig, AutoModel
python
from transformers import AutoConfig, AutoModel

Modify configuration

Modify configuration

config = AutoConfig.from_pretrained("bert-base-uncased") config.hidden_dropout_prob = 0.2 # Custom dropout config.attention_probs_dropout_prob = 0.2
config = AutoConfig.from_pretrained("bert-base-uncased") config.hidden_dropout_prob = 0.2 # Custom dropout config.attention_probs_dropout_prob = 0.2

Load model with custom config

Load model with custom config

model = AutoModel.from_pretrained("bert-base-uncased", config=config)
model = AutoModel.from_pretrained("bert-base-uncased", config=config)

Or create model from scratch with config

Or create model from scratch with config

model = AutoModel.from_config(config)
undefined
model = AutoModel.from_config(config)
undefined

Pattern 3: Loading Quantized Models (Memory Efficient)

模式3:加载量化模型(内存高效型)

python
from transformers import AutoModel, BitsAndBytesConfig
import torch
python
from transformers import AutoModel, BitsAndBytesConfig
import torch

8-bit quantization (50% memory reduction)

8-bit quantization (50% memory reduction)

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=quantization_config, device_map="auto" # Automatic device placement )
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=quantization_config, device_map="auto" # Automatic device placement )

4-bit quantization (75% memory reduction)

4-bit quantization (75% memory reduction)

quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True )
model = AutoModel.from_pretrained( "meta-llama/Llama-2-13b-hf", quantization_config=quantization_config, device_map="auto" )
undefined
quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True )
model = AutoModel.from_pretrained( "meta-llama/Llama-2-13b-hf", quantization_config=quantization_config, device_map="auto" )
undefined

Pattern 4: Loading from Local Path

模式4:从本地路径加载模型

python
undefined
python
undefined

Save model locally

Save model locally

model.save_pretrained("./my-model") tokenizer.save_pretrained("./my-model")
model.save_pretrained("./my-model") tokenizer.save_pretrained("./my-model")

Load from local path

Load from local path

model = AutoModel.from_pretrained("./my-model") tokenizer = AutoTokenizer.from_pretrained("./my-model")

---
model = AutoModel.from_pretrained("./my-model") tokenizer = AutoTokenizer.from_pretrained("./my-model")

---

Part 2: Tokenization Best Practices

第二部分:分词最佳实践

Critical Tokenization Patterns

关键分词模式

python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

✅ CORRECT: All required arguments

✅ CORRECT: All required arguments

tokens = tokenizer( text, padding=True, # Pad to longest in batch truncation=True, # Truncate to max_length max_length=512, # Maximum sequence length return_tensors="pt" # Return PyTorch tensors )
tokens = tokenizer( text, padding=True, # Pad to longest in batch truncation=True, # Truncate to max_length max_length=512, # Maximum sequence length return_tensors="pt" # Return PyTorch tensors )

Access components

Access components

input_ids = tokens['input_ids'] # Token IDs attention_mask = tokens['attention_mask'] # Padding mask token_type_ids = tokens.get('token_type_ids') # Segment IDs (BERT)
input_ids = tokens['input_ids'] # Token IDs attention_mask = tokens['attention_mask'] # Padding mask token_type_ids = tokens.get('token_type_ids') # Segment IDs (BERT)

❌ WRONG: Missing critical arguments

❌ WRONG: Missing critical arguments

tokens = tokenizer(text) # No padding, truncation, or tensor format!
undefined
tokens = tokenizer(text) # No padding, truncation, or tensor format!
undefined

Batch Tokenization

批量分词

python
undefined
python
undefined

Tokenize multiple texts efficiently

Tokenize multiple texts efficiently

texts = ["First text", "Second text", "Third text"]
tokens = tokenizer( texts, padding=True, # Pad all to longest in batch truncation=True, max_length=128, return_tensors="pt" )
texts = ["First text", "Second text", "Third text"]
tokens = tokenizer( texts, padding=True, # Pad all to longest in batch truncation=True, max_length=128, return_tensors="pt" )

Result shape: [batch_size, max_length]

Result shape: [batch_size, max_length]

print(tokens['input_ids'].shape) # torch.Size([3, max_len_in_batch])
undefined
print(tokens['input_ids'].shape) # torch.Size([3, max_len_in_batch])
undefined

Special Token Handling

特殊令牌处理

python
undefined
python
undefined

Add special tokens

Add special tokens

tokenizer.add_special_tokens({ 'additional_special_tokens': ['[CUSTOM]', '[MARKER]'] })
tokenizer.add_special_tokens({ 'additional_special_tokens': ['[CUSTOM]', '[MARKER]'] })

Resize model embeddings to match

Resize model embeddings to match

model.resize_token_embeddings(len(tokenizer))
model.resize_token_embeddings(len(tokenizer))

Encode with special tokens preserved

Encode with special tokens preserved

text = "Hello [CUSTOM] world" tokens = tokenizer(text, add_special_tokens=True)
text = "Hello [CUSTOM] world" tokens = tokenizer(text, add_special_tokens=True)

Decode

Decode

decoded = tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=False)
undefined
decoded = tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=False)
undefined

Tokenization for Different Tasks

不同任务的分词方式

python
undefined
python
undefined

Text classification (single sequence)

Text classification (single sequence)

tokens = tokenizer( "This movie was great!", padding="max_length", truncation=True, max_length=128, return_tensors="pt" )
tokens = tokenizer( "This movie was great!", padding="max_length", truncation=True, max_length=128, return_tensors="pt" )

Question answering (pair of sequences)

Question answering (pair of sequences)

question = "What is the capital of France?" context = "France is a country in Europe. Paris is its capital."
tokens = tokenizer( question, context, padding="max_length", truncation="only_second", # Only truncate context max_length=384, return_tensors="pt" )
question = "What is the capital of France?" context = "France is a country in Europe. Paris is its capital."
tokens = tokenizer( question, context, padding="max_length", truncation="only_second", # Only truncate context max_length=384, return_tensors="pt" )

Text generation (decoder-only models)

Text generation (decoder-only models)

prompt = "Once upon a time" tokens = tokenizer(prompt, return_tensors="pt")
prompt = "Once upon a time" tokens = tokenizer(prompt, return_tensors="pt")

No padding needed for generation input

No padding needed for generation input


---

---

Part 3: Fine-Tuning Workflows

第三部分:微调工作流

Pattern 1: Simple Fine-Tuning with Trainer

模式1:使用Trainer进行简单微调

python
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset
python
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset

1. Load dataset

1. Load dataset

dataset = load_dataset("glue", "mrpc")
dataset = load_dataset("glue", "mrpc")

2. Load model

2. Load model

model = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels=2 ) tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels=2 ) tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

3. Tokenize dataset

3. Tokenize dataset

def tokenize_function(examples): return tokenizer( examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True, max_length=128 )
tokenized_datasets = dataset.map(tokenize_function, batched=True)
def tokenize_function(examples): return tokenizer( examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True, max_length=128 )
tokenized_datasets = dataset.map(tokenize_function, batched=True)

4. Define training arguments

4. Define training arguments

training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, logging_dir="./logs", logging_steps=100, save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="accuracy", )
training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, logging_dir="./logs", logging_steps=100, save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="accuracy", )

5. Define metrics

5. Define metrics

from datasets import load_metric import numpy as np
metric = load_metric("accuracy")
def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels)
from datasets import load_metric import numpy as np
metric = load_metric("accuracy")
def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels)

6. Create Trainer

6. Create Trainer

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], compute_metrics=compute_metrics, )
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], compute_metrics=compute_metrics, )

7. Train

7. Train

trainer.train()
trainer.train()

8. Save

8. Save

trainer.save_model("./fine-tuned-model")
undefined
trainer.save_model("./fine-tuned-model")
undefined

Pattern 2: LoRA Fine-Tuning (Parameter-Efficient)

模式2:LoRA微调(参数高效型)

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

Load base model

Load base model

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", load_in_8bit=True, # 8-bit for memory efficiency device_map="auto" )
model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", load_in_8bit=True, # 8-bit for memory efficiency device_map="auto" )

Configure LoRA

Configure LoRA

lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=8, # LoRA rank lora_alpha=32, # LoRA alpha lora_dropout=0.1, target_modules=["q_proj", "v_proj"], # Which layers to adapt )
lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=8, # LoRA rank lora_alpha=32, # LoRA alpha lora_dropout=0.1, target_modules=["q_proj", "v_proj"], # Which layers to adapt )

Apply LoRA

Apply LoRA

model = get_peft_model(model, lora_config)
model = get_peft_model(model, lora_config)

Check trainable parameters

Check trainable parameters

model.print_trainable_parameters()
model.print_trainable_parameters()

Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.062%

Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.062%

Train with Trainer (same as before)

Train with Trainer (same as before)

Only LoRA parameters are updated!

Only LoRA parameters are updated!

undefined
undefined

Pattern 3: Custom Training Loop

模式3:自定义训练循环

python
import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler
python
import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler

Prepare dataloaders

Prepare dataloaders

train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=16, shuffle=True) eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=16)
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=16, shuffle=True) eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=16)

Optimizer

Optimizer

optimizer = AdamW(model.parameters(), lr=2e-5)
optimizer = AdamW(model.parameters(), lr=2e-5)

Learning rate scheduler

Learning rate scheduler

num_epochs = 3 num_training_steps = num_epochs * len(train_dataloader) lr_scheduler = get_scheduler( "linear", optimizer=optimizer, num_warmup_steps=500, num_training_steps=num_training_steps )
num_epochs = 3 num_training_steps = num_epochs * len(train_dataloader) lr_scheduler = get_scheduler( "linear", optimizer=optimizer, num_warmup_steps=500, num_training_steps=num_training_steps )

Training loop

Training loop

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device)
for epoch in range(num_epochs): model.train() for batch in train_dataloader: batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()

# Evaluation
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    # Compute metrics

---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device)
for epoch in range(num_epochs): model.train() for batch in train_dataloader: batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()

# Evaluation
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    # Compute metrics

---

Part 4: Pipeline Usage (High-Level API)

第四部分:Pipeline使用(高层API)

Text Classification Pipeline

文本分类Pipeline

python
from transformers import pipeline
python
from transformers import pipeline

Load pipeline

Load pipeline

classifier = pipeline( "text-classification", model="distilbert-base-uncased-finetuned-sst-2-english" )
classifier = pipeline( "text-classification", model="distilbert-base-uncased-finetuned-sst-2-english" )

Single prediction

Single prediction

result = classifier("I love this product!")
result = classifier("I love this product!")

[{'label': 'POSITIVE', 'score': 0.9998}]

[{'label': 'POSITIVE', 'score': 0.9998}]

Batch prediction

Batch prediction

results = classifier([ "Great service!", "Terrible experience", "Average quality" ])
undefined
results = classifier([ "Great service!", "Terrible experience", "Average quality" ])
undefined

Question Answering Pipeline

问答Pipeline

python
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

result = qa_pipeline(
    question="What is the capital of France?",
    context="France is a country in Europe. Its capital is Paris, a beautiful city."
)
python
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

result = qa_pipeline(
    question="What is the capital of France?",
    context="France is a country in Europe. Its capital is Paris, a beautiful city."
)

{'score': 0.98, 'start': 49, 'end': 54, 'answer': 'Paris'}

{'score': 0.98, 'start': 49, 'end': 54, 'answer': 'Paris'}

undefined
undefined

Text Generation Pipeline

文本生成Pipeline

python
generator = pipeline("text-generation", model="gpt2")

outputs = generator(
    "Once upon a time",
    max_length=50,
    num_return_sequences=3,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True
)

for output in outputs:
    print(output['generated_text'])
python
generator = pipeline("text-generation", model="gpt2")

outputs = generator(
    "Once upon a time",
    max_length=50,
    num_return_sequences=3,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True
)

for output in outputs:
    print(output['generated_text'])

Zero-Shot Classification Pipeline

零样本分类Pipeline

python
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

result = classifier(
    "This is a course about Python programming.",
    candidate_labels=["education", "technology", "business", "sports"]
)
python
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

result = classifier(
    "This is a course about Python programming.",
    candidate_labels=["education", "technology", "business", "sports"]
)

{'sequence': '...', 'labels': ['education', 'technology', ...], 'scores': [0.85, 0.12, ...]}

{'sequence': '...', 'labels': ['education', 'technology', ...], 'scores': [0.85, 0.12, ...]}


---

---

Part 5: Inference Optimization

第五部分:推理优化

Optimization 1: Batch Processing

优化1:批量处理

python
undefined
python
undefined

❌ SLOW: Process one at a time

❌ SLOW: Process one at a time

for text in texts: output = model(**tokenizer(text, return_tensors="pt"))
for text in texts: output = model(**tokenizer(text, return_tensors="pt"))

✅ FAST: Process in batches

✅ FAST: Process in batches

batch_size = 32 for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt") outputs = model(**inputs)
undefined
batch_size = 32 for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt") outputs = model(**inputs)
undefined

Optimization 2: Mixed Precision (AMP)

优化2:混合精度(AMP)

python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()

    # Forward pass in mixed precision
    with autocast():
        outputs = model(**batch)
        loss = outputs.loss

    # Backward pass with scaled gradients
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()

    # Forward pass in mixed precision
    with autocast():
        outputs = model(**batch)
        loss = outputs.loss

    # Backward pass with scaled gradients
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Optimization 3: ONNX Export

优化3:导出为ONNX格式

python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

Export to ONNX

Export to ONNX

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") model.save_pretrained("./onnx-model", export=True)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") model.save_pretrained("./onnx-model", export=True)

Load ONNX model (faster inference)

Load ONNX model (faster inference)

ort_model = ORTModelForSequenceClassification.from_pretrained("./onnx-model")
ort_model = ORTModelForSequenceClassification.from_pretrained("./onnx-model")

Inference (2-3x faster)

Inference (2-3x faster)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") inputs = tokenizer("Hello world", return_tensors="pt") outputs = ort_model(**inputs)
undefined
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") inputs = tokenizer("Hello world", return_tensors="pt") outputs = ort_model(**inputs)
undefined

Optimization 4: Dynamic Quantization

优化4:动态量化

python
import torch
python
import torch

Quantize model to int8

Quantize model to int8

quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, # Quantize Linear layers dtype=torch.qint8 )
quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, # Quantize Linear layers dtype=torch.qint8 )

4x smaller model, 2-3x faster inference on CPU

4x smaller model, 2-3x faster inference on CPU


---

---

Part 6: Common Issues & Solutions

第六部分:常见问题与解决方案

Issue 1: CUDA Out of Memory

问题1:CUDA内存不足

Problem:
RuntimeError: CUDA out of memory
Solutions:
python
undefined
问题:
RuntimeError: CUDA out of memory
解决方案:
python
undefined

Solution 1: Reduce batch size

Solution 1: Reduce batch size

training_args = TrainingArguments( per_device_train_batch_size=8, # Was 32 gradient_accumulation_steps=4, # Effective batch = 8*4 = 32 )
training_args = TrainingArguments( per_device_train_batch_size=8, # Was 32 gradient_accumulation_steps=4, # Effective batch = 8*4 = 32 )

Solution 2: Use gradient checkpointing

Solution 2: Use gradient checkpointing

model.gradient_checkpointing_enable()
model.gradient_checkpointing_enable()

Solution 3: Use 8-bit model

Solution 3: Use 8-bit model

from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModel.from_pretrained("model-name", quantization_config=quantization_config)
from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModel.from_pretrained("model-name", quantization_config=quantization_config)

Solution 4: Clear cache

Solution 4: Clear cache

import torch torch.cuda.empty_cache()
undefined
import torch torch.cuda.empty_cache()
undefined

Issue 2: Slow Tokenization

问题2:分词速度慢

Problem: Tokenization is bottleneck
Solutions:
python
undefined
问题: 分词成为性能瓶颈
解决方案:
python
undefined

Solution 1: Use fast tokenizers

Solution 1: Use fast tokenizers

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

Solution 2: Tokenize dataset once, cache it

Solution 2: Tokenize dataset once, cache it

tokenized_dataset = dataset.map( tokenize_function, batched=True, num_proc=4, # Parallel processing remove_columns=dataset.column_names, load_from_cache_file=True # Cache results )
tokenized_dataset = dataset.map( tokenize_function, batched=True, num_proc=4, # Parallel processing remove_columns=dataset.column_names, load_from_cache_file=True # Cache results )

Solution 3: Use larger batches for tokenization

Solution 3: Use larger batches for tokenization

tokenizer( texts, padding=True, truncation=True, max_length=512, return_tensors="pt", batched=True, # Process multiple texts at once batch_size=1000 )
undefined
tokenizer( texts, padding=True, truncation=True, max_length=512, return_tensors="pt", batched=True, # Process multiple texts at once batch_size=1000 )
undefined

Issue 3: Inconsistent Results

问题3:结果不一致

Problem: Model outputs different results for same input
Solution:
python
undefined
问题: 相同输入下模型输出不同结果
解决方案:
python
undefined

Set seeds for reproducibility

Set seeds for reproducibility

import random import numpy as np import torch
def set_seed(seed=42): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False
set_seed(42)
import random import numpy as np import torch
def set_seed(seed=42): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False
set_seed(42)

Disable dropout during inference

Disable dropout during inference

model.eval()
model.eval()

Use deterministic generation

Use deterministic generation

outputs = model.generate( inputs, do_sample=False, # Greedy decoding # OR do_sample=True, temperature=1.0, top_k=50, seed=42 # For sampling )
undefined
outputs = model.generate( inputs, do_sample=False, # Greedy decoding # OR do_sample=True, temperature=1.0, top_k=50, seed=42 # For sampling )
undefined

Issue 4: Attention Mask Errors

问题4:注意力掩码错误

Problem:
IndexError: index out of range in self
Solution:
python
undefined
问题:
IndexError: index out of range in self
解决方案:
python
undefined

✅ ALWAYS provide attention mask

✅ ALWAYS provide attention mask

tokens = tokenizer( text, padding=True, truncation=True, return_tensors="pt", return_attention_mask=True # Explicit (usually default) )
tokens = tokenizer( text, padding=True, truncation=True, return_tensors="pt", return_attention_mask=True # Explicit (usually default) )

Use it in model forward

Use it in model forward

outputs = model( input_ids=tokens['input_ids'], attention_mask=tokens['attention_mask'] # Don't forget this! )
outputs = model( input_ids=tokens['input_ids'], attention_mask=tokens['attention_mask'] # Don't forget this! )

For custom padding

For custom padding

attention_mask = (input_ids != tokenizer.pad_token_id).long()

---
attention_mask = (input_ids != tokenizer.pad_token_id).long()

---

Part 7: Model-Specific Patterns

第七部分:模型专属模式

GPT Models (Decoder-Only)

GPT模型(仅解码器)

python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Set pad token (GPT doesn't have one by default)

Set pad token (GPT doesn't have one by default)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token = tokenizer.eos_token

Generation

Generation

input_text = "The future of AI is" inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate( **inputs, max_new_tokens=50, num_beams=5, # Beam search early_stopping=True, no_repeat_ngram_size=2, # Prevent repetition temperature=0.8, top_p=0.9 )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefined
input_text = "The future of AI is" inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate( **inputs, max_new_tokens=50, num_beams=5, # Beam search early_stopping=True, no_repeat_ngram_size=2, # Prevent repetition temperature=0.8, top_p=0.9 )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefined

T5 Models (Encoder-Decoder)

T5模型(编码器-解码器)

python
from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")
python
from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

T5 expects task prefix

T5 expects task prefix

input_text = "translate English to German: How are you?" inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate( **inputs, max_length=50 )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
input_text = "translate English to German: How are you?" inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate( **inputs, max_length=50 )
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

"Wie geht es dir?"

"Wie geht es dir?"

undefined
undefined

BERT Models (Encoder-Only)

BERT模型(仅编码器)

python
from transformers import BertForMaskedLM, BertTokenizer

model = BertForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
python
from transformers import BertForMaskedLM, BertTokenizer

model = BertForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Masked language modeling

Masked language modeling

text = "Paris is the [MASK] of France." inputs = tokenizer(text, return_tensors="pt")
text = "Paris is the [MASK] of France." inputs = tokenizer(text, return_tensors="pt")

Get predictions for [MASK]

Get predictions for [MASK]

outputs = model(**inputs) mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] mask_token_logits = outputs.logits[0, mask_token_index, :]
outputs = model(**inputs) mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] mask_token_logits = outputs.logits[0, mask_token_index, :]

Top 5 predictions

Top 5 predictions

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() for token in top_5_tokens: print(tokenizer.decode([token]))
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() for token in top_5_tokens: print(tokenizer.decode([token]))

capital, city, center, heart, ...

capital, city, center, heart, ...


---

---

Part 8: Production Deployment

第八部分:生产环境部署

FastAPI Serving Pattern

FastAPI服务部署模式

python
from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel
import uvicorn

app = FastAPI()
python
from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel
import uvicorn

app = FastAPI()

Load model once at startup

Load model once at startup

classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
class TextInput(BaseModel): text: str
@app.post("/classify") async def classify_text(input: TextInput): result = classifier(input.text)[0] return { "label": result['label'], "confidence": result['score'] }
if name == "main": uvicorn.run(app, host="0.0.0.0", port=8000)
undefined
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
class TextInput(BaseModel): text: str
@app.post("/classify") async def classify_text(input: TextInput): result = classifier(input.text)[0] return { "label": result['label'], "confidence": result['score'] }
if name == "main": uvicorn.run(app, host="0.0.0.0", port=8000)
undefined

Batch Inference Optimization

批量推理优化

python
import asyncio
from typing import List

class BatchPredictor:
    def __init__(self, model, tokenizer, max_batch_size=32):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.queue = []
        self.lock = asyncio.Lock()

    async def predict(self, text: str):
        async with self.lock:
            future = asyncio.Future()
            self.queue.append((text, future))

            if len(self.queue) >= self.max_batch_size:
                await self._process_batch()

        return await future

    async def _process_batch(self):
        if not self.queue:
            return

        texts, futures = zip(*self.queue)
        self.queue = []

        # Process batch
        inputs = self.tokenizer(list(texts), padding=True, truncation=True, return_tensors="pt")
        outputs = self.model(**inputs)
        results = outputs.logits.argmax(dim=-1).tolist()

        # Return results
        for future, result in zip(futures, results):
            future.set_result(result)

python
import asyncio
from typing import List

class BatchPredictor:
    def __init__(self, model, tokenizer, max_batch_size=32):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.queue = []
        self.lock = asyncio.Lock()

    async def predict(self, text: str):
        async with self.lock:
            future = asyncio.Future()
            self.queue.append((text, future))

            if len(self.queue) >= self.max_batch_size:
                await self._process_batch()

        return await future

    async def _process_batch(self):
        if not self.queue:
            return

        texts, futures = zip(*self.queue)
        self.queue = []

        # Process batch
        inputs = self.tokenizer(list(texts), padding=True, truncation=True, return_tensors="pt")
        outputs = self.model(**inputs)
        results = outputs.logits.argmax(dim=-1).tolist()

        # Return results
        for future, result in zip(futures, results):
            future.set_result(result)

Quick Decision Trees

快速决策树

"Which model should I use?"

"我应该使用哪个模型?"

*Hugging Face Transformers v1.1 - Enhanced*
*Hugging Face Transformers v1.1 - Enhanced*

🔄 Workflow

🔄 工作流

Aşama 1: Model Selection

步骤1:模型选择

  • Task: Görevine en uygun mimariyi seç (Encoder: classification, Decoder: generation).
  • License: Modelin ticari kullanım izni (Apache 2.0 vs Llama Community) var mı?
  • Size: Parametre sayısı vs performans dengesini kur (7B genellikle yeterli).
  • 任务类型:选择最适合任务的架构(编码器:分类任务,解码器:生成任务)。
  • 许可证:模型是否允许商业使用(Apache 2.0 vs Llama社区许可证)?
  • 模型规模:平衡参数数量与性能(7B参数模型通常已足够)。

Aşama 2: Optimization pipeline

步骤2:优化流程

  • Quantization: Inference için 4-bit / 8-bit quantization (BitsAndBytes) kullan.
  • Batching: Tek tek değil, batch halinde process et (GPU verimi).
  • Format: Production için ONNX veya TensorRT formatına çevir.
  • 量化:推理时使用4位/8位量化(BitsAndBytes)。
  • 批量处理:避免逐个处理,采用批量处理提升GPU利用率。
  • 格式转换:生产环境下转换为ONNX或TensorRT格式。

Aşama 3: Deployment

步骤3:部署

  • Cache: Model ağırlıklarını ve tokenizer'ı docker image içine bake etme, volume kullan.
  • Token Limits: Context window sınırını aşan inputlar için strateji belirle (chunking).
  • 缓存:将模型权重和分词器打包到Docker镜像中,或使用卷挂载。
  • 令牌限制:为超出上下文窗口的输入制定处理策略(如分块)。

Kontrol Noktaları

检查点

AşamaDoğrulama
1Model GPU hafızasına sığıyor mu (OOM hatası)?
2Inference süresi (Latency) hedefin altında mı?
3Tokenizer ile Model uyumlu mu (aynı vocab)?
Classification → BERT, RoBERTa, DeBERTa
Generation → GPT-2, GPT-Neo, LLaMA
Translation/Summarization → T5, BART, mT5
Question Answering → BERT, DeBERTa, RoBERTa
Performance vs Speed? Best performance → Large models (355M+ params) Balanced → Base models (110M params) Fast inference → Distilled models (66M params)
undefined
步骤验证项
1模型能否装入GPU内存(是否出现OOM错误)?
2推理延迟是否低于目标值?
3分词器与模型是否兼容(词汇表是否一致)?
分类任务 → BERT、RoBERTa、DeBERTa
生成任务 → GPT-2、GPT-Neo、LLaMA
翻译/摘要任务 → T5、BART、mT5
问答任务 → BERT、DeBERTa、RoBERTa
性能与速度的权衡? 最佳性能 → 大模型(3.55亿+参数) 平衡型 → 基础模型(1.1亿参数) 快速推理 → 蒸馏模型(6600万参数)
undefined

"How should I fine-tune?"

"我应该如何微调模型?"

Have full dataset control?
  YES → Full fine-tuning or LoRA
  NO → Few-shot prompting

Dataset size?
  Large (>10K examples) → Full fine-tuning
  Medium (1K-10K) → LoRA or full fine-tuning
  Small (<1K) → LoRA or prompt engineering

Compute available?
  Limited → LoRA (4-bit quantized)
  Moderate → LoRA (8-bit)
  High → Full fine-tuning

是否完全控制数据集?
  是 → 全量微调或LoRA
  否 → 少样本提示

数据集规模?
  大型(>1万条样本) → 全量微调
  中型(1千-1万条样本) → LoRA或全量微调
  小型(<1千条样本) → LoRA或提示工程

可用计算资源?
  有限 → LoRA(4位量化)
  中等 → LoRA(8位量化)
  充足 → 全量微调

Resources

参考资源


Skill version: 1.0.0 Last updated: 2025-10-25 Maintained by: Applied Artificial Intelligence

技能版本: 1.0.0 最后更新: 2025-10-25 维护方: 应用人工智能团队