huggingface_transformers
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHugging Face Transformers Best Practices
Hugging Face Transformers 最佳实践
Comprehensive guide to using the Hugging Face Transformers library including model loading, tokenization, fine-tuning workflows, pipeline usage, custom datasets, and deployment optimization.
本指南全面介绍Hugging Face Transformers库的使用方法,包括模型加载、分词、微调工作流、Pipeline使用、自定义数据集以及部署优化。
Quick Reference
快速参考
When to use this skill:
- Loading and using pre-trained transformers (BERT, GPT, T5, LLaMA, etc.)
- Fine-tuning models on custom data
- Implementing NLP tasks (classification, QA, generation, etc.)
- Optimizing inference (quantization, ONNX, etc.)
- Debugging tokenization issues
- Using Hugging Face pipelines
- Deploying transformers to production
Models covered:
- Encoders: BERT, RoBERTa, DeBERTa, ALBERT
- Decoders: GPT-2, GPT-Neo, LLaMA, Mistral
- Encoder-Decoders: T5, BART, Flan-T5
- Vision: ViT, CLIP, Stable Diffusion
何时使用本技能:
- 加载并使用预训练Transformer模型(BERT、GPT、T5、LLaMA等)
- 在自定义数据上微调模型
- 实现NLP任务(分类、问答、生成等)
- 优化推理过程(量化、ONNX等)
- 调试分词相关问题
- 使用Hugging Face Pipeline
- 将Transformer模型部署到生产环境
涵盖的模型:
- 编码器模型:BERT、RoBERTa、DeBERTa、ALBERT
- 解码器模型:GPT-2、GPT-Neo、LLaMA、Mistral
- 编码器-解码器模型:T5、BART、Flan-T5
- 视觉模型:ViT、CLIP、Stable Diffusion
Part 1: Model Loading Patterns
第一部分:模型加载模式
Pattern 1: Basic Model Loading
模式1:基础模型加载
python
from transformers import AutoModel, AutoTokenizerpython
from transformers import AutoModel, AutoTokenizerLoad model and tokenizer
Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
For specific tasks
For specific tasks
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=3 # For 3-class classification
)
undefinedfrom transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=3 # For 3-class classification
)
undefinedPattern 2: Loading with Specific Configuration
模式2:加载指定配置的模型
python
from transformers import AutoConfig, AutoModelpython
from transformers import AutoConfig, AutoModelModify configuration
Modify configuration
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2 # Custom dropout
config.attention_probs_dropout_prob = 0.2
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2 # Custom dropout
config.attention_probs_dropout_prob = 0.2
Load model with custom config
Load model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)
model = AutoModel.from_pretrained("bert-base-uncased", config=config)
Or create model from scratch with config
Or create model from scratch with config
model = AutoModel.from_config(config)
undefinedmodel = AutoModel.from_config(config)
undefinedPattern 3: Loading Quantized Models (Memory Efficient)
模式3:加载量化模型(内存高效型)
python
from transformers import AutoModel, BitsAndBytesConfig
import torchpython
from transformers import AutoModel, BitsAndBytesConfig
import torch8-bit quantization (50% memory reduction)
8-bit quantization (50% memory reduction)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto" # Automatic device placement
)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto" # Automatic device placement
)
4-bit quantization (75% memory reduction)
4-bit quantization (75% memory reduction)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=quantization_config,
device_map="auto"
)
undefinedquantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=quantization_config,
device_map="auto"
)
undefinedPattern 4: Loading from Local Path
模式4:从本地路径加载模型
python
undefinedpython
undefinedSave model locally
Save model locally
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")
Load from local path
Load from local path
model = AutoModel.from_pretrained("./my-model")
tokenizer = AutoTokenizer.from_pretrained("./my-model")
---model = AutoModel.from_pretrained("./my-model")
tokenizer = AutoTokenizer.from_pretrained("./my-model")
---Part 2: Tokenization Best Practices
第二部分:分词最佳实践
Critical Tokenization Patterns
关键分词模式
python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")✅ CORRECT: All required arguments
✅ CORRECT: All required arguments
tokens = tokenizer(
text,
padding=True, # Pad to longest in batch
truncation=True, # Truncate to max_length
max_length=512, # Maximum sequence length
return_tensors="pt" # Return PyTorch tensors
)
tokens = tokenizer(
text,
padding=True, # Pad to longest in batch
truncation=True, # Truncate to max_length
max_length=512, # Maximum sequence length
return_tensors="pt" # Return PyTorch tensors
)
Access components
Access components
input_ids = tokens['input_ids'] # Token IDs
attention_mask = tokens['attention_mask'] # Padding mask
token_type_ids = tokens.get('token_type_ids') # Segment IDs (BERT)
input_ids = tokens['input_ids'] # Token IDs
attention_mask = tokens['attention_mask'] # Padding mask
token_type_ids = tokens.get('token_type_ids') # Segment IDs (BERT)
❌ WRONG: Missing critical arguments
❌ WRONG: Missing critical arguments
tokens = tokenizer(text) # No padding, truncation, or tensor format!
undefinedtokens = tokenizer(text) # No padding, truncation, or tensor format!
undefinedBatch Tokenization
批量分词
python
undefinedpython
undefinedTokenize multiple texts efficiently
Tokenize multiple texts efficiently
texts = ["First text", "Second text", "Third text"]
tokens = tokenizer(
texts,
padding=True, # Pad all to longest in batch
truncation=True,
max_length=128,
return_tensors="pt"
)
texts = ["First text", "Second text", "Third text"]
tokens = tokenizer(
texts,
padding=True, # Pad all to longest in batch
truncation=True,
max_length=128,
return_tensors="pt"
)
Result shape: [batch_size, max_length]
Result shape: [batch_size, max_length]
print(tokens['input_ids'].shape) # torch.Size([3, max_len_in_batch])
undefinedprint(tokens['input_ids'].shape) # torch.Size([3, max_len_in_batch])
undefinedSpecial Token Handling
特殊令牌处理
python
undefinedpython
undefinedAdd special tokens
Add special tokens
tokenizer.add_special_tokens({
'additional_special_tokens': ['[CUSTOM]', '[MARKER]']
})
tokenizer.add_special_tokens({
'additional_special_tokens': ['[CUSTOM]', '[MARKER]']
})
Resize model embeddings to match
Resize model embeddings to match
model.resize_token_embeddings(len(tokenizer))
model.resize_token_embeddings(len(tokenizer))
Encode with special tokens preserved
Encode with special tokens preserved
text = "Hello [CUSTOM] world"
tokens = tokenizer(text, add_special_tokens=True)
text = "Hello [CUSTOM] world"
tokens = tokenizer(text, add_special_tokens=True)
Decode
Decode
decoded = tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=False)
undefineddecoded = tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=False)
undefinedTokenization for Different Tasks
不同任务的分词方式
python
undefinedpython
undefinedText classification (single sequence)
Text classification (single sequence)
tokens = tokenizer(
"This movie was great!",
padding="max_length",
truncation=True,
max_length=128,
return_tensors="pt"
)
tokens = tokenizer(
"This movie was great!",
padding="max_length",
truncation=True,
max_length=128,
return_tensors="pt"
)
Question answering (pair of sequences)
Question answering (pair of sequences)
question = "What is the capital of France?"
context = "France is a country in Europe. Paris is its capital."
tokens = tokenizer(
question,
context,
padding="max_length",
truncation="only_second", # Only truncate context
max_length=384,
return_tensors="pt"
)
question = "What is the capital of France?"
context = "France is a country in Europe. Paris is its capital."
tokens = tokenizer(
question,
context,
padding="max_length",
truncation="only_second", # Only truncate context
max_length=384,
return_tensors="pt"
)
Text generation (decoder-only models)
Text generation (decoder-only models)
prompt = "Once upon a time"
tokens = tokenizer(prompt, return_tensors="pt")
prompt = "Once upon a time"
tokens = tokenizer(prompt, return_tensors="pt")
No padding needed for generation input
No padding needed for generation input
---
---Part 3: Fine-Tuning Workflows
第三部分:微调工作流
Pattern 1: Simple Fine-Tuning with Trainer
模式1:使用Trainer进行简单微调
python
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments
)
from datasets import load_datasetpython
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments
)
from datasets import load_dataset1. Load dataset
1. Load dataset
dataset = load_dataset("glue", "mrpc")
dataset = load_dataset("glue", "mrpc")
2. Load model
2. Load model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
3. Tokenize dataset
3. Tokenize dataset
def tokenize_function(examples):
return tokenizer(
examples["sentence1"],
examples["sentence2"],
padding="max_length",
truncation=True,
max_length=128
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
def tokenize_function(examples):
return tokenizer(
examples["sentence1"],
examples["sentence2"],
padding="max_length",
truncation=True,
max_length=128
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
4. Define training arguments
4. Define training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=100,
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=100,
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
5. Define metrics
5. Define metrics
from datasets import load_metric
import numpy as np
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
from datasets import load_metric
import numpy as np
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
6. Create Trainer
6. Create Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
compute_metrics=compute_metrics,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
compute_metrics=compute_metrics,
)
7. Train
7. Train
trainer.train()
trainer.train()
8. Save
8. Save
trainer.save_model("./fine-tuned-model")
undefinedtrainer.save_model("./fine-tuned-model")
undefinedPattern 2: LoRA Fine-Tuning (Parameter-Efficient)
模式2:LoRA微调(参数高效型)
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskTypepython
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskTypeLoad base model
Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True, # 8-bit for memory efficiency
device_map="auto"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True, # 8-bit for memory efficiency
device_map="auto"
)
Configure LoRA
Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # LoRA rank
lora_alpha=32, # LoRA alpha
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"], # Which layers to adapt
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # LoRA rank
lora_alpha=32, # LoRA alpha
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"], # Which layers to adapt
)
Apply LoRA
Apply LoRA
model = get_peft_model(model, lora_config)
model = get_peft_model(model, lora_config)
Check trainable parameters
Check trainable parameters
model.print_trainable_parameters()
model.print_trainable_parameters()
Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.062%
Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.062%
Train with Trainer (same as before)
Train with Trainer (same as before)
Only LoRA parameters are updated!
Only LoRA parameters are updated!
undefinedundefinedPattern 3: Custom Training Loop
模式3:自定义训练循环
python
import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_schedulerpython
import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_schedulerPrepare dataloaders
Prepare dataloaders
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=16, shuffle=True)
eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=16)
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=16, shuffle=True)
eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=16)
Optimizer
Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)
optimizer = AdamW(model.parameters(), lr=2e-5)
Learning rate scheduler
Learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=500,
num_training_steps=num_training_steps
)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=500,
num_training_steps=num_training_steps
)
Training loop
Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
for epoch in range(num_epochs):
model.train()
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# Evaluation
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
# Compute metrics
---device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
for epoch in range(num_epochs):
model.train()
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# Evaluation
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
# Compute metrics
---Part 4: Pipeline Usage (High-Level API)
第四部分:Pipeline使用(高层API)
Text Classification Pipeline
文本分类Pipeline
python
from transformers import pipelinepython
from transformers import pipelineLoad pipeline
Load pipeline
classifier = pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
classifier = pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
Single prediction
Single prediction
result = classifier("I love this product!")
result = classifier("I love this product!")
[{'label': 'POSITIVE', 'score': 0.9998}]
[{'label': 'POSITIVE', 'score': 0.9998}]
Batch prediction
Batch prediction
results = classifier([
"Great service!",
"Terrible experience",
"Average quality"
])
undefinedresults = classifier([
"Great service!",
"Terrible experience",
"Average quality"
])
undefinedQuestion Answering Pipeline
问答Pipeline
python
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
result = qa_pipeline(
question="What is the capital of France?",
context="France is a country in Europe. Its capital is Paris, a beautiful city."
)python
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
result = qa_pipeline(
question="What is the capital of France?",
context="France is a country in Europe. Its capital is Paris, a beautiful city."
){'score': 0.98, 'start': 49, 'end': 54, 'answer': 'Paris'}
{'score': 0.98, 'start': 49, 'end': 54, 'answer': 'Paris'}
undefinedundefinedText Generation Pipeline
文本生成Pipeline
python
generator = pipeline("text-generation", model="gpt2")
outputs = generator(
"Once upon a time",
max_length=50,
num_return_sequences=3,
temperature=0.7,
top_k=50,
top_p=0.95,
do_sample=True
)
for output in outputs:
print(output['generated_text'])python
generator = pipeline("text-generation", model="gpt2")
outputs = generator(
"Once upon a time",
max_length=50,
num_return_sequences=3,
temperature=0.7,
top_k=50,
top_p=0.95,
do_sample=True
)
for output in outputs:
print(output['generated_text'])Zero-Shot Classification Pipeline
零样本分类Pipeline
python
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
"This is a course about Python programming.",
candidate_labels=["education", "technology", "business", "sports"]
)python
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
"This is a course about Python programming.",
candidate_labels=["education", "technology", "business", "sports"]
){'sequence': '...', 'labels': ['education', 'technology', ...], 'scores': [0.85, 0.12, ...]}
{'sequence': '...', 'labels': ['education', 'technology', ...], 'scores': [0.85, 0.12, ...]}
---
---Part 5: Inference Optimization
第五部分:推理优化
Optimization 1: Batch Processing
优化1:批量处理
python
undefinedpython
undefined❌ SLOW: Process one at a time
❌ SLOW: Process one at a time
for text in texts:
output = model(**tokenizer(text, return_tensors="pt"))
for text in texts:
output = model(**tokenizer(text, return_tensors="pt"))
✅ FAST: Process in batches
✅ FAST: Process in batches
batch_size = 32
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)
undefinedbatch_size = 32
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)
undefinedOptimization 2: Mixed Precision (AMP)
优化2:混合精度(AMP)
python
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
# Forward pass in mixed precision
with autocast():
outputs = model(**batch)
loss = outputs.loss
# Backward pass with scaled gradients
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()python
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
# Forward pass in mixed precision
with autocast():
outputs = model(**batch)
loss = outputs.loss
# Backward pass with scaled gradients
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()Optimization 3: ONNX Export
优化3:导出为ONNX格式
python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassificationpython
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassificationExport to ONNX
Export to ONNX
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
model.save_pretrained("./onnx-model", export=True)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
model.save_pretrained("./onnx-model", export=True)
Load ONNX model (faster inference)
Load ONNX model (faster inference)
ort_model = ORTModelForSequenceClassification.from_pretrained("./onnx-model")
ort_model = ORTModelForSequenceClassification.from_pretrained("./onnx-model")
Inference (2-3x faster)
Inference (2-3x faster)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = ort_model(**inputs)
undefinedtokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = ort_model(**inputs)
undefinedOptimization 4: Dynamic Quantization
优化4:动态量化
python
import torchpython
import torchQuantize model to int8
Quantize model to int8
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # Quantize Linear layers
dtype=torch.qint8
)
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # Quantize Linear layers
dtype=torch.qint8
)
4x smaller model, 2-3x faster inference on CPU
4x smaller model, 2-3x faster inference on CPU
---
---Part 6: Common Issues & Solutions
第六部分:常见问题与解决方案
Issue 1: CUDA Out of Memory
问题1:CUDA内存不足
Problem:
RuntimeError: CUDA out of memorySolutions:
python
undefined问题:
RuntimeError: CUDA out of memory解决方案:
python
undefinedSolution 1: Reduce batch size
Solution 1: Reduce batch size
training_args = TrainingArguments(
per_device_train_batch_size=8, # Was 32
gradient_accumulation_steps=4, # Effective batch = 8*4 = 32
)
training_args = TrainingArguments(
per_device_train_batch_size=8, # Was 32
gradient_accumulation_steps=4, # Effective batch = 8*4 = 32
)
Solution 2: Use gradient checkpointing
Solution 2: Use gradient checkpointing
model.gradient_checkpointing_enable()
model.gradient_checkpointing_enable()
Solution 3: Use 8-bit model
Solution 3: Use 8-bit model
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained("model-name", quantization_config=quantization_config)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained("model-name", quantization_config=quantization_config)
Solution 4: Clear cache
Solution 4: Clear cache
import torch
torch.cuda.empty_cache()
undefinedimport torch
torch.cuda.empty_cache()
undefinedIssue 2: Slow Tokenization
问题2:分词速度慢
Problem: Tokenization is bottleneck
Solutions:
python
undefined问题: 分词成为性能瓶颈
解决方案:
python
undefinedSolution 1: Use fast tokenizers
Solution 1: Use fast tokenizers
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
Solution 2: Tokenize dataset once, cache it
Solution 2: Tokenize dataset once, cache it
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
num_proc=4, # Parallel processing
remove_columns=dataset.column_names,
load_from_cache_file=True # Cache results
)
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
num_proc=4, # Parallel processing
remove_columns=dataset.column_names,
load_from_cache_file=True # Cache results
)
Solution 3: Use larger batches for tokenization
Solution 3: Use larger batches for tokenization
tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
batched=True, # Process multiple texts at once
batch_size=1000
)
undefinedtokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
batched=True, # Process multiple texts at once
batch_size=1000
)
undefinedIssue 3: Inconsistent Results
问题3:结果不一致
Problem: Model outputs different results for same input
Solution:
python
undefined问题: 相同输入下模型输出不同结果
解决方案:
python
undefinedSet seeds for reproducibility
Set seeds for reproducibility
import random
import numpy as np
import torch
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
import random
import numpy as np
import torch
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
Disable dropout during inference
Disable dropout during inference
model.eval()
model.eval()
Use deterministic generation
Use deterministic generation
outputs = model.generate(
inputs,
do_sample=False, # Greedy decoding
# OR
do_sample=True,
temperature=1.0,
top_k=50,
seed=42 # For sampling
)
undefinedoutputs = model.generate(
inputs,
do_sample=False, # Greedy decoding
# OR
do_sample=True,
temperature=1.0,
top_k=50,
seed=42 # For sampling
)
undefinedIssue 4: Attention Mask Errors
问题4:注意力掩码错误
Problem:
IndexError: index out of range in selfSolution:
python
undefined问题:
IndexError: index out of range in self解决方案:
python
undefined✅ ALWAYS provide attention mask
✅ ALWAYS provide attention mask
tokens = tokenizer(
text,
padding=True,
truncation=True,
return_tensors="pt",
return_attention_mask=True # Explicit (usually default)
)
tokens = tokenizer(
text,
padding=True,
truncation=True,
return_tensors="pt",
return_attention_mask=True # Explicit (usually default)
)
Use it in model forward
Use it in model forward
outputs = model(
input_ids=tokens['input_ids'],
attention_mask=tokens['attention_mask'] # Don't forget this!
)
outputs = model(
input_ids=tokens['input_ids'],
attention_mask=tokens['attention_mask'] # Don't forget this!
)
For custom padding
For custom padding
attention_mask = (input_ids != tokenizer.pad_token_id).long()
---attention_mask = (input_ids != tokenizer.pad_token_id).long()
---Part 7: Model-Specific Patterns
第七部分:模型专属模式
GPT Models (Decoder-Only)
GPT模型(仅解码器)
python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")Set pad token (GPT doesn't have one by default)
Set pad token (GPT doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token = tokenizer.eos_token
Generation
Generation
input_text = "The future of AI is"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5, # Beam search
early_stopping=True,
no_repeat_ngram_size=2, # Prevent repetition
temperature=0.8,
top_p=0.9
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefinedinput_text = "The future of AI is"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5, # Beam search
early_stopping=True,
no_repeat_ngram_size=2, # Prevent repetition
temperature=0.8,
top_p=0.9
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
undefinedT5 Models (Encoder-Decoder)
T5模型(编码器-解码器)
python
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")python
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")T5 expects task prefix
T5 expects task prefix
input_text = "translate English to German: How are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=50
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
input_text = "translate English to German: How are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=50
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
"Wie geht es dir?"
"Wie geht es dir?"
undefinedundefinedBERT Models (Encoder-Only)
BERT模型(仅编码器)
python
from transformers import BertForMaskedLM, BertTokenizer
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")python
from transformers import BertForMaskedLM, BertTokenizer
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")Masked language modeling
Masked language modeling
text = "Paris is the [MASK] of France."
inputs = tokenizer(text, return_tensors="pt")
text = "Paris is the [MASK] of France."
inputs = tokenizer(text, return_tensors="pt")
Get predictions for [MASK]
Get predictions for [MASK]
outputs = model(**inputs)
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = outputs.logits[0, mask_token_index, :]
outputs = model(**inputs)
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = outputs.logits[0, mask_token_index, :]
Top 5 predictions
Top 5 predictions
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(tokenizer.decode([token]))
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(tokenizer.decode([token]))
capital, city, center, heart, ...
capital, city, center, heart, ...
---
---Part 8: Production Deployment
第八部分:生产环境部署
FastAPI Serving Pattern
FastAPI服务部署模式
python
from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel
import uvicorn
app = FastAPI()python
from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel
import uvicorn
app = FastAPI()Load model once at startup
Load model once at startup
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
class TextInput(BaseModel):
text: str
@app.post("/classify")
async def classify_text(input: TextInput):
result = classifier(input.text)[0]
return {
"label": result['label'],
"confidence": result['score']
}
if name == "main":
uvicorn.run(app, host="0.0.0.0", port=8000)
undefinedclassifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
class TextInput(BaseModel):
text: str
@app.post("/classify")
async def classify_text(input: TextInput):
result = classifier(input.text)[0]
return {
"label": result['label'],
"confidence": result['score']
}
if name == "main":
uvicorn.run(app, host="0.0.0.0", port=8000)
undefinedBatch Inference Optimization
批量推理优化
python
import asyncio
from typing import List
class BatchPredictor:
def __init__(self, model, tokenizer, max_batch_size=32):
self.model = model
self.tokenizer = tokenizer
self.max_batch_size = max_batch_size
self.queue = []
self.lock = asyncio.Lock()
async def predict(self, text: str):
async with self.lock:
future = asyncio.Future()
self.queue.append((text, future))
if len(self.queue) >= self.max_batch_size:
await self._process_batch()
return await future
async def _process_batch(self):
if not self.queue:
return
texts, futures = zip(*self.queue)
self.queue = []
# Process batch
inputs = self.tokenizer(list(texts), padding=True, truncation=True, return_tensors="pt")
outputs = self.model(**inputs)
results = outputs.logits.argmax(dim=-1).tolist()
# Return results
for future, result in zip(futures, results):
future.set_result(result)python
import asyncio
from typing import List
class BatchPredictor:
def __init__(self, model, tokenizer, max_batch_size=32):
self.model = model
self.tokenizer = tokenizer
self.max_batch_size = max_batch_size
self.queue = []
self.lock = asyncio.Lock()
async def predict(self, text: str):
async with self.lock:
future = asyncio.Future()
self.queue.append((text, future))
if len(self.queue) >= self.max_batch_size:
await self._process_batch()
return await future
async def _process_batch(self):
if not self.queue:
return
texts, futures = zip(*self.queue)
self.queue = []
# Process batch
inputs = self.tokenizer(list(texts), padding=True, truncation=True, return_tensors="pt")
outputs = self.model(**inputs)
results = outputs.logits.argmax(dim=-1).tolist()
# Return results
for future, result in zip(futures, results):
future.set_result(result)Quick Decision Trees
快速决策树
"Which model should I use?"
"我应该使用哪个模型?"
*Hugging Face Transformers v1.1 - Enhanced**Hugging Face Transformers v1.1 - Enhanced*🔄 Workflow
🔄 工作流
Kaynak: Hugging Face Course & Production Guide
来源: Hugging Face 课程 & 生产环境指南
Aşama 1: Model Selection
步骤1:模型选择
- Task: Görevine en uygun mimariyi seç (Encoder: classification, Decoder: generation).
- License: Modelin ticari kullanım izni (Apache 2.0 vs Llama Community) var mı?
- Size: Parametre sayısı vs performans dengesini kur (7B genellikle yeterli).
- 任务类型:选择最适合任务的架构(编码器:分类任务,解码器:生成任务)。
- 许可证:模型是否允许商业使用(Apache 2.0 vs Llama社区许可证)?
- 模型规模:平衡参数数量与性能(7B参数模型通常已足够)。
Aşama 2: Optimization pipeline
步骤2:优化流程
- Quantization: Inference için 4-bit / 8-bit quantization (BitsAndBytes) kullan.
- Batching: Tek tek değil, batch halinde process et (GPU verimi).
- Format: Production için ONNX veya TensorRT formatına çevir.
- 量化:推理时使用4位/8位量化(BitsAndBytes)。
- 批量处理:避免逐个处理,采用批量处理提升GPU利用率。
- 格式转换:生产环境下转换为ONNX或TensorRT格式。
Aşama 3: Deployment
步骤3:部署
- Cache: Model ağırlıklarını ve tokenizer'ı docker image içine bake etme, volume kullan.
- Token Limits: Context window sınırını aşan inputlar için strateji belirle (chunking).
- 缓存:将模型权重和分词器打包到Docker镜像中,或使用卷挂载。
- 令牌限制:为超出上下文窗口的输入制定处理策略(如分块)。
Kontrol Noktaları
检查点
| Aşama | Doğrulama |
|---|---|
| 1 | Model GPU hafızasına sığıyor mu (OOM hatası)? |
| 2 | Inference süresi (Latency) hedefin altında mı? |
| 3 | Tokenizer ile Model uyumlu mu (aynı vocab)? |
| Classification → BERT, RoBERTa, DeBERTa | |
| Generation → GPT-2, GPT-Neo, LLaMA | |
| Translation/Summarization → T5, BART, mT5 | |
| Question Answering → BERT, DeBERTa, RoBERTa |
Performance vs Speed?
Best performance → Large models (355M+ params)
Balanced → Base models (110M params)
Fast inference → Distilled models (66M params)
undefined| 步骤 | 验证项 |
|---|---|
| 1 | 模型能否装入GPU内存(是否出现OOM错误)? |
| 2 | 推理延迟是否低于目标值? |
| 3 | 分词器与模型是否兼容(词汇表是否一致)? |
| 分类任务 → BERT、RoBERTa、DeBERTa | |
| 生成任务 → GPT-2、GPT-Neo、LLaMA | |
| 翻译/摘要任务 → T5、BART、mT5 | |
| 问答任务 → BERT、DeBERTa、RoBERTa |
性能与速度的权衡?
最佳性能 → 大模型(3.55亿+参数)
平衡型 → 基础模型(1.1亿参数)
快速推理 → 蒸馏模型(6600万参数)
undefined"How should I fine-tune?"
"我应该如何微调模型?"
Have full dataset control?
YES → Full fine-tuning or LoRA
NO → Few-shot prompting
Dataset size?
Large (>10K examples) → Full fine-tuning
Medium (1K-10K) → LoRA or full fine-tuning
Small (<1K) → LoRA or prompt engineering
Compute available?
Limited → LoRA (4-bit quantized)
Moderate → LoRA (8-bit)
High → Full fine-tuning是否完全控制数据集?
是 → 全量微调或LoRA
否 → 少样本提示
数据集规模?
大型(>1万条样本) → 全量微调
中型(1千-1万条样本) → LoRA或全量微调
小型(<1千条样本) → LoRA或提示工程
可用计算资源?
有限 → LoRA(4位量化)
中等 → LoRA(8位量化)
充足 → 全量微调Resources
参考资源
- Hugging Face Docs: https://huggingface.co/docs/transformers/
- Model Hub: https://huggingface.co/models
- PEFT (LoRA): https://huggingface.co/docs/peft/
- Optimum: https://huggingface.co/docs/optimum/
- Datasets: https://huggingface.co/docs/datasets/
Skill version: 1.0.0
Last updated: 2025-10-25
Maintained by: Applied Artificial Intelligence
- Hugging Face 文档: https://huggingface.co/docs/transformers/
- 模型中心: https://huggingface.co/models
- PEFT(LoRA): https://huggingface.co/docs/peft/
- Optimum: https://huggingface.co/docs/optimum/
- Datasets: https://huggingface.co/docs/datasets/
技能版本: 1.0.0
最后更新: 2025-10-25
维护方: 应用人工智能团队