Hugging Face Transformers Best Practices

Hugging Face Transformers 最佳实践

Comprehensive guide to using the Hugging Face Transformers library including model loading, tokenization, fine-tuning workflows, pipeline usage, custom datasets, and deployment optimization.

本指南全面介绍Hugging Face Transformers库的使用方法，包括模型加载、分词、微调工作流、Pipeline使用、自定义数据集以及部署优化。

Quick Reference

快速参考

When to use this skill:

Loading and using pre-trained transformers (BERT, GPT, T5, LLaMA, etc.)
Fine-tuning models on custom data
Implementing NLP tasks (classification, QA, generation, etc.)
Optimizing inference (quantization, ONNX, etc.)
Debugging tokenization issues
Using Hugging Face pipelines
Deploying transformers to production

Models covered:

Encoders: BERT, RoBERTa, DeBERTa, ALBERT
Decoders: GPT-2, GPT-Neo, LLaMA, Mistral
Encoder-Decoders: T5, BART, Flan-T5
Vision: ViT, CLIP, Stable Diffusion

何时使用本技能：

加载并使用预训练Transformer模型（BERT、GPT、T5、LLaMA等）
在自定义数据上微调模型
实现NLP任务（分类、问答、生成等）
优化推理过程（量化、ONNX等）
调试分词相关问题
使用Hugging Face Pipeline
将Transformer模型部署到生产环境

涵盖的模型：

编码器模型：BERT、RoBERTa、DeBERTa、ALBERT
解码器模型：GPT-2、GPT-Neo、LLaMA、Mistral
编码器-解码器模型：T5、BART、Flan-T5
视觉模型：ViT、CLIP、Stable Diffusion

Part 1: Model Loading Patterns

第一部分：模型加载模式

Pattern 1: Basic Model Loading

模式1：基础模型加载

python

from transformers import AutoModel, AutoTokenizer

python

from transformers import AutoModel, AutoTokenizer

Load model and tokenizer

model_name = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name)

For specific tasks

from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=3 # For 3-class classification )

undefined

from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=3 # For 3-class classification )

undefined

Pattern 2: Loading with Specific Configuration

模式2：加载指定配置的模型

python

from transformers import AutoConfig, AutoModel

python

from transformers import AutoConfig, AutoModel

Modify configuration

config = AutoConfig.from_pretrained("bert-base-uncased") config.hidden_dropout_prob = 0.2 # Custom dropout config.attention_probs_dropout_prob = 0.2

Load model with custom config

model = AutoModel.from_pretrained("bert-base-uncased", config=config)

Or create model from scratch with config

model = AutoModel.from_config(config)

undefined

model = AutoModel.from_config(config)

undefined

Pattern 3: Loading Quantized Models (Memory Efficient)

模式3：加载量化模型（内存高效型）

python

from transformers import AutoModel, BitsAndBytesConfig
import torch

python

from transformers import AutoModel, BitsAndBytesConfig
import torch

8-bit quantization (50% memory reduction)

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModel.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=quantization_config, device_map="auto" # Automatic device placement )

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModel.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=quantization_config, device_map="auto" # Automatic device placement )

4-bit quantization (75% memory reduction)

quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True )

model = AutoModel.from_pretrained( "meta-llama/Llama-2-13b-hf", quantization_config=quantization_config, device_map="auto" )

undefined

quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True )

model = AutoModel.from_pretrained( "meta-llama/Llama-2-13b-hf", quantization_config=quantization_config, device_map="auto" )

undefined

Pattern 4: Loading from Local Path

模式4：从本地路径加载模型

python

undefined

python

undefined

Save model locally

model.save_pretrained("./my-model") tokenizer.save_pretrained("./my-model")

Load from local path

model = AutoModel.from_pretrained("./my-model") tokenizer = AutoTokenizer.from_pretrained("./my-model")

---

model = AutoModel.from_pretrained("./my-model") tokenizer = AutoTokenizer.from_pretrained("./my-model")

---

Part 2: Tokenization Best Practices

第二部分：分词最佳实践

Critical Tokenization Patterns

关键分词模式

python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

✅ CORRECT: All required arguments

tokens = tokenizer( text, padding=True, # Pad to longest in batch truncation=True, # Truncate to max_length max_length=512, # Maximum sequence length return_tensors="pt" # Return PyTorch tensors )

Access components

input_ids = tokens['input_ids'] # Token IDs attention_mask = tokens['attention_mask'] # Padding mask token_type_ids = tokens.get('token_type_ids') # Segment IDs (BERT)

❌ WRONG: Missing critical arguments

tokens = tokenizer(text) # No padding, truncation, or tensor format!

undefined

tokens = tokenizer(text) # No padding, truncation, or tensor format!

undefined

Batch Tokenization

批量分词

python

undefined

python

undefined

Tokenize multiple texts efficiently

texts = ["First text", "Second text", "Third text"]

tokens = tokenizer( texts, padding=True, # Pad all to longest in batch truncation=True, max_length=128, return_tensors="pt" )

texts = ["First text", "Second text", "Third text"]

tokens = tokenizer( texts, padding=True, # Pad all to longest in batch truncation=True, max_length=128, return_tensors="pt" )

Result shape: [batch_size, max_length]

print(tokens['input_ids'].shape) # torch.Size([3, max_len_in_batch])

undefined

print(tokens['input_ids'].shape) # torch.Size([3, max_len_in_batch])

undefined

Special Token Handling

特殊令牌处理

python

undefined

python

undefined

Add special tokens

tokenizer.add_special_tokens({ 'additional_special_tokens': ['[CUSTOM]', '[MARKER]'] })

Resize model embeddings to match

model.resize_token_embeddings(len(tokenizer))

Encode with special tokens preserved

text = "Hello [CUSTOM] world" tokens = tokenizer(text, add_special_tokens=True)

Decode

decoded = tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=False)

undefined

decoded = tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=False)

undefined

Tokenization for Different Tasks

不同任务的分词方式

python

undefined

python

undefined

Text classification (single sequence)

tokens = tokenizer( "This movie was great!", padding="max_length", truncation=True, max_length=128, return_tensors="pt" )

Question answering (pair of sequences)

question = "What is the capital of France?" context = "France is a country in Europe. Paris is its capital."

tokens = tokenizer( question, context, padding="max_length", truncation="only_second", # Only truncate context max_length=384, return_tensors="pt" )

question = "What is the capital of France?" context = "France is a country in Europe. Paris is its capital."

tokens = tokenizer( question, context, padding="max_length", truncation="only_second", # Only truncate context max_length=384, return_tensors="pt" )

Text generation (decoder-only models)

prompt = "Once upon a time" tokens = tokenizer(prompt, return_tensors="pt")

No padding needed for generation input

---

---

Part 3: Fine-Tuning Workflows

第三部分：微调工作流

Pattern 1: Simple Fine-Tuning with Trainer

模式1：使用Trainer进行简单微调

python

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset

python

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset

1. Load dataset

dataset = load_dataset("glue", "mrpc")

2. Load model

model = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels=2 ) tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

3. Tokenize dataset

def tokenize_function(examples): return tokenizer( examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True, max_length=128 )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

def tokenize_function(examples): return tokenizer( examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True, max_length=128 )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

4. Define training arguments

training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01, logging_dir="./logs", logging_steps=100, save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="accuracy", )

5. Define metrics

from datasets import load_metric import numpy as np

metric = load_metric("accuracy")

def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels)

from datasets import load_metric import numpy as np

metric = load_metric("accuracy")

def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels)

6. Create Trainer

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], compute_metrics=compute_metrics, )

7. Train

trainer.train()

8. Save

trainer.save_model("./fine-tuned-model")

undefined

trainer.save_model("./fine-tuned-model")

undefined

Pattern 2: LoRA Fine-Tuning (Parameter-Efficient)

模式2：LoRA微调（参数高效型）

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

Load base model

model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", load_in_8bit=True, # 8-bit for memory efficiency device_map="auto" )

Configure LoRA

lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=8, # LoRA rank lora_alpha=32, # LoRA alpha lora_dropout=0.1, target_modules=["q_proj", "v_proj"], # Which layers to adapt )

Apply LoRA

model = get_peft_model(model, lora_config)

Check trainable parameters

model.print_trainable_parameters()

Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.062%

Train with Trainer (same as before)

Only LoRA parameters are updated!

undefined

undefined

Pattern 3: Custom Training Loop

模式3：自定义训练循环

python

import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler

python

import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler

Prepare dataloaders

train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=16, shuffle=True) eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=16)

Optimizer

optimizer = AdamW(model.parameters(), lr=2e-5)

Learning rate scheduler

num_epochs = 3 num_training_steps = num_epochs * len(train_dataloader) lr_scheduler = get_scheduler( "linear", optimizer=optimizer, num_warmup_steps=500, num_training_steps=num_training_steps )

Training loop

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device)

for epoch in range(num_epochs): model.train() for batch in train_dataloader: batch = {k: v.to(device) for k, v in batch.items()}

    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()

# Evaluation
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    # Compute metrics

---

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device)

for epoch in range(num_epochs): model.train() for batch in train_dataloader: batch = {k: v.to(device) for k, v in batch.items()}

    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()

# Evaluation
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    # Compute metrics

---

Part 4: Pipeline Usage (High-Level API)

第四部分：Pipeline使用（高层API）

Text Classification Pipeline

文本分类Pipeline

python

from transformers import pipeline

python

from transformers import pipeline

Load pipeline

classifier = pipeline( "text-classification", model="distilbert-base-uncased-finetuned-sst-2-english" )

Single prediction

result = classifier("I love this product!")

[{'label': 'POSITIVE', 'score': 0.9998}]

Batch prediction

results = classifier([ "Great service!", "Terrible experience", "Average quality" ])

undefined

results = classifier([ "Great service!", "Terrible experience", "Average quality" ])

undefined

Question Answering Pipeline

问答Pipeline

python

qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

result = qa_pipeline(
    question="What is the capital of France?",
    context="France is a country in Europe. Its capital is Paris, a beautiful city."
)

python

qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

result = qa_pipeline(
    question="What is the capital of France?",
    context="France is a country in Europe. Its capital is Paris, a beautiful city."
)

{'score': 0.98, 'start': 49, 'end': 54, 'answer': 'Paris'}

undefined

undefined

Text Generation Pipeline

文本生成Pipeline

python

generator = pipeline("text-generation", model="gpt2")

outputs = generator(
    "Once upon a time",
    max_length=50,
    num_return_sequences=3,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True
)

for output in outputs:
    print(output['generated_text'])

python

generator = pipeline("text-generation", model="gpt2")

outputs = generator(
    "Once upon a time",
    max_length=50,
    num_return_sequences=3,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True
)

for output in outputs:
    print(output['generated_text'])

Zero-Shot Classification Pipeline

零样本分类Pipeline

python

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

result = classifier(
    "This is a course about Python programming.",
    candidate_labels=["education", "technology", "business", "sports"]
)

python

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

result = classifier(
    "This is a course about Python programming.",
    candidate_labels=["education", "technology", "business", "sports"]
)

{'sequence': '...', 'labels': ['education', 'technology', ...], 'scores': [0.85, 0.12, ...]}

---

---

Part 5: Inference Optimization

第五部分：推理优化

Optimization 1: Batch Processing

优化1：批量处理

python

undefined

python

undefined

❌ SLOW: Process one at a time

for text in texts: output = model(**tokenizer(text, return_tensors="pt"))

✅ FAST: Process in batches

batch_size = 32 for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt") outputs = model(**inputs)

undefined

batch_size = 32 for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt") outputs = model(**inputs)

undefined

Optimization 2: Mixed Precision (AMP)

优化2：混合精度（AMP）

python

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()

    # Forward pass in mixed precision
    with autocast():
        outputs = model(**batch)
        loss = outputs.loss

    # Backward pass with scaled gradients
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

python

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()

    # Forward pass in mixed precision
    with autocast():
        outputs = model(**batch)
        loss = outputs.loss

    # Backward pass with scaled gradients
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Optimization 3: ONNX Export

优化3：导出为ONNX格式

python

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

python

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

Export to ONNX

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") model.save_pretrained("./onnx-model", export=True)

Load ONNX model (faster inference)

ort_model = ORTModelForSequenceClassification.from_pretrained("./onnx-model")

Inference (2-3x faster)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") inputs = tokenizer("Hello world", return_tensors="pt") outputs = ort_model(**inputs)

undefined

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") inputs = tokenizer("Hello world", return_tensors="pt") outputs = ort_model(**inputs)

undefined

Optimization 4: Dynamic Quantization

优化4：动态量化

python

import torch

python

import torch

Quantize model to int8

quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, # Quantize Linear layers dtype=torch.qint8 )

4x smaller model, 2-3x faster inference on CPU

---

---

Part 6: Common Issues & Solutions

第六部分：常见问题与解决方案

Issue 1: CUDA Out of Memory

问题1：CUDA内存不足

Problem:

RuntimeError: CUDA out of memory

Solutions:

python

undefined

问题：

RuntimeError: CUDA out of memory

解决方案：

python

undefined

Solution 1: Reduce batch size

training_args = TrainingArguments( per_device_train_batch_size=8, # Was 32 gradient_accumulation_steps=4, # Effective batch = 8*4 = 32 )

Solution 2: Use gradient checkpointing

model.gradient_checkpointing_enable()

Solution 3: Use 8-bit model

from transformers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True) model = AutoModel.from_pretrained("model-name", quantization_config=quantization_config)

Solution 4: Clear cache

import torch torch.cuda.empty_cache()

undefined

import torch torch.cuda.empty_cache()

undefined

Issue 2: Slow Tokenization

问题2：分词速度慢

Problem: Tokenization is bottleneck

Solutions:

python

undefined

问题： 分词成为性能瓶颈

解决方案：

python

undefined

Solution 1: Use fast tokenizers

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

Solution 2: Tokenize dataset once, cache it

tokenized_dataset = dataset.map( tokenize_function, batched=True, num_proc=4, # Parallel processing remove_columns=dataset.column_names, load_from_cache_file=True # Cache results )

Solution 3: Use larger batches for tokenization

tokenizer( texts, padding=True, truncation=True, max_length=512, return_tensors="pt", batched=True, # Process multiple texts at once batch_size=1000 )

undefined

tokenizer( texts, padding=True, truncation=True, max_length=512, return_tensors="pt", batched=True, # Process multiple texts at once batch_size=1000 )

undefined

Issue 3: Inconsistent Results

问题3：结果不一致

Problem: Model outputs different results for same input

Solution:

python

undefined

问题： 相同输入下模型输出不同结果

解决方案：

python

undefined

Set seeds for reproducibility

import random import numpy as np import torch

def set_seed(seed=42): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False

set_seed(42)

import random import numpy as np import torch

def set_seed(seed=42): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False

set_seed(42)

Disable dropout during inference

model.eval()

Use deterministic generation

outputs = model.generate( inputs, do_sample=False, # Greedy decoding # OR do_sample=True, temperature=1.0, top_k=50, seed=42 # For sampling )

undefined

outputs = model.generate( inputs, do_sample=False, # Greedy decoding # OR do_sample=True, temperature=1.0, top_k=50, seed=42 # For sampling )

undefined

Issue 4: Attention Mask Errors

问题4：注意力掩码错误

Problem:

IndexError: index out of range in self

Solution:

python

undefined

问题：

IndexError: index out of range in self

解决方案：

python

undefined

✅ ALWAYS provide attention mask

tokens = tokenizer( text, padding=True, truncation=True, return_tensors="pt", return_attention_mask=True # Explicit (usually default) )

Use it in model forward

outputs = model( input_ids=tokens['input_ids'], attention_mask=tokens['attention_mask'] # Don't forget this! )

For custom padding

attention_mask = (input_ids != tokenizer.pad_token_id).long()

---

attention_mask = (input_ids != tokenizer.pad_token_id).long()

---

Part 7: Model-Specific Patterns

第七部分：模型专属模式

GPT Models (Decoder-Only)

GPT模型（仅解码器）

python

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

python

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Set pad token (GPT doesn't have one by default)

tokenizer.pad_token = tokenizer.eos_token

Generation

input_text = "The future of AI is" inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate( **inputs, max_new_tokens=50, num_beams=5, # Beam search early_stopping=True, no_repeat_ngram_size=2, # Prevent repetition temperature=0.8, top_p=0.9 )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

undefined

input_text = "The future of AI is" inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate( **inputs, max_new_tokens=50, num_beams=5, # Beam search early_stopping=True, no_repeat_ngram_size=2, # Prevent repetition temperature=0.8, top_p=0.9 )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

undefined

T5 Models (Encoder-Decoder)

T5模型（编码器-解码器）

python

from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

python

from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

T5 expects task prefix

input_text = "translate English to German: How are you?" inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate( **inputs, max_length=50 )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

input_text = "translate English to German: How are you?" inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate( **inputs, max_length=50 )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

"Wie geht es dir?"

undefined

undefined

BERT Models (Encoder-Only)

BERT模型（仅编码器）

python

from transformers import BertForMaskedLM, BertTokenizer

model = BertForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

python

from transformers import BertForMaskedLM, BertTokenizer

model = BertForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Masked language modeling

text = "Paris is the [MASK] of France." inputs = tokenizer(text, return_tensors="pt")

Get predictions for [MASK]

outputs = model(**inputs) mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] mask_token_logits = outputs.logits[0, mask_token_index, :]

Top 5 predictions

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() for token in top_5_tokens: print(tokenizer.decode([token]))

capital, city, center, heart, ...

---

---

Part 8: Production Deployment

第八部分：生产环境部署

FastAPI Serving Pattern

FastAPI服务部署模式

python

from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel
import uvicorn

app = FastAPI()

python

from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel
import uvicorn

app = FastAPI()

Load model once at startup

classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

class TextInput(BaseModel): text: str

@app.post("/classify") async def classify_text(input: TextInput): result = classifier(input.text)[0] return { "label": result['label'], "confidence": result['score'] }

if name == "main": uvicorn.run(app, host="0.0.0.0", port=8000)

undefined

classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

class TextInput(BaseModel): text: str

@app.post("/classify") async def classify_text(input: TextInput): result = classifier(input.text)[0] return { "label": result['label'], "confidence": result['score'] }

if name == "main": uvicorn.run(app, host="0.0.0.0", port=8000)

undefined

Batch Inference Optimization

批量推理优化

python

import asyncio
from typing import List

class BatchPredictor:
    def __init__(self, model, tokenizer, max_batch_size=32):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.queue = []
        self.lock = asyncio.Lock()

    async def predict(self, text: str):
        async with self.lock:
            future = asyncio.Future()
            self.queue.append((text, future))

            if len(self.queue) >= self.max_batch_size:
                await self._process_batch()

        return await future

    async def _process_batch(self):
        if not self.queue:
            return

        texts, futures = zip(*self.queue)
        self.queue = []

        # Process batch
        inputs = self.tokenizer(list(texts), padding=True, truncation=True, return_tensors="pt")
        outputs = self.model(**inputs)
        results = outputs.logits.argmax(dim=-1).tolist()

        # Return results
        for future, result in zip(futures, results):
            future.set_result(result)

python

import asyncio
from typing import List

class BatchPredictor:
    def __init__(self, model, tokenizer, max_batch_size=32):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.queue = []
        self.lock = asyncio.Lock()

    async def predict(self, text: str):
        async with self.lock:
            future = asyncio.Future()
            self.queue.append((text, future))

            if len(self.queue) >= self.max_batch_size:
                await self._process_batch()

        return await future

    async def _process_batch(self):
        if not self.queue:
            return

        texts, futures = zip(*self.queue)
        self.queue = []

        # Process batch
        inputs = self.tokenizer(list(texts), padding=True, truncation=True, return_tensors="pt")
        outputs = self.model(**inputs)
        results = outputs.logits.argmax(dim=-1).tolist()

        # Return results
        for future, result in zip(futures, results):
            future.set_result(result)

Quick Decision Trees

快速决策树

"Which model should I use?"

"我应该使用哪个模型？"

*Hugging Face Transformers v1.1 - Enhanced*

*Hugging Face Transformers v1.1 - Enhanced*

🔄 Workflow

🔄 工作流

Kaynak: Hugging Face Course & Production Guide

来源： Hugging Face 课程 & 生产环境指南

Aşama 1: Model Selection

步骤1：模型选择

Task: Görevine en uygun mimariyi seç (Encoder: classification, Decoder: generation).
License: Modelin ticari kullanım izni (Apache 2.0 vs Llama Community) var mı?
Size: Parametre sayısı vs performans dengesini kur (7B genellikle yeterli).

任务类型：选择最适合任务的架构（编码器：分类任务，解码器：生成任务）。
许可证：模型是否允许商业使用（Apache 2.0 vs Llama社区许可证）？
模型规模：平衡参数数量与性能（7B参数模型通常已足够）。

Aşama 2: Optimization pipeline

步骤2：优化流程

Quantization: Inference için 4-bit / 8-bit quantization (BitsAndBytes) kullan.
Batching: Tek tek değil, batch halinde process et (GPU verimi).
Format: Production için ONNX veya TensorRT formatına çevir.

量化：推理时使用4位/8位量化（BitsAndBytes）。
批量处理：避免逐个处理，采用批量处理提升GPU利用率。
格式转换：生产环境下转换为ONNX或TensorRT格式。

Aşama 3: Deployment

步骤3：部署

Cache: Model ağırlıklarını ve tokenizer'ı docker image içine bake etme, volume kullan.
Token Limits: Context window sınırını aşan inputlar için strateji belirle (chunking).

缓存：将模型权重和分词器打包到Docker镜像中，或使用卷挂载。
令牌限制：为超出上下文窗口的输入制定处理策略（如分块）。

Kontrol Noktaları

检查点

Aşama	Doğrulama
1	Model GPU hafızasına sığıyor mu (OOM hatası)?
2	Inference süresi (Latency) hedefin altında mı?
3	Tokenizer ile Model uyumlu mu (aynı vocab)?
Classification → BERT, RoBERTa, DeBERTa
Generation → GPT-2, GPT-Neo, LLaMA
Translation/Summarization → T5, BART, mT5
Question Answering → BERT, DeBERTa, RoBERTa

Performance vs Speed? Best performance → Large models (355M+ params) Balanced → Base models (110M params) Fast inference → Distilled models (66M params)

undefined

步骤	验证项
1	模型能否装入GPU内存（是否出现OOM错误）？
2	推理延迟是否低于目标值？
3	分词器与模型是否兼容（词汇表是否一致）？
分类任务 → BERT、RoBERTa、DeBERTa
生成任务 → GPT-2、GPT-Neo、LLaMA
翻译/摘要任务 → T5、BART、mT5
问答任务 → BERT、DeBERTa、RoBERTa

性能与速度的权衡？最佳性能 → 大模型（3.55亿+参数）平衡型 → 基础模型（1.1亿参数）快速推理 → 蒸馏模型（6600万参数）

undefined

"How should I fine-tune?"

"我应该如何微调模型？"

Have full dataset control?
  YES → Full fine-tuning or LoRA
  NO → Few-shot prompting

Dataset size?
  Large (>10K examples) → Full fine-tuning
  Medium (1K-10K) → LoRA or full fine-tuning
  Small (<1K) → LoRA or prompt engineering

Compute available?
  Limited → LoRA (4-bit quantized)
  Moderate → LoRA (8-bit)
  High → Full fine-tuning

是否完全控制数据集？
  是 → 全量微调或LoRA
  否 → 少样本提示

数据集规模？
  大型（>1万条样本） → 全量微调
  中型（1千-1万条样本） → LoRA或全量微调
  小型（<1千条样本） → LoRA或提示工程

可用计算资源？
  有限 → LoRA（4位量化）
  中等 → LoRA（8位量化）
  充足 → 全量微调

Resources

参考资源

Hugging Face Docs: https://huggingface.co/docs/transformers/
Model Hub: https://huggingface.co/models
PEFT (LoRA): https://huggingface.co/docs/peft/
Optimum: https://huggingface.co/docs/optimum/
Datasets: https://huggingface.co/docs/datasets/

Skill version: 1.0.0 Last updated: 2025-10-25 Maintained by: Applied Artificial Intelligence

Hugging Face 文档： https://huggingface.co/docs/transformers/
模型中心： https://huggingface.co/models
PEFT（LoRA）： https://huggingface.co/docs/peft/
Optimum： https://huggingface.co/docs/optimum/
Datasets： https://huggingface.co/docs/datasets/

技能版本： 1.0.0 最后更新： 2025-10-25 维护方： 应用人工智能团队

huggingface_transformers

Original

Translation

Hugging Face Transformers Best Practices

Hugging Face Transformers 最佳实践

Quick Reference

快速参考

Part 1: Model Loading Patterns

第一部分：模型加载模式

Pattern 1: Basic Model Loading

模式1：基础模型加载

Load model and tokenizer

Load model and tokenizer

For specific tasks

For specific tasks

Pattern 2: Loading with Specific Configuration

模式2：加载指定配置的模型

Modify configuration

Modify configuration

Load model with custom config

Load model with custom config

Or create model from scratch with config

Or create model from scratch with config

Pattern 3: Loading Quantized Models (Memory Efficient)

模式3：加载量化模型（内存高效型）

8-bit quantization (50% memory reduction)

8-bit quantization (50% memory reduction)

4-bit quantization (75% memory reduction)

4-bit quantization (75% memory reduction)

Pattern 4: Loading from Local Path

模式4：从本地路径加载模型

Save model locally

Save model locally

Load from local path

Load from local path

Part 2: Tokenization Best Practices

第二部分：分词最佳实践

Critical Tokenization Patterns

关键分词模式

✅ CORRECT: All required arguments

✅ CORRECT: All required arguments

Access components

Access components

❌ WRONG: Missing critical arguments

❌ WRONG: Missing critical arguments

Batch Tokenization

批量分词

Tokenize multiple texts efficiently

Tokenize multiple texts efficiently

Result shape: [batch_size, max_length]

Result shape: [batch_size, max_length]

Special Token Handling

特殊令牌处理

Add special tokens

Add special tokens

Resize model embeddings to match

Resize model embeddings to match

Encode with special tokens preserved

Encode with special tokens preserved

Decode

Decode

Tokenization for Different Tasks

不同任务的分词方式

Text classification (single sequence)

Text classification (single sequence)

Question answering (pair of sequences)

Question answering (pair of sequences)

Text generation (decoder-only models)

Text generation (decoder-only models)

No padding needed for generation input

No padding needed for generation input

Part 3: Fine-Tuning Workflows

第三部分：微调工作流

Pattern 1: Simple Fine-Tuning with Trainer

模式1：使用Trainer进行简单微调

1. Load dataset

1. Load dataset

2. Load model

2. Load model

3. Tokenize dataset