huggingface-tokenizers

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

HuggingFace Tokenizers - Fast Tokenization for NLP

HuggingFace Tokenizers - 面向NLP的快速分词工具

Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
兼具Rust级性能和Python易用性的生产级快速分词器。

When to use HuggingFace Tokenizers

什么时候使用HuggingFace Tokenizers

Use HuggingFace Tokenizers when:
  • Need extremely fast tokenization (<20s per GB of text)
  • Training custom tokenizers from scratch
  • Want alignment tracking (token → original text position)
  • Building production NLP pipelines
  • Need to tokenize large corpora efficiently
Performance:
  • Speed: <20 seconds to tokenize 1GB on CPU
  • Implementation: Rust core with Python/Node.js bindings
  • Efficiency: 10-100× faster than pure Python implementations
Use alternatives instead:
  • SentencePiece: Language-independent, used by T5/ALBERT
  • tiktoken: OpenAI's BPE tokenizer for GPT models
  • transformers AutoTokenizer: Loading pretrained only (uses this library internally)
符合以下场景时推荐使用HuggingFace Tokenizers:
  • 需要极致快的分词速度(每GB文本分词耗时<20秒)
  • 需要从零开始训练自定义分词器
  • 需要跟踪对齐关系(分词结果→原始文本位置映射)
  • 搭建生产级NLP pipelines
  • 需要高效对大型语料进行分词
性能:
  • 速度:CPU上分词1GB数据耗时小于20秒
  • 实现:Rust内核,提供Python/Node.js绑定
  • 效率:比纯Python实现快10-100倍
可替代方案:
  • SentencePiece:语言无关的分词器,被T5/ALBERT使用
  • tiktoken:OpenAI推出的面向GPT模型的BPE分词器
  • transformers AutoTokenizer:仅用于加载预训练模型(底层使用本库实现)

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

Install tokenizers

Install tokenizers

pip install tokenizers
pip install tokenizers

With transformers integration

With transformers integration

pip install tokenizers transformers
undefined
pip install tokenizers transformers
undefined

Load pretrained tokenizer

加载预训练分词器

python
from tokenizers import Tokenizer
python
from tokenizers import Tokenizer

Load from HuggingFace Hub

Load from HuggingFace Hub

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Encode text

Encode text

output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]

Decode back

Decode back

text = tokenizer.decode(output.ids) print(text) # "hello, how are you?"
undefined
text = tokenizer.decode(output.ids) print(text) # "hello, how are you?"
undefined

Train custom BPE tokenizer

训练自定义BPE分词器

python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

Initialize tokenizer with BPE model

Initialize tokenizer with BPE model

tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace()
tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace()

Configure trainer

Configure trainer

trainer = BpeTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2 )
trainer = BpeTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2 )

Train on files

Train on files

files = ["train.txt", "validation.txt"] tokenizer.train(files, trainer)
files = ["train.txt", "validation.txt"] tokenizer.train(files, trainer)

Save

Save

tokenizer.save("my-tokenizer.json")

**Training time**: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB
tokenizer.save("my-tokenizer.json")

**训练耗时**:100MB语料约1-2分钟,1GB语料约10-20分钟

Batch encoding with padding

带填充的批量编码

python
undefined
python
undefined

Enable padding

Enable padding

tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

Encode batch

Encode batch

texts = ["Hello world", "This is a longer sentence"] encodings = tokenizer.encode_batch(texts)
for encoding in encodings: print(encoding.ids)
texts = ["Hello world", "This is a longer sentence"] encodings = tokenizer.encode_batch(texts)
for encoding in encodings: print(encoding.ids)

[101, 7592, 2088, 102, 3, 3, 3]

[101, 7592, 2088, 102, 3, 3, 3]

[101, 2023, 2003, 1037, 2936, 6251, 102]

[101, 2023, 2003, 1037, 2936, 6251, 102]

undefined
undefined

Tokenization algorithms

分词算法

BPE (Byte-Pair Encoding)

BPE (Byte-Pair Encoding)

How it works:
  1. Start with character-level vocabulary
  2. Find most frequent character pair
  3. Merge into new token, add to vocabulary
  4. Repeat until vocabulary size reached
Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()

trainer = BpeTrainer(
    vocab_size=50257,
    special_tokens=["<|endoftext|>"],
    min_frequency=2
)

tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
  • Handles OOV words well (breaks into subwords)
  • Flexible vocabulary size
  • Good for morphologically rich languages
Trade-offs:
  • Tokenization depends on merge order
  • May split common words unexpectedly
工作原理:
  1. 初始使用字符级词汇表
  2. 找到出现频率最高的字符对
  3. 合并为新token,加入词汇表
  4. 重复上述步骤直到达到目标词汇表大小
使用该算法的模型: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()

trainer = BpeTrainer(
    vocab_size=50257,
    special_tokens=["<|endoftext|>"],
    min_frequency=2
)

tokenizer.train(files=["data.txt"], trainer=trainer)
优势:
  • 可以很好地处理OOV(未登录词),将其拆分为子词
  • 词汇表大小灵活可调
  • 对形态丰富的语言适配性好
缺点:
  • 分词结果依赖合并顺序
  • 可能会意外拆分常用词

WordPiece

WordPiece

How it works:
  1. Start with character vocabulary
  2. Score merge pairs:
    frequency(pair) / (frequency(first) × frequency(second))
  3. Merge highest scoring pair
  4. Repeat until vocabulary size reached
Used by: BERT, DistilBERT, MobileBERT
python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import BertNormalizer

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(
    vocab_size=30522,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    continuing_subword_prefix="##"
)

tokenizer.train(files=["corpus.txt"], trainer=trainer)
Advantages:
  • Prioritizes meaningful merges (high score = semantically related)
  • Used successfully in BERT (state-of-the-art results)
Trade-offs:
  • Unknown words become
    [UNK]
    if no subword match
  • Saves vocabulary, not merge rules (larger files)
工作原理:
  1. 初始使用字符级词汇表
  2. 计算合并对的得分:
    pair出现频率 / (第一个元素出现频率 × 第二个元素出现频率)
  3. 合并得分最高的字符对
  4. 重复上述步骤直到达到目标词汇表大小
使用该算法的模型: BERT, DistilBERT, MobileBERT
python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import BertNormalizer

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(
    vocab_size=30522,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    continuing_subword_prefix="##"
)

tokenizer.train(files=["corpus.txt"], trainer=trainer)
优势:
  • 优先合并有意义的字符对(得分越高语义相关性越强)
  • 在BERT中取得了SOTA效果,经过工业级验证
缺点:
  • 没有匹配子词的未知词会被标记为
    [UNK]
  • 存储的是词汇表而非合并规则,文件体积更大

Unigram

Unigram

How it works:
  1. Start with large vocabulary (all substrings)
  2. Compute loss for corpus with current vocabulary
  3. Remove tokens with minimal impact on loss
  4. Repeat until vocabulary size reached
Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)
python
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer

tokenizer = Tokenizer(Unigram())

trainer = UnigramTrainer(
    vocab_size=8000,
    special_tokens=["<unk>", "<s>", "</s>"],
    unk_token="<unk>"
)

tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
  • Probabilistic (finds most likely tokenization)
  • Works well for languages without word boundaries
  • Handles diverse linguistic contexts
Trade-offs:
  • Computationally expensive to train
  • More hyperparameters to tune
工作原理:
  1. 初始使用超大词汇表(包含所有子字符串)
  2. 基于当前词汇表计算整个语料的损失
  3. 移除对损失影响最小的token
  4. 重复上述步骤直到达到目标词汇表大小
使用该算法的模型: ALBERT, T5, mBART, XLNet(通过SentencePiece实现)
python
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer

tokenizer = Tokenizer(Unigram())

trainer = UnigramTrainer(
    vocab_size=8000,
    special_tokens=["<unk>", "<s>", "</s>"],
    unk_token="<unk>"
)

tokenizer.train(files=["data.txt"], trainer=trainer)
优势:
  • 基于概率计算,可以得到概率最高的分词结果
  • 对没有明确词边界的语言适配性好
  • 可以处理多样化的语言语境
缺点:
  • 训练计算成本高
  • 需要调试的超参数更多

Tokenization pipeline

分词流水线

Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing
完整的分词流程:归一化 → 预分词 → 模型处理 → 后处理

Normalization

归一化

Clean and standardize text:
python
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence

tokenizer.normalizer = Sequence([
    NFD(),           # Unicode normalization (decompose)
    Lowercase(),     # Convert to lowercase
    StripAccents()   # Remove accents
])
对文本进行清洗和标准化处理:
python
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence

tokenizer.normalizer = Sequence([
    NFD(),           # Unicode归一化(分解形式)
    Lowercase(),     # 转换为小写
    StripAccents()   # 去除重音符号
])

Input: "Héllo WORLD"

Input: "Héllo WORLD"

After normalization: "hello world"

After normalization: "hello world"


**Common normalizers**:
- `NFD`, `NFC`, `NFKD`, `NFKC` - Unicode normalization forms
- `Lowercase()` - Convert to lowercase
- `StripAccents()` - Remove accents (é → e)
- `Strip()` - Remove whitespace
- `Replace(pattern, content)` - Regex replacement

**常用归一化工具:**
- `NFD`, `NFC`, `NFKD`, `NFKC` - Unicode归一化标准
- `Lowercase()` - 转换为小写
- `StripAccents()` - 去除重音符号(é → e)
- `Strip()` - 去除空白字符
- `Replace(pattern, content)` - 正则替换

Pre-tokenization

预分词

Split text into word-like units:
python
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel
将文本拆分为类词单元:
python
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel

Split on whitespace and punctuation

Split on whitespace and punctuation

tokenizer.pre_tokenizer = Sequence([ Whitespace(), Punctuation() ])
tokenizer.pre_tokenizer = Sequence([ Whitespace(), Punctuation() ])

Input: "Hello, world!"

Input: "Hello, world!"

After pre-tokenization: ["Hello", ",", "world", "!"]

After pre-tokenization: ["Hello", ",", "world", "!"]


**Common pre-tokenizers**:
- `Whitespace()` - Split on spaces, tabs, newlines
- `ByteLevel()` - GPT-2 style byte-level splitting
- `Punctuation()` - Isolate punctuation
- `Digits(individual_digits=True)` - Split digits individually
- `Metaspace()` - Replace spaces with ▁ (SentencePiece style)

**常用预分词工具:**
- `Whitespace()` - 按空格、制表符、换行符拆分
- `ByteLevel()` - GPT-2风格的字节级拆分
- `Punctuation()` - 隔离标点符号
- `Digits(individual_digits=True)` - 按单个数字拆分
- `Metaspace()` - 用▁替换空格(SentencePiece风格)

Post-processing

后处理

Add special tokens for model input:
python
from tokenizers.processors import TemplateProcessing
为模型输入添加特殊token:
python
from tokenizers.processors import TemplateProcessing

BERT-style: [CLS] sentence [SEP]

BERT风格: [CLS] 句子 [SEP]

tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B [SEP]", special_tokens=[ ("[CLS]", 1), ("[SEP]", 2), ], )

**Common patterns**:
```python
tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B [SEP]", special_tokens=[ ("[CLS]", 1), ("[SEP]", 2), ], )

**常用模板:**
```python

GPT-2: sentence <|endoftext|>

GPT-2: 句子 <|endoftext|>

TemplateProcessing( single="$A <|endoftext|>", special_tokens=[("<|endoftext|>", 50256)] )
TemplateProcessing( single="$A <|endoftext|>", special_tokens=[("<|endoftext|>", 50256)] )

RoBERTa: <s> sentence </s>

RoBERTa: <s> 句子 </s>

TemplateProcessing( single="<s> $A </s>", pair="<s> $A </s> </s> $B </s>", special_tokens=[("<s>", 0), ("</s>", 2)] )
undefined
TemplateProcessing( single="<s> $A </s>", pair="<s> $A </s> </s> $B </s>", special_tokens=[("<s>", 0), ("</s>", 2)] )
undefined

Alignment tracking

对齐跟踪

Track token positions in original text:
python
output = tokenizer.encode("Hello, world!")
跟踪token在原始文本中的位置:
python
output = tokenizer.encode("Hello, world!")

Get token offsets

Get token offsets

for token, offset in zip(output.tokens, output.offsets): start, end = offset print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
for token, offset in zip(output.tokens, output.offsets): start, end = offset print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")

Output:

Output:

hello → [ 0, 5): 'Hello'

hello → [ 0, 5): 'Hello'

, → [ 5, 6): ','

, → [ 5, 6): ','

world → [ 7, 12): 'world'

world → [ 7, 12): 'world'

! → [12, 13): '!'

! → [12, 13): '!'


**Use cases**:
- Named entity recognition (map predictions back to text)
- Question answering (extract answer spans)
- Token classification (align labels to original positions)

**适用场景:**
- 命名实体识别(将预测结果映射回原始文本)
- 问答任务(提取答案跨度)
- token分类任务(将标签对齐到原始文本位置)

Integration with transformers

与transformers集成

Load with AutoTokenizer

通过AutoTokenizer加载

python
from transformers import AutoTokenizer
python
from transformers import AutoTokenizer

AutoTokenizer automatically uses fast tokenizers

AutoTokenizer自动使用快速分词器

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Check if using fast tokenizer

Check if using fast tokenizer

print(tokenizer.is_fast) # True
print(tokenizer.is_fast) # True

Access underlying tokenizers.Tokenizer

Access underlying tokenizers.Tokenizer

fast_tokenizer = tokenizer.backend_tokenizer print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
undefined
fast_tokenizer = tokenizer.backend_tokenizer print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
undefined

Convert custom tokenizer to transformers

将自定义分词器转换为transformers格式

python
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
python
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

Train custom tokenizer

Train custom tokenizer

tokenizer = Tokenizer(BPE())
tokenizer = Tokenizer(BPE())

... train tokenizer ...

... train tokenizer ...

tokenizer.save("my-tokenizer.json")
tokenizer.save("my-tokenizer.json")

Wrap for transformers

Wrap for transformers

transformers_tokenizer = PreTrainedTokenizerFast( tokenizer_file="my-tokenizer.json", unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]" )
transformers_tokenizer = PreTrainedTokenizerFast( tokenizer_file="my-tokenizer.json", unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]" )

Use like any transformers tokenizer

Use like any transformers tokenizer

outputs = transformers_tokenizer( "Hello world", padding=True, truncation=True, max_length=512, return_tensors="pt" )
undefined
outputs = transformers_tokenizer( "Hello world", padding=True, truncation=True, max_length=512, return_tensors="pt" )
undefined

Common patterns

常用模式

Train from iterator (large datasets)

从迭代器训练(适用于大型数据集)

python
from datasets import load_dataset
python
from datasets import load_dataset

Load dataset

Load dataset

dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

Create batch iterator

Create batch iterator

def batch_iterator(batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i:i + batch_size]["text"]
def batch_iterator(batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i:i + batch_size]["text"]

Train tokenizer

Train tokenizer

tokenizer.train_from_iterator( batch_iterator(), trainer=trainer, length=len(dataset) # For progress bar )

**Performance**: Processes 1GB in ~10-20 minutes
tokenizer.train_from_iterator( batch_iterator(), trainer=trainer, length=len(dataset) # For progress bar )

**性能:** 处理1GB数据约需10-20分钟

Enable truncation and padding

启用截断和填充

python
undefined
python
undefined

Enable truncation

Enable truncation

tokenizer.enable_truncation(max_length=512)
tokenizer.enable_truncation(max_length=512)

Enable padding

Enable padding

tokenizer.enable_padding( pad_id=tokenizer.token_to_id("[PAD]"), pad_token="[PAD]", length=512 # Fixed length, or None for batch max )
tokenizer.enable_padding( pad_id=tokenizer.token_to_id("[PAD]"), pad_token="[PAD]", length=512 # Fixed length, or None for batch max )

Encode with both

Encode with both

output = tokenizer.encode("This is a long sentence that will be truncated...") print(len(output.ids)) # 512
undefined
output = tokenizer.encode("This is a long sentence that will be truncated...") print(len(output.ids)) # 512
undefined

Multi-processing

多进程处理

python
from tokenizers import Tokenizer
from multiprocessing import Pool
python
from tokenizers import Tokenizer
from multiprocessing import Pool

Load tokenizer

Load tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
def encode_batch(texts): return tokenizer.encode_batch(texts)
tokenizer = Tokenizer.from_file("tokenizer.json")
def encode_batch(texts): return tokenizer.encode_batch(texts)

Process large corpus in parallel

Process large corpus in parallel

with Pool(8) as pool: # Split corpus into chunks chunk_size = 1000 chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
# Encode in parallel
results = pool.map(encode_batch, chunks)

**Speedup**: 5-8× with 8 cores
with Pool(8) as pool: # Split corpus into chunks chunk_size = 1000 chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
# Encode in parallel
results = pool.map(encode_batch, chunks)

**加速效果:** 8核CPU可实现5-8倍提速

Performance benchmarks

性能基准

Training speed

训练速度

Corpus SizeBPE (30k vocab)WordPiece (30k)Unigram (8k)
10 MB15 sec18 sec25 sec
100 MB1.5 min2 min4 min
1 GB15 min20 min40 min
Hardware: 16-core CPU, tested on English Wikipedia
语料大小BPE (3万词表)WordPiece (3万词表)Unigram (8千词表)
10 MB15 秒18 秒25 秒
100 MB1.5 分钟2 分钟4 分钟
1 GB15 分钟20 分钟40 分钟
硬件: 16核CPU,基于英文维基百科测试

Tokenization speed

分词速度

Implementation1 GB corpusThroughput
Pure Python~20 minutes~50 MB/min
HF Tokenizers~15 seconds~4 GB/min
Speedup80×80×
Test: English text, average sentence length 20 words
实现方案1GB语料耗时吞吐量
纯Python实现~20 分钟~50 MB/分钟
HF Tokenizers~15 秒~4 GB/分钟
加速比80×80×
测试条件: 英文文本,平均句子长度20词

Memory usage

内存占用

TaskMemory
Load tokenizer~10 MB
Train BPE (30k vocab)~200 MB
Encode 1M sentences~500 MB
任务内存
加载分词器~10 MB
训练BPE分词器(3万词表)~200 MB
编码100万条句子~500 MB

Supported models

支持的模型

Pre-trained tokenizers available via
from_pretrained()
:
BERT family:
  • bert-base-uncased
    ,
    bert-large-cased
  • distilbert-base-uncased
  • roberta-base
    ,
    roberta-large
GPT family:
  • gpt2
    ,
    gpt2-medium
    ,
    gpt2-large
  • distilgpt2
T5 family:
  • t5-small
    ,
    t5-base
    ,
    t5-large
  • google/flan-t5-xxl
Other:
  • facebook/bart-base
    ,
    facebook/mbart-large-cc25
  • albert-base-v2
    ,
    albert-xlarge-v2
  • xlm-roberta-base
    ,
    xlm-roberta-large
可通过
from_pretrained()
加载的预训练分词器:
BERT家族:
  • bert-base-uncased
    ,
    bert-large-cased
  • distilbert-base-uncased
  • roberta-base
    ,
    roberta-large
GPT家族:
  • gpt2
    ,
    gpt2-medium
    ,
    gpt2-large
  • distilgpt2
T5家族:
  • t5-small
    ,
    t5-base
    ,
    t5-large
  • google/flan-t5-xxl
其他:
  • facebook/bart-base
    ,
    facebook/mbart-large-cc25
  • albert-base-v2
    ,
    albert-xlarge-v2
  • xlm-roberta-base
    ,
    xlm-roberta-large

References

参考资料

  • Training Guide - Train custom tokenizers, configure trainers, handle large datasets
  • Algorithms Deep Dive - BPE, WordPiece, Unigram explained in detail
  • Pipeline Components - Normalizers, pre-tokenizers, post-processors, decoders
  • Transformers Integration - AutoTokenizer, PreTrainedTokenizerFast, special tokens
  • 训练指南 - 训练自定义分词器、配置训练器、处理大型数据集
  • 算法深度解析 - BPE、WordPiece、Unigram详细讲解
  • 流水线组件 - 归一化器、预分词器、后处理器、解码器
  • Transformers集成指南 - AutoTokenizer、PreTrainedTokenizerFast、特殊token处理

Resources

资源