huggingface-tokenizers
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHuggingFace Tokenizers - Fast Tokenization for NLP
HuggingFace Tokenizers - 面向NLP的快速分词工具
Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
兼具Rust高性能与Python易用性的、可用于生产环境的快速分词器。
When to use HuggingFace Tokenizers
何时使用HuggingFace Tokenizers
Use HuggingFace Tokenizers when:
- Need extremely fast tokenization (<20s per GB of text)
- Training custom tokenizers from scratch
- Want alignment tracking (token → original text position)
- Building production NLP pipelines
- Need to tokenize large corpora efficiently
Performance:
- Speed: <20 seconds to tokenize 1GB on CPU
- Implementation: Rust core with Python/Node.js bindings
- Efficiency: 10-100× faster than pure Python implementations
Use alternatives instead:
- SentencePiece: Language-independent, used by T5/ALBERT
- tiktoken: OpenAI's BPE tokenizer for GPT models
- transformers AutoTokenizer: Loading pretrained only (uses this library internally)
在以下场景使用HuggingFace Tokenizers:
- 需要极快的分词速度(每GB文本耗时<20秒)
- 从零开始训练自定义分词器
- 需要跟踪对齐关系(分词结果→原文位置)
- 构建生产级NLP流水线
- 需要高效处理大规模语料库
性能表现:
- 速度:在CPU上分词1GB文本耗时<20秒
- 实现方式:Rust核心,提供Python/Node.js绑定
- 效率:比纯Python实现快10-100倍
可选择替代工具的场景:
- SentencePiece:独立于语言,被T5/ALBERT使用
- tiktoken:OpenAI针对GPT模型的BPE分词器
- transformers AutoTokenizer:仅用于加载预训练模型(内部基于本库实现)
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedInstall tokenizers
安装tokenizers
pip install tokenizers
pip install tokenizers
With transformers integration
带transformers集成的安装
pip install tokenizers transformers
undefinedpip install tokenizers transformers
undefinedLoad pretrained tokenizer
加载预训练分词器
python
from tokenizers import Tokenizerpython
from tokenizers import TokenizerLoad from HuggingFace Hub
从HuggingFace Hub加载
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
Encode text
编码文本
output = tokenizer.encode("Hello, how are you?")
print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?']
print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
output = tokenizer.encode("Hello, how are you?")
print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?']
print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
Decode back
解码回文本
text = tokenizer.decode(output.ids)
print(text) # "hello, how are you?"
undefinedtext = tokenizer.decode(output.ids)
print(text) # "hello, how are you?"
undefinedTrain custom BPE tokenizer
训练自定义BPE分词器
python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespacepython
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import WhitespaceInitialize tokenizer with BPE model
用BPE模型初始化分词器
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
Configure trainer
配置训练器
trainer = BpeTrainer(
vocab_size=30000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
min_frequency=2
)
trainer = BpeTrainer(
vocab_size=30000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
min_frequency=2
)
Train on files
基于文件训练
files = ["train.txt", "validation.txt"]
tokenizer.train(files, trainer)
files = ["train.txt", "validation.txt"]
tokenizer.train(files, trainer)
Save
保存分词器
tokenizer.save("my-tokenizer.json")
**Training time**: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GBtokenizer.save("my-tokenizer.json")
**训练耗时**:100MB语料约1-2分钟,1GB语料约10-20分钟Batch encoding with padding
带填充的批量编码
python
undefinedpython
undefinedEnable padding
启用填充
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
Encode batch
批量编码文本
texts = ["Hello world", "This is a longer sentence"]
encodings = tokenizer.encode_batch(texts)
for encoding in encodings:
print(encoding.ids)
texts = ["Hello world", "This is a longer sentence"]
encodings = tokenizer.encode_batch(texts)
for encoding in encodings:
print(encoding.ids)
[101, 7592, 2088, 102, 3, 3, 3]
[101, 7592, 2088, 102, 3, 3, 3]
[101, 2023, 2003, 1037, 2936, 6251, 102]
[101, 2023, 2003, 1037, 2936, 6251, 102]
undefinedundefinedTokenization algorithms
分词算法
BPE (Byte-Pair Encoding)
BPE(字节对编码)
How it works:
- Start with character-level vocabulary
- Find most frequent character pair
- Merge into new token, add to vocabulary
- Repeat until vocabulary size reached
Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()
trainer = BpeTrainer(
vocab_size=50257,
special_tokens=["<|endoftext|>"],
min_frequency=2
)
tokenizer.train(files=["data.txt"], trainer=trainer)Advantages:
- Handles OOV words well (breaks into subwords)
- Flexible vocabulary size
- Good for morphologically rich languages
Trade-offs:
- Tokenization depends on merge order
- May split common words unexpectedly
工作原理:
- 从字符级词汇表开始
- 找到出现频率最高的字符对
- 将其合并为新分词,添加到词汇表
- 重复操作直到达到指定词汇表大小
适用模型:GPT-2、GPT-3、RoBERTa、BART、DeBERTa
python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()
trainer = BpeTrainer(
vocab_size=50257,
special_tokens=["<|endoftext|>"],
min_frequency=2
)
tokenizer.train(files=["data.txt"], trainer=trainer)优势:
- 能很好地处理OOV(未登录词),将其拆分为子词
- 词汇表大小灵活可调
- 适用于形态丰富的语言
权衡:
- 分词结果依赖合并顺序
- 可能会意外拆分常见词汇
WordPiece
WordPiece
How it works:
- Start with character vocabulary
- Score merge pairs:
frequency(pair) / (frequency(first) × frequency(second)) - Merge highest scoring pair
- Repeat until vocabulary size reached
Used by: BERT, DistilBERT, MobileBERT
python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import BertNormalizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer(
vocab_size=30522,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##"
)
tokenizer.train(files=["corpus.txt"], trainer=trainer)Advantages:
- Prioritizes meaningful merges (high score = semantically related)
- Used successfully in BERT (state-of-the-art results)
Trade-offs:
- Unknown words become if no subword match
[UNK] - Saves vocabulary, not merge rules (larger files)
工作原理:
- 从字符词汇表开始
- 计算字符对的合并得分:
frequency(pair) / (frequency(first) × frequency(second)) - 合并得分最高的字符对
- 重复操作直到达到指定词汇表大小
适用模型:BERT、DistilBERT、MobileBERT
python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import BertNormalizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer(
vocab_size=30522,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##"
)
tokenizer.train(files=["corpus.txt"], trainer=trainer)优势:
- 优先合并语义相关的字符对(高分值=语义关联紧密)
- 在BERT模型中成功应用,达到SOTA效果
权衡:
- 如果没有匹配的子词,未登录词会被标记为
[UNK] - 保存的是词汇表而非合并规则,文件体积更大
Unigram
Unigram
How it works:
- Start with large vocabulary (all substrings)
- Compute loss for corpus with current vocabulary
- Remove tokens with minimal impact on loss
- Repeat until vocabulary size reached
Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)
python
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(
vocab_size=8000,
special_tokens=["<unk>", "<s>", "</s>"],
unk_token="<unk>"
)
tokenizer.train(files=["data.txt"], trainer=trainer)Advantages:
- Probabilistic (finds most likely tokenization)
- Works well for languages without word boundaries
- Handles diverse linguistic contexts
Trade-offs:
- Computationally expensive to train
- More hyperparameters to tune
工作原理:
- 从包含所有子串的大词汇表开始
- 计算当前词汇表下的语料损失
- 删除对损失影响最小的分词
- 重复操作直到达到指定词汇表大小
适用模型:ALBERT、T5、mBART、XLNet(通过SentencePiece实现)
python
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(
vocab_size=8000,
special_tokens=["<unk>", "<s>", "</s>"],
unk_token="<unk>"
)
tokenizer.train(files=["data.txt"], trainer=trainer)优势:
- 基于概率(找到最可能的分词方式)
- 适用于无词边界的语言
- 能处理多样化的语言环境
权衡:
- 训练计算成本高
- 有更多超参数需要调优
Tokenization pipeline
分词流水线
Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing
完整流水线:标准化 → 预分词 → 模型处理 → 后处理
Normalization
标准化
Clean and standardize text:
python
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence
tokenizer.normalizer = Sequence([
NFD(), # Unicode normalization (decompose)
Lowercase(), # Convert to lowercase
StripAccents() # Remove accents
])清理并标准化文本:
python
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence
tokenizer.normalizer = Sequence([
NFD(), # Unicode标准化(分解字符)
Lowercase(), # 转换为小写
StripAccents() # 去除重音
])Input: "Héllo WORLD"
输入: "Héllo WORLD"
After normalization: "hello world"
标准化后: "hello world"
**Common normalizers**:
- `NFD`, `NFC`, `NFKD`, `NFKC` - Unicode normalization forms
- `Lowercase()` - Convert to lowercase
- `StripAccents()` - Remove accents (é → e)
- `Strip()` - Remove whitespace
- `Replace(pattern, content)` - Regex replacement
**常见标准化器**:
- `NFD`, `NFC`, `NFKD`, `NFKC` - Unicode标准化格式
- `Lowercase()` - 转换为小写
- `StripAccents()` - 去除重音(é → e)
- `Strip()` - 去除空白字符
- `Replace(pattern, content)` - 正则替换Pre-tokenization
预分词
Split text into word-like units:
python
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel将文本拆分为类词单元:
python
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevelSplit on whitespace and punctuation
按空白字符和标点符号拆分
tokenizer.pre_tokenizer = Sequence([
Whitespace(),
Punctuation()
])
tokenizer.pre_tokenizer = Sequence([
Whitespace(),
Punctuation()
])
Input: "Hello, world!"
输入: "Hello, world!"
After pre-tokenization: ["Hello", ",", "world", "!"]
预分词后: ["Hello", ",", "world", "!"]
**Common pre-tokenizers**:
- `Whitespace()` - Split on spaces, tabs, newlines
- `ByteLevel()` - GPT-2 style byte-level splitting
- `Punctuation()` - Isolate punctuation
- `Digits(individual_digits=True)` - Split digits individually
- `Metaspace()` - Replace spaces with ▁ (SentencePiece style)
**常见预分词器**:
- `Whitespace()` - 按空格、制表符、换行符拆分
- `ByteLevel()` - GPT-2风格的字节级拆分
- `Punctuation()` - 分离标点符号
- `Digits(individual_digits=True)` - 将数字逐个拆分
- `Metaspace()` - 用▁替换空格(SentencePiece风格)Post-processing
后处理
Add special tokens for model input:
python
from tokenizers.processors import TemplateProcessing为模型输入添加特殊分词:
python
from tokenizers.processors import TemplateProcessingBERT-style: [CLS] sentence [SEP]
BERT风格: [CLS] 句子 [SEP]
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B [SEP]",
special_tokens=[
("[CLS]", 1),
("[SEP]", 2),
],
)
**Common patterns**:
```pythontokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B [SEP]",
special_tokens=[
("[CLS]", 1),
("[SEP]", 2),
],
)
**常见模板**:
```pythonGPT-2: sentence <|endoftext|>
GPT-2: 句子 <|endoftext|>
TemplateProcessing(
single="$A <|endoftext|>",
special_tokens=[("<|endoftext|>", 50256)]
)
TemplateProcessing(
single="$A <|endoftext|>",
special_tokens=[("<|endoftext|>", 50256)]
)
RoBERTa: <s> sentence </s>
RoBERTa: <s> 句子 </s>
TemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> </s> $B </s>",
special_tokens=[("<s>", 0), ("</s>", 2)]
)
undefinedTemplateProcessing(
single="<s> $A </s>",
pair="<s> $A </s> </s> $B </s>",
special_tokens=[("<s>", 0), ("</s>", 2)]
)
undefinedAlignment tracking
对齐跟踪
Track token positions in original text:
python
output = tokenizer.encode("Hello, world!")跟踪分词在原文中的位置:
python
output = tokenizer.encode("Hello, world!")Get token offsets
获取分词偏移量
for token, offset in zip(output.tokens, output.offsets):
start, end = offset
print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
for token, offset in zip(output.tokens, output.offsets):
start, end = offset
print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
Output:
输出:
hello → [ 0, 5): 'Hello'
hello → [ 0, 5): 'Hello'
, → [ 5, 6): ','
, → [ 5, 6): ','
world → [ 7, 12): 'world'
world → [ 7, 12): 'world'
! → [12, 13): '!'
! → [12, 13): '!'
**Use cases**:
- Named entity recognition (map predictions back to text)
- Question answering (extract answer spans)
- Token classification (align labels to original positions)
**适用场景**:
- 命名实体识别(将预测结果映射回原文)
- 问答系统(提取答案片段)
- 分词分类(将标签与原文位置对齐)Integration with transformers
与transformers集成
Load with AutoTokenizer
用AutoTokenizer加载
python
from transformers import AutoTokenizerpython
from transformers import AutoTokenizerAutoTokenizer automatically uses fast tokenizers
AutoTokenizer会自动使用快速分词器
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Check if using fast tokenizer
检查是否使用快速分词器
print(tokenizer.is_fast) # True
print(tokenizer.is_fast) # True
Access underlying tokenizers.Tokenizer
访问底层的tokenizers.Tokenizer
fast_tokenizer = tokenizer.backend_tokenizer
print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
undefinedfast_tokenizer = tokenizer.backend_tokenizer
print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
undefinedConvert custom tokenizer to transformers
将自定义分词器转换为transformers兼容格式
python
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFastpython
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFastTrain custom tokenizer
训练自定义分词器
tokenizer = Tokenizer(BPE())
tokenizer = Tokenizer(BPE())
... train tokenizer ...
... 训练分词器 ...
tokenizer.save("my-tokenizer.json")
tokenizer.save("my-tokenizer.json")
Wrap for transformers
包装为transformers兼容的分词器
transformers_tokenizer = PreTrainedTokenizerFast(
tokenizer_file="my-tokenizer.json",
unk_token="[UNK]",
pad_token="[PAD]",
cls_token="[CLS]",
sep_token="[SEP]",
mask_token="[MASK]"
)
transformers_tokenizer = PreTrainedTokenizerFast(
tokenizer_file="my-tokenizer.json",
unk_token="[UNK]",
pad_token="[PAD]",
cls_token="[CLS]",
sep_token="[SEP]",
mask_token="[MASK]"
)
Use like any transformers tokenizer
像使用其他transformers分词器一样使用
outputs = transformers_tokenizer(
"Hello world",
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
undefinedoutputs = transformers_tokenizer(
"Hello world",
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
undefinedCommon patterns
常见使用模式
Train from iterator (large datasets)
从迭代器训练(处理大规模数据集)
python
from datasets import load_datasetpython
from datasets import load_datasetLoad dataset
加载数据集
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
Create batch iterator
创建批量迭代器
def batch_iterator(batch_size=1000):
for i in range(0, len(dataset), batch_size):
yield dataset[i:i + batch_size]["text"]
def batch_iterator(batch_size=1000):
for i in range(0, len(dataset), batch_size):
yield dataset[i:i + batch_size]["text"]
Train tokenizer
训练分词器
tokenizer.train_from_iterator(
batch_iterator(),
trainer=trainer,
length=len(dataset) # For progress bar
)
**Performance**: Processes 1GB in ~10-20 minutestokenizer.train_from_iterator(
batch_iterator(),
trainer=trainer,
length=len(dataset) # 用于显示进度条
)
**性能**:处理1GB语料耗时约10-20分钟Enable truncation and padding
启用截断和填充
python
undefinedpython
undefinedEnable truncation
启用截断
tokenizer.enable_truncation(max_length=512)
tokenizer.enable_truncation(max_length=512)
Enable padding
启用填充
tokenizer.enable_padding(
pad_id=tokenizer.token_to_id("[PAD]"),
pad_token="[PAD]",
length=512 # Fixed length, or None for batch max
)
tokenizer.enable_padding(
pad_id=tokenizer.token_to_id("[PAD]"),
pad_token="[PAD]",
length=512 # 固定长度,设为None则使用批次最大长度
)
Encode with both
同时启用两者进行编码
output = tokenizer.encode("This is a long sentence that will be truncated...")
print(len(output.ids)) # 512
undefinedoutput = tokenizer.encode("This is a long sentence that will be truncated...")
print(len(output.ids)) # 512
undefinedMulti-processing
多进程处理
python
from tokenizers import Tokenizer
from multiprocessing import Poolpython
from tokenizers import Tokenizer
from multiprocessing import PoolLoad tokenizer
加载分词器
tokenizer = Tokenizer.from_file("tokenizer.json")
def encode_batch(texts):
return tokenizer.encode_batch(texts)
tokenizer = Tokenizer.from_file("tokenizer.json")
def encode_batch(texts):
return tokenizer.encode_batch(texts)
Process large corpus in parallel
并行处理大规模语料
with Pool(8) as pool:
# Split corpus into chunks
chunk_size = 1000
chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
# Encode in parallel
results = pool.map(encode_batch, chunks)
**Speedup**: 5-8× with 8 coreswith Pool(8) as pool:
# 将语料拆分为块
chunk_size = 1000
chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
# 并行编码
results = pool.map(encode_batch, chunks)
**速度提升**:使用8核CPU可提升5-8倍速度Performance benchmarks
性能基准测试
Training speed
训练速度
| Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |
|---|---|---|---|
| 10 MB | 15 sec | 18 sec | 25 sec |
| 100 MB | 1.5 min | 2 min | 4 min |
| 1 GB | 15 min | 20 min | 40 min |
Hardware: 16-core CPU, tested on English Wikipedia
| 语料大小 | BPE(30k词汇量) | WordPiece(30k词汇量) | Unigram(8k词汇量) |
|---|---|---|---|
| 10 MB | 15秒 | 18秒 | 25秒 |
| 100 MB | 1.5分钟 | 2分钟 | 4分钟 |
| 1 GB | 15分钟 | 20分钟 | 40分钟 |
硬件环境:16核CPU,基于英文维基百科测试
Tokenization speed
分词速度
| Implementation | 1 GB corpus | Throughput |
|---|---|---|
| Pure Python | ~20 minutes | ~50 MB/min |
| HF Tokenizers | ~15 seconds | ~4 GB/min |
| Speedup | 80× | 80× |
Test: English text, average sentence length 20 words
| 实现方式 | 1GB语料耗时 | 吞吐量 |
|---|---|---|
| 纯Python实现 | ~20分钟 | ~50 MB/min |
| HF Tokenizers | ~15秒 | ~4 GB/min |
| 速度提升 | 80倍 | 80倍 |
测试条件:英文文本,平均句子长度20词
Memory usage
内存占用
| Task | Memory |
|---|---|
| Load tokenizer | ~10 MB |
| Train BPE (30k vocab) | ~200 MB |
| Encode 1M sentences | ~500 MB |
| 任务 | 内存占用 |
|---|---|
| 加载分词器 | ~10 MB |
| 训练BPE(30k词汇量) | ~200 MB |
| 编码100万句文本 | ~500 MB |
Supported models
支持的预训练模型
Pre-trained tokenizers available via :
from_pretrained()BERT family:
- ,
bert-base-uncasedbert-large-cased distilbert-base-uncased- ,
roberta-baseroberta-large
GPT family:
- ,
gpt2,gpt2-mediumgpt2-large distilgpt2
T5 family:
- ,
t5-small,t5-baset5-large google/flan-t5-xxl
Other:
- ,
facebook/bart-basefacebook/mbart-large-cc25 - ,
albert-base-v2albert-xlarge-v2 - ,
xlm-roberta-basexlm-roberta-large
Browse all: https://huggingface.co/models?library=tokenizers
可通过加载的预训练分词器:
from_pretrained()BERT系列:
- ,
bert-base-uncasedbert-large-cased distilbert-base-uncased- ,
roberta-baseroberta-large
GPT系列:
- ,
gpt2,gpt2-mediumgpt2-large distilgpt2
T5系列:
- ,
t5-small,t5-baset5-large google/flan-t5-xxl
其他:
- ,
facebook/bart-basefacebook/mbart-large-cc25 - ,
albert-base-v2albert-xlarge-v2 - ,
xlm-roberta-basexlm-roberta-large
References
参考资料
- Training Guide - Train custom tokenizers, configure trainers, handle large datasets
- Algorithms Deep Dive - BPE, WordPiece, Unigram explained in detail
- Pipeline Components - Normalizers, pre-tokenizers, post-processors, decoders
- Transformers Integration - AutoTokenizer, PreTrainedTokenizerFast, special tokens
- 训练指南 - 训练自定义分词器、配置训练器、处理大规模数据集
- 算法深度解析 - 详细解析BPE、WordPiece、Unigram算法
- 流水线组件 - 标准化器、预分词器、后处理器、解码器详解
- Transformers集成 - AutoTokenizer、PreTrainedTokenizerFast、特殊分词说明
Resources
资源
- Docs: https://huggingface.co/docs/tokenizers
- GitHub: https://github.com/huggingface/tokenizers ⭐ 9,000+
- Version: 0.20.0+
- Course: https://huggingface.co/learn/nlp-course/chapter6/1
- Paper: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)
- 文档: https://huggingface.co/docs/tokenizers
- GitHub: https://github.com/huggingface/tokenizers ⭐ 9000+
- 版本: 0.20.0+
- 课程: https://huggingface.co/learn/nlp-course/chapter6/1
- 论文: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)