sentencepiece
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSentencePiece - Language-Independent Tokenization
SentencePiece —— 与语言无关的分词工具
Unsupervised tokenizer that works on raw text without language-specific preprocessing.
一款无监督分词器,可直接处理原始文本,无需针对特定语言的预处理。
When to use SentencePiece
何时使用SentencePiece
Use SentencePiece when:
- Building multilingual models (no language-specific rules)
- Working with CJK languages (Chinese, Japanese, Korean)
- Need reproducible tokenization (deterministic vocabulary)
- Want to train on raw text (no pre-tokenization needed)
- Require lightweight deployment (6MB memory, 50k sentences/sec)
Performance:
- Speed: 50,000 sentences/sec
- Memory: ~6MB for loaded model
- Languages: All (language-independent)
Use alternatives instead:
- HuggingFace Tokenizers: Faster training, more flexibility
- tiktoken: OpenAI models (GPT-3.5/4)
- BERT WordPiece: English-centric tasks
使用SentencePiece的场景:
- 构建多语言模型(无需特定语言规则)
- 处理CJK语言(中文、日文、韩文)
- 需要可复现的分词结果(词汇表确定)
- 希望直接在原始文本上训练(无需预分词)
- 需要轻量部署(占用6MB内存,每秒处理50000句)
性能指标:
- 速度: 每秒处理50000句
- 内存: 加载模型后约占用6MB
- 支持语言: 所有语言(与语言无关)
可替代方案:
- HuggingFace Tokenizers: 训练速度更快,灵活性更高
- tiktoken: 适配OpenAI模型(GPT-3.5/4)
- BERT WordPiece: 适用于英文为主的任务
Quick start
快速开始
Installation
安装
bash
undefinedbash
undefinedPython
Python
pip install sentencepiece
pip install sentencepiece
C++ (requires CMake)
C++(需要CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
undefinedgit clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install
undefinedTrain model
训练模型
bash
undefinedbash
undefinedCommand-line (BPE with 8000 vocab)
命令行方式(使用BPE算法,词汇表大小8000)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
Python API
Python API
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='m',
vocab_size=8000,
model_type='bpe'
)
**Training time**: ~1-2 minutes for 100MB corpusimport sentencepiece as spm
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='m',
vocab_size=8000,
model_type='bpe'
)
**训练时间**: 100MB语料库约需1-2分钟Encode and decode
编码与解码
python
import sentencepiece as spmpython
import sentencepiece as spmLoad model
加载模型
sp = spm.SentencePieceProcessor(model_file='m.model')
sp = spm.SentencePieceProcessor(model_file='m.model')
Encode to pieces
编码为子词片段
pieces = sp.encode('This is a test', out_type=str)
print(pieces) # ['▁This', '▁is', '▁a', '▁test']
pieces = sp.encode('This is a test', out_type=str)
print(pieces) # ['▁This', '▁is', '▁a', '▁test']
Encode to IDs
编码为ID
ids = sp.encode('This is a test', out_type=int)
print(ids) # [284, 47, 11, 1243]
ids = sp.encode('This is a test', out_type=int)
print(ids) # [284, 47, 11, 1243]
Decode
解码
text = sp.decode(ids)
print(text) # "This is a test"
undefinedtext = sp.decode(ids)
print(text) # "This is a test"
undefinedLanguage-independent design
与语言无关的设计
Whitespace as symbol (▁)
将空格视为特殊符号(▁)
python
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces) # ['▁Hello', '▁world']python
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces) # ['▁Hello', '▁world']Decode preserves spaces
解码后保留空格
decoded = sp.decode_pieces(pieces)
print(decoded) # "Hello world"
**Key principle**: Treat text as raw Unicode, whitespace = ▁ (meta symbol)decoded = sp.decode_pieces(pieces)
print(decoded) # "Hello world"
**核心原则**: 将文本视为原始Unicode,空格用▁(元符号)表示Tokenization algorithms
分词算法
BPE (Byte-Pair Encoding)
BPE(字节对编码)
python
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='bpe_model',
vocab_size=16000,
model_type='bpe'
)Used by: mBART
python
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='bpe_model',
vocab_size=16000,
model_type='bpe'
)适用模型: mBART
Unigram (default)
Unigram(默认算法)
python
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='unigram_model',
vocab_size=8000,
model_type='unigram'
)Used by: T5, ALBERT, XLNet
python
spm.SentencePieceTrainer.train(
input='data.txt',
model_prefix='unigram_model',
vocab_size=8000,
model_type='unigram'
)适用模型: T5, ALBERT, XLNet
Training configuration
训练配置
Essential parameters
关键参数
python
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='m',
vocab_size=32000,
model_type='unigram',
character_coverage=0.9995, # 1.0 for CJK
user_defined_symbols=['[SEP]', '[CLS]'],
unk_piece='<unk>',
num_threads=16
)python
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='m',
vocab_size=32000,
model_type='unigram',
character_coverage=0.9995, # CJK语言设为1.0
user_defined_symbols=['[SEP]', '[CLS]'],
unk_piece='<unk>',
num_threads=16
)Character coverage
字符覆盖率
| Language Type | Coverage | Rationale |
|---|---|---|
| English | 0.9995 | Most common chars |
| CJK (Chinese) | 1.0 | All characters needed |
| Multilingual | 0.9995 | Balance |
| 语言类型 | 覆盖率 | 原因 |
|---|---|---|
| 英语 | 0.9995 | 覆盖绝大多数常用字符 |
| CJK(中文) | 1.0 | 需要覆盖所有字符 |
| 多语言 | 0.9995 | 平衡需求 |
Encoding options
编码选项
Subword regularization
子词正则化
python
undefinedpython
undefinedSample different tokenizations
生成不同的分词结果
for _ in range(3):
pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
print(pieces)
for _ in range(3):
pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
print(pieces)
Output (different each time):
输出(每次不同):
['▁token', 'ization']
['▁token', 'ization']
['▁tok', 'en', 'ization']
['▁tok', 'en', 'ization']
**Use case**: Data augmentation for robustness.
**适用场景**: 数据增强,提升模型鲁棒性Common patterns
常见应用模式
T5-style training
T5风格训练
python
spm.SentencePieceTrainer.train(
input='c4_corpus.txt',
model_prefix='t5',
vocab_size=32000,
model_type='unigram',
user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
unk_id=2,
eos_id=1,
pad_id=0
)python
spm.SentencePieceTrainer.train(
input='c4_corpus.txt',
model_prefix='t5',
vocab_size=32000,
model_type='unigram',
user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
unk_id=2,
eos_id=1,
pad_id=0
)Integration with transformers
与Transformers集成
python
from transformers import T5Tokenizerpython
from transformers import T5TokenizerT5 uses SentencePiece internally
T5内部使用SentencePiece
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
undefinedtokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
undefinedPerformance benchmarks
性能基准测试
Training speed
训练速度
| Corpus | BPE (16k) | Unigram (8k) |
|---|---|---|
| 100 MB | 1-2 min | 3-4 min |
| 1 GB | 10-15 min | 30-40 min |
| 语料库大小 | BPE(16k词汇) | Unigram(8k词汇) |
|---|---|---|
| 100 MB | 1-2分钟 | 3-4分钟 |
| 1 GB | 10-15分钟 | 30-40分钟 |
Tokenization speed
分词速度
- SentencePiece: 50,000 sentences/sec
- HF Tokenizers: 200,000 sentences/sec (4× faster)
- SentencePiece: 每秒50000句
- HF Tokenizers: 每秒200000句(快4倍)
Supported models
支持的模型
T5 family: , (32k vocab, Unigram)
ALBERT: (30k vocab, Unigram)
XLNet: (32k vocab, Unigram)
mBART: (250k vocab, BPE)
t5-baset5-largealbert-base-v2xlnet-base-casedfacebook/mbart-large-50T5系列: , (32k词汇,Unigram算法)
ALBERT: (30k词汇,Unigram算法)
XLNet: (32k词汇,Unigram算法)
mBART: (250k词汇,BPE算法)
t5-baset5-largealbert-base-v2xlnet-base-casedfacebook/mbart-large-50References
参考资料
- Training Guide - Detailed options, corpus preparation
- Algorithms - BPE vs Unigram, subword regularization
- 训练指南 - 详细参数说明、语料库准备
- 算法说明 - BPE与Unigram对比、子词正则化
Resources
资源
- GitHub: https://github.com/google/sentencepiece ⭐ 10,000+
- Paper: https://arxiv.org/abs/1808.06226 (EMNLP 2018)
- Version: 0.2.0+
- GitHub: https://github.com/google/sentencepiece ⭐ 10000+
- 论文: https://arxiv.org/abs/1808.06226(EMNLP 2018)
- 版本: 0.2.0+