sentencepiece

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SentencePiece - Language-Independent Tokenization

SentencePiece —— 与语言无关的分词工具

Unsupervised tokenizer that works on raw text without language-specific preprocessing.
一款无监督分词器,可直接处理原始文本,无需针对特定语言的预处理。

When to use SentencePiece

何时使用SentencePiece

Use SentencePiece when:
  • Building multilingual models (no language-specific rules)
  • Working with CJK languages (Chinese, Japanese, Korean)
  • Need reproducible tokenization (deterministic vocabulary)
  • Want to train on raw text (no pre-tokenization needed)
  • Require lightweight deployment (6MB memory, 50k sentences/sec)
Performance:
  • Speed: 50,000 sentences/sec
  • Memory: ~6MB for loaded model
  • Languages: All (language-independent)
Use alternatives instead:
  • HuggingFace Tokenizers: Faster training, more flexibility
  • tiktoken: OpenAI models (GPT-3.5/4)
  • BERT WordPiece: English-centric tasks
使用SentencePiece的场景:
  • 构建多语言模型(无需特定语言规则)
  • 处理CJK语言(中文、日文、韩文)
  • 需要可复现的分词结果(词汇表确定)
  • 希望直接在原始文本上训练(无需预分词)
  • 需要轻量部署(占用6MB内存,每秒处理50000句)
性能指标:
  • 速度: 每秒处理50000句
  • 内存: 加载模型后约占用6MB
  • 支持语言: 所有语言(与语言无关)
可替代方案:
  • HuggingFace Tokenizers: 训练速度更快,灵活性更高
  • tiktoken: 适配OpenAI模型(GPT-3.5/4)
  • BERT WordPiece: 适用于英文为主的任务

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

Python

Python

pip install sentencepiece
pip install sentencepiece

C++ (requires CMake)

C++(需要CMake)

git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build && cd build cmake .. && make -j $(nproc) sudo make install
undefined
git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build && cd build cmake .. && make -j $(nproc) sudo make install
undefined

Train model

训练模型

bash
undefined
bash
undefined

Command-line (BPE with 8000 vocab)

命令行方式(使用BPE算法,词汇表大小8000)

spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

Python API

Python API

import sentencepiece as spm
spm.SentencePieceTrainer.train( input='data.txt', model_prefix='m', vocab_size=8000, model_type='bpe' )

**Training time**: ~1-2 minutes for 100MB corpus
import sentencepiece as spm
spm.SentencePieceTrainer.train( input='data.txt', model_prefix='m', vocab_size=8000, model_type='bpe' )

**训练时间**: 100MB语料库约需1-2分钟

Encode and decode

编码与解码

python
import sentencepiece as spm
python
import sentencepiece as spm

Load model

加载模型

sp = spm.SentencePieceProcessor(model_file='m.model')
sp = spm.SentencePieceProcessor(model_file='m.model')

Encode to pieces

编码为子词片段

pieces = sp.encode('This is a test', out_type=str) print(pieces) # ['▁This', '▁is', '▁a', '▁test']
pieces = sp.encode('This is a test', out_type=str) print(pieces) # ['▁This', '▁is', '▁a', '▁test']

Encode to IDs

编码为ID

ids = sp.encode('This is a test', out_type=int) print(ids) # [284, 47, 11, 1243]
ids = sp.encode('This is a test', out_type=int) print(ids) # [284, 47, 11, 1243]

Decode

解码

text = sp.decode(ids) print(text) # "This is a test"
undefined
text = sp.decode(ids) print(text) # "This is a test"
undefined

Language-independent design

与语言无关的设计

Whitespace as symbol (▁)

将空格视为特殊符号(▁)

python
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']
python
text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

Decode preserves spaces

解码后保留空格

decoded = sp.decode_pieces(pieces) print(decoded) # "Hello world"

**Key principle**: Treat text as raw Unicode, whitespace = ▁ (meta symbol)
decoded = sp.decode_pieces(pieces) print(decoded) # "Hello world"

**核心原则**: 将文本视为原始Unicode,空格用▁(元符号)表示

Tokenization algorithms

分词算法

BPE (Byte-Pair Encoding)

BPE(字节对编码)

python
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)
Used by: mBART
python
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)
适用模型: mBART

Unigram (default)

Unigram(默认算法)

python
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)
Used by: T5, ALBERT, XLNet
python
spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)
适用模型: T5, ALBERT, XLNet

Training configuration

训练配置

Essential parameters

关键参数

python
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # 1.0 for CJK
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)
python
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # CJK语言设为1.0
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)

Character coverage

字符覆盖率

Language TypeCoverageRationale
English0.9995Most common chars
CJK (Chinese)1.0All characters needed
Multilingual0.9995Balance
语言类型覆盖率原因
英语0.9995覆盖绝大多数常用字符
CJK(中文)1.0需要覆盖所有字符
多语言0.9995平衡需求

Encoding options

编码选项

Subword regularization

子词正则化

python
undefined
python
undefined

Sample different tokenizations

生成不同的分词结果

for _ in range(3): pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1) print(pieces)
for _ in range(3): pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1) print(pieces)

Output (different each time):

输出(每次不同):

['▁token', 'ization']

['▁token', 'ization']

['▁tok', 'en', 'ization']

['▁tok', 'en', 'ization']


**Use case**: Data augmentation for robustness.

**适用场景**: 数据增强,提升模型鲁棒性

Common patterns

常见应用模式

T5-style training

T5风格训练

python
spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)
python
spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)

Integration with transformers

与Transformers集成

python
from transformers import T5Tokenizer
python
from transformers import T5Tokenizer

T5 uses SentencePiece internally

T5内部使用SentencePiece

tokenizer = T5Tokenizer.from_pretrained('t5-base') inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
undefined
tokenizer = T5Tokenizer.from_pretrained('t5-base') inputs = tokenizer('translate English to French: Hello', return_tensors='pt')
undefined

Performance benchmarks

性能基准测试

Training speed

训练速度

CorpusBPE (16k)Unigram (8k)
100 MB1-2 min3-4 min
1 GB10-15 min30-40 min
语料库大小BPE(16k词汇)Unigram(8k词汇)
100 MB1-2分钟3-4分钟
1 GB10-15分钟30-40分钟

Tokenization speed

分词速度

  • SentencePiece: 50,000 sentences/sec
  • HF Tokenizers: 200,000 sentences/sec (4× faster)
  • SentencePiece: 每秒50000句
  • HF Tokenizers: 每秒200000句(快4倍)

Supported models

支持的模型

T5 family:
t5-base
,
t5-large
(32k vocab, Unigram) ALBERT:
albert-base-v2
(30k vocab, Unigram) XLNet:
xlnet-base-cased
(32k vocab, Unigram) mBART:
facebook/mbart-large-50
(250k vocab, BPE)
T5系列:
t5-base
,
t5-large
(32k词汇,Unigram算法) ALBERT:
albert-base-v2
(30k词汇,Unigram算法) XLNet:
xlnet-base-cased
(32k词汇,Unigram算法) mBART:
facebook/mbart-large-50
(250k词汇,BPE算法)

References

参考资料

  • Training Guide - Detailed options, corpus preparation
  • Algorithms - BPE vs Unigram, subword regularization
  • 训练指南 - 详细参数说明、语料库准备
  • 算法说明 - BPE与Unigram对比、子词正则化

Resources

资源