piper-tts-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Piper TTS Voice Training

Piper TTS 语音训练

Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime.
训练兼容Piper轻量级ONNX运行时的自定义文本转语音(TTS)语音。

Overview

概述

Piper produces fast, offline TTS suitable for embedded devices. Training involves:
  1. Corpus preparation (text covering phonetic range)
  2. Audio generation or recording
  3. Quality validation via Whisper transcription
  4. Fine-tuning from existing checkpoint (recommended) or training from scratch
  5. ONNX export for deployment
Fine-tuning vs from-scratch:
  • Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU)
  • From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months)
Piper可生成适用于嵌入式设备的快速离线TTS。训练流程包括:
  1. 语料库准备(覆盖语音范围的文本)
  2. 音频生成或录制
  3. 通过Whisper转录进行质量验证
  4. 从现有检查点微调(推荐)或从头训练
  5. 导出ONNX格式用于部署
微调 vs 从头训练:
  • 微调:约1300个短语 + 1000个轮次(普通GPU需数天)
  • 从头训练:约13000+个短语 + 2000+个轮次(需数周/数月)

Workflow

工作流程

1. Corpus Preparation

1. 语料库准备

Gather 1,300-1,500+ phrases covering broad phonetic range:
  • Use piper-recording-studio corpus as base
  • Add domain-specific phrases for your use case
  • Include varied sentence structures and lengths
Critical for non-US English: Ensure corpus uses correct regional spelling. See Localisation.
收集1300-1500+个覆盖广泛语音范围的短语:
  • 以piper-recording-studio语料库为基础
  • 为你的使用场景添加特定领域的短语
  • 包含多样的句子结构和长度
**非美式英语的关键注意事项:**确保语料库使用正确的地区拼写。详见本地化

2. Audio Generation

2. 音频生成

Generate or record training audio at 22050Hz mono WAV.
If using voice cloning (e.g., Chatterbox TTS):
  • Generate at source sample rate (often 24kHz)
  • Convert to 22050Hz:
    sox -v 0.95 input.wav -r 22050 -t wav output.wav
  • The
    -v 0.95
    prevents clipping during resampling
Recording requirements:
  • Consistent microphone position and room acoustics
  • Minimal background noise
  • Natural speaking pace (not reading voice)
生成或录制22050Hz单声道WAV格式的训练音频。
如果使用语音克隆(如Chatterbox TTS):
  • 以源采样率(通常为24kHz)生成音频
  • 转换为22050Hz:
    sox -v 0.95 input.wav -r 22050 -t wav output.wav
  • -v 0.95
    参数可防止重采样期间出现削波
录制要求:
  • 麦克风位置和房间声学环境保持一致
  • 背景噪音最小化
  • 自然的语速(非朗读式生硬语气)

3. Quality Validation with Whisper

3. 使用Whisper进行质量验证

Automate quality checks rather than manual listening:
python
import whisper
from piper_phonemize import phonemize_text

model = whisper.load_model("base")

def validate_sample(audio_path, expected_text):
    result = model.transcribe(audio_path)
    transcribed = result["text"].strip()

    # Compare phonemically to handle spelling/punctuation differences
    expected_phonemes = phonemize_text(expected_text, "en-gb")
    transcribed_phonemes = phonemize_text(transcribed, "en-gb")

    return expected_phonemes == transcribed_phonemes
Retry failed samples up to 3 times. Target 95%+ dataset coverage.
通过自动化方式进行质量检查,而非人工监听:
python
import whisper
from piper_phonemize import phonemize_text

model = whisper.load_model("base")

def validate_sample(audio_path, expected_text):
    result = model.transcribe(audio_path)
    transcribed = result["text"].strip()

    # 基于音素进行比较,以处理拼写/标点差异
    expected_phonemes = phonemize_text(expected_text, "en-gb")
    transcribed_phonemes = phonemize_text(transcribed, "en-gb")

    return expected_phonemes == transcribed_phonemes
失败的样本最多重试3次。目标实现95%以上的数据集覆盖。

4. Dataset Format (LJSpeech)

4. 数据集格式(LJSpeech)

Structure your dataset:
dataset/
├── metadata.csv
└── wavs/
    ├── sample_0001.wav
    ├── sample_0002.wav
    └── ...
metadata.csv format:
{id}|{text}
(pipe-separated, no headers)
sample_0001|The quick brown fox jumps over the lazy dog.
sample_0002|Pack my box with five dozen liquor jugs.
按以下结构组织数据集:
dataset/
├── metadata.csv
└── wavs/
    ├── sample_0001.wav
    ├── sample_0002.wav
    └── ...
metadata.csv格式:
{id}|{text}
(竖线分隔,无表头)
sample_0001|The quick brown fox jumps over the lazy dog.
sample_0002|Pack my box with five dozen liquor jugs.

5. Preprocessing

5. 预处理

Convert to PyTorch tensors:
bash
python3 -m piper_train.preprocess \
    --language en-gb \
    --input-dir dataset/ \
    --output-dir piper_training_dir/ \
    --dataset-format ljspeech
Use
en-gb
for Australian/NZ/UK voices (espeak-ng phoneme set).
转换为PyTorch张量:
bash
python3 -m piper_train.preprocess \
    --language en-gb \
    --input-dir dataset/ \
    --output-dir piper_training_dir/ \
    --dataset-format ljspeech
针对澳大利亚/新西兰/英式语音,使用
en-gb
(espeak-ng音素集)。

6. Training

6. 训练

Fine-tuning (recommended):
bash
python3 -m piper_train \
    --dataset-dir piper_training_dir/ \
    --accelerator gpu \
    --devices 1 \
    --batch-size 12 \
    --max_epochs 3000 \
    --resume_from_checkpoint ljspeech-2000.ckpt \
    --checkpoint-epochs 100 \
    --quality high \
    --precision 32
Key parameters:
  • --batch-size
    : Reduce if VRAM limited (12 works on 8GB)
  • --resume_from_checkpoint
    : Start from LJSpeech high-quality checkpoint
  • --precision 32
    : More stable than mixed precision
  • --validation-split 0.0 --num-test-examples 0
    : Skip validation for small datasets
Monitor with TensorBoard: watch
loss_disc_all
for convergence.
微调(推荐):
bash
python3 -m piper_train \
    --dataset-dir piper_training_dir/ \
    --accelerator gpu \
    --devices 1 \
    --batch-size 12 \
    --max_epochs 3000 \
    --resume_from_checkpoint ljspeech-2000.ckpt \
    --checkpoint-epochs 100 \
    --quality high \
    --precision 32
关键参数:
  • --batch-size
    :如果显存不足则减小(8GB显存可使用12)
  • --resume_from_checkpoint
    :从LJSpeech高质量检查点开始
  • --precision 32
    :比混合精度更稳定
  • --validation-split 0.0 --num-test-examples 0
    :小型数据集可跳过验证
使用TensorBoard监控:观察
loss_disc_all
指标判断收敛情况。

7. ONNX Export

7. ONNX导出

bash
python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized
onnxsim output.onnx.unoptimized output.onnx
Create metadata file
output.onnx.json
from training
config.json
.
bash
python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized
onnxsim output.onnx.unoptimized output.onnx
从训练
config.json
创建元数据文件
output.onnx.json

Localisation for Australian, New Zealand and UK English

澳大利亚、新西兰及英式英语的本地化

Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift.
Corpus preparation:
  • Run
    scripts/convert_spelling.py
    on corpus text before training
  • Use
    en-gb
    or
    en-au
    espeak-ng voice for phonemisation
  • Review generated phonemes for Americanisms
Common spelling conversions:
AmericanAustralian/UK
-ize-ise
-or-our
-er-re
-og-ogue
-ense-ence
Phoneme considerations:
  • /r/ linking and intrusion patterns differ
  • Vowel sounds in words like "dance", "bath", "castle"
  • Final -ile pronunciation (hostile, missile)
For complete word lists and phonetic details, see references/localisation.md.
Validation: Use Whisper with
language="en"
and verify transcriptions match expected regional forms.
Piper使用espeak-ng进行音素化。训练数据中的美式发音会导致口音偏移。
语料库准备:
  • 训练前对语料库文本运行
    scripts/convert_spelling.py
    脚本
  • 使用
    en-gb
    en-au
    espeak-ng语音进行音素化
  • 检查生成的音素是否存在美式发音
常见拼写转换:
美式澳大利亚/英式
-ize-ise
-or-our
-er-re
-og-ogue
-ense-ence
音素注意事项:
  • /r/的连读和插入模式不同
  • "dance"、"bath"、"castle"等单词中的元音发音
  • 末尾-ile的发音(如hostile、missile)
完整的单词列表和语音细节,请参阅references/localisation.md
**验证:**使用Whisper并设置
language="en"
,验证转录结果是否符合预期的地区形式。

Dependencies

依赖项

Pin versions to avoid API breakage:
pytorch-lightning==1.9.3
torch<2.6.0
piper-phonemize
onnxruntime-gpu
onnxsim
Docker containerisation recommended for reproducibility.
固定版本以避免API变更:
pytorch-lightning==1.9.3
torch<2.6.0
piper-phonemize
onnxruntime-gpu
onnxsim
推荐使用Docker容器化以确保可复现性。

Hardware Requirements

硬件要求

Minimum (fine-tuning):
  • 8GB VRAM GPU (Pascal or newer)
  • 8GB system RAM
  • ~5 days for 1,000 epochs on Tesla P4
From scratch: Multiply time by ~200x.
最低配置(微调):
  • 8GB显存GPU(帕斯卡架构或更新)
  • 8GB系统内存
  • Tesla P4上训练1000轮次约需5天
**从头训练:**时间约为微调的200倍。

Troubleshooting

故障排除

IssueSolution
CUDA OOMReduce batch-size (try 8 or 4)
Checkpoint won't loadCheck pytorch-lightning version matches checkpoint
Garbled outputInsufficient training epochs or dataset too small
Wrong accentCheck espeak-ng language code and corpus spelling
问题解决方案
CUDA内存不足减小batch-size(尝试8或4)
检查点无法加载确保pytorch-lightning版本与检查点匹配
输出语音混乱训练轮次不足或数据集过小
口音错误检查espeak-ng语言代码和语料库拼写