piper-tts-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Piper TTS Voice Training

Piper TTS 语音训练

Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime.

训练兼容Piper轻量级ONNX运行时的自定义文本转语音（TTS）语音。

Overview

概述

Piper produces fast, offline TTS suitable for embedded devices. Training involves:

Corpus preparation (text covering phonetic range)
Audio generation or recording
Quality validation via Whisper transcription
Fine-tuning from existing checkpoint (recommended) or training from scratch
ONNX export for deployment

Fine-tuning vs from-scratch:

Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU)
From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months)

Piper可生成适用于嵌入式设备的快速离线TTS。训练流程包括：

语料库准备（覆盖语音范围的文本）
音频生成或录制
通过Whisper转录进行质量验证
从现有检查点微调（推荐）或从头训练
导出ONNX格式用于部署

微调 vs 从头训练：

微调：约1300个短语 + 1000个轮次（普通GPU需数天）
从头训练：约13000+个短语 + 2000+个轮次（需数周/数月）

Workflow

工作流程

1. Corpus Preparation

1. 语料库准备

Gather 1,300-1,500+ phrases covering broad phonetic range:

Use piper-recording-studio corpus as base
Add domain-specific phrases for your use case
Include varied sentence structures and lengths

Critical for non-US English: Ensure corpus uses correct regional spelling. See Localisation.

收集1300-1500+个覆盖广泛语音范围的短语：

以piper-recording-studio语料库为基础
为你的使用场景添加特定领域的短语
包含多样的句子结构和长度

**非美式英语的关键注意事项：**确保语料库使用正确的地区拼写。详见本地化。

2. Audio Generation

2. 音频生成

Generate or record training audio at 22050Hz mono WAV.

If using voice cloning (e.g., Chatterbox TTS):

Generate at source sample rate (often 24kHz)

Convert to 22050Hz:

sox -v 0.95 input.wav -r 22050 -t wav output.wav

The
```
-v 0.95
```
prevents clipping during resampling

Recording requirements:

Consistent microphone position and room acoustics
Minimal background noise
Natural speaking pace (not reading voice)

生成或录制22050Hz单声道WAV格式的训练音频。

如果使用语音克隆（如Chatterbox TTS）：

以源采样率（通常为24kHz）生成音频

转换为22050Hz：

sox -v 0.95 input.wav -r 22050 -t wav output.wav

```
-v 0.95
```
参数可防止重采样期间出现削波

录制要求：

麦克风位置和房间声学环境保持一致
背景噪音最小化
自然的语速（非朗读式生硬语气）

3. Quality Validation with Whisper

3. 使用Whisper进行质量验证

Automate quality checks rather than manual listening:

python

import whisper
from piper_phonemize import phonemize_text

model = whisper.load_model("base")

def validate_sample(audio_path, expected_text):
    result = model.transcribe(audio_path)
    transcribed = result["text"].strip()

    # Compare phonemically to handle spelling/punctuation differences
    expected_phonemes = phonemize_text(expected_text, "en-gb")
    transcribed_phonemes = phonemize_text(transcribed, "en-gb")

    return expected_phonemes == transcribed_phonemes

Retry failed samples up to 3 times. Target 95%+ dataset coverage.

通过自动化方式进行质量检查，而非人工监听：

python

import whisper
from piper_phonemize import phonemize_text

model = whisper.load_model("base")

def validate_sample(audio_path, expected_text):
    result = model.transcribe(audio_path)
    transcribed = result["text"].strip()

    # 基于音素进行比较，以处理拼写/标点差异
    expected_phonemes = phonemize_text(expected_text, "en-gb")
    transcribed_phonemes = phonemize_text(transcribed, "en-gb")

    return expected_phonemes == transcribed_phonemes

失败的样本最多重试3次。目标实现95%以上的数据集覆盖。

4. Dataset Format (LJSpeech)

4. 数据集格式（LJSpeech）

Structure your dataset:

dataset/
├── metadata.csv
└── wavs/
    ├── sample_0001.wav
    ├── sample_0002.wav
    └── ...

metadata.csv format:

{id}|{text}

(pipe-separated, no headers)

sample_0001|The quick brown fox jumps over the lazy dog.
sample_0002|Pack my box with five dozen liquor jugs.

按以下结构组织数据集：

dataset/
├── metadata.csv
└── wavs/
    ├── sample_0001.wav
    ├── sample_0002.wav
    └── ...

metadata.csv格式：

{id}|{text}

（竖线分隔，无表头）

sample_0001|The quick brown fox jumps over the lazy dog.
sample_0002|Pack my box with five dozen liquor jugs.

5. Preprocessing

5. 预处理

Convert to PyTorch tensors:

bash

python3 -m piper_train.preprocess \
    --language en-gb \
    --input-dir dataset/ \
    --output-dir piper_training_dir/ \
    --dataset-format ljspeech

Use

en-gb

for Australian/NZ/UK voices (espeak-ng phoneme set).

转换为PyTorch张量：

bash

python3 -m piper_train.preprocess \
    --language en-gb \
    --input-dir dataset/ \
    --output-dir piper_training_dir/ \
    --dataset-format ljspeech

针对澳大利亚/新西兰/英式语音，使用

en-gb

（espeak-ng音素集）。

6. Training

6. 训练

Fine-tuning (recommended):

bash

python3 -m piper_train \
    --dataset-dir piper_training_dir/ \
    --accelerator gpu \
    --devices 1 \
    --batch-size 12 \
    --max_epochs 3000 \
    --resume_from_checkpoint ljspeech-2000.ckpt \
    --checkpoint-epochs 100 \
    --quality high \
    --precision 32

Key parameters:

```
--batch-size
```
: Reduce if VRAM limited (12 works on 8GB)
```
--resume_from_checkpoint
```
: Start from LJSpeech high-quality checkpoint
```
--precision 32
```
: More stable than mixed precision

--validation-split 0.0 --num-test-examples 0

: Skip validation for small datasets

Monitor with TensorBoard: watch

loss_disc_all

for convergence.

微调（推荐）：

bash

python3 -m piper_train \
    --dataset-dir piper_training_dir/ \
    --accelerator gpu \
    --devices 1 \
    --batch-size 12 \
    --max_epochs 3000 \
    --resume_from_checkpoint ljspeech-2000.ckpt \
    --checkpoint-epochs 100 \
    --quality high \
    --precision 32

关键参数：

```
--batch-size
```
：如果显存不足则减小（8GB显存可使用12）
```
--resume_from_checkpoint
```
：从LJSpeech高质量检查点开始
```
--precision 32
```
：比混合精度更稳定

--validation-split 0.0 --num-test-examples 0

：小型数据集可跳过验证

使用TensorBoard监控：观察

loss_disc_all

指标判断收敛情况。

7. ONNX Export

7. ONNX导出

bash

python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized
onnxsim output.onnx.unoptimized output.onnx

Create metadata file

output.onnx.json

from training

config.json

bash

python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized
onnxsim output.onnx.unoptimized output.onnx

从训练

config.json

创建元数据文件

output.onnx.json

。

Localisation for Australian, New Zealand and UK English

澳大利亚、新西兰及英式英语的本地化

Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift.

Corpus preparation:

Run
```
scripts/convert_spelling.py
```
on corpus text before training
Use
```
en-gb
```
or
```
en-au
```
espeak-ng voice for phonemisation
Review generated phonemes for Americanisms

Common spelling conversions:

American	Australian/UK
-ize	-ise
-or	-our
-er	-re
-og	-ogue
-ense	-ence

Phoneme considerations:

/r/ linking and intrusion patterns differ
Vowel sounds in words like "dance", "bath", "castle"
Final -ile pronunciation (hostile, missile)

For complete word lists and phonetic details, see references/localisation.md.

Validation: Use Whisper with

language="en"

and verify transcriptions match expected regional forms.

Piper使用espeak-ng进行音素化。训练数据中的美式发音会导致口音偏移。

语料库准备：

训练前对语料库文本运行
```
scripts/convert_spelling.py
```
脚本
使用
```
en-gb
```
或
```
en-au
```
espeak-ng语音进行音素化
检查生成的音素是否存在美式发音

常见拼写转换：

美式	澳大利亚/英式
-ize	-ise
-or	-our
-er	-re
-og	-ogue
-ense	-ence

音素注意事项：

/r/的连读和插入模式不同
"dance"、"bath"、"castle"等单词中的元音发音
末尾-ile的发音（如hostile、missile）

完整的单词列表和语音细节，请参阅references/localisation.md。

**验证：**使用Whisper并设置

language="en"

，验证转录结果是否符合预期的地区形式。

Dependencies

依赖项

Pin versions to avoid API breakage:

pytorch-lightning==1.9.3
torch<2.6.0
piper-phonemize
onnxruntime-gpu
onnxsim

Docker containerisation recommended for reproducibility.

固定版本以避免API变更：

pytorch-lightning==1.9.3
torch<2.6.0
piper-phonemize
onnxruntime-gpu
onnxsim

推荐使用Docker容器化以确保可复现性。

Hardware Requirements

硬件要求

Minimum (fine-tuning):

8GB VRAM GPU (Pascal or newer)
8GB system RAM
~5 days for 1,000 epochs on Tesla P4

From scratch: Multiply time by ~200x.

最低配置（微调）：

8GB显存GPU（帕斯卡架构或更新）
8GB系统内存
Tesla P4上训练1000轮次约需5天

**从头训练：**时间约为微调的200倍。

Troubleshooting

故障排除

Issue	Solution
CUDA OOM	Reduce batch-size (try 8 or 4)
Checkpoint won't load	Check pytorch-lightning version matches checkpoint
Garbled output	Insufficient training epochs or dataset too small
Wrong accent	Check espeak-ng language code and corpus spelling

问题	解决方案
CUDA内存不足	减小batch-size（尝试8或4）
检查点无法加载	确保pytorch-lightning版本与检查点匹配
输出语音混乱	训练轮次不足或数据集过小
口音错误	检查espeak-ng语言代码和语料库拼写