omnivoice-tts
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOmniVoice TTS Skill
OmniVoice TTS 技能
Skill by ara.so — Daily 2026 Skills collection.
OmniVoice is a state-of-the-art zero-shot TTS model supporting 600+ languages, built on a diffusion language model-style architecture. It supports voice cloning (from reference audio), voice design (via text attributes), and auto voice generation with RTF as low as 0.025.
由 ara.so 开发的技能 — 2026年度每日技能合集。
OmniVoice是一款最先进的零样本TTS模型,支持600多种语言,基于扩散语言模型风格架构构建。它支持声音克隆(从参考音频提取)、声音设计(通过文本属性定义)以及自动语音生成,RTF低至0.025。
Installation
安装
Requirements
依赖要求
- Python 3.9+
- PyTorch 2.8+
- CUDA (recommended) or Apple Silicon (MPS) or CPU
- Python 3.9+
- PyTorch 2.8+
- CUDA(推荐)或 Apple Silicon(MPS)或 CPU
pip (recommended)
pip安装(推荐)
bash
undefinedbash
undefinedStep 1: Install PyTorch for your platform
Step 1: Install PyTorch for your platform
NVIDIA GPU (CUDA 12.8)
NVIDIA GPU (CUDA 12.8)
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
Apple Silicon
Apple Silicon
pip install torch==2.8.0 torchaudio==2.8.0
pip install torch==2.8.0 torchaudio==2.8.0
Step 2: Install OmniVoice
Step 2: Install OmniVoice
pip install omnivoice
pip install omnivoice
Or from source (latest)
Or from source (latest)
pip install git+https://github.com/k2-fsa/OmniVoice.git
pip install git+https://github.com/k2-fsa/OmniVoice.git
Or editable dev install
Or editable dev install
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .
undefinedgit clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .
undefineduv
uv安装
bash
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv syncbash
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv syncWith mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"
With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"
undefinedundefinedHuggingFace Mirror (if blocked)
HuggingFace镜像(访问受阻时使用)
bash
export HF_ENDPOINT="https://hf-mirror.com"bash
export HF_ENDPOINT="https://hf-mirror.com"Core Concepts
核心概念
| Mode | What you provide | Use case |
|---|---|---|
| Voice Cloning | | Clone a speaker from a short audio clip |
| Voice Design | | Describe speaker attributes (no audio needed) |
| Auto Voice | nothing extra | Model picks a random voice |
| 模式 | 所需输入 | 使用场景 |
|---|---|---|
| 声音克隆 | | 从短音频片段克隆说话人声音 |
| 声音设计 | | 描述说话人属性(无需音频) |
| 自动生成语音 | 无额外输入 | 模型随机选择音色 |
Python API
Python API
Load the Model
加载模型
python
from omnivoice import OmniVoice
import torch
import torchaudiopython
from omnivoice import OmniVoice
import torch
import torchaudioNVIDIA GPU
NVIDIA GPU
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
Apple Silicon
Apple Silicon
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="mps",
dtype=torch.float16
)
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="mps",
dtype=torch.float16
)
CPU (slower)
CPU (slower)
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cpu",
dtype=torch.float32
)
undefinedmodel = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cpu",
dtype=torch.float32
)
undefinedVoice Cloning
声音克隆
python
undefinedpython
undefinedWith manual reference transcription (faster, more accurate)
手动提供参考文本(更快、更准确)
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
)
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
)
Without ref_text — Whisper auto-transcribes ref_audio
不提供ref_text — Whisper自动转录参考音频
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
)
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
)
audio is a list of torch.Tensor, shape (1, T) at 24kHz
audio是torch.Tensor列表,形状为(1, T),采样率24kHz
torchaudio.save("out.wav", audio[0], 24000)
undefinedtorchaudio.save("out.wav", audio[0], 24000)
undefinedVoice Design
声音设计
python
undefinedpython
undefinedDescribe speaker via comma-separated attributes
用逗号分隔的属性描述说话人
audio = model.generate(
text="Hello, this is a test of zero-shot voice design.",
instruct="female, low pitch, british accent",
)
torchaudio.save("out.wav", audio[0], 24000)
**Supported attributes:**
- **Gender**: `male`, `female`
- **Age**: `child`, `young`, `middle-aged`, `elderly`
- **Pitch**: `very low pitch`, `low pitch`, `high pitch`, `very high pitch`
- **Style**: `whisper`
- **English accents**: `american accent`, `british accent`, `australian accent`, etc.
- **Chinese dialects**: `四川话`, `陕西话`, etc.audio = model.generate(
text="Hello, this is a test of zero-shot voice design.",
instruct="female, low pitch, british accent",
)
torchaudio.save("out.wav", audio[0], 24000)
**支持的属性:**
- **性别**: `male`, `female`
- **年龄**: `child`, `young`, `middle-aged`, `elderly`
- **音调**: `very low pitch`, `low pitch`, `high pitch`, `very high pitch`
- **风格**: `whisper`
- **英语口音**: `american accent`, `british accent`, `australian accent` 等
- **中文方言**: `四川话`, `陕西话` 等Auto Voice
自动生成语音
python
audio = model.generate(text="This is a sentence without any voice prompt.")
torchaudio.save("out.wav", audio[0], 24000)python
audio = model.generate(text="This is a sentence without any voice prompt.")
torchaudio.save("out.wav", audio[0], 24000)Generation Parameters
生成参数
python
audio = model.generate(
text="Hello world.",
ref_audio="ref.wav",
ref_text="Reference text.",
num_step=32, # diffusion steps; use 16 for faster (slightly lower quality)
speed=1.2, # speaking rate multiplier (>1 faster, <1 slower)
duration=8.0, # fix output duration in seconds (overrides speed)
)python
audio = model.generate(
text="Hello world.",
ref_audio="ref.wav",
ref_text="Reference text.",
num_step=32, # 扩散步数;设为16可提速(质量略有下降)
speed=1.2, # 语速倍率(>1更快,<1更慢)
duration=8.0, # 固定输出时长(秒),会覆盖speed参数
)Non-Verbal Symbols
非语言符号
python
undefinedpython
undefinedInsert expressive non-verbal sounds inline
直接在文本中插入富有表现力的非语言音效
audio = model.generate(
text="[laughter] You really got me. I didn't see that coming at all."
)
**Supported tags:**
`[laughter]`, `[sigh]`, `[confirmation-en]`, `[question-en]`, `[question-ah]`,
`[question-oh]`, `[question-ei]`, `[question-yi]`, `[surprise-ah]`, `[surprise-oh]`,
`[surprise-wa]`, `[surprise-yo]`, `[dissatisfaction-hnn]`audio = model.generate(
text="[laughter] You really got me. I didn't see that coming at all."
)
**支持的标签:**
`[laughter]`, `[sigh]`, `[confirmation-en]`, `[question-en]`, `[question-ah]`,
`[question-oh]`, `[question-ei]`, `[question-yi]`, `[surprise-ah]`, `[surprise-oh]`,
`[surprise-wa]`, `[surprise-yo]`, `[dissatisfaction-hnn]`Pronunciation Control
发音控制
python
undefinedpython
undefinedChinese: pinyin with tone numbers (inline, uppercase)
中文:直接在文本中插入带声调的拼音(大写)
audio = model.generate(
text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。"
)
audio = model.generate(
text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。"
)
English: CMU dict pronunciation in brackets (uppercase)
英语:在括号中插入CMU字典发音(大写)
audio = model.generate(
text="You could probably still make [IH1 T] look good."
)
---audio = model.generate(
text="You could probably still make [IH1 T] look good."
)
---CLI Tools
CLI工具
Web Demo
网页演示
bash
omnivoice-demo --ip 0.0.0.0 --port 8001
omnivoice-demo --help # all optionsbash
omnivoice-demo --ip 0.0.0.0 --port 8001
omnivoice-demo --help # 查看所有选项Single Inference
单次推理
bash
undefinedbash
undefinedVoice Cloning (ref_text optional; omit for Whisper auto-transcription)
声音克隆(ref_text可选,省略则用Whisper自动转录)
omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--ref_audio ref.wav
--ref_text "Transcription of the reference audio."
--output hello.wav
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--ref_audio ref.wav
--ref_text "Transcription of the reference audio."
--output hello.wav
omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--ref_audio ref.wav
--ref_text "Transcription of the reference audio."
--output hello.wav
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--ref_audio ref.wav
--ref_text "Transcription of the reference audio."
--output hello.wav
Voice Design
声音设计
omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--instruct "male, British accent"
--output hello.wav
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--instruct "male, British accent"
--output hello.wav
omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--instruct "male, British accent"
--output hello.wav
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--instruct "male, British accent"
--output hello.wav
Auto Voice
自动生成语音
omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--output hello.wav
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--output hello.wav
undefinedomnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--output hello.wav
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--output hello.wav
undefinedBatch Inference (Multi-GPU)
批量推理(多GPU)
bash
omnivoice-infer-batch \
--model k2-fsa/OmniVoice \
--test_list test.jsonl \
--res_dir results/JSONL format ():
test.jsonljsonl
{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"}
{"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"}
{"id": "sample_003", "text": "Auto voice example"}
{"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2}
{"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0}
{"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"}JSONL field reference:
| Field | Required | Description |
|---|---|---|
| ✅ | Unique identifier |
| ✅ | Text to synthesize |
| ❌ | Path to reference audio (voice cloning) |
| ❌ | Transcript of ref audio |
| ❌ | Speaker attributes (voice design) |
| ❌ | Language code, e.g. |
| ❌ | Language name, e.g. |
| ❌ | Fixed output duration in seconds |
| ❌ | Speaking rate multiplier (ignored if duration set) |
bash
omnivoice-infer-batch \
--model k2-fsa/OmniVoice \
--test_list test.jsonl \
--res_dir results/JSONL格式 ():
test.jsonljsonl
{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"}
{"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"}
{"id": "sample_003", "text": "Auto voice example"}
{"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2}
{"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0}
{"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"}JSONL字段说明:
| 字段 | 是否必填 | 描述 |
|---|---|---|
| ✅ | 唯一标识符 |
| ✅ | 待合成的文本 |
| ❌ | 参考音频路径(用于声音克隆) |
| ❌ | 参考音频的转录文本 |
| ❌ | 说话人属性(用于声音设计) |
| ❌ | 语言编码,例如 |
| ❌ | 语言名称,例如 |
| ❌ | 固定输出时长(秒) |
| ❌ | 语速倍率(设置duration后该参数失效) |
Common Patterns
常用使用场景
Full Voice Cloning Pipeline
完整声音克隆流程
python
from omnivoice import OmniVoice
import torch
import torchaudio
from pathlib import Path
def clone_voice(ref_audio_path: str, texts: list[str], output_dir: str):
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
Path(output_dir).mkdir(parents=True, exist_ok=True)
for i, text in enumerate(texts):
audio = model.generate(
text=text,
ref_audio=ref_audio_path,
# ref_text omitted: Whisper auto-transcribes
num_step=32,
speed=1.0,
)
out_path = f"{output_dir}/output_{i:04d}.wav"
torchaudio.save(out_path, audio[0], 24000)
print(f"Saved: {out_path}")
clone_voice(
ref_audio_path="speaker.wav",
texts=["Hello world.", "Second sentence.", "Third sentence."],
output_dir="outputs/"
)python
from omnivoice import OmniVoice
import torch
import torchaudio
from pathlib import Path
def clone_voice(ref_audio_path: str, texts: list[str], output_dir: str):
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
Path(output_dir).mkdir(parents=True, exist_ok=True)
for i, text in enumerate(texts):
audio = model.generate(
text=text,
ref_audio=ref_audio_path,
# 省略ref_text:使用Whisper自动转录
num_step=32,
speed=1.0,
)
out_path = f"{output_dir}/output_{i:04d}.wav"
torchaudio.save(out_path, audio[0], 24000)
print(f"Saved: {out_path}")
clone_voice(
ref_audio_path="speaker.wav",
texts=["Hello world.", "Second sentence.", "Third sentence."],
output_dir="outputs/"
)Batch Processing from a List
从列表批量处理
python
import json
from omnivoice import OmniVoice
import torch
import torchaudio
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)
items = [
{"id": "s1", "text": "English sentence.", "instruct": "female, american accent"},
{"id": "s2", "text": "Another sentence.", "ref_audio": "ref.wav"},
{"id": "s3", "text": "Auto voice.", },
]
for item in items:
kwargs = {"text": item["text"]}
if "ref_audio" in item:
kwargs["ref_audio"] = item["ref_audio"]
if "ref_text" in item:
kwargs["ref_text"] = item["ref_text"]
if "instruct" in item:
kwargs["instruct"] = item["instruct"]
audio = model.generate(**kwargs)
torchaudio.save(f"{item['id']}.wav", audio[0], 24000)python
import json
from omnivoice import OmniVoice
import torch
import torchaudio
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)
items = [
{"id": "s1", "text": "English sentence.", "instruct": "female, american accent"},
{"id": "s2", "text": "Another sentence.", "ref_audio": "ref.wav"},
{"id": "s3", "text": "Auto voice.", },
]
for item in items:
kwargs = {"text": item["text"]}
if "ref_audio" in item:
kwargs["ref_audio"] = item["ref_audio"]
if "ref_text" in item:
kwargs["ref_text"] = item["ref_text"]
if "instruct" in item:
kwargs["instruct"] = item["instruct"]
audio = model.generate(**kwargs)
torchaudio.save(f"{item['id']}.wav", audio[0], 24000)Voice Design Combinations
声音设计组合
python
designs = [
"male, elderly, low pitch",
"female, child, high pitch",
"male, whisper",
"female, british accent, high pitch",
"male, american accent, middle-aged",
]
for design in designs:
audio = model.generate(
text="The quick brown fox jumps over the lazy dog.",
instruct=design,
)
safe_name = design.replace(", ", "_").replace(" ", "-")
torchaudio.save(f"design_{safe_name}.wav", audio[0], 24000)python
designs = [
"male, elderly, low pitch",
"female, child, high pitch",
"male, whisper",
"female, british accent, high pitch",
"male, american accent, middle-aged",
]
for design in designs:
audio = model.generate(
text="The quick brown fox jumps over the lazy dog.",
instruct=design,
)
safe_name = design.replace(", ", "_").replace(" ", "-")
torchaudio.save(f"design_{safe_name}.wav", audio[0], 24000)Fast Inference (Lower Diffusion Steps)
快速推理(降低扩散步数)
python
undefinedpython
undefinedDefault: num_step=32 (high quality)
默认:num_step=32(高质量)
Fast: num_step=16 (slightly lower quality, ~2x faster)
快速模式:num_step=16(质量略有下降,速度提升约2倍)
audio = model.generate(
text="Fast inference example.",
ref_audio="ref.wav",
num_step=16,
)
---audio = model.generate(
text="Fast inference example.",
ref_audio="ref.wav",
num_step=16,
)
---Output Format
输出格式
- Sample rate: 24,000 Hz
- Type: , each tensor shape
list[torch.Tensor](1, T) - Save: use
torchaudio.save(path, audio[0], 24000)
- 采样率: 24,000 Hz
- 类型: ,每个张量形状为
list[torch.Tensor](1, T) - 保存方式: 使用
torchaudio.save(path, audio[0], 24000)
Troubleshooting
问题排查
HuggingFace download fails
HuggingFace下载失败
bash
export HF_ENDPOINT="https://hf-mirror.com"bash
export HF_ENDPOINT="https://hf-mirror.com"CUDA out of memory
CUDA显存不足
python
undefinedpython
undefinedUse float16 (not float32)
使用float16而非float32
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)
Or reduce batch size / text length in batch inference
或者在批量推理时减小批量大小/文本长度
undefinedundefinedWhisper ASR not available for ref_text auto-transcription
无法使用Whisper ASR自动转录ref_text
bash
pip install openai-whisperbash
pip install openai-whisperWrong pronunciation in Chinese
中文发音错误
Use inline pinyin with tone numbers directly in the text string:
python
undefined直接在文本字符串中插入带声调的拼音即可:
python
undefinedFormat: PINYINTONE_NUMBER within the sentence
格式:拼音+声调数字,直接插入句子中
text = "这批货物打ZHE2出售"
undefinedtext = "这批货物打ZHE2出售"
undefinedAudio quality issues
音频质量问题
- Increase to 32 or 64
num_step - Provide manually instead of relying on auto-transcription
ref_text - Use a clean, noise-free reference audio clip (3–15 seconds recommended)
- 将提高到32或64
num_step - 手动提供,不要依赖自动转录
ref_text - 使用干净无噪音的参考音频片段(推荐3-15秒)
Apple Silicon (MPS) issues
Apple Silicon(MPS)相关问题
python
undefinedpython
undefinedUse mps device explicitly
显式指定mps设备
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16)
---model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16)
---Model & Resources
模型与资源
| Resource | Link |
|---|---|
| HuggingFace Model | |
| HuggingFace Space | https://huggingface.co/spaces/k2-fsa/OmniVoice |
| Paper (arXiv) | https://arxiv.org/abs/2604.00688 |
| Demo Page | https://zhu-han.github.io/omnivoice |
| Supported Languages | |
| Voice Design Attributes | |
| Generation Parameters | |
| Training/Eval Examples | |
| 资源 | 链接 |
|---|---|
| HuggingFace模型 | |
| HuggingFace Space | https://huggingface.co/spaces/k2-fsa/OmniVoice |
| 论文(arXiv) | https://arxiv.org/abs/2604.00688 |
| 演示页面 | https://zhu-han.github.io/omnivoice |
| 支持的语言 | 仓库中的 |
| 声音设计属性 | 仓库中的 |
| 生成参数 | 仓库中的 |
| 训练/评估示例 | 仓库中的 |