omnivoice-tts

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OmniVoice TTS Skill

OmniVoice TTS 技能

Skill by ara.so — Daily 2026 Skills collection.
OmniVoice is a state-of-the-art zero-shot TTS model supporting 600+ languages, built on a diffusion language model-style architecture. It supports voice cloning (from reference audio), voice design (via text attributes), and auto voice generation with RTF as low as 0.025.

ara.so 开发的技能 — 2026年度每日技能合集。
OmniVoice是一款最先进的零样本TTS模型,支持600多种语言,基于扩散语言模型风格架构构建。它支持声音克隆(从参考音频提取)、声音设计(通过文本属性定义)以及自动语音生成,RTF低至0.025。

Installation

安装

Requirements

依赖要求

  • Python 3.9+
  • PyTorch 2.8+
  • CUDA (recommended) or Apple Silicon (MPS) or CPU
  • Python 3.9+
  • PyTorch 2.8+
  • CUDA(推荐)或 Apple Silicon(MPS)或 CPU

pip (recommended)

pip安装(推荐)

bash
undefined
bash
undefined

Step 1: Install PyTorch for your platform

Step 1: Install PyTorch for your platform

NVIDIA GPU (CUDA 12.8)

NVIDIA GPU (CUDA 12.8)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

Apple Silicon

Apple Silicon

pip install torch==2.8.0 torchaudio==2.8.0
pip install torch==2.8.0 torchaudio==2.8.0

Step 2: Install OmniVoice

Step 2: Install OmniVoice

pip install omnivoice
pip install omnivoice

Or from source (latest)

Or from source (latest)

Or editable dev install

Or editable dev install

git clone https://github.com/k2-fsa/OmniVoice.git cd OmniVoice pip install -e .
undefined
git clone https://github.com/k2-fsa/OmniVoice.git cd OmniVoice pip install -e .
undefined

uv

uv安装

bash
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync
bash
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync

With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"

With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"

undefined
undefined

HuggingFace Mirror (if blocked)

HuggingFace镜像(访问受阻时使用)

bash
export HF_ENDPOINT="https://hf-mirror.com"

bash
export HF_ENDPOINT="https://hf-mirror.com"

Core Concepts

核心概念

ModeWhat you provideUse case
Voice Cloning
ref_audio
+
ref_text
Clone a speaker from a short audio clip
Voice Design
instruct
string
Describe speaker attributes (no audio needed)
Auto Voicenothing extraModel picks a random voice

模式所需输入使用场景
声音克隆
ref_audio
+
ref_text
从短音频片段克隆说话人声音
声音设计
instruct
字符串
描述说话人属性(无需音频)
自动生成语音无额外输入模型随机选择音色

Python API

Python API

Load the Model

加载模型

python
from omnivoice import OmniVoice
import torch
import torchaudio
python
from omnivoice import OmniVoice
import torch
import torchaudio

NVIDIA GPU

NVIDIA GPU

model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16 )
model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16 )

Apple Silicon

Apple Silicon

model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16 )
model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16 )

CPU (slower)

CPU (slower)

model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cpu", dtype=torch.float32 )
undefined
model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cpu", dtype=torch.float32 )
undefined

Voice Cloning

声音克隆

python
undefined
python
undefined

With manual reference transcription (faster, more accurate)

手动提供参考文本(更快、更准确)

audio = model.generate( text="Hello, this is a test of zero-shot voice cloning.", ref_audio="ref.wav", ref_text="Transcription of the reference audio.", )
audio = model.generate( text="Hello, this is a test of zero-shot voice cloning.", ref_audio="ref.wav", ref_text="Transcription of the reference audio.", )

Without ref_text — Whisper auto-transcribes ref_audio

不提供ref_text — Whisper自动转录参考音频

audio = model.generate( text="Hello, this is a test of zero-shot voice cloning.", ref_audio="ref.wav", )
audio = model.generate( text="Hello, this is a test of zero-shot voice cloning.", ref_audio="ref.wav", )

audio is a list of torch.Tensor, shape (1, T) at 24kHz

audio是torch.Tensor列表,形状为(1, T),采样率24kHz

torchaudio.save("out.wav", audio[0], 24000)
undefined
torchaudio.save("out.wav", audio[0], 24000)
undefined

Voice Design

声音设计

python
undefined
python
undefined

Describe speaker via comma-separated attributes

用逗号分隔的属性描述说话人

audio = model.generate( text="Hello, this is a test of zero-shot voice design.", instruct="female, low pitch, british accent", ) torchaudio.save("out.wav", audio[0], 24000)

**Supported attributes:**
- **Gender**: `male`, `female`
- **Age**: `child`, `young`, `middle-aged`, `elderly`
- **Pitch**: `very low pitch`, `low pitch`, `high pitch`, `very high pitch`
- **Style**: `whisper`
- **English accents**: `american accent`, `british accent`, `australian accent`, etc.
- **Chinese dialects**: `四川话`, `陕西话`, etc.
audio = model.generate( text="Hello, this is a test of zero-shot voice design.", instruct="female, low pitch, british accent", ) torchaudio.save("out.wav", audio[0], 24000)

**支持的属性:**
- **性别**: `male`, `female`
- **年龄**: `child`, `young`, `middle-aged`, `elderly`
- **音调**: `very low pitch`, `low pitch`, `high pitch`, `very high pitch`
- **风格**: `whisper`
- **英语口音**: `american accent`, `british accent`, `australian accent` 等
- **中文方言**: `四川话`, `陕西话` 等

Auto Voice

自动生成语音

python
audio = model.generate(text="This is a sentence without any voice prompt.")
torchaudio.save("out.wav", audio[0], 24000)
python
audio = model.generate(text="This is a sentence without any voice prompt.")
torchaudio.save("out.wav", audio[0], 24000)

Generation Parameters

生成参数

python
audio = model.generate(
    text="Hello world.",
    ref_audio="ref.wav",
    ref_text="Reference text.",
    num_step=32,      # diffusion steps; use 16 for faster (slightly lower quality)
    speed=1.2,        # speaking rate multiplier (>1 faster, <1 slower)
    duration=8.0,     # fix output duration in seconds (overrides speed)
)
python
audio = model.generate(
    text="Hello world.",
    ref_audio="ref.wav",
    ref_text="Reference text.",
    num_step=32,      # 扩散步数;设为16可提速(质量略有下降)
    speed=1.2,        # 语速倍率(>1更快,<1更慢)
    duration=8.0,     # 固定输出时长(秒),会覆盖speed参数
)

Non-Verbal Symbols

非语言符号

python
undefined
python
undefined

Insert expressive non-verbal sounds inline

直接在文本中插入富有表现力的非语言音效

audio = model.generate( text="[laughter] You really got me. I didn't see that coming at all." )

**Supported tags:**
`[laughter]`, `[sigh]`, `[confirmation-en]`, `[question-en]`, `[question-ah]`,
`[question-oh]`, `[question-ei]`, `[question-yi]`, `[surprise-ah]`, `[surprise-oh]`,
`[surprise-wa]`, `[surprise-yo]`, `[dissatisfaction-hnn]`
audio = model.generate( text="[laughter] You really got me. I didn't see that coming at all." )

**支持的标签:**
`[laughter]`, `[sigh]`, `[confirmation-en]`, `[question-en]`, `[question-ah]`,
`[question-oh]`, `[question-ei]`, `[question-yi]`, `[surprise-ah]`, `[surprise-oh]`,
`[surprise-wa]`, `[surprise-yo]`, `[dissatisfaction-hnn]`

Pronunciation Control

发音控制

python
undefined
python
undefined

Chinese: pinyin with tone numbers (inline, uppercase)

中文:直接在文本中插入带声调的拼音(大写)

audio = model.generate( text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。" )
audio = model.generate( text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。" )

English: CMU dict pronunciation in brackets (uppercase)

英语:在括号中插入CMU字典发音(大写)

audio = model.generate( text="You could probably still make [IH1 T] look good." )

---
audio = model.generate( text="You could probably still make [IH1 T] look good." )

---

CLI Tools

CLI工具

Web Demo

网页演示

bash
omnivoice-demo --ip 0.0.0.0 --port 8001
omnivoice-demo --help  # all options
bash
omnivoice-demo --ip 0.0.0.0 --port 8001
omnivoice-demo --help  # 查看所有选项

Single Inference

单次推理

bash
undefined
bash
undefined

Voice Cloning (ref_text optional; omit for Whisper auto-transcription)

声音克隆(ref_text可选,省略则用Whisper自动转录)

omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--ref_audio ref.wav
--ref_text "Transcription of the reference audio."
--output hello.wav
omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--ref_audio ref.wav
--ref_text "Transcription of the reference audio."
--output hello.wav

Voice Design

声音设计

omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--instruct "male, British accent"
--output hello.wav
omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--instruct "male, British accent"
--output hello.wav

Auto Voice

自动生成语音

omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--output hello.wav
undefined
omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--output hello.wav
undefined

Batch Inference (Multi-GPU)

批量推理(多GPU)

bash
omnivoice-infer-batch \
    --model k2-fsa/OmniVoice \
    --test_list test.jsonl \
    --res_dir results/
JSONL format (
test.jsonl
):
jsonl
{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"}
{"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"}
{"id": "sample_003", "text": "Auto voice example"}
{"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2}
{"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0}
{"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"}
JSONL field reference:
FieldRequiredDescription
id
Unique identifier
text
Text to synthesize
ref_audio
Path to reference audio (voice cloning)
ref_text
Transcript of ref audio
instruct
Speaker attributes (voice design)
language_id
Language code, e.g.
"en"
language_name
Language name, e.g.
"English"
duration
Fixed output duration in seconds
speed
Speaking rate multiplier (ignored if duration set)

bash
omnivoice-infer-batch \
    --model k2-fsa/OmniVoice \
    --test_list test.jsonl \
    --res_dir results/
JSONL格式 (
test.jsonl
):
jsonl
{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"}
{"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"}
{"id": "sample_003", "text": "Auto voice example"}
{"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2}
{"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0}
{"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"}
JSONL字段说明:
字段是否必填描述
id
唯一标识符
text
待合成的文本
ref_audio
参考音频路径(用于声音克隆)
ref_text
参考音频的转录文本
instruct
说话人属性(用于声音设计)
language_id
语言编码,例如
"en"
language_name
语言名称,例如
"English"
duration
固定输出时长(秒)
speed
语速倍率(设置duration后该参数失效)

Common Patterns

常用使用场景

Full Voice Cloning Pipeline

完整声音克隆流程

python
from omnivoice import OmniVoice
import torch
import torchaudio
from pathlib import Path

def clone_voice(ref_audio_path: str, texts: list[str], output_dir: str):
    model = OmniVoice.from_pretrained(
        "k2-fsa/OmniVoice",
        device_map="cuda:0",
        dtype=torch.float16
    )
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    for i, text in enumerate(texts):
        audio = model.generate(
            text=text,
            ref_audio=ref_audio_path,
            # ref_text omitted: Whisper auto-transcribes
            num_step=32,
            speed=1.0,
        )
        out_path = f"{output_dir}/output_{i:04d}.wav"
        torchaudio.save(out_path, audio[0], 24000)
        print(f"Saved: {out_path}")

clone_voice(
    ref_audio_path="speaker.wav",
    texts=["Hello world.", "Second sentence.", "Third sentence."],
    output_dir="outputs/"
)
python
from omnivoice import OmniVoice
import torch
import torchaudio
from pathlib import Path

def clone_voice(ref_audio_path: str, texts: list[str], output_dir: str):
    model = OmniVoice.from_pretrained(
        "k2-fsa/OmniVoice",
        device_map="cuda:0",
        dtype=torch.float16
    )
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    for i, text in enumerate(texts):
        audio = model.generate(
            text=text,
            ref_audio=ref_audio_path,
            # 省略ref_text:使用Whisper自动转录
            num_step=32,
            speed=1.0,
        )
        out_path = f"{output_dir}/output_{i:04d}.wav"
        torchaudio.save(out_path, audio[0], 24000)
        print(f"Saved: {out_path}")

clone_voice(
    ref_audio_path="speaker.wav",
    texts=["Hello world.", "Second sentence.", "Third sentence."],
    output_dir="outputs/"
)

Batch Processing from a List

从列表批量处理

python
import json
from omnivoice import OmniVoice
import torch
import torchaudio

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)

items = [
    {"id": "s1", "text": "English sentence.", "instruct": "female, american accent"},
    {"id": "s2", "text": "Another sentence.", "ref_audio": "ref.wav"},
    {"id": "s3", "text": "Auto voice.", },
]

for item in items:
    kwargs = {"text": item["text"]}
    if "ref_audio" in item:
        kwargs["ref_audio"] = item["ref_audio"]
    if "ref_text" in item:
        kwargs["ref_text"] = item["ref_text"]
    if "instruct" in item:
        kwargs["instruct"] = item["instruct"]

    audio = model.generate(**kwargs)
    torchaudio.save(f"{item['id']}.wav", audio[0], 24000)
python
import json
from omnivoice import OmniVoice
import torch
import torchaudio

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)

items = [
    {"id": "s1", "text": "English sentence.", "instruct": "female, american accent"},
    {"id": "s2", "text": "Another sentence.", "ref_audio": "ref.wav"},
    {"id": "s3", "text": "Auto voice.", },
]

for item in items:
    kwargs = {"text": item["text"]}
    if "ref_audio" in item:
        kwargs["ref_audio"] = item["ref_audio"]
    if "ref_text" in item:
        kwargs["ref_text"] = item["ref_text"]
    if "instruct" in item:
        kwargs["instruct"] = item["instruct"]

    audio = model.generate(**kwargs)
    torchaudio.save(f"{item['id']}.wav", audio[0], 24000)

Voice Design Combinations

声音设计组合

python
designs = [
    "male, elderly, low pitch",
    "female, child, high pitch",
    "male, whisper",
    "female, british accent, high pitch",
    "male, american accent, middle-aged",
]

for design in designs:
    audio = model.generate(
        text="The quick brown fox jumps over the lazy dog.",
        instruct=design,
    )
    safe_name = design.replace(", ", "_").replace(" ", "-")
    torchaudio.save(f"design_{safe_name}.wav", audio[0], 24000)
python
designs = [
    "male, elderly, low pitch",
    "female, child, high pitch",
    "male, whisper",
    "female, british accent, high pitch",
    "male, american accent, middle-aged",
]

for design in designs:
    audio = model.generate(
        text="The quick brown fox jumps over the lazy dog.",
        instruct=design,
    )
    safe_name = design.replace(", ", "_").replace(" ", "-")
    torchaudio.save(f"design_{safe_name}.wav", audio[0], 24000)

Fast Inference (Lower Diffusion Steps)

快速推理(降低扩散步数)

python
undefined
python
undefined

Default: num_step=32 (high quality)

默认:num_step=32(高质量)

Fast: num_step=16 (slightly lower quality, ~2x faster)

快速模式:num_step=16(质量略有下降,速度提升约2倍)

audio = model.generate( text="Fast inference example.", ref_audio="ref.wav", num_step=16, )

---
audio = model.generate( text="Fast inference example.", ref_audio="ref.wav", num_step=16, )

---

Output Format

输出格式

  • Sample rate: 24,000 Hz
  • Type:
    list[torch.Tensor]
    , each tensor shape
    (1, T)
  • Save: use
    torchaudio.save(path, audio[0], 24000)

  • 采样率: 24,000 Hz
  • 类型:
    list[torch.Tensor]
    ,每个张量形状为
    (1, T)
  • 保存方式: 使用
    torchaudio.save(path, audio[0], 24000)

Troubleshooting

问题排查

HuggingFace download fails

HuggingFace下载失败

bash
export HF_ENDPOINT="https://hf-mirror.com"
bash
export HF_ENDPOINT="https://hf-mirror.com"

CUDA out of memory

CUDA显存不足

python
undefined
python
undefined

Use float16 (not float32)

使用float16而非float32

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)

Or reduce batch size / text length in batch inference

或者在批量推理时减小批量大小/文本长度

undefined
undefined

Whisper ASR not available for ref_text auto-transcription

无法使用Whisper ASR自动转录ref_text

bash
pip install openai-whisper
bash
pip install openai-whisper

Wrong pronunciation in Chinese

中文发音错误

Use inline pinyin with tone numbers directly in the text string:
python
undefined
直接在文本字符串中插入带声调的拼音即可:
python
undefined

Format: PINYINTONE_NUMBER within the sentence

格式:拼音+声调数字,直接插入句子中

text = "这批货物打ZHE2出售"
undefined
text = "这批货物打ZHE2出售"
undefined

Audio quality issues

音频质量问题

  • Increase
    num_step
    to 32 or 64
  • Provide
    ref_text
    manually instead of relying on auto-transcription
  • Use a clean, noise-free reference audio clip (3–15 seconds recommended)
  • num_step
    提高到32或64
  • 手动提供
    ref_text
    ,不要依赖自动转录
  • 使用干净无噪音的参考音频片段(推荐3-15秒)

Apple Silicon (MPS) issues

Apple Silicon(MPS)相关问题

python
undefined
python
undefined

Use mps device explicitly

显式指定mps设备

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16)

---
model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16)

---

Model & Resources

模型与资源

ResourceLink
HuggingFace Model
k2-fsa/OmniVoice
HuggingFace Spacehttps://huggingface.co/spaces/k2-fsa/OmniVoice
Paper (arXiv)https://arxiv.org/abs/2604.00688
Demo Pagehttps://zhu-han.github.io/omnivoice
Supported Languages
docs/languages.md
in repo
Voice Design Attributes
docs/voice-design.md
in repo
Generation Parameters
docs/generation-parameters.md
in repo
Training/Eval Examples
examples/
in repo
资源链接
HuggingFace模型
k2-fsa/OmniVoice
HuggingFace Spacehttps://huggingface.co/spaces/k2-fsa/OmniVoice
论文(arXiv)https://arxiv.org/abs/2604.00688
演示页面https://zhu-han.github.io/omnivoice
支持的语言仓库中的
docs/languages.md
声音设计属性仓库中的
docs/voice-design.md
生成参数仓库中的
docs/generation-parameters.md
训练/评估示例仓库中的
examples/