omnivoice-tts

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

OmniVoice TTS Skill

OmniVoice TTS 技能

Skill by ara.so — Daily 2026 Skills collection.

OmniVoice is a state-of-the-art zero-shot TTS model supporting 600+ languages, built on a diffusion language model-style architecture. It supports voice cloning (from reference audio), voice design (via text attributes), and auto voice generation with RTF as low as 0.025.

由 ara.so 开发的技能 — 2026年度每日技能合集。

OmniVoice是一款最先进的零样本TTS模型，支持600多种语言，基于扩散语言模型风格架构构建。它支持声音克隆（从参考音频提取）、声音设计（通过文本属性定义）以及自动语音生成，RTF低至0.025。

Installation

安装

Requirements

依赖要求

Python 3.9+
PyTorch 2.8+
CUDA (recommended) or Apple Silicon (MPS) or CPU

Python 3.9+
PyTorch 2.8+
CUDA（推荐）或 Apple Silicon（MPS）或 CPU

pip (recommended)

pip安装（推荐）

bash

undefined

bash

undefined

Step 1: Install PyTorch for your platform

NVIDIA GPU (CUDA 12.8)

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

Apple Silicon

pip install torch==2.8.0 torchaudio==2.8.0

Step 2: Install OmniVoice

pip install omnivoice

Or from source (latest)

pip install git+https://github.com/k2-fsa/OmniVoice.git

Or editable dev install

git clone https://github.com/k2-fsa/OmniVoice.git cd OmniVoice pip install -e .

undefined

git clone https://github.com/k2-fsa/OmniVoice.git cd OmniVoice pip install -e .

undefined

uv

uv安装

bash

git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync

bash

git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync

With mirror: uv sync --default-index "https://mirrors.aliyun.com/pypi/simple"

undefined

undefined

HuggingFace Mirror (if blocked)

HuggingFace镜像（访问受阻时使用）

bash

export HF_ENDPOINT="https://hf-mirror.com"

bash

export HF_ENDPOINT="https://hf-mirror.com"

Core Concepts

核心概念

Mode	What you provide	Use case
Voice Cloning	`ref_audio` + `ref_text`	Clone a speaker from a short audio clip
Voice Design	`instruct` string	Describe speaker attributes (no audio needed)
Auto Voice	nothing extra	Model picks a random voice

模式	所需输入	使用场景
声音克隆	`ref_audio` + `ref_text`	从短音频片段克隆说话人声音
声音设计	`instruct` 字符串	描述说话人属性（无需音频）
自动生成语音	无额外输入	模型随机选择音色

Python API

Load the Model

加载模型

python

from omnivoice import OmniVoice
import torch
import torchaudio

python

from omnivoice import OmniVoice
import torch
import torchaudio

NVIDIA GPU

model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16 )

Apple Silicon

model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16 )

CPU (slower)

model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cpu", dtype=torch.float32 )

undefined

model = OmniVoice.from_pretrained( "k2-fsa/OmniVoice", device_map="cpu", dtype=torch.float32 )

undefined

Voice Cloning

声音克隆

python

undefined

python

undefined

With manual reference transcription (faster, more accurate)

手动提供参考文本（更快、更准确）

audio = model.generate( text="Hello, this is a test of zero-shot voice cloning.", ref_audio="ref.wav", ref_text="Transcription of the reference audio.", )

Without ref_text — Whisper auto-transcribes ref_audio

不提供ref_text — Whisper自动转录参考音频

audio = model.generate( text="Hello, this is a test of zero-shot voice cloning.", ref_audio="ref.wav", )

audio is a list of torch.Tensor, shape (1, T) at 24kHz

audio是torch.Tensor列表，形状为(1, T)，采样率24kHz

torchaudio.save("out.wav", audio[0], 24000)

undefined

torchaudio.save("out.wav", audio[0], 24000)

undefined

Voice Design

声音设计

python

undefined

python

undefined

Describe speaker via comma-separated attributes

用逗号分隔的属性描述说话人

audio = model.generate( text="Hello, this is a test of zero-shot voice design.", instruct="female, low pitch, british accent", ) torchaudio.save("out.wav", audio[0], 24000)


**Supported attributes:**
- **Gender**: `male`, `female`
- **Age**: `child`, `young`, `middle-aged`, `elderly`
- **Pitch**: `very low pitch`, `low pitch`, `high pitch`, `very high pitch`
- **Style**: `whisper`
- **English accents**: `american accent`, `british accent`, `australian accent`, etc.
- **Chinese dialects**: `四川话`, `陕西话`, etc.

audio = model.generate( text="Hello, this is a test of zero-shot voice design.", instruct="female, low pitch, british accent", ) torchaudio.save("out.wav", audio[0], 24000)


**支持的属性：**
- **性别**: `male`, `female`
- **年龄**: `child`, `young`, `middle-aged`, `elderly`
- **音调**: `very low pitch`, `low pitch`, `high pitch`, `very high pitch`
- **风格**: `whisper`
- **英语口音**: `american accent`, `british accent`, `australian accent` 等
- **中文方言**: `四川话`, `陕西话` 等

Auto Voice

自动生成语音

python

audio = model.generate(text="This is a sentence without any voice prompt.")
torchaudio.save("out.wav", audio[0], 24000)

python

audio = model.generate(text="This is a sentence without any voice prompt.")
torchaudio.save("out.wav", audio[0], 24000)

Generation Parameters

生成参数

python

audio = model.generate(
    text="Hello world.",
    ref_audio="ref.wav",
    ref_text="Reference text.",
    num_step=32,      # diffusion steps; use 16 for faster (slightly lower quality)
    speed=1.2,        # speaking rate multiplier (>1 faster, <1 slower)
    duration=8.0,     # fix output duration in seconds (overrides speed)
)

python

audio = model.generate(
    text="Hello world.",
    ref_audio="ref.wav",
    ref_text="Reference text.",
    num_step=32,      # 扩散步数；设为16可提速（质量略有下降）
    speed=1.2,        # 语速倍率（>1更快，<1更慢）
    duration=8.0,     # 固定输出时长（秒），会覆盖speed参数
)

Non-Verbal Symbols

非语言符号

python

undefined

python

undefined

Insert expressive non-verbal sounds inline

直接在文本中插入富有表现力的非语言音效

audio = model.generate( text="[laughter] You really got me. I didn't see that coming at all." )


**Supported tags:**
`[laughter]`, `[sigh]`, `[confirmation-en]`, `[question-en]`, `[question-ah]`,
`[question-oh]`, `[question-ei]`, `[question-yi]`, `[surprise-ah]`, `[surprise-oh]`,
`[surprise-wa]`, `[surprise-yo]`, `[dissatisfaction-hnn]`

audio = model.generate( text="[laughter] You really got me. I didn't see that coming at all." )


**支持的标签：**
`[laughter]`, `[sigh]`, `[confirmation-en]`, `[question-en]`, `[question-ah]`,
`[question-oh]`, `[question-ei]`, `[question-yi]`, `[surprise-ah]`, `[surprise-oh]`,
`[surprise-wa]`, `[surprise-yo]`, `[dissatisfaction-hnn]`

Pronunciation Control

发音控制

python

undefined

python

undefined

Chinese: pinyin with tone numbers (inline, uppercase)

中文：直接在文本中插入带声调的拼音（大写）

audio = model.generate( text="这批货物打ZHE2出售后他严重SHE2本了，再也经不起ZHE1腾了。" )

English: CMU dict pronunciation in brackets (uppercase)

英语：在括号中插入CMU字典发音（大写）

audio = model.generate( text="You could probably still make [IH1 T] look good." )

---

audio = model.generate( text="You could probably still make [IH1 T] look good." )

---

CLI Tools

CLI工具

Web Demo

网页演示

bash

omnivoice-demo --ip 0.0.0.0 --port 8001
omnivoice-demo --help  # all options

bash

omnivoice-demo --ip 0.0.0.0 --port 8001
omnivoice-demo --help  # 查看所有选项

Single Inference

单次推理

bash

undefined

bash

undefined

Voice Cloning (ref_text optional; omit for Whisper auto-transcription)

声音克隆（ref_text可选，省略则用Whisper自动转录）

omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--ref_audio ref.wav
--ref_text "Transcription of the reference audio."
--output hello.wav

Voice Design

声音设计

omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--instruct "male, British accent"
--output hello.wav

Auto Voice

自动生成语音

omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--output hello.wav

undefined

omnivoice-infer
--model k2-fsa/OmniVoice
--text "This is a test for text to speech."
--output hello.wav

undefined

Batch Inference (Multi-GPU)

批量推理（多GPU）

bash

omnivoice-infer-batch \
    --model k2-fsa/OmniVoice \
    --test_list test.jsonl \
    --res_dir results/

JSONL format (

test.jsonl

jsonl

{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"}
{"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"}
{"id": "sample_003", "text": "Auto voice example"}
{"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2}
{"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0}
{"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"}

JSONL field reference:

Field	Required	Description
`id`	✅	Unique identifier
`text`	✅	Text to synthesize
`ref_audio`	❌	Path to reference audio (voice cloning)
`ref_text`	❌	Transcript of ref audio
`instruct`	❌	Speaker attributes (voice design)
`language_id`	❌	Language code, e.g. `"en"`
`language_name`	❌	Language name, e.g. `"English"`
`duration`	❌	Fixed output duration in seconds
`speed`	❌	Speaking rate multiplier (ignored if duration set)

bash

omnivoice-infer-batch \
    --model k2-fsa/OmniVoice \
    --test_list test.jsonl \
    --res_dir results/

JSONL格式 (

test.jsonl

jsonl

{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript"}
{"id": "sample_002", "text": "Voice design example", "instruct": "female, british accent"}
{"id": "sample_003", "text": "Auto voice example"}
{"id": "sample_004", "text": "Speed controlled", "ref_audio": "/path/to/ref.wav", "speed": 1.2}
{"id": "sample_005", "text": "Duration fixed", "ref_audio": "/path/to/ref.wav", "duration": 10.0}
{"id": "sample_006", "text": "With language hint", "ref_audio": "/path/to/ref.wav", "language_id": "en", "language_name": "English"}

JSONL字段说明：

字段	是否必填	描述
`id`	✅	唯一标识符
`text`	✅	待合成的文本
`ref_audio`	❌	参考音频路径（用于声音克隆）
`ref_text`	❌	参考音频的转录文本
`instruct`	❌	说话人属性（用于声音设计）
`language_id`	❌	语言编码，例如 `"en"`
`language_name`	❌	语言名称，例如 `"English"`
`duration`	❌	固定输出时长（秒）
`speed`	❌	语速倍率（设置duration后该参数失效）

Common Patterns

常用使用场景

Full Voice Cloning Pipeline

完整声音克隆流程

python

from omnivoice import OmniVoice
import torch
import torchaudio
from pathlib import Path

def clone_voice(ref_audio_path: str, texts: list[str], output_dir: str):
    model = OmniVoice.from_pretrained(
        "k2-fsa/OmniVoice",
        device_map="cuda:0",
        dtype=torch.float16
    )
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    for i, text in enumerate(texts):
        audio = model.generate(
            text=text,
            ref_audio=ref_audio_path,
            # ref_text omitted: Whisper auto-transcribes
            num_step=32,
            speed=1.0,
        )
        out_path = f"{output_dir}/output_{i:04d}.wav"
        torchaudio.save(out_path, audio[0], 24000)
        print(f"Saved: {out_path}")

clone_voice(
    ref_audio_path="speaker.wav",
    texts=["Hello world.", "Second sentence.", "Third sentence."],
    output_dir="outputs/"
)

python

from omnivoice import OmniVoice
import torch
import torchaudio
from pathlib import Path

def clone_voice(ref_audio_path: str, texts: list[str], output_dir: str):
    model = OmniVoice.from_pretrained(
        "k2-fsa/OmniVoice",
        device_map="cuda:0",
        dtype=torch.float16
    )
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    for i, text in enumerate(texts):
        audio = model.generate(
            text=text,
            ref_audio=ref_audio_path,
            # 省略ref_text：使用Whisper自动转录
            num_step=32,
            speed=1.0,
        )
        out_path = f"{output_dir}/output_{i:04d}.wav"
        torchaudio.save(out_path, audio[0], 24000)
        print(f"Saved: {out_path}")

clone_voice(
    ref_audio_path="speaker.wav",
    texts=["Hello world.", "Second sentence.", "Third sentence."],
    output_dir="outputs/"
)

Batch Processing from a List

从列表批量处理

python

import json
from omnivoice import OmniVoice
import torch
import torchaudio

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)

items = [
    {"id": "s1", "text": "English sentence.", "instruct": "female, american accent"},
    {"id": "s2", "text": "Another sentence.", "ref_audio": "ref.wav"},
    {"id": "s3", "text": "Auto voice.", },
]

for item in items:
    kwargs = {"text": item["text"]}
    if "ref_audio" in item:
        kwargs["ref_audio"] = item["ref_audio"]
    if "ref_text" in item:
        kwargs["ref_text"] = item["ref_text"]
    if "instruct" in item:
        kwargs["instruct"] = item["instruct"]

    audio = model.generate(**kwargs)
    torchaudio.save(f"{item['id']}.wav", audio[0], 24000)

python

import json
from omnivoice import OmniVoice
import torch
import torchaudio

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)

items = [
    {"id": "s1", "text": "English sentence.", "instruct": "female, american accent"},
    {"id": "s2", "text": "Another sentence.", "ref_audio": "ref.wav"},
    {"id": "s3", "text": "Auto voice.", },
]

for item in items:
    kwargs = {"text": item["text"]}
    if "ref_audio" in item:
        kwargs["ref_audio"] = item["ref_audio"]
    if "ref_text" in item:
        kwargs["ref_text"] = item["ref_text"]
    if "instruct" in item:
        kwargs["instruct"] = item["instruct"]

    audio = model.generate(**kwargs)
    torchaudio.save(f"{item['id']}.wav", audio[0], 24000)

Voice Design Combinations

声音设计组合

python

designs = [
    "male, elderly, low pitch",
    "female, child, high pitch",
    "male, whisper",
    "female, british accent, high pitch",
    "male, american accent, middle-aged",
]

for design in designs:
    audio = model.generate(
        text="The quick brown fox jumps over the lazy dog.",
        instruct=design,
    )
    safe_name = design.replace(", ", "_").replace(" ", "-")
    torchaudio.save(f"design_{safe_name}.wav", audio[0], 24000)

python

designs = [
    "male, elderly, low pitch",
    "female, child, high pitch",
    "male, whisper",
    "female, british accent, high pitch",
    "male, american accent, middle-aged",
]

for design in designs:
    audio = model.generate(
        text="The quick brown fox jumps over the lazy dog.",
        instruct=design,
    )
    safe_name = design.replace(", ", "_").replace(" ", "-")
    torchaudio.save(f"design_{safe_name}.wav", audio[0], 24000)

Fast Inference (Lower Diffusion Steps)

快速推理（降低扩散步数）

python

undefined

python

undefined

Default: num_step=32 (high quality)

默认：num_step=32（高质量）

Fast: num_step=16 (slightly lower quality, ~2x faster)

快速模式：num_step=16（质量略有下降，速度提升约2倍）

audio = model.generate( text="Fast inference example.", ref_audio="ref.wav", num_step=16, )

---

audio = model.generate( text="Fast inference example.", ref_audio="ref.wav", num_step=16, )

---

Output Format

输出格式

Sample rate: 24,000 Hz
Type:
```
list[torch.Tensor]
```
, each tensor shape
```
(1, T)
```
Save: use
```
torchaudio.save(path, audio[0], 24000)
```

采样率: 24,000 Hz
类型:
```
list[torch.Tensor]
```
，每个张量形状为
```
(1, T)
```
保存方式: 使用
```
torchaudio.save(path, audio[0], 24000)
```

Troubleshooting

问题排查

HuggingFace download fails

HuggingFace下载失败

bash

export HF_ENDPOINT="https://hf-mirror.com"

bash

export HF_ENDPOINT="https://hf-mirror.com"

CUDA out of memory

CUDA显存不足

python

undefined

python

undefined

Use float16 (not float32)

使用float16而非float32

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="cuda:0", dtype=torch.float16)

Or reduce batch size / text length in batch inference

或者在批量推理时减小批量大小/文本长度

undefined

undefined

Whisper ASR not available for ref_text auto-transcription

无法使用Whisper ASR自动转录ref_text

bash

pip install openai-whisper

bash

pip install openai-whisper

Wrong pronunciation in Chinese

中文发音错误

Use inline pinyin with tone numbers directly in the text string:

python

undefined

直接在文本字符串中插入带声调的拼音即可：

python

undefined

Format: PINYINTONE_NUMBER within the sentence

格式：拼音+声调数字，直接插入句子中

text = "这批货物打ZHE2出售"

undefined

text = "这批货物打ZHE2出售"

undefined

Audio quality issues

音频质量问题

Increase
```
num_step
```
to 32 or 64
Provide
```
ref_text
```
manually instead of relying on auto-transcription
Use a clean, noise-free reference audio clip (3–15 seconds recommended)

将
```
num_step
```
提高到32或64
手动提供
```
ref_text
```
，不要依赖自动转录
使用干净无噪音的参考音频片段（推荐3-15秒）

Apple Silicon (MPS) issues

Apple Silicon（MPS）相关问题

python

undefined

python

undefined

Use mps device explicitly

显式指定mps设备

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16)

---

model = OmniVoice.from_pretrained("k2-fsa/OmniVoice", device_map="mps", dtype=torch.float16)

---

Model & Resources

模型与资源

Resource	Link
HuggingFace Model	`k2-fsa/OmniVoice`
HuggingFace Space	https://huggingface.co/spaces/k2-fsa/OmniVoice
Paper (arXiv)	https://arxiv.org/abs/2604.00688
Demo Page	https://zhu-han.github.io/omnivoice
Supported Languages	`docs/languages.md` in repo
Voice Design Attributes	`docs/voice-design.md` in repo
Generation Parameters	`docs/generation-parameters.md` in repo
Training/Eval Examples	`examples/` in repo

资源	链接
HuggingFace模型	`k2-fsa/OmniVoice`
HuggingFace Space	https://huggingface.co/spaces/k2-fsa/OmniVoice
论文（arXiv）	https://arxiv.org/abs/2604.00688
演示页面	https://zhu-han.github.io/omnivoice
支持的语言	仓库中的 `docs/languages.md`
声音设计属性	仓库中的 `docs/voice-design.md`
生成参数	仓库中的 `docs/generation-parameters.md`
训练/评估示例	仓库中的 `examples/`