moss-tts-nano-speech

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MOSS-TTS-Nano Speech Generation Skill

MOSS-TTS-Nano语音生成技能

Skill by ara.so — Daily 2026 Skills collection.
MOSS-TTS-Nano is an open-source multilingual tiny TTS model (0.1B parameters) from MOSI.AI and the OpenMOSS team. It uses an Audio Tokenizer + LLM autoregressive pipeline to generate 48 kHz stereo speech in real time, supports 20 languages, voice cloning, streaming inference, and runs on CPU without a GPU.
技能由 ara.so 提供 — 属于Daily 2026 Skills合集。
MOSS-TTS-Nano是由MOSI.AI与OpenMOSS团队开发的开源多语言轻量TTS模型(参数规模0.1B)。它采用音频分词器+LLM自回归流水线,可实时生成48kHz立体声语音,支持20种语言、语音克隆和流式推理,无需GPU即可在CPU上运行。

Installation

安装

Conda (recommended)

Conda(推荐方式)

bash
conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano

pip install -r requirements.txt
pip install -e .
bash
conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano

pip install -r requirements.txt
pip install -e .

Fix WeTextProcessing if it fails

解决WeTextProcessing安装失败问题

bash
conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git
After
pip install -e .
the
moss-tts-nano
CLI command is available in the active environment.
bash
conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git
执行
pip install -e .
后,
moss-tts-nano
CLI命令将在激活的环境中可用。

Model Weights

模型权重

Models are auto-downloaded from Hugging Face on first run:
  • TTS model:
    OpenMOSS-Team/MOSS-TTS-Nano
  • Audio tokenizer:
    OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano
ModelScope mirrors are available at
openmoss/MOSS-TTS-Nano
and
openmoss/MOSS-Audio-Tokenizer-Nano
.
首次运行时,模型会自动从Hugging Face下载:
  • TTS模型:
    OpenMOSS-Team/MOSS-TTS-Nano
  • 音频分词器:
    OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano
也可使用ModelScope镜像:
openmoss/MOSS-TTS-Nano
openmoss/MOSS-Audio-Tokenizer-Nano

CLI Commands

CLI命令

Generate speech (voice clone mode)

生成语音(语音克隆模式)

bash
moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"
Output defaults to
generated_audio/moss_tts_nano_output.wav
.
bash
moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"
输出默认路径为
generated_audio/moss_tts_nano_output.wav

Generate from a text file (long-form)

从文本文件生成(长文本)

bash
moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text-file my_script.txt \
  --output output.wav
bash
moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text-file my_script.txt \
  --output output.wav

Launch local web demo

启动本地Web演示

bash
moss-tts-nano serve
bash
moss-tts-nano serve

or directly:

或直接运行:

python app.py

Opens at `http://127.0.0.1:18083` — model stays loaded in memory for fast repeated requests.
python app.py

将在`http://127.0.0.1:18083`打开 — 模型会保持加载在内存中,以支持快速重复请求。

Direct Python entrypoint

直接Python入口

bash
python infer.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "Hello, this is a test of MOSS-TTS-Nano."
Output:
generated_audio/infer_output.wav
bash
python infer.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "Hello, this is a test of MOSS-TTS-Nano."
输出路径:
generated_audio/infer_output.wav

Python API Usage

Python API使用方法

Basic voice clone inference

基础语音克隆推理

python
from infer import MossTTSNanoInference
python
from infer import MossTTSNanoInference

Initialize once (downloads weights on first run)

初始化一次(首次运行时会下载权重)

tts = MossTTSNanoInference()
tts = MossTTSNanoInference()

Voice clone: synthesize text in the style of the reference audio

语音克隆:参考音频风格的文本合成

audio = tts.infer( text="欢迎使用MOSS语音合成系统。", prompt_audio_path="assets/audio/zh_1.wav", )
audio = tts.infer( text="欢迎使用MOSS语音合成系统。", prompt_audio_path="assets/audio/zh_1.wav", )

Save output

保存输出

import soundfile as sf sf.write("output.wav", audio, samplerate=48000)
undefined
import soundfile as sf sf.write("output.wav", audio, samplerate=48000)
undefined

English voice clone

英文语音克隆

python
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

audio = tts.infer(
    text="Welcome to MOSS TTS Nano, a tiny but capable text to speech model.",
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("english_output.wav", audio, samplerate=48000)
python
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

audio = tts.infer(
    text="Welcome to MOSS TTS Nano, a tiny but capable text to speech model.",
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("english_output.wav", audio, samplerate=48000)

Streaming inference (low latency)

流式推理(低延迟)

python
from infer import MossTTSNanoInference
import soundfile as sf
import numpy as np

tts = MossTTSNanoInference()

chunks = []
for audio_chunk in tts.infer_stream(
    text="This sentence is generated chunk by chunk for low latency playback.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    chunks.append(audio_chunk)
    # process or play chunk in real time here

full_audio = np.concatenate(chunks)
sf.write("streamed_output.wav", full_audio, samplerate=48000)
python
from infer import MossTTSNanoInference
import soundfile as sf
import numpy as np

tts = MossTTSNanoInference()

chunks = []
for audio_chunk in tts.infer_stream(
    text="This sentence is generated chunk by chunk for low latency playback.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    chunks.append(audio_chunk)
    # 在此处实时处理或播放片段

full_audio = np.concatenate(chunks)
sf.write("streamed_output.wav", full_audio, samplerate=48000)

Long-text synthesis with chunked voice cloning

分段语音克隆长文本合成

python
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

long_text = """
MOSS-TTS-Nano supports long-form synthesis through automatic chunking.
Each chunk uses the same reference voice, producing consistent speaker identity
across the entire output even for multi-paragraph documents.
"""

audio = tts.infer(
    text=long_text,
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("long_form_output.wav", audio, samplerate=48000)
python
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

long_text = """
MOSS-TTS-Nano支持通过自动分段实现长文本合成。
每个分段使用相同的参考音色,即使是多段落文档,整个输出的说话人身份也能保持一致。
"""

audio = tts.infer(
    text=long_text,
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("long_form_output.wav", audio, samplerate=48000)

FastAPI HTTP endpoint usage

FastAPI HTTP端点使用

When the server is running (
moss-tts-nano serve
or
python app.py
):
python
import requests
import base64
import soundfile as sf
import io
import numpy as np
当服务器运行时(
moss-tts-nano serve
python app.py
):
python
import requests
import base64
import soundfile as sf
import io
import numpy as np

Read reference audio as base64

将参考音频读取为base64格式

with open("assets/audio/zh_1.wav", "rb") as f: ref_audio_b64 = base64.b64encode(f.read()).decode()
response = requests.post( "http://127.0.0.1:18083/generate", json={ "text": "你好,这是一个语音合成测试。", "prompt_audio_base64": ref_audio_b64, }, )
data = response.json() audio_bytes = base64.b64decode(data["audio_base64"])
audio_array, sr = sf.read(io.BytesIO(audio_bytes)) sf.write("api_output.wav", audio_array, samplerate=sr)
undefined
with open("assets/audio/zh_1.wav", "rb") as f: ref_audio_b64 = base64.b64encode(f.read()).decode()
response = requests.post( "http://127.0.0.1:18083/generate", json={ "text": "你好,这是一个语音合成测试。", "prompt_audio_base64": ref_audio_b64, }, )
data = response.json() audio_bytes = base64.b64decode(data["audio_base64"])
audio_array, sr = sf.read(io.BytesIO(audio_bytes)) sf.write("api_output.wav", audio_array, samplerate=sr)
undefined

Streaming HTTP response (real-time web playback)

流式HTTP响应(实时Web播放)

python
import requests

with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = __import__("base64").b64encode(f.read()).decode()

with requests.post(
    "http://127.0.0.1:18083/generate_stream",
    json={
        "text": "流式语音合成示例,适合实时播放场景。",
        "prompt_audio_base64": ref_audio_b64,
    },
    stream=True,
) as resp:
    with open("stream_output.wav", "wb") as out:
        for chunk in resp.iter_content(chunk_size=4096):
            out.write(chunk)
python
import requests

with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = __import__("base64").b64encode(f.read()).decode()

with requests.post(
    "http://127.0.0.1:18083/generate_stream",
    json={
        "text": "流式语音合成示例,适合实时播放场景。",
        "prompt_audio_base64": ref_audio_b64,
    },
    stream=True,
) as resp:
    with open("stream_output.wav", "wb") as out:
        for chunk in resp.iter_content(chunk_size=4096):
            out.write(chunk)

Supported Languages

支持的语言

CodeLanguageCodeLanguageCodeLanguage
zhChineseenEnglishdeGerman
esSpanishfrFrenchjaJapanese
itItalianhuHungariankoKorean
ruRussianfaPersianarArabic
plPolishptPortuguesecsCzech
daDanishsvSwedishelGreek
trTurkish
The language is inferred automatically from the input text and the reference audio. No explicit language code parameter is required for basic usage.
代码语言代码语言代码语言
zh中文en英文de德语
es西班牙语fr法语ja日语
it意大利语hu匈牙利语ko韩语
ru俄语fa波斯语ar阿拉伯语
pl波兰语pt葡萄牙语cs捷克语
da丹麦语sv瑞典语el希腊语
tr土耳其语
语言会根据输入文本和参考音频自动推断,基础使用无需指定明确的语言代码参数。

Architecture Overview

架构概述

  • Pipeline: Audio Tokenizer + LLM (pure autoregressive)
  • Audio Tokenizer: MOSS-Audio-Tokenizer-Nano (~20M params), CNN-free causal Transformer (Cat architecture)
  • Output: 48 kHz, 2-channel (stereo)
  • Token rate: 12.5 Hz token stream
  • Codebooks: RVQ with 16 codebooks (0.125 kbps – 2 kbps)
  • LLM: ~0.1B parameters total
  • 流水线:音频分词器 + LLM(纯自回归)
  • 音频分词器:MOSS-Audio-Tokenizer-Nano(约20M参数),无CNN的因果Transformer(Cat架构)
  • 输出:48kHz,双声道(立体声)
  • 令牌速率:12.5Hz令牌流
  • 码本:带16个码本的RVQ(0.125 kbps – 2 kbps)
  • LLM:总参数约0.1B

Key CLI Flags

主要CLI参数

FlagAliasDescription
--prompt-audio-path
Path to reference WAV for voice cloning (
infer.py
)
--prompt-speech
Same purpose in
moss-tts-nano generate
CLI
--text
Input text string
--text-file
Path to plain text file for long-form synthesis
--output
Output WAV file path (default varies by entrypoint)
参数别名描述
--prompt-audio-path
用于语音克隆的参考WAV文件路径(
infer.py
--prompt-speech
moss-tts-nano generate
命令中作用相同
--text
输入文本字符串
--text-file
用于长文本合成的纯文本文件路径
--output
输出WAV文件路径(默认路径因入口点而异)

Common Patterns

常见使用模式

Pattern: Batch synthesis with one reference voice

模式:单参考音色批量合成

python
from infer import MossTTSNanoInference
import soundfile as sf

tts = MossTTSNanoInference()
ref = "assets/audio/zh_1.wav"

sentences = [
    "第一句话,用于批量合成测试。",
    "第二句话,保持相同的音色。",
    "第三句话,输出独立的音频文件。",
]

for i, sentence in enumerate(sentences):
    audio = tts.infer(text=sentence, prompt_audio_path=ref)
    sf.write(f"output_{i:02d}.wav", audio, samplerate=48000)
    print(f"Saved output_{i:02d}.wav")
python
from infer import MossTTSNanoInference
import soundfile as sf

tts = MossTTSNanoInference()
ref = "assets/audio/zh_1.wav"

sentences = [
    "第一句话,用于批量合成测试。",
    "第二句话,保持相同的音色。",
    "第三句话,输出独立的音频文件。",
]

for i, sentence in enumerate(sentences):
    audio = tts.infer(text=sentence, prompt_audio_path=ref)
    sf.write(f"output_{i:02d}.wav", audio, samplerate=48000)
    print(f"已保存output_{i:02d}.wav")

Pattern: Real-time playback with sounddevice

模式:使用sounddevice实时播放

python
import sounddevice as sd
import numpy as np
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

buffer = []
for chunk in tts.infer_stream(
    text="Real-time playback example using sounddevice.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    buffer.append(chunk)

audio = np.concatenate(buffer)
sd.play(audio, samplerate=48000)
sd.wait()
python
import sounddevice as sd
import numpy as np
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

buffer = []
for chunk in tts.infer_stream(
    text="Real-time playback example using sounddevice.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    buffer.append(chunk)

audio = np.concatenate(buffer)
sd.play(audio, samplerate=48000)
sd.wait()

Pattern: Gradio integration

模式:Gradio集成

python
import gradio as gr
import soundfile as sf
import numpy as np
import io
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

def synthesize(reference_audio_path: str, text: str):
    audio = tts.infer(text=text, prompt_audio_path=reference_audio_path)
    # Return as (sample_rate, numpy_array) tuple for Gradio Audio component
    return (48000, audio)

demo = gr.Interface(
    fn=synthesize,
    inputs=[
        gr.Audio(type="filepath", label="Reference Voice"),
        gr.Textbox(label="Text to synthesize"),
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="MOSS-TTS-Nano Voice Clone",
)

demo.launch()
python
import gradio as gr
import soundfile as sf
import numpy as np
import io
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

def synthesize(reference_audio_path: str, text: str):
    audio = tts.infer(text=text, prompt_audio_path=reference_audio_path)
    # 返回(sample_rate, numpy_array)元组给Gradio Audio组件
    return (48000, audio)

demo = gr.Interface(
    fn=synthesize,
    inputs=[
        gr.Audio(type="filepath", label="参考音色"),
        gr.Textbox(label="待合成文本"),
    ],
    outputs=gr.Audio(label="生成语音"),
    title="MOSS-TTS-Nano语音克隆",
)

demo.launch()

Troubleshooting

故障排除

WeTextProcessing install fails

WeTextProcessing安装失败

bash
undefined
bash
undefined

Use conda to get pynini, then install from source

使用conda安装pynini,再从源码安装

conda install -c conda-forge pynini=2.1.6.post1 -y pip install git+https://github.com/WhizZest/WeTextProcessing.git
undefined
conda install -c conda-forge pynini=2.1.6.post1 -y pip install git+https://github.com/WhizZest/WeTextProcessing.git
undefined

Model download is slow or fails

模型下载缓慢或失败

Set
HF_ENDPOINT
to a mirror if Hugging Face is unreachable:
bash
export HF_ENDPOINT=https://hf-mirror.com
python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试"
Or use ModelScope:
bash
pip install modelscope
Then point model paths to
openmoss/MOSS-TTS-Nano
and
openmoss/MOSS-Audio-Tokenizer-Nano
.
如果无法访问Hugging Face,设置
HF_ENDPOINT
为镜像地址:
bash
export HF_ENDPOINT=https://hf-mirror.com
python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试"
或使用ModelScope:
bash
pip install modelscope
然后将模型路径指向
openmoss/MOSS-TTS-Nano
openmoss/MOSS-Audio-Tokenizer-Nano

Out of memory on CPU

CPU内存不足

  • Use streaming inference (
    infer_stream
    ) to reduce peak memory.
  • Reduce chunk size for long text inputs — the model handles chunked voice cloning automatically.
  • Close other applications; the model needs ~1–2 GB RAM.
  • 使用流式推理(
    infer_stream
    )降低峰值内存占用。
  • 减少长文本输入的分段大小 — 模型会自动处理分段语音克隆。
  • 关闭其他应用程序;模型需要约1–2 GB内存。

Audio output is silent or corrupt

音频输出无声或损坏

  • Ensure the reference WAV is a clean mono or stereo file, 16-bit or float32, any sample rate (it will be resampled).
  • Minimum reference audio duration: ~3–5 seconds for reliable voice cloning.
  • Avoid reference audio with heavy background noise.
  • 确保参考WAV是清晰的单声道或立体声文件,16位或float32格式,任意采样率(会自动重采样)。
  • 参考音频最小时长:约3–5秒,以保证可靠的语音克隆效果。
  • 避免使用带有大量背景噪音的参考音频。

moss-tts-nano
command not found

moss-tts-nano
命令未找到

bash
undefined
bash
undefined

Re-run editable install inside the active conda env

在激活的conda环境中重新执行可编辑安装

pip install -e . which moss-tts-nano # should resolve now
undefined
pip install -e . which moss-tts-nano # 现在应该能找到
undefined

Port conflict for web demo

Web演示端口冲突

bash
undefined
bash
undefined

Default port is 18083; check what occupies it

默认端口是18083;检查占用情况

lsof -i :18083
lsof -i :18083

Kill if needed, then relaunch

如有需要先终止进程,再重新启动

moss-tts-nano serve
undefined
moss-tts-nano serve
undefined

Output Defaults

默认输出路径

EntrypointDefault output path
python infer.py
generated_audio/infer_output.wav
moss-tts-nano generate
generated_audio/moss_tts_nano_output.wav
python app.py
/
moss-tts-nano serve
returned via HTTP response
The
generated_audio/
directory is created automatically if it does not exist.
入口点默认输出路径
python infer.py
generated_audio/infer_output.wav
moss-tts-nano generate
generated_audio/moss_tts_nano_output.wav
python app.py
/
moss-tts-nano serve
通过HTTP响应返回
如果
generated_audio/
目录不存在,会自动创建。