moss-tts-nano-speech

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MOSS-TTS-Nano Speech Generation Skill

MOSS-TTS-Nano语音生成技能

Skill by ara.so — Daily 2026 Skills collection.

MOSS-TTS-Nano is an open-source multilingual tiny TTS model (0.1B parameters) from MOSI.AI and the OpenMOSS team. It uses an Audio Tokenizer + LLM autoregressive pipeline to generate 48 kHz stereo speech in real time, supports 20 languages, voice cloning, streaming inference, and runs on CPU without a GPU.

技能由 ara.so 提供 — 属于Daily 2026 Skills合集。

MOSS-TTS-Nano是由MOSI.AI与OpenMOSS团队开发的开源多语言轻量TTS模型（参数规模0.1B）。它采用音频分词器+LLM自回归流水线，可实时生成48kHz立体声语音，支持20种语言、语音克隆和流式推理，无需GPU即可在CPU上运行。

Installation

安装

Conda (recommended)

Conda（推荐方式）

bash

conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano

pip install -r requirements.txt
pip install -e .

bash

conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano

pip install -r requirements.txt
pip install -e .

Fix WeTextProcessing if it fails

解决WeTextProcessing安装失败问题

bash

conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git

After

pip install -e .

the

moss-tts-nano

CLI command is available in the active environment.

bash

conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git

执行

pip install -e .

后，

moss-tts-nano

CLI命令将在激活的环境中可用。

Model Weights

模型权重

Models are auto-downloaded from Hugging Face on first run:

TTS model:
```
OpenMOSS-Team/MOSS-TTS-Nano
```
Audio tokenizer:
```
OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano
```

ModelScope mirrors are available at

openmoss/MOSS-TTS-Nano

and

openmoss/MOSS-Audio-Tokenizer-Nano

首次运行时，模型会自动从Hugging Face下载：

TTS模型：
```
OpenMOSS-Team/MOSS-TTS-Nano
```
音频分词器：
```
OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano
```

也可使用ModelScope镜像：

openmoss/MOSS-TTS-Nano

和

openmoss/MOSS-Audio-Tokenizer-Nano

。

CLI Commands

CLI命令

Generate speech (voice clone mode)

生成语音（语音克隆模式）

bash

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"

Output defaults to

generated_audio/moss_tts_nano_output.wav

bash

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"

输出默认路径为

generated_audio/moss_tts_nano_output.wav

。

Generate from a text file (long-form)

从文本文件生成（长文本）

bash

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text-file my_script.txt \
  --output output.wav

bash

moss-tts-nano generate \
  --prompt-speech assets/audio/zh_1.wav \
  --text-file my_script.txt \
  --output output.wav

Launch local web demo

启动本地Web演示

bash

moss-tts-nano serve

bash

moss-tts-nano serve

or directly:

或直接运行：

python app.py


Opens at `http://127.0.0.1:18083` — model stays loaded in memory for fast repeated requests.

python app.py


将在`http://127.0.0.1:18083`打开 — 模型会保持加载在内存中，以支持快速重复请求。

Direct Python entrypoint

直接Python入口

bash

python infer.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "Hello, this is a test of MOSS-TTS-Nano."

Output:

generated_audio/infer_output.wav

bash

python infer.py \
  --prompt-audio-path assets/audio/zh_1.wav \
  --text "Hello, this is a test of MOSS-TTS-Nano."

输出路径：

generated_audio/infer_output.wav

Python API Usage

Python API使用方法

Basic voice clone inference

基础语音克隆推理

python

from infer import MossTTSNanoInference

python

from infer import MossTTSNanoInference

Initialize once (downloads weights on first run)

初始化一次（首次运行时会下载权重）

tts = MossTTSNanoInference()

Voice clone: synthesize text in the style of the reference audio

语音克隆：参考音频风格的文本合成

audio = tts.infer( text="欢迎使用MOSS语音合成系统。", prompt_audio_path="assets/audio/zh_1.wav", )

Save output

保存输出

import soundfile as sf sf.write("output.wav", audio, samplerate=48000)

undefined

import soundfile as sf sf.write("output.wav", audio, samplerate=48000)

undefined

English voice clone

英文语音克隆

python

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

audio = tts.infer(
    text="Welcome to MOSS TTS Nano, a tiny but capable text to speech model.",
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("english_output.wav", audio, samplerate=48000)

python

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

audio = tts.infer(
    text="Welcome to MOSS TTS Nano, a tiny but capable text to speech model.",
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("english_output.wav", audio, samplerate=48000)

Streaming inference (low latency)

流式推理（低延迟）

python

from infer import MossTTSNanoInference
import soundfile as sf
import numpy as np

tts = MossTTSNanoInference()

chunks = []
for audio_chunk in tts.infer_stream(
    text="This sentence is generated chunk by chunk for low latency playback.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    chunks.append(audio_chunk)
    # process or play chunk in real time here

full_audio = np.concatenate(chunks)
sf.write("streamed_output.wav", full_audio, samplerate=48000)

python

from infer import MossTTSNanoInference
import soundfile as sf
import numpy as np

tts = MossTTSNanoInference()

chunks = []
for audio_chunk in tts.infer_stream(
    text="This sentence is generated chunk by chunk for low latency playback.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    chunks.append(audio_chunk)
    # 在此处实时处理或播放片段

full_audio = np.concatenate(chunks)
sf.write("streamed_output.wav", full_audio, samplerate=48000)

Long-text synthesis with chunked voice cloning

分段语音克隆长文本合成

python

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

long_text = """
MOSS-TTS-Nano supports long-form synthesis through automatic chunking.
Each chunk uses the same reference voice, producing consistent speaker identity
across the entire output even for multi-paragraph documents.
"""

audio = tts.infer(
    text=long_text,
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("long_form_output.wav", audio, samplerate=48000)

python

from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

long_text = """
MOSS-TTS-Nano支持通过自动分段实现长文本合成。
每个分段使用相同的参考音色，即使是多段落文档，整个输出的说话人身份也能保持一致。
"""

audio = tts.infer(
    text=long_text,
    prompt_audio_path="assets/audio/en_sample.wav",
)

import soundfile as sf
sf.write("long_form_output.wav", audio, samplerate=48000)

FastAPI HTTP endpoint usage

FastAPI HTTP端点使用

When the server is running (

moss-tts-nano serve

python app.py

python

import requests
import base64
import soundfile as sf
import io
import numpy as np

当服务器运行时（

moss-tts-nano serve

或

python app.py

）：

python

import requests
import base64
import soundfile as sf
import io
import numpy as np

Read reference audio as base64

将参考音频读取为base64格式

with open("assets/audio/zh_1.wav", "rb") as f: ref_audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post( "http://127.0.0.1:18083/generate", json={ "text": "你好，这是一个语音合成测试。", "prompt_audio_base64": ref_audio_b64, }, )

data = response.json() audio_bytes = base64.b64decode(data["audio_base64"])

audio_array, sr = sf.read(io.BytesIO(audio_bytes)) sf.write("api_output.wav", audio_array, samplerate=sr)

undefined

with open("assets/audio/zh_1.wav", "rb") as f: ref_audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post( "http://127.0.0.1:18083/generate", json={ "text": "你好，这是一个语音合成测试。", "prompt_audio_base64": ref_audio_b64, }, )

data = response.json() audio_bytes = base64.b64decode(data["audio_base64"])

audio_array, sr = sf.read(io.BytesIO(audio_bytes)) sf.write("api_output.wav", audio_array, samplerate=sr)

undefined

Streaming HTTP response (real-time web playback)

流式HTTP响应（实时Web播放）

python

import requests

with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = __import__("base64").b64encode(f.read()).decode()

with requests.post(
    "http://127.0.0.1:18083/generate_stream",
    json={
        "text": "流式语音合成示例，适合实时播放场景。",
        "prompt_audio_base64": ref_audio_b64,
    },
    stream=True,
) as resp:
    with open("stream_output.wav", "wb") as out:
        for chunk in resp.iter_content(chunk_size=4096):
            out.write(chunk)

python

import requests

with open("assets/audio/zh_1.wav", "rb") as f:
    ref_audio_b64 = __import__("base64").b64encode(f.read()).decode()

with requests.post(
    "http://127.0.0.1:18083/generate_stream",
    json={
        "text": "流式语音合成示例，适合实时播放场景。",
        "prompt_audio_base64": ref_audio_b64,
    },
    stream=True,
) as resp:
    with open("stream_output.wav", "wb") as out:
        for chunk in resp.iter_content(chunk_size=4096):
            out.write(chunk)

Supported Languages

支持的语言

Code	Language	Code	Language	Code	Language
zh	Chinese	en	English	de	German
es	Spanish	fr	French	ja	Japanese
it	Italian	hu	Hungarian	ko	Korean
ru	Russian	fa	Persian	ar	Arabic
pl	Polish	pt	Portuguese	cs	Czech
da	Danish	sv	Swedish	el	Greek
tr	Turkish

The language is inferred automatically from the input text and the reference audio. No explicit language code parameter is required for basic usage.

代码	语言	代码	语言	代码	语言
zh	中文	en	英文	de	德语
es	西班牙语	fr	法语	ja	日语
it	意大利语	hu	匈牙利语	ko	韩语
ru	俄语	fa	波斯语	ar	阿拉伯语
pl	波兰语	pt	葡萄牙语	cs	捷克语
da	丹麦语	sv	瑞典语	el	希腊语
tr	土耳其语

语言会根据输入文本和参考音频自动推断，基础使用无需指定明确的语言代码参数。

Architecture Overview

架构概述

Pipeline: Audio Tokenizer + LLM (pure autoregressive)
Audio Tokenizer: MOSS-Audio-Tokenizer-Nano (~20M params), CNN-free causal Transformer (Cat architecture)
Output: 48 kHz, 2-channel (stereo)
Token rate: 12.5 Hz token stream
Codebooks: RVQ with 16 codebooks (0.125 kbps – 2 kbps)
LLM: ~0.1B parameters total

流水线：音频分词器 + LLM（纯自回归）
音频分词器：MOSS-Audio-Tokenizer-Nano（约20M参数），无CNN的因果Transformer（Cat架构）
输出：48kHz，双声道（立体声）
令牌速率：12.5Hz令牌流
码本：带16个码本的RVQ（0.125 kbps – 2 kbps）
LLM：总参数约0.1B

Key CLI Flags

主要CLI参数

Flag	Alias	Description
`--prompt-audio-path`	—	Path to reference WAV for voice cloning ( `infer.py` )
`--prompt-speech`	—	Same purpose in `moss-tts-nano generate` CLI
`--text`	—	Input text string
`--text-file`	—	Path to plain text file for long-form synthesis
`--output`	—	Output WAV file path (default varies by entrypoint)

参数	别名	描述
`--prompt-audio-path`	—	用于语音克隆的参考WAV文件路径（ `infer.py` ）
`--prompt-speech`	—	在 `moss-tts-nano generate` 命令中作用相同
`--text`	—	输入文本字符串
`--text-file`	—	用于长文本合成的纯文本文件路径
`--output`	—	输出WAV文件路径（默认路径因入口点而异）

Common Patterns

常见使用模式

Pattern: Batch synthesis with one reference voice

模式：单参考音色批量合成

python

from infer import MossTTSNanoInference
import soundfile as sf

tts = MossTTSNanoInference()
ref = "assets/audio/zh_1.wav"

sentences = [
    "第一句话，用于批量合成测试。",
    "第二句话，保持相同的音色。",
    "第三句话，输出独立的音频文件。",
]

for i, sentence in enumerate(sentences):
    audio = tts.infer(text=sentence, prompt_audio_path=ref)
    sf.write(f"output_{i:02d}.wav", audio, samplerate=48000)
    print(f"Saved output_{i:02d}.wav")

python

from infer import MossTTSNanoInference
import soundfile as sf

tts = MossTTSNanoInference()
ref = "assets/audio/zh_1.wav"

sentences = [
    "第一句话，用于批量合成测试。",
    "第二句话，保持相同的音色。",
    "第三句话，输出独立的音频文件。",
]

for i, sentence in enumerate(sentences):
    audio = tts.infer(text=sentence, prompt_audio_path=ref)
    sf.write(f"output_{i:02d}.wav", audio, samplerate=48000)
    print(f"已保存output_{i:02d}.wav")

Pattern: Real-time playback with sounddevice

模式：使用sounddevice实时播放

python

import sounddevice as sd
import numpy as np
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

buffer = []
for chunk in tts.infer_stream(
    text="Real-time playback example using sounddevice.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    buffer.append(chunk)

audio = np.concatenate(buffer)
sd.play(audio, samplerate=48000)
sd.wait()

python

import sounddevice as sd
import numpy as np
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

buffer = []
for chunk in tts.infer_stream(
    text="Real-time playback example using sounddevice.",
    prompt_audio_path="assets/audio/en_sample.wav",
):
    buffer.append(chunk)

audio = np.concatenate(buffer)
sd.play(audio, samplerate=48000)
sd.wait()

Pattern: Gradio integration

模式：Gradio集成

python

import gradio as gr
import soundfile as sf
import numpy as np
import io
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

def synthesize(reference_audio_path: str, text: str):
    audio = tts.infer(text=text, prompt_audio_path=reference_audio_path)
    # Return as (sample_rate, numpy_array) tuple for Gradio Audio component
    return (48000, audio)

demo = gr.Interface(
    fn=synthesize,
    inputs=[
        gr.Audio(type="filepath", label="Reference Voice"),
        gr.Textbox(label="Text to synthesize"),
    ],
    outputs=gr.Audio(label="Generated Speech"),
    title="MOSS-TTS-Nano Voice Clone",
)

demo.launch()

python

import gradio as gr
import soundfile as sf
import numpy as np
import io
from infer import MossTTSNanoInference

tts = MossTTSNanoInference()

def synthesize(reference_audio_path: str, text: str):
    audio = tts.infer(text=text, prompt_audio_path=reference_audio_path)
    # 返回(sample_rate, numpy_array)元组给Gradio Audio组件
    return (48000, audio)

demo = gr.Interface(
    fn=synthesize,
    inputs=[
        gr.Audio(type="filepath", label="参考音色"),
        gr.Textbox(label="待合成文本"),
    ],
    outputs=gr.Audio(label="生成语音"),
    title="MOSS-TTS-Nano语音克隆",
)

demo.launch()

Troubleshooting

故障排除

WeTextProcessing install fails

WeTextProcessing安装失败

bash

undefined

bash

undefined

Use conda to get pynini, then install from source

使用conda安装pynini，再从源码安装

conda install -c conda-forge pynini=2.1.6.post1 -y pip install git+https://github.com/WhizZest/WeTextProcessing.git

undefined

conda install -c conda-forge pynini=2.1.6.post1 -y pip install git+https://github.com/WhizZest/WeTextProcessing.git

undefined

Model download is slow or fails

模型下载缓慢或失败

Set

HF_ENDPOINT

to a mirror if Hugging Face is unreachable:

bash

export HF_ENDPOINT=https://hf-mirror.com
python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试"

Or use ModelScope:

bash

pip install modelscope

Then point model paths to

openmoss/MOSS-TTS-Nano

and

openmoss/MOSS-Audio-Tokenizer-Nano

如果无法访问Hugging Face，设置

HF_ENDPOINT

为镜像地址：

bash

export HF_ENDPOINT=https://hf-mirror.com
python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试"

或使用ModelScope：

bash

pip install modelscope

然后将模型路径指向

openmoss/MOSS-TTS-Nano

和

openmoss/MOSS-Audio-Tokenizer-Nano

。

Out of memory on CPU

CPU内存不足

Use streaming inference (
```
infer_stream
```
) to reduce peak memory.
Reduce chunk size for long text inputs — the model handles chunked voice cloning automatically.
Close other applications; the model needs ~1–2 GB RAM.

使用流式推理（
```
infer_stream
```
）降低峰值内存占用。
减少长文本输入的分段大小 — 模型会自动处理分段语音克隆。
关闭其他应用程序；模型需要约1–2 GB内存。

Audio output is silent or corrupt

音频输出无声或损坏

Ensure the reference WAV is a clean mono or stereo file, 16-bit or float32, any sample rate (it will be resampled).
Minimum reference audio duration: ~3–5 seconds for reliable voice cloning.
Avoid reference audio with heavy background noise.

确保参考WAV是清晰的单声道或立体声文件，16位或float32格式，任意采样率（会自动重采样）。
参考音频最小时长：约3–5秒，以保证可靠的语音克隆效果。
避免使用带有大量背景噪音的参考音频。

moss-tts-nano

command not found

moss-tts-nano

命令未找到

bash

undefined

bash

undefined

Re-run editable install inside the active conda env

在激活的conda环境中重新执行可编辑安装

pip install -e . which moss-tts-nano # should resolve now

undefined

pip install -e . which moss-tts-nano # 现在应该能找到

undefined

Port conflict for web demo

Web演示端口冲突

bash

undefined

bash

undefined

Default port is 18083; check what occupies it

默认端口是18083；检查占用情况

lsof -i :18083

Kill if needed, then relaunch

如有需要先终止进程，再重新启动

moss-tts-nano serve

undefined

moss-tts-nano serve

undefined

Output Defaults

默认输出路径

Entrypoint	Default output path
`python infer.py`	`generated_audio/infer_output.wav`
`moss-tts-nano generate`	`generated_audio/moss_tts_nano_output.wav`
`python app.py` / `moss-tts-nano serve`	returned via HTTP response

The

generated_audio/

directory is created automatically if it does not exist.

入口点	默认输出路径
`python infer.py`	`generated_audio/infer_output.wav`
`moss-tts-nano generate`	`generated_audio/moss_tts_nano_output.wav`
`python app.py` / `moss-tts-nano serve`	通过HTTP响应返回

如果

generated_audio/

目录不存在，会自动创建。

moss-tts-nano-speech

Original

Translation

MOSS-TTS-Nano Speech Generation Skill

MOSS-TTS-Nano语音生成技能

Installation

安装

Conda (recommended)

Conda（推荐方式）

Fix WeTextProcessing if it fails

解决WeTextProcessing安装失败问题

Model Weights

模型权重

CLI Commands

CLI命令

Generate speech (voice clone mode)

生成语音（语音克隆模式）

Generate from a text file (long-form)

从文本文件生成（长文本）

Launch local web demo

启动本地Web演示

or directly:

或直接运行：

Direct Python entrypoint

直接Python入口

Python API Usage

Python API使用方法

Basic voice clone inference

基础语音克隆推理

Initialize once (downloads weights on first run)

初始化一次（首次运行时会下载权重）

Voice clone: synthesize text in the style of the reference audio

语音克隆：参考音频风格的文本合成

Save output

保存输出

English voice clone

英文语音克隆

Streaming inference (low latency)

流式推理（低延迟）

Long-text synthesis with chunked voice cloning

分段语音克隆长文本合成

FastAPI HTTP endpoint usage

FastAPI HTTP端点使用

Read reference audio as base64

将参考音频读取为base64格式

Streaming HTTP response (real-time web playback)

流式HTTP响应（实时Web播放）

Supported Languages

支持的语言

Architecture Overview

架构概述

Key CLI Flags

主要CLI参数

Common Patterns

常见使用模式

Pattern: Batch synthesis with one reference voice

模式：单参考音色批量合成

Pattern: Real-time playback with sounddevice

模式：使用sounddevice实时播放

Pattern: Gradio integration

模式：Gradio集成

Troubleshooting

故障排除

WeTextProcessing install fails

WeTextProcessing安装失败

Use conda to get pynini, then install from source

使用conda安装pynini，再从源码安装

Model download is slow or fails

模型下载缓慢或失败

Out of memory on CPU

CPU内存不足

Audio output is silent or corrupt

音频输出无声或损坏

moss-tts-nano command not found

moss-tts-nano命令未找到

Re-run editable install inside the active conda env

在激活的conda环境中重新执行可编辑安装

Port conflict for web demo

Web演示端口冲突

Default port is 18083; check what occupies it

`moss-tts-nano`
command not found

`moss-tts-nano`
命令未找到