moss-tts-nano-speech
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMOSS-TTS-Nano Speech Generation Skill
MOSS-TTS-Nano语音生成技能
Skill by ara.so — Daily 2026 Skills collection.
MOSS-TTS-Nano is an open-source multilingual tiny TTS model (0.1B parameters) from MOSI.AI and the OpenMOSS team. It uses an Audio Tokenizer + LLM autoregressive pipeline to generate 48 kHz stereo speech in real time, supports 20 languages, voice cloning, streaming inference, and runs on CPU without a GPU.
技能由 ara.so 提供 — 属于Daily 2026 Skills合集。
MOSS-TTS-Nano是由MOSI.AI与OpenMOSS团队开发的开源多语言轻量TTS模型(参数规模0.1B)。它采用音频分词器+LLM自回归流水线,可实时生成48kHz立体声语音,支持20种语言、语音克隆和流式推理,无需GPU即可在CPU上运行。
Installation
安装
Conda (recommended)
Conda(推荐方式)
bash
conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano
git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano
pip install -r requirements.txt
pip install -e .bash
conda create -n moss-tts-nano python=3.12 -y
conda activate moss-tts-nano
git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git
cd MOSS-TTS-Nano
pip install -r requirements.txt
pip install -e .Fix WeTextProcessing if it fails
解决WeTextProcessing安装失败问题
bash
conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.gitAfter the CLI command is available in the active environment.
pip install -e .moss-tts-nanobash
conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git执行后, CLI命令将在激活的环境中可用。
pip install -e .moss-tts-nanoModel Weights
模型权重
Models are auto-downloaded from Hugging Face on first run:
- TTS model:
OpenMOSS-Team/MOSS-TTS-Nano - Audio tokenizer:
OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano
ModelScope mirrors are available at and .
openmoss/MOSS-TTS-Nanoopenmoss/MOSS-Audio-Tokenizer-Nano首次运行时,模型会自动从Hugging Face下载:
- TTS模型:
OpenMOSS-Team/MOSS-TTS-Nano - 音频分词器:
OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano
也可使用ModelScope镜像: 和 。
openmoss/MOSS-TTS-Nanoopenmoss/MOSS-Audio-Tokenizer-NanoCLI Commands
CLI命令
Generate speech (voice clone mode)
生成语音(语音克隆模式)
bash
moss-tts-nano generate \
--prompt-speech assets/audio/zh_1.wav \
--text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"Output defaults to .
generated_audio/moss_tts_nano_output.wavbash
moss-tts-nano generate \
--prompt-speech assets/audio/zh_1.wav \
--text "欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。"输出默认路径为。
generated_audio/moss_tts_nano_output.wavGenerate from a text file (long-form)
从文本文件生成(长文本)
bash
moss-tts-nano generate \
--prompt-speech assets/audio/zh_1.wav \
--text-file my_script.txt \
--output output.wavbash
moss-tts-nano generate \
--prompt-speech assets/audio/zh_1.wav \
--text-file my_script.txt \
--output output.wavLaunch local web demo
启动本地Web演示
bash
moss-tts-nano servebash
moss-tts-nano serveor directly:
或直接运行:
python app.py
Opens at `http://127.0.0.1:18083` — model stays loaded in memory for fast repeated requests.python app.py
将在`http://127.0.0.1:18083`打开 — 模型会保持加载在内存中,以支持快速重复请求。Direct Python entrypoint
直接Python入口
bash
python infer.py \
--prompt-audio-path assets/audio/zh_1.wav \
--text "Hello, this is a test of MOSS-TTS-Nano."Output:
generated_audio/infer_output.wavbash
python infer.py \
--prompt-audio-path assets/audio/zh_1.wav \
--text "Hello, this is a test of MOSS-TTS-Nano."输出路径:
generated_audio/infer_output.wavPython API Usage
Python API使用方法
Basic voice clone inference
基础语音克隆推理
python
from infer import MossTTSNanoInferencepython
from infer import MossTTSNanoInferenceInitialize once (downloads weights on first run)
初始化一次(首次运行时会下载权重)
tts = MossTTSNanoInference()
tts = MossTTSNanoInference()
Voice clone: synthesize text in the style of the reference audio
语音克隆:参考音频风格的文本合成
audio = tts.infer(
text="欢迎使用MOSS语音合成系统。",
prompt_audio_path="assets/audio/zh_1.wav",
)
audio = tts.infer(
text="欢迎使用MOSS语音合成系统。",
prompt_audio_path="assets/audio/zh_1.wav",
)
Save output
保存输出
import soundfile as sf
sf.write("output.wav", audio, samplerate=48000)
undefinedimport soundfile as sf
sf.write("output.wav", audio, samplerate=48000)
undefinedEnglish voice clone
英文语音克隆
python
from infer import MossTTSNanoInference
tts = MossTTSNanoInference()
audio = tts.infer(
text="Welcome to MOSS TTS Nano, a tiny but capable text to speech model.",
prompt_audio_path="assets/audio/en_sample.wav",
)
import soundfile as sf
sf.write("english_output.wav", audio, samplerate=48000)python
from infer import MossTTSNanoInference
tts = MossTTSNanoInference()
audio = tts.infer(
text="Welcome to MOSS TTS Nano, a tiny but capable text to speech model.",
prompt_audio_path="assets/audio/en_sample.wav",
)
import soundfile as sf
sf.write("english_output.wav", audio, samplerate=48000)Streaming inference (low latency)
流式推理(低延迟)
python
from infer import MossTTSNanoInference
import soundfile as sf
import numpy as np
tts = MossTTSNanoInference()
chunks = []
for audio_chunk in tts.infer_stream(
text="This sentence is generated chunk by chunk for low latency playback.",
prompt_audio_path="assets/audio/en_sample.wav",
):
chunks.append(audio_chunk)
# process or play chunk in real time here
full_audio = np.concatenate(chunks)
sf.write("streamed_output.wav", full_audio, samplerate=48000)python
from infer import MossTTSNanoInference
import soundfile as sf
import numpy as np
tts = MossTTSNanoInference()
chunks = []
for audio_chunk in tts.infer_stream(
text="This sentence is generated chunk by chunk for low latency playback.",
prompt_audio_path="assets/audio/en_sample.wav",
):
chunks.append(audio_chunk)
# 在此处实时处理或播放片段
full_audio = np.concatenate(chunks)
sf.write("streamed_output.wav", full_audio, samplerate=48000)Long-text synthesis with chunked voice cloning
分段语音克隆长文本合成
python
from infer import MossTTSNanoInference
tts = MossTTSNanoInference()
long_text = """
MOSS-TTS-Nano supports long-form synthesis through automatic chunking.
Each chunk uses the same reference voice, producing consistent speaker identity
across the entire output even for multi-paragraph documents.
"""
audio = tts.infer(
text=long_text,
prompt_audio_path="assets/audio/en_sample.wav",
)
import soundfile as sf
sf.write("long_form_output.wav", audio, samplerate=48000)python
from infer import MossTTSNanoInference
tts = MossTTSNanoInference()
long_text = """
MOSS-TTS-Nano支持通过自动分段实现长文本合成。
每个分段使用相同的参考音色,即使是多段落文档,整个输出的说话人身份也能保持一致。
"""
audio = tts.infer(
text=long_text,
prompt_audio_path="assets/audio/en_sample.wav",
)
import soundfile as sf
sf.write("long_form_output.wav", audio, samplerate=48000)FastAPI HTTP endpoint usage
FastAPI HTTP端点使用
When the server is running ( or ):
moss-tts-nano servepython app.pypython
import requests
import base64
import soundfile as sf
import io
import numpy as np当服务器运行时( 或 ):
moss-tts-nano servepython app.pypython
import requests
import base64
import soundfile as sf
import io
import numpy as npRead reference audio as base64
将参考音频读取为base64格式
with open("assets/audio/zh_1.wav", "rb") as f:
ref_audio_b64 = base64.b64encode(f.read()).decode()
response = requests.post(
"http://127.0.0.1:18083/generate",
json={
"text": "你好,这是一个语音合成测试。",
"prompt_audio_base64": ref_audio_b64,
},
)
data = response.json()
audio_bytes = base64.b64decode(data["audio_base64"])
audio_array, sr = sf.read(io.BytesIO(audio_bytes))
sf.write("api_output.wav", audio_array, samplerate=sr)
undefinedwith open("assets/audio/zh_1.wav", "rb") as f:
ref_audio_b64 = base64.b64encode(f.read()).decode()
response = requests.post(
"http://127.0.0.1:18083/generate",
json={
"text": "你好,这是一个语音合成测试。",
"prompt_audio_base64": ref_audio_b64,
},
)
data = response.json()
audio_bytes = base64.b64decode(data["audio_base64"])
audio_array, sr = sf.read(io.BytesIO(audio_bytes))
sf.write("api_output.wav", audio_array, samplerate=sr)
undefinedStreaming HTTP response (real-time web playback)
流式HTTP响应(实时Web播放)
python
import requests
with open("assets/audio/zh_1.wav", "rb") as f:
ref_audio_b64 = __import__("base64").b64encode(f.read()).decode()
with requests.post(
"http://127.0.0.1:18083/generate_stream",
json={
"text": "流式语音合成示例,适合实时播放场景。",
"prompt_audio_base64": ref_audio_b64,
},
stream=True,
) as resp:
with open("stream_output.wav", "wb") as out:
for chunk in resp.iter_content(chunk_size=4096):
out.write(chunk)python
import requests
with open("assets/audio/zh_1.wav", "rb") as f:
ref_audio_b64 = __import__("base64").b64encode(f.read()).decode()
with requests.post(
"http://127.0.0.1:18083/generate_stream",
json={
"text": "流式语音合成示例,适合实时播放场景。",
"prompt_audio_base64": ref_audio_b64,
},
stream=True,
) as resp:
with open("stream_output.wav", "wb") as out:
for chunk in resp.iter_content(chunk_size=4096):
out.write(chunk)Supported Languages
支持的语言
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
| zh | Chinese | en | English | de | German |
| es | Spanish | fr | French | ja | Japanese |
| it | Italian | hu | Hungarian | ko | Korean |
| ru | Russian | fa | Persian | ar | Arabic |
| pl | Polish | pt | Portuguese | cs | Czech |
| da | Danish | sv | Swedish | el | Greek |
| tr | Turkish |
The language is inferred automatically from the input text and the reference audio. No explicit language code parameter is required for basic usage.
| 代码 | 语言 | 代码 | 语言 | 代码 | 语言 |
|---|---|---|---|---|---|
| zh | 中文 | en | 英文 | de | 德语 |
| es | 西班牙语 | fr | 法语 | ja | 日语 |
| it | 意大利语 | hu | 匈牙利语 | ko | 韩语 |
| ru | 俄语 | fa | 波斯语 | ar | 阿拉伯语 |
| pl | 波兰语 | pt | 葡萄牙语 | cs | 捷克语 |
| da | 丹麦语 | sv | 瑞典语 | el | 希腊语 |
| tr | 土耳其语 |
语言会根据输入文本和参考音频自动推断,基础使用无需指定明确的语言代码参数。
Architecture Overview
架构概述
- Pipeline: Audio Tokenizer + LLM (pure autoregressive)
- Audio Tokenizer: MOSS-Audio-Tokenizer-Nano (~20M params), CNN-free causal Transformer (Cat architecture)
- Output: 48 kHz, 2-channel (stereo)
- Token rate: 12.5 Hz token stream
- Codebooks: RVQ with 16 codebooks (0.125 kbps – 2 kbps)
- LLM: ~0.1B parameters total
- 流水线:音频分词器 + LLM(纯自回归)
- 音频分词器:MOSS-Audio-Tokenizer-Nano(约20M参数),无CNN的因果Transformer(Cat架构)
- 输出:48kHz,双声道(立体声)
- 令牌速率:12.5Hz令牌流
- 码本:带16个码本的RVQ(0.125 kbps – 2 kbps)
- LLM:总参数约0.1B
Key CLI Flags
主要CLI参数
| Flag | Alias | Description |
|---|---|---|
| — | Path to reference WAV for voice cloning ( |
| — | Same purpose in |
| — | Input text string |
| — | Path to plain text file for long-form synthesis |
| — | Output WAV file path (default varies by entrypoint) |
| 参数 | 别名 | 描述 |
|---|---|---|
| — | 用于语音克隆的参考WAV文件路径( |
| — | 在 |
| — | 输入文本字符串 |
| — | 用于长文本合成的纯文本文件路径 |
| — | 输出WAV文件路径(默认路径因入口点而异) |
Common Patterns
常见使用模式
Pattern: Batch synthesis with one reference voice
模式:单参考音色批量合成
python
from infer import MossTTSNanoInference
import soundfile as sf
tts = MossTTSNanoInference()
ref = "assets/audio/zh_1.wav"
sentences = [
"第一句话,用于批量合成测试。",
"第二句话,保持相同的音色。",
"第三句话,输出独立的音频文件。",
]
for i, sentence in enumerate(sentences):
audio = tts.infer(text=sentence, prompt_audio_path=ref)
sf.write(f"output_{i:02d}.wav", audio, samplerate=48000)
print(f"Saved output_{i:02d}.wav")python
from infer import MossTTSNanoInference
import soundfile as sf
tts = MossTTSNanoInference()
ref = "assets/audio/zh_1.wav"
sentences = [
"第一句话,用于批量合成测试。",
"第二句话,保持相同的音色。",
"第三句话,输出独立的音频文件。",
]
for i, sentence in enumerate(sentences):
audio = tts.infer(text=sentence, prompt_audio_path=ref)
sf.write(f"output_{i:02d}.wav", audio, samplerate=48000)
print(f"已保存output_{i:02d}.wav")Pattern: Real-time playback with sounddevice
模式:使用sounddevice实时播放
python
import sounddevice as sd
import numpy as np
from infer import MossTTSNanoInference
tts = MossTTSNanoInference()
buffer = []
for chunk in tts.infer_stream(
text="Real-time playback example using sounddevice.",
prompt_audio_path="assets/audio/en_sample.wav",
):
buffer.append(chunk)
audio = np.concatenate(buffer)
sd.play(audio, samplerate=48000)
sd.wait()python
import sounddevice as sd
import numpy as np
from infer import MossTTSNanoInference
tts = MossTTSNanoInference()
buffer = []
for chunk in tts.infer_stream(
text="Real-time playback example using sounddevice.",
prompt_audio_path="assets/audio/en_sample.wav",
):
buffer.append(chunk)
audio = np.concatenate(buffer)
sd.play(audio, samplerate=48000)
sd.wait()Pattern: Gradio integration
模式:Gradio集成
python
import gradio as gr
import soundfile as sf
import numpy as np
import io
from infer import MossTTSNanoInference
tts = MossTTSNanoInference()
def synthesize(reference_audio_path: str, text: str):
audio = tts.infer(text=text, prompt_audio_path=reference_audio_path)
# Return as (sample_rate, numpy_array) tuple for Gradio Audio component
return (48000, audio)
demo = gr.Interface(
fn=synthesize,
inputs=[
gr.Audio(type="filepath", label="Reference Voice"),
gr.Textbox(label="Text to synthesize"),
],
outputs=gr.Audio(label="Generated Speech"),
title="MOSS-TTS-Nano Voice Clone",
)
demo.launch()python
import gradio as gr
import soundfile as sf
import numpy as np
import io
from infer import MossTTSNanoInference
tts = MossTTSNanoInference()
def synthesize(reference_audio_path: str, text: str):
audio = tts.infer(text=text, prompt_audio_path=reference_audio_path)
# 返回(sample_rate, numpy_array)元组给Gradio Audio组件
return (48000, audio)
demo = gr.Interface(
fn=synthesize,
inputs=[
gr.Audio(type="filepath", label="参考音色"),
gr.Textbox(label="待合成文本"),
],
outputs=gr.Audio(label="生成语音"),
title="MOSS-TTS-Nano语音克隆",
)
demo.launch()Troubleshooting
故障排除
WeTextProcessing install fails
WeTextProcessing安装失败
bash
undefinedbash
undefinedUse conda to get pynini, then install from source
使用conda安装pynini,再从源码安装
conda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git
undefinedconda install -c conda-forge pynini=2.1.6.post1 -y
pip install git+https://github.com/WhizZest/WeTextProcessing.git
undefinedModel download is slow or fails
模型下载缓慢或失败
Set to a mirror if Hugging Face is unreachable:
HF_ENDPOINTbash
export HF_ENDPOINT=https://hf-mirror.com
python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试"Or use ModelScope:
bash
pip install modelscopeThen point model paths to and .
openmoss/MOSS-TTS-Nanoopenmoss/MOSS-Audio-Tokenizer-Nano如果无法访问Hugging Face,设置为镜像地址:
HF_ENDPOINTbash
export HF_ENDPOINT=https://hf-mirror.com
python infer.py --prompt-audio-path assets/audio/zh_1.wav --text "测试"或使用ModelScope:
bash
pip install modelscope然后将模型路径指向 和 。
openmoss/MOSS-TTS-Nanoopenmoss/MOSS-Audio-Tokenizer-NanoOut of memory on CPU
CPU内存不足
- Use streaming inference () to reduce peak memory.
infer_stream - Reduce chunk size for long text inputs — the model handles chunked voice cloning automatically.
- Close other applications; the model needs ~1–2 GB RAM.
- 使用流式推理()降低峰值内存占用。
infer_stream - 减少长文本输入的分段大小 — 模型会自动处理分段语音克隆。
- 关闭其他应用程序;模型需要约1–2 GB内存。
Audio output is silent or corrupt
音频输出无声或损坏
- Ensure the reference WAV is a clean mono or stereo file, 16-bit or float32, any sample rate (it will be resampled).
- Minimum reference audio duration: ~3–5 seconds for reliable voice cloning.
- Avoid reference audio with heavy background noise.
- 确保参考WAV是清晰的单声道或立体声文件,16位或float32格式,任意采样率(会自动重采样)。
- 参考音频最小时长:约3–5秒,以保证可靠的语音克隆效果。
- 避免使用带有大量背景噪音的参考音频。
moss-tts-nano
command not found
moss-tts-nanomoss-tts-nano
命令未找到
moss-tts-nanobash
undefinedbash
undefinedRe-run editable install inside the active conda env
在激活的conda环境中重新执行可编辑安装
pip install -e .
which moss-tts-nano # should resolve now
undefinedpip install -e .
which moss-tts-nano # 现在应该能找到
undefinedPort conflict for web demo
Web演示端口冲突
bash
undefinedbash
undefinedDefault port is 18083; check what occupies it
默认端口是18083;检查占用情况
lsof -i :18083
lsof -i :18083
Kill if needed, then relaunch
如有需要先终止进程,再重新启动
moss-tts-nano serve
undefinedmoss-tts-nano serve
undefinedOutput Defaults
默认输出路径
| Entrypoint | Default output path |
|---|---|
| |
| |
| returned via HTTP response |
The directory is created automatically if it does not exist.
generated_audio/| 入口点 | 默认输出路径 |
|---|---|
| |
| |
| 通过HTTP响应返回 |
如果目录不存在,会自动创建。
generated_audio/