voice-changer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ElevenLabs Voice Changer

Transform the voice in an audio recording into a different target voice. Voice Changer (previously called Speech-to-Speech — the API endpoint and SDK methods still use the

speech_to_speech

speechToSpeech

name) keeps the original performance — emotion, pacing, intonation, breaths, whispers, laughs, cries — and only swaps who is speaking.

Setup: See Installation Guide. For JavaScript, use
@elevenlabs/*
packages only.

将音频录音中的语音转换为不同的目标语音。Voice Changer（以前称为Speech-to-Speech——API端点和SDK方法仍使用

speech_to_speech

speechToSpeech

名称）保留原始表演的所有细节——情感、节奏、语调、呼吸、低语、笑声、哭声——仅替换说话人。

设置： 查看安装指南。对于JavaScript，仅使用
@elevenlabs/*
包。

Key Facts

关键信息

Maximum input length: 5 minutes per request — split longer recordings into chunks and stitch the outputs.
Maximum file size: 50 MB per request — compress to MP3 if your source is larger.
Pricing: 1,000 characters per minute of audio processed (duration-based, not text-based).
Recommended model:
```
eleven_multilingual_sts_v2
```
— often outperforms
```
eleven_english_sts_v2
```
even for English-only content.

最大输入时长： 每次请求5分钟——较长的录音需分割为片段后再拼接输出。
最大文件大小： 每次请求50 MB——如果源文件更大，请压缩为MP3格式。
定价： 每处理1分钟音频计费1000字符（基于时长，而非文本）。
推荐模型：
```
eleven_multilingual_sts_v2
```
——即使仅处理英文内容，通常也优于
```
eleven_english_sts_v2
```
。

Quick Start

快速开始

Python

python

from elevenlabs import ElevenLabs

client = ElevenLabs()

with open("source.mp3", "rb") as audio_file:
    audio_stream = client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",  # George
        audio=audio_file,
        model_id="eleven_multilingual_sts_v2",
        output_format="mp3_44100_128",
    )

with open("converted.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

python

from elevenlabs import ElevenLabs

client = ElevenLabs()

with open("source.mp3", "rb") as audio_file:
    audio_stream = client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",  # George
        audio=audio_file,
        model_id="eleven_multilingual_sts_v2",
        output_format="mp3_44100_128",
    )

with open("converted.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

JavaScript

javascript

import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream, createWriteStream } from "fs";

const client = new ElevenLabsClient();

const audioStream = await client.speechToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
  audio: createReadStream("source.mp3"),
  modelId: "eleven_multilingual_sts_v2",
  outputFormat: "mp3_44100_128",
});

audioStream.pipe(createWriteStream("converted.mp3"));

javascript

import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream, createWriteStream } from "fs";

const client = new ElevenLabsClient();

const audioStream = await client.speechToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
  audio: createReadStream("source.mp3"),
  modelId: "eleven_multilingual_sts_v2",
  outputFormat: "mp3_44100_128",
});

audioStream.pipe(createWriteStream("converted.mp3"));

cURL

bash

curl -X POST "https://api.elevenlabs.io/v1/speech-to-speech/JBFqnCBsd6RMkjVDRZzb?output_format=mp3_44100_128" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -F "audio=@source.mp3" \
  -F "model_id=eleven_multilingual_sts_v2" \
  --output converted.mp3

bash

curl -X POST "https://api.elevenlabs.io/v1/speech-to-speech/JBFqnCBsd6RMkjVDRZzb?output_format=mp3_44100_128" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -F "audio=@source.mp3" \
  -F "model_id=eleven_multilingual_sts_v2" \
  --output converted.mp3

Parameters

参数

Parameter	Type	Default	Description
`voice_id`	string (required)	—	Target voice to speak in. Use a pre-made voice ID, a cloned voice, or a voice from the library
`audio`	file (required)	—	Source audio whose performance (emotion, timing, delivery) will be preserved
`model_id`	string	`eleven_english_sts_v2`	`eleven_multilingual_sts_v2` for 29 languages, `eleven_english_sts_v2` for English-only
`output_format`	string	`mp3_44100_128`	See output formats table below
`voice_settings`	JSON string	—	Override stored voice settings for this request only
`seed`	integer	—	Best-effort deterministic sampling (0 – 4294967295)
`remove_background_noise`	boolean	`false`	Run the isolation model on the input before conversion
`file_format`	string	`other`	`other` for any encoded audio, or `pcm_s16le_16` for 16-bit PCM mono @ 16kHz little-endian (lower latency)
`optimize_streaming_latency`	int (query)	—	0–4. Trade quality for latency. `4` is fastest but disables the text normalizer
`enable_logging`	boolean (query)	`true`	Set to `false` for zero-retention mode (enterprise only — disables history/stitching)

参数	类型	默认值	描述
`voice_id`	string（必填）	—	目标语音ID。可使用预制语音ID、克隆语音或语音库中的语音
`audio`	文件（必填）	—	源音频，其表演细节（情感、时长、表达方式）将被保留
`model_id`	string	`eleven_english_sts_v2`	`eleven_multilingual_sts_v2` 支持29种语言， `eleven_english_sts_v2` 仅支持英文
`output_format`	string	`mp3_44100_128`	见下方输出格式表
`voice_settings`	JSON字符串	—	仅在此请求中覆盖存储的语音设置
`seed`	整数	—	尽力实现确定性采样（范围0 – 4294967295）
`remove_background_noise`	布尔值	`false`	转换前对输入音频运行隔离模型
`file_format`	string	`other`	`other` 适用于任何编码音频， `pcm_s16le_16` 适用于16kHz单声道16位PCM小端格式（延迟更低）
`optimize_streaming_latency`	整数（查询参数）	—	0–4。以质量为代价换取延迟。 `4` 最快但会禁用文本规范化器
`enable_logging`	布尔值（查询参数）	`true`	设置为 `false` 启用零保留模式（仅限企业版——禁用历史记录/拼接功能）

Models

模型

Model ID	Languages	Best For
`eleven_multilingual_sts_v2`	29	Recommended for everything — often outperforms the English model even on English audio
`eleven_english_sts_v2`	English	API default — English-only fallback

Only models whose

can_do_voice_conversion

property is true can be used here. Voice Changer does not currently have a low-latency "flash/turbo" tier — if you need one, keep

pcm_s16le_16

input, an

opus_*

/ low-bitrate

mp3_*

output, and raise

optimize_streaming_latency

模型ID	语言	最佳适用场景
`eleven_multilingual_sts_v2`	29种	推荐用于所有场景——即使处理英文音频，通常也优于英文专用模型
`eleven_english_sts_v2`	英文	API默认值——仅英文场景的备选方案

仅

can_do_voice_conversion

属性为true的模型可在此使用。Voice Changer目前没有低延迟的“flash/turbo”版本——如果需要低延迟，请使用

pcm_s16le_16

输入、

opus_*

/低比特率

mp3_*

输出，并提高

optimize_streaming_latency

值。

Languages (

eleven_multilingual_sts_v2

)

支持语言（

eleven_multilingual_sts_v2

）

English (US, UK, AU, CA), Japanese, Chinese, German, Hindi, French (FR, CA), Korean, Portuguese (BR, PT), Italian, Spanish (ES, MX), Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic (SA, AE), Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian, Russian.

英语（美、英、澳、加）、日语、中文、德语、印地语、法语（法、加）、韩语、葡萄牙语（巴、葡）、意大利语、西班牙语（西、墨）、印尼语、荷兰语、土耳其语、菲律宾语、波兰语、瑞典语、保加利亚语、罗马尼亚语、阿拉伯语（沙、阿）、捷克语、希腊语、芬兰语、克罗地亚语、马来语、斯洛伐克语、丹麦语、泰米尔语、乌克兰语、俄语。

Target Voices

目标语音

Use any voice ID from pre-made voices, your cloned voices, or the voice library.

Popular voices:

```
JBFqnCBsd6RMkjVDRZzb
```
— George (male, narrative)
```
EXAVITQu4vr4xnSDxMaL
```
— Sarah (female, soft)
```
onwK4e9ZLuTAKqWW03F9
```
— Daniel (male, authoritative)
```
XB0fDUnXU5powFXDhCwa
```
— Charlotte (female, conversational)

python

voices = client.voices.get_all()
for voice in voices.voices:
    print(f"{voice.voice_id}: {voice.name}")

可使用预制语音、克隆语音或语音库中的任意语音ID。

热门语音：

```
JBFqnCBsd6RMkjVDRZzb
```
— George（男性，旁白风格）
```
EXAVITQu4vr4xnSDxMaL
```
— Sarah（女性，柔和风格）
```
onwK4e9ZLuTAKqWW03F9
```
— Daniel（男性，权威风格）
```
XB0fDUnXU5powFXDhCwa
```
— Charlotte（女性，对话风格）

python

voices = client.voices.get_all()
for voice in voices.voices:
    print(f"{voice.voice_id}: {voice.name}")

Converting from a URL

从URL转换

python

import requests
from io import BytesIO
from elevenlabs import ElevenLabs

client = ElevenLabs()

audio_url = "https://storage.googleapis.com/eleven-public-cdn/audio/marketing/nicole.mp3"
response = requests.get(audio_url)
audio_data = BytesIO(response.content)

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_data,
    model_id="eleven_multilingual_sts_v2",
    output_format="mp3_44100_128",
)

with open("converted.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

python

import requests
from io import BytesIO
from elevenlabs import ElevenLabs

client = ElevenLabs()

audio_url = "https://storage.googleapis.com/eleven-public-cdn/audio/marketing/nicole.mp3"
response = requests.get(audio_url)
audio_data = BytesIO(response.content)

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_data,
    model_id="eleven_multilingual_sts_v2",
    output_format="mp3_44100_128",
)

with open("converted.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

Voice Settings Override

语音设置覆盖

Fine-tune the target voice for a single request without changing its stored defaults:

python

from elevenlabs import VoiceSettings

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    voice_settings=VoiceSettings(
        stability=0.5,
        similarity_boost=0.75,
        style=0.0,
        use_speaker_boost=True,
    ),
)

Stability: lower = more emotional range (follows the source more freely), higher = steadier delivery.
Similarity boost: higher = closer to the target voice's timbre, may amplify source artifacts.
Style: exaggerates the target voice's unique characteristics (v2+ models).
Speaker boost: post-processing to sharpen clarity of the target voice.

在单次请求中微调目标语音，而不更改其存储的默认设置：

python

from elevenlabs import VoiceSettings

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    voice_settings=VoiceSettings(
        stability=0.5,
        similarity_boost=0.75,
        style=0.0,
        use_speaker_boost=True,
    ),
)

Stability（稳定性）： 值越低，情感范围越广（更自由地跟随源音频）；值越高，表达方式越稳定。
Similarity boost（相似度增强）： 值越高，越接近目标语音的音色，但可能放大源音频中的瑕疵。
Style（风格）： 强化目标语音的独特特征（仅v2及以上模型支持）。
Speaker boost（说话人增强）： 后处理以提升目标语音的清晰度。

Cleaning Up Noisy Source Audio

清理含噪源音频

If the input recording is noisy, either pre-process with the voice-isolator skill or pass

remove_background_noise=True

to do it in a single call:

python

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    remove_background_noise=True,
)

Cleaner input almost always produces better conversion — the model is trying to match phonemes and prosody, and background noise gets in the way.

如果输入录音有噪音，可先使用语音隔离工具预处理，或传入

remove_background_noise=True

在单次调用中完成：

python

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    remove_background_noise=True,
)

更干净的输入几乎总能产生更好的转换效果——模型尝试匹配音素和韵律，背景噪音会造成干扰。

Low-Latency PCM Input

低延迟PCM输入

If you already have raw 16-bit PCM mono @ 16kHz, passing

file_format="pcm_s16le_16"

skips decoding and reduces latency:

python

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=pcm_bytes,
    model_id="eleven_multilingual_sts_v2",
    file_format="pcm_s16le_16",
)

Pair this with

optimize_streaming_latency

(0–4) as a query param for further latency reductions at some quality cost.

如果您已有原始16kHz单声道16位PCM音频，传入

file_format="pcm_s16le_16"

可跳过解码并降低延迟：

python

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=pcm_bytes,
    model_id="eleven_multilingual_sts_v2",
    file_format="pcm_s16le_16",
)

搭配

optimize_streaming_latency

（0–4）作为查询参数，可进一步降低延迟，但会牺牲部分质量。

Output Formats

输出格式

Format	Description
`mp3_44100_128`	MP3 44.1kHz 128kbps (default) — good for web/apps
`mp3_44100_192`	MP3 44.1kHz 192kbps (Creator+) — higher quality
`mp3_44100_64`	MP3 44.1kHz 64kbps — smaller files
`mp3_22050_32`	MP3 22.05kHz 32kbps — smallest MP3
`pcm_16000`	Raw PCM 16kHz — real-time pipelines
`pcm_24000`	Raw PCM 24kHz — good streaming balance
`pcm_44100`	Raw PCM 44.1kHz (Pro+) — CD quality
`pcm_48000`	Raw PCM 48kHz (Pro+) — highest quality
`ulaw_8000`	μ-law 8kHz — Twilio / telephony
`alaw_8000`	A-law 8kHz — telephony
`opus_48000_64`	Opus 48kHz 64kbps — efficient streaming

格式	描述
`mp3_44100_128`	MP3 44.1kHz 128kbps（默认）——适合网页/应用
`mp3_44100_192`	MP3 44.1kHz 192kbps（Creator+版）——更高质量
`mp3_44100_64`	MP3 44.1kHz 64kbps——文件更小
`mp3_22050_32`	MP3 22.05kHz 32kbps——最小的MP3格式
`pcm_16000`	原始PCM 16kHz——实时流水线
`pcm_24000`	原始PCM 24kHz——流媒体平衡佳
`pcm_44100`	原始PCM 44.1kHz（Pro+版）——CD音质
`pcm_48000`	原始PCM 48kHz（Pro+版）——最高质量
`ulaw_8000`	μ-law 8kHz——Twilio/电话系统
`alaw_8000`	A-law 8kHz——电话系统
`opus_48000_64`	Opus 48kHz 64kbps——高效流媒体

Deterministic Output

确定性输出

Pass a

seed

to make repeated conversions of the same input return (best-effort) identical audio — useful for testing and A/B comparisons.

python

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    seed=12345,
)

传入

seed

参数，可使相同输入的多次转换返回（尽力实现）相同的音频——适用于测试和A/B对比。

python

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    seed=12345,
)

Input Audio Best Practices

输入音频最佳实践

The conversion quality is bounded by the input recording — the model can only swap the timbre, not rescue a bad source. A few practical rules:

Be expressive. Whisper, shout, laugh, cry — the model preserves all of it. Flat input gives you flat output.
Watch microphone gain. Too quiet and the model under-detects phonemes; too loud and clipping bleeds into the conversion. Aim for healthy peaks, no clipping.
Accent and cadence transfer from the source, not the target. If you read in an American accent and target the British "George" voice, you get George's timbre with an American accent. To dub into a different accent or language, record someone speaking in that target accent/language and convert into a cloned/library voice.
Clean up noise first. Either pass
```
remove_background_noise=True
```
or run the source through the voice-isolator skill before conversion. Noise hurts more here than in TTS.
Split long recordings. Anything over 5 minutes must be chunked. Cut at natural pauses, convert each piece, and concatenate the resulting audio.

转换质量受限于输入录音——模型仅能替换音色，无法挽救质量差的源音频。以下是一些实用规则：

富有表现力：低语、呼喊、大笑、哭泣——模型会保留所有这些细节。平淡的输入会得到平淡的输出。
注意麦克风增益：音量过低会导致模型无法准确检测音素；音量过高则削波失真会影响转换效果。目标是达到健康的峰值，无削波。
口音和节奏源自源音频，而非目标语音：如果您用美式口音朗读，目标是英式“George”语音，您会得到带有美式口音的George音色。要配音成不同口音或语言，请录制目标口音/语言的音频，再转换为克隆/库中的语音。
先清理噪音：要么传入
```
remove_background_noise=True
```
，要么先使用语音隔离工具处理源音频再进行转换。噪音对语音转换的影响比对TTS的影响更大。
分割长录音：超过5分钟的录音必须分割为片段。在自然停顿处切割，转换每个片段，再拼接结果音频。

Common Workflows

常见工作流

Re-voice a narration — keep the performance of a scratch recording, swap in a different narrator voice.
Localize / dub — convert a voice-over into the same speaker's cloned voice in another language (using
```
eleven_multilingual_sts_v2
```
).
Create character voices — act out a line yourself, convert into a distinctive character voice for games or animation.
Anonymize a speaker — replace a recognizable voice with a neutral pre-made voice while preserving what was said and how.
Pair with voice-isolator — isolate the source voice first (or set
```
remove_background_noise=True
```
) for noisy field recordings before conversion.
Pair with voice cloning — clone a target voice from a short sample, then use its
```
voice_id
```
here as the conversion target.

重新配音旁白：保留草稿录音的表演细节，替换为不同的旁白语音。
本地化/配音：使用
```
eleven_multilingual_sts_v2
```
将配音转换为同一说话人在另一种语言中的克隆语音。
创建角色语音：自己表演台词，转换为独特的角色语音用于游戏或动画。
匿名化说话人：将可识别的语音替换为中性预制语音，同时保留所说内容和表达方式。
搭配语音隔离工具：对于含噪现场录音，先隔离源语音（或设置
```
remove_background_noise=True
```
）再进行转换。
搭配语音克隆：从短样本克隆目标语音，然后使用其
```
voice_id
```
作为转换目标。

Error Handling

错误处理

python

try:
    audio_stream = client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        audio=audio_file,
        model_id="eleven_multilingual_sts_v2",
    )
except Exception as e:
    print(f"Voice changer failed: {e}")

Common errors:

401: Invalid API key
422: Invalid parameters (check
```
voice_id
```
,
```
model_id
```
, or
```
file_format
```
vs the supplied audio)
429: Rate limit exceeded

python

try:
    audio_stream = client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        audio=audio_file,
        model_id="eleven_multilingual_sts_v2",
    )
except Exception as e:
    print(f"Voice changer failed: {e}")

常见错误：

401：API密钥无效
422：参数无效（检查
```
voice_id
```
、
```
model_id
```
或
```
file_format
```
与提供的音频是否匹配）
429：超出速率限制

References

参考资料

Installation Guide

安装指南

voice-changer

Original

Translation

ElevenLabs Voice Changer

ElevenLabs Voice Changer

Key Facts

关键信息

Quick Start

快速开始

Python

Python

JavaScript

JavaScript

cURL

cURL

Parameters

参数

Models

模型

Languages (eleven_multilingual_sts_v2)

支持语言（eleven_multilingual_sts_v2）

Target Voices

目标语音

Converting from a URL

从URL转换

Voice Settings Override

语音设置覆盖

Cleaning Up Noisy Source Audio

清理含噪源音频

Low-Latency PCM Input

低延迟PCM输入

Output Formats

输出格式

Deterministic Output

确定性输出

Input Audio Best Practices

输入音频最佳实践

Common Workflows

常见工作流

Error Handling

错误处理

References

参考资料

Languages (
`eleven_multilingual_sts_v2`
)

支持语言（
`eleven_multilingual_sts_v2`
）