voice-changer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ElevenLabs Voice Changer

ElevenLabs Voice Changer

Transform the voice in an audio recording into a different target voice. Voice Changer (previously called Speech-to-Speech — the API endpoint and SDK methods still use the
speech_to_speech
/
speechToSpeech
name) keeps the original performance — emotion, pacing, intonation, breaths, whispers, laughs, cries — and only swaps who is speaking.
Setup: See Installation Guide. For JavaScript, use
@elevenlabs/*
packages only.
将音频录音中的语音转换为不同的目标语音。Voice Changer(以前称为Speech-to-Speech——API端点和SDK方法仍使用
speech_to_speech
/
speechToSpeech
名称)保留原始表演的所有细节——情感、节奏、语调、呼吸、低语、笑声、哭声——仅替换说话人。
设置: 查看安装指南。对于JavaScript,仅使用
@elevenlabs/*
包。

Key Facts

关键信息

  • Maximum input length: 5 minutes per request — split longer recordings into chunks and stitch the outputs.
  • Maximum file size: 50 MB per request — compress to MP3 if your source is larger.
  • Pricing: 1,000 characters per minute of audio processed (duration-based, not text-based).
  • Recommended model:
    eleven_multilingual_sts_v2
    — often outperforms
    eleven_english_sts_v2
    even for English-only content.
  • 最大输入时长: 每次请求5分钟——较长的录音需分割为片段后再拼接输出。
  • 最大文件大小: 每次请求50 MB——如果源文件更大,请压缩为MP3格式。
  • 定价: 每处理1分钟音频计费1000字符(基于时长,而非文本)。
  • 推荐模型:
    eleven_multilingual_sts_v2
    ——即使仅处理英文内容,通常也优于
    eleven_english_sts_v2

Quick Start

快速开始

Python

Python

python
from elevenlabs import ElevenLabs

client = ElevenLabs()

with open("source.mp3", "rb") as audio_file:
    audio_stream = client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",  # George
        audio=audio_file,
        model_id="eleven_multilingual_sts_v2",
        output_format="mp3_44100_128",
    )

with open("converted.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)
python
from elevenlabs import ElevenLabs

client = ElevenLabs()

with open("source.mp3", "rb") as audio_file:
    audio_stream = client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",  # George
        audio=audio_file,
        model_id="eleven_multilingual_sts_v2",
        output_format="mp3_44100_128",
    )

with open("converted.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

JavaScript

JavaScript

javascript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream, createWriteStream } from "fs";

const client = new ElevenLabsClient();

const audioStream = await client.speechToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
  audio: createReadStream("source.mp3"),
  modelId: "eleven_multilingual_sts_v2",
  outputFormat: "mp3_44100_128",
});

audioStream.pipe(createWriteStream("converted.mp3"));
javascript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream, createWriteStream } from "fs";

const client = new ElevenLabsClient();

const audioStream = await client.speechToSpeech.convert("JBFqnCBsd6RMkjVDRZzb", {
  audio: createReadStream("source.mp3"),
  modelId: "eleven_multilingual_sts_v2",
  outputFormat: "mp3_44100_128",
});

audioStream.pipe(createWriteStream("converted.mp3"));

cURL

cURL

bash
curl -X POST "https://api.elevenlabs.io/v1/speech-to-speech/JBFqnCBsd6RMkjVDRZzb?output_format=mp3_44100_128" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -F "audio=@source.mp3" \
  -F "model_id=eleven_multilingual_sts_v2" \
  --output converted.mp3
bash
curl -X POST "https://api.elevenlabs.io/v1/speech-to-speech/JBFqnCBsd6RMkjVDRZzb?output_format=mp3_44100_128" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -F "audio=@source.mp3" \
  -F "model_id=eleven_multilingual_sts_v2" \
  --output converted.mp3

Parameters

参数

ParameterTypeDefaultDescription
voice_id
string (required)Target voice to speak in. Use a pre-made voice ID, a cloned voice, or a voice from the library
audio
file (required)Source audio whose performance (emotion, timing, delivery) will be preserved
model_id
string
eleven_english_sts_v2
eleven_multilingual_sts_v2
for 29 languages,
eleven_english_sts_v2
for English-only
output_format
string
mp3_44100_128
See output formats table below
voice_settings
JSON stringOverride stored voice settings for this request only
seed
integerBest-effort deterministic sampling (0 – 4294967295)
remove_background_noise
boolean
false
Run the isolation model on the input before conversion
file_format
string
other
other
for any encoded audio, or
pcm_s16le_16
for 16-bit PCM mono @ 16kHz little-endian (lower latency)
optimize_streaming_latency
int (query)0–4. Trade quality for latency.
4
is fastest but disables the text normalizer
enable_logging
boolean (query)
true
Set to
false
for zero-retention mode (enterprise only — disables history/stitching)
参数类型默认值描述
voice_id
string(必填)目标语音ID。可使用预制语音ID、克隆语音或语音库中的语音
audio
文件(必填)源音频,其表演细节(情感、时长、表达方式)将被保留
model_id
string
eleven_english_sts_v2
eleven_multilingual_sts_v2
支持29种语言,
eleven_english_sts_v2
仅支持英文
output_format
string
mp3_44100_128
见下方输出格式表
voice_settings
JSON字符串仅在此请求中覆盖存储的语音设置
seed
整数尽力实现确定性采样(范围0 – 4294967295)
remove_background_noise
布尔值
false
转换前对输入音频运行隔离模型
file_format
string
other
other
适用于任何编码音频,
pcm_s16le_16
适用于16kHz单声道16位PCM小端格式(延迟更低)
optimize_streaming_latency
整数(查询参数)0–4。以质量为代价换取延迟。
4
最快但会禁用文本规范化器
enable_logging
布尔值(查询参数)
true
设置为
false
启用零保留模式(仅限企业版——禁用历史记录/拼接功能)

Models

模型

Model IDLanguagesBest For
eleven_multilingual_sts_v2
29Recommended for everything — often outperforms the English model even on English audio
eleven_english_sts_v2
EnglishAPI default — English-only fallback
Only models whose
can_do_voice_conversion
property is true can be used here. Voice Changer does not currently have a low-latency "flash/turbo" tier — if you need one, keep
pcm_s16le_16
input, an
opus_*
/ low-bitrate
mp3_*
output, and raise
optimize_streaming_latency
.
模型ID语言最佳适用场景
eleven_multilingual_sts_v2
29种推荐用于所有场景——即使处理英文音频,通常也优于英文专用模型
eleven_english_sts_v2
英文API默认值——仅英文场景的备选方案
can_do_voice_conversion
属性为true的模型可在此使用。Voice Changer目前没有低延迟的“flash/turbo”版本——如果需要低延迟,请使用
pcm_s16le_16
输入、
opus_*
/低比特率
mp3_*
输出,并提高
optimize_streaming_latency
值。

Languages (
eleven_multilingual_sts_v2
)

支持语言(
eleven_multilingual_sts_v2

English (US, UK, AU, CA), Japanese, Chinese, German, Hindi, French (FR, CA), Korean, Portuguese (BR, PT), Italian, Spanish (ES, MX), Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic (SA, AE), Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian, Russian.
英语(美、英、澳、加)、日语、中文、德语、印地语、法语(法、加)、韩语、葡萄牙语(巴、葡)、意大利语、西班牙语(西、墨)、印尼语、荷兰语、土耳其语、菲律宾语、波兰语、瑞典语、保加利亚语、罗马尼亚语、阿拉伯语(沙、阿)、捷克语、希腊语、芬兰语、克罗地亚语、马来语、斯洛伐克语、丹麦语、泰米尔语、乌克兰语、俄语。

Target Voices

目标语音

Use any voice ID from pre-made voices, your cloned voices, or the voice library.
Popular voices:
  • JBFqnCBsd6RMkjVDRZzb
    — George (male, narrative)
  • EXAVITQu4vr4xnSDxMaL
    — Sarah (female, soft)
  • onwK4e9ZLuTAKqWW03F9
    — Daniel (male, authoritative)
  • XB0fDUnXU5powFXDhCwa
    — Charlotte (female, conversational)
python
voices = client.voices.get_all()
for voice in voices.voices:
    print(f"{voice.voice_id}: {voice.name}")
可使用预制语音、克隆语音或语音库中的任意语音ID。
热门语音:
  • JBFqnCBsd6RMkjVDRZzb
    — George(男性,旁白风格)
  • EXAVITQu4vr4xnSDxMaL
    — Sarah(女性,柔和风格)
  • onwK4e9ZLuTAKqWW03F9
    — Daniel(男性,权威风格)
  • XB0fDUnXU5powFXDhCwa
    — Charlotte(女性,对话风格)
python
voices = client.voices.get_all()
for voice in voices.voices:
    print(f"{voice.voice_id}: {voice.name}")

Converting from a URL

从URL转换

python
import requests
from io import BytesIO
from elevenlabs import ElevenLabs

client = ElevenLabs()

audio_url = "https://storage.googleapis.com/eleven-public-cdn/audio/marketing/nicole.mp3"
response = requests.get(audio_url)
audio_data = BytesIO(response.content)

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_data,
    model_id="eleven_multilingual_sts_v2",
    output_format="mp3_44100_128",
)

with open("converted.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)
python
import requests
from io import BytesIO
from elevenlabs import ElevenLabs

client = ElevenLabs()

audio_url = "https://storage.googleapis.com/eleven-public-cdn/audio/marketing/nicole.mp3"
response = requests.get(audio_url)
audio_data = BytesIO(response.content)

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_data,
    model_id="eleven_multilingual_sts_v2",
    output_format="mp3_44100_128",
)

with open("converted.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)

Voice Settings Override

语音设置覆盖

Fine-tune the target voice for a single request without changing its stored defaults:
python
from elevenlabs import VoiceSettings

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    voice_settings=VoiceSettings(
        stability=0.5,
        similarity_boost=0.75,
        style=0.0,
        use_speaker_boost=True,
    ),
)
  • Stability: lower = more emotional range (follows the source more freely), higher = steadier delivery.
  • Similarity boost: higher = closer to the target voice's timbre, may amplify source artifacts.
  • Style: exaggerates the target voice's unique characteristics (v2+ models).
  • Speaker boost: post-processing to sharpen clarity of the target voice.
在单次请求中微调目标语音,而不更改其存储的默认设置:
python
from elevenlabs import VoiceSettings

audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    voice_settings=VoiceSettings(
        stability=0.5,
        similarity_boost=0.75,
        style=0.0,
        use_speaker_boost=True,
    ),
)
  • Stability(稳定性): 值越低,情感范围越广(更自由地跟随源音频);值越高,表达方式越稳定。
  • Similarity boost(相似度增强): 值越高,越接近目标语音的音色,但可能放大源音频中的瑕疵。
  • Style(风格): 强化目标语音的独特特征(仅v2及以上模型支持)。
  • Speaker boost(说话人增强): 后处理以提升目标语音的清晰度。

Cleaning Up Noisy Source Audio

清理含噪源音频

If the input recording is noisy, either pre-process with the voice-isolator skill or pass
remove_background_noise=True
to do it in a single call:
python
audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    remove_background_noise=True,
)
Cleaner input almost always produces better conversion — the model is trying to match phonemes and prosody, and background noise gets in the way.
如果输入录音有噪音,可先使用语音隔离工具预处理,或传入
remove_background_noise=True
在单次调用中完成:
python
audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    remove_background_noise=True,
)
更干净的输入几乎总能产生更好的转换效果——模型尝试匹配音素和韵律,背景噪音会造成干扰。

Low-Latency PCM Input

低延迟PCM输入

If you already have raw 16-bit PCM mono @ 16kHz, passing
file_format="pcm_s16le_16"
skips decoding and reduces latency:
python
audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=pcm_bytes,
    model_id="eleven_multilingual_sts_v2",
    file_format="pcm_s16le_16",
)
Pair this with
optimize_streaming_latency
(0–4) as a query param for further latency reductions at some quality cost.
如果您已有原始16kHz单声道16位PCM音频,传入
file_format="pcm_s16le_16"
可跳过解码并降低延迟:
python
audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=pcm_bytes,
    model_id="eleven_multilingual_sts_v2",
    file_format="pcm_s16le_16",
)
搭配
optimize_streaming_latency
(0–4)作为查询参数,可进一步降低延迟,但会牺牲部分质量。

Output Formats

输出格式

FormatDescription
mp3_44100_128
MP3 44.1kHz 128kbps (default) — good for web/apps
mp3_44100_192
MP3 44.1kHz 192kbps (Creator+) — higher quality
mp3_44100_64
MP3 44.1kHz 64kbps — smaller files
mp3_22050_32
MP3 22.05kHz 32kbps — smallest MP3
pcm_16000
Raw PCM 16kHz — real-time pipelines
pcm_24000
Raw PCM 24kHz — good streaming balance
pcm_44100
Raw PCM 44.1kHz (Pro+) — CD quality
pcm_48000
Raw PCM 48kHz (Pro+) — highest quality
ulaw_8000
μ-law 8kHz — Twilio / telephony
alaw_8000
A-law 8kHz — telephony
opus_48000_64
Opus 48kHz 64kbps — efficient streaming
格式描述
mp3_44100_128
MP3 44.1kHz 128kbps(默认)——适合网页/应用
mp3_44100_192
MP3 44.1kHz 192kbps(Creator+版)——更高质量
mp3_44100_64
MP3 44.1kHz 64kbps——文件更小
mp3_22050_32
MP3 22.05kHz 32kbps——最小的MP3格式
pcm_16000
原始PCM 16kHz——实时流水线
pcm_24000
原始PCM 24kHz——流媒体平衡佳
pcm_44100
原始PCM 44.1kHz(Pro+版)——CD音质
pcm_48000
原始PCM 48kHz(Pro+版)——最高质量
ulaw_8000
μ-law 8kHz——Twilio/电话系统
alaw_8000
A-law 8kHz——电话系统
opus_48000_64
Opus 48kHz 64kbps——高效流媒体

Deterministic Output

确定性输出

Pass a
seed
to make repeated conversions of the same input return (best-effort) identical audio — useful for testing and A/B comparisons.
python
audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    seed=12345,
)
传入
seed
参数,可使相同输入的多次转换返回(尽力实现)相同的音频——适用于测试和A/B对比。
python
audio_stream = client.speech_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    audio=audio_file,
    model_id="eleven_multilingual_sts_v2",
    seed=12345,
)

Input Audio Best Practices

输入音频最佳实践

The conversion quality is bounded by the input recording — the model can only swap the timbre, not rescue a bad source. A few practical rules:
  • Be expressive. Whisper, shout, laugh, cry — the model preserves all of it. Flat input gives you flat output.
  • Watch microphone gain. Too quiet and the model under-detects phonemes; too loud and clipping bleeds into the conversion. Aim for healthy peaks, no clipping.
  • Accent and cadence transfer from the source, not the target. If you read in an American accent and target the British "George" voice, you get George's timbre with an American accent. To dub into a different accent or language, record someone speaking in that target accent/language and convert into a cloned/library voice.
  • Clean up noise first. Either pass
    remove_background_noise=True
    or run the source through the voice-isolator skill before conversion. Noise hurts more here than in TTS.
  • Split long recordings. Anything over 5 minutes must be chunked. Cut at natural pauses, convert each piece, and concatenate the resulting audio.
转换质量受限于输入录音——模型仅能替换音色,无法挽救质量差的源音频。以下是一些实用规则:
  • 富有表现力:低语、呼喊、大笑、哭泣——模型会保留所有这些细节。平淡的输入会得到平淡的输出。
  • 注意麦克风增益:音量过低会导致模型无法准确检测音素;音量过高则削波失真会影响转换效果。目标是达到健康的峰值,无削波。
  • 口音和节奏源自源音频,而非目标语音:如果您用美式口音朗读,目标是英式“George”语音,您会得到带有美式口音的George音色。要配音成不同口音或语言,请录制目标口音/语言的音频,再转换为克隆/库中的语音。
  • 先清理噪音:要么传入
    remove_background_noise=True
    ,要么先使用语音隔离工具处理源音频再进行转换。噪音对语音转换的影响比对TTS的影响更大。
  • 分割长录音:超过5分钟的录音必须分割为片段。在自然停顿处切割,转换每个片段,再拼接结果音频。

Common Workflows

常见工作流

  • Re-voice a narration — keep the performance of a scratch recording, swap in a different narrator voice.
  • Localize / dub — convert a voice-over into the same speaker's cloned voice in another language (using
    eleven_multilingual_sts_v2
    ).
  • Create character voices — act out a line yourself, convert into a distinctive character voice for games or animation.
  • Anonymize a speaker — replace a recognizable voice with a neutral pre-made voice while preserving what was said and how.
  • Pair with voice-isolator — isolate the source voice first (or set
    remove_background_noise=True
    ) for noisy field recordings before conversion.
  • Pair with voice cloning — clone a target voice from a short sample, then use its
    voice_id
    here as the conversion target.
  • 重新配音旁白:保留草稿录音的表演细节,替换为不同的旁白语音。
  • 本地化/配音:使用
    eleven_multilingual_sts_v2
    将配音转换为同一说话人在另一种语言中的克隆语音。
  • 创建角色语音:自己表演台词,转换为独特的角色语音用于游戏或动画。
  • 匿名化说话人:将可识别的语音替换为中性预制语音,同时保留所说内容和表达方式。
  • 搭配语音隔离工具:对于含噪现场录音,先隔离源语音(或设置
    remove_background_noise=True
    )再进行转换。
  • 搭配语音克隆:从短样本克隆目标语音,然后使用其
    voice_id
    作为转换目标。

Error Handling

错误处理

python
try:
    audio_stream = client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        audio=audio_file,
        model_id="eleven_multilingual_sts_v2",
    )
except Exception as e:
    print(f"Voice changer failed: {e}")
Common errors:
  • 401: Invalid API key
  • 422: Invalid parameters (check
    voice_id
    ,
    model_id
    , or
    file_format
    vs the supplied audio)
  • 429: Rate limit exceeded
python
try:
    audio_stream = client.speech_to_speech.convert(
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        audio=audio_file,
        model_id="eleven_multilingual_sts_v2",
    )
except Exception as e:
    print(f"Voice changer failed: {e}")
常见错误:
  • 401:API密钥无效
  • 422:参数无效(检查
    voice_id
    model_id
    file_format
    与提供的音频是否匹配)
  • 429:超出速率限制

References

参考资料

  • Installation Guide
  • 安装指南