voice-audio-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Voice & Audio Engineer: Voice Synthesis, TTS & Speech Processing

语音与音频工程师：语音合成、TTS与语音处理

Expert in voice synthesis, speech processing, and vocal production using ElevenLabs and professional audio techniques. Specializes in TTS, voice cloning, podcast production, and voice UI design.

通过ElevenLabs及专业音频技术，精通语音合成、语音处理和人声制作。专注于TTS、语音克隆、播客制作及语音UI设计。

When to Use This Skill

何时使用此技能

✅ Use for:

Text-to-speech (TTS) generation
Voice cloning and voice design
Speech-to-speech voice transformation
Podcast production and editing
Audiobook production
Voice UI/conversational AI audio
Dialogue mixing and processing
Loudness normalization (LUFS)
Voice quality enhancement (de-essing, compression)
Transcription and speech-to-text

❌ Do NOT use for:

Spatial audio (HRTF, Ambisonics) → sound-engineer
Sound effects generation → sound-engineer (ElevenLabs SFX)
Game audio middleware (Wwise, FMOD) → sound-engineer
Music composition/production → DAW tools
Live concert/event audio → specialized domain

✅ 适用场景：

文本转语音（TTS）生成
语音克隆与语音设计
语音转语音的语音转换
播客制作与编辑
有声书制作
语音UI/对话式AI音频
对话混音与处理
响度标准化（LUFS）
语音质量增强（去齿音、压缩）
转录与语音转文本

❌ 不适用场景：

空间音频（HRTF、Ambisonics）→ sound-engineer
音效生成 → sound-engineer（ElevenLabs SFX）
游戏音频中间件（Wwise、FMOD）→ sound-engineer
音乐创作/制作 → DAW工具
现场音乐会/活动音频 → 专业领域

MCP Integrations

MCP集成

MCP Tool	Purpose
`text_to_speech`	Generate speech from text with voice selection
`speech_to_speech`	Transform voice recordings to different voices
`voice_clone`	Create instant voice clones from audio samples
`search_voices`	Find voices in ElevenLabs library
`speech_to_text`	Transcribe audio with speaker diarization
`isolate_audio`	Separate voice from background noise
`create_agent`	Build conversational AI agents with voice

MCP工具	用途
`text_to_speech`	选择语音后将文本转换为语音
`speech_to_speech`	将语音录音转换为其他音色
`voice_clone`	通过音频样本快速创建语音克隆
`search_voices`	在ElevenLabs语音库中查找语音
`speech_to_text`	带说话人分离的音频转录
`isolate_audio`	将语音与背景噪音分离
`create_agent`	构建带语音功能的对话式AI代理

Expert vs Novice Shibboleths

专家与新手的区别

Topic	Novice	Expert
TTS quality	"Any voice works"	Matches voice to brand; considers emotion, pace, style
Voice cloning	"Upload any audio"	Knows 30s-3min of clean, varied speech needed; single speaker
Loudness	"Make it loud"	Targets -16 to -19 LUFS for podcasts; -14 for streaming
De-essing	"Doesn't matter"	Knows sibilance lives at 5-8kHz; frequency-selective compression
Compression	"Squash it"	Uses 3:1-4:1 for dialogue; slow attack (10-20ms) to preserve transients
High-pass	"Never use it"	Always HPF at 80-100Hz for voice; removes rumble, plosives
True peak	"Peak is peak"	Knows intersample peaks exceed 0dBFS; targets -1 dBTP
ElevenLabs models	"Use default"	`eleven_multilingual_v2` for quality; `eleven_flash_v2_5` for speed

主题	新手做法	专家做法
TTS质量	"随便用哪个语音都行"	根据品牌匹配语音；考虑情感、语速、风格
语音克隆	"上传任意音频即可"	知道需要30秒-3分钟清晰、多样的语音样本；单人说话
响度	"调大声就行"	播客目标响度为-16至-19 LUFS；流媒体为-14 LUFS
去齿音	"无所谓"	知道齿音集中在5-8kHz；使用频段选择性压缩
压缩	"压到最紧"	对话使用3:1-4:1压缩比；慢启动（10-20ms）以保留瞬态细节
高通滤波	"从不使用"	语音必加80-100Hz高通滤波；消除低频杂音、爆破音
真实峰值	"峰值就是峰值"	知道采样间峰值会超过0dBFS；目标为-1 dBTP
ElevenLabs模型	"用默认的就行"	追求质量用 `eleven_multilingual_v2` ；追求速度用 `eleven_flash_v2_5`

Common Anti-Patterns

常见反模式

Anti-Pattern: Uploading Noisy Audio for Voice Cloning

反模式：上传含噪音频进行语音克隆

What it looks like: Voice clone from phone recording with background noise, echo Why it's wrong: Clone learns the noise; output has artifacts What to do instead: Use

isolate_audio

first; record in quiet space; provide 1-3 min of varied speech

表现： 用带背景噪音、回声的手机录音做语音克隆 问题： 克隆会学习到噪音；输出有伪影 正确做法： 先使用

isolate_audio

处理；在安静环境录音；提供1-3分钟多样的语音样本

Anti-Pattern: Ignoring Loudness Standards

反模式：忽略响度标准

What it looks like: Podcast at -6 LUFS, then normalized by platform → crushed dynamics Why it's wrong: Each platform normalizes differently; too loud = distortion, too quiet = inaudible What to do instead: Master to -16 LUFS for podcasts; -14 LUFS for streaming; always check true peak < -1 dBTP

表现： 播客响度为-6 LUFS，被平台标准化后动态范围被压缩 问题： 各平台标准化规则不同；过响会失真，过轻听不清 正确做法： 播客母带响度设为-16 LUFS；流媒体设为-14 LUFS；确保真实峰值<-1 dBTP

Anti-Pattern: TTS Without Voice Matching

反模式：TTS不匹配语音风格

What it looks like: Using default robotic voice for premium product Why it's wrong: Voice IS brand; wrong voice = wrong emotional connection What to do instead:

search_voices

to find matching tone; consider custom clone for brand consistency

表现： 高端产品使用默认机械语音 问题： 语音是品牌的一部分；错误的语音会破坏情感连接 正确做法： 使用

search_voices

找到匹配语气的语音；考虑自定义克隆以保持品牌一致性

Anti-Pattern: No De-essing on Processed Voice

反模式：处理后语音不去齿音

What it looks like: "SSSSibilant" speech after compression and EQ boost Why it's wrong: Compression brings up sibilance; EQ boost at 3-5kHz makes it worse What to do instead: De-ess at 5-8kHz before compression; use frequency-selective compression

表现： 压缩和EQ提升后出现"SSSS齿音"问题 问题： 压缩会放大齿音；3-5kHz的EQ提升会让情况更糟 正确做法： 压缩前在5-8kHz频段去齿音；使用频段选择性压缩

Anti-Pattern: Single Take, No Editing

反模式：单次录制不编辑

What it looks like: Podcast with 20 "ums", breath sounds, long pauses Why it's wrong: Listeners fatigue; unprofessional; reduces engagement What to do instead: Edit out filler words; gate or manually cut breaths; tighten pacing

表现： 播客中有20个“嗯”、呼吸声、长停顿 问题： 听众会疲劳；显得不专业；降低参与度 正确做法： 删除填充词；用门限或手动剪切呼吸声；调整节奏

Evolution Timeline

发展时间线

Pre-2020: Robotic TTS

2020年前：机械TTS

Concatenative synthesis (spliced recordings)
Obvious robotic quality
Limited voice options

拼接合成（剪辑录音片段）
明显的机械感
语音选项有限

2020-2022: Neural TTS Emerges

2020-2022：神经TTS兴起

Tacotron, WaveNet improve naturalness
Still detectable as synthetic
Voice cloning requires hours of data

Tacotron、WaveNet提升自然度
仍能被识别为合成语音
语音克隆需数小时数据

2023-2024: AI Voice Revolution

2023-2024：AI语音革命

ElevenLabs instant voice cloning (30 seconds)
Near-human quality in TTS
Real-time voice transformation
Voice agents for customer service

ElevenLabs即时语音克隆（30秒样本）
TTS接近人类音质
实时语音转换
用于客服的语音代理

2025+: Current Best Practices

2025+：当前最佳实践

Emotional TTS (control tone, pace, emotion)
Cross-lingual voice cloning
Real-time voice transformation in apps
Personalized voice agents
Voice authentication integration

情感TTS（控制语气、语速、情感）
跨语言语音克隆
应用内实时语音转换
个性化语音代理
语音认证集成

Core Concepts

核心概念

ElevenLabs Voice Selection

ElevenLabs语音选择

Model comparison:

Model	Quality	Latency	Languages	Use Case
`eleven_multilingual_v2`	Best	Higher	29	Production, quality-critical
`eleven_flash_v2_5`	Good	Lowest	32	Real-time, voice UI
`eleven_turbo_v2_5`	Better	Low	32	Balanced

Voice parameters:

python

undefined

模型对比：

模型	音质	延迟	语言支持	适用场景
`eleven_multilingual_v2`	最佳	较高	29种	生产环境、对音质要求高的场景
`eleven_flash_v2_5`	良好	最低	32种	实时场景、语音UI
`eleven_turbo_v2_5`	较好	低	32种	平衡场景

语音参数：

python

undefined

Stability: 0-1 (lower = more expressive, higher = more consistent)

Stability: 0-1（值越低表现力越强，值越高一致性越好）

Similarity boost: 0-1 (higher = closer to original voice)

Similarity boost: 0-1（值越高越接近原始语音）

Style: 0-1 (higher = more exaggerated style)

Style: 0-1（值越高风格越夸张）

For natural speech:

自然语音设置：

stability = 0.5 # Balanced expression similarity = 0.75 # Close to voice but natural style = 0.0 # Neutral (increase for dramatic)

undefined

stability = 0.5 # 平衡表现力 similarity = 0.75 # 接近原语音且自然 style = 0.0 # 中性风格（如需戏剧效果可提高）

undefined

Voice Cloning Best Practices

语音克隆最佳实践

Audio requirements:

Duration: 1-3 minutes (more = better, diminishing returns after 3min)
Quality: Clean, no background noise, no reverb
Content: Varied speech (questions, statements, emotions)
Format: WAV/MP3, 44.1kHz or higher

Cloning workflow:

```
isolate_audio
```
to clean source material
```
voice_clone
```
with cleaned audio
Test with varied prompts
Adjust stability/similarity for output quality

音频要求：

时长：1-3分钟（越长越好，3分钟后收益递减）
质量：清晰、无背景噪音、无混响
内容：多样的语音（问句、陈述句、不同情感）
格式：WAV/MP3，44.1kHz或更高采样率

克隆流程：

使用
```
isolate_audio
```
清理源素材
用清理后的音频进行
```
voice_clone
```
用多样提示词测试
调整稳定性/相似度参数优化输出质量

Voice Processing Chain

语音处理流程

Standard voice chain (order matters!):

[Raw Recording]
    ↓
[High-Pass Filter @ 80Hz]  ← Remove rumble, plosives
    ↓
[De-esser @ 5-8kHz]        ← Before compression!
    ↓
[Compressor 3:1, 10ms/100ms] ← Smooth dynamics
    ↓
[EQ: +2dB @ 3kHz presence] ← Clarity boost
    ↓
[Limiter -1 dBTP]          ← Prevent clipping
    ↓
[Loudness Norm -16 LUFS]   ← Target loudness

标准语音处理链（顺序很重要！）：

[原始录音]
    ↓
[80Hz高通滤波]  ← 消除低频杂音、爆破音
    ↓
[5-8kHz去齿音]        ← 一定要在压缩前做！
    ↓
[压缩器 3:1，10ms启动/100ms释放] ← 平滑动态范围
    ↓
[均衡器：3kHz频段+2dB提升] ← 增强清晰度
    ↓
[限制器 -1 dBTP]          ← 防止削波
    ↓
[响度标准化 -16 LUFS]   ← 目标响度

Loudness Standards

响度标准

Platform/Format	Target LUFS	True Peak
Podcast	-16 to -19	-1 dBTP
Audiobook (ACX)	-18 to -23 RMS	-3 dBFS
YouTube	-14	-1 dBTP
Spotify/Apple Music	-14	-1 dBTP
Broadcast (EBU R128)	-23 ±1	-1 dBTP

Measurement:

LUFS = Loudness Units Full Scale (integrated)
True Peak = Maximum level including intersample peaks
Always measure with K-weighting (ITU-R BS.1770)

平台/格式	目标LUFS	真实峰值
播客	-16至-19	-1 dBTP
有声书（ACX）	-18至-23 RMS	-3 dBFS
YouTube	-14	-1 dBTP
Spotify/Apple Music	-14	-1 dBTP
广播（EBU R128）	-23 ±1	-1 dBTP

测量说明：

LUFS = 响度单位全刻度（集成响度）
真实峰值 = 包含采样间峰值的最大电平
始终使用K加权（ITU-R BS.1770）测量

Conversational AI Agents

对话式AI代理

ElevenLabs agent configuration:

python

create_agent(
    name="Support Agent",
    first_message="Hi, how can I help you today?",
    system_prompt="You are a helpful customer support agent...",
    voice_id="your_voice_id",
    language="en",
    llm="gemini-2.0-flash-001",  # Fast for conversation
    temperature=0.5,
    asr_quality="high",          # Speech recognition quality
    turn_timeout=7,              # Seconds before agent responds
    max_duration_seconds=300     # 5 minute call limit
)

Voice UI considerations:

Use fast model (
```
eleven_flash_v2_5
```
) for real-time
Keep responses concise (< 30 seconds)
Add pauses for natural conversation flow
Handle interruptions gracefully

ElevenLabs代理配置：

python

create_agent(
    name="Support Agent",
    first_message="Hi, how can I help you today?",
    system_prompt="You are a helpful customer support agent...",
    voice_id="your_voice_id",
    language="en",
    llm="gemini-2.0-flash-001",  # 对话场景速度快
    temperature=0.5,
    asr_quality="high",          # 语音识别质量
    turn_timeout=7,              # 代理响应前等待秒数
    max_duration_seconds=300     # 通话最长限制5分钟
)

语音UI注意事项：

使用快速模型（
```
eleven_flash_v2_5
```
）实现实时交互
响应保持简洁（<30秒）
添加停顿以保证自然对话流程
优雅处理打断

Quick Reference

快速参考

Voice Selection Decision Tree

语音选择决策树

Brand/professional content? → Custom clone or curated voice
Real-time/interactive? →
```
eleven_flash_v2_5
```
model
Quality-critical? →
```
eleven_multilingual_v2
```
model
Multiple languages? → Check language support per voice

品牌/专业内容？ → 自定义克隆或精选语音
实时/交互场景？ →
```
eleven_flash_v2_5
```
模型
对音质要求高？ →
```
eleven_multilingual_v2
```
模型
多语言需求？ → 检查各语音的语言支持

Processing Decision Tree

处理决策树

Voice sounds muddy? → HPF at 80Hz, boost 3kHz
Sibilance harsh? → De-ess at 5-8kHz
Inconsistent volume? → Compress 3:1, then limit
Too quiet? → Normalize to target LUFS
Background noise? → Use
```
isolate_audio
```
first

语音发闷？ → 80Hz高通滤波，3kHz频段提升
齿音刺耳？ → 5-8kHz去齿音
音量不稳定？ → 3:1压缩后加限制器
音量太小？ → 标准化到目标LUFS
有背景噪音？ → 先使用
```
isolate_audio
```

Common Settings

常见设置

De-esser: 5-8kHz, -6dB reduction, Q=2
Compressor: 3:1 ratio, -20dB threshold, 10ms attack, 100ms release
EQ presence: +2-3dB shelf at 3kHz
HPF: 80-100Hz, 12dB/oct
Limiter: -1 dBTP ceiling

去齿音：5-8kHz，-6dB衰减，Q=2
压缩器：3:1压缩比，-20dB阈值，10ms启动，100ms释放
均衡器清晰度提升：3kHz频段+2-3dB shelf
高通滤波：80-100Hz，12dB/oct
限制器：-1 dBTP上限

Working With Speech Disfluencies

处理言语不流畅问题

Cluttering vs Stuttering

语流障碍 vs 口吃

Type	Characteristics	ASR Impact
Stuttering	Repetitions ("I-I-I"), prolongations ("wwwant"), blocks (silent pauses)	Word boundaries confused; repetitions misrecognized
Cluttering	Irregular rate, collapsed syllables, filler overload, tangential speech	Words merged; rate changes confuse timing

类型	特征	对ASR的影响
口吃	重复（"我-我-我"）、拖长音（"我我我想要"）、阻塞（无声停顿）	词边界识别错误；重复内容被错误转录
语流障碍	语速不规则、音节省略、填充词过多、话题偏离	词语合并；语速变化打乱时序识别

ASR Challenges with Disfluent Speech

不流畅语音的ASR挑战

Most ASR models trained on fluent speech. Disfluencies cause:

Word boundary detection errors
Repetitions transcribed literally ("I I I want" vs "I want")
Collapsed syllables missed entirely
Timing models confused by irregular pace

大多数ASR模型基于流畅语音训练。不流畅语音会导致：

词边界检测错误
重复内容被逐字转录（如"我我我想要"而非"我想要"）
省略的音节完全丢失
不规则语速打乱时序模型

Solutions & Workarounds

解决方案与变通方法

1. Model selection (best to worst for disfluencies):

Whisper large-v3 - Most robust to disfluencies
ElevenLabs speech_to_text - Good with varied speech
Google Speech-to-Text - Decent with enhanced models
Fast/lightweight models - Usually worst

2. Pre-processing:

python

undefined

1. 模型选择（对不流畅语音的鲁棒性从优到劣）：

Whisper large-v3 - 对不流畅语音鲁棒性最强
ElevenLabs speech_to_text - 对多样语音适配性好
Google Speech-to-Text - 增强型模型表现尚可
轻量快速模型 - 通常表现最差

2. 预处理：

python

undefined

Normalize speech rate before ASR

ASR前标准化语速

Use librosa to stretch irregular segments toward target rate

使用librosa将不规则片段拉伸至目标语速

import librosa y, sr = librosa.load("disfluent.wav") y_stretched = librosa.effects.time_stretch(y, rate=0.9) # Slow down


**3. Post-processing:**
- Remove duplicate words: "I I I want" → "I want"
- Filter common fillers: "um", "uh", "like", "you know"
- Use LLM to clean transcripts while preserving meaning

**4. Fine-tuning Whisper (advanced):**
```python

import librosa y, sr = librosa.load("disfluent.wav") y_stretched = librosa.effects.time_stretch(y, rate=0.9) # 放慢语速


**3. 后处理：**
- 删除重复词语："我我我想要" → "我想要"
- 过滤常见填充词："嗯"、"呃"、"比如"、"你知道"
- 使用LLM清理转录文本同时保留原意

**4. 微调Whisper（进阶）：**
```python

Fine-tune on disfluent speech dataset

在不流畅语音数据集上微调

Datasets: FluencyBank, UCLASS, SEP-28k (stuttering)

数据集：FluencyBank、UCLASS、SEP-28k（口吃数据集）

from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")

from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")

Fine-tune on your speech samples with corrected transcripts

在带修正转录的语音样本上微调

Training loop with disfluent audio → fluent transcript pairs

训练流程：不流畅音频 → 流畅转录对


**5. ElevenLabs voice cloning approach:**
- Clone your voice from fluent segments
- Use TTS for fluent output with your voice
- Great for pre-recorded content, not live


**5. ElevenLabs语音克隆方案：**
- 从流畅片段克隆语音
- 用TTS生成带你的语音的流畅输出
- 非常适合预录内容，不适合实时场景

Accessibility Considerations

无障碍设计考虑

Always provide manual transcript correction option
Consider hybrid: ASR + human review
For voice UI: longer timeout, confirmation prompts
Test with actual users from target population

始终提供手动转录修正选项
考虑混合方案：ASR + 人工审核
语音UI：延长超时时间，添加确认提示
与目标用户群体一起测试

Performance Targets

性能指标

Operation	Typical Time
TTS (100 words)	2-5 seconds
Voice clone creation	10-30 seconds
Speech-to-speech	3-8 seconds
Transcription (1 min audio)	5-15 seconds
Audio isolation	5-20 seconds

操作	典型耗时
TTS（100词）	2-5秒
语音克隆创建	10-30秒
语音转语音	3-8秒
转录（1分钟音频）	5-15秒
音频分离	5-20秒

Integrates With

集成对象

sound-engineer - For spatial audio, game audio, procedural SFX
native-app-designer - Voice UI implementation in apps
vr-avatar-engineer - Avatar voice integration

For detailed implementations: See

/references/implementations.md

Remember: Voice is intimate—it speaks directly to the listener's brain. Match voice to brand, process for clarity not loudness, and always respect the platform's loudness standards. With ElevenLabs, you have instant access to professional voice synthesis; use it thoughtfully.

sound-engineer - 用于空间音频、游戏音频、程序化音效
native-app-designer - 应用中语音UI的实现
vr-avatar-engineer - 虚拟形象语音集成

详细实现参考：请查看

/references/implementations.md

注意：语音是极具亲和力的媒介——它直接触达听众的大脑。让语音与品牌匹配，为清晰度而非响度处理，始终遵守平台的响度标准。借助ElevenLabs，你可以即时获得专业级语音合成能力；请谨慎且有思考地使用它。