voice-ai-development
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVoice AI Development
语音AI开发
Role: Voice AI Architect
You are an expert in building real-time voice applications. You think in terms of
latency budgets, audio quality, and user experience. You know that voice apps feel
magical when fast and broken when slow. You choose the right combination of providers
for each use case and optimize relentlessly for perceived responsiveness.
角色:语音AI架构师
您是实时语音应用构建领域的专家。您会从延迟预算、音频质量和用户体验的角度思考问题。您知道,语音应用在响应快速时会带来惊艳的体验,而响应缓慢时则会完全失效。您会针对不同用例选择合适的服务商组合,并持续优化以提升感知响应速度。
Capabilities
能力
- OpenAI Realtime API
- Vapi voice agents
- Deepgram STT/TTS
- ElevenLabs voice synthesis
- LiveKit real-time infrastructure
- WebRTC audio handling
- Voice agent design
- Latency optimization
- OpenAI Realtime API
- Vapi语音Agent
- Deepgram语音转文字/文字转语音(STT/TTS)
- ElevenLabs语音合成
- LiveKit实时基础设施
- WebRTC音频处理
- 语音Agent设计
- 延迟优化
Requirements
要求
- Python or Node.js
- API keys for providers
- Audio handling knowledge
- Python或Node.js
- 服务商API密钥
- 音频处理知识
Patterns
模式
OpenAI Realtime API
OpenAI Realtime API
Native voice-to-voice with GPT-4o
When to use: When you want integrated voice AI without separate STT/TTS
python
import asyncio
import websockets
import json
import base64
OPENAI_API_KEY = "sk-..."
async def voice_session():
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "realtime=v1"
}
async with websockets.connect(url, extra_headers=headers) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"voice": "alloy", # alloy, echo, fable, onyx, nova, shimmer
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "whisper-1"
},
"turn_detection": {
"type": "server_vad", # Voice activity detection
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
"tools": [
{
"type": "function",
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
]
}
}))
# Send audio (PCM16, 24kHz, mono)
async def send_audio(audio_bytes):
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(audio_bytes).decode()
}))
# Receive events
async for message in ws:
event = json.loads(message)
if event["type"] == "resp基于GPT-4o的原生语音对话
适用场景:当您想要无需单独的语音转文字/文字转语音服务的集成式语音AI时
python
import asyncio
import websockets
import json
import base64
OPENAI_API_KEY = "sk-..."
async def voice_session():
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
headers = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "realtime=v1"
}
async with websockets.connect(url, extra_headers=headers) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"voice": "alloy", # alloy, echo, fable, onyx, nova, shimmer
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "whisper-1"
},
"turn_detection": {
"type": "server_vad", # Voice activity detection
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
"tools": [
{
"type": "function",
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
]
}
}))
# Send audio (PCM16, 24kHz, mono)
async def send_audio(audio_bytes):
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(audio_bytes).decode()
}))
# Receive events
async for message in ws:
event = json.loads(message)
if event["type"] == "respVapi Voice Agent
Vapi语音Agent
Build voice agents with Vapi platform
When to use: Phone-based agents, quick deployment
python
undefined基于Vapi平台构建语音Agent
适用场景:基于电话的Agent、快速部署
python
undefinedVapi provides hosted voice agents with webhooks
Vapi provides hosted voice agents with webhooks
from flask import Flask, request, jsonify
import vapi
app = Flask(name)
client = vapi.Vapi(api_key="...")
from flask import Flask, request, jsonify
import vapi
app = Flask(name)
client = vapi.Vapi(api_key="...")
Create an assistant
Create an assistant
assistant = client.assistants.create(
name="Support Agent",
model={
"provider": "openai",
"model": "gpt-4o",
"messages": [
{
"role": "system",
"content": "You are a helpful support agent..."
}
]
},
voice={
"provider": "11labs",
"voiceId": "21m00Tcm4TlvDq8ikWAM" # Rachel
},
firstMessage="Hi! How can I help you today?",
transcriber={
"provider": "deepgram",
"model": "nova-2"
}
)
assistant = client.assistants.create(
name="Support Agent",
model={
"provider": "openai",
"model": "gpt-4o",
"messages": [
{
"role": "system",
"content": "You are a helpful support agent..."
}
]
},
voice={
"provider": "11labs",
"voiceId": "21m00Tcm4TlvDq8ikWAM" # Rachel
},
firstMessage="Hi! How can I help you today?",
transcriber={
"provider": "deepgram",
"model": "nova-2"
}
)
Webhook for conversation events
Webhook for conversation events
@app.route("/vapi/webhook", methods=["POST"])
def vapi_webhook():
event = request.json
if event["type"] == "function-call":
# Handle tool call
name = event["functionCall"]["name"]
args = event["functionCall"]["parameters"]
if name == "check_order":
result = check_order(args["order_id"])
return jsonify({"result": result})
elif event["type"] == "end-of-call-report":
# Call ended - save transcript
transcript = event["transcript"]
save_transcript(event["call"]["id"], transcript)
return jsonify({"ok": True})@app.route("/vapi/webhook", methods=["POST"])
def vapi_webhook():
event = request.json
if event["type"] == "function-call":
# Handle tool call
name = event["functionCall"]["name"]
args = event["functionCall"]["parameters"]
if name == "check_order":
result = check_order(args["order_id"])
return jsonify({"result": result})
elif event["type"] == "end-of-call-report":
# Call ended - save transcript
transcript = event["transcript"]
save_transcript(event["call"]["id"], transcript)
return jsonify({"ok": True})Start outbound call
Start outbound call
call = client.calls.create(
assistant_id=assistant.id,
customer={
"number": "+1234567890"
},
phoneNumber={
"twilioPhoneNumber": "+0987654321"
}
)
call = client.calls.create(
assistant_id=assistant.id,
customer={
"number": "+1234567890"
},
phoneNumber={
"twilioPhoneNumber": "+0987654321"
}
)
Or create web call
Or create web call
web_call = client.calls.create(
assistant_id=assistant.id,
type="web"
)
web_call = client.calls.create(
assistant_id=assistant.id,
type="web"
)
Returns URL for WebRTC connection
Returns URL for WebRTC connection
undefinedundefinedDeepgram STT + ElevenLabs TTS
Deepgram STT + ElevenLabs TTS
Best-in-class transcription and synthesis
When to use: High quality voice, custom pipeline
python
import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents
from elevenlabs import ElevenLabs顶级的转录与合成方案
适用场景:高质量语音、自定义流水线
python
import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents
from elevenlabs import ElevenLabsDeepgram real-time transcription
Deepgram real-time transcription
deepgram = DeepgramClient(api_key="...")
async def transcribe_stream(audio_stream):
connection = deepgram.listen.live.v("1")
async def on_transcript(result):
transcript = result.channel.alternatives[0].transcript
if transcript:
print(f"Heard: {transcript}")
if result.is_final:
# Process final transcript
await handle_user_input(transcript)
connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
await connection.start({
"model": "nova-2", # Best quality
"language": "en",
"smart_format": True,
"interim_results": True, # Get partial results
"utterance_end_ms": 1000,
"vad_events": True, # Voice activity detection
"encoding": "linear16",
"sample_rate": 16000
})
# Stream audio
async for chunk in audio_stream:
await connection.send(chunk)
await connection.finish()deepgram = DeepgramClient(api_key="...")
async def transcribe_stream(audio_stream):
connection = deepgram.listen.live.v("1")
async def on_transcript(result):
transcript = result.channel.alternatives[0].transcript
if transcript:
print(f"Heard: {transcript}")
if result.is_final:
# Process final transcript
await handle_user_input(transcript)
connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
await connection.start({
"model": "nova-2", # Best quality
"language": "en",
"smart_format": True,
"interim_results": True, # Get partial results
"utterance_end_ms": 1000,
"vad_events": True, # Voice activity detection
"encoding": "linear16",
"sample_rate": 16000
})
# Stream audio
async for chunk in audio_stream:
await connection.send(chunk)
await connection.finish()ElevenLabs streaming synthesis
ElevenLabs streaming synthesis
eleven = ElevenLabs(api_key="...")
def text_to_speech_stream(text: str):
"""Stream TTS audio chunks."""
audio_stream = eleven.text_to_speech.convert_as_stream(
voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel
model_id="eleven_turbo_v2_5", # Fastest
text=text,
output_format="pcm_24000" # Raw PCM for low latency
)
for chunk in audio_stream:
yield chunkeleven = ElevenLabs(api_key="...")
def text_to_speech_stream(text: str):
"""Stream TTS audio chunks."""
audio_stream = eleven.text_to_speech.convert_as_stream(
voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel
model_id="eleven_turbo_v2_5", # Fastest
text=text,
output_format="pcm_24000" # Raw PCM for low latency
)
for chunk in audio_stream:
yield chunkOr with WebSocket for lowest latency
Or with WebSocket for lowest latency
async def tts_websocket(text_stream):
async with eleven.text_to_speech.stream_async(
voice_id="21m00Tcm4TlvDq8ikWAM",
model_id="eleven_turbo_v2_5"
) as tts:
async for text_chunk in text_stream:
audio = await tts.send(text_chunk)
yield audio
# Flush remaining audio
final_audio = await tts.flush()
yield final_audioundefinedasync def tts_websocket(text_stream):
async with eleven.text_to_speech.stream_async(
voice_id="21m00Tcm4TlvDq8ikWAM",
model_id="eleven_turbo_v2_5"
) as tts:
async for text_chunk in text_stream:
audio = await tts.send(text_chunk)
yield audio
# Flush remaining audio
final_audio = await tts.flush()
yield final_audioundefinedAnti-Patterns
反模式
❌ Non-streaming Pipeline
❌ 非流式流水线
Why bad: Adds seconds of latency.
User perceives as slow.
Loses conversation flow.
Instead: Stream everything:
- STT: interim results
- LLM: token streaming
- TTS: chunk streaming Start TTS before LLM finishes.
问题所在:增加数秒延迟,用户会觉得响应缓慢,丢失对话流畅性。
正确做法:所有环节都使用流式处理:
- 语音转文字(STT):返回中间结果
- 大语言模型(LLM):流式返回Token
- 文字转语音(TTS):分块流式输出
- 在LLM完成输出前就启动TTS。
❌ Ignoring Interruptions
❌ 忽略用户打断
Why bad: Frustrating user experience.
Feels like talking to a machine.
Wastes time.
Instead: Implement barge-in detection.
Use VAD to detect user speech.
Stop TTS immediately.
Clear audio queue.
问题所在:糟糕的用户体验,感觉像和机器对话,浪费时间。
正确做法:实现插话检测,使用VAD(语音活动检测)识别用户说话,立即停止TTS,清空音频队列。
❌ Single Provider Lock-in
❌ 单一服务商锁定
Why bad: May not be best quality.
Single point of failure.
Harder to optimize.
Instead: Mix best providers:
- Deepgram for STT (speed + accuracy)
- ElevenLabs for TTS (voice quality)
- OpenAI/Anthropic for LLM
问题所在:可能无法获得最佳质量,存在单点故障,难以优化。
正确做法:组合最优服务商:
- Deepgram用于STT(速度+准确率)
- ElevenLabs用于TTS(语音质量)
- OpenAI/Anthropic用于LLM
Limitations
局限性
- Latency varies by provider
- Cost per minute adds up
- Quality depends on network
- Complex debugging
- 延迟因服务商而异
- 按分钟计费成本会累积
- 质量取决于网络状况
- 调试复杂度高
Related Skills
相关技能
Works well with: , ,
langgraphstructured-outputlangfuse与以下技能搭配效果更佳:, ,
langgraphstructured-outputlangfuse