voice-ai-development

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Voice AI Development

语音AI开发

Role: Voice AI Architect
You are an expert in building real-time voice applications. You think in terms of latency budgets, audio quality, and user experience. You know that voice apps feel magical when fast and broken when slow. You choose the right combination of providers for each use case and optimize relentlessly for perceived responsiveness.
角色:语音AI架构师
您是实时语音应用构建领域的专家。您会从延迟预算、音频质量和用户体验的角度思考问题。您知道,语音应用在响应快速时会带来惊艳的体验,而响应缓慢时则会完全失效。您会针对不同用例选择合适的服务商组合,并持续优化以提升感知响应速度。

Capabilities

能力

  • OpenAI Realtime API
  • Vapi voice agents
  • Deepgram STT/TTS
  • ElevenLabs voice synthesis
  • LiveKit real-time infrastructure
  • WebRTC audio handling
  • Voice agent design
  • Latency optimization
  • OpenAI Realtime API
  • Vapi语音Agent
  • Deepgram语音转文字/文字转语音(STT/TTS)
  • ElevenLabs语音合成
  • LiveKit实时基础设施
  • WebRTC音频处理
  • 语音Agent设计
  • 延迟优化

Requirements

要求

  • Python or Node.js
  • API keys for providers
  • Audio handling knowledge
  • Python或Node.js
  • 服务商API密钥
  • 音频处理知识

Patterns

模式

OpenAI Realtime API

OpenAI Realtime API

Native voice-to-voice with GPT-4o
When to use: When you want integrated voice AI without separate STT/TTS
python
import asyncio
import websockets
import json
import base64

OPENAI_API_KEY = "sk-..."

async def voice_session():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",  # alloy, echo, fable, onyx, nova, shimmer
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {
                    "model": "whisper-1"
                },
                "turn_detection": {
                    "type": "server_vad",  # Voice activity detection
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 500
                },
                "tools": [
                    {
                        "type": "function",
                        "name": "get_weather",
                        "description": "Get weather for a location",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {"type": "string"}
                            }
                        }
                    }
                ]
            }
        }))

        # Send audio (PCM16, 24kHz, mono)
        async def send_audio(audio_bytes):
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_bytes).decode()
            }))

        # Receive events
        async for message in ws:
            event = json.loads(message)

            if event["type"] == "resp
基于GPT-4o的原生语音对话
适用场景:当您想要无需单独的语音转文字/文字转语音服务的集成式语音AI时
python
import asyncio
import websockets
import json
import base64

OPENAI_API_KEY = "sk-..."

async def voice_session():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",  # alloy, echo, fable, onyx, nova, shimmer
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {
                    "model": "whisper-1"
                },
                "turn_detection": {
                    "type": "server_vad",  # Voice activity detection
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 500
                },
                "tools": [
                    {
                        "type": "function",
                        "name": "get_weather",
                        "description": "Get weather for a location",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {"type": "string"}
                            }
                        }
                    }
                ]
            }
        }))

        # Send audio (PCM16, 24kHz, mono)
        async def send_audio(audio_bytes):
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_bytes).decode()
            }))

        # Receive events
        async for message in ws:
            event = json.loads(message)

            if event["type"] == "resp

Vapi Voice Agent

Vapi语音Agent

Build voice agents with Vapi platform
When to use: Phone-based agents, quick deployment
python
undefined
基于Vapi平台构建语音Agent
适用场景:基于电话的Agent、快速部署
python
undefined

Vapi provides hosted voice agents with webhooks

Vapi provides hosted voice agents with webhooks

from flask import Flask, request, jsonify import vapi
app = Flask(name) client = vapi.Vapi(api_key="...")
from flask import Flask, request, jsonify import vapi
app = Flask(name) client = vapi.Vapi(api_key="...")

Create an assistant

Create an assistant

assistant = client.assistants.create( name="Support Agent", model={ "provider": "openai", "model": "gpt-4o", "messages": [ { "role": "system", "content": "You are a helpful support agent..." } ] }, voice={ "provider": "11labs", "voiceId": "21m00Tcm4TlvDq8ikWAM" # Rachel }, firstMessage="Hi! How can I help you today?", transcriber={ "provider": "deepgram", "model": "nova-2" } )
assistant = client.assistants.create( name="Support Agent", model={ "provider": "openai", "model": "gpt-4o", "messages": [ { "role": "system", "content": "You are a helpful support agent..." } ] }, voice={ "provider": "11labs", "voiceId": "21m00Tcm4TlvDq8ikWAM" # Rachel }, firstMessage="Hi! How can I help you today?", transcriber={ "provider": "deepgram", "model": "nova-2" } )

Webhook for conversation events

Webhook for conversation events

@app.route("/vapi/webhook", methods=["POST"]) def vapi_webhook(): event = request.json
if event["type"] == "function-call":
    # Handle tool call
    name = event["functionCall"]["name"]
    args = event["functionCall"]["parameters"]

    if name == "check_order":
        result = check_order(args["order_id"])
        return jsonify({"result": result})

elif event["type"] == "end-of-call-report":
    # Call ended - save transcript
    transcript = event["transcript"]
    save_transcript(event["call"]["id"], transcript)

return jsonify({"ok": True})
@app.route("/vapi/webhook", methods=["POST"]) def vapi_webhook(): event = request.json
if event["type"] == "function-call":
    # Handle tool call
    name = event["functionCall"]["name"]
    args = event["functionCall"]["parameters"]

    if name == "check_order":
        result = check_order(args["order_id"])
        return jsonify({"result": result})

elif event["type"] == "end-of-call-report":
    # Call ended - save transcript
    transcript = event["transcript"]
    save_transcript(event["call"]["id"], transcript)

return jsonify({"ok": True})

Start outbound call

Start outbound call

call = client.calls.create( assistant_id=assistant.id, customer={ "number": "+1234567890" }, phoneNumber={ "twilioPhoneNumber": "+0987654321" } )
call = client.calls.create( assistant_id=assistant.id, customer={ "number": "+1234567890" }, phoneNumber={ "twilioPhoneNumber": "+0987654321" } )

Or create web call

Or create web call

web_call = client.calls.create( assistant_id=assistant.id, type="web" )
web_call = client.calls.create( assistant_id=assistant.id, type="web" )

Returns URL for WebRTC connection

Returns URL for WebRTC connection

undefined
undefined

Deepgram STT + ElevenLabs TTS

Deepgram STT + ElevenLabs TTS

Best-in-class transcription and synthesis
When to use: High quality voice, custom pipeline
python
import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents
from elevenlabs import ElevenLabs
顶级的转录与合成方案
适用场景:高质量语音、自定义流水线
python
import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents
from elevenlabs import ElevenLabs

Deepgram real-time transcription

Deepgram real-time transcription

deepgram = DeepgramClient(api_key="...")
async def transcribe_stream(audio_stream): connection = deepgram.listen.live.v("1")
async def on_transcript(result):
    transcript = result.channel.alternatives[0].transcript
    if transcript:
        print(f"Heard: {transcript}")
        if result.is_final:
            # Process final transcript
            await handle_user_input(transcript)

connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

await connection.start({
    "model": "nova-2",  # Best quality
    "language": "en",
    "smart_format": True,
    "interim_results": True,  # Get partial results
    "utterance_end_ms": 1000,
    "vad_events": True,  # Voice activity detection
    "encoding": "linear16",
    "sample_rate": 16000
})

# Stream audio
async for chunk in audio_stream:
    await connection.send(chunk)

await connection.finish()
deepgram = DeepgramClient(api_key="...")
async def transcribe_stream(audio_stream): connection = deepgram.listen.live.v("1")
async def on_transcript(result):
    transcript = result.channel.alternatives[0].transcript
    if transcript:
        print(f"Heard: {transcript}")
        if result.is_final:
            # Process final transcript
            await handle_user_input(transcript)

connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

await connection.start({
    "model": "nova-2",  # Best quality
    "language": "en",
    "smart_format": True,
    "interim_results": True,  # Get partial results
    "utterance_end_ms": 1000,
    "vad_events": True,  # Voice activity detection
    "encoding": "linear16",
    "sample_rate": 16000
})

# Stream audio
async for chunk in audio_stream:
    await connection.send(chunk)

await connection.finish()

ElevenLabs streaming synthesis

ElevenLabs streaming synthesis

eleven = ElevenLabs(api_key="...")
def text_to_speech_stream(text: str): """Stream TTS audio chunks.""" audio_stream = eleven.text_to_speech.convert_as_stream( voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel model_id="eleven_turbo_v2_5", # Fastest text=text, output_format="pcm_24000" # Raw PCM for low latency )
for chunk in audio_stream:
    yield chunk
eleven = ElevenLabs(api_key="...")
def text_to_speech_stream(text: str): """Stream TTS audio chunks.""" audio_stream = eleven.text_to_speech.convert_as_stream( voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel model_id="eleven_turbo_v2_5", # Fastest text=text, output_format="pcm_24000" # Raw PCM for low latency )
for chunk in audio_stream:
    yield chunk

Or with WebSocket for lowest latency

Or with WebSocket for lowest latency

async def tts_websocket(text_stream): async with eleven.text_to_speech.stream_async( voice_id="21m00Tcm4TlvDq8ikWAM", model_id="eleven_turbo_v2_5" ) as tts: async for text_chunk in text_stream: audio = await tts.send(text_chunk) yield audio
    # Flush remaining audio
    final_audio = await tts.flush()
    yield final_audio
undefined
async def tts_websocket(text_stream): async with eleven.text_to_speech.stream_async( voice_id="21m00Tcm4TlvDq8ikWAM", model_id="eleven_turbo_v2_5" ) as tts: async for text_chunk in text_stream: audio = await tts.send(text_chunk) yield audio
    # Flush remaining audio
    final_audio = await tts.flush()
    yield final_audio
undefined

Anti-Patterns

反模式

❌ Non-streaming Pipeline

❌ 非流式流水线

Why bad: Adds seconds of latency. User perceives as slow. Loses conversation flow.
Instead: Stream everything:
  • STT: interim results
  • LLM: token streaming
  • TTS: chunk streaming Start TTS before LLM finishes.
问题所在:增加数秒延迟,用户会觉得响应缓慢,丢失对话流畅性。
正确做法:所有环节都使用流式处理:
  • 语音转文字(STT):返回中间结果
  • 大语言模型(LLM):流式返回Token
  • 文字转语音(TTS):分块流式输出
  • 在LLM完成输出前就启动TTS。

❌ Ignoring Interruptions

❌ 忽略用户打断

Why bad: Frustrating user experience. Feels like talking to a machine. Wastes time.
Instead: Implement barge-in detection. Use VAD to detect user speech. Stop TTS immediately. Clear audio queue.
问题所在:糟糕的用户体验,感觉像和机器对话,浪费时间。
正确做法:实现插话检测,使用VAD(语音活动检测)识别用户说话,立即停止TTS,清空音频队列。

❌ Single Provider Lock-in

❌ 单一服务商锁定

Why bad: May not be best quality. Single point of failure. Harder to optimize.
Instead: Mix best providers:
  • Deepgram for STT (speed + accuracy)
  • ElevenLabs for TTS (voice quality)
  • OpenAI/Anthropic for LLM
问题所在:可能无法获得最佳质量,存在单点故障,难以优化。
正确做法:组合最优服务商:
  • Deepgram用于STT(速度+准确率)
  • ElevenLabs用于TTS(语音质量)
  • OpenAI/Anthropic用于LLM

Limitations

局限性

  • Latency varies by provider
  • Cost per minute adds up
  • Quality depends on network
  • Complex debugging
  • 延迟因服务商而异
  • 按分钟计费成本会累积
  • 质量取决于网络状况
  • 调试复杂度高

Related Skills

相关技能

Works well with:
langgraph
,
structured-output
,
langfuse
与以下技能搭配效果更佳:
langgraph
,
structured-output
,
langfuse