Voice AI Development

语音AI开发

Role: Voice AI Architect

You are an expert in building real-time voice applications. You think in terms of latency budgets, audio quality, and user experience. You know that voice apps feel magical when fast and broken when slow. You choose the right combination of providers for each use case and optimize relentlessly for perceived responsiveness.

角色：语音AI架构师

您是实时语音应用构建领域的专家。您会从延迟预算、音频质量和用户体验的角度思考问题。您知道，语音应用在响应快速时会带来惊艳的体验，而响应缓慢时则会完全失效。您会针对不同用例选择合适的服务商组合，并持续优化以提升感知响应速度。

Capabilities

能力

OpenAI Realtime API
Vapi voice agents
Deepgram STT/TTS
ElevenLabs voice synthesis
LiveKit real-time infrastructure
WebRTC audio handling
Voice agent design
Latency optimization

OpenAI Realtime API
Vapi语音Agent
Deepgram语音转文字/文字转语音（STT/TTS）
ElevenLabs语音合成
LiveKit实时基础设施
WebRTC音频处理
语音Agent设计
延迟优化

Requirements

要求

Python or Node.js
API keys for providers
Audio handling knowledge

Python或Node.js
服务商API密钥
音频处理知识

Patterns

模式

OpenAI Realtime API

Native voice-to-voice with GPT-4o

When to use: When you want integrated voice AI without separate STT/TTS

python

import asyncio
import websockets
import json
import base64

OPENAI_API_KEY = "sk-..."

async def voice_session():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",  # alloy, echo, fable, onyx, nova, shimmer
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {
                    "model": "whisper-1"
                },
                "turn_detection": {
                    "type": "server_vad",  # Voice activity detection
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 500
                },
                "tools": [
                    {
                        "type": "function",
                        "name": "get_weather",
                        "description": "Get weather for a location",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {"type": "string"}
                            }
                        }
                    }
                ]
            }
        }))

        # Send audio (PCM16, 24kHz, mono)
        async def send_audio(audio_bytes):
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_bytes).decode()
            }))

        # Receive events
        async for message in ws:
            event = json.loads(message)

            if event["type"] == "resp

基于GPT-4o的原生语音对话

适用场景：当您想要无需单独的语音转文字/文字转语音服务的集成式语音AI时

python

import asyncio
import websockets
import json
import base64

OPENAI_API_KEY = "sk-..."

async def voice_session():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",  # alloy, echo, fable, onyx, nova, shimmer
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {
                    "model": "whisper-1"
                },
                "turn_detection": {
                    "type": "server_vad",  # Voice activity detection
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 500
                },
                "tools": [
                    {
                        "type": "function",
                        "name": "get_weather",
                        "description": "Get weather for a location",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {"type": "string"}
                            }
                        }
                    }
                ]
            }
        }))

        # Send audio (PCM16, 24kHz, mono)
        async def send_audio(audio_bytes):
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_bytes).decode()
            }))

        # Receive events
        async for message in ws:
            event = json.loads(message)

            if event["type"] == "resp

Vapi Voice Agent

Vapi语音Agent

Build voice agents with Vapi platform

When to use: Phone-based agents, quick deployment

python

undefined

基于Vapi平台构建语音Agent

适用场景：基于电话的Agent、快速部署

python

undefined

Vapi provides hosted voice agents with webhooks

from flask import Flask, request, jsonify import vapi

app = Flask(name) client = vapi.Vapi(api_key="...")

from flask import Flask, request, jsonify import vapi

app = Flask(name) client = vapi.Vapi(api_key="...")

Create an assistant

assistant = client.assistants.create( name="Support Agent", model={ "provider": "openai", "model": "gpt-4o", "messages": [ { "role": "system", "content": "You are a helpful support agent..." } ] }, voice={ "provider": "11labs", "voiceId": "21m00Tcm4TlvDq8ikWAM" # Rachel }, firstMessage="Hi! How can I help you today?", transcriber={ "provider": "deepgram", "model": "nova-2" } )

Webhook for conversation events

@app.route("/vapi/webhook", methods=["POST"]) def vapi_webhook(): event = request.json

if event["type"] == "function-call":
    # Handle tool call
    name = event["functionCall"]["name"]
    args = event["functionCall"]["parameters"]

    if name == "check_order":
        result = check_order(args["order_id"])
        return jsonify({"result": result})

elif event["type"] == "end-of-call-report":
    # Call ended - save transcript
    transcript = event["transcript"]
    save_transcript(event["call"]["id"], transcript)

return jsonify({"ok": True})

@app.route("/vapi/webhook", methods=["POST"]) def vapi_webhook(): event = request.json

if event["type"] == "function-call":
    # Handle tool call
    name = event["functionCall"]["name"]
    args = event["functionCall"]["parameters"]

    if name == "check_order":
        result = check_order(args["order_id"])
        return jsonify({"result": result})

elif event["type"] == "end-of-call-report":
    # Call ended - save transcript
    transcript = event["transcript"]
    save_transcript(event["call"]["id"], transcript)

return jsonify({"ok": True})

Start outbound call

call = client.calls.create( assistant_id=assistant.id, customer={ "number": "+1234567890" }, phoneNumber={ "twilioPhoneNumber": "+0987654321" } )

Or create web call

web_call = client.calls.create( assistant_id=assistant.id, type="web" )

Returns URL for WebRTC connection

undefined

undefined

Deepgram STT + ElevenLabs TTS

Best-in-class transcription and synthesis

When to use: High quality voice, custom pipeline

python

import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents
from elevenlabs import ElevenLabs

顶级的转录与合成方案

适用场景：高质量语音、自定义流水线

python

import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents
from elevenlabs import ElevenLabs

Deepgram real-time transcription

deepgram = DeepgramClient(api_key="...")

async def transcribe_stream(audio_stream): connection = deepgram.listen.live.v("1")

async def on_transcript(result):
    transcript = result.channel.alternatives[0].transcript
    if transcript:
        print(f"Heard: {transcript}")
        if result.is_final:
            # Process final transcript
            await handle_user_input(transcript)

connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

await connection.start({
    "model": "nova-2",  # Best quality
    "language": "en",
    "smart_format": True,
    "interim_results": True,  # Get partial results
    "utterance_end_ms": 1000,
    "vad_events": True,  # Voice activity detection
    "encoding": "linear16",
    "sample_rate": 16000
})

# Stream audio
async for chunk in audio_stream:
    await connection.send(chunk)

await connection.finish()

deepgram = DeepgramClient(api_key="...")

async def transcribe_stream(audio_stream): connection = deepgram.listen.live.v("1")

async def on_transcript(result):
    transcript = result.channel.alternatives[0].transcript
    if transcript:
        print(f"Heard: {transcript}")
        if result.is_final:
            # Process final transcript
            await handle_user_input(transcript)

connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

await connection.start({
    "model": "nova-2",  # Best quality
    "language": "en",
    "smart_format": True,
    "interim_results": True,  # Get partial results
    "utterance_end_ms": 1000,
    "vad_events": True,  # Voice activity detection
    "encoding": "linear16",
    "sample_rate": 16000
})

# Stream audio
async for chunk in audio_stream:
    await connection.send(chunk)

await connection.finish()

ElevenLabs streaming synthesis

eleven = ElevenLabs(api_key="...")

def text_to_speech_stream(text: str): """Stream TTS audio chunks.""" audio_stream = eleven.text_to_speech.convert_as_stream( voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel model_id="eleven_turbo_v2_5", # Fastest text=text, output_format="pcm_24000" # Raw PCM for low latency )

for chunk in audio_stream:
    yield chunk

eleven = ElevenLabs(api_key="...")

def text_to_speech_stream(text: str): """Stream TTS audio chunks.""" audio_stream = eleven.text_to_speech.convert_as_stream( voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel model_id="eleven_turbo_v2_5", # Fastest text=text, output_format="pcm_24000" # Raw PCM for low latency )

for chunk in audio_stream:
    yield chunk

Or with WebSocket for lowest latency

async def tts_websocket(text_stream): async with eleven.text_to_speech.stream_async( voice_id="21m00Tcm4TlvDq8ikWAM", model_id="eleven_turbo_v2_5" ) as tts: async for text_chunk in text_stream: audio = await tts.send(text_chunk) yield audio

    # Flush remaining audio
    final_audio = await tts.flush()
    yield final_audio

undefined

async def tts_websocket(text_stream): async with eleven.text_to_speech.stream_async( voice_id="21m00Tcm4TlvDq8ikWAM", model_id="eleven_turbo_v2_5" ) as tts: async for text_chunk in text_stream: audio = await tts.send(text_chunk) yield audio

    # Flush remaining audio
    final_audio = await tts.flush()
    yield final_audio

undefined

Anti-Patterns

反模式

❌ Non-streaming Pipeline

❌ 非流式流水线

Why bad: Adds seconds of latency. User perceives as slow. Loses conversation flow.

Instead: Stream everything:

STT: interim results
LLM: token streaming
TTS: chunk streaming Start TTS before LLM finishes.

问题所在：增加数秒延迟，用户会觉得响应缓慢，丢失对话流畅性。

正确做法：所有环节都使用流式处理：

语音转文字（STT）：返回中间结果
大语言模型（LLM）：流式返回Token
文字转语音（TTS）：分块流式输出
在LLM完成输出前就启动TTS。

❌ Ignoring Interruptions

❌ 忽略用户打断

Why bad: Frustrating user experience. Feels like talking to a machine. Wastes time.

Instead: Implement barge-in detection. Use VAD to detect user speech. Stop TTS immediately. Clear audio queue.

问题所在：糟糕的用户体验，感觉像和机器对话，浪费时间。

正确做法：实现插话检测，使用VAD（语音活动检测）识别用户说话，立即停止TTS，清空音频队列。

❌ Single Provider Lock-in

❌ 单一服务商锁定

Why bad: May not be best quality. Single point of failure. Harder to optimize.

Instead: Mix best providers:

Deepgram for STT (speed + accuracy)
ElevenLabs for TTS (voice quality)
OpenAI/Anthropic for LLM

问题所在：可能无法获得最佳质量，存在单点故障，难以优化。

正确做法：组合最优服务商：

Deepgram用于STT（速度+准确率）
ElevenLabs用于TTS（语音质量）
OpenAI/Anthropic用于LLM

Limitations

局限性

Latency varies by provider
Cost per minute adds up
Quality depends on network
Complex debugging

延迟因服务商而异
按分钟计费成本会累积
质量取决于网络状况
调试复杂度高

voice-ai-development

Original

Translation

Voice AI Development

语音AI开发

Capabilities

能力

Requirements

要求

Patterns

模式

OpenAI Realtime API

OpenAI Realtime API

Vapi Voice Agent

Vapi语音Agent

Vapi provides hosted voice agents with webhooks

Vapi provides hosted voice agents with webhooks

Create an assistant

Create an assistant

Webhook for conversation events

Webhook for conversation events

Start outbound call

Start outbound call

Or create web call

Or create web call

Returns URL for WebRTC connection

Returns URL for WebRTC connection

Deepgram STT + ElevenLabs TTS

Deepgram STT + ElevenLabs TTS

Deepgram real-time transcription

Deepgram real-time transcription

ElevenLabs streaming synthesis

ElevenLabs streaming synthesis

Or with WebSocket for lowest latency

Or with WebSocket for lowest latency

Anti-Patterns

反模式

❌ Non-streaming Pipeline

❌ 非流式流水线

❌ Ignoring Interruptions

❌ 忽略用户打断

❌ Single Provider Lock-in

❌ 单一服务商锁定

Limitations

局限性

Related Skills

相关技能