voice-ai

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<objective> Build production voice AI agents with sub-500ms latency:
  1. STT - Deepgram Nova-3 streaming transcription (~150ms)
  2. LLM - Groq llama-3.1-8b-instant for fastest inference (~220ms)
  3. TTS - Cartesia Sonic for ultra-realistic voice (~90ms)
  4. Telephony - Twilio Media Streams for real-time bidirectional audio
CRITICAL: NO OPENAI - Never use
from openai import OpenAI
Key deliverables:
  • Streaming STT with voice activity detection
  • Low-latency LLM responses optimized for voice
  • Expressive TTS with emotion controls
  • Twilio Media Streams WebSocket handler </objective>
<quick_start> Minimal Voice Pipeline (~50 lines, <500ms):
python
import os
import asyncio
from groq import AsyncGroq
from deepgram import AsyncDeepgramClient
from cartesia import AsyncCartesia
<目标> 打造延迟低于500毫秒的生产级语音AI代理:
  1. 语音转文字(STT) - Deepgram Nova-3流式转录(约150毫秒)
  2. 大语言模型(LLM) - Groq llama-3.1-8b-instant,实现最快推理(约220毫秒)
  3. 文字转语音(TTS) - Cartesia Sonic,生成超逼真语音(约90毫秒)
  4. 电话通信 - Twilio Media Streams,实现实时双向音频
重要提示:禁止使用OpenAI - 绝不允许导入
from openai import OpenAI
核心交付成果:
  • 带语音活动检测的流式STT
  • 针对语音场景优化的低延迟LLM响应
  • 带情感控制的富有表现力的TTS
  • Twilio Media Streams WebSocket处理器 </目标>
<快速开始> 极简语音处理流水线(约50行代码,延迟<500毫秒):
python
import os
import asyncio
from groq import AsyncGroq
from deepgram import AsyncDeepgramClient
from cartesia import AsyncCartesia

NEVER: from openai import OpenAI

NEVER: from openai import OpenAI

async def voice_pipeline(user_audio: bytes) -> bytes: """Process audio input, return audio response."""
# 1. STT: Deepgram Nova-3 (~150ms)
dg = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))
result = await dg.listen.rest.v1.transcribe(
    {"buffer": user_audio, "mimetype": "audio/wav"},
    {"model": "nova-3", "language": "en-US"}
)
user_text = result.results.channels[0].alternatives[0].transcript

# 2. LLM: Groq (~220ms) - NOT OpenAI
groq = AsyncGroq(api_key=os.getenv("GROQ_API_KEY"))
response = await groq.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[
        {"role": "system", "content": "Keep responses under 2 sentences."},
        {"role": "user", "content": user_text}
    ],
    max_tokens=150
)
response_text = response.choices[0].message.content

# 3. TTS: Cartesia Sonic-2 (~90ms)
cartesia = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))
audio_chunks = []
for chunk in cartesia.tts.sse(
    model_id="sonic-2",
    transcript=response_text,
    voice={"id": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94"},
    output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 8000}
):
    if chunk.audio:
        audio_chunks.append(chunk.audio)

return b"".join(audio_chunks)  # Total: ~460ms
</quick_start>

<success_criteria>
A voice AI agent is successful when:
- Total latency is under 500ms (STT + LLM + TTS)
- STT correctly transcribes with utterance end detection
- TTS sounds natural and conversational
- Barge-in (interruption) works smoothly (Enterprise tier)
- Bilingual support handles language switching
</success_criteria>

<optimal_stack>
async def voice_pipeline(user_audio: bytes) -> bytes: """Process audio input, return audio response."""
# 1. STT: Deepgram Nova-3 (~150ms)
dg = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))
result = await dg.listen.rest.v1.transcribe(
    {"buffer": user_audio, "mimetype": "audio/wav"},
    {"model": "nova-3", "language": "en-US"}
)
user_text = result.results.channels[0].alternatives[0].transcript

# 2. LLM: Groq (~220ms) - NOT OpenAI
groq = AsyncGroq(api_key=os.getenv("GROQ_API_KEY"))
response = await groq.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[
        {"role": "system", "content": "Keep responses under 2 sentences."},
        {"role": "user", "content": user_text}
    ],
    max_tokens=150
)
response_text = response.choices[0].message.content

# 3. TTS: Cartesia Sonic-2 (~90ms)
cartesia = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))
audio_chunks = []
for chunk in cartesia.tts.sse(
    model_id="sonic-2",
    transcript=response_text,
    voice={"id": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94"},
    output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 8000}
):
    if chunk.audio:
        audio_chunks.append(chunk.audio)

return b"".join(audio_chunks)  # Total: ~460ms
</快速开始>

<成功标准>
语音AI代理达标标准:
- 总延迟低于500毫秒(STT + LLM + TTS)
- STT能准确转录并检测话语结束
- TTS语音自然、符合对话场景
- 插话(打断)功能流畅运行(企业版)
- 双语支持可处理语言切换
</成功标准>

<最优技术栈>

VozLux-Tested Stack

经VozLux测试的技术栈

ComponentProviderModelLatencyNotes
STTDeepgramNova-3~150msStreaming, VAD, utterance detection
LLMGroqllama-3.1-8b-instant~220msLPU hardware, fastest inference
TTSCartesiaSonic-2~90msStreaming, emotions, bilingual
TOTAL--~460msSub-500ms target achieved
组件提供商模型延迟说明
STTDeepgramNova-3~150ms流式处理、语音活动检测(VAD)、话语检测
LLMGroqllama-3.1-8b-instant~220msLPU硬件,最快推理速度
TTSCartesiaSonic-2~90ms流式处理、情感支持、双语能力
总计--~460ms达成低于500毫秒的目标

LLM Priority (Never OpenAI)

LLM优先级(绝不使用OpenAI)

python
LLM_PRIORITY = [
    ("groq", "GROQ_API_KEY", "~220ms"),      # Primary
    ("cerebras", "CEREBRAS_API_KEY", "~200ms"),  # Fallback
    ("anthropic", "ANTHROPIC_API_KEY", "~500ms"),  # Quality fallback
]
python
LLM_PRIORITY = [
    ("groq", "GROQ_API_KEY", "~220ms"),      # Primary
    ("cerebras", "CEREBRAS_API_KEY", "~200ms"),  # Fallback
    ("anthropic", "ANTHROPIC_API_KEY", "~500ms"),  # Quality fallback
]

NEVER: from openai import OpenAI

NEVER: from openai import OpenAI

undefined
undefined

Tier Architecture

层级架构

TierLatencySTTLLMTTSFeatures
Free3000msTwiML GatherGroqPollyBasic IVR
Pro600msDeepgram NovaGroqCartesiaMedia Streams
Enterprise400msDeepgram + VADGroqCartesiaBarge-in
</optimal_stack>
<deepgram_stt>
层级延迟STTLLMTTS功能
免费版3000msTwiML GatherGroqPolly基础IVR
专业版600msDeepgram NovaGroqCartesia媒体流
企业版400msDeepgram + VADGroqCartesia插话功能
</最优技术栈>
<Deepgram语音转文字(STT)>

Deepgram STT (v5 SDK)

Deepgram STT(v5 SDK)

Streaming WebSocket Pattern

流式WebSocket模式

python
from deepgram import AsyncDeepgramClient
from deepgram.core.events import EventType
from deepgram.extensions.types.sockets import (
    ListenV1SocketClientResponse,
    ListenV1MediaMessage,
    ListenV1ControlMessage
)

async def streaming_stt():
    client = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))

    async with client.listen.v1.connect(model="nova-3") as connection:
        def on_message(message: ListenV1SocketClientResponse):
            msg_type = getattr(message, "type", None)

            if msg_type == "Results":
                channel = getattr(message, "channel", None)
                if channel and channel.alternatives:
                    text = channel.alternatives[0].transcript
                    is_final = getattr(message, "is_final", False)
                    if text:
                        print(f"{'[FINAL]' if is_final else '[INTERIM]'} {text}")

            elif msg_type == "UtteranceEnd":
                print("[USER FINISHED SPEAKING]")

            elif msg_type == "SpeechStarted":
                print("[USER STARTED SPEAKING - barge-in trigger]")

        connection.on(EventType.MESSAGE, on_message)
        await connection.start_listening()

        # Send audio chunks
        await connection.send_media(ListenV1MediaMessage(data=audio_bytes))

        # Keep alive for long sessions
        await connection.send_control(ListenV1ControlMessage(type="KeepAlive"))
python
from deepgram import AsyncDeepgramClient
from deepgram.core.events import EventType
from deepgram.extensions.types.sockets import (
    ListenV1SocketClientResponse,
    ListenV1MediaMessage,
    ListenV1ControlMessage
)

async def streaming_stt():
    client = AsyncDeepgramClient(api_key=os.getenv("DEEPGRAM_API_KEY"))

    async with client.listen.v1.connect(model="nova-3") as connection:
        def on_message(message: ListenV1SocketClientResponse):
            msg_type = getattr(message, "type", None)

            if msg_type == "Results":
                channel = getattr(message, "channel", None)
                if channel and channel.alternatives:
                    text = channel.alternatives[0].transcript
                    is_final = getattr(message, "is_final", False)
                    if text:
                        print(f"{'[FINAL]' if is_final else '[INTERIM]'} {text}")

            elif msg_type == "UtteranceEnd":
                print("[USER FINISHED SPEAKING]")

            elif msg_type == "SpeechStarted":
                print("[USER STARTED SPEAKING - barge-in trigger]")

        connection.on(EventType.MESSAGE, on_message)
        await connection.start_listening()

        # Send audio chunks
        await connection.send_media(ListenV1MediaMessage(data=audio_bytes))

        # Keep alive for long sessions
        await connection.send_control(ListenV1ControlMessage(type="KeepAlive"))

Connection Options

连接选项

python
options = {
    "model": "nova-3",
    "language": "en-US",
    "encoding": "mulaw",      # Twilio format
    "sample_rate": 8000,      # Telephony standard
    "interim_results": True,   # Get partial transcripts
    "utterance_end_ms": 1000,  # Silence to end utterance
    "vad_events": True,        # Voice activity detection
}
See
reference/deepgram-setup.md
for full streaming setup. </deepgram_stt>
<groq_llm>
python
options = {
    "model": "nova-3",
    "language": "en-US",
    "encoding": "mulaw",      # Twilio format
    "sample_rate": 8000,      # Telephony standard
    "interim_results": True,   # Get partial transcripts
    "utterance_end_ms": 1000,  # Silence to end utterance
    "vad_events": True,        # Voice activity detection
}
完整流式设置请参考
reference/deepgram-setup.md
。 </Deepgram语音转文字(STT)>
<Groq大语言模型(LLM)>

Groq LLM (Fastest Inference)

Groq LLM(最快推理速度)

Voice-Optimized Pattern

语音场景优化模式

python
from groq import AsyncGroq

class GroqVoiceLLM:
    def __init__(self, model: str = "llama-3.1-8b-instant"):
        self.client = AsyncGroq()
        self.model = model
        self.system_prompt = (
            "You are a helpful voice assistant. "
            "Keep responses to 2-3 sentences max. "
            "Speak naturally as if on a phone call."
        )

    async def generate_stream(self, user_input: str):
        """Streaming for lowest TTFB."""
        stream = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": user_input}
            ],
            max_tokens=150,
            temperature=0.7,
            stream=True,
        )

        async for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield content  # Pipe to TTS immediately
python
from groq import AsyncGroq

class GroqVoiceLLM:
    def __init__(self, model: str = "llama-3.1-8b-instant"):
        self.client = AsyncGroq()
        self.model = model
        self.system_prompt = (
            "You are a helpful voice assistant. "
            "Keep responses to 2-3 sentences max. "
            "Speak naturally as if on a phone call."
        )

    async def generate_stream(self, user_input: str):
        """Streaming for lowest TTFB."""
        stream = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": user_input}
            ],
            max_tokens=150,
            temperature=0.7,
            stream=True,
        )

        async for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield content  # Pipe to TTS immediately

Model Selection

模型选择

ModelSpeedQualityUse Case
llama-3.1-8b-instant~220msGoodPrimary voice
llama-3.3-70b-versatile~500msBestComplex queries
mixtral-8x7b-32768~300msGoodLong context
See
reference/groq-voice-llm.md
for context management. </groq_llm>
<cartesia_tts>
模型速度质量适用场景
llama-3.1-8b-instant~220ms良好主流语音场景
llama-3.3-70b-versatile~500ms最佳复杂查询
mixtral-8x7b-32768~300ms良好长上下文场景
上下文管理相关内容请参考
reference/groq-voice-llm.md
。 </Groq大语言模型(LLM)>
<Cartesia文字转语音(TTS)>

Cartesia TTS (Sonic-2)

Cartesia TTS(Sonic-2)

Streaming Pattern

流式模式

python
from cartesia import AsyncCartesia

class CartesiaTTS:
    VOICES = {
        "en": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94",  # Warm female
        "es": "5c5ad5e7-1020-476b-8b91-fdcbe9cc313c",  # Mexican Spanish
    }

    EMOTIONS = {
        "greeting": "excited",
        "confirmation": "grateful",
        "info": "calm",
        "complaint": "sympathetic",
        "apology": "apologetic",
    }

    def __init__(self, api_key: str):
        self.client = AsyncCartesia(api_key=api_key)

    async def synthesize_stream(
        self,
        text: str,
        language: str = "en",
        emotion: str = "neutral"
    ):
        voice_id = self.VOICES.get(language, self.VOICES["en"])

        response = self.client.tts.sse(
            model_id="sonic-2",
            transcript=text,
            voice={
                "id": voice_id,
                "experimental_controls": {
                    "speed": "normal",
                    "emotion": [emotion] if emotion != "neutral" else []
                }
            },
            language=language,
            output_format={
                "container": "raw",
                "encoding": "pcm_s16le",
                "sample_rate": 8000,  # Telephony
            },
        )

        for chunk in response:
            if chunk.audio:
                yield chunk.audio
python
from cartesia import AsyncCartesia

class CartesiaTTS:
    VOICES = {
        "en": "f9836c6e-a0bd-460e-9d3c-f7299fa60f94",  # Warm female
        "es": "5c5ad5e7-1020-476b-8b91-fdcbe9cc313c",  # Mexican Spanish
    }

    EMOTIONS = {
        "greeting": "excited",
        "confirmation": "grateful",
        "info": "calm",
        "complaint": "sympathetic",
        "apology": "apologetic",
    }

    def __init__(self, api_key: str):
        self.client = AsyncCartesia(api_key=api_key)

    async def synthesize_stream(
        self,
        text: str,
        language: str = "en",
        emotion: str = "neutral"
    ):
        voice_id = self.VOICES.get(language, self.VOICES["en"])

        response = self.client.tts.sse(
            model_id="sonic-2",
            transcript=text,
            voice={
                "id": voice_id,
                "experimental_controls": {
                    "speed": "normal",
                    "emotion": [emotion] if emotion != "neutral" else []
                }
            },
            language=language,
            output_format={
                "container": "raw",
                "encoding": "pcm_s16le",
                "sample_rate": 8000,  # Telephony
            },
        )

        for chunk in response:
            if chunk.audio:
                yield chunk.audio

With Timestamps

带时间戳版本

python
response = client.tts.sse(
    model_id="sonic-2",
    transcript="Hello, world!",
    voice={"id": voice_id},
    output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
    add_timestamps=True,
)

for chunk in response:
    if chunk.word_timestamps:
        for word, start, end in zip(
            chunk.word_timestamps.words,
            chunk.word_timestamps.start,
            chunk.word_timestamps.end
        ):
            print(f"'{word}': {start:.2f}s - {end:.2f}s")
See
reference/cartesia-tts.md
for all 57 emotions. </cartesia_tts>
<twilio_media_streams>
python
response = client.tts.sse(
    model_id="sonic-2",
    transcript="Hello, world!",
    voice={"id": voice_id},
    output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
    add_timestamps=True,
)

for chunk in response:
    if chunk.word_timestamps:
        for word, start, end in zip(
            chunk.word_timestamps.words,
            chunk.word_timestamps.start,
            chunk.word_timestamps.end
        ):
            print(f"'{word}': {start:.2f}s - {end:.2f}s")
全部57种情感类型请参考
reference/cartesia-tts.md
。 </Cartesia文字转语音(TTS)>
<Twilio媒体流>

Twilio Media Streams

Twilio Media Streams

WebSocket Handler (FastAPI)

WebSocket处理器(FastAPI)

python
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import Response
from twilio.twiml.voice_response import VoiceResponse, Connect, Stream
import json, base64, audioop

app = FastAPI()

@app.post("/voice/incoming")
async def incoming_call(request: Request):
    """Route to Media Streams WebSocket."""
    form = await request.form()
    caller = form.get("From", "")
    lang = "es" if caller.startswith("+52") else "en"

    response = VoiceResponse()
    connect = Connect()
    connect.append(Stream(url=f"wss://your-app.com/voice/stream?lang={lang}"))
    response.append(connect)

    return Response(content=str(response), media_type="application/xml")

@app.websocket("/voice/stream")
async def media_stream(websocket: WebSocket, lang: str = "en"):
    await websocket.accept()
    stream_sid = None

    while True:
        message = await websocket.receive_text()
        data = json.loads(message)

        if data["event"] == "start":
            stream_sid = data["start"]["streamSid"]
            # Initialize STT, send greeting

        elif data["event"] == "media":
            audio = base64.b64decode(data["media"]["payload"])
            # Send to Deepgram STT

        elif data["event"] == "stop":
            break

async def send_audio(websocket, stream_sid: str, pcm_audio: bytes):
    """Convert PCM to mu-law and send to Twilio."""
    mulaw = audioop.lin2ulaw(pcm_audio, 2)
    await websocket.send_text(json.dumps({
        "event": "media",
        "streamSid": stream_sid,
        "media": {"payload": base64.b64encode(mulaw).decode()}
    }))
See
reference/twilio-webhooks.md
for complete handler. </twilio_media_streams>
<bilingual_support>
python
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import Response
from twilio.twiml.voice_response import VoiceResponse, Connect, Stream
import json, base64, audioop

app = FastAPI()

@app.post("/voice/incoming")
async def incoming_call(request: Request):
    """Route to Media Streams WebSocket."""
    form = await request.form()
    caller = form.get("From", "")
    lang = "es" if caller.startswith("+52") else "en"

    response = VoiceResponse()
    connect = Connect()
    connect.append(Stream(url=f"wss://your-app.com/voice/stream?lang={lang}"))
    response.append(connect)

    return Response(content=str(response), media_type="application/xml")

@app.websocket("/voice/stream")
async def media_stream(websocket: WebSocket, lang: str = "en"):
    await websocket.accept()
    stream_sid = None

    while True:
        message = await websocket.receive_text()
        data = json.loads(message)

        if data["event"] == "start":
            stream_sid = data["start"]["streamSid"]
            # Initialize STT, send greeting

        elif data["event"] == "media":
            audio = base64.b64decode(data["media"]["payload"])
            # Send to Deepgram STT

        elif data["event"] == "stop":
            break

async def send_audio(websocket, stream_sid: str, pcm_audio: bytes):
    """Convert PCM to mu-law and send to Twilio."""
    mulaw = audioop.lin2ulaw(pcm_audio, 2)
    await websocket.send_text(json.dumps({
        "event": "media",
        "streamSid": stream_sid,
        "media": {"payload": base64.b64encode(mulaw).decode()}
    }))
完整处理器实现请参考
reference/twilio-webhooks.md
。 </Twilio媒体流>
<双语支持(英/西)>

Bilingual Support (EN/ES)

双语支持(英语/西班牙语)

Auto-Detection

自动检测

python
def detect_language(caller_number: str) -> str:
    if caller_number.startswith("+52"):
        return "es"  # Mexico
    elif caller_number.startswith("+1"):
        return "en"  # US/Canada
    return "es"  # Default Spanish
python
def detect_language(caller_number: str) -> str:
    if caller_number.startswith("+52"):
        return "es"  # 墨西哥
    elif caller_number.startswith("+1"):
        return "en"  # 美国/加拿大
    return "es"  # 默认西班牙语

Voice Prompts

语音提示

python
GREETINGS = {
    "en": "Hello! How can I help you today?",
    "es": "Hola! En que puedo ayudarle hoy?",  # Use "usted" for respect
}
</bilingual_support>
<voice_prompts>
python
GREETINGS = {
    "en": "Hello! How can I help you today?",
    "es": "Hola! En que puedo ayudarle hoy?",  # 使用"usted"表尊重
}
</双语支持(英/西)>
<语音提示工程>

Voice Prompt Engineering

语音提示工程

python
VOICE_PROMPT = """
python
VOICE_PROMPT = """

Role

角色

You are a bilingual voice assistant for {business_name}.
您是{business_name}的双语语音助手。

Tone

语气

  • 2-3 sentences max for phone clarity
  • NEVER use bullet points, lists, or markdown
  • Spell out emails: "john at company dot com"
  • Phone numbers with pauses: "five one two... eight seven seven..."
  • Spanish: Use "usted" for formal respect
  • 最多2-3句话,适配电话清晰度
  • 绝不使用项目符号、列表或Markdown格式
  • 邮箱拼写:"john at company dot com"
  • 电话号码带停顿:"five one two... eight seven seven..."
  • 西班牙语:使用"usted"表正式尊重

Guardrails

防护规则

  • Never make up information
  • Transfer to human after 3 failed attempts
  • Match caller's language
  • 绝不编造信息
  • 3次尝试失败后转接人工
  • 匹配来电者语言

Error Recovery

错误恢复

English: "I want to make sure I got that right. Did you say [repeat]?" Spanish: "Quiero asegurarme de entender bien. Dijo [repetir]?" """

> See `reference/voice-prompts.md` for full template.
</voice_prompts>

<file_locations>
英语:"I want to make sure I got that right. Did you say [repeat]?" 西班牙语:"Quiero asegurarme de entender bien. Dijo [repetir]?" """

> 完整模板请参考`reference/voice-prompts.md`。
</语音提示工程>

<文件位置>

Reference Files

参考文件

  • reference/deepgram-setup.md
    - Full streaming STT setup
  • reference/groq-voice-llm.md
    - Groq patterns for voice
  • reference/cartesia-tts.md
    - All 57 emotions, voice cloning
  • reference/twilio-webhooks.md
    - Complete Media Streams handler
  • reference/latency-optimization.md
    - Sub-500ms techniques
  • reference/voice-prompts.md
    - Voice-optimized prompts </file_locations>
<routing>
  • reference/deepgram-setup.md
    - 完整流式STT设置
  • reference/groq-voice-llm.md
    - Groq语音场景适配模式
  • reference/cartesia-tts.md
    - 全部57种情感、语音克隆
  • reference/twilio-webhooks.md
    - 完整媒体流处理器
  • reference/latency-optimization.md
    - 低于500毫秒延迟优化技巧
  • reference/voice-prompts.md
    - 语音场景优化提示词 </文件位置>
<请求路由>

Request Routing

请求路由

User wants voice agent: → Provide full stack (Deepgram + Groq + Cartesia + Twilio) → Start with quick_start pipeline
User wants STT only: → Provide Deepgram streaming pattern → Reference:
reference/deepgram-setup.md
User wants TTS only: → Provide Cartesia pattern with emotions → Reference:
reference/cartesia-tts.md
User wants latency optimization: → Audit current stack, identify bottlenecks → Reference:
reference/latency-optimization.md
User mentions OpenAI: → REDIRECT to Groq immediately → Explain: "NO OPENAI - Use Groq for lowest latency" </routing>
<env_setup>
用户需要语音代理: → 提供完整技术栈(Deepgram + Groq + Cartesia + Twilio) → 从快速开始流水线入手
用户仅需要STT: → 提供Deepgram流式模式 → 参考:
reference/deepgram-setup.md
用户仅需要TTS: → 提供带情感控制的Cartesia模式 → 参考:
reference/cartesia-tts.md
用户需要延迟优化: → 审核当前技术栈,识别瓶颈 → 参考:
reference/latency-optimization.md
用户提及OpenAI: → 立即引导至Groq → 说明:"禁止使用OpenAI - 请使用Groq实现最低延迟" </请求路由>
<环境变量配置>

Environment Variables

环境变量

bash
undefined
bash
undefined

Required (NEVER OpenAI)

必填(绝不使用OpenAI)

DEEPGRAM_API_KEY=your_key GROQ_API_KEY=gsk_xxxx CARTESIA_API_KEY=your_key
DEEPGRAM_API_KEY=your_key GROQ_API_KEY=gsk_xxxx CARTESIA_API_KEY=your_key

Twilio

Twilio相关

TWILIO_ACCOUNT_SID=ACxxxx TWILIO_AUTH_TOKEN=your_token TWILIO_PHONE_NUMBER=+15551234567
TWILIO_ACCOUNT_SID=ACxxxx TWILIO_AUTH_TOKEN=your_token TWILIO_PHONE_NUMBER=+15551234567

Fallbacks

备选方案

ANTHROPIC_API_KEY=sk-ant-xxxx CEREBRAS_API_KEY=csk_xxxx

```bash
pip install deepgram-sdk groq cartesia twilio fastapi
</env_setup>
<quick_reference>
ANTHROPIC_API_KEY=sk-ant-xxxx CEREBRAS_API_KEY=csk_xxxx

```bash
pip install deepgram-sdk groq cartesia twilio fastapi
</环境变量配置>
<快速参考卡片>

Quick Reference Card

快速参考卡片

STACK:
  STT: Deepgram Nova-3 (~150ms)
  LLM: Groq llama-3.1-8b-instant (~220ms) - NOT OPENAI
  TTS: Cartesia Sonic-2 (~90ms)

LATENCY TARGETS:
  Pro: 600ms (Media Streams)
  Enterprise: 400ms (Full streaming + barge-in)

BILINGUAL:
  +52 -> Spanish (es)
  +1 -> English (en)
  Default -> Spanish

EMOTIONS (Cartesia):
  greeting -> excited
  confirmation -> grateful
  complaint -> sympathetic
</quick_reference>
技术栈:
  STT: Deepgram Nova-3 (~150ms)
  LLM: Groq llama-3.1-8b-instant (~220ms) - 禁止使用OpenAI
  TTS: Cartesia Sonic-2 (~90ms)

延迟目标:
  专业版: 600ms(媒体流)
  企业版: 400ms(全流式+插话功能)

双语规则:
  +52 -> 西班牙语(es)
  +1 -> 英语(en)
  默认 -> 西班牙语

情感映射(Cartesia):
  greeting -> excited
  confirmation -> grateful
  complaint -> sympathetic
</快速参考卡片>