speech-engine

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ElevenLabs Speech Engine

ElevenLabs Speech Engine

Add a real-time voice interface to a custom LLM-backed agent. ElevenLabs handles microphone audio, speech-to-text, turn-taking, text-to-speech, and browser playback; your server exposes a Speech Engine WebSocket endpoint and streams the LLM response text back.
Setup: See Installation Guide. For JavaScript, use
@elevenlabs/*
packages only. For deeper SDK details, read JavaScript SDK Reference or Python SDK Reference.
为基于自定义LLM的agent添加实时语音交互界面。ElevenLabs负责处理麦克风音频、语音转文本、话轮转换、文本转语音及浏览器播放;您的服务器需暴露一个Speech Engine WebSocket端点,并将LLM响应文本流式返回。
设置说明: 查看安装指南。JavaScript环境下仅使用
@elevenlabs/*
包。如需深入了解SDK细节,请阅读JavaScript SDK参考文档Python SDK参考文档

When to Use

适用场景

Use Speech Engine when the user wants to:
  • Add voice to an existing chat app or custom LLM pipeline
  • Add voice to OpenClaw, Hermes, or a similar agent runtime while keeping agent logic on the developer-owned server
  • Build a developer-hosted WebSocket server for ElevenLabs voice conversations
  • Stream OpenAI, Anthropic, Gemini, or custom LLM responses back as spoken audio
  • Handle user interruptions while an LLM response is still streaming
  • Build a browser client with
    @elevenlabs/react
    or
    @elevenlabs/client
    using a server-issued conversation token
Use the
agents
skill instead when the user is creating or configuring a hosted ElevenLabs Conversational AI agent with platform-managed prompts, tools, workflows, phone numbers, or widgets.
当用户有以下需求时,可使用Speech Engine:
  • 为现有聊天应用或自定义LLM流程添加语音功能
  • 为OpenClaw、Hermes或类似的agent运行时添加语音功能,同时将agent逻辑保留在开发者自有服务器上
  • 为ElevenLabs语音对话构建开发者托管的WebSocket服务器
  • 将OpenAI、Anthropic、Gemini或自定义LLM的响应以语音音频形式流式返回
  • 在LLM响应仍在流式传输时处理用户中断
  • 使用服务器颁发的对话令牌,通过
    @elevenlabs/react
    @elevenlabs/client
    构建浏览器客户端
若用户正在创建或配置由平台管理提示词、工具、工作流、电话号码或小部件的托管式ElevenLabs对话AI agent,则应使用
agents
技能。

How It Works

工作原理

Each Speech Engine WebSocket connection represents one conversation.
  1. The browser sends user audio to ElevenLabs.
  2. ElevenLabs transcribes speech and sends the full transcript to your server.
  3. Your server calls the LLM with that conversation history.
  4. Your server streams text back through the SDK.
  5. ElevenLabs converts the response to speech and plays it in the browser.
The SDK manages WebSocket routing, request verification, session lifecycle, ping/pong, turn-taking, and interruption handling.
sendResponse()
/
send_response()
accepts a string, an async iterable, or provider streams from OpenAI, Anthropic, or Google Gemini.
每个Speech Engine WebSocket连接代表一次对话。
  1. 浏览器将用户音频发送至ElevenLabs。
  2. ElevenLabs转录语音并将完整转录文本发送至您的服务器。
  3. 您的服务器使用该对话历史调用LLM。
  4. 您的服务器通过SDK流式返回文本。
  5. ElevenLabs将响应转换为语音并在浏览器中播放。
SDK负责管理WebSocket路由、请求验证、会话生命周期、心跳检测、话轮转换及中断处理。
sendResponse()
/
send_response()
支持传入字符串、异步可迭代对象,或来自OpenAI、Anthropic、Google Gemini的提供商流。

Implementation Flow

实现流程

  1. Install server dependencies and configure
    ELEVENLABS_API_KEY
    plus the LLM provider key.
  2. Expose your Speech Engine server through a public HTTPS URL for local development, for example with
    ngrok http 3001
    .
  3. Create a Speech Engine resource with
    ws_url
    /
    wsUrl
    pointing at the public URL plus
    /ws
    .
  4. Store the returned Speech Engine ID, for example in
    ELEVENLABS_SPEECH_ENGINE_ID
    .
  5. Start a Speech Engine server with
    engine.serve(...)
    in Python or
    speechEngine.attach(...)
    in TypeScript.
  6. Issue browser conversation tokens from a server endpoint. Never put
    ELEVENLABS_API_KEY
    in browser code.
  7. Start the client session with
    conversationToken
    ; optionally set
    overrides.agent.firstMessage
    if the agent should greet first.
  1. 安装服务器依赖项,并配置
    ELEVENLABS_API_KEY
    及LLM提供商密钥。
  2. 通过公开HTTPS URL暴露您的Speech Engine服务器以进行本地开发,例如使用
    ngrok http 3001
  3. 创建Speech Engine资源,将
    ws_url
    /
    wsUrl
    指向公开URL加
    /ws
    路径。
  4. 存储返回的Speech Engine ID,例如存入
    ELEVENLABS_SPEECH_ENGINE_ID
  5. 在Python中使用
    engine.serve(...)
    ,或在TypeScript中使用
    speechEngine.attach(...)
    启动Speech Engine服务器。
  6. 从服务器端点颁发浏览器对话令牌。切勿在浏览器代码中放入
    ELEVENLABS_API_KEY
  7. 使用
    conversationToken
    启动客户端会话;若agent需要主动问候,可选择性设置
    overrides.agent.firstMessage

Create a Speech Engine

创建Speech Engine

Python

Python

python
import asyncio
import os

from dotenv import load_dotenv
from elevenlabs import AsyncElevenLabs

load_dotenv()

elevenlabs = AsyncElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

async def main():
    engine = await elevenlabs.speech_engine.create(
        name="My Speech Engine",
        speech_engine={"ws_url": os.environ["PUBLIC_WS_URL"]},
    )
    print(engine.engine_id)

asyncio.run(main())
python
import asyncio
import os

from dotenv import load_dotenv
from elevenlabs import AsyncElevenLabs

load_dotenv()

elevenlabs = AsyncElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

async def main():
    engine = await elevenlabs.speech_engine.create(
        name="My Speech Engine",
        speech_engine={"ws_url": os.environ["PUBLIC_WS_URL"]},
    )
    print(engine.engine_id)

asyncio.run(main())

TypeScript

TypeScript

typescript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import "dotenv/config";

const elevenlabs = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY,
});

const engine = await elevenlabs.speechEngine.create({
  name: "My Speech Engine",
  speechEngine: { wsUrl: process.env.PUBLIC_WS_URL! },
});

console.log(engine.engineId);
PUBLIC_WS_URL
should look like
https://example.ngrok.app/ws
locally or your production WebSocket route in deployment.
The create request can also configure
tts
,
asr
,
turn
,
speech_engine.request_headers
/
speechEngine.requestHeaders
, and
privacy
for custom voices, transcription keywords, turn-taking, server auth headers, and recording behavior. See the SDK reference files for expanded examples.
typescript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import "dotenv/config";

const elevenlabs = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY,
});

const engine = await elevenlabs.speechEngine.create({
  name: "My Speech Engine",
  speechEngine: { wsUrl: process.env.PUBLIC_WS_URL! },
});

console.log(engine.engineId);
本地环境下,
PUBLIC_WS_URL
应类似
https://example.ngrok.app/ws
;部署时则为您的生产环境WebSocket路由。
创建请求还可配置
tts
asr
turn
speech_engine.request_headers
/
speechEngine.requestHeaders
privacy
,以实现自定义语音、转录关键词、话轮转换、服务器认证头及录制行为。扩展示例请查看SDK参考文档。

Server Examples

服务器示例

Python

Python

python
import asyncio
import os

from dotenv import load_dotenv
from elevenlabs import AsyncElevenLabs
from openai import AsyncOpenAI

load_dotenv()

elevenlabs = AsyncElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))
openai = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

async def on_transcript(transcript, session):
    stream = await openai.responses.create(
        model=os.environ["OPENAI_MODEL"],
        instructions="You are a concise, conversational voice assistant.",
        input=[
            {
                "role": "assistant" if message.role == "agent" else message.role,
                "content": message.content,
            }
            for message in transcript
        ],
        stream=True,
    )
    await session.send_response(stream)

async def main():
    engine = await elevenlabs.speech_engine.get(os.environ["ELEVENLABS_SPEECH_ENGINE_ID"])
    await engine.serve(
        port=3001,
        path="/ws",
        debug=True,
        on_transcript=on_transcript,
    )

asyncio.run(main())
python
import asyncio
import os

from dotenv import load_dotenv
from elevenlabs import AsyncElevenLabs
from openai import AsyncOpenAI

load_dotenv()

elevenlabs = AsyncElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))
openai = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

async def on_transcript(transcript, session):
    stream = await openai.responses.create(
        model=os.environ["OPENAI_MODEL"],
        instructions="You are a concise, conversational voice assistant.",
        input=[
            {
                "role": "assistant" if message.role == "agent" else message.role,
                "content": message.content,
            }
            for message in transcript
        ],
        stream=True,
    )
    await session.send_response(stream)

async def main():
    engine = await elevenlabs.speech_engine.get(os.environ["ELEVENLABS_SPEECH_ENGINE_ID"])
    await engine.serve(
        port=3001,
        path="/ws",
        debug=True,
        on_transcript=on_transcript,
    )

asyncio.run(main())

TypeScript

TypeScript

typescript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createServer } from "node:http";
import OpenAI from "openai";
import "dotenv/config";

const elevenlabs = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY,
});
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const httpServer = createServer();

await elevenlabs.speechEngine.attach(
  process.env.ELEVENLABS_SPEECH_ENGINE_ID!,
  httpServer,
  "/ws",
  {
    debug: true,
    async onTranscript(transcript, signal, session) {
      const response = await openai.responses.create(
        {
          model: process.env.OPENAI_MODEL!,
          instructions: "You are a concise, conversational voice assistant.",
          input: transcript.map((message) => ({
            role: message.role === "agent" ? "assistant" : message.role,
            content: message.content,
          })),
          stream: true,
        },
        { signal },
      );

      session.sendResponse(response);
    },
  },
);

httpServer.listen(3001);
In TypeScript, pass the
AbortSignal
from
onTranscript
to the LLM request so user interruptions cancel the in-flight response. In Python, the SDK cancels the previous transcript handler when a newer transcript arrives.
typescript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createServer } from "node:http";
import OpenAI from "openai";
import "dotenv/config";

const elevenlabs = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY,
});
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const httpServer = createServer();

await elevenlabs.speechEngine.attach(
  process.env.ELEVENLABS_SPEECH_ENGINE_ID!,
  httpServer,
  "/ws",
  {
    debug: true,
    async onTranscript(transcript, signal, session) {
      const response = await openai.responses.create(
        {
          model: process.env.OPENAI_MODEL!,
          instructions: "You are a concise, conversational voice assistant.",
          input: transcript.map((message) => ({
            role: message.role === "agent" ? "assistant" : message.role,
            content: message.content,
          })),
          stream: true,
        },
        { signal },
      );

      session.sendResponse(response);
    },
  },
);

httpServer.listen(3001);
在TypeScript中,需将
onTranscript
中的
AbortSignal
传递给LLM请求,以便用户中断时取消正在进行的响应。在Python中,当有新的转录文本到达时,SDK会自动取消之前的转录处理程序。

Browser Client

浏览器客户端

Create a server-side token endpoint and have the browser request a token before starting the microphone session. Keep the Speech Engine ID and API key on the server.
typescript
import express from "express";
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import "dotenv/config";

const app = express();
const elevenlabs = new ElevenLabsClient();

app.get("/api/token", async (_req, res) => {
  const response = await elevenlabs.conversationalAi.conversations.getWebrtcToken({
    agentId: process.env.ELEVENLABS_SPEECH_ENGINE_ID!,
  });
  res.json({ token: response.token });
});
React clients can use
@elevenlabs/react
:
tsx
import { useConversation } from "@elevenlabs/react";

export function VoiceControls() {
  const conversation = useConversation({
    onConnect: () => console.log("connected"),
    onDisconnect: () => console.log("disconnected"),
    onError: (error) => console.error(error),
  });

  async function startConversation() {
    await navigator.mediaDevices.getUserMedia({ audio: true });
    const { token } = await fetch("/api/token").then((res) => res.json());

    await conversation.startSession({
      conversationToken: token,
      overrides: {
        agent: { firstMessage: "Hello! How can I help you today?" },
      },
    });
  }

  return <button onClick={startConversation}>Start conversation</button>;
}
If a WebRTC browser session stalls or logs
/rtc/v1
404s,
v1 RTC path not found
, or
could not establish pc connection
, pin
livekit-client
to
2.16.1
in the app's
package.json
until the upstream LiveKit compatibility issue is resolved:
json
{
  "overrides": {
    "livekit-client": "2.16.1"
  }
}
创建服务器端令牌端点,让浏览器在启动麦克风会话前请求令牌。请将Speech Engine ID和API密钥保留在服务器端。
typescript
import express from "express";
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import "dotenv/config";

const app = express();
const elevenlabs = new ElevenLabsClient();

app.get("/api/token", async (_req, res) => {
  const response = await elevenlabs.conversationalAi.conversations.getWebrtcToken({
    agentId: process.env.ELEVENLABS_SPEECH_ENGINE_ID!,
  });
  res.json({ token: response.token });
});
React客户端可使用
@elevenlabs/react
tsx
import { useConversation } from "@elevenlabs/react";

export function VoiceControls() {
  const conversation = useConversation({
    onConnect: () => console.log("connected"),
    onDisconnect: () => console.log("disconnected"),
    onError: (error) => console.error(error),
  });

  async function startConversation() {
    await navigator.mediaDevices.getUserMedia({ audio: true });
    const { token } = await fetch("/api/token").then((res) => res.json());

    await conversation.startSession({
      conversationToken: token,
      overrides: {
        agent: { firstMessage: "Hello! How can I help you today?" },
      },
    });
  }

  return <button onClick={startConversation}>Start conversation</button>;
}
如果WebRTC浏览器会话停滞,或日志中出现
/rtc/v1
404错误、
v1 RTC path not found
could not establish pc connection
,请在应用的
package.json
中将
livekit-client
固定为
2.16.1
版本,直至上游LiveKit兼容性问题解决:
json
{
  "overrides": {
    "livekit-client": "2.16.1"
  }
}

Callbacks and Events

回调与事件

EventTypeScript callbackPython callbackNotes
user_transcript
onTranscript(transcript, signal, session)
on_transcript(transcript, session)
Full conversation history for the current turn
init
onInit(conversationId, session)
on_init(conversation_id, session)
Conversation ID becomes available
close
onClose(session)
on_close(session)
Clean disconnect
disconnected
onDisconnect(session)
on_disconnect(session)
Unexpected WebSocket drop
error
onError(error, session)
on_error(error, session)
Protocol or WebSocket error
Transcript messages use role
"user"
or
"agent"
. Map
"agent"
to
"assistant"
when passing history to LLM APIs that expect OpenAI-style roles.
事件TypeScript回调Python回调说明
user_transcript
onTranscript(transcript, signal, session)
on_transcript(transcript, session)
当前话轮的完整对话历史
init
onInit(conversationId, session)
on_init(conversation_id, session)
对话ID可用
close
onClose(session)
on_close(session)
正常断开连接
disconnected
onDisconnect(session)
on_disconnect(session)
WebSocket意外断开
error
onError(error, session)
on_error(error, session)
协议或WebSocket错误
转录消息使用角色
"user"
"agent"
。当将历史记录传递给期望OpenAI风格角色的LLM API时,需将
"agent"
映射为
"assistant"

References

参考资料

  • Installation Guide
  • JavaScript SDK Reference
  • Python SDK Reference
  • 安装指南
  • JavaScript SDK参考文档
  • Python SDK参考文档