speech-to-text

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ElevenLabs Speech-to-Text

ElevenLabs 语音转文字

Transcribe audio to text with Scribe v2 - supports 90+ languages, speaker diarization, and word-level timestamps.
Setup: See Installation Guide. For JavaScript, use
@elevenlabs/*
packages only.
使用Scribe v2将音频转录为文本——支持90多种语言、说话人分离和单词级时间戳。
设置: 查看安装指南。对于JavaScript,仅使用
@elevenlabs/*
包。

Quick Start

快速开始

Python

Python

python
from elevenlabs.client import ElevenLabs

client = ElevenLabs()

with open("audio.mp3", "rb") as audio_file:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")

print(result.text)
python
from elevenlabs.client import ElevenLabs

client = ElevenLabs()

with open("audio.mp3", "rb") as audio_file:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")

print(result.text)

JavaScript

JavaScript

javascript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream } from "fs";

const client = new ElevenLabsClient();
const result = await client.speechToText.convert({
  file: createReadStream("audio.mp3"),
  modelId: "scribe_v2",
});
console.log(result.text);
javascript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createReadStream } from "fs";

const client = new ElevenLabsClient();
const result = await client.speechToText.convert({
  file: createReadStream("audio.mp3"),
  modelId: "scribe_v2",
});
console.log(result.text);

cURL

cURL

bash
curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" -F "file=@audio.mp3" -F "model_id=scribe_v2"
bash
curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" -F "file=@audio.mp3" -F "model_id=scribe_v2"

Models

模型

Model IDDescriptionBest For
scribe_v2
State-of-the-art accuracy, 90+ languagesBatch transcription, subtitles, long-form audio
scribe_v2_realtime
Low latency (~150ms)Live transcription, voice agents
模型ID描述最佳适用场景
scribe_v2
最先进的准确率,支持90+语言批量转录、字幕生成、长音频处理
scribe_v2_realtime
低延迟(约150ms)实时转录、语音Agent

Transcription with Timestamps

带时间戳的转录

Word-level timestamps include type classification and speaker identification:
python
result = client.speech_to_text.convert(
    file=audio_file, model_id="scribe_v2", timestamps_granularity="word"
)

for word in result.words:
    print(f"{word.text}: {word.start}s - {word.end}s (type: {word.type})")
单词级时间戳包含类型分类和说话人识别:
python
result = client.speech_to_text.convert(
    file=audio_file, model_id="scribe_v2", timestamps_granularity="word"
)

for word in result.words:
    print(f"{word.text}: {word.start}s - {word.end}s (type: {word.type})")

Speaker Diarization

说话人分离

Identify WHO said WHAT - the model labels each word with a speaker ID, useful for meetings, interviews, or any multi-speaker audio:
python
result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    diarize=True
)

for word in result.words:
    print(f"[{word.speaker_id}] {word.text}")
识别“谁说了什么”——模型会为每个单词标记说话人ID,适用于会议、访谈或任何多说话人音频场景:
python
result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    diarize=True
)

for word in result.words:
    print(f"[{word.speaker_id}] {word.text}")

Keyterm Prompting

关键词提示

Help the model recognize specific words it might otherwise mishear - product names, technical jargon, or unusual spellings (up to 100 terms):
python
result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    keyterms=["ElevenLabs", "Scribe", "API"]
)
帮助模型识别可能误听的特定词汇——产品名称、技术术语或特殊拼写(最多100个术语):
python
result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    keyterms=["ElevenLabs", "Scribe", "API"]
)

Language Detection

语言检测

Automatic detection with optional language hint:
python
result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    language_code="eng"  # ISO 639-1 or ISO 639-3 code
)

print(f"Detected: {result.language_code} ({result.language_probability:.0%})")
自动检测语言,可选择提供语言提示:
python
result = client.speech_to_text.convert(
    file=audio_file,
    model_id="scribe_v2",
    language_code="eng"  # ISO 639-1 或 ISO 639-3 代码
)

print(f"检测到语言: {result.language_code} ({result.language_probability:.0%})")

Supported Formats

支持的格式

Audio: MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus Video: MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP
Limits: Up to 3GB file size, 10 hours duration
音频: MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus 视频: MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP
限制: 文件大小最大3GB,时长最长10小时

Response Format

响应格式

json
{
  "text": "The full transcription text",
  "language_code": "eng",
  "language_probability": 0.98,
  "words": [
    {"text": "The", "start": 0.0, "end": 0.15, "type": "word", "speaker_id": "speaker_0"},
    {"text": " ", "start": 0.15, "end": 0.16, "type": "spacing", "speaker_id": "speaker_0"}
  ]
}
Word types:
  • word
    - An actual spoken word
  • spacing
    - Whitespace between words (useful for precise timing)
  • audio_event
    - Non-speech sounds the model detected (laughter, applause, music, etc.)
json
{
  "text": "完整的转录文本",
  "language_code": "eng",
  "language_probability": 0.98,
  "words": [
    {"text": "The", "start": 0.0, "end": 0.15, "type": "word", "speaker_id": "speaker_0"},
    {"text": " ", "start": 0.15, "end": 0.16, "type": "spacing", "speaker_id": "speaker_0"}
  ]
}
单词类型:
  • word
    - 实际说出的单词
  • spacing
    - 单词间的空白(用于精确计时)
  • audio_event
    - 模型检测到的非语音声音(笑声、掌声、音乐等)

Error Handling

错误处理

python
try:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")
except Exception as e:
    print(f"Transcription failed: {e}")
Common errors:
  • 401: Invalid API key
  • 422: Invalid parameters
  • 429: Rate limit exceeded
python
try:
    result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")
except Exception as e:
    print(f"转录失败: {e}")
常见错误:
  • 401: API密钥无效
  • 422: 参数无效
  • 429: 超出速率限制

Tracking Costs

成本跟踪

Monitor usage via
request-id
response header:
python
response = client.speech_to_text.convert.with_raw_response(file=audio_file, model_id="scribe_v2")
result = response.parse()
print(f"Request ID: {response.headers.get('request-id')}")
通过
request-id
响应头监控使用情况:
python
response = client.speech_to_text.convert.with_raw_response(file=audio_file, model_id="scribe_v2")
result = response.parse()
print(f"请求ID: {response.headers.get('request-id')}")

Real-Time Streaming

实时流

For live transcription with ultra-low latency (~150ms), use the real-time API. The real-time API produces two types of transcripts:
  • Partial transcripts: Interim results that update frequently as audio is processed - use these for live feedback (e.g., showing text as the user speaks)
  • Committed transcripts: Final, stable results after you "commit" - use these as the source of truth for your application
A "commit" tells the model to finalize the current segment. You can commit manually (e.g., when the user pauses) or use Voice Activity Detection (VAD) to auto-commit on silence.
对于超低延迟(约150ms)的实时转录,请使用实时API。实时API生成两种类型的转录结果:
  • 部分转录结果: 处理音频时频繁更新的临时结果——用于实时反馈(例如,用户说话时显示文本)
  • 确认转录结果: 你“确认”后的最终稳定结果——用作应用程序的可信来源
“确认”操作会告知模型完成当前片段的处理。你可以手动确认(例如,用户暂停时),或使用语音活动检测(VAD)在静音时自动确认。

Python (Server-Side)

Python(服务端)

python
import asyncio
from elevenlabs.client import ElevenLabs

client = ElevenLabs()

async def transcribe_realtime():
    async with client.speech_to_text.realtime.connect(
        model_id="scribe_v2_realtime",
        include_timestamps=True,
    ) as connection:
        await connection.stream_url("https://example.com/audio.mp3")

        async for event in connection:
            if event.type == "partial_transcript":
                print(f"Partial: {event.text}")
            elif event.type == "committed_transcript":
                print(f"Final: {event.text}")

asyncio.run(transcribe_realtime())
python
import asyncio
from elevenlabs.client import ElevenLabs

client = ElevenLabs()

async def transcribe_realtime():
    async with client.speech_to_text.realtime.connect(
        model_id="scribe_v2_realtime",
        include_timestamps=True,
    ) as connection:
        await connection.stream_url("https://example.com/audio.mp3")

        async for event in connection:
            if event.type == "partial_transcript":
                print(f"部分结果: {event.text}")
            elif event.type == "committed_transcript":
                print(f"最终结果: {event.text}")

asyncio.run(transcribe_realtime())

JavaScript (Client-Side with React)

JavaScript(客户端结合React)

typescript
import { useScribe } from "@elevenlabs/react";

function TranscriptionComponent() {
  const [transcript, setTranscript] = useState("");

  const scribe = useScribe({
    modelId: "scribe_v2_realtime",
    onPartialTranscript: (data) => console.log("Partial:", data.text),
    onCommittedTranscript: (data) => setTranscript((prev) => prev + data.text),
  });

  const start = async () => {
    // Get token from your backend (never expose API key to client)
    const { token } = await fetch("/scribe-token").then((r) => r.json());

    await scribe.connect({
      token,
      microphone: { echoCancellation: true, noiseSuppression: true },
    });
  };

  return <button onClick={start}>Start Recording</button>;
}
typescript
import { useScribe } from "@elevenlabs/react";

function TranscriptionComponent() {
  const [transcript, setTranscript] = useState("");

  const scribe = useScribe({
    modelId: "scribe_v2_realtime",
    onPartialTranscript: (data) => console.log("部分结果:", data.text),
    onCommittedTranscript: (data) => setTranscript((prev) => prev + data.text),
  });

  const start = async () => {
    // 从后端获取token(切勿向客户端暴露API密钥)
    const { token } = await fetch("/scribe-token").then((r) => r.json());

    await scribe.connect({
      token,
      microphone: { echoCancellation: true, noiseSuppression: true },
    });
  };

  return <button onClick={start}>开始录制</button>;
}

Commit Strategies

确认策略

StrategyDescription
ManualYou call
commit()
when ready - use for file processing or when you control the audio segments
VADVoice Activity Detection auto-commits when silence is detected - use for live microphone input
javascript
// VAD configuration
const connection = await client.speechToText.realtime.connect({
  modelId: "scribe_v2_realtime",
  vad: {
    silenceThresholdSecs: 1.5,
    threshold: 0.4,
  },
});
策略描述
手动确认准备好时调用
commit()
——用于文件处理或你控制音频片段的场景
VAD自动确认语音活动检测在检测到静音时自动确认——用于实时麦克风输入场景
javascript
// VAD配置
const connection = await client.speechToText.realtime.connect({
  modelId: "scribe_v2_realtime",
  vad: {
    silenceThresholdSecs: 1.5,
    threshold: 0.4,
  },
});

Event Types

事件类型

EventDescription
partial_transcript
Live interim results
committed_transcript
Final results after commit
committed_transcript_with_timestamps
Final with word timing
error
Error occurred
See real-time references for complete documentation.
事件描述
partial_transcript
实时临时结果
committed_transcript
确认后的最终结果
committed_transcript_with_timestamps
带单词计时的最终结果
error
发生错误
查看实时参考文档获取完整说明。

References

参考资料

  • Installation Guide
  • Transcription Options
  • Real-Time Client-Side Streaming
  • Real-Time Server-Side Streaming
  • Commit Strategies
  • Real-Time Event Reference
  • 安装指南
  • 转录选项
  • 实时客户端流
  • 实时服务端流
  • 确认策略
  • 实时事件参考