speech-to-text

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Speech-to-Text — Saaras

语音转文本 — Saaras

[!IMPORTANT] Auth:
api-subscription-key
header — NOT
Authorization: Bearer
. Base URL:
https://api.sarvam.ai/v1
[!IMPORTANT] 认证方式:使用
api-subscription-key
请求头——不是
Authorization: Bearer
格式。基础URL:
https://api.sarvam.ai/v1

Model

模型

saaras:v3
— 23 languages, 5 output modes (
transcribe
,
translate
,
verbatim
,
translit
,
codemix
), auto language detection.
saaras:v3
— 支持23种语言,5种输出模式(
transcribe
translate
verbatim
translit
codemix
),具备自动语言检测功能。

Quick Start (Python)

快速开始(Python)

python
from sarvamai import SarvamAI
client = SarvamAI()

response = client.speech_to_text.transcribe(
    file=open("audio.wav", "rb"),
    model="saaras:v3",
    mode="transcribe"
)
print(response.transcript)
python
from sarvamai import SarvamAI
client = SarvamAI()

response = client.speech_to_text.transcribe(
    file=open("audio.wav", "rb"),
    model="saaras:v3",
    mode="transcribe"
)
print(response.transcript)

Quick Start (JavaScript/TypeScript)

快速开始(JavaScript/TypeScript)

typescript
import { SarvamAIClient } from "sarvamai";
import * as fs from "fs";

const client = new SarvamAIClient({ apiSubscriptionKey: "YOUR_SARVAM_API_KEY" });

const response = await client.speechToText.transcribe({
    file: fs.createReadStream("audio.wav"),
    model: "saaras:v3",
    mode: "transcribe"
});
console.log(response.transcript);
typescript
import { SarvamAIClient } from "sarvamai";
import * as fs from "fs";

const client = new SarvamAIClient({ apiSubscriptionKey: "YOUR_SARVAM_API_KEY" });

const response = await client.speechToText.transcribe({
    file: fs.createReadStream("audio.wav"),
    model: "saaras:v3",
    mode: "transcribe"
});
console.log(response.transcript);

Batch API (Long Audio + Diarization)

批量API(长音频 + 说话人分离)

python
job = client.speech_to_text_job.create_job(
    model="saaras:v3",
    mode="transcribe",
    language_code="hi-IN",
    with_diarization=True,
    num_speakers=2
)
job.upload_files(file_paths=["meeting.mp3"])
job.start()
job.wait_until_complete()
job.download_outputs(output_dir="./output")
Supports audio up to 1 hour, up to 8 speakers, all 5 output modes.
python
job = client.speech_to_text_job.create_job(
    model="saaras:v3",
    mode="transcribe",
    language_code="hi-IN",
    with_diarization=True,
    num_speakers=2
)
job.upload_files(file_paths=["meeting.mp3"])
job.start()
job.wait_until_complete()
job.download_outputs(output_dir="./output")
支持最长1小时的音频,最多8位说话人,兼容全部5种输出模式。

WebSocket Streaming

WebSocket流式传输

python
import asyncio, base64
from sarvamai import AsyncSarvamAI

async def stream_audio():
    client = AsyncSarvamAI()
    async with client.speech_to_text_streaming.connect(
        model="saaras:v3",
        high_vad_sensitivity=True,
        flush_signal=True
    ) as ws:
        with open("audio.wav", "rb") as f:
            audio_base64 = base64.b64encode(f.read()).decode("utf-8")
        await ws.transcribe(audio=audio_base64, encoding="audio/wav", sample_rate=16000)
        await ws.flush()
        response = await ws.recv()
        print(response)

asyncio.run(stream_audio())
Supports sessions up to 8 hours. Use
sample_rate=8000
for telephony audio.
python
import asyncio, base64
from sarvamai import AsyncSarvamAI

async def stream_audio():
    client = AsyncSarvamAI()
    async with client.speech_to_text_streaming.connect(
        model="saaras:v3",
        high_vad_sensitivity=True,
        flush_signal=True
    ) as ws:
        with open("audio.wav", "rb") as f:
            audio_base64 = base64.b64encode(f.read()).decode("utf-8")
        await ws.transcribe(audio=audio_base64, encoding="audio/wav", sample_rate=16000)
        await ws.flush()
        response = await ws.recv()
        print(response)

asyncio.run(stream_audio())
支持最长8小时的会话。对于电话音频,使用
sample_rate=8000

Gotchas

注意事项

GotchaDetail
REST: 30s limitAudio >30s fails. Use Batch API or WebSocket for longer files.
JS method name
client.speechToText.transcribe({...})
— camelCase, NOT
speech_to_text
. File via
fs.createReadStream()
.
WebSocket codecsOnly
wav
,
pcm_s16le
,
pcm_l16
,
pcm_raw
. MP3/AAC/OGG NOT supported for streaming.
WebSocket audioMust be base64-encoded. Use
sample_rate=8000
for telephony audio.
Flush signal
flush_signal=True
+
await ws.flush()
forces immediate transcription boundary.
Short audio detectionSet
language_code
explicitly for audio <3 seconds — auto-detection needs more signal.
注意点详情
REST接口:30秒限制音频时长超过30秒会失败。对于更长的文件,请使用批量API或WebSocket。
JavaScript方法名
client.speechToText.transcribe({...})
— 采用小驼峰命名,不是
speech_to_text
。通过
fs.createReadStream()
传入文件。
WebSocket编码格式仅支持
wav
pcm_s16le
pcm_l16
pcm_raw
。流式传输不支持MP3/AAC/OGG格式。
WebSocket音频要求必须经过base64编码。对于电话音频,使用
sample_rate=8000
刷新信号
flush_signal=True
+
await ws.flush()
强制立即生成转录边界。
短音频检测对于时长小于3秒的音频,请显式设置
language_code
——自动检测需要更多的信号输入。

Full Docs

完整文档

Fetch streaming protocol, batch API SDK examples, and codec details from:
可从以下链接获取流式传输协议、批量API SDK示例以及编码格式详情: