audio-transcribe

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Audio Transcribe

音频转录

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.

将音频文件转录为带时间戳的文本。支持自动语言检测、说话人识别（语音分离），并输出包含分段时间信息的结构化JSON。

Command

命令

bash

agent-media audio transcribe --in <path> [options]

bash

agent-media audio transcribe --in <path> [options]

Inputs

输入参数

Option	Required	Description
`--in`	Yes	Input audio file path or URL (supports mp3, wav, m4a, ogg)
`--diarize`	No	Enable speaker identification
`--language`	No	Language code (auto-detected if not provided)
`--speakers`	No	Number of speakers hint for diarization
`--out`	No	Output path, filename or directory (default: ./)
`--provider`	No	Provider to use (local, fal, replicate, runpod)

选项	是否必填	描述
`--in`	是	输入音频文件路径或URL（支持mp3、wav、m4a、ogg格式）
`--diarize`	否	启用说话人识别功能
`--language`	否	语言代码（未提供时自动检测）
`--speakers`	否	语音分离的说话人数量提示
`--out`	否	输出路径、文件名或目录（默认值：./）
`--provider`	否	使用的服务提供商（local、fal、replicate、runpod）

Output

输出

Returns a JSON object with transcription data:

json

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

返回包含转录数据的JSON对象：

json

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "完整转录文本...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

使用示例

Basic transcription (auto-detect language):

bash

agent-media audio transcribe --in interview.mp3

Transcription with speaker identification:

bash

agent-media audio transcribe --in meeting.wav --diarize

Transcription with specific language and speaker count:

bash

agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

Use specific provider:

bash

agent-media audio transcribe --in audio.wav --provider replicate

基础转录（自动检测语言）：

bash

agent-media audio transcribe --in interview.mp3

带说话人识别的转录：

bash

agent-media audio transcribe --in meeting.wav --diarize

指定语言和说话人数量的转录：

bash

agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

使用指定服务提供商：

bash

agent-media audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

从视频中提取音频

To transcribe a video file, first extract the audio:

bash

undefined

要转录视频文件，需先提取音频：

bash

undefined

Step 1: Extract audio from video

步骤1：从视频中提取音频

agent-media audio extract --in video.mp4 --format mp3

Step 2: Transcribe the extracted audio

步骤2：转录提取出的音频

agent-media audio transcribe --in extracted_xxx.mp3

undefined

agent-media audio transcribe --in extracted_xxx.mp3

undefined

Providers

服务提供商

local

Runs locally on CPU using Transformers.js, no API key required.

Uses Moonshine model (5x faster than Whisper)
Models downloaded on first use (~100MB)
Does NOT support diarization — use fal or replicate for speaker identification
You may see a
```
mutex lock failed
```
error — ignore it, the output is correct if
```
"ok": true
```

bash

agent-media audio transcribe --in audio.mp3 --provider local

在本地CPU上运行，使用Transformers.js，无需API密钥。

使用Moonshine模型（比Whisper快5倍）
首次使用时会下载模型（约100MB）
不支持语音分离——如需说话人识别，请使用fal或replicate
可能会出现
```
mutex lock failed
```
错误——可忽略，只要
```
"ok": true
```
，输出即为正确

bash

agent-media audio transcribe --in audio.mp3 --provider local

fal

Requires
```
FAL_API_KEY
```
Uses
```
wizper
```
model for fast transcription (2x faster) when diarization is disabled
Uses
```
whisper
```
model when diarization is enabled (native support)

需要
```
FAL_API_KEY
```
禁用语音分离时，使用
```
wizper
```
模型实现快速转录（速度快2倍）
启用语音分离时，使用
```
whisper
```
模型（原生支持）

replicate

Requires
```
REPLICATE_API_TOKEN
```
Uses
```
whisper-diarization
```
model with Whisper Large V3 Turbo
Native diarization support with word-level timestamps

需要
```
REPLICATE_API_TOKEN
```
使用
```
whisper-diarization
```
模型搭配Whisper Large V3 Turbo
原生支持语音分离，带单词级时间戳

runpod

Requires
```
RUNPOD_API_KEY
```
Uses
```
pruna/whisper-v3-large
```
model (Whisper Large V3)
Does NOT support diarization (speaker identification) - use fal or replicate for diarization

bash

agent-media audio transcribe --in audio.mp3 --provider runpod

需要
```
RUNPOD_API_KEY
```
使用
```
pruna/whisper-v3-large
```
模型（Whisper Large V3）
不支持语音分离（说话人识别）——如需该功能，请使用fal或replicate

bash

agent-media audio transcribe --in audio.mp3 --provider runpod