audio-transcribe

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Audio Transcribe

音频转录

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.
将音频文件转录为带时间戳的文本。支持自动语言检测、说话人识别(语音分离),并输出包含分段时间信息的结构化JSON。

Command

命令

bash
agent-media audio transcribe --in <path> [options]
bash
agent-media audio transcribe --in <path> [options]

Inputs

输入参数

OptionRequiredDescription
--in
YesInput audio file path or URL (supports mp3, wav, m4a, ogg)
--diarize
NoEnable speaker identification
--language
NoLanguage code (auto-detected if not provided)
--speakers
NoNumber of speakers hint for diarization
--out
NoOutput path, filename or directory (default: ./)
--provider
NoProvider to use (local, fal, replicate, runpod)
选项是否必填描述
--in
输入音频文件路径或URL(支持mp3、wav、m4a、ogg格式)
--diarize
启用说话人识别功能
--language
语言代码(未提供时自动检测)
--speakers
语音分离的说话人数量提示
--out
输出路径、文件名或目录(默认值:./)
--provider
使用的服务提供商(local、fal、replicate、runpod)

Output

输出

Returns a JSON object with transcription data:
json
{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}
返回包含转录数据的JSON对象:
json
{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "完整转录文本...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

使用示例

Basic transcription (auto-detect language):
bash
agent-media audio transcribe --in interview.mp3
Transcription with speaker identification:
bash
agent-media audio transcribe --in meeting.wav --diarize
Transcription with specific language and speaker count:
bash
agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3
Use specific provider:
bash
agent-media audio transcribe --in audio.wav --provider replicate
基础转录(自动检测语言):
bash
agent-media audio transcribe --in interview.mp3
带说话人识别的转录:
bash
agent-media audio transcribe --in meeting.wav --diarize
指定语言和说话人数量的转录:
bash
agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3
使用指定服务提供商:
bash
agent-media audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

从视频中提取音频

To transcribe a video file, first extract the audio:
bash
undefined
要转录视频文件,需先提取音频:
bash
undefined

Step 1: Extract audio from video

步骤1:从视频中提取音频

agent-media audio extract --in video.mp4 --format mp3
agent-media audio extract --in video.mp4 --format mp3

Step 2: Transcribe the extracted audio

步骤2:转录提取出的音频

agent-media audio transcribe --in extracted_xxx.mp3
undefined
agent-media audio transcribe --in extracted_xxx.mp3
undefined

Providers

服务提供商

local

local

Runs locally on CPU using Transformers.js, no API key required.
  • Uses Moonshine model (5x faster than Whisper)
  • Models downloaded on first use (~100MB)
  • Does NOT support diarization — use fal or replicate for speaker identification
  • You may see a
    mutex lock failed
    error — ignore it, the output is correct if
    "ok": true
bash
agent-media audio transcribe --in audio.mp3 --provider local
在本地CPU上运行,使用Transformers.js,无需API密钥。
  • 使用Moonshine模型(比Whisper快5倍)
  • 首次使用时会下载模型(约100MB)
  • 不支持语音分离——如需说话人识别,请使用fal或replicate
  • 可能会出现
    mutex lock failed
    错误——可忽略,只要
    "ok": true
    ,输出即为正确
bash
agent-media audio transcribe --in audio.mp3 --provider local

fal

fal

  • Requires
    FAL_API_KEY
  • Uses
    wizper
    model for fast transcription (2x faster) when diarization is disabled
  • Uses
    whisper
    model when diarization is enabled (native support)
  • 需要
    FAL_API_KEY
  • 禁用语音分离时,使用
    wizper
    模型实现快速转录(速度快2倍)
  • 启用语音分离时,使用
    whisper
    模型(原生支持)

replicate

replicate

  • Requires
    REPLICATE_API_TOKEN
  • Uses
    whisper-diarization
    model with Whisper Large V3 Turbo
  • Native diarization support with word-level timestamps
  • 需要
    REPLICATE_API_TOKEN
  • 使用
    whisper-diarization
    模型搭配Whisper Large V3 Turbo
  • 原生支持语音分离,带单词级时间戳

runpod

runpod

  • Requires
    RUNPOD_API_KEY
  • Uses
    pruna/whisper-v3-large
    model (Whisper Large V3)
  • Does NOT support diarization (speaker identification) - use fal or replicate for diarization
bash
agent-media audio transcribe --in audio.mp3 --provider runpod
  • 需要
    RUNPOD_API_KEY
  • 使用
    pruna/whisper-v3-large
    模型(Whisper Large V3)
  • 不支持语音分离(说话人识别)——如需该功能,请使用fal或replicate
bash
agent-media audio transcribe --in audio.mp3 --provider runpod