audio-transcribe
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAudio Transcribe
音频转录
Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.
将音频文件转录为带时间戳的文本。支持自动语言检测、说话人识别(语音分离),并输出包含分段时间信息的结构化JSON。
Command
命令
bash
agent-media audio transcribe --in <path> [options]bash
agent-media audio transcribe --in <path> [options]Inputs
输入参数
| Option | Required | Description |
|---|---|---|
| Yes | Input audio file path or URL (supports mp3, wav, m4a, ogg) |
| No | Enable speaker identification |
| No | Language code (auto-detected if not provided) |
| No | Number of speakers hint for diarization |
| No | Output path, filename or directory (default: ./) |
| No | Provider to use (local, fal, replicate, runpod) |
| 选项 | 是否必填 | 描述 |
|---|---|---|
| 是 | 输入音频文件路径或URL(支持mp3、wav、m4a、ogg格式) |
| 否 | 启用说话人识别功能 |
| 否 | 语言代码(未提供时自动检测) |
| 否 | 语音分离的说话人数量提示 |
| 否 | 输出路径、文件名或目录(默认值:./) |
| 否 | 使用的服务提供商(local、fal、replicate、runpod) |
Output
输出
Returns a JSON object with transcription data:
json
{
"ok": true,
"media_type": "audio",
"action": "transcribe",
"provider": "fal",
"output_path": "transcription_123_abc.json",
"transcription": {
"text": "Full transcription text...",
"language": "en",
"segments": [
{ "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
{ "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
]
}
}返回包含转录数据的JSON对象:
json
{
"ok": true,
"media_type": "audio",
"action": "transcribe",
"provider": "fal",
"output_path": "transcription_123_abc.json",
"transcription": {
"text": "完整转录文本...",
"language": "en",
"segments": [
{ "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
{ "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
]
}
}Examples
使用示例
Basic transcription (auto-detect language):
bash
agent-media audio transcribe --in interview.mp3Transcription with speaker identification:
bash
agent-media audio transcribe --in meeting.wav --diarizeTranscription with specific language and speaker count:
bash
agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3Use specific provider:
bash
agent-media audio transcribe --in audio.wav --provider replicate基础转录(自动检测语言):
bash
agent-media audio transcribe --in interview.mp3带说话人识别的转录:
bash
agent-media audio transcribe --in meeting.wav --diarize指定语言和说话人数量的转录:
bash
agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3使用指定服务提供商:
bash
agent-media audio transcribe --in audio.wav --provider replicateExtracting Audio from Video
从视频中提取音频
To transcribe a video file, first extract the audio:
bash
undefined要转录视频文件,需先提取音频:
bash
undefinedStep 1: Extract audio from video
步骤1:从视频中提取音频
agent-media audio extract --in video.mp4 --format mp3
agent-media audio extract --in video.mp4 --format mp3
Step 2: Transcribe the extracted audio
步骤2:转录提取出的音频
agent-media audio transcribe --in extracted_xxx.mp3
undefinedagent-media audio transcribe --in extracted_xxx.mp3
undefinedProviders
服务提供商
local
local
Runs locally on CPU using Transformers.js, no API key required.
- Uses Moonshine model (5x faster than Whisper)
- Models downloaded on first use (~100MB)
- Does NOT support diarization — use fal or replicate for speaker identification
- You may see a error — ignore it, the output is correct if
mutex lock failed"ok": true
bash
agent-media audio transcribe --in audio.mp3 --provider local在本地CPU上运行,使用Transformers.js,无需API密钥。
- 使用Moonshine模型(比Whisper快5倍)
- 首次使用时会下载模型(约100MB)
- 不支持语音分离——如需说话人识别,请使用fal或replicate
- 可能会出现错误——可忽略,只要
mutex lock failed,输出即为正确"ok": true
bash
agent-media audio transcribe --in audio.mp3 --provider localfal
fal
- Requires
FAL_API_KEY - Uses model for fast transcription (2x faster) when diarization is disabled
wizper - Uses model when diarization is enabled (native support)
whisper
- 需要
FAL_API_KEY - 禁用语音分离时,使用模型实现快速转录(速度快2倍)
wizper - 启用语音分离时,使用模型(原生支持)
whisper
replicate
replicate
- Requires
REPLICATE_API_TOKEN - Uses model with Whisper Large V3 Turbo
whisper-diarization - Native diarization support with word-level timestamps
- 需要
REPLICATE_API_TOKEN - 使用模型搭配Whisper Large V3 Turbo
whisper-diarization - 原生支持语音分离,带单词级时间戳
runpod
runpod
- Requires
RUNPOD_API_KEY - Uses model (Whisper Large V3)
pruna/whisper-v3-large - Does NOT support diarization (speaker identification) - use fal or replicate for diarization
bash
agent-media audio transcribe --in audio.mp3 --provider runpod- 需要
RUNPOD_API_KEY - 使用模型(Whisper Large V3)
pruna/whisper-v3-large - 不支持语音分离(说话人识别)——如需该功能,请使用fal或replicate
bash
agent-media audio transcribe --in audio.mp3 --provider runpod