audio-transcribe

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Audio Transcriber

音频转录工具

Speech recognition using WhisperX with multi-language support and word-level timestamp alignment.
使用WhisperX实现语音识别,支持多语言和词级时间戳对齐。

Prerequisites

前置要求

Requires Python 3.12 (uv manages this automatically).
需要Python 3.12(uv会自动管理该版本)。

Usage

使用场景

When the user wants to transcribe audio/video: $ARGUMENTS
当用户需要转录音频/视频时:$ARGUMENTS

Instructions

使用说明

Step 1: Get input file

步骤1:获取输入文件

If the user has not provided an input file path, ask them to provide one.
Supported formats:
  • Audio: MP3, WAV, FLAC, M4A, OGG, etc.
  • Video: MP4, MKV, MOV, AVI, etc. (audio is extracted automatically)
Verify the file exists:
bash
ls -la "$INPUT_FILE"
如果用户未提供输入文件路径,请要求用户提供。
支持的格式:
  • 音频:MP3、WAV、FLAC、M4A、OGG等
  • 视频:MP4、MKV、MOV、AVI等(会自动提取音频)
验证文件是否存在:
bash
ls -la "$INPUT_FILE"

Step 2: Ask user for configuration

步骤2:向用户收集配置信息

Warning: You MUST use AskUserQuestion to collect user preferences. Do not skip this step.
Use AskUserQuestion to collect the following:
  1. Model size: Choose the recognition model
    • Options:
      • "base - Balanced speed and accuracy (Recommended)"
      • "tiny - Fastest, lower accuracy"
      • "small - Faster, moderate accuracy"
      • "medium - Slower, higher accuracy"
      • "large-v2 - Slowest, highest accuracy"
  2. Language: What language is the audio?
    • Options:
      • "Auto-detect (Recommended)"
      • "Chinese (zh)"
      • "English (en)"
      • "Japanese (ja)"
      • "Other"
  3. Word-level alignment: Do you need word-level timestamps?
    • Options:
      • "Yes - Precise timing for each word (Recommended)"
      • "No - Sentence-level timing only (faster)"
  4. Output format: What format to output?
    • Options:
      • "TXT - Plain text with timestamps (Recommended)"
      • "SRT - Subtitle format"
      • "VTT - Web subtitle format"
      • "JSON - Structured data (with word-level info)"
  5. Output path: Where to save?
    • Default: same directory as input file, named
      <original_name>.txt
      (or matching format)
警告:你必须使用AskUserQuestion收集用户偏好,不得跳过该步骤。
使用AskUserQuestion收集以下配置:
  1. 模型大小:选择识别模型
    • 选项:
      • "base - 速度和准确率均衡(推荐)"
      • "tiny - 速度最快,准确率较低"
      • "small - 速度较快,准确率中等"
      • "medium - 速度较慢,准确率较高"
      • "large-v2 - 速度最慢,准确率最高"
  2. 语言:音频使用的是什么语言?
    • 选项:
      • "自动检测(推荐)"
      • "中文(zh)"
      • "英文(en)"
      • "日文(ja)"
      • "其他"
  3. 词级对齐:是否需要词级时间戳?
    • 选项:
      • "是 - 每个单词对应精确时间(推荐)"
      • "否 - 仅提供句子级时间(速度更快)"
  4. 输出格式:需要什么格式的输出?
    • 选项:
      • "TXT - 带时间戳的纯文本(推荐)"
      • "SRT - 字幕格式"
      • "VTT - 网页字幕格式"
      • "JSON - 结构化数据(包含词级信息)"
  5. 输出路径:保存位置在哪里?
    • 默认:与输入文件同目录,命名为
      <原文件名>.txt
      (或匹配对应输出格式后缀)

Step 3: Run transcription script

步骤3:运行转录脚本

Use the
transcribe.py
script in the skill directory:
bash
uv run /path/to/skills/audio-transcribe/transcribe.py "INPUT_FILE" [OPTIONS]
Parameters:
  • --model
    ,
    -m
    : Model size (tiny/base/small/medium/large-v2)
  • --language
    ,
    -l
    : Language code (en/zh/ja/...), auto-detect if not specified
  • --no-align
    : Skip word-level alignment
  • --no-vad
    : Disable VAD filtering (use if transcription has time jumps or missing segments)
  • --output
    ,
    -o
    : Output file path
  • --format
    ,
    -f
    : Output format (srt/vtt/txt/json)
Examples:
bash
undefined
使用skill目录下的
transcribe.py
脚本:
bash
uv run /path/to/skills/audio-transcribe/transcribe.py "INPUT_FILE" [OPTIONS]
参数说明:
  • --model
    ,
    -m
    :模型大小(tiny/base/small/medium/large-v2)
  • --language
    ,
    -l
    :语言代码(en/zh/ja/...),未指定则自动检测
  • --no-align
    :跳过词级对齐
  • --no-vad
    :关闭VAD过滤(如果转录出现时间跳跃或片段缺失时使用)
  • --output
    ,
    -o
    :输出文件路径
  • --format
    ,
    -f
    :输出格式(srt/vtt/txt/json)
使用示例:
bash
undefined

Basic transcription (auto-detect language)

基础转录(自动检测语言)

uv run skills/audio-transcribe/transcribe.py "video.mp4" -o "video.txt"
uv run skills/audio-transcribe/transcribe.py "video.mp4" -o "video.txt"

Chinese transcription, output SRT subtitles

中文转录,输出SRT字幕

uv run skills/audio-transcribe/transcribe.py "audio.mp3" -l zh -f srt -o "subtitles.srt"
uv run skills/audio-transcribe/transcribe.py "audio.mp3" -l zh -f srt -o "subtitles.srt"

Fast transcription, skip word alignment

快速转录,跳过词级对齐

uv run skills/audio-transcribe/transcribe.py "audio.wav" --no-align -o "transcript.txt"
uv run skills/audio-transcribe/transcribe.py "audio.wav" --no-align -o "transcript.txt"

Use a larger model, output JSON (with word-level timestamps)

使用更大的模型,输出JSON(带词级时间戳)

uv run skills/audio-transcribe/transcribe.py "speech.mp3" -m medium -f json -o "result.json"
uv run skills/audio-transcribe/transcribe.py "speech.mp3" -m medium -f json -o "result.json"

Disable VAD filtering (fix time jumps / missing segments)

关闭VAD过滤(修复时间跳跃/片段缺失问题)

uv run skills/audio-transcribe/transcribe.py "audio.mp3" --no-vad -o "transcript.txt"
undefined
uv run skills/audio-transcribe/transcribe.py "audio.mp3" --no-vad -o "transcript.txt"
undefined

Step 4: Present results

步骤4:呈现结果

After transcription completes:
  1. Show the full output file path
  2. Display a preview of the transcription content
  3. Report total duration and segment count
转录完成后:
  1. 展示完整的输出文件路径
  2. 显示转录内容的预览
  3. 告知总时长和片段数量

Output format reference

输出格式参考

TXT format

TXT格式

[00:00:00.000 - 00:00:03.500] This is the first sentence
[00:00:03.500 - 00:00:07.200] This is the second sentence
[00:00:00.000 - 00:00:03.500] This is the first sentence
[00:00:03.500 - 00:00:07.200] This is the second sentence

SRT format

SRT格式

1
00:00:00,000 --> 00:00:03,500
This is the first sentence

2
00:00:03,500 --> 00:00:07,200
This is the second sentence
1
00:00:00,000 --> 00:00:03,500
This is the first sentence

2
00:00:03,500 --> 00:00:07,200
This is the second sentence

JSON format (with word-level)

JSON格式(带词级信息)

json
[
  {
    "start": 0.0,
    "end": 3.5,
    "text": "This is the first sentence",
    "words": [
      {"word": "This", "start": 0.0, "end": 0.5, "score": 0.95},
      ...
    ]
  }
]
json
[
  {
    "start": 0.0,
    "end": 3.5,
    "text": "This is the first sentence",
    "words": [
      {"word": "This", "start": 0.0, "end": 0.5, "score": 0.95},
      ...
    ]
  }
]

Troubleshooting

故障排查

Slow on first run:
  • WhisperX needs to download model files; first run will be slower
  • Subsequent runs use the cached model
Out of memory:
  • Use a smaller model (tiny or base)
  • Ensure the system has enough memory
Low recognition accuracy:
  • Try a larger model (medium or large-v2)
  • Explicitly specify the language instead of auto-detect
首次运行速度慢
  • WhisperX需要下载模型文件,首次运行速度会较慢
  • 后续运行会使用缓存的模型
内存不足
  • 使用更小的模型(tiny或base)
  • 确保系统有足够的内存
识别准确率低
  • 尝试使用更大的模型(medium或large-v2)
  • 显式指定语言而不是使用自动检测