audio-transcribe

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Audio Transcriber

音频转录工具

Speech recognition using WhisperX with multi-language support and word-level timestamp alignment.

使用WhisperX实现语音识别，支持多语言和词级时间戳对齐。

Prerequisites

前置要求

Requires Python 3.12 (uv manages this automatically).

需要Python 3.12（uv会自动管理该版本）。

Usage

使用场景

When the user wants to transcribe audio/video: $ARGUMENTS

当用户需要转录音频/视频时：$ARGUMENTS

Instructions

使用说明

Step 1: Get input file

步骤1：获取输入文件

If the user has not provided an input file path, ask them to provide one.

Supported formats:

Audio: MP3, WAV, FLAC, M4A, OGG, etc.
Video: MP4, MKV, MOV, AVI, etc. (audio is extracted automatically)

Verify the file exists:

bash

ls -la "$INPUT_FILE"

如果用户未提供输入文件路径，请要求用户提供。

支持的格式：

音频：MP3、WAV、FLAC、M4A、OGG等
视频：MP4、MKV、MOV、AVI等（会自动提取音频）

验证文件是否存在：

bash

ls -la "$INPUT_FILE"

Step 2: Ask user for configuration

步骤2：向用户收集配置信息

Warning: You MUST use AskUserQuestion to collect user preferences. Do not skip this step.

Use AskUserQuestion to collect the following:

Model size: Choose the recognition model
- Options:
  - "base - Balanced speed and accuracy (Recommended)"
  - "tiny - Fastest, lower accuracy"
  - "small - Faster, moderate accuracy"
  - "medium - Slower, higher accuracy"
  - "large-v2 - Slowest, highest accuracy"
Language: What language is the audio?
- Options:
  - "Auto-detect (Recommended)"
  - "Chinese (zh)"
  - "English (en)"
  - "Japanese (ja)"
  - "Other"
Word-level alignment: Do you need word-level timestamps?
- Options:
  - "Yes - Precise timing for each word (Recommended)"
  - "No - Sentence-level timing only (faster)"
Output format: What format to output?
- Options:
  - "TXT - Plain text with timestamps (Recommended)"
  - "SRT - Subtitle format"
  - "VTT - Web subtitle format"
  - "JSON - Structured data (with word-level info)"
Output path: Where to save?
- Default: same directory as input file, named
```
<original_name>.txt
```
  (or matching format)

警告：你必须使用AskUserQuestion收集用户偏好，不得跳过该步骤。

使用AskUserQuestion收集以下配置：

模型大小：选择识别模型
- 选项：
  - "base - 速度和准确率均衡（推荐）"
  - "tiny - 速度最快，准确率较低"
  - "small - 速度较快，准确率中等"
  - "medium - 速度较慢，准确率较高"
  - "large-v2 - 速度最慢，准确率最高"
语言：音频使用的是什么语言？
- 选项：
  - "自动检测（推荐）"
  - "中文（zh）"
  - "英文（en）"
  - "日文（ja）"
  - "其他"
词级对齐：是否需要词级时间戳？
- 选项：
  - "是 - 每个单词对应精确时间（推荐）"
  - "否 - 仅提供句子级时间（速度更快）"
输出格式：需要什么格式的输出？
- 选项：
  - "TXT - 带时间戳的纯文本（推荐）"
  - "SRT - 字幕格式"
  - "VTT - 网页字幕格式"
  - "JSON - 结构化数据（包含词级信息）"
输出路径：保存位置在哪里？
- 默认：与输入文件同目录，命名为
```
<原文件名>.txt
```
  （或匹配对应输出格式后缀）

Step 3: Run transcription script

步骤3：运行转录脚本

Use the

transcribe.py

script in the skill directory:

bash

uv run /path/to/skills/audio-transcribe/transcribe.py "INPUT_FILE" [OPTIONS]

Parameters:

```
--model
```
,
```
-m
```
: Model size (tiny/base/small/medium/large-v2)
```
--language
```
,
```
-l
```
: Language code (en/zh/ja/...), auto-detect if not specified
```
--no-align
```
: Skip word-level alignment
```
--no-vad
```
: Disable VAD filtering (use if transcription has time jumps or missing segments)
```
--output
```
,
```
-o
```
: Output file path
```
--format
```
,
```
-f
```
: Output format (srt/vtt/txt/json)

Examples:

bash

undefined

使用skill目录下的

transcribe.py

脚本：

bash

uv run /path/to/skills/audio-transcribe/transcribe.py "INPUT_FILE" [OPTIONS]

参数说明：

```
--model
```
,
```
-m
```
：模型大小（tiny/base/small/medium/large-v2）
```
--language
```
,
```
-l
```
：语言代码（en/zh/ja/...），未指定则自动检测
```
--no-align
```
：跳过词级对齐
```
--no-vad
```
：关闭VAD过滤（如果转录出现时间跳跃或片段缺失时使用）
```
--output
```
,
```
-o
```
：输出文件路径
```
--format
```
,
```
-f
```
：输出格式（srt/vtt/txt/json）

使用示例：

bash

undefined

Basic transcription (auto-detect language)

基础转录（自动检测语言）

uv run skills/audio-transcribe/transcribe.py "video.mp4" -o "video.txt"

Chinese transcription, output SRT subtitles

中文转录，输出SRT字幕

uv run skills/audio-transcribe/transcribe.py "audio.mp3" -l zh -f srt -o "subtitles.srt"

Fast transcription, skip word alignment

快速转录，跳过词级对齐

uv run skills/audio-transcribe/transcribe.py "audio.wav" --no-align -o "transcript.txt"

Use a larger model, output JSON (with word-level timestamps)

使用更大的模型，输出JSON（带词级时间戳）

uv run skills/audio-transcribe/transcribe.py "speech.mp3" -m medium -f json -o "result.json"

Disable VAD filtering (fix time jumps / missing segments)

关闭VAD过滤（修复时间跳跃/片段缺失问题）

uv run skills/audio-transcribe/transcribe.py "audio.mp3" --no-vad -o "transcript.txt"

undefined

uv run skills/audio-transcribe/transcribe.py "audio.mp3" --no-vad -o "transcript.txt"

undefined

Step 4: Present results

步骤4：呈现结果

After transcription completes:

Show the full output file path
Display a preview of the transcription content
Report total duration and segment count

转录完成后：

展示完整的输出文件路径
显示转录内容的预览
告知总时长和片段数量

Output format reference

输出格式参考

TXT format

TXT格式

[00:00:00.000 - 00:00:03.500] This is the first sentence
[00:00:03.500 - 00:00:07.200] This is the second sentence

[00:00:00.000 - 00:00:03.500] This is the first sentence
[00:00:03.500 - 00:00:07.200] This is the second sentence

SRT format

SRT格式

1
00:00:00,000 --> 00:00:03,500
This is the first sentence

2
00:00:03,500 --> 00:00:07,200
This is the second sentence

1
00:00:00,000 --> 00:00:03,500
This is the first sentence

2
00:00:03,500 --> 00:00:07,200
This is the second sentence

JSON format (with word-level)

JSON格式（带词级信息）

json

[
  {
    "start": 0.0,
    "end": 3.5,
    "text": "This is the first sentence",
    "words": [
      {"word": "This", "start": 0.0, "end": 0.5, "score": 0.95},
      ...
    ]
  }
]

json

[
  {
    "start": 0.0,
    "end": 3.5,
    "text": "This is the first sentence",
    "words": [
      {"word": "This", "start": 0.0, "end": 0.5, "score": 0.95},
      ...
    ]
  }
]

Troubleshooting

故障排查

Slow on first run:

WhisperX needs to download model files; first run will be slower
Subsequent runs use the cached model

Out of memory:

Use a smaller model (tiny or base)
Ensure the system has enough memory

Low recognition accuracy:

Try a larger model (medium or large-v2)
Explicitly specify the language instead of auto-detect

首次运行速度慢：

WhisperX需要下载模型文件，首次运行速度会较慢
后续运行会使用缓存的模型

内存不足：

使用更小的模型（tiny或base）
确保系统有足够的内存

识别准确率低：

尝试使用更大的模型（medium或large-v2）
显式指定语言而不是使用自动检测