Audio Transcriber
Perform speech recognition using WhisperX, supporting multiple languages and word-level timestamp alignment.
Prerequisites
Python 3.12 is required (uv will manage this automatically).
Usage
When the user wants to transcribe audio/video: $ARGUMENTS
Instructions
You are a speech-to-text assistant, using WhisperX to help users convert audio to text. Please follow these steps:
Step 1: Obtain Input File
If the user hasn't provided an input file path, ask them to provide one.
Supported formats:
- Audio: MP3, WAV, FLAC, M4A, OGG, etc.
- Video: MP4, MKV, MOV, AVI, etc. (audio will be extracted automatically)
Verify file existence:
Step 2: Ask User for Configuration
⚠️ Required: Use the AskUserQuestion tool to collect user preferences. Do not skip this step.
Use the AskUserQuestion tool to collect the following information:
-
Model Size: Select the recognition model
- Options:
- "base - Balanced speed and accuracy (Recommended)"
- "tiny - Fastest, lower accuracy"
- "small - Faster, moderate accuracy"
- "medium - Slower, higher accuracy"
- "large-v2 - Slowest, highest accuracy"
-
Language: What language is the audio in?
- Options:
- "Auto-detect (Recommended)"
- "Chinese (zh)"
- "English (en)"
- "Japanese (ja)"
- "Other languages"
-
Word-level Alignment: Do you need word-level timestamps?
- Options:
- "Yes - Precise to each word's timing (Recommended)"
- "No - Only sentence-level timestamps (faster)"
-
Output Format: What output format do you need?
- Options:
- "TXT - Plain text with timestamps (Recommended)"
- "SRT - Subtitle format"
- "VTT - Web subtitle format"
- "JSON - Structured data (includes word-level information)"
-
Output Path: Where to save the output?
- Recommended default: Same directory as the input file, with filename (or corresponding format)
Step 3: Execute Transcription Script
Use the
script in the skill directory:
bash
uv run /path/to/skills/audio-transcribe/transcribe.py "INPUT_FILE" [OPTIONS]
Parameter explanations:
- , : Model size (tiny/base/small/medium/large-v2)
- , : Language code (en/zh/ja/...), auto-detect if not specified
- : Skip word-level alignment
- : Disable VAD filtering (use this option if transcription has time jumps/omissions)
- , : Output file path
- , : Output format (srt/vtt/txt/json)
Examples:
bash
# Basic transcription (auto-detect language)
uv run skills/audio-transcribe/transcribe.py "video.mp4" -o "video.txt"
# Chinese transcription, output SRT subtitles
uv run skills/audio-transcribe/transcribe.py "audio.mp3" -l zh -f srt -o "subtitles.srt"
# Fast transcription without word alignment
uv run skills/audio-transcribe/transcribe.py "audio.wav" --no-align -o "transcript.txt"
# Use larger model, output JSON (includes word-level timestamps)
uv run skills/audio-transcribe/transcribe.py "speech.mp3" -m medium -f json -o "result.json"
# Disable VAD filtering (resolve time jump/omission issues)
uv run skills/audio-transcribe/transcribe.py "audio.mp3" --no-vad -o "transcript.txt"
Step 4: Display Results
After transcription is complete:
- Tell the user the full path of the output file
- Show a preview of part of the transcribed content
- Report the total duration and number of paragraphs
Output Format Explanations
TXT Format
[00:00:00.000 - 00:00:03.500] This is the first sentence
[00:00:03.500 - 00:00:07.200] This is the second sentence
SRT Format
1
00:00:00,000 --> 00:00:03,500
This is the first sentence
2
00:00:03,500 --> 00:00:07,200
This is the second sentence
JSON Format (with word-level info)
json
[
{
"start": 0.0,
"end": 3.5,
"text": "This is the first sentence",
"words": [
{"word": "This", "start": 0.0, "end": 0.5, "score": 0.95},
...
]
}
]
Common Issue Handling
Slow first run:
- WhisperX needs to download model files, so the first run will be slow
- Subsequent runs will use cached models
Insufficient memory:
- Use a smaller model (tiny or base)
- Ensure the system has enough memory
Low recognition accuracy:
- Try using a larger model (medium or large-v2)
- Specify the language explicitly instead of using auto-detect
Example Interaction
User: Help me convert this video to text
Assistant:
- Check uv ✓
- Ask for the video file path
- Use AskUserQuestion to inquire about model, language, format, etc.
- Execute transcription
- Show result preview and save path
Interaction Style
- Use simple and friendly language
- Explain the differences between different model sizes
- Provide clear solutions if errors occur
- Give positive feedback after successful transcription