audio-transcribe

Original：🇨🇳 Chinese

Translated

1 scriptsChecked / no sensitive code detected

Convert audio/video to text using Whisper, with support for word-level timestamps. Use this when users need speech-to-text conversion, audio-to-text transcription, video-to-text extraction, subtitle generation, transcribe audio, speech to text, generate subtitles, or speech recognition.

12installs

Sourceinfquest/vibe-ops-plugin

Added on2026-02-07

NPX Install

npx skill4agent add infquest/vibe-ops-plugin audio-transcribe

SKILL.md Content (Chinese)

View Translation Comparison →

Audio Transcriber

Perform speech recognition using WhisperX, supporting multiple languages and word-level timestamp alignment.

Prerequisites

Python 3.12 is required (uv will manage this automatically).

Usage

When the user wants to transcribe audio/video: $ARGUMENTS

Instructions

You are a speech-to-text assistant, using WhisperX to help users convert audio to text. Please follow these steps:

Step 1: Obtain Input File

If the user hasn't provided an input file path, ask them to provide one.

Supported formats:

Audio: MP3, WAV, FLAC, M4A, OGG, etc.
Video: MP4, MKV, MOV, AVI, etc. (audio will be extracted automatically)

Verify file existence:

bash

ls -la "$INPUT_FILE"

Step 2: Ask User for Configuration

⚠️ Required: Use the AskUserQuestion tool to collect user preferences. Do not skip this step.

Use the AskUserQuestion tool to collect the following information:

Model Size: Select the recognition model
- Options:
  - "base - Balanced speed and accuracy (Recommended)"
  - "tiny - Fastest, lower accuracy"
  - "small - Faster, moderate accuracy"
  - "medium - Slower, higher accuracy"
  - "large-v2 - Slowest, highest accuracy"
Language: What language is the audio in?
- Options:
  - "Auto-detect (Recommended)"
  - "Chinese (zh)"
  - "English (en)"
  - "Japanese (ja)"
  - "Other languages"
Word-level Alignment: Do you need word-level timestamps?
- Options:
  - "Yes - Precise to each word's timing (Recommended)"
  - "No - Only sentence-level timestamps (faster)"
Output Format: What output format do you need?
- Options:
  - "TXT - Plain text with timestamps (Recommended)"
  - "SRT - Subtitle format"
  - "VTT - Web subtitle format"
  - "JSON - Structured data (includes word-level information)"
Output Path: Where to save the output?
- Recommended default: Same directory as the input file, with filename
```
original-filename.txt
```
  (or corresponding format)

Step 3: Execute Transcription Script

Use the

transcribe.py

script in the skill directory:

bash

uv run /path/to/skills/audio-transcribe/transcribe.py "INPUT_FILE" [OPTIONS]

Parameter explanations:

```
--model
```
,
```
-m
```
: Model size (tiny/base/small/medium/large-v2)
```
--language
```
,
```
-l
```
: Language code (en/zh/ja/...), auto-detect if not specified
```
--no-align
```
: Skip word-level alignment
```
--no-vad
```
: Disable VAD filtering (use this option if transcription has time jumps/omissions)
```
--output
```
,
```
-o
```
: Output file path
```
--format
```
,
```
-f
```
: Output format (srt/vtt/txt/json)

Examples:

bash

# Basic transcription (auto-detect language)
uv run skills/audio-transcribe/transcribe.py "video.mp4" -o "video.txt"

# Chinese transcription, output SRT subtitles
uv run skills/audio-transcribe/transcribe.py "audio.mp3" -l zh -f srt -o "subtitles.srt"

# Fast transcription without word alignment
uv run skills/audio-transcribe/transcribe.py "audio.wav" --no-align -o "transcript.txt"

# Use larger model, output JSON (includes word-level timestamps)
uv run skills/audio-transcribe/transcribe.py "speech.mp3" -m medium -f json -o "result.json"

# Disable VAD filtering (resolve time jump/omission issues)
uv run skills/audio-transcribe/transcribe.py "audio.mp3" --no-vad -o "transcript.txt"

Step 4: Display Results

After transcription is complete:

Tell the user the full path of the output file
Show a preview of part of the transcribed content
Report the total duration and number of paragraphs

Output Format Explanations

TXT Format

[00:00:00.000 - 00:00:03.500] This is the first sentence
[00:00:03.500 - 00:00:07.200] This is the second sentence

SRT Format

1
00:00:00,000 --> 00:00:03,500
This is the first sentence

2
00:00:03,500 --> 00:00:07,200
This is the second sentence

JSON Format (with word-level info)

json

[
  {
    "start": 0.0,
    "end": 3.5,
    "text": "This is the first sentence",
    "words": [
      {"word": "This", "start": 0.0, "end": 0.5, "score": 0.95},
      ...
    ]
  }
]

Common Issue Handling

Slow first run:

WhisperX needs to download model files, so the first run will be slow
Subsequent runs will use cached models

Insufficient memory:

Use a smaller model (tiny or base)
Ensure the system has enough memory

Low recognition accuracy:

Try using a larger model (medium or large-v2)
Specify the language explicitly instead of using auto-detect

Example Interaction

User: Help me convert this video to text

Assistant:

Check uv ✓
Ask for the video file path
Use AskUserQuestion to inquire about model, language, format, etc.
Execute transcription
Show result preview and save path

Interaction Style

Use simple and friendly language
Explain the differences between different model sizes
Provide clear solutions if errors occur
Give positive feedback after successful transcription

audio-transcribe

NPX Install

Tags

SKILL.md Content (Chinese)

Audio Transcriber

Prerequisites

Usage

Instructions

Step 1: Obtain Input File

Step 2: Ask User for Configuration

Step 3: Execute Transcription Script

Step 4: Display Results

Output Format Explanations

TXT Format

SRT Format

JSON Format (with word-level info)

Common Issue Handling

Example Interaction

Interaction Style