Loading...
Loading...
Convert audio/video to text using Whisper, with support for word-level timestamps. Use this when users need speech-to-text conversion, audio-to-text transcription, video-to-text extraction, subtitle generation, transcribe audio, speech to text, generate subtitles, or speech recognition.
npx skill4agent add infquest/vibe-ops-plugin audio-transcribels -la "$INPUT_FILE"original-filename.txttranscribe.pyuv run /path/to/skills/audio-transcribe/transcribe.py "INPUT_FILE" [OPTIONS]--model-m--language-l--no-align--no-vad--output-o--format-f# Basic transcription (auto-detect language)
uv run skills/audio-transcribe/transcribe.py "video.mp4" -o "video.txt"
# Chinese transcription, output SRT subtitles
uv run skills/audio-transcribe/transcribe.py "audio.mp3" -l zh -f srt -o "subtitles.srt"
# Fast transcription without word alignment
uv run skills/audio-transcribe/transcribe.py "audio.wav" --no-align -o "transcript.txt"
# Use larger model, output JSON (includes word-level timestamps)
uv run skills/audio-transcribe/transcribe.py "speech.mp3" -m medium -f json -o "result.json"
# Disable VAD filtering (resolve time jump/omission issues)
uv run skills/audio-transcribe/transcribe.py "audio.mp3" --no-vad -o "transcript.txt"[00:00:00.000 - 00:00:03.500] This is the first sentence
[00:00:03.500 - 00:00:07.200] This is the second sentence1
00:00:00,000 --> 00:00:03,500
This is the first sentence
2
00:00:03,500 --> 00:00:07,200
This is the second sentence[
{
"start": 0.0,
"end": 3.5,
"text": "This is the first sentence",
"words": [
{"word": "This", "start": 0.0, "end": 0.5, "score": 0.95},
...
]
}
]