Loading...
Loading...
Speech-to-text transcription using Whisper with word-level timestamps. Use when users ask to transcribe audio or video to text, generate subtitles, or recognize speech.
npx skill4agent add maxgent-ai/maxgent-plugin audio-transcribels -la "$INPUT_FILE"<original_name>.txttranscribe.pyuv run /path/to/skills/audio-transcribe/transcribe.py "INPUT_FILE" [OPTIONS]--model-m--language-l--no-align--no-vad--output-o--format-f# Basic transcription (auto-detect language)
uv run skills/audio-transcribe/transcribe.py "video.mp4" -o "video.txt"
# Chinese transcription, output SRT subtitles
uv run skills/audio-transcribe/transcribe.py "audio.mp3" -l zh -f srt -o "subtitles.srt"
# Fast transcription, skip word alignment
uv run skills/audio-transcribe/transcribe.py "audio.wav" --no-align -o "transcript.txt"
# Use a larger model, output JSON (with word-level timestamps)
uv run skills/audio-transcribe/transcribe.py "speech.mp3" -m medium -f json -o "result.json"
# Disable VAD filtering (fix time jumps / missing segments)
uv run skills/audio-transcribe/transcribe.py "audio.mp3" --no-vad -o "transcript.txt"[00:00:00.000 - 00:00:03.500] This is the first sentence
[00:00:03.500 - 00:00:07.200] This is the second sentence1
00:00:00,000 --> 00:00:03,500
This is the first sentence
2
00:00:03,500 --> 00:00:07,200
This is the second sentence[
{
"start": 0.0,
"end": 3.5,
"text": "This is the first sentence",
"words": [
{"word": "This", "start": 0.0, "end": 0.5, "score": 0.95},
...
]
}
]