Video Transcript Extraction Expert
Input video link or local path → Auto detection → Background download → Slicing → Compression → Doubao transcription → Strict verbatim transcript (stdout + local save)
Phase 0 · Locate Skill Root Directory (First Priority)
When triggered, you must first locate yourself by running this:
bash
VT_HOME="$(
for d in "$HOME/.claude/skills/video-transcript" \
"$(pwd)/.claude/skills/video-transcript" \
"$(pwd)/skills/video-transcript" \
"$HOME/.claude/plugins/video-transcript/video-transcript"; do
[ -f "$d/SKILL.md" ] && echo "$d" && break
done
)"
export VT_HOME
echo "VT_HOME=$VT_HOME"
If the output is empty, it means the skill is installed in a non-standard location. Ask the user to provide the path, then manually run
.
After that,
all commands must use
"$VT_HOME/scripts/transcript.py"
, do not hardcode relative or absolute paths.
Phase 1 · Dependency Check (First Run / When Suspected)
Run the check first on the first run or when errors occur:
bash
python3 "$VT_HOME/scripts/transcript.py" --doctor
If there are ✗ items, inform the user:
- Missing ffmpeg / playwright / chromium / API Key, etc. → Run the one-click installation wizard:
bash
bash "$VT_HOME/install.sh"
- Only proceed to Phase 2 when all checks are ✓.
Phase 2 · Trigger Methods and Execution
Run directly when the user provides a video link (URL or local path):
bash
python3 "$VT_HOME/scripts/transcript.py" "<URL or local path>"
Supported inputs:
- Bilibili:
https://www.bilibili.com/video/BVxxx
or short link
- Douyin:
https://www.douyin.com/video/xxx
, , douyin.com/jingxuan?modal_id=xxx
- Xiaohongshu:
xiaohongshu.com/discovery/item/xxx
, xiaohongshu.com/explore/xxx
,
- YouTube: ,
- Local video file path
The script automatically:
0. Detection — Launch headless browser, retrieve title, duration, direct video/audio links; print 📊 evaluation table + estimated time
- Download — Reuse the direct links obtained from detection (no browser restart); Bilibili uses dash stream (download video/audio separately + merge with ffmpeg); other platforms download directly as mp4
- Long Video Slicing — Automatically slice videos longer than 8min into 6min segments to avoid being summarized by the model
- Compression — Use ffmpeg, target size 30MB; automatically select resolution based on duration, done in one go
- Transcription — Use Doubao Video Understanding API, strict verbatim prompt (retain colloquial words, internet memes, pauses)
- Merge — Merge segments for long videos + automatically adjust timestamp offsets for segment headers
⚠️ Two Mandatory Tasks for You (Agent)
(1) Repeat Evaluation Table — Set User Expectations
After the script starts, the first segment of stderr will print the 📊 Video Detection Evaluation Table. You must repeat it to the user immediately, informing them:
- Video title, duration, number of segments
- Estimated time (most important, gives the user a waiting expectation)
Do not wait until transcription is complete to speak, repeat the evaluation table first → then wait for transcription. If the user cannot see the duration and estimated time, they may think the program is frozen.
Example:
Video detection completed:
- Platform: Bilibili / Title: 《xxx》
- Duration: 17 minutes 12 seconds, will be sliced into 3 segments for independent transcription
- Estimated time: 3 minutes 20 seconds ~ 5 minutes 25 seconds, running now, please wait...
(2) Output Full Verbatim Transcript After Completion
The script will print the complete Markdown transcript to stdout at the end. You must display the entire transcript exactly as it is in the conversation, do not only say "saved to xxx.md" — that means the user sees nothing.
Correct approach:
- Capture all Markdown content after in stdout
- Repeat it fully to the user (title, duration, all paragraphs)
- Add a line at the end indicating the local save path, to help the user open the .md file later
Incorrect approaches:
- ❌ "Transcript saved to outputs/xxx.md, please check the file"
- ❌ Only show the first few paragraphs and omit the rest
- ❌ Summarize/simplify content (transcripts must be verbatim)
Phase 3 · Output Destinations
The script will output to two destinations simultaneously:
- Full Markdown transcript directly to stdout — For the agent to repeat to the user (see mandatory requirement 2 in Phase 2)
- Saved locally to
$VT_HOME/outputs/<title>_transcript.md
— Managed by the skill, all outputs are stored in one place
Disable local save:
Change save path:
Evaluation Table Example
═══════════════════════════════════════════════════════
📊 Video Detection
───────────────────────────────────────────────────────
Platform: Bilibili
Title: A city of 100,000 people disappeared between Zhejiang and Anhui
Duration: 17 minutes 12 seconds
Segments: 3 segments (each ≤ 6 minutes)
Estimated Time: 3 minutes 20 seconds ~ 5 minutes 25 seconds
═══════════════════════════════════════════════════════
Output Format Example
markdown
# Video Title
> Duration 5:32 | Source: <URL or filename>
## 1. Introduce Topic [00:00 - 00:42]
Hello everyone, today we're going to talk about...
## 2. Core Viewpoint [00:42 - 02:15]
Well, my opinion is this, first of all...
Features:
- Semantic Segmentation: Automatically divided into 3-8 segments by topic, each with a short subtitle
- Paragraph-level Timestamps: Marked with at the start of each segment for easy positioning
- Verbatim Transcription: Retains colloquial words ("uh" "well" "ah" "you know"), internet memes, pauses, no summarization or rewriting
- Silent Segments: Marked with
_(No voice here, XX seconds)_
Phase 4 · Exception Handling
| Scenario | Handling |
|---|
| reports missing dependencies | Run bash "$VT_HOME/install.sh"
|
| Check ; if not configured, ask the user to run install.sh to guide Key setup |
| API 401 / "Model not authorized" | Volcano Ark Console → Model Plaza → Click "Activate" for Doubao-Seed-2.0-pro |
| Douyin graphic note (note link) | Inform the user that graphic content is not supported, only videos are supported |
| Crawling failure due to platform frontend revision | Check for manual workaround |
| Long video is summarized instead of verbatim | Already handled automatically (>8min sliced); report if issue persists |
| Compressed video still > 50MB | The script will automatically iterate to reduce bitrate/resolution (max 4 rounds) |
Command Line Options
| Parameter | Description |
|---|
| Video URL or local file path (required, optional for ) |
| Video title (uses detected title by default) |
| Compression target size in MB, default 30 |
| Do not save .md file (saved to by default) |
| Change save path |
| Check mode: verify dependencies and configurations |
Notes
- Native Doubao video understanding, the model directly processes video audio, no independent ASR required
- Timestamp precision is paragraph-level (not word/sentence-level), used for chapter positioning
- Full Markdown transcript is output directly to stdout by default (for upper-layer agent display); simultaneously saved locally
- Estimated time model (based on actual measurements): headless startup 10s + download (0.4s per video second) + 12s compression per segment + 30s API processing per segment, with ±20% range
- API Key configuration is stored in (permission 600, gitignore), not distributed with the repository