video-transcript

Original🇨🇳 Chinese
Translated
4 scripts

Video Transcript Extraction Expert (based on Doubao Video Understanding Model). Supports links from Bilibili, Douyin, Xiaohongshu, YouTube, or local video files. Runs entirely in the background (headless) on the user's computer, no pop-ups, no requirement to log in to video platforms. Outputs strict verbatim transcripts with "semantic segmentation + paragraph-level timestamps" (retains colloquial words, internet memes, pauses). Long videos are automatically segmented to avoid being summarized by the model. Trigger scenarios: - User says "generate transcript", "extract transcript", "convert to text", "video to text" - User says "dictate video", "extract video copy", "video subtitles" - User uses the /video-transcript command - User pastes a video link (Bilibili/Douyin/Xiaohongshu/YouTube) with the intention of getting a text version

2installs
Added on

NPX Install

npx skill4agent add backtthefuture/huangshu video-transcript

Tags

Translated version includes tags in frontmatter

SKILL.md Content (Chinese)

View Translation Comparison →

Video Transcript Extraction Expert

Input video link or local path → Auto detection → Background download → Slicing → Compression → Doubao transcription → Strict verbatim transcript (stdout + local save)

Phase 0 · Locate Skill Root Directory (First Priority)

When triggered, you must first locate yourself by running this:
bash
VT_HOME="$(
  for d in "$HOME/.claude/skills/video-transcript" \
           "$(pwd)/.claude/skills/video-transcript" \
           "$(pwd)/skills/video-transcript" \
           "$HOME/.claude/plugins/video-transcript/video-transcript"; do
    [ -f "$d/SKILL.md" ] && echo "$d" && break
  done
)"
export VT_HOME
echo "VT_HOME=$VT_HOME"
If the output is empty, it means the skill is installed in a non-standard location. Ask the user to provide the path, then manually run
export VT_HOME=<path>
. After that, all commands must use
"$VT_HOME/scripts/transcript.py"
, do not hardcode relative or absolute paths.

Phase 1 · Dependency Check (First Run / When Suspected)

Run the check first on the first run or when errors occur:
bash
python3 "$VT_HOME/scripts/transcript.py" --doctor
If there are ✗ items, inform the user:
  • Missing ffmpeg / playwright / chromium / API Key, etc. → Run the one-click installation wizard:
    bash
    bash "$VT_HOME/install.sh"
  • Only proceed to Phase 2 when all checks are ✓.

Phase 2 · Trigger Methods and Execution

Run directly when the user provides a video link (URL or local path):
bash
python3 "$VT_HOME/scripts/transcript.py" "<URL or local path>"
Supported inputs:
  • Bilibili:
    https://www.bilibili.com/video/BVxxx
    or
    b23.tv/xxx
    short link
  • Douyin:
    https://www.douyin.com/video/xxx
    ,
    v.douyin.com/xxx
    ,
    douyin.com/jingxuan?modal_id=xxx
  • Xiaohongshu:
    xiaohongshu.com/discovery/item/xxx
    ,
    xiaohongshu.com/explore/xxx
    ,
    xhslink.com/xxx
  • YouTube:
    youtube.com/watch?v=xxx
    ,
    youtu.be/xxx
  • Local video file path
The script automatically: 0. Detection — Launch headless browser, retrieve title, duration, direct video/audio links; print 📊 evaluation table + estimated time
  1. Download — Reuse the direct links obtained from detection (no browser restart); Bilibili uses dash stream (download video/audio separately + merge with ffmpeg); other platforms download directly as mp4
  2. Long Video Slicing — Automatically slice videos longer than 8min into 6min segments to avoid being summarized by the model
  3. Compression — Use ffmpeg, target size 30MB; automatically select resolution based on duration, done in one go
  4. Transcription — Use Doubao Video Understanding API, strict verbatim prompt (retain colloquial words, internet memes, pauses)
  5. Merge — Merge segments for long videos + automatically adjust timestamp offsets for segment headers

⚠️ Two Mandatory Tasks for You (Agent)

(1) Repeat Evaluation Table — Set User Expectations
After the script starts, the first segment of stderr will print the 📊 Video Detection Evaluation Table. You must repeat it to the user immediately, informing them:
  • Video title, duration, number of segments
  • Estimated time (most important, gives the user a waiting expectation)
Do not wait until transcription is complete to speak, repeat the evaluation table first → then wait for transcription. If the user cannot see the duration and estimated time, they may think the program is frozen.
Example:
Video detection completed:
  • Platform: Bilibili / Title: 《xxx》
  • Duration: 17 minutes 12 seconds, will be sliced into 3 segments for independent transcription
  • Estimated time: 3 minutes 20 seconds ~ 5 minutes 25 seconds, running now, please wait...
(2) Output Full Verbatim Transcript After Completion
The script will print the complete Markdown transcript to stdout at the end. You must display the entire transcript exactly as it is in the conversation, do not only say "saved to xxx.md" — that means the user sees nothing.
Correct approach:
  1. Capture all Markdown content after
    # <Title>
    in stdout
  2. Repeat it fully to the user (title, duration, all paragraphs)
  3. Add a line at the end indicating the local save path, to help the user open the .md file later
Incorrect approaches:
  • ❌ "Transcript saved to outputs/xxx.md, please check the file"
  • ❌ Only show the first few paragraphs and omit the rest
  • ❌ Summarize/simplify content (transcripts must be verbatim)

Phase 3 · Output Destinations

The script will output to two destinations simultaneously:
  1. Full Markdown transcript directly to stdout — For the agent to repeat to the user (see mandatory requirement 2 in Phase 2)
  2. Saved locally to
    $VT_HOME/outputs/<title>_transcript.md
    — Managed by the skill, all outputs are stored in one place
Disable local save:
--no-save
Change save path:
--output-dir <path>

Evaluation Table Example

═══════════════════════════════════════════════════════
  📊 Video Detection
───────────────────────────────────────────────────────
  Platform:      Bilibili
  Title:      A city of 100,000 people disappeared between Zhejiang and Anhui
  Duration:      17 minutes 12 seconds
  Segments:      3 segments (each ≤ 6 minutes)
  Estimated Time:  3 minutes 20 seconds ~ 5 minutes 25 seconds
═══════════════════════════════════════════════════════

Output Format Example

markdown
# Video Title

> Duration 5:32 | Source: <URL or filename>

## 1. Introduce Topic [00:00 - 00:42]
Hello everyone, today we're going to talk about...

## 2. Core Viewpoint [00:42 - 02:15]
Well, my opinion is this, first of all...
Features:
  • Semantic Segmentation: Automatically divided into 3-8 segments by topic, each with a short subtitle
  • Paragraph-level Timestamps: Marked with
    [MM:SS - MM:SS]
    at the start of each segment for easy positioning
  • Verbatim Transcription: Retains colloquial words ("uh" "well" "ah" "you know"), internet memes, pauses, no summarization or rewriting
  • Silent Segments: Marked with
    _(No voice here, XX seconds)_

Phase 4 · Exception Handling

ScenarioHandling
--doctor
reports missing dependencies
Run
bash "$VT_HOME/install.sh"
Doubao API Key not found
Check
$VT_HOME/.env
; if not configured, ask the user to run install.sh to guide Key setup
API 401 / "Model not authorized"Volcano Ark Console → Model Plaza → Click "Activate" for Doubao-Seed-2.0-pro
Douyin graphic note (note link)Inform the user that graphic content is not supported, only videos are supported
Crawling failure due to platform frontend revisionCheck
$VT_HOME/FALLBACK.md
for manual workaround
Long video is summarized instead of verbatimAlready handled automatically (>8min sliced); report if issue persists
Compressed video still > 50MBThe script will automatically iterate to reduce bitrate/resolution (max 4 rounds)

Command Line Options

ParameterDescription
input
Video URL or local file path (required, optional for
--doctor
)
--title
Video title (uses detected title by default)
--target-size
Compression target size in MB, default 30
--no-save
Do not save .md file (saved to
$VT_HOME/outputs/
by default)
--output-dir
Change save path
--doctor
Check mode: verify dependencies and configurations

Notes

  • Native Doubao video understanding, the model directly processes video audio, no independent ASR required
  • Timestamp precision is paragraph-level (not word/sentence-level), used for chapter positioning
  • Full Markdown transcript is output directly to stdout by default (for upper-layer agent display); simultaneously saved locally
  • Estimated time model (based on actual measurements): headless startup 10s + download (0.4s per video second) + 12s compression per segment + 30s API processing per segment, with ±20% range
  • API Key configuration is stored in
    $VT_HOME/.env
    (permission 600, gitignore), not distributed with the repository