video-transcript

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

视频逐字稿提取专家

Video Transcript Extraction Expert

输入视频链接或本地路径 → 自动探测 → 后台下载 → 切片 → 压缩 → 豆包转录 → 严格逐字稿(stdout + 落盘)

Input video link or local path → Auto detection → Background download → Slicing → Compression → Doubao transcription → Strict verbatim transcript (stdout + local save)

阶段 0 · 定位 skill 根目录(第一件事)

Phase 0 · Locate Skill Root Directory (First Priority)

被触发时你必须先定位自己,跑这段:

bash

VT_HOME="$(
  for d in "$HOME/.claude/skills/video-transcript" \
           "$(pwd)/.claude/skills/video-transcript" \
           "$(pwd)/skills/video-transcript" \
           "$HOME/.claude/plugins/video-transcript/video-transcript"; do
    [ -f "$d/SKILL.md" ] && echo "$d" && break
  done
)"
export VT_HOME
echo "VT_HOME=$VT_HOME"

如果输出为空,说明 skill 安装位置非标准。让用户给出路径,然后手工

export VT_HOME=<路径>

。之后所有命令都通过

"$VT_HOME/scripts/transcript.py"

,不要用相对路径或绝对路径硬编码。

When triggered, you must first locate yourself by running this:

bash

VT_HOME="$(
  for d in "$HOME/.claude/skills/video-transcript" \
           "$(pwd)/.claude/skills/video-transcript" \
           "$(pwd)/skills/video-transcript" \
           "$HOME/.claude/plugins/video-transcript/video-transcript"; do
    [ -f "$d/SKILL.md" ] && echo "$d" && break
  done
)"
export VT_HOME
echo "VT_HOME=$VT_HOME"

If the output is empty, it means the skill is installed in a non-standard location. Ask the user to provide the path, then manually run

export VT_HOME=<path>

. After that, all commands must use

"$VT_HOME/scripts/transcript.py"

, do not hardcode relative or absolute paths.

阶段 1 · 依赖体检(首次/可疑时)

Phase 1 · Dependency Check (First Run / When Suspected)

第一次跑或者遇到报错时,先做体检:

bash

python3 "$VT_HOME/scripts/transcript.py" --doctor

如果有 ✗ 项,告诉用户:

缺 ffmpeg / playwright / chromium / API Key 等 → 跑一键安装向导:
bash
```
bash "$VT_HOME/install.sh"
```
体检全 ✓ 才进入阶段 2。

Run the check first on the first run or when errors occur:

bash

python3 "$VT_HOME/scripts/transcript.py" --doctor

If there are ✗ items, inform the user:

Missing ffmpeg / playwright / chromium / API Key, etc. → Run the one-click installation wizard:
bash
```
bash "$VT_HOME/install.sh"
```
Only proceed to Phase 2 when all checks are ✓.

阶段 2 · 触发方式与执行

Phase 2 · Trigger Methods and Execution

用户给视频链接(URL 或本地路径)就直接跑:

bash

python3 "$VT_HOME/scripts/transcript.py" "<URL或本地路径>"

支持的输入:

B 站:

https://www.bilibili.com/video/BVxxx

或

b23.tv/xxx

短链

抖音:

https://www.douyin.com/video/xxx

、

v.douyin.com/xxx

、

douyin.com/jingxuan?modal_id=xxx

小红书:

xiaohongshu.com/discovery/item/xxx

、

xiaohongshu.com/explore/xxx

、

xhslink.com/xxx

YouTube:
```
youtube.com/watch?v=xxx
```
、
```
youtu.be/xxx
```
本地视频文件路径

脚本自动: 0. 探测 — 启动 headless 浏览器,拿标题、时长、视频/音频直链;打印 📊 评估表 + 预估耗时

下载 — 复用探测拿到的直链(不重启浏览器);B 站走 dash 流(分别下载 video/audio + ffmpeg 合并);其他平台直接 mp4
长视频切片 — 总时长 > 8min 自动切成 6min/段,避免被模型概括成摘要
压缩 — ffmpeg,目标 30MB;按时长自动选分辨率,一次到位
转录 — 豆包视频理解 API,严格逐字 prompt(保留口语词、网络梗、停顿)
合并 — 长视频合并各段输出 + 自动调整段头时间戳偏移

Run directly when the user provides a video link (URL or local path):

bash

python3 "$VT_HOME/scripts/transcript.py" "<URL or local path>"

Supported inputs:

Bilibili:

https://www.bilibili.com/video/BVxxx

b23.tv/xxx

short link

Douyin:

https://www.douyin.com/video/xxx

v.douyin.com/xxx

douyin.com/jingxuan?modal_id=xxx

Xiaohongshu:

xiaohongshu.com/discovery/item/xxx

xiaohongshu.com/explore/xxx

xhslink.com/xxx

YouTube:
```
youtube.com/watch?v=xxx
```
,
```
youtu.be/xxx
```
Local video file path

The script automatically: 0. Detection — Launch headless browser, retrieve title, duration, direct video/audio links; print 📊 evaluation table + estimated time

Download — Reuse the direct links obtained from detection (no browser restart); Bilibili uses dash stream (download video/audio separately + merge with ffmpeg); other platforms download directly as mp4
Long Video Slicing — Automatically slice videos longer than 8min into 6min segments to avoid being summarized by the model
Compression — Use ffmpeg, target size 30MB; automatically select resolution based on duration, done in one go
Transcription — Use Doubao Video Understanding API, strict verbatim prompt (retain colloquial words, internet memes, pauses)
Merge — Merge segments for long videos + automatically adjust timestamp offsets for segment headers

⚠️ 你(agent)必须做的两件事

⚠️ Two Mandatory Tasks for You (Agent)

(1) 评估表复述 — 给用户等待预期

脚本启动后,stderr 第一段就会打印 📊 视频探测评估表。你必须立刻把它复述给用户,告诉他:

视频标题、时长、分段数
预估耗时(给用户一个等待预期,这点最重要)

不要等转录跑完才说,先复述评估表 → 再继续等待转录。如果用户看不到时长和耗时预估,会以为程序卡死。

例:

视频探测完成:

平台:B 站 / 标题:《xxx》

时长 17 分 12 秒,会切成 3 段独立转录

预估耗时 3 分 20 秒 ~ 5 分 25 秒,正在跑,稍等...

(2) 转录完成后必须输出完整逐字稿全文

脚本最后会把完整 Markdown 逐字稿打印到 stdout。你必须把整篇逐字稿原样展示在对话里,不要只说"已保存到 xxx.md"——那等于用户什么都没看到。

正确做法:

抓 stdout 里
```
# <标题>
```
之后的全部 Markdown
完整复述给用户(标题、时长、所有段落)
末尾附一行说明落盘路径,方便用户后续打开 .md 文件

错误做法:

❌ "逐字稿已保存到 outputs/xxx.md,请查看文件"
❌ 只展示前几段就省略
❌ 总结/精简内容(逐字稿要逐字展示)

(1) Repeat Evaluation Table — Set User Expectations

After the script starts, the first segment of stderr will print the 📊 Video Detection Evaluation Table. You must repeat it to the user immediately, informing them:

Video title, duration, number of segments
Estimated time (most important, gives the user a waiting expectation)

Do not wait until transcription is complete to speak, repeat the evaluation table first → then wait for transcription. If the user cannot see the duration and estimated time, they may think the program is frozen.

Example:

Video detection completed:

Platform: Bilibili / Title: 《xxx》

Duration: 17 minutes 12 seconds, will be sliced into 3 segments for independent transcription

Estimated time: 3 minutes 20 seconds ~ 5 minutes 25 seconds, running now, please wait...

(2) Output Full Verbatim Transcript After Completion

The script will print the complete Markdown transcript to stdout at the end. You must display the entire transcript exactly as it is in the conversation, do not only say "saved to xxx.md" — that means the user sees nothing.

Correct approach:

Capture all Markdown content after
```
# <Title>
```
in stdout
Repeat it fully to the user (title, duration, all paragraphs)
Add a line at the end indicating the local save path, to help the user open the .md file later

Incorrect approaches:

❌ "Transcript saved to outputs/xxx.md, please check the file"
❌ Only show the first few paragraphs and omit the rest
❌ Summarize/simplify content (transcripts must be verbatim)

阶段 3 · 输出去处

Phase 3 · Output Destinations

脚本会两个去处同时输出:

stdout 直出完整 Markdown 全文 — 供 agent 复述给用户(见阶段 2 第 2 条强制要求)
落盘到
```
$VT_HOME/outputs/<标题>_transcript.md
```
— skill 自管,所有产出归一处

取消落盘:

--no-save

改保存路径:

--output-dir <path>

The script will output to two destinations simultaneously:

Full Markdown transcript directly to stdout — For the agent to repeat to the user (see mandatory requirement 2 in Phase 2)
Saved locally to
```
$VT_HOME/outputs/<title>_transcript.md
```
— Managed by the skill, all outputs are stored in one place

Disable local save:

--no-save

Change save path:

--output-dir <path>

评估表样例

Evaluation Table Example

═══════════════════════════════════════════════════════
  📊 视频探测
───────────────────────────────────────────────────────
  平台:      B 站
  标题:      在浙江和安徽之间，一座10万人的城市消失了
  时长:      17分12秒
  分段:      3 段(每段 ≤ 6 分钟)
  预估耗时:  3分20秒 ~ 5分25秒
═══════════════════════════════════════════════════════

═══════════════════════════════════════════════════════
  📊 Video Detection
───────────────────────────────────────────────────────
  Platform:      Bilibili
  Title:      A city of 100,000 people disappeared between Zhejiang and Anhui
  Duration:      17 minutes 12 seconds
  Segments:      3 segments (each ≤ 6 minutes)
  Estimated Time:  3 minutes 20 seconds ~ 5 minutes 25 seconds
═══════════════════════════════════════════════════════

输出格式样例

Output Format Example

markdown

undefined

markdown

undefined

视频标题

Video Title

时长 5:32 | 来源: <URL或文件名>

Duration 5:32 | Source: <URL or filename>

1. 引入话题 [00:00 - 00:42]

1. Introduce Topic [00:00 - 00:42]

大家好,今天我们要聊的是...

Hello everyone, today we're going to talk about...

2. 核心观点 [00:42 - 02:15]

2. Core Viewpoint [00:42 - 02:15]

那么我的看法是这样的,首先...


特性:
- **语义分段**:按主题自动分 3-8 段,每段一个简短小标题
- **段落级时间戳**:每段开头标 `[MM:SS - MM:SS]`,定位方便
- **逐字转录**:保留口语词("呃""那""啊""就是")、网络梗、停顿,不总结、不改写
- **无人声段落**:用 `_(此处无人声,XX秒)_` 标注

Well, my opinion is this, first of all...


Features:
- **Semantic Segmentation**: Automatically divided into 3-8 segments by topic, each with a short subtitle
- **Paragraph-level Timestamps**: Marked with `[MM:SS - MM:SS]` at the start of each segment for easy positioning
- **Verbatim Transcription**: Retains colloquial words ("uh" "well" "ah" "you know"), internet memes, pauses, no summarization or rewriting
- **Silent Segments**: Marked with `_(No voice here, XX seconds)_`

阶段 4 · 异常处理

Phase 4 · Exception Handling

场景	处理
`--doctor` 报缺依赖	跑 `bash "$VT_HOME/install.sh"`
`没找到豆包 API Key`	检查 `$VT_HOME/.env` ;没配过让用户跑 install.sh 引导填 Key
API 401 / "模型未授权"	火山方舟控制台 → 模型广场 → 给 Doubao-Seed-2.0-pro 点"开通"
抖音图文笔记(note 链接)	提示用户不支持图文,仅支持视频
平台前端改版导致抓取失败	看 `$VT_HOME/FALLBACK.md` 走人工兜底
长视频被模型概括而非逐字	已自动处理(>8min 切片);仍出现请反馈
压缩后视频仍 > 50MB	脚本自动迭代降码率/分辨率(最多 4 轮)

Scenario	Handling
`--doctor` reports missing dependencies	Run `bash "$VT_HOME/install.sh"`
`Doubao API Key not found`	Check `$VT_HOME/.env` ; if not configured, ask the user to run install.sh to guide Key setup
API 401 / "Model not authorized"	Volcano Ark Console → Model Plaza → Click "Activate" for Doubao-Seed-2.0-pro
Douyin graphic note (note link)	Inform the user that graphic content is not supported, only videos are supported
Crawling failure due to platform frontend revision	Check `$VT_HOME/FALLBACK.md` for manual workaround
Long video is summarized instead of verbatim	Already handled automatically (>8min sliced); report if issue persists
Compressed video still > 50MB	The script will automatically iterate to reduce bitrate/resolution (max 4 rounds)

命令行选项

Command Line Options

参数	说明
`input`	视频 URL 或本地文件路径(必需, `--doctor` 时可省)
`--title`	视频标题(默认用探测到的)
`--target-size`	压缩目标大小 MB,默认 30
`--no-save`	不落盘 .md(默认会保存到 `$VT_HOME/outputs/` )
`--output-dir`	改保存路径
`--doctor`	体检模式:检查依赖+配置

Parameter	Description
`input`	Video URL or local file path (required, optional for `--doctor` )
`--title`	Video title (uses detected title by default)
`--target-size`	Compression target size in MB, default 30
`--no-save`	Do not save .md file (saved to `$VT_HOME/outputs/` by default)
`--output-dir`	Change save path
`--doctor`	Check mode: verify dependencies and configurations

Notes

豆包原生视频理解,模型直接听视频音频,无需独立 ASR
时间戳精度为段落级(不是词级/句级),用于章节定位
默认 stdout 直接输出 Markdown 全文(供上层 agent 展示);同时落盘
预估耗时模型(基于实测):headless 启动 10s + 下载(0.4s/视频秒) + 每段压缩 12s + 每段 API 30s,给 ±20% 范围
API Key 配置存在
```
$VT_HOME/.env
```
(权限 600,gitignore),不会随仓库分发

Native Doubao video understanding, the model directly processes video audio, no independent ASR required
Timestamp precision is paragraph-level (not word/sentence-level), used for chapter positioning
Full Markdown transcript is output directly to stdout by default (for upper-layer agent display); simultaneously saved locally
Estimated time model (based on actual measurements): headless startup 10s + download (0.4s per video second) + 12s compression per segment + 30s API processing per segment, with ±20% range
API Key configuration is stored in
```
$VT_HOME/.env
```
(permission 600, gitignore), not distributed with the repository