li-transcript

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

视频逐字稿提取 + 对标归档

Video Transcript Extraction + Benchmark Archiving

从视频链接提取逐字稿，AI 校对后归档到对标博主目录。

Extract transcript from video link, archive to benchmark blogger directory after AI proofreading.

工作流程

Workflow

Step 0：首次使用前的环境准备（仅第一次需要）

Step 0: Environment Preparation Before First Use (Only Required Once)

如果是从开源仓库刚装好的 skill，需要先准备运行环境：

bash

undefined

If you've just installed this skill from an open-source repository, you need to prepare the runtime environment first:

bash

undefined

1. 系统依赖（macOS）

1. System Dependencies (macOS)

brew install yt-dlp ffmpeg

2. Python 虚拟环境 + 腾讯云 ASR SDK

2. Python Virtual Environment + Tencent Cloud ASR SDK

python3 -m venv .claude/skills/li-transcript/scripts/.venv .claude/skills/li-transcript/scripts/.venv/bin/pip install tencentcloud-sdk-python-asr

3. 在项目根目录创建 .env 文件，填入腾讯云密钥（控制台 https://console.cloud.tencent.com/cam/capi 申请）

3. Create a .env file in the project root directory and fill in Tencent Cloud keys (apply at console https://console.cloud.tencent.com/cam/capi)

格式：

Format:

TENCENT_SECRET_ID=你的_id

TENCENT_SECRET_ID=your_id

TENCENT_SECRET_KEY=你的_key

TENCENT_SECRET_KEY=your_key


环境准备好之后，跳到 Step 1。


Once the environment is ready, proceed to Step 1.

Step 1：转录视频

Step 1: Transcribe Video

运行转录脚本（脚本和 venv 都在 skill 目录下，自包含）：

bash

.claude/skills/li-transcript/scripts/.venv/bin/python3 .claude/skills/li-transcript/scripts/video2text.py "视频链接"

脚本会输出：

stderr：进度信息 +
```
[标题] xxx
```
（视频原标题）
stdout：去除时间戳的纯文本逐字稿

脚本报错时的排查顺序：

项目根目录的
```
.env
```
是否包含
```
TENCENT_SECRET_ID
```
和
```
TENCENT_SECRET_KEY
```
（脚本会从当前目录向上查找最多 6 层）

.claude/skills/li-transcript/scripts/.venv/

是否正常（重建：

python3 -m venv .claude/skills/li-transcript/scripts/.venv && .claude/skills/li-transcript/scripts/.venv/bin/pip install tencentcloud-sdk-python-asr

）

视频链接是否被 yt-dlp 支持

Run the transcription script (both the script and venv are in the skill directory, self-contained):

bash

.claude/skills/li-transcript/scripts/.venv/bin/python3 .claude/skills/li-transcript/scripts/video2text.py "video link"

The script will output:

stderr: Progress information +
```
[标题] xxx
```
(original video title)
stdout: Plain text transcript without timestamps

Troubleshooting order for script errors:

Does the
```
.env
```
file in the project root directory contain
```
TENCENT_SECRET_ID
```
and
```
TENCENT_SECRET_KEY
```
(the script will search up to 6 levels upward from the current directory)

.claude/skills/li-transcript/scripts/.venv/

functioning properly (rebuild:

python3 -m venv .claude/skills/li-transcript/scripts/.venv && .claude/skills/li-transcript/scripts/.venv/bin/pip install tencentcloud-sdk-python-asr

)

Is the video link supported by yt-dlp

Step 2：AI 校对

Step 2: AI Proofreading

对原始逐字稿做文字校对，只改错字不改内容：

同音/近音字纠错：根据上下文推断（如「艺人公司」→「一人公司」、「四动会」→「私董会」、「体校」→「提效」）
专有名词修正：技术术语、产品名、人名、英文词汇
明显的 ASR 乱码：替换为合理推断

保留口语表达（「啊」「嗯」「就是说」），不改写句子结构，不美化风格。

校对直接执行，不逐条列出差异——用户想核对可以自己对比原文。

Perform text proofreading on the raw transcript, only correcting typos without altering content:

Homophone/near-homophone correction: Infer based on context (e.g., "艺人公司" → "一人公司", "四动会" → "私董会", "体校" → "提效")
Proper noun correction: Technical terms, product names, personal names, English vocabulary
Obvious ASR garbled text: Replace with reasonable inferences

Retain spoken expressions (e.g., "ah", "um", "you know"), do not rewrite sentence structure, and do not polish the style.

Execute proofreading directly without listing differences one by one—users can compare with the original text themselves if they want to check.

Step 3：识别作者 & 归档

Step 3: Identify Author & Archive

识别作者——按优先级：

用户已告知 → 直接使用
逐字稿中有自我介绍（「大家好我是XXX」「我是XXX」）→ 提取
都没有 → 问用户

匹配对标——读取

05-选题研究/对标博主/短视频/

目录列表：

匹配到已有博主 → 归档到该目录
没匹配到 → 问用户「[名字]还不在对标列表里，创建吗？」，同意则创建目录

作者和是否创建对标可以合并为一次提问，减少来回。

确定标题——优先用 yt-dlp 获取的原标题（stderr 中

[标题]

行），获取不到则从逐字稿内容概括一个。

Identify Author — Priority order:

User has informed → Use directly
Self-introduction in transcript (e.g., "Hello everyone, I'm XXX", "I'm XXX") → Extract
Neither → Ask the user

Match Benchmark — Read the directory list of

05-选题研究/对标博主/短视频/

If a matching existing blogger is found → Archive to that directory
If no match is found → Ask the user "[Name] is not in the benchmark list yet, create it?", create the directory if agreed

Combine the questions about the author and whether to create a benchmark to reduce back-and-forth communication.

Determine Title — Prioritize the original title obtained by yt-dlp (the

[标题]

line in stderr), if unavailable, summarize one from the transcript content.

Step 4：保存文件

Step 4: Save File

路径：

05-选题研究/对标博主/短视频/[博主名]/[YYYY年M月]/[视频标题].md

时间目录不存在则创建。

格式——根据视频来源平台调整 YAML 字段名：

markdown

---
[平台]数据:
  点赞数: 
  收藏数: 
  评论数: 
  分享数: 
  观看数: 
tags: 
---

Path:

05-选题研究/对标博主/短视频/[博主名]/[YYYY年M月]/[视频标题].md

Create the time directory if it doesn't exist.

Format — Adjust YAML field names according to the video source platform:

markdown

---
[Platform]数据:
  点赞数: 
  收藏数: 
  评论数: 
  分享数: 
  观看数: 
tags: 
---

标题

[视频标题]

封面花字

视频脚本

[校对后的完整逐字稿]


平台判断规则：
- URL 含 `xiaohongshu.com` → `小红书数据`
- URL 含 `bilibili.com` → `B站数据`
- URL 含 `douyin.com` → `抖音数据`
- URL 含 `youtube.com` 或 `youtu.be` → `YouTube数据`
- 其他 → `平台数据`

保存后一句话告知路径。

[校对后的完整逐字稿]


Platform judgment rules:
- URL contains `xiaohongshu.com` → `小红书数据`
- URL contains `bilibili.com` → `B站数据`
- URL contains `douyin.com` → `抖音数据`
- URL contains `youtube.com` or `youtu.be` → `YouTube数据`
- Others → `平台数据`

Inform the user of the path in one sentence after saving.