youtube-transcript

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

YouTube Transcript

YouTube 字幕提取

Overview

概述

Extract YouTube video transcripts, metadata, and chapters using yt-dlp. Output formatted as Markdown with YAML frontmatter, saved to ~/Brains/brain/ (Obsidian vault).

使用yt-dlp提取YouTube视频的字幕、元数据和章节信息。输出格式为带YAML前置元数据的Markdown文件，保存至~/Brains/brain/（Obsidian库）。

Quick Start

快速开始

To extract a transcript from a YouTube video:

bash

python scripts/extract_transcript.py <youtube_url>

Optional: Specify custom output filename:

bash

python scripts/extract_transcript.py <youtube_url> custom_filename.md

要提取YouTube视频的字幕：

bash

python scripts/extract_transcript.py <youtube_url>

可选：指定自定义输出文件名：

bash

python scripts/extract_transcript.py <youtube_url> custom_filename.md

Output Format

输出格式

YAML Frontmatter

YAML前置元数据

The generated Markdown includes comprehensive metadata:

```
title
```
- Video title
```
channel
```
- Channel name
```
url
```
- YouTube URL
```
upload_date
```
- Upload date (YYYY-MM-DD)
```
duration
```
- Video duration (HH:MM:SS)
```
description
```
- Video description (truncated to 500 chars)
```
tags
```
- Array of video tags
```
view_count
```
- View count
```
like_count
```
- Like count

生成的Markdown文件包含完整的元数据：

```
title
```
- 视频标题
```
channel
```
- 频道名称
```
url
```
- YouTube链接
```
upload_date
```
- 上传日期（YYYY-MM-DD）
```
duration
```
- 视频时长（HH:MM:SS）
```
description
```
- 视频描述（截断至500字符）
```
tags
```
- 视频标签数组
```
view_count
```
- 观看次数
```
like_count
```
- 点赞次数

Body Structure

正文结构

Transcript organized by video chapters (if available):

markdown

undefined

字幕按视频章节（若存在）组织：

markdown

undefined

Chapter Title

章节标题

00:05:23 Transcript text for this segment.

00:05:45 Next segment text.


If no chapters exist, all content appears under "## Transcript" heading.

Timestamps formatted as HH:MM:SS for consistency.

00:05:23 该片段的字幕文本。

00:05:45 下一段的文本。


若视频无章节，所有内容将置于“## 字幕”标题下。

时间戳统一格式为HH:MM:SS。

Workflow

工作流程

Extract metadata and subtitles using yt-dlp
Parse VTT subtitle format to extract timestamps and text
Group transcript segments by video chapters (if present)
Format as Markdown with YAML frontmatter
Save to ~/Brains/brain/ with sanitized filename based on video title
Clean up temporary subtitle files

使用yt-dlp提取元数据和字幕
解析VTT字幕格式以提取时间戳和文本
按视频章节（若存在）分组字幕片段
格式化为带YAML前置元数据的Markdown文件
以视频标题为基础生成合规文件名，保存至~/Brains/brain/
清理临时字幕文件

Deduplication

去重

To remove duplicates from existing transcript files:

bash

python scripts/deduplicate_transcript.py <markdown_file>

This removes transcript entries that are prefixes of subsequent entries (common in VTT files where subtitles accumulate).

要从已有的字幕文件中移除重复内容：

bash

python scripts/deduplicate_transcript.py <markdown_file>

此操作会移除后续条目的前缀重复字幕条目（VTT文件中常见的字幕累积问题）。

Requirements

依赖要求

Ensure yt-dlp is installed:

bash

pip install yt-dlp

确保已安装yt-dlp：

bash

pip install yt-dlp

Limitations

限制

Extracts subtitles in English first, falls back to Russian if English unavailable
Requires video to have subtitles (auto-generated or manual)
Does not download video or audio files
Description truncated to 500 characters in frontmatter

优先提取英文字幕，若英文不可用则回退至俄文
要求视频带有字幕（自动生成或手动添加）
不会下载视频或音频文件
前置元数据中的描述会被截断至500字符