youtube-transcript

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

YouTube Transcript

YouTube 字幕提取

Overview

概述

Extract YouTube video transcripts, metadata, and chapters using yt-dlp. Output formatted as Markdown with YAML frontmatter, saved to ~/Brains/brain/ (Obsidian vault).
使用yt-dlp提取YouTube视频的字幕、元数据和章节信息。输出格式为带YAML前置元数据的Markdown文件,保存至~/Brains/brain/(Obsidian库)。

Quick Start

快速开始

To extract a transcript from a YouTube video:
bash
python scripts/extract_transcript.py <youtube_url>
Optional: Specify custom output filename:
bash
python scripts/extract_transcript.py <youtube_url> custom_filename.md
要提取YouTube视频的字幕:
bash
python scripts/extract_transcript.py <youtube_url>
可选:指定自定义输出文件名:
bash
python scripts/extract_transcript.py <youtube_url> custom_filename.md

Output Format

输出格式

YAML Frontmatter

YAML前置元数据

The generated Markdown includes comprehensive metadata:
  • title
    - Video title
  • channel
    - Channel name
  • url
    - YouTube URL
  • upload_date
    - Upload date (YYYY-MM-DD)
  • duration
    - Video duration (HH:MM:SS)
  • description
    - Video description (truncated to 500 chars)
  • tags
    - Array of video tags
  • view_count
    - View count
  • like_count
    - Like count
生成的Markdown文件包含完整的元数据:
  • title
    - 视频标题
  • channel
    - 频道名称
  • url
    - YouTube链接
  • upload_date
    - 上传日期(YYYY-MM-DD)
  • duration
    - 视频时长(HH:MM:SS)
  • description
    - 视频描述(截断至500字符)
  • tags
    - 视频标签数组
  • view_count
    - 观看次数
  • like_count
    - 点赞次数

Body Structure

正文结构

Transcript organized by video chapters (if available):
markdown
undefined
字幕按视频章节(若存在)组织:
markdown
undefined

Chapter Title

章节标题

00:05:23 Transcript text for this segment.
00:05:45 Next segment text.

If no chapters exist, all content appears under "## Transcript" heading.

Timestamps formatted as HH:MM:SS for consistency.
00:05:23 该片段的字幕文本。
00:05:45 下一段的文本。

若视频无章节,所有内容将置于“## 字幕”标题下。

时间戳统一格式为HH:MM:SS。

Workflow

工作流程

  1. Extract metadata and subtitles using yt-dlp
  2. Parse VTT subtitle format to extract timestamps and text
  3. Group transcript segments by video chapters (if present)
  4. Format as Markdown with YAML frontmatter
  5. Save to ~/Brains/brain/ with sanitized filename based on video title
  6. Clean up temporary subtitle files
  1. 使用yt-dlp提取元数据和字幕
  2. 解析VTT字幕格式以提取时间戳和文本
  3. 按视频章节(若存在)分组字幕片段
  4. 格式化为带YAML前置元数据的Markdown文件
  5. 以视频标题为基础生成合规文件名,保存至~/Brains/brain/
  6. 清理临时字幕文件

Deduplication

去重

To remove duplicates from existing transcript files:
bash
python scripts/deduplicate_transcript.py <markdown_file>
This removes transcript entries that are prefixes of subsequent entries (common in VTT files where subtitles accumulate).
要从已有的字幕文件中移除重复内容:
bash
python scripts/deduplicate_transcript.py <markdown_file>
此操作会移除后续条目的前缀重复字幕条目(VTT文件中常见的字幕累积问题)。

Requirements

依赖要求

Ensure yt-dlp is installed:
bash
pip install yt-dlp
确保已安装yt-dlp:
bash
pip install yt-dlp

Limitations

限制

  • Extracts subtitles in English first, falls back to Russian if English unavailable
  • Requires video to have subtitles (auto-generated or manual)
  • Does not download video or audio files
  • Description truncated to 500 characters in frontmatter
  • 优先提取英文字幕,若英文不可用则回退至俄文
  • 要求视频带有字幕(自动生成或手动添加)
  • 不会下载视频或音频文件
  • 前置元数据中的描述会被截断至500字符