youtube-transcript
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseYouTube Transcript
YouTube 字幕提取
Overview
概述
Extract YouTube video transcripts, metadata, and chapters using yt-dlp. Output formatted as Markdown with YAML frontmatter, saved to ~/Brains/brain/ (Obsidian vault).
使用yt-dlp提取YouTube视频的字幕、元数据和章节信息。输出格式为带YAML前置元数据的Markdown文件,保存至~/Brains/brain/(Obsidian库)。
Quick Start
快速开始
To extract a transcript from a YouTube video:
bash
python scripts/extract_transcript.py <youtube_url>Optional: Specify custom output filename:
bash
python scripts/extract_transcript.py <youtube_url> custom_filename.md要提取YouTube视频的字幕:
bash
python scripts/extract_transcript.py <youtube_url>可选:指定自定义输出文件名:
bash
python scripts/extract_transcript.py <youtube_url> custom_filename.mdOutput Format
输出格式
YAML Frontmatter
YAML前置元数据
The generated Markdown includes comprehensive metadata:
- - Video title
title - - Channel name
channel - - YouTube URL
url - - Upload date (YYYY-MM-DD)
upload_date - - Video duration (HH:MM:SS)
duration - - Video description (truncated to 500 chars)
description - - Array of video tags
tags - - View count
view_count - - Like count
like_count
生成的Markdown文件包含完整的元数据:
- - 视频标题
title - - 频道名称
channel - - YouTube链接
url - - 上传日期(YYYY-MM-DD)
upload_date - - 视频时长(HH:MM:SS)
duration - - 视频描述(截断至500字符)
description - - 视频标签数组
tags - - 观看次数
view_count - - 点赞次数
like_count
Body Structure
正文结构
Transcript organized by video chapters (if available):
markdown
undefined字幕按视频章节(若存在)组织:
markdown
undefinedChapter Title
章节标题
00:05:23 Transcript text for this segment.
00:05:45 Next segment text.
If no chapters exist, all content appears under "## Transcript" heading.
Timestamps formatted as HH:MM:SS for consistency.00:05:23 该片段的字幕文本。
00:05:45 下一段的文本。
若视频无章节,所有内容将置于“## 字幕”标题下。
时间戳统一格式为HH:MM:SS。Workflow
工作流程
- Extract metadata and subtitles using yt-dlp
- Parse VTT subtitle format to extract timestamps and text
- Group transcript segments by video chapters (if present)
- Format as Markdown with YAML frontmatter
- Save to ~/Brains/brain/ with sanitized filename based on video title
- Clean up temporary subtitle files
- 使用yt-dlp提取元数据和字幕
- 解析VTT字幕格式以提取时间戳和文本
- 按视频章节(若存在)分组字幕片段
- 格式化为带YAML前置元数据的Markdown文件
- 以视频标题为基础生成合规文件名,保存至~/Brains/brain/
- 清理临时字幕文件
Deduplication
去重
To remove duplicates from existing transcript files:
bash
python scripts/deduplicate_transcript.py <markdown_file>This removes transcript entries that are prefixes of subsequent entries (common in VTT files where subtitles accumulate).
要从已有的字幕文件中移除重复内容:
bash
python scripts/deduplicate_transcript.py <markdown_file>此操作会移除后续条目的前缀重复字幕条目(VTT文件中常见的字幕累积问题)。
Requirements
依赖要求
Ensure yt-dlp is installed:
bash
pip install yt-dlp确保已安装yt-dlp:
bash
pip install yt-dlpLimitations
限制
- Extracts subtitles in English first, falls back to Russian if English unavailable
- Requires video to have subtitles (auto-generated or manual)
- Does not download video or audio files
- Description truncated to 500 characters in frontmatter
- 优先提取英文字幕,若英文不可用则回退至俄文
- 要求视频带有字幕(自动生成或手动添加)
- 不会下载视频或音频文件
- 前置元数据中的描述会被截断至500字符