youtube-to-markdown

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

YouTube to Markdown

YouTube转Markdown

Multiple videos: Process one video at a time, sequentially. Do not run parallel extractions. Execute all steps sequentially without asking for user approval. Use TodoWrite to track progress.
多视频处理:一次仅处理一个视频,按顺序执行。请勿并行提取。 无需等待用户批准,依次执行所有步骤。使用TodoWrite跟踪进度。

Step 0: Check extracted before

步骤0:检查是否已提取过

bash
python3 ./check_existing.py "<YOUTUBE_URL>" "<output_directory>"
Integrity check:
  • If
    summary_valid: false
    : Show issues to user, ask "Tiedosto epätäydellinen: [issues]. Prosessoidaanko uudelleen?" If yes, continue to Step 1.
  • If
    transcript_valid: false
    : Ask user, if yes re-run Steps 2-3, 7-9.
  • If
    comments_valid: false
    : Ask user, if yes re-run comment analysis.
If returns
exists: true
AND all valid fields are true: Read and follow UPDATE_MODE.md for update workflow.
bash
python3 ./check_existing.py "<YOUTUBE_URL>" "<output_directory>"
完整性检查:
  • summary_valid: false
    :向用户展示问题,并询问:“文件不完整:[问题]。是否重新处理?”如果用户同意,继续执行步骤1。
  • transcript_valid: false
    :询问用户,若同意则重新运行步骤2-3、7-9。
  • comments_valid: false
    :询问用户,若同意则重新运行评论分析。
如果返回
exists: true
且所有验证字段均为true:阅读并遵循UPDATE_MODE.md中的更新流程。

Step 1: Extract data (metadata, description, chapters)

步骤1:提取数据(元数据、描述、章节)

bash
python3 extract_data.py "<YOUTUBE_URL>" "<output_directory>"
Creates: youtube_{VIDEO_ID}metadata.md, youtube{VIDEO_ID}description.md, youtube{VIDEO_ID}_chapters.json
IMPORTANT: If you ask which language transcript to extract then do not translate that language to english and require that subagent do not translate either. Only if the user requests another language that the original then translate.
bash
python3 extract_data.py "<YOUTUBE_URL>" "<output_directory>"
生成文件:youtube_{VIDEO_ID}metadata.md、youtube{VIDEO_ID}description.md、youtube{VIDEO_ID}_chapters.json
重要提示:如果询问用户要提取哪种语言的转录文本,请勿将该语言翻译成英文,同时要求子代理也不要翻译。仅当用户要求将原始语言转换为其他语言时才进行翻译。

Step 2: Extract transcript

步骤2:提取转录文本

Primary method (if transcript available)

主要方法(当转录文本可用时)

If video language is
en
, proceed directly. If non-English, ask user which language to download.
bash
python3 extract_transcript.py "<YOUTUBE_URL>" "<output_directory>" "<LANG_CODE>"
Creates: youtube_{VIDEO_ID}_transcript.vtt
IMPORTANT: All file output must be in the same language as discovered in Step 2. If language is not English, explicitly instruct all subagents to preserve the original language.
The download may fail if a video is private, age-restricted, or geo-blocked.
如果视频语言为
en
,直接继续。若非英语,询问用户要下载的语言。
bash
python3 extract_transcript.py "<YOUTUBE_URL>" "<output_directory>" "<LANG_CODE>"
生成文件:youtube_{VIDEO_ID}_transcript.vtt
重要提示:所有输出文件必须与步骤2中识别的语言保持一致。如果语言不是英语,明确指示所有子代理保留原始语言。
如果视频为私有、年龄限制或地区限制,下载可能会失败。

Fallback (only if transcript unavailable)

备用方法(仅当转录文本不可用时)

Ask user: "No transcript available. Proceed with Whisper transcription?
  • Mac/Apple Silicon: Uses MLX Whisper if installed (faster, see SETUP_MLX_WHISPER.md)
  • All platforms: Falls back to OpenAI Whisper (requires: brew install openai-whisper OR pip3 install openai-whisper)"
bash
python3 extract_transcript_whisper.py "<YOUTUBE_URL>" "<output_directory>"
Script auto-detects MLX Whisper on Mac and uses it if available, otherwise uses OpenAI Whisper.
询问用户:“无可用转录文本。是否使用Whisper进行转录?
  • Mac/Apple Silicon:如果已安装MLX Whisper则使用(速度更快,详见SETUP_MLX_WHISPER.md)
  • 所有平台:默认使用OpenAI Whisper(需安装:brew install openai-whisper 或 pip3 install openai-whisper)”
bash
python3 extract_transcript_whisper.py "<YOUTUBE_URL>" "<output_directory>"
脚本会自动检测Mac上的MLX Whisper,若可用则使用,否则使用OpenAI Whisper。

Step 3: Deduplicate transcript

步骤3:去重转录文本

Set BASE_NAME from Step 1 output (youtube_{VIDEO_ID})
bash
python3 ./deduplicate_vtt.py "<output_directory>/${BASE_NAME}_transcript.vtt" "<output_directory>/${BASE_NAME}_transcript_dedup.md" "<output_directory>/${BASE_NAME}_transcript_no_timestamps.txt"
从步骤1的输出中获取BASE_NAME(即youtube_{VIDEO_ID})
bash
python3 ./deduplicate_vtt.py "<output_directory>/${BASE_NAME}_transcript.vtt" "<output_directory>/${BASE_NAME}_transcript_dedup.md" "<output_directory>/${BASE_NAME}_transcript_no_timestamps.txt"

Step 4: Add natural paragraph breaks

步骤4:添加自然段落分隔符

Parallel with Step 5.
task_tool:
  • subagent_type: "general-purpose"
  • model: "sonnet"
  • prompt:
INPUT: <output_directory>/${BASE_NAME}_transcript_no_timestamps.txt
CHAPTERS: <output_directory>/${BASE_NAME}_chapters.json
OUTPUT: <output_directory>/${BASE_NAME}_transcript_paragraphs.txt

Analyze INPUT and identify natural paragraph break line numbers.

Read CHAPTERS. If it contains chapters, use chapter timestamps as primary break points.

Target ~500 chars per paragraph. Find natural break points at topic shifts or sentence endings.

Write to OUTPUT in format:
15,42,78,103,...
bash
python3 ./apply_paragraph_breaks.py "<output_directory>/${BASE_NAME}_transcript_dedup.md" "<output_directory>/${BASE_NAME}_transcript_paragraphs.txt" "<output_directory>/${BASE_NAME}_transcript_paragraphs.md"
与步骤5并行执行。
task_tool:
  • subagent_type: "general-purpose"
  • model: "sonnet"
  • prompt:
INPUT: <output_directory>/${BASE_NAME}_transcript_no_timestamps.txt
CHAPTERS: <output_directory>/${BASE_NAME}_chapters.json
OUTPUT: <output_directory>/${BASE_NAME}_transcript_paragraphs.txt

分析INPUT并识别自然段落分隔的行号。

读取CHAPTERS。如果包含章节信息,使用章节时间戳作为主要分隔点。

目标每段约500字符。在主题转换或句子结尾处寻找自然分隔点。

按以下格式写入OUTPUT:
15,42,78,103,...
bash
python3 ./apply_paragraph_breaks.py "<output_directory>/${BASE_NAME}_transcript_dedup.md" "<output_directory>/${BASE_NAME}_transcript_paragraphs.txt" "<output_directory>/${BASE_NAME}_transcript_paragraphs.md"

Step 5: Summarize transcript

步骤5:总结转录文本

Parallel with Step 4.
task_tool:
  • subagent_type: "general-purpose"
  • model: "sonnet"
  • prompt:
INPUT: <output_directory>/${BASE_NAME}_transcript_no_timestamps.txt
OUTPUT: <output_directory>/${BASE_NAME}_summary.md
FORMATS: ./summary_formats.md

1. Classify content type:
   - TIPS: gear reviews, rankings, "X ways to...", practical advice lists
   - INTERVIEW: podcasts, conversations, Q&A, multiple perspectives
   - EDUCATIONAL: concept explanations, analysis, "how X works"
   - TUTORIAL: step-by-step instructions, coding, recipes

2. Analyze content structure:
   - Identify meaningful content units (topic shifts, argument structure, narrative breaks)
   - If single continuous topic, omit content unit headers
   - Skip ads, sponsors, self-promotion ("like and subscribe", merch, etc.)
   - Merge content spanning ad breaks if thematically connected

3. Read FORMATS file and use format for detected content type. Target <10% of transcript bytes.


ACTION REQUIRED: Use the Write tool NOW to save output to OUTPUT file. Do not ask for confirmation.
与步骤4并行执行。
task_tool:
  • subagent_type: "general-purpose"
  • model: "sonnet"
  • prompt:
INPUT: <output_directory>/${BASE_NAME}_transcript_no_timestamps.txt
OUTPUT: <output_directory>/${BASE_NAME}_summary.md
FORMATS: ./summary_formats.md

1. 分类内容类型:
   - 技巧类:装备评测、排名、“X种方法”、实用建议列表
   - 访谈类:播客、对话、问答、多视角内容
   - 教育类:概念讲解、分析、“X的工作原理”
   - 教程类:分步指导、编码、食谱

2. 分析内容结构:
   - 识别有意义的内容单元(主题转换、论证结构、叙事断点)
   - 如果是单一连续主题,省略内容单元标题
   - 跳过广告、赞助商内容、自我推广(如“点赞订阅”、周边商品等)
   - 如果广告前后的内容主题相关,合并这些内容

3. 阅读FORMATS文件,根据检测到的内容类型使用对应格式。目标长度不超过转录文本字节数的10%。


操作要求:立即使用Write工具将输出保存到OUTPUT文件。请勿请求确认。

Step 6: Review and tighten summary

步骤6:审核并精简总结

task_tool:
  • subagent_type: "general-purpose"
  • model: "sonnet"
  • prompt:
INPUT: <output_directory>/${BASE_NAME}_summary.md
OUTPUT: <output_directory>/${BASE_NAME}_summary_tight.md
FORMATS: ./summary_formats.md

You are an adversarial copy editor. Cut fluff, enforce quality.

Rules:
- Read FORMATS - the format has been selected based on the content type - preserve format and do not count a reason to squeeze more from budget.
- Byte budget: <10% of transcript bytes
- Hidden Gems: Remove if duplicates main content
- Tightness: Cut filler words, compress verbose explanations, prefer lists over prose

Preserve original language - do not translate.

ACTION REQUIRED: Use the Write tool NOW to save output to OUTPUT file. Do not ask for confirmation.
task_tool:
  • subagent_type: "general-purpose"
  • model: "sonnet"
  • prompt:
INPUT: <output_directory>/${BASE_NAME}_summary.md
OUTPUT: <output_directory>/${BASE_NAME}_summary_tight.md
FORMATS: ./summary_formats.md

你是一名严格的文案编辑。删除冗余内容,确保质量。

规则:
- 阅读FORMATS文件——已根据内容类型选择格式,请保留格式,不要以预算为理由压缩更多内容。
- 字节预算:不超过转录文本字节数的10%
- 隐藏亮点:如果与主要内容重复则删除
- 精简度:删除填充词,压缩冗长解释,优先使用列表而非散文

保留原始语言——请勿翻译。

操作要求:立即使用Write工具将输出保存到OUTPUT文件。请勿请求确认。

Step 7: Clean speech artifacts

步骤7:清理语音冗余内容

task_tool:
  • subagent_type: "general-purpose"
  • model: "haiku"
  • prompt:
Read <output_directory>/${BASE_NAME}_transcript_paragraphs.md and clean speech artifacts.

Tasks:
- Remove fillers (um, uh, like, you know)
- Fix transcription errors
- Add proper punctuation
- Reduce or add implicit words to improve flow
- Preserve natural voice and tone
- Keep timestamps at end of paragraphs

ACTION REQUIRED: Use the Write tool NOW to save output to <output_directory>/${BASE_NAME}_transcript_cleaned.md. Do not ask for confirmation.
task_tool:
  • subagent_type: "general-purpose"
  • model: "haiku"
  • prompt:
读取<output_directory>/${BASE_NAME}_transcript_paragraphs.md并清理语音冗余内容。

任务:
- 删除填充词(um、uh、like、you know等)
- 修正转录错误
- 添加正确的标点符号
- 增减隐含词汇以提升流畅度
- 保留自然语气与语调
- 保留段落末尾的时间戳

操作要求:立即使用Write工具将输出保存到<output_directory>/${BASE_NAME}_transcript_cleaned.md。请勿请求确认。

Step 8: Add topic headings

步骤8:添加主题标题

task_tool:
  • subagent_type: "general-purpose"
  • model: "sonnet"
  • prompt:
INPUT: <output_directory>/${BASE_NAME}_transcript_cleaned.md
OUTPUT: <output_directory>/${BASE_NAME}_transcript.md

Read the INPUT file. Add markdown headings.

Read <output_directory>/${BASE_NAME}_chapters.json:
- If contains chapters: Use chapter names as ### headings at chapter timestamps, add #### headings for subtopics
- If empty: Add ### headings where major topics change

ACTION REQUIRED: Use the Write tool NOW to save output to OUTPUT file. Do not ask for confirmation.
task_tool:
  • subagent_type: "general-purpose"
  • model: "sonnet"
  • prompt:
INPUT: <output_directory>/${BASE_NAME}_transcript_cleaned.md
OUTPUT: <output_directory>/${BASE_NAME}_transcript.md

读取INPUT文件。添加Markdown标题。

读取<output_directory>/${BASE_NAME}_chapters.json:
- 如果包含章节信息:在对应章节时间戳处使用章节名称作为###标题,为子主题添加####标题
- 如果为空:在主题发生重大变化处添加###标题

操作要求:立即使用Write工具将输出保存到OUTPUT文件。请勿请求确认。

Step 9: Create output files

步骤9:生成最终输出文件

bash
python3 finalize.py "${BASE_NAME}" "<output_directory>"
Script uses templates to create two final files: summary file with metadata and summary, and transcript file with description and transcript. Removes intermediate work files.
Outputs:
  • youtube - {title} ({video_id}).md
    - Main summary
  • youtube - {title} - transcript ({video_id}).md
    - Description and transcript
Use
--debug
flag to keep intermediate work files for inspection.
bash
python3 finalize.py "${BASE_NAME}" "<output_directory>"
脚本使用模板生成两个最终文件:包含元数据与总结的摘要文件,以及包含描述与转录内容的转录文件。删除中间工作文件。
输出文件:
  • youtube - {title} ({video_id}).md
    - 主摘要文件
  • youtube - {title} - transcript ({video_id}).md
    - 描述与转录文件
使用
--debug
参数可保留中间工作文件以便检查。

Step 10: Comment analysis

步骤10:评论分析

If youtube-comment-analysis skill is available, run it with the same YouTube URL and output directory.
如果youtube-comment-analysis技能可用,使用相同的YouTube URL和输出目录运行该技能。