transcribe-refiner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTranscribe Refiner - Caption Cleanup Engine
Transcribe Refiner - 字幕清理引擎
Transform raw auto-generated captions into clean, readable transcripts with zero content loss.
将自动生成的原始字幕转换为干净、易读的转录文本,且无任何内容丢失。
Core Purpose
核心目标
Auto-generated captions (Zoom, YouTube, Teams, etc.) are messy: fragmented sentences, timestamps everywhere, speaker tags on every line, filler words, transcription errors. This skill reconstructs them into coherent, flowing text that can be consumed by humans or downstream skills (like lecture-alchemist).
Zoom、YouTube、Teams等平台自动生成的字幕往往杂乱无章:句子碎片化、时间戳随处可见、每行都有发言者标签、存在填充词和转录错误。本技能可将这些字幕重构为连贯流畅的文本,供人类阅读或供下游技能(如lecture-alchemist)使用。
Critical Rules
关键规则
Zero Content Loss
无内容丢失
Every substantive statement, technical term, concept, question, and answer from the raw captions MUST appear in the output. Only noise is removed, never content.
Remove: Timestamps, redundant speaker tags, filler words (um, uh, basically, right?, you know), technical interruptions ("can you hear me?", "let me share my screen"), duplicate sentences from reconnection.
Preserve: Every teaching point, code reference, question asked, answer given, tangent with value, name, URL, command, or technical term.
原始字幕中的每一条实质性陈述、技术术语、概念、问题和答案都必须出现在输出结果中。仅移除冗余信息,绝不丢失内容。
移除内容: 时间戳、重复的发言者标签、填充词(um、uh、basically、right?、you know)、技术干扰语句(“能听到我说话吗?”、“我来共享屏幕”)、重连导致的重复句子。
保留内容: 所有教学要点、代码引用、提出的问题、给出的答案、有价值的题外话、姓名、URL、命令或技术术语。
Smart Error Correction
智能错误修正
Auto-captions make predictable errors. Fix them using domain context:
| Common Error | Likely Correct | Domain Clue |
|---|---|---|
| "lowest function" | "loss function" | AI/ML context |
| "wait" | "weight" | neural network context |
| "epic" | "epoch" | training context |
| "by Torch" | "PyTorch" | ML framework |
| "relaunch bowl" | "relaunch poll" | Zoom context |
| "solidity" vs "Solidity" | capitalize if Web3 | Web3 context |
| "know JS" | "Node.js" | WebDev context |
| "react" vs "React" | capitalize if framework | WebDev context |
When uncertain about a correction, keep the original and flag it:
[unclear: "original text"]自动字幕会出现可预测的错误。可结合领域上下文进行修正:
| 常见错误 | 修正结果 | 领域线索 |
|---|---|---|
| "lowest function" | "loss function" | AI/ML领域 |
| "wait" | "weight" | 神经网络领域 |
| "epic" | "epoch" | 模型训练领域 |
| "by Torch" | "PyTorch" | ML框架领域 |
| "relaunch bowl" | "relaunch poll" | Zoom场景 |
| "solidity" vs "Solidity" | 若为Web3领域则大写首字母 | Web3领域 |
| "know JS" | "Node.js" | Web开发领域 |
| "react" vs "React" | 若为前端框架则大写首字母 | Web开发领域 |
若对修正结果不确定,保留原文并标记:
[unclear: "原始文本"]Speaker Handling
发言者处理
- Identify unique speakers from tags
- Normalize names (e.g., →
[rishabh])**Rishabh:** - Only include speaker attribution at natural conversation changes
- For single-speaker lectures, omit speaker tags entirely after initial identification
- For Q&A, clearly mark: and
**Student:****Instructor:**
- 从标签中识别唯一发言者
- 标准化姓名格式(例如:→
[rishabh])**Rishabh:** - 仅在对话自然切换时添加发言者归属标记
- 对于单人讲座,在初始识别后可省略发言者标签
- 对于问答环节,明确标记:和
**学生:****讲师:**
Input Formats
输入格式
| Format | Characteristics | Handling |
|---|---|---|
| Zoom captions (.txt) | | Strip timestamps, merge fragments |
| YouTube (.vtt/.srt) | Numbered blocks with timecodes | Strip timecodes and sequence numbers |
| Otter.ai | Speaker-labeled paragraphs | Normalize speaker labels |
| Teams | Timestamped speaker blocks | Strip timestamps, merge |
| Raw paste | Mixed format | Auto-detect and clean |
| 格式 | 特征 | 处理方式 |
|---|---|---|
| Zoom字幕(.txt) | | 移除时间戳,合并碎片化内容 |
| YouTube(.vtt/.srt) | 带时间码的编号块 | 移除时间码和序列号 |
| Otter.ai转录文本 | 带发言者标签的段落 | 标准化发言者标签 |
| Teams转录文本 | 带时间戳的发言者块 | 移除时间戳,合并内容 |
| 原始粘贴文本 | 混合格式 | 自动检测并清理 |
Processing Steps
处理步骤
- Strip noise - Remove timestamps, sequence numbers, formatting artifacts
- Merge fragments - Join broken sentences across caption blocks
- Remove filler - Strip "um", "uh", "basically", "right?", "you know" (but keep if they carry meaning like "right?" as a genuine question)
- Fix transcription errors - Use domain context to correct obvious misrecognitions
- Remove technical interruptions - "Can you hear me?", "Let me share my screen", "Is my screen visible?", connection issues
- Form paragraphs - Group related sentences into natural paragraphs by topic
- Identify sections - Insert breaks at major topic transitions
--- - Normalize Q&A - Clearly separate questions from instruction
- Add metadata header - Speaker(s), estimated duration, domain detected
- 移除冗余信息 - 删除时间戳、序列号和格式伪影
- 合并碎片化内容 - 将字幕块中被拆分的句子拼接完整
- 删除填充词 - 移除“um”、“uh”、“basically”、“right?”、“you know”(但若这些词带有实际含义,如作为真实提问的“right?”,则予以保留)
- 修正转录错误 - 结合领域上下文修正明显的识别错误
- 移除技术干扰语句 - 删除“能听到我说话吗?”、“我来共享屏幕”、“我的屏幕能看到吗?”以及连接问题相关语句
- 生成段落 - 按主题将相关句子分组为自然段落
- 识别章节 - 在主要主题切换处插入分隔符
--- - 标准化问答环节 - 明确区分问题与教学内容
- 添加元数据头部 - 包含发言者、预估时长、检测到的领域
Output Format
输出格式
markdown
undefinedmarkdown
undefinedTranscript: [Topic/Title if identifiable]
转录文本:[可识别的主题/标题]
Speaker(s): [Name(s)]
Estimated Duration: [from timestamp range]
Domain: [Auto-detected: WebDev / AI-ML / Web3 / DSA / General]
Cleaning Notes: [e.g., "Fixed 12 transcription errors, removed ~45 filler instances"]
[Clean, flowing paragraphs organized by topic]
[Natural paragraph breaks at topic changes]
[Next topic section]
发言者: [姓名]
预估时长: [来自时间戳范围]
领域: [自动识别:WebDev / AI-ML / Web3 / DSA / 通用]
清理说明: [例如:“修正12处转录错误,移除约45个填充词”]
[按主题组织的干净、流畅段落]
[主题切换处的自然段落分隔]
[下一主题章节]
Q&A Segments
问答环节
Student: [Question]
Instructor: [Answer]
undefined学生: [问题]
讲师: [回答]
undefinedTopic Inventory (Anti-Loss System)
主题清单(防丢失机制)
This is the critical mechanism that prevents data loss across the pipeline. After cleaning, generate a Topic Inventory at the end of output -- a manifest of every substantive item found in the transcript.
markdown
undefined这是防止整个处理流程中数据丢失的关键机制。清理完成后,在输出末尾生成主题清单——一份记录转录文本中所有实质性内容的清单。
markdown
undefinedTopic Inventory
主题清单
Concepts Mentioned
提及的概念
- [Concept] - paragraph [N]
- [Concept] - paragraph [N] ...
- [概念] - 第[N]段
- [概念] - 第[N]段 ...
Technical Terms Introduced
引入的技术术语
- [term]: first mentioned in paragraph [N] ...
- [术语]:首次出现在第[N]段 ...
Code/Commands Referenced
引用的代码/命令
- [code snippet or command] - paragraph [N] ...
- [代码片段或命令] - 第[N]段 ...
Questions Asked (Q&A)
提出的问题(问答环节)
- Q: [question summary] - paragraph [N] ...
- 问题:[问题摘要] - 第[N]段 ...
Names/Resources Mentioned
提及的姓名/资源
- [name, URL, tool, book, etc.] ...
- [姓名、URL、工具、书籍等] ...
Corrections Applied
应用的修正
| Original Caption | Corrected To | Confidence |
|---|---|---|
| "lowest function" | "loss function" | High |
| "epic" | "epoch" | High |
| [unclear text] | [kept as-is] | Low |
| 原始字幕 | 修正结果 | 置信度 |
|---|---|---|
| "lowest function" | "loss function" | 高 |
| "epic" | "epoch" | 高 |
| [模糊文本] | [保留原文] | 低 |
Stats
统计数据
- Raw caption blocks: [N]
- Substantive paragraphs produced: [N]
- Filler instances removed: [N]
- Transcription errors corrected: [N]
- Uncertain corrections flagged: [N]
This inventory travels to the next stage (lecture-alchemist) for cross-verification. Every item in this inventory MUST appear in the final notes.- 原始字幕块数量:[N]
- 生成的实质性段落数量:[N]
- 移除的填充词数量:[N]
- 修正的转录错误数量:[N]
- 标记的不确定修正数量:[N]
该清单会传递至下一环节(lecture-alchemist)进行交叉验证。清单中的每一项都必须出现在最终笔记中。Timestamp Anchors
时间戳锚点
Preserve approximate timestamps as hidden anchors for key topic transitions. Format:
markdown
<!-- T:20:36:30 --> Neural network architecture introduction
<!-- T:20:45:12 --> Activation functions
<!-- T:21:03:45 --> Training loopThese allow the reader to jump back to the recording at specific points.
为关键主题切换处保留近似时间戳作为隐藏锚点,格式如下:
markdown
<!-- T:20:36:30 --> 神经网络架构介绍
<!-- T:20:45:12 --> 激活函数
<!-- T:21:03:45 --> 训练循环这些锚点可帮助读者跳转至录制视频的对应位置。
Quality Checklist
质量检查清单
Before output, verify:
- Every teaching point from raw input is in the output
- Topic Inventory is complete and accurate
- Transcription errors corrected using domain context
- Uncertain corrections flagged with
[unclear: ...] - Filler words removed without losing meaning
- Sentences properly merged (no mid-word breaks)
- Q&A segments clearly separated
- Technical interruptions removed
- Timestamp anchors placed at topic transitions
- Output reads as natural, flowing text
输出前需验证:
- 原始输入中的所有教学要点均已包含在输出结果中
- 主题清单完整且准确
- 已结合领域上下文修正转录错误
- 不确定的修正已用标记
[unclear: ...] - 填充词已被移除且未丢失原意
- 句子已正确拼接(无单词拆分)
- 问答环节已明确区分
- 技术干扰语句已被移除
- 已在主题切换处添加时间戳锚点
- 输出文本读起来自然流畅
Pipeline Position
流程定位
This skill is Stage 1 in the lecture processing pipeline:
- transcribe-refiner (this) → clean transcript + Topic Inventory
- lecture-alchemist → structured study notes (verifies against inventory)
- concept-cartographer → visual diagrams (verifies against inventory)
- obsidian-markdown → Obsidian vault formatting
本技能是讲座处理流程中的第一阶段:
- transcribe-refiner(本技能)→ 清理后的转录文本 + 主题清单
- lecture-alchemist → 结构化学习笔记(与清单交叉验证)
- concept-cartographer → 可视化图表(与清单交叉验证)
- obsidian-markdown → Obsidian库格式优化