Podcast Content Editing
Generate transcript → Generate review draft (topic-level outline + sentence-level deletion suggestions) → User review → Execute editing
Quick Start
User: Help me cut the nonsense in the podcast
User: content edit
User: Generate a transcript and mark the content to be deleted
Input
- Audio/video files
- (Optional) List of speaker names, e.g.,
Output
- Review Draft - A file containing: content outline + full transcript + deletion marks
- After Confirmation - Execute editing
Note: No separate transcript file will be output, as the review draft already includes the full transcript
Two Levels for Collaborative Use
| Level | Position | Granularity | Applicable Scenario |
|---|
| Topic-level | First part of the review draft (content outline) | 5-30 minutes per block | Quick rough cut, delete entire off-topic/chit-chat segments |
| Sentence-level | Third part of the review draft (main text) | Inline sentence marking | Fine adjustment, view context |
Workflow
1. Transcribe audio (FunASR + sentence-level timestamps + speaker diarization)
↓
2. Silence detection (FFmpeg silencedetect, identify long blank segments)
↓
3. Generate transcript (with speaker labels)
↓
4. AI analysis: Identify topic structure + mark suggested deletions
↓
5. Output review draft (content outline + silence segments + deletion suggestions)
↓
【User directly modifies deletion marks on the review draft】
↓
6. Execute editing → /podcastcut-edit (parse deletion marks from review draft)
Technical Notes
| Feature | Implementation |
|---|
| Transcription | FunASR (must use full model path, see code below) |
| Timestamps | Sentence-level (automatically returns ) |
| Speaker Diarization | Built-in CAM++ model in FunASR |
| Silence Detection | FFmpeg (threshold -40dB, minimum duration 3s) |
⚠️ Must Use Script for Transcription
Do not write your own code, directly call the existing script:
bash
# Transcribe (output podcast_transcript.json)
python ~/.claude/skills/podcastcut-content/scripts/transcribe.py <audio file> <output directory>
# Generate transcript (output podcast_逐字稿.md)
python ~/.claude/skills/podcastcut-content/scripts/generate_transcript.py \
<transcript.json> <output.md> '{"0":"响歌歌","1":"麦雅","2":"安安"}'
Why can't you write your own code? FunASR requires the full model path + VAD + Punc + Speaker four models to obtain
. Simplified writing (e.g.,
) will cause transcription failure.
For performance references and common issues, see
Why use sentence-level instead of character-level? Character-level + speaker diarization is unstable for long audio (OOM, alignment errors after segmentation)
This Skill only deletes entire sentences, more fine-grained deletions (half sentences, filler words) are left to
Transcript Format
markdown
**Maia** 00:05
Let's start.
**响歌歌** 00:06
Really?
**Maia** 00:08
Great, OK. Hello everyone, welcome to today's 5:15 Podcast, I'm host Maia. Today we're going to talk about something fun.
**响歌歌** 00:20
I'm host 响歌歌. Alright, let's get started.
Format Rules
| Element | Format |
|---|
| Speaker | in bold |
| Timestamp | or (start time of the speaker) |
| Content | Content from the same speaker is concatenated, no line breaks per sentence |
| Speaker Change | Add a blank line |
Review Draft Format
The review draft integrates topic-level outlines and sentence-level deletion suggestions, allowing users to complete the review in one file.
markdown
# Podcast Review Draft
**File**: podcast.mp3
**Total Duration**: 2:08:07
---
## I. Content Outline (Topic-level)
|---|------|------|------|--------|------|
| 1 | Opening small talk + technical debugging | 00:00 - 04:45 | 04:45 | 🗑️ Delete | Recording preparation, technical issues |
| 2 | Official opening + guest introduction | 04:45 - 07:00 | 02:15 | ✅ Keep | Official start of the podcast |
| 3 | Chit-chat: Guest background | 05:48 - 07:01 | 01:13 | 🗑️ Delete | Irrelevant to the topic |
| 4 | Topic discussion | 07:01 - 40:00 | 32:59 | ✅ Keep | Core content |
| 5 | Recording discussion (middle) | 49:59 - 51:13 | 01:14 | 🗑️ Delete | Discussion on editing matters |
**Statistics**: Suggested to keep 2:03:21 | Suggested to delete 08:12
**Operation**: `Delete topics 1, 3, 5` or `Keep only topics 2, 4`
---
## II. Silence Segments
|---|------|------|----------|
| 1 | 12:34 - 12:48 | 00:14 | Between Topic 2 and Topic 3 |
| 2 | 35:20 - 35:58 | 00:38 | Guest's thinking pause |
| 3 | 1:02:15 - 1:02:45 | 00:30 | Mid-recording disconnection/silence |
**Statistics**: Total 3 silence segments, total duration 01:22
**Operation**: `Delete all silence` or `Delete silence 1, 3` (keep 2, may be intentional pause)
---
## III. Statistics
- Total sentences: 3390
- Suggested deletions: 377 instances
- Silence segments: 3 instances (01:22)
### By Type
- Opening small talk: 31 instances
- Chit-chat - personal background: 23 instances
- Technical debugging: 15 instances
- Recording discussion: 6 instances
- Privacy - company name: 5 instances
- Privacy - location: 4 instances
- Privacy - school name: 3 instances
---
## IV. Main Text (Transcript + Deletion Marks)
**⚠️ Must include the full transcript, from the first sentence to the last, no content can be omitted!**
Incorrect practice: `(The following content is topic discussion, keep...)` ← Not allowed!
Correct practice: Output all sentences, regardless of whether they are marked for deletion
Full transcript, content suggested for deletion by AI is marked with ~~strikethrough~~ and the reason is noted. Content from the same speaker is concatenated, no line breaks per sentence.
**响歌歌** 00:00
~~Alright, right, you shouldn't hear any noise. Because I remember last time we used this there was noise, just like that psychological counseling session, um, right, this time it should be better.~~ `[Delete: Opening small talk]`
**麦雅** 00:23
~~I'll turn on this dog, okay~~ `[Delete: Opening small talk]`
**安安** 00:27
~~I'll also turn on my self-introduction.~~ `[Delete: Opening small talk]`
...
**麦雅** 04:50
Hello everyone, welcome to today's 5:15 Podcast, I'm host 麦雅.
**响歌歌** 04:58
I'm host 响歌歌.
**麦雅** 05:02
Today we have a special guest, 安安.
**安安** 05:08
Hello everyone, I'm 安安.
...
**安安** 15:32
~~When I worked at Google before~~ `[Delete: Privacy - company name]` When I worked before, I encountered a similar situation.
...
**麦雅** 49:59
~~Should we cut this segment?~~ `[Delete: Recording discussion]`
**响歌歌** 50:02
~~Hmm, let's check later.~~ `[Delete: Recording discussion]`
**安安** 50:05
~~I think we can keep it.~~ `[Delete: Recording discussion]`
...
(Full transcript continues...)
Structure Description
| Section | Content | Purpose |
|---|
| I. Content Outline | Topic-level table | Quickly understand structure, delete entire blocks |
| II. Silence Segments | List of long blank segments | Delete silent paragraphs |
| III. Statistics | Summary of deletion counts by type | Quickly see the scale of deletions |
| IV. Main Text | Full transcript + inline deletion marks | View context, review sentence by sentence |
Topic Identification Rules
| Topic Type | Identification Method |
|---|
| Opening small talk | Content before the official opening ("Hello everyone") |
| Official opening | Paragraph starting with "Hello/Hello everyone" |
| Chit-chat | Discussion of personal background irrelevant to the topic |
| Topic discussion | Core content around the podcast topic |
| Recording discussion | Paragraphs discussing editing, content selection |
| Closing | Concluding remarks like "Alright, that's it for today" |
Deletion Types
⚠️ Division of Labor Principle
| Skill | Focus | Processed Content | Timestamp Granularity |
|---|
| Content semantics | Opening, off-topic, privacy, redundancy, long silence | Sentence-level |
| Verbal error technology | Filler words, verbal errors, short pauses, half-sentence deletion | Character-level |
This Skill focuses on content level: What to delete and what to keep is a semantic judgment, deleting entire sentences.
Verbal error identification is technical level: Requires more fine-grained rules (repeated characters, pause patterns), using character-level timestamps.
Why this division of labor?
- Sentence-level transcription + speaker diarization = accurate speaker identification
- Character-level transcription + speaker diarization = easy speaker misalignment (OOM for long audio, alignment difficulties after segmentation)
- First delete large segments (sentence-level), then process the remaining content (character-level)
Content Deletion Types (Processed by This Skill)
| Type | Mark | Example |
|---|
| Opening small talk | [Delete: Opening small talk]
| "Shall we start?" "Can you hear me?" |
| Closing chit-chat | [Delete: Closing chit-chat]
| "Alright, that's it" "Bye" |
| Recording-related | [Delete: Recording-related]
| "Re-record this segment" "Cut this later" |
| Off-topic content | | Discussion irrelevant to the topic |
| Redundant repetition | | Large segments repeating the same point |
| Privacy - company | [Delete: Privacy - company name]
| "I work at Google" |
| Privacy - personal name | [Delete: Privacy - personal name]
| "My colleague Zhang San said" |
| Privacy - location | [Delete: Privacy - location]
| "I live in xxx" |
| Long silence | Listed separately in the second part of the review draft | Silence over 3 seconds |
Verbal Error Deletion Types (Processed by /podcastcut-transcribe)
| Type | Description |
|---|
| Fillers/modal particles | "Um", "I mean", "Then", "Right right right" |
| Verbal errors | Misspoken words and corrections |
| Short pauses | Small pauses within sentences (< 3 seconds) |
Note: Long silence (≥3 seconds) is processed by this Skill, short pauses are processed by
.
Why not process fillers here? Identification of fillers requires more fine-grained rules (continuous repetition, pause patterns), which is technical rather than content semantic.
AI Analysis Method
⚠️ Must Use Claude for Semantic Analysis
Keyword matching is not sufficient! Rule-based methods cannot identify:
- Semantic off-topic/chit-chat (no obvious keywords)
- Chit-chat after guest introduction (where they live, graduation year, school conditions)
- Hidden recording discussions (no keywords like "cut")
Must use Claude to analyze the transcript in segments, prioritize quality.
Analysis Workflow
1. Split the transcript into 15-minute segments
2. Send each segment to Claude for analysis, identify content suggested for deletion
3. Claude returns: sentence index + deletion type + reason
4. Merge results from all segments, generate review draft
Claude Analysis Prompt
For each transcript segment, use the following prompt:
You are a podcast content review assistant. Analyze the following transcript and identify sentences suggested for deletion.
## Deletion Types
1. **Opening small talk**: Chit-chat, technical debugging before the official opening ("Hello everyone")
2. **Recording discussion**: Discussion of editing, recording status, technical issues, "Should we cut this segment"
3. **Privacy - company name**: Mention of specific company names (Google, Meta, ByteDance, etc.)
4. **Privacy - school name**: Mention of specific school names (Stanford, Tsinghua, etc.)
5. **Privacy - location**: Mention of specific locations (Palo Alto, Silicon Valley, etc.)
6. **Privacy - personal name**: Mention of specific personal names (non-public figures)
7. **Off-topic/chit-chat**: Discussion irrelevant to the podcast topic (personal background chit-chat, geographic discussion, etc.)
8. **Redundant repetition**: Repeating the same point multiple times, large segments of repetition
## Output Format
For each sentence suggested for deletion, output:
- Sentence timestamp
- Deletion type
- Reason (brief description)
Only mark sentences that need deletion, skip those that don't.
## Transcript
{transcript_segment}
Detailed Deletion Type Explanations
| Type | Identification Key Points |
|---|
| Opening small talk | All content before the official opening, including technical debugging and chit-chat |
| Recording discussion | "Cut", "Did we record that?", "This is too sensitive", "Cut later" |
| Privacy information | Company names, school names, locations, personal names |
| Off-topic/chit-chat | Irrelevant to the topic: where they live, when they arrived, how the school is |
| Redundant repetition | The same meaning repeated more than 3 times |
Chit-chat Detection Focus
Chit-chat after guest introduction is particularly easy to miss, watch for these signals:
- Sudden appearance of place names, school names, years
- "Which area do you live in" "When did you arrive" "How is it over there"
- Multiple consecutive sentences discussing non-topic content (geography, school, city comparison)
Recording Discussion Detection Focus
Not just at the opening! May appear throughout the podcast:
- Technical issues: "Can you hear me", "Disconnected", "Headphone battery dead"
- Content concerns: "Too low-key", "Don't want to share", "Don't mention details"
- Editing discussion: "Cut later", "Should we keep this"
Silence Detection Method
Use FFmpeg's
filter to detect large blank segments.
Detection Command
bash
ffmpeg -i video.mp4 -af "silencedetect=noise=-40dB:d=3" -f null - 2>&1 | grep silencedetect
| Parameter | Description | Recommended Value |
|---|
| Silence threshold (volume below this is considered silence) | -40dB |
| Minimum silence duration (seconds) | 3 (content editing focuses on large blank segments) |
Output Parsing
[silencedetect @ 0x...] silence_start: 752.341
[silencedetect @ 0x...] silence_end: 766.512 | silence_duration: 14.171
Parse
and
to generate the list of silence segments.
Threshold Selection
| Scenario | noise | d (minimum duration) |
|---|
| Content editing (this Skill) | -40dB | 3 seconds |
| Verbal error identification (fine-grained) | -50dB | 0.5 seconds |
Why use 3 seconds? Pauses shorter than 3 seconds may be natural thinking gaps, not recommended for deletion.
Output Files
podcast_transcript.json # Sentence-level timestamps + speakers (for editing use)
podcast_审查稿.md # Review draft (includes full transcript + deletion marks)
⚠️ Only output the review draft, no separate transcript file
The fourth section "Main Text" of the review draft is the full transcript, no need to output separately.
Sentence-level JSON Format
json
{
"file": "podcast.mp3",
"duration": 3600.5,
"sentences": [
{"text": "Hello everyone,", "start": 0.50, "end": 1.20, "spk": 0},
{"text": "Welcome to today's podcast.", "start": 1.20, "end": 2.80, "spk": 0},
{"text": "I'm host Xiao Ming.", "start": 2.80, "end": 3.90, "spk": 1},
...
]
}
⚠️ Review Draft and Deletion List Must Be Synchronized
Users may directly modify deletion marks in the review draft (add/remove strikethrough), making the deletion list outdated.
Rules:
- The review draft is the final source for user review
- Before executing editing, re-parse deletion marks from the review draft
- Do not rely on the potentially outdated
Parsing Method: Scan the text marked with
in the review draft and match the timestamps in transcript.json.
Relationship with Other Skills
/podcastcut-content → Content editing (semantic level) ← This Skill
/podcastcut-edit → Execute editing
/podcastcut-transcribe → Verbal error identification (technical level, optional)
/podcastcut-subtitle → Generate subtitles
Recommended Workflow:
Original video
↓
/podcastcut-content ← Mark large segments (small talk, off-topic, redundancy, privacy)
↓
/podcastcut-edit ← Execute deletion, output v2
↓
【Optional】Need to process verbal errors?
↓ Yes
/podcastcut-transcribe ← Identify verbal errors, filler words, silence
↓
/podcastcut-edit ← Execute deletion, output v3
↓
Completed
Why delete content first then process verbal errors?
- After deleting large segments, the video becomes shorter
- Verbal error identification transcription is faster, review scope is smaller
- No need to process verbal errors in deleted large segments
Speaker Diarization
FunASR has built-in speaker diarization functionality (
), automatically outputting speaker IDs.
Workflow
FunASR transcription (enable spk_model)
↓
Output sentences with speaker IDs (speaker 0, speaker 1...)
↓
Search self-introduction phrases to confirm the real name corresponding to the ID
↓
Replace with real names when generating the review draft
⚠️ Speaker Mapping Confirmation Method
Do not directly use the order provided by the user! Must search self-introduction phrases in the transcription results to confirm:
python
# Search key phrases to determine speaker mapping
key_phrases = ["I'm the host", "I'm xxx", "Hello everyone, I'm"]
for s in sentences:
for phrase in key_phrases:
if phrase in s['text']:
print(f"spk{s['spk']}: {s['text']}") # Confirm who the spk ID corresponds to
Common Issues
| Issue | Cause | Solution |
|---|
| Same person divided into multiple IDs | FunASR identification instability | Map multiple IDs to the same name |
| More IDs than actual speakers | As above | Merge redundant IDs based on self-introduction |
| Speaker ID misalignment after segmented transcription | Each segment's ID resets independently | Prioritize full transcription, avoid segmentation |
Limitations
| Condition | Effect |
|---|
| 2-10 person conversation | Good effect |
| Audio < 30s | Decreased effect |
| More than 10 people | Decreased effect |
| Segmented transcription | Speaker IDs may be inconsistent |
Progress TodoList
Create at startup:
- [ ] Transcribe audio (FunASR, sentence-level + speaker diarization)
- [ ] Silence detection (FFmpeg silencedetect)
- [ ] Generate transcript
- [ ] AI analysis: Identify topic structure + mark suggested deletions
- [ ] Output review draft (including silence segments)
- [ ] Wait for user confirmation
Example Dialogue
User: Help me cut the nonsense in the podcast, speakers are Maia and 响歌歌
AI: Alright, I'll process this podcast.
1. Transcribing audio...
2. Detecting silence...
3. Generating transcript...
4. Analyzing content...
Review draft generated: podcast_审查稿.md
=== Content Outline (Topic-level) ===
| # | Topic | Duration | AI Suggestion |
|---|------|------|--------|
| 1 | Opening small talk | 04:45 | 🗑️ Delete |
| 2 | Official opening | 02:15 | ✅ Keep |
| 3 | Chit-chat: Guest background | 01:13 | 🗑️ Delete |
| 4 | Topic discussion | 32:59 | ✅ Keep |
=== Sentence-level Statistics (By Type) ===
- Opening small talk: 12 instances
- Recording discussion: 8 instances
- Privacy information: 5 instances
- Off-topic chit-chat: 3 instances
Please check the deletion marks in the review draft, and tell me to execute editing after adjustment.
User: [Added/removed some deletion marks in the review draft] Alright, cut according to the review draft
AI: Alright, parsing deletion marks from the review draft...
- Found 25 deletion marks
- Total deletion duration: 06:32
Executing editing...
Feedback Records
2026-01-31 (Late Night)
- Speaker ID misalignment caused by segmented transcription: Speaker IDs reset independently for each segment during segmented transcription, resulting in different IDs for the same person after merging
- Cause: 2-hour audio split into 13 segments to avoid OOM, each segment's speaker ID starts from 0
- Solution: Prioritize full transcription (2-hour audio takes about 16 minutes, no OOM)
- Updated: Added "Segmented vs Full Transcription" section to tips/转录最佳实践.md
- FunASR may identify the same person as multiple IDs: 3-person conversation identified as 4 speaker IDs
- Performance: 响歌歌 was split into spk1 (60 sentences) and spk3 (560 sentences)
- Solution: Search self-introduction phrases ("I'm host xxx") to confirm mapping, merge multiple IDs
- Updated: Added confirmation method and common issues to the "Speaker Diarization" section of SKILL.md
- Updated performance data: 2-hour podcast tested to take 16 minutes, 3390 sentences (previous estimate of 12 minutes, 800 sentences was low)
- Updated: Performance reference table in tips/转录最佳实践.md
2026-01-31 (Evening)
- Incomplete review draft content: AI took a shortcut and wrote "(The following content is topic discussion, keep...)" instead of the full transcript
- Updated: Clearly marked in the "Section IV" of the review draft format that full content must be output, no omissions allowed
- Redundant transcript file output: Users only need one review draft (which already includes the full transcript)
- Updated: Removed from the output files section, clearly stated to only output the review draft
2026-01-31
- Deleted "User Confirmation Method" section: Users actually operate by directly modifying deletion marks in the review draft, no need for command-based operations
- Old workflow: Users input commands like "Delete topics 1, 3" "Delete all silence"
- Actual workflow: Users add/remove in the review draft, then say "Cut according to the review draft"
- Updated: Flowchart, example dialogue, removed command-based operation instructions
- FunASR call parameter error caused transcription failure: Using simplified model names cannot obtain
- Incorrect writing: + +
- Correct writing: Must use full model path + VAD + Punc + Speaker four models
- Cause: SKILL.md only wrote simplified parameters, actual execution used incorrect API due to "free play"
- Updated: Included complete call code in SKILL.md, marked comparison between incorrect and correct writing
- Lesson: Executable code must be fully written in SKILL.md, cannot only write parameter names and let AI assemble it
2026-01-25
- Reverted to sentence-level timestamps: Character-level + speaker diarization is unstable for long audio
- Issue: After character-level transcription and merging speaker information, speaker alignment errors occurred ("I'm host 麦雅" was attributed to 响歌歌)
- Cause:
- Speaker diarization OOM for long audio (2-hour audio → 234MB WAV)
- Segmented speaker diarization returned 0 sentences (API format issue)
- Character-level transcription has no punctuation, sentence boundaries are unnatural
- Updated: Reverted to sentence-level timestamps, this Skill only deletes entire sentences
- Half-sentence deletion is left to (character-level)
2026-01-24 (Evening)
- Attempted to upgrade to character-level timestamps: Solved the problem that sentence-level cannot accurately delete parts of sentences
- Issue: Deleting "Um, I can talk about why" would also delete the second half "This episode's guest was actually invited by 响歌歌"
- Attempt: Use 30s segmentation +
timestamp_granularity="character"
to obtain character-level timestamps
- Result: Character-level transcription succeeded, but speaker diarization failed, leading to speaker alignment errors
- Final decision: Revert to sentence-level (see 2026-01-25 feedback)
2026-01-24
- Added silence detection function: Use FFmpeg silencedetect to identify long blank segments (≥3 seconds), listed separately in the review draft for user confirmation to delete
- Review draft marked for deletion but not actually cut: Entire sentences marked for deletion in the review draft, but only part of the content was in the deletion list
- Case: Review draft
~~Um, this is why specifically...~~
, deletion list only had
- Updated: Emphasized that the review draft and deletion list must be synchronized, re-parse from the review draft before executing editing
2026-01-18 (Afternoon)
- Adjusted transcript/review draft format: Content from the same speaker is concatenated, no line breaks per sentence
- Original format: One line per sentence
- New format: All sentences from the same speaker are in one paragraph
- Advantage: More compact, better reading experience
2026-01-18
- Must use Claude for semantic analysis: Rule-based keyword matching is not sufficient in quality
- Issue: Cannot identify semantic off-topic/chit-chat, hidden recording discussions
- Updated: Added "AI Analysis Method" section, clearly stated that Claude must be used to analyze the transcript in segments
- Included: Analysis workflow, Claude prompt template, detailed deletion type explanations
2026-01-17 (Evening)
- Adjusted format of Section II of the review draft: Changed to full transcript + inline deletion marks
- Original format: Grouped by topic → deletion suggestions listed under each topic (table form)
- New format: Full transcript (main text), deleted content marked with + inline
- Advantage: Retains full context, users can see the context before deciding whether to delete
2026-01-17 (Afternoon)
- Large segment deletion not clean: Consecutive sentences marked for deletion, but sentence-by-sentence deletion retained blank spaces between sentences
- Cause: Each sentence processed independently, no merging of consecutive deletions with the same reason
- Updated: Clarified division of labor, this Skill focuses on sentence-level, filler words/verbal fillers processed by
- Editing rules have been synchronized to
- Do not mark filler words at sentence-level: Sentence-level timestamps are not precise enough, deleting filler words is easy to cause accidental deletion
- Updated: Divided deletion types into "sentence-level" and "character-level", clarified division of labor
2026-01-17 (Morning)
- Recording discussions and technical debugging may appear anywhere in the podcast, not just at the opening
- Updated: Recording-related detection changed to full-process detection, added technical issue keywords and continuous paragraph detection
- Chit-chat after guest introduction (where they live, when they arrived, how the school is) is easy to miss
- Updated: Added signal patterns (place names, school names, year-related conversations) to off-topic/chit-chat detection
- Sentence-by-sentence review is inefficient, users want to see the global structure and delete entire blocks
- Added: Review draft integrates topic-level outline and sentence-level deletion suggestions, complete review in one file