YouTube Shorts Generator
End-to-end pipeline: Long Video → Transcript → Ranked Highlights → Vertical Clips.
Turns one long video into N viral-ready vertical mp4s. Each clip ships with a viral score (0–100), an opening hook line, and a one-sentence reason it should perform.
Agent Execution Protocol
Step 1 — Collect Inputs
Ask once, then proceed:
| Input | Default | Notes |
|---|
| — | YouTube URL, or hosted mp4 URL, or local file path |
| | How many shorts to render |
| | for TikTok/Reels/Shorts, square, portrait |
| | / / / / |
| auto | Whisper language code (e.g. ) |
| — | Optional path; if set, dump full result there |
If the user gave only a URL, use defaults and don't block on questions.
Step 2 — Verify Prerequisites
- installed and authed ()
- on PATH (Whisper needs it for audio decoding)
- Python 3.10+ with installed (only if running the local transcribe stage)
If
is missing, stop and ask the user. Never invent a key.
Step 3 — Run the Pipeline
The standard path is the orchestrator script — it handles all eight stages in order:
bash
bash library/social/youtube-shorts/scripts/run-youtube-shorts.sh \
--source "<YOUTUBE_URL>" \
--num-clips 5 \
--aspect-ratio 9:16 \
--whisper-model base \
--output-json result.json \
--view
The eight stages:
- Download — pull the source video at the requested resolution (///, default ). For local files, skip.
- Transcribe — local Whisper produces timestamped segments. Audio stays on the machine.
- Classify content type — LLM tags the video (podcast / interview / tutorial / vlog / lecture / monologue) and density. Tunes the highlight prompt per type.
- Chunk if long — videos > (1800s default) are split into (1200s default) windows with (60s default) overlap so cross-boundary highlights aren't missed.
- Rank highlights — LLM scans each chunk through :
- Hook moments — strong opening line that stops the scroll
- Emotional peaks — laughter, anger, vulnerability, awe
- Opinion bombs — spicy, contrarian, debate-bait takes
- Revelation moments — "wait, what?" reframes
- Conflict — disagreement, tension, callouts
- Quotable lines — tight, screenshot-worthy phrasing
- Story peaks — climax of a narrative arc
- Practical value — actionable insight a viewer will save
Each candidate gets , , 0–100, , , . Aim for 30–75s clips unless content dictates otherwise.
- Dedupe — collapse overlaps. Rule: if two candidates overlap > 50%, keep the higher score, drop the other.
- Top-N selection — sort surviving candidates by score, take .
- Vertical auto-crop — render each highlight at via . Auto-handles face tracking and screen recordings.
Quick Invocation Patterns
Single video, defaults:
bash
bash scripts/run-youtube-shorts.sh --source "https://youtube.com/watch?v=VIDEO_ID"
Tuned for high-density podcast (more clips, larger Whisper model):
bash
bash scripts/run-youtube-shorts.sh \
--source "<URL>" --num-clips 8 --whisper-model medium --view
Square clips for Instagram feed:
bash
bash scripts/run-youtube-shorts.sh \
--source "<URL>" --aspect-ratio 1:1 --num-clips 3
Batch — with one URL per line:
bash
xargs -a urls.txt -I{} bash scripts/run-youtube-shorts.sh --source "{}"
Async submit (returns request_id, poll later):
bash
REQUEST_ID=$(bash scripts/run-youtube-shorts.sh \
--source "<URL>" --async --output-json - --jq '.request_id' | tr -d '"')
muapi predict wait "$REQUEST_ID" --download ./outputs
Platform Specs
| Platform | Aspect | Sweet-spot duration | Notes |
|---|
| YouTube Shorts | 9:16 | 30–60s | Hook in first 1s, max quality |
| TikTok | 9:16 | 30–75s | High energy; longer is fine if hook lands |
| Instagram Reels | 9:16 | 30–60s | Hook in first 1s |
| Instagram Feed | 1:1 | 15–45s | Static-feel works well |
| LinkedIn | 16:9 or 1:1 | 30–60s | Professional tone |
| Twitter/X | 16:9 | 15–60s | Punchy, direct |
Output Schema
json
{
"source_video_url": "...",
"transcript": { "duration": 1873.4, "segments": [...] },
"highlights": [ /* every candidate, before top-N cut */ ],
"shorts": [
{
"title": "The one mistake that cost me $50K",
"start_time": 124.3,
"end_time": 187.6,
"score": 92,
"hook_sentence": "Nobody talks about this, but it killed my first startup...",
"virality_reason": "Opens with a number + regret, peaks on a contrarian lesson",
"clip_url": "https://.../short_1.mp4"
}
]
}
When reporting back to the user, surface for each clip: rank, score, time range, title, hook, and clip URL. Skip the raw transcript unless asked.
Tunable Knobs
Edit defaults inside the orchestrator or pass via flags:
| Knob | Default | Purpose |
|---|
| | Chunk length for long videos |
| | Videos longer than this get chunked |
| | Overlap between chunks |
| | Seconds between job-status polls |
| | Give up after this long |
| | Min IoU to collapse overlapping candidates |
Whisper Model Selection
- / — fast, English-leaning, fine for clean studio audio
- / — better for accents and music beds
- — highest accuracy, much slower; only worth it on a GPU
Pick
unless transcript quality is poor, then bump to
.
Common Mistakes to Avoid
- Skipping the dedupe step — without it, you ship near-duplicate clips that all came from the same hot moment.
- Generic virality prompt — the highlight ranker must score against the eight signals above, not "interestingness."
- Wrong aspect ratio for the platform — YouTube Shorts and TikTok are ; LinkedIn often . Default to only if the platform isn't specified.
- Crop without face tracking — vertical crops on talking-head content must follow the speaker's face; static center-crop loses the subject.
- Padding to hit — if dedupe leaves fewer survivors than requested, return what you have. Don't ship low-score filler.
- Re-running the full pipeline on a 404'd clip URL — re-run only the crop stage for that highlight.
Failure Modes
- — stop and tell the user to install ( / ).
- Whisper produced no segments — likely no detectable speech or a hard language. Retry with
--whisper-model medium --language <code>
before declaring failure.
- API key missing or rejected — surface the exact error; don't fabricate a key.
- Job timed out — bump and retry; don't silently truncate.
- Highlight ranker returned < — return what survived dedupe with a note.
Done Criteria
The skill is done when:
- has up to entries, each with a working .
- The user has been shown the ranked list (score, time range, title, hook, URL).
- If was set, the file exists and parses.