music-to-video — one music-grounded, beat-synced video workflow
Use this skill to turn a music track into a beat-synced HyperFrames video. You analyze the track once, lay out the frames, fill in a per-frame plan, and build each frame as a composition. The input is a music track plus optional user images or videos — there is no narration and no website capture. Typography and templates are the floor (a complete video needs zero assets); any media the user supplies is cut in on the same beat grid.
You are the
orchestrator. Work in
. Run the steps in order and pass each
Gate before moving on. Two steps need the user:
Step 3 (plan approval) and
Step 6 (render approval). Do every step yourself except
Step 4, where you dispatch
one sub-agent per frame. Keep design and motion rules out of this file — they live in
and the
sub-agent.
= this skill directory.
=
.
Workflow: Step 0 setup →
+
; Step 1 analyze →
; Step 2 skeleton →
(frames, groups
); Step 3 plan → complete
+
; Step 4 build →
compositions/frames/NN-*.html
; Step 5 assemble →
; Step 6 render →
.
Two ideas that shape everything
- One analyzer, and you trust it. is the only beat analyzer — never re-measure beats with another tool or by ear. Its energy / density / rolls / onsets / silences are always reliable. Its and are reliable only when the music is genuinely rhythmic; on calm music the grid is a metronome the tracker imposed, so pace by phrases and energy instead and never hard-cut to it. Deciding which case you're in is each frame's (Step 2).
- One frame = one file; groups live inside. Step 2 cuts the track into frames, and each frame becomes one composition file
compositions/frames/NN-<frame_id>.html
, built by one frame-worker. A frame can subdivide into groups (each a template or a motion-primitives combo). Extra density goes inside a group, so frame count tracks distinct treatments, not beats — a fast track does not blow up the number of sub-agents.
Step 0: Setup, BGM, and inputs
Goal: Establish the music source, create the HyperFrames project, and note any user-supplied media.
The
music is the spine — establish one track before anything else. This skill is tuned for
fast, high-energy BGM: a strong beat grid drives the cuts (calm tracks work, but pace by phrase rather than beat). If the user gave you audio — a music file, or a video to pull the audio from — use it. If not, generate one: choose the mood from the user's description (e.g. "driving synthwave", "trap beat", "upbeat corporate") and produce a track via
(
— HeyGen retrieval when credentialed, else local Lyria / MusicGen; ElevenLabs or another generator also works). Either way the track lands at
. Stage any user-supplied images or videos so frames can weave them in on the beat grid; otherwise typography carries the whole video.
Initialize only if
is missing. Name
from the brief in kebab-case, such as
— never a timestamp.
bash
npx hyperframes init "videos/<project>" --non-interactive --skip-skills --example=blank
mkdir -p "$PROJECT_DIR/assets" "$PROJECT_DIR/renders"
cp "<user-music>" "$PROJECT_DIR/assets/bgm.mp3" # extract from a video first if needed
# only if the user gave you images/videos:
node <SKILL_DIR>/scripts/stage-assets.mjs --from <dir> --hyperframes "$PROJECT_DIR" --into public
The brand (font + palette) is chosen at Step 3, not here. Don't pick a genre or a track type up front — assets are just an optional ingredient, and the genre emerges from the per-frame choices.
Gate: +
exist; aspect / length / fps and (if any) the asset inventory are noted.
Step 1: Analyze the music
Goal: Produce the one canonical timing analysis the whole video is built on.
is the
only beat analyzer — never re-measure beats with another tool or by ear. It reads the track once and writes
: energy phases (level / density / feel), onsets +
, rolls, silences,
,
, phrases, tempo / grid, and
. It's deterministic — the same file always gives the same map. Most fields are reliable on any music;
and
are reliable only when the music is genuinely rhythmic, and judging that is the call you make at Step 2.
Prerequisites: Python 3 with
,
, and
available. If import fails, install them into the active Python environment before running the analyzer:
bash
python3 -m pip install librosa numpy soundfile
bash
python3 <SKILL_DIR>/scripts/analyze-beatgrid.py "$PROJECT_DIR/assets/bgm.mp3" \
-o "$PROJECT_DIR/audiomap.json" --print
Step 2: Frame skeleton
Goal: Read the music and lay out the frames — the skeleton of
.
Read
references/frame-skeleton.md
. Turn
into the
skeleton of
yourself — there is no intermediate JSON. Cut the track into
frames at real musical changes (
, SURGE / DROP
, the edges of a roll, a stretch with no onsets, a big energy jump), snapping every boundary to an audiomap anchor. For each frame set
,
(the verdict from Step 1's trust call —
when the grid is real,
when it's a metronome imposed on calm music),
, and a one-line
(the plain music situation Step 3 matches a template against). Only classify and lay out here: leave every frame's
as
and the frontmatter
blank — no templates, copy, color, or fonts. Expect ~1–6 frames.
Gate: frames tile the track (first at 0, last at
); each carries
+
+
+
; every
is
; no content anywhere.
Step 3: Plan (user-gated)
Goal: Turn the skeleton into an approved, complete
.
Read
,
,
,
motion-primitive-catalog.md
, and
(only if the user supplied assets). Editing the same file in place, do two things:
- Pick the brand. Choose one preset from
../hyperframes-creative/frame-presets/
using the table in ../hyperframes-creative/references/design-spec.md
(match the track's mood; only its fonts and colors matter — templates own composition). Copy it into unmodified and fill the frontmatter (font + a ≤4–6 swatch palette) from it.
- Fill every frame. Decide its groups and give each a treatment: a matched template from the catalog (with bound params and real audiomap anchors), a free-compose from the primitive catalog, or an asset treatment that obeys . Write the copy. You own WHAT (template / primitives + content + anchors); the frame-worker owns HOW — never write millisecond tweens into the storyboard.
bash
node <SKILL_DIR>/scripts/validate-plan.mjs --storyboard "$PROJECT_DIR/STORYBOARD.md" \
--audiomap "$PROJECT_DIR/audiomap.json" --templates <SKILL_DIR>/references/templates
Fix every
(hard errors: duration mismatch, frames not tiling the track, a missing
); warnings are best-effort. Then show the user a frame-by-frame summary and iterate until they approve.
Gate: is a verbatim preset copy;
exits 0; the user approved the plan.
Step 4: Build frames
Goal: Build every frame as a self-contained composition file.
Create
. Read
sub-agents/frame-worker.md
and
../hyperframes-core/references/subagent-dispatch.md
. Dispatch
one frame-worker per frame, in parallel where possible (otherwise in waves). Each worker gets exactly one frame and this context:
text
PROJECT_DIR: <abs path>
frame_id: <NN-frame_id> # = the frame file stem, e.g. 02-f2; the composition id
Your block: the `## Frame N — <frame_id>` block in PROJECT_DIR/STORYBOARD.md
audiomap: PROJECT_DIR/audiomap.json
frame.md: PROJECT_DIR/frame.md
Materials: for each group, <SKILL_DIR>/references/templates/<id>/index.html (templates) and
<SKILL_DIR>/references/motion-primitives/<id>/ (free); staged assets/ (asset groups)
Contracts: ../hyperframes-core/references/sub-compositions.md + determinism-rules.md
Canvas: <w>×<h> Pacing: <beat_cut|phrase_flow>
Write to: PROJECT_DIR/compositions/frames/<frame_id>.html
The worker forks the cited materials, converts every anchor to frame-local seconds (
local_t = track_t − span_sec[0]
), gates its groups with 0ms cuts, and writes one seek-safe frame file.
The worker never runs the CLI — those commands operate on the assembled project, which doesn't exist yet, so they'd report on the wrong files. The worker just writes to the contract and stops; you verify after assembly (Step 6). As each worker returns, you can confirm its file landed on disk.
Gate: every frame has its
compositions/frames/NN-*.html
on disk.
Step 5: Assemble
Goal: Wire the built frames + BGM into the playable
.
is deterministic — no subagent, no judgment. It references each frame file at its cumulative
, mounts
on track 11, and hard-cuts frame → frame (frames tile the track with no gaps, so there is
no transition injector).
bash
node <SKILL_DIR>/scripts/assemble-index.mjs --storyboard "$PROJECT_DIR/STORYBOARD.md" \
--hyperframes "$PROJECT_DIR" --audiomap "$PROJECT_DIR/audiomap.json"
Fix any
it reports — a missing or blank frame file means that worker wrote a partial file; re-dispatch it (Step 4) and re-assemble.
Gate: exists; total duration ==
audiomap.audio.duration_sec
.
Step 6: Verify and render
Goal: Verify the assembled video, get user approval, and render the final MP4.
Run the CLI on the
assembled project — that's the correct unit (the per-frame workers couldn't run it).
checks structure,
runs headless Chrome (catching JS errors and missing assets),
snapshots frames.
bash
( cd "$PROJECT_DIR" && npx hyperframes lint . && npx hyperframes validate . && npx hyperframes inspect . )
Inspect at
, each frame start, the strongest DROP / SURGE, every
, and the final frame. On failure, make the
cheapest safe fix: edit the offending
compositions/frames/NN-*.html
for a local issue;
re-dispatch that one frame-worker only when a whole frame must be rebuilt; go back to Step 3 only if the plan is creatively wrong. Never change duration or audio timing to hide a sync issue. Once the gates pass, pause for user review, then render only on approval:
bash
( cd "$PROJECT_DIR" && npx hyperframes render . -q draft -o renders/video.mp4 --fps 30 )
Gate: /
/
passed; the user approved;
exists with audio, duration ==
audiomap.audio.duration_sec
. The final reply states the MP4 path and duration.
Resume table
| You have | Continue from |
|---|
| only | Step 1 |
| Step 2 |
| (skeleton) | Step 3 |
| (complete) | Step 4 |
| all frame files | Step 5 |
| Step 6 |
Quick Reference
Formats: landscape
by default; portrait
; square
. Set the canvas once in the storyboard frontmatter (
).
Scripts under
:
(the one analyzer),
(plan check),
(index assembly),
(stage user media),
(vendored parser). Everything else is the
CLI.
| Read | When |
|---|
references/frame-skeleton.md
| Step 2: read the music, lay out the frames, set pacing |
| · | Step 3: pick the brand, fill each frame, write the plan |
references/template-catalog.md
| Step 3: pick a template per group |
references/motion-primitive-catalog.md
| Step 3/4: L0 recipes for free-compose |
| Step 3/4: asset treatments (beat-cut / ken-burns) |
sub-agents/frame-worker.md
| Step 4: dispatch + build one frame |
../hyperframes-core/references/subagent-dispatch.md
| Step 4: dispatch sub-agents safely |
../hyperframes-creative/references/design-spec.md
| Step 3: pick the preset (the brand) |
Directory layout
music-to-video/
SKILL.md
references/ frame-skeleton.md · planning.md · storyboard-format.md
template-catalog.md · motion-primitive-catalog.md · montage.md
templates/<id>/ { index.html (+ assets/ · program.json) } ← L1 catalog impls
motion-primitives/<id>/ { index.html } (+ ../assets/gsap.min.js shared by recipes) ← L0 catalog impls
scripts/ analyze-beatgrid.py · assemble-index.mjs · validate-plan.mjs · stage-assets.mjs · lib/storyboard.mjs
sub-agents/ frame-worker.md ← the one subagent (one per frame)