embedded-captions

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Embedded Captions

嵌入式字幕

One catalog, picked up front (CATALOG.md — 17 identities; the three engines behind it are backend detail). Standard (default) builds a clean verbatim rail (lower-third subtitle carrying most text) + an embed climax composited into the scene behind the subject at the peak. Cinematic is pure embed — no rail, every caption composited behind the subject (hero typography, accumulation, occlusion as the effect). Theme is a complete themed constitution — body paradigm × hero setpiece × front fx × plate reaction, composed from registries (themes/README.md):
ordnance
terminal
neonsign
stardust
stomp
. Most explainer / voiceover is Standard; embed is the scarce, earned peak — embedding every word is the common mistake; Theme is for VFX-grade asks ("炸", "特效", "像 AE 做的").

统一前置风格目录 (CATALOG.md — 17种样式;背后的三大引擎属于后端细节)。标准模式(默认)生成清晰的逐字底部字幕栏(承载大部分文本的底部副标题)+ 一个嵌入式高潮效果,在内容峰值时合成到主体后方的场景中。电影模式是纯嵌入式——无底部字幕栏,所有字幕都合成到主体后方(英雄级排版、文本累积、遮挡作为特效)。主题模式是完整的主题化配置——主体范式 × 英雄级场景 × 前置特效 × 画面响应,由注册表组合而成(themes/README.md):
ordnance
terminal
neonsign
stardust
stomp
。大多数解说/旁白内容适合标准模式嵌入式是稀缺且需精心设计的高潮效果——给每个词都做嵌入式是常见错误;主题模式适用于VFX级需求(“炸”、“特效”、“像AE做的”)。

Operational flow (TL;DR)

操作流程(精简版)

The craft prose below is long; the pipeline itself is short — and everything deterministic is computed or compiled, never hand-written:
  1. Decision gate (refuse bad clips) → pick ONE identity from CATALOG.md (17 identities; engine/compiler derived by lookup — never surface a mode/category question)
  2. hyperframes init
    (skip it if the project dir already exists with the video inside —
    matte.cjs
    /
    transcribe.cjs
    adopt any video in the dir as source.mp4) →
    bash scripts/prepare.sh <project>
    (matte ∥ transcribe ∥ audio-envelope in parallel, then safe-zones v2 with scene palette/optics/lighting — one command, nothing forgotten)
  3. author a small JSON of creative choices (read
    safe-zones.json
    first): Cinematic →
    plan.json
    fill-timings.cjs
    fit-fonts.cjs
    make-composition.cjs
    ; Theme →
    theme.json
    make-theme.cjs
    (rail/panel/poem/takeover paradigms;
    anchor
    is the quiet rail default)
  4. Visual QA:
    node scripts/preview-frames.cjs <project>
    → faithful composite previews in ~2s/frame (no render). Check § Visual QA before paying for a render.
  5. render-and-composite.sh
    → gates (timing / occlusion+hero / overflow / hand-off) →
    final.mp4
Load-bearing rules people miss:
  • rail (default) + embed (promotion).
    drop
    (filler, not shown) /
    rail
    (verbatim lower-third subtitle, in front, carries most text) /
    embed
    (a peak word composited behind the subject). Standard mode does both, embedding only the peak(s). See § Caption model.
  • The video is delivered UNTOUCHED (Standard/Cinematic; Theme mode's PLATE budget is the one sanctioned exception — register-gated reaction beats (charge-dim, punch, shake, grain) defined per theme DNA and applied AFTER the matte composite so subject+text+plate move as one frame) — captions are the only thing added; the matte just lets the subject occlude the embed track. Never grade/recolor/scanline the footage.
  • Two rulebooks: rail → references/rail.md (thin), embed craft → references/composition-craft.md (rich, embed-only). Skim by need.

下文的专业说明较长,但实际流程很短——所有确定性步骤均为计算或编译完成,无需手动编写:
  1. 决策门(拒绝不合格片段)→ CATALOG.md中选择一种样式(17种样式;引擎/编译器通过查表确定——绝不向用户展示模式/类别选择问题)
  2. hyperframes init
    (如果项目目录已存在且包含视频则可跳过——
    matte.cjs
    /
    transcribe.cjs
    会将目录中的任意视频作为source.mp4)→
    bash scripts/prepare.sh <project>
    (并行处理遮罩、转录、音频包络,然后结合场景调色板/光学/灯光生成安全区域v2——一条命令完成所有步骤,无遗漏)
  3. 编写一份包含创意选择的小型JSON文件(先阅读
    safe-zones.json
    ): 电影模式 →
    plan.json
    fill-timings.cjs
    fit-fonts.cjs
    make-composition.cjs
    ; 主题模式 →
    theme.json
    make-theme.cjs
    (底部字幕栏/面板/诗歌/接管范式;
    anchor
    是默认的简洁底部字幕栏)
  4. 视觉质检
    node scripts/preview-frames.cjs <project>
    → 约2秒/帧生成忠实的合成预览(无需渲染)。渲染前请查看§ 视觉质检部分。
  5. render-and-composite.sh
    → 校验门(时序/遮挡+英雄级/溢出/交接)→
    final.mp4
容易被忽略的核心规则:
  • 底部字幕栏(默认)+ 嵌入式(高潮)
    drop
    (填充内容,不显示)/
    rail
    (逐字底部副标题,位于前景,承载大部分文本)/
    embed
    (被突出的峰值词汇,合成到主体后方)。标准模式同时包含两者,仅对峰值内容做嵌入式处理。详见**§ 字幕模型**。
  • 视频原片保持不变(标准/电影模式;主题模式的画面调整是唯一被允许的例外——注册的响应节拍(变暗、冲击、抖动、颗粒)根据主题DNA定义,在遮罩合成后应用,使主体+文本+画面同步运动)——仅添加字幕;遮罩仅用于让主体遮挡嵌入式字幕轨道。绝不调整原片的色彩/色调/扫描线。
  • 两套规则手册:底部字幕栏 → references/rail.md(内容简洁),嵌入式制作 → references/composition-craft.md(内容详尽,仅适用于嵌入式)。按需浏览即可。

Caption model — rail + embed

字幕模型 — 底部字幕栏 + 嵌入式

Every spoken phrase is one of three things:
WhatHow it's shown
dropfiller — um/uh, stutters, self-correctionsnot shown
railthe default — ordinary spoken content (verbatim)clean lower-third subtitle, in front, readable. A punch word can get an inline
emphasis
highlight (accent colour / active-word pop) — it stays on the rail.
embeda promoted peak — the headline beatone big word composited behind the subject (matte occlusion), designed entrance + exit
The rail carries most of the text; embed is the scarce, earned peak. Scarcity is per beat/block, not per clip: ≤1 hero per block (thought), never two co-visible, ≥ a beat of air between hero windows (the compiler warns under 0.6s). A short clip → usually 1–2; a long explainer → ~one per section. Among multiple heroes, the largest authored one is the APEX (it alone gets the full lockup embed + width-fit raise); smaller ones are MINOR peaks that ride their column as oversized emphasis lines (fg, damped motion) — not every beat needs the matte showcase, which is exactly what keeps the apex an event. Embedding every word is still the common mistake.
Rail-surface identities build exactly this (rail =
rail.html
, embed = the climax in
index.html
). Column-flow identities drop the rail and make everything embed-style — recommend them only for mood-over-verbatim asks, never for explainer / voiceover where the words must read (CATALOG.md encodes this per identity).

每个口语短语分为三类:
类型展示方式
drop(舍弃)填充内容——嗯/啊、结巴、自我修正内容不显示
rail(底部字幕栏)默认类型——普通口语内容(逐字)清晰的底部副标题,位于前景,易于阅读。重点词汇可添加行内
emphasis
高亮(强调色/动态突出)——仍保留在底部字幕栏中。
embed(嵌入式)被突出的峰值——核心节拍词汇单个大词汇合成到主体后方(遮罩遮挡),包含设计好的入场+退场动画
底部字幕栏承载大部分文本;嵌入式是稀缺且需精心设计的高潮效果。稀缺性是按节拍/区块计算,而非按片段:每个区块(表达一个想法)最多1个英雄级词汇,绝不出现两个同时可见的情况,英雄级词汇窗口之间至少间隔一个节拍(编译器会警告间隔小于0.6秒的情况)。短片段通常1-2个;长解说视频约每个章节1个。在多个英雄级词汇中,最大的那个是APEX(顶点)(仅它会获得完整的嵌入式布局+自适应宽度提升);较小的是次要峰值,作为超大强调行显示在对应列中(前景,弱化动画)——并非每个节拍都需要遮罩展示,这正是让顶点效果成为亮点的原因。给每个词都做嵌入式仍然是常见错误。
底部字幕栏样式严格遵循此模型(底部字幕栏 =
rail.html
,嵌入式 =
index.html
中的高潮效果)。列流式样式会移除底部字幕栏,所有内容都采用嵌入式风格——仅推荐用于氛围优先于逐字可读性的需求,绝不用于解说/旁白类需要清晰读词的内容(CATALOG.md已为每种样式标注适用场景)。

Step 0 — pick ONE identity from the CATALOG

步骤0 — 从目录中选择一种样式

One front-end, three engines behind. The user picks an IDENTITY from CATALOG.md (17 entries: 12 classic + 5 themed); the engine, compiler and authoring file are derived by lookup from the catalog row. Never surface "Standard vs Cinematic vs Theme" as a question — those are backend names (a product has one UX even with several engines). The catalog encodes everything routing needs: reading surface, voice, recommend-for, scene needs, adjacency notes for the genuinely-close pairs (loud↔ordnance, neon↔neonsign, cream↔stardust).
Procedure: probe the clip → shortlist 2–3 identities from the catalog → recommend ONE with a one-line why → the user picks → author that identity's file. Identities are engine-locked (no cross combos; opening one is a validation event — see dna/README.md).
Always present your recommendation and let the user pick before you author. Don't silently default.
(The full identity table lives in CATALOG.md — single source of truth for routing. The engine docs below describe each backend's authoring contract.)
Recommendation heuristic: use the "Shortlisting heuristics" in CATALOG.md — they are identity-level (e.g. "炸" shortlists ordnance/stomp/terminal/loud and picks by WHAT should explode), never category-level. Unsure →
anchor
.
  • Cinematic → write
    plan.json
    for a locked template, compiled by
    make-composition.cjs
    .
  • Theme → read themes/README.md, author
    theme.json
    , run
    scripts/render-theme.sh
    (compiles + renders + plate reaction → final_fx.mp4).

一个前端入口,三个后端引擎。用户从CATALOG.md中选择一种样式(17种:12种经典样式 +5种主题样式);引擎、编译器和创作文件通过目录行查表确定。绝不向用户展示“标准vs电影vs主题”的选择问题——这些是后端名称(产品仅保留一个用户界面,即使有多个引擎)。目录包含所有路由所需信息:显示载体、风格、适用场景、场景需求、相似样式的邻接说明(loud↔ordnance、neon↔neonsign、cream↔stardust)。
流程:分析视频 → 从目录中筛选2-3种样式 → 推荐一种并给出一行理由 → 用户选择 → 编写对应样式的文件。样式与引擎绑定(不可跨组合;打开样式文件会触发验证——详见dna/README.md)。
始终先给出推荐,让用户选择后再开始创作。不要默认选择而不告知用户。
(完整样式表位于CATALOG.md——是路由的唯一可信来源。下文的引擎文档描述了每个后端的创作约定。)
推荐启发式规则:使用CATALOG.md中的“筛选启发式规则”——这些是基于样式的规则(例如“炸”需求会筛选ordnance/stomp/terminal/loud,并根据“什么内容需要爆炸”选择),而非基于类别的规则。不确定时选
anchor
  • 电影模式 → 编写
    plan.json
    作为锁定模板,由
    make-composition.cjs
    编译。
  • 主题模式 → 阅读themes/README.md,编写
    theme.json
    ,运行
    scripts/render-theme.sh
    (编译+渲染+画面响应 → final_fx.mp4)。

Decision gate — RUN FIRST

决策门 — 首先执行

Probe the video and classify the scene before either mode.
bash
ffprobe <video.mp4>                    # specs
ffmpeg -ss <t> -i <video.mp4> -vframes 1 sample.png   # at 20/50/80%
Read the samples. Refuse if:
  • Multiple speakers / hard cuts (split & render each shot, or refuse)
  • No human subject (this skill is for talking-head)
  • Under 3 seconds, no speech, or face never clearly visible —
    transcribe.cjs
    warns when audio is near-silent (Whisper hallucinates words like "Thank you." over silence); heed it and refuse rather than caption fabricated words
  • Source already has burned-in captions / subtitles / heavy text graphics — adding a second caption system conflicts and the footage ships untouched (no covering/inpainting). Burned text often appears only mid-clip: sample a 1fps contact sheet (
    ffmpeg -i in.mp4 -vf "fps=1,scale=160:-1,tile=10x5" sheet.png
    ), don't trust 3 spot frames.
  • Transcript is garbage — non-native/heavy-accent speech can transcribe into confident gibberish. Sanity-read
    transcript.json
    before authoring; if it doesn't parse as language, try
    WHISPER_MODEL=medium
    once, else refuse (a verbatim rail of fabricated words is worse than no captions).
  • Busy handheld with fast motion (matte flickers)
在选择任何模式前,先分析视频并分类场景。
bash
ffprobe <video.mp4>                    # 查看视频规格
ffmpeg -ss <t> -i <video.mp4> -vframes 1 sample.png   # 在20%/50%/80%位置采样帧
查看采样帧。如果出现以下情况则拒绝处理:
  • 多说话人/硬切镜头(拆分并单独渲染每个镜头,或拒绝处理)
  • 无人类主体(本工具仅适用于访谈类视频)
  • 时长不足3秒、无语音,或面部从未清晰可见——
    transcribe.cjs
    会在音频接近静音时发出警告(Whisper会在静音时生成“Thank you.”之类的幻觉文本);务必重视并拒绝处理,而非为虚构文本添加字幕
  • 原片已包含烧录字幕/副标题/大量文本图形——添加第二套字幕系统会冲突,且原片保持不变(不覆盖/修复)。烧录文本可能仅在片段中间出现:生成1fps接触表(
    ffmpeg -i in.mp4 -vf "fps=1,scale=160:-1,tile=10x5" sheet.png
    ),不要仅依赖3个采样帧。
  • 转录结果无效——非母语/重口音语音可能被转录为看似可信的乱码。创作前先检查
    transcript.json
    ;如果无法解析为正常语言,尝试一次
    WHISPER_MODEL=medium
    ,否则拒绝处理(逐字显示虚构文本的底部字幕栏比无字幕更糟)。
  • 手持拍摄且画面快速移动(遮罩会闪烁)

Pre-flight probes (cost nothing, prevent the worst failures)

预检查(零成本,避免严重失败)

  1. Shot-cut probe. Sample frames at 20%, 50%, 80%. If a different subject/scene appears, trim the clip before the cut.
  2. Letterbox / pillarbox probe. Black bars on the first frame? Compute safe content rect and constrain caption placement inside it.
  3. Luminance probe. Sample the caption region's average luminance —
    under 60
    → light text reads as-is,
    60-180
    → add the glyph scrim,
    180+
    → opaque text + scrim (never bare light text). Cinematic templates are cream+
    screen
    and LOCKED
    — use this probe to pick a fitting identity (bright scenes →
    ink
    , or the opaque-rail
    anchor
    theme), never to recolour one.
  4. Identity recommendation by tone (you recommend; the user picks — see Step 0 + CATALOG.md). explainer / interview / must-read words → rail/panel-surface identities; poetic / social / "cinematic" → column-flow identities by register; "炸 / 特效 / VFX" / named worlds → themed identities. When unsure →
    anchor
    (words read, scene safe) — but present a shortlist and let the user choose.

  1. 镜头切换检查。在20%、50%、80%位置采样帧。如果出现不同主体/场景,在切换点前修剪片段
  2. 黑边检查。第一帧有黑边?计算安全内容区域,将字幕位置限制在该区域内。
  3. 亮度检查。采样字幕区域的平均亮度——
    低于60
    → 浅色文本直接显示,
    60-180
    → 添加文字遮罩,
    180+
    → 不透明文本+遮罩(绝不使用浅色裸文本)。电影模式模板为cream+
    screen
    且锁定
    ——使用此检查来选择合适的样式(明亮场景→
    ink
    ,或不透明底部字幕栏的
    anchor
    主题),绝不修改样式颜色。
  4. 按风格推荐样式(你推荐;用户选择——见步骤0 + CATALOG.md)。解说/访谈/需清晰读词的内容→底部字幕栏/面板样式;诗意/社交/“电影感”内容→按场景选择列流式样式;“炸/特效/VFX”/指定风格→主题样式。不确定时选
    anchor
    (文字清晰,场景安全)——但需给出筛选列表并让用户选择。

Pipeline — 5 steps

流程 — 5个步骤

1. hyperframes init <project> --non-interactive --video <video.mp4> --skip-skills
2. bash scripts/prepare.sh <project>       # matte ∥ transcribe (parallel) → safe-zones. One command.
                                           #   → frames_fg/ transcript.json safe-zones.json
3. [AGENT STEP — the only creative step] author a small JSON; see below by mode
   Cinematic: author plan.json → node scripts/fill-timings.cjs → fit-fonts.cjs → make-composition.cjs
   Theme:     author theme.json → bash scripts/render-theme.sh <project>   (compiles + renders + plate fx)
4. node scripts/preview-frames.cjs <project>   # ~2s/frame composite previews → § Visual QA (BEFORE the render)
5. bash scripts/render-and-composite.sh <project>  # gates → final.mp4 + history/ snapshot
   (Theme mode: SKIP steps 3b/5 — render-theme.sh already runs compile + render-and-composite
    + _postfx.sh; the deliverable is final_fx.mp4, final.mp4 is pre-plate-reaction)
Step 3 differs by mode:
1. hyperframes init <project> --non-interactive --video <video.mp4> --skip-skills
2. bash scripts/prepare.sh <project>       # 并行处理遮罩、转录 → 生成安全区域。一条命令完成。
                                           #   → frames_fg/ transcript.json safe-zones.json
3. [AGENT步骤 — 唯一创意步骤] 编写小型JSON文件;按模式选择下文对应步骤
   电影模式:编写plan.json → node scripts/fill-timings.cjs → fit-fonts.cjs → make-composition.cjs
   主题模式:编写theme.json → bash scripts/render-theme.sh <project>   (编译+渲染+画面特效)
4. node scripts/preview-frames.cjs <project>   # 约2秒/帧生成合成预览 → § 视觉质检(渲染前执行)
5. bash scripts/render-and-composite.sh <project>  # 校验门 → final.mp4 + history/ 快照
   (主题模式:跳过步骤3b/5 — render-theme.sh已运行编译+render-and-composite
    + _postfx.sh;交付文件为final_fx.mp4,final.mp4是画面响应前的版本)
步骤3因模式而异:

Step 3 — Cinematic mode (pure embed)

步骤3 — 电影模式(纯嵌入式)

  1. Read
    safe-zones.json
    first.
    Narration planes go in
    zones.hugLeft
    /
    hugRight
    — clean strips ABUTTING the silhouette (text far from the body reads as floating, not embedded; far corners are the fallback, not the default). The hero defaults to
    heroAnchor
    /
    heroBands.best
    (centered ON the subject, ~30–55% occluded).
    recommendation:"fg"
    moves NARRATION in front for legibility; the hero stays embedded whenever
    heroBands.feasible
    — hero-fg is the last resort.
  2. The DNA is the identity you picked in Step 0 (CATALOG.md) — do not re-open the choice here. Sanity-check it against the scene (bright hero band luma > 150 wants
    ink
    ; full pick guidance lives in the catalog, covering all ten incl. neon / glitch / chrome / velocity). State your pick + why; the user decides. The DNA locks type/palette/blend/motion + hero three-act; safe-zones v2 (
    palette
    /
    optics
    /
    lighting
    ) parameterizes it to THIS scene automatically.
  3. Author
    <project>/cinematic.json
    "dna": "<name>"
    + thought-BLOCKS, not raw groups: each block = lines of words (grouped 2–5 at clause boundaries) + the plane it stacks in + per-line
    css
    (size/weight/style only — no positions) + at most ONE line marked
    "hero": true
    (the promoted word;
    "text"
    for display form). Schema:
    scripts/make-cinematic.cjs
    header.
  4. Compile:
    node scripts/make-cinematic.cjs <project>
    — lowers blocks → plan.json → index.html. Generated for you: transcript-sequenced timings, accumulate-within-block, page-flip-between-blocks, the hero LOCKUP (a hero block's pre-context, HERO and post-context stack as ONE bonded composition centered on the subject — reading order top→bottom = spoken order by construction; context floats in FRONT while the hero embeds BEHIND = the depth sandwich; a mass rule keeps the hero dominating its context), apex/minor hero split, reading order by construction, fg fallback per safe-zones. Then the gates run as usual. (Hand-authoring plan.json directly remains possible for designs blocks can't express — then run
    fill-timings.cjs
    +
    fit-fonts.cjs
    +
    make-composition.cjs
    yourself.)
  1. 先阅读
    safe-zones.json
    。旁白文本放置在**
    zones.hugLeft
    /
    hugRight
    **区域——紧贴主体轮廓的干净条带(文本离主体过远会显得漂浮,而非嵌入式;角落是 fallback 选项,而非默认)。英雄级词汇默认放置在
    heroAnchor
    /
    heroBands.best
    (居中于主体,约30–55%被遮挡)。
    recommendation:"fg"
    会将旁白文本移到前景以提升可读性;只要
    heroBands.feasible
    为真,英雄级词汇始终保持嵌入式
    ——英雄级词汇移到前景是最后手段。
  2. DNA是你在步骤0中选择的样式(CATALOG.md)——不要在此重新选择。对照场景检查样式(明亮英雄级区域亮度>150适合
    ink
    ;完整选择指南在目录中,涵盖所有10种样式包括neon/glitch/chrome/velocity)。说明你的选择及理由;由用户决定。DNA锁定字体/调色板/混合模式/动画+英雄级三段式效果;安全区域v2(
    palette
    /
    optics
    /
    lighting
    )会自动根据当前场景调整参数。
  3. 编写
    <project>/cinematic.json
    "dna": "<name>"
    + 按想法划分的区块,而非原始文本组:每个区块=若干行文字(按分句边界分组2–5个词)+ 放置的区域 + 每行的
    css
    (仅尺寸/字重/样式——无位置)+ 最多一行标记
    "hero": true
    (被突出的词汇;
    "text"
    为显示文本)。 schema见
    scripts/make-cinematic.cjs
    头部。
  4. 编译
    node scripts/make-cinematic.cjs <project>
    — 将区块转换为plan.json → index.html。自动生成:转录时序排序、区块内文本累积、区块间翻页效果、英雄级锁定布局(英雄级区块的前置上下文、英雄词汇、后置上下文组合为一个绑定的合成内容,居中于主体——阅读顺序从上到下=说话顺序;上下文漂浮在前景,英雄词汇嵌入在背景=深度层次;规则确保英雄词汇主导上下文)、顶点/次要英雄级词汇区分、构造式阅读顺序、基于安全区域的前景 fallback。然后按常规执行校验门。(仍可手动编写plan.json以实现区块无法表达的设计——然后自行运行
    fill-timings.cjs
    +
    fit-fonts.cjs
    +
    make-composition.cjs
    。)_

Step 3 — Theme mode (themed constitution)

步骤3 — 主题模式(主题化配置)

Read themes/README.md FIRST — paradigm/setpiece registries, linkages, hard rules, and the exact
theme.json
schema.
  1. Pick a theme DNA by content register (each
    themes/<name>.json
    has
    voice
    +
    when
    ). State your pick + why; the user decides.
  2. Author
    <project>/theme.json
    dna
    ,
    lines
    (verbatim, transcript order; 1–5 words each — for
    takeover
    each line is one CARD),
    minors
    (emphasis words),
    hero:{match}
    (the climax word/phrase; leave it OUT of
    lines
    for embed setpieces, keep it IN for inline setpieces and panel+redact).
  3. Render:
    bash scripts/render-theme.sh <project>
    — compiles (verbatim-completeness gate at compile time), renders both layers, composites, applies the plate reaction →
    final_fx.mp4
    . Use
    preview-frames.cjs
    between compile and render for Visual QA.

首先阅读themes/README.md — 范式/场景注册表、关联规则、硬规则,以及
theme.json
的精确schema。
  1. 按内容场景选择主题DNA(每个
    themes/<name>.json
    包含
    voice
    +
    when
    )。说明你的选择及理由;由用户决定。
  2. 编写
    <project>/theme.json
    dna
    lines
    (逐字,按转录顺序;每行1–5个词——
    takeover
    模式下每行是一个卡片)、
    minors
    (强调词汇)、
    hero:{match}
    (高潮词汇/短语;嵌入式场景需从
    lines
    中移除,内联场景和面板+隐藏场景需保留在
    lines
    中)。
  3. 渲染
    bash scripts/render-theme.sh <project>
    — 编译(编译时检查逐字完整性)、渲染两层、合成、应用画面响应 →
    final_fx.mp4
    。编译和渲染之间使用
    preview-frames.cjs
    进行视觉质检。

Visual QA — preview BEFORE you render

视觉质检 — 渲染前预览

node scripts/preview-frames.cjs <project> [t…]
composites faithful preview frames in ~2s each (caption layers screenshotted at seek-time + real video frame + matte occlusion + rail overlay = what the final composite will look like at that moment). Default samples = each group/climax window. A full render costs minutes — never use it to discover layout problems.
Check the previews (
<project>/preview/sheet.png
) against this list — these are the failures the geometric gates cannot catch:
  1. Washout — light text over a bright region (window/sign/sky): unreadable → move the plane or change DNA/mode (bright scene →
    ink
    ).
  2. Text-on-text — captions over the scene's own text/graphics, or two caption groups colliding.
  3. Reading order — on-screen vertical order must match spoken order; the hero must not sit below later words.
  4. Hero presence — the climax should be BIG and visibly behind the subject (~30–55% occluded), not a floating label in a margin.
  5. Balance — one coherent column/band, not scattered fragments; margins breathing; nothing clipped.
Then the 5 positive checks in references/reference-bar.md (poster test · timid test · one-glance hierarchy · scene handshake · dead-air audit) — the failure list keeps a render from being broken; the positive list is what makes it designed. Ship when both pass.
Fresh-eyes review (recommended for anything user-facing): you have confirmation bias about your own layout. If you can spawn a subagent, give it ONLY the preview sheet + this checklist and ask for PASS/FIX verdicts per frame ("review these caption previews against the 5-point checklist; answer PASS or the specific fix per frame"). Apply fixes in plan.json / theme.json, recompile, re-preview — each loop costs seconds. Render once, when the previews pass.

node scripts/preview-frames.cjs <project> [t…]
约2秒生成一帧忠实的预览帧 (在指定时间点截图字幕层 + 真实视频帧 + 遮罩遮挡 + 底部字幕栏叠加 = 最终合成在该时刻的效果)。默认采样每个组/高潮窗口。 完整渲染需数分钟——绝不要用渲染来发现布局问题。
对照以下列表检查预览图(
<project>/preview/sheet.png
)——这些是几何校验门无法捕获的失败情况:
  1. 褪色——浅色文本在明亮区域(窗户/标识/天空)上:无法阅读 → 移动区域或更改样式/模式(明亮场景→
    ink
    )。
  2. 文本重叠——字幕覆盖场景自身的文本/图形,或两个字幕组碰撞。
  3. 阅读顺序——屏幕上的垂直顺序必须与说话顺序匹配;英雄级词汇不得位于后续词汇下方。
  4. 英雄级词汇存在感——高潮词汇应足够大且明显位于主体后方(约30–55%被遮挡),而非边缘的漂浮标签。
  5. 平衡——连贯的列/条带,而非零散片段;边距留白;无内容被裁剪。
然后检查references/reference-bar.md中的5项正向检查 (海报测试·保守测试·一眼层级·场景契合·空白审计)——失败列表避免渲染出有问题的结果;正向列表确保结果是经过设计的。 两项都通过后再交付。
新鲜视角审核(面向用户的内容推荐):你对自己的布局会有确认偏差。如果可以调用子代理,仅提供预览图+此检查表,要求对每帧给出通过/修复 verdict(“对照5点检查表审核这些字幕预览图;每帧回答通过或具体修复建议”)。在plan.json / theme.json中应用修复,重新编译,重新预览——每个循环仅需数秒。预览通过后再进行一次渲染。

The DNA registry — ten visual languages (replaces the template catalog)

DNA注册表 — 十种视觉语言(替代模板目录)

Both modes draw from dna/ — six art-directed visual languages that parameterize per scene (accent sampled from the footage, contact shadow along the measured light direction, depth-match blur, RMS-coupled hero amplitude):
DNARegisterScene fitVoice
creampremium-warmdark/mid warm scenesInter + warm cream + screen; glowing emergence hero (successor of cinematic-cream)
inkpremiumbright scenes (luma > 150)near-black multiply — type printed ON the wall; the bright-scene answer
editorialeditorial-luxeintrospective / fashion / poeticBodoni Moda, lowercase-italic hero — magazine elegance
keynotetech-premiumproduct / launchopaque white Inter 800, dead-center stillness
documentaryformalinterview / seriousburn-in reveals, no hero — gravitas IS the style
loudloudhype / sport / socialAnton + scene-sampled accent, single-unit slam + ripple; body ANNOUNCES in front (
bodyLayer: fg
)
neonloud-cybercyberpunk / nightlife / tech-noir (dark scenes)electric-cyan signage, ignition flicker, the hero powers ON like a sign
glitchloud-cyberdigital / hacker / AIRGB-split echoes snap together on landing; machine-percussive timing
chromeloud-luxeY2K / fashion-tech / musicliquid-metal gradient hero + one sheen sweep during the hold
velocityloud-sportsport / auto / fitnessevery word arrives along its motion vector (streak+skew), hero passes with speed trails
Pick by
safe-zones.json
(
heroAnchor.bandLuma
,
palette.temperature
) × content register — dna/README.md has the decision rule. Authoring:
cinematic.json
takes
"dna": "<name>"
.
The engine generates the hero three-act from the DNA (no authoring needed): co-visible captions dim (setup) → per-letter entrance with amplitude ∝ spoken loudness (impact) → breathe + glow until exit (afterglow).
(Legacy:
plan.template:"cinematic-cream"
maps to
dna:"cream"
automatically. The retired 54-template library lives outside the skill at
~/Downloads/embedded-captions-archive/standard-templates-54/
;
_motion.md
remains in-skill as the motion-verb reference catalog.)

两种模式均使用**dna/** — 六种经过艺术指导的视觉语言,可根据场景参数化调整(从画面采样强调色,沿测量的光线方向添加接触阴影,匹配深度的模糊效果,与RMS耦合的英雄级词汇振幅):
DNA场景类型适配场景风格描述
cream高端温暖风深色/中等亮度暖色调场景Inter字体 + 暖奶油色 + screen混合模式;发光式入场英雄级效果(电影奶油色的继任者)
ink高端风明亮场景(亮度>150)近黑色multiply混合模式——文字印在墙上;明亮场景的最优解
editorial编辑级奢华风内省/时尚/诗意类内容Bodoni Moda字体,斜体小写英雄级词汇——杂志优雅风格
keynote科技高端风产品/发布会不透明白色Inter 800字体,静止居中显示
documentary正式风访谈/严肃类内容渐显式烧录效果,无英雄级词汇——庄重就是风格
loud活力风hype/体育/社交类内容Anton字体 + 场景采样强调色,单元素撞击+波纹效果;主体文本显示在前景(
bodyLayer: fg
)
neon赛博活力风赛博朋克/夜生活/科技黑色电影(深色场景)电青色标识,点火闪烁效果,英雄级词汇像标识一样通电启动
glitch赛博活力风数字/黑客/AI类内容RGB分离回声在落地时合并;机器 percussive 时序
chrome奢华活力风Y2K/时尚科技/音乐类内容液态金属渐变英雄级词汇 + 停留时的一次光泽扫过
velocity运动活力风体育/汽车/健身类内容每个词汇沿运动矢量到达(拖尾+倾斜),英雄级词汇带速度轨迹通过
根据
safe-zones.json
heroAnchor.bandLuma
palette.temperature
)× 内容场景选择——dna/README.md包含决策规则。创作时:
cinematic.json
需指定
"dna": "<name>"
引擎会根据DNA生成英雄级三段式效果(无需创作): 可见字幕变暗(铺垫)→ 逐字入场,振幅与说话音量成正比(冲击)→ 呼吸+发光直到退场(余辉)。
(遗留:
plan.template:"cinematic-cream"
会自动映射到
dna:"cream"
。 已退役的54个模板库位于技能外部的
~/Downloads/embedded-captions-archive/standard-templates-54/
_motion.md
仍作为动画动词参考目录保留在技能中。)

Aesthetic decision — tone × shot × platform (input to the catalog shortlist, NOT a second router)

美学决策 — 风格×镜头×平台(用于目录筛选,而非二次路由)

Classify the clip on 3 axes and feed the result into CATALOG.md's shortlisting — this section never picks a mode/engine by itself:
Tone (what feel does the content have?)
  • documentary | conversational | energetic | poetic | keynote | investigative | music-video
Shot (what's the framing?)
  • close-up (head + shoulders) | mid-shot (torso+) | wide (full body+) | cut-montage (mixed shots)
Platform (where will it play?)
  • 9:16 portrait (TikTok/IG/Shorts) | 16:9 landscape (YouTube/web) | 1:1 square | broadcast export
Cross-reference in references/direction-catalog.md § Classification matrix for direction language — then return to CATALOG.md to shortlist identities (this matrix informs the shortlist; the catalog is the only routing surface).
从三个维度分类片段,并将结果输入CATALOG.md的筛选流程——本节不会单独选择模式/引擎:
风格(内容给人的感受?)
  • documentary(纪实)| conversational(对话)| energetic(活力)| poetic(诗意)| keynote(发布会)| investigative(调查)| music-video(音乐视频)
镜头(画面构图?)
  • close-up(特写:头+肩)| mid-shot(中景:躯干+)| wide(全景:全身+)| cut-montage(混剪:多种镜头)
平台(播放平台?)
  • 9:16竖屏(TikTok/IG/Shorts)| 16:9横屏(YouTube/网页)| 1:1正方形 | 广播导出
references/direction-catalog.md § 分类矩阵中交叉参考获取指导语言——然后返回CATALOG.md筛选样式(此矩阵为筛选提供信息;目录是唯一路由入口)。

Composition craft (embed track) — read before embedding

合成制作(嵌入式轨道)—— 嵌入前阅读

The full embed-track playbook lives in references/composition-craft.md: transcript role-annotation, phrase grouping, planes & clean-zone anchoring, zone coherence, climax pop & readability, edge-breathing, the occlusion 3-step judgement, and accumulation/persistence. It governs how a promoted phrase sits INTO the scene — read it before authoring any embed (Cinematic
plan.json
or Standard
index.html
). The default rail track has its own, much simpler spec → references/rail.md.

完整的嵌入式轨道手册位于**references/composition-craft.md: 转录角色标注、短语分组、区域与干净区域锚定、区域连贯性、高潮突出与可读性、边缘留白、遮挡三步判断、累积/持久化。它指导如何将被突出的短语融入场景——创作任何嵌入式内容(电影模式
plan.json
或标准模式
index.html
)前请阅读。默认的
底部字幕栏**轨道有自己更简单的规范 → references/rail.md

Shared knowledge

共享知识库

DocWhat
references/rail.mdThe rail track — standard lower-third subtitle spec (the default; carries most text).
references/composition-craft.mdThe embed-track playbook — grouping, planes, climax pop, occlusion judgement, accumulation/persistence. Read before embedding.
dna/README.mdThe DNA registry — six scene-parameterized visual languages; how to pick.
references/reference-bar.mdThe taste bar — per-register world-class references + the 5 positive checks.
references/aesthetic-principles.mdThe 18 rules. Beat Veed AI on taste. Read first.
references/motion-vocabulary.md10 named motion primitives + tone→timing lookup
references/direction-catalog.md10 ship-ready aesthetics + tone×shot×platform matrix
references/anti-patterns.mdBugs already locked out (CoreML, letter-spacing reflow, etc.)
references/scene-types.mdWhen a wall surface is usable (4 conditions)
references/layout-heuristics.mdPlane positioning, clean-zone selection, crown 3 conditions, pillarbox math
references/typography-presets.mdFont-size × column-width matrix (starting points)
references/caption-grouping.mdWord → group rules (pauses, sentence boundaries)
references/failure-modes.mdLong tail of dev gotchas
references/bespoke-vs-presets.mdWhy presets fail sometimes; clone-and-tweak pattern
Read the aesthetic principles and direction catalog FIRST. Everything else is implementation detail.

文档内容
references/rail.md底部字幕栏轨道——标准底部副标题规范(默认;承载大部分文本)。
references/composition-craft.md嵌入式轨道手册——分组、区域、高潮突出、遮挡判断、累积/持久化。嵌入前阅读。
dna/README.mdDNA注册表——六种场景参数化视觉语言;选择指南。
references/reference-bar.md风格参考标准——各场景的世界级参考+5项正向检查。
references/aesthetic-principles.md18条规则。在风格上超越Veed AI。优先阅读。
references/motion-vocabulary.md10种命名动画原语 + 风格→时序对照表
references/direction-catalog.md10种可交付美学风格 + 风格×镜头×平台矩阵
references/anti-patterns.md已锁定的错误(CoreML、字间距重排等)
references/scene-types.md墙面可使用的4种条件
references/layout-heuristics.md区域定位、干净区域选择、顶部区域3条件、黑边计算
references/typography-presets.md字体大小×列宽矩阵(起始参考)
references/caption-grouping.md词→组规则(停顿、句子边界)
references/failure-modes.md开发中的长尾问题
references/bespoke-vs-presets.md预设有时失效的原因;克隆+调整模式
优先阅读美学原则和风格目录。其他内容均为实现细节。

Non-negotiables

不可协商规则

  • Face must never be 100%-covered continuously — every 0.3s window, face bbox ≥30% uncovered.
  • WCAG contrast — final render lints; fix palette if it fails.
  • Deterministic — no
    Math.random()
    , no
    Date.now()
    , no
    repeat:-1
    .
  • Never grade/recolor the video. The footage ships untouched — captions are the only addition. No full-frame scanlines / duotone / darken / vignette over the a-roll. Cyberpunk/CRT texture belongs inside a caption element, not over the whole frame.
  • Rail-first for talking-head / explainer. Don't embed the whole transcript — most text is the rail; embed only peaks. Embedding everything is the default mistake.
  • Embed is scarce + spaced. ≤1 embed per sentence/beat, never two adjacent or co-visible, ≥ a beat apart, at most one
    apex
    . climax = per-beat peak, not "the single payoff of the entire clip."
  • Matte = the PERSON (hyperframes
    remove-background
    , u2net_human_seg, Apache-2.0).
    Human segmentation by intent, but not surgically: thin offset furniture (mic boom arms) is usually excluded — captions render over it, behind the person — while large salient objects NEAR the subject (a telescope, a desk rig) can still leak into the matte and occlude captions. Objects HELD by the subject (products, phones) may drop out intermittently, letting captions pass in front. NEVER assume: sample
    frames_fg/
    at 2-3 timestamps before placing the hero, and prefer hero positions clear of any leaked furniture (
    heroAnchor
    can be skewed by leaks — cross-check against frames_bg).
  • safe-zones is PROP-BLIND — eyeball every band you use. Zones/heroBands score subject occlusion + luma only: a mic, telescope, or screen sitting inside a "clean" zone is invisible to them (and a prop leaking INTO the matte skews
    heroAnchor.centerXPct
    off the person). Before authoring, extract ONE frame of each band you intend to use; if a prop lives there, measure its bbox and move/shrink the plane. Two real cases shipped clean only because the agent did exactly this. (Auto prop-saliency is a known gap; zones'
    peakLuma
    only catches moving bright objects.)
  • Captions stay on-frame. Cinematic mode hard-gates frame-overflow; Standard mode runs
    check-overflow.cjs
    as a WARNING (intentional bleed is the only exception — read the warning).
  • Each caption ≥ 0.5s on screen — shorter = unreadable.
  • Word timings must match transcript.json within 80ms — a caption firing 500ms off-beat destroys the scene illusion. Cinematic runs
    check-timing.cjs --strict
    before rendering (via render-and-composite.sh); THEME mode enforces the same timings at compile time instead (make-theme's sequential transcript matcher + verbatim completeness gate — drift is a compile error). Never pack multiple transcript words into one entry (e.g.
    "FUTURE OF"
    or an
    IT
    + line-break +
    ALL
    stack with one start/end) — the second word inherits the first's timestamp and fires early. Split them into separate word entries with their own timings, even if you want them on the same visual line (use CSS
    white-space
    / natural wrap instead of
    <br>
    ). Creative substitutions where caption text ≠ transcript (e.g.
    "15%"
    replacing
    "fifteen percent"
    ) are supported — register them in
    CREATIVE_SUBS
    inside
    check-timing.cjs
    .
  • Group windows must envelop their words
    group.in ≤ min(word.start)
    and
    group.out ≥ max(word.end)
    for every group. If
    group.in
    is later than a word's start, the word is silently delayed until the container mounts (we've shipped 800ms lag bugs from this). The validator enforces this.
  • No two caption groups may overlap in both time AND screen region — overlapping-in-time captions create text-on-text pileups. Options: (a) spatial separation — place each group in a non-overlapping vertical band so they can coexist (memory-wall cascade style); (b) handoff — set the earlier group's
    out
    ≤ the next group's
    in
    so only one is on screen; (c) deliberate layered typography — add
    "allow_overlap": true
    on one of the groups to silence the validator. The validator estimates each group's vertical bbox from its CSS and flags collisions. Pick (a) by default — it's what makes cinematic-cream feel like a poem accumulating, not a subtitle track replacing itself.
  • Screen-blend fails on bright backgrounds (>180 luminance). Cinematic templates are cream +
    screen
    and that DNA is locked (the plan can't recolour them) → on a bright backdrop they wash out, so pick
    ink
    (letterpress built FOR bright surfaces) or the
    anchor
    theme (opaque rail surface) rather than overriding a look.
  • Don't animate
    letter-spacing
    or
    filter:blur
    on word entrance
    — inline-block reflow causes line-jumps.
  • CoreML banned for matting — the onnxruntime CoreML EP's mixed-precision partitioning corrupted face alpha (observed with the previous RVM engine; don't re-try it). Matting is CPU-only (~2 fps @1080p ≈ 2-3 min per 10s clip; budget for it on long clips).

  • 面部绝不能被100%持续遮挡——每0.3秒窗口内,面部 bounding box 至少30%未被遮挡。
  • WCAG对比度——最终渲染会进行检查;如果失败则调整调色板。
  • 确定性——禁止使用
    Math.random()
    Date.now()
    repeat:-1
  • 绝不调整原片色彩/色调。原片保持不变——仅添加字幕。不得在主轨道上添加全屏扫描线/双色调/变暗/暗角。赛博朋克/CRT纹理应放在字幕元素内部,而非覆盖整个画面。
  • 访谈/解说类优先使用底部字幕栏。不要给整个转录文本做嵌入式——大部分文本应放在底部字幕栏;仅对峰值内容做嵌入式。给所有内容做嵌入式是默认错误。
  • 嵌入式稀缺且间隔开。每个句子/节拍最多1个嵌入式,绝不相邻或同时可见,至少间隔一个节拍,最多一个
    apex
    。高潮是每个节拍的峰值,不是“整个片段的单一回报”。
  • 遮罩仅针对人物(hyperframes
    remove-background
    ,u2net_human_seg,Apache-2.0协议)。按意图进行人体分割,但并非精确到像素:细的偏移家具(麦克风吊杆)通常会被排除——字幕会渲染在其上方,人物后方;而主体附近的大型显著物体(望远镜、桌面设备)可能会被纳入遮罩并遮挡字幕。主体手持的物体(产品、手机)可能会间歇性消失,导致字幕显示在前景。绝不假设:放置英雄级词汇前在2-3个时间点采样
    frames_fg/
    ,优先选择无家具泄漏的英雄级位置(
    heroAnchor
    可能因泄漏偏移——与frames_bg交叉核对)。
  • 安全区域不识别道具——目视检查每个使用的区域。Zones/heroBands仅评估主体遮挡+亮度:麦克风、望远镜或屏幕位于“干净”区域内时无法被识别(道具泄漏到遮罩中会导致
    heroAnchor.centerXPct
    偏离人物)。创作前,提取每个拟使用区域的一帧;如果有道具在该区域,测量其bounding box并移动/缩小区域。有两个实际案例正是因为代理执行了此步骤才得以干净交付。(自动道具显著性是已知缺口;zones的
    peakLuma
    仅能捕捉移动的明亮物体。)
  • 字幕始终保持在画面内。电影模式严格禁止画面溢出;标准模式运行
    check-overflow.cjs
    作为警告(仅允许故意溢出——阅读警告说明)。
  • 每个字幕在屏幕上停留≥0.5秒——更短则无法阅读。
  • 词时序必须与transcript.json的误差在80ms内——字幕节拍偏移500ms会破坏场景幻觉。电影模式在渲染前(通过render-and-composite.sh)运行
    check-timing.cjs --strict
    ;主题模式在编译时强制执行相同时序(make-theme的转录顺序匹配器+逐字完整性校验门——偏移会触发编译错误)。绝不要将多个转录词打包到一个条目中(例如
    "FUTURE OF"
    IT
    +换行+
    ALL
    共享一个开始/结束时间)——第二个词会继承第一个词的时间戳并提前显示。将它们拆分为单独的词条目,各有自己的时序,即使希望它们显示在同一视觉行(使用CSS
    white-space
    /自动换行而非
    <br>
    )。支持字幕文本≠转录文本的创意替换(例如
    "15%"
    替换
    "fifteen percent"
    )——在
    check-timing.cjs
    CREATIVE_SUBS
    中注册。
  • 组窗口必须包含其所有词——
    group.in ≤ min(word.start)
    group.out ≥ max(word.end)
    对组内每个词都成立。如果
    group.in
    晚于某个词的开始时间,该词会被延迟到容器加载时显示(我们曾因此发布过800ms延迟的bug)。验证器会强制执行此规则。
  • 两个字幕组不得在时间和屏幕区域上同时重叠——时间重叠的字幕会导致文本堆积。选项:(a) 空间分离——将每个组放在不重叠的垂直区域,使其可以共存(记忆墙级联风格);(b) 交接——设置前一个组的
    out
    ≤ 下一个组的
    in
    ,确保同一时间仅一个组在屏幕上;(c) 故意分层排版——在其中一个组上添加
    "allow_overlap": true
    以关闭验证器警告。验证器会根据CSS估算每个组的垂直bounding box并标记冲突。默认选择(a)——这正是电影奶油色风格像诗歌累积而非字幕替换的原因。
  • 屏幕混合模式在明亮背景(亮度>180)上失效电影模式模板为cream +
    screen
    且该DNA锁定(计划无法修改颜色)→ 在明亮背景上会褪色,因此选择
    ink
    (专为明亮表面设计的凸版印刷风格)或
    anchor
    主题(不透明底部字幕栏),而非覆盖原有样式。
  • 词入场时不要动画
    letter-spacing
    filter:blur
    ——inline-block重排会导致换行。
  • 遮罩禁止使用CoreML——onnxruntime CoreML EP的混合精度分区会破坏面部alpha通道(在之前的RVM引擎中已观察到;不要重试)。遮罩仅使用CPU(1080p下约2 fps ≈ 每10秒片段2-3分钟;长片段需预留时间)。

Dependencies

依赖项

  • hyperframes, built (
    packages/cli/dist/cli.js
    ). Scripts auto-resolve the checkout:
    HYPERFRAMES_ROOT
    env → repo root if this skill ships inside hyperframes →
    ~/Downloads/hyperframes
    . Build with
    bun install && bun run build
    .
  • Node-first; two Python touchpoints via
    uvx
    (no manual installs):
    transcription runs WhisperX through
    uvx
    (word-level timings; falls back per SKILL §transcription), and Theme's
    drawon
    setpiece shells
    python3 scripts/gen-stroke-path.py
    at compile time. Everything else runs on the toolchain hyperframes already ships: matting via the hyperframes CLI's
    remove-background
    (u2net_human_seg; weights auto-download once, ~168 MB, to
    ~/.cache/hyperframes/
    ), image/alpha math via
    sharp
    , layout/occlusion/overflow via
    puppeteer
    , plus
    ffmpeg
    . The scripts auto-resolve these from the hyperframes checkout — nothing extra to install.
  • Transcription = WhisperX via
    uvx
    (word-level timings + alignment; no manual install —
    transcribe.cjs
    drives
    uvx whisperx
    ). Falls back to an existing word-level
    transcript.json
    if present.
  • Source video
    matte.cjs
    /
    transcribe.cjs
    auto-resolve
    source.mp4
    (or glob the clip / read
    hyperframes.json
    ), so
    hyperframes init --video X.mp4
    needs no manual rename.
  • fps
    matte.cjs
    extracts at the source's native rate and records
    matte.fps
    ;
    render-and-composite.sh
    uses that so the matte stays frame-aligned.
  • Matting weights are NOT bundled:
    matte.cjs
    shells the hyperframes CLI's
    remove-background
    , which downloads u2net_human_seg (~168 MB, Apache-2.0) once to
    ~/.cache/hyperframes/background-removal/models/
    . First prepare on a fresh machine needs network for that one download.
If a hard dependency is missing, STOP and ask the user — don't silently skip steps.
  • hyperframes,已构建(
    packages/cli/dist/cli.js
    )。脚本会自动解析检出路径:
    HYPERFRAMES_ROOT
    环境变量 → 如果本技能随hyperframes一起发布则指向仓库根目录 →
    ~/Downloads/hyperframes
    。使用
    bun install && bun run build
    构建。
  • 优先使用Node;通过
    uvx
    调用两个Python工具(无需手动安装)
    :转录通过
    uvx
    运行WhisperX(词级时序;根据SKILL §transcription降级),主题模式的
    drawon
    场景在编译时调用
    python3 scripts/gen-stroke-path.py
    。其他所有步骤使用hyperframes已包含的工具链:通过hyperframes CLI的**
    remove-background
    进行遮罩(u2net_human_seg;权重自动下载一次,约168 MB,保存到
    ~/.cache/hyperframes/
    ),通过
    sharp
    进行图像/alpha运算,通过
    puppeteer
    进行布局/遮挡/溢出检查,以及
    ffmpeg
    **。脚本会自动从hyperframes检出路径解析这些工具——无需额外安装。
  • 转录 = 通过
    uvx
    运行WhisperX
    (词级时序+对齐;无需手动安装——
    transcribe.cjs
    驱动
    uvx whisperx
    )。如果已存在词级
    transcript.json
    则降级使用该文件。
  • 源视频——
    matte.cjs
    /
    transcribe.cjs
    会自动解析
    source.mp4
    (或匹配片段/读取
    hyperframes.json
    ),因此
    hyperframes init --video X.mp4
    无需手动重命名。
  • 帧率——
    matte.cjs
    按源视频原生帧率提取帧并记录
    matte.fps
    render-and-composite.sh
    使用该帧率确保遮罩与帧对齐。
  • 遮罩权重未捆绑:
    matte.cjs
    调用hyperframes CLI的
    remove-background
    ,会自动下载u2net_human_seg(约168 MB,Apache-2.0协议)一次到
    ~/.cache/hyperframes/background-removal/models/
    。新机器首次准备时需要网络下载该文件。
如果缺少硬依赖项,请停止并询问用户——不要跳过步骤。