hyperframes-captions

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Captions

字幕

Analyze the spoken content to determine caption style. If the user specifies a style, use that. Otherwise, detect tone from the transcript.

分析语音内容来确定字幕样式。如果用户指定了样式，则使用指定样式；否则从转录文本中检测语气。

Transcript Source

转录来源

The project's

transcript.json

contains word-level timestamps from whisper.cpp (

--output-json-full

with

--dtw

json

{
  "transcription": [
    {
      "offsets": { "from": 0, "to": 5000 },
      "text": " Hello world.",
      "tokens": [
        { "text": " Hello", "offsets": { "from": 0, "to": 1000 }, "p": 0.98 },
        { "text": " world", "offsets": { "from": 1000, "to": 2000 }, "p": 0.95 }
      ]
    }
  ]
}

Normalize tokens into a word array before grouping:

const words = [];
for (const segment of transcript.transcription) {
  for (const token of segment.tokens || []) {
    const text = token.text.trim();
    if (!text) continue;
    words.push({
      text,
      start: token.offsets.from / 1000,
      end: token.offsets.to / 1000,
    });
  }
}

If no

transcript.json

exists, check for

.srt

.vtt

files. If no transcript is available, ask the user to provide one or run

hyperframes transcribe

(when available).

项目的

transcript.json

包含来自whisper.cpp的单词级时间戳（使用

--output-json-full

和

--dtw

参数生成）：

json

{
  "transcription": [
    {
      "offsets": { "from": 0, "to": 5000 },
      "text": " Hello world.",
      "tokens": [
        { "text": " Hello", "offsets": { "from": 0, "to": 1000 }, "p": 0.98 },
        { "text": " world", "offsets": { "from": 1000, "to": 2000 }, "p": 0.95 }
      ]
    }
  ]
}

在分组前将令牌标准化为单词数组：

const words = [];
for (const segment of transcript.transcription) {
  for (const token of segment.tokens || []) {
    const text = token.text.trim();
    if (!text) continue;
    words.push({
      text,
      start: token.offsets.from / 1000,
      end: token.offsets.to / 1000,
    });
  }
}

如果不存在

transcript.json

，则检查是否有

.srt

或

.vtt

文件。如果没有可用的转录文本，请要求用户提供，或者运行

hyperframes transcribe

（如果该功能可用）。

Style Detection (Default — When No Style Is Specified)

样式检测（默认模式——未指定样式时）

Read the full transcript before choosing a style. The style comes from the content, not a template.

选择样式前先通读完整转录文本。样式由内容决定，而非模板。

Four Dimensions

四个维度

1. Visual feel — the overall aesthetic personality:

Corporate/professional scripts → clean, minimal, restrained
Energetic/marketing scripts → bold, punchy, high-impact
Storytelling/narrative scripts → elegant, warm, cinematic
Technical/educational scripts → precise, high-contrast, structured
Social media/casual scripts → playful, dynamic, friendly

2. Color palette — driven by the content's mood:

Dark backgrounds with bright accents for high energy
Muted/neutral tones for professional or calm content
High contrast (white on black, black on white) for clarity
One accent color for emphasis — not multiple

3. Font mood — typography character, not specific font names:

Heavy/condensed for impact and energy
Clean sans-serif for modern and professional
Rounded for friendly and approachable
Serif for elegance and storytelling

4. Animation character — how words enter and exit:

Scale-pop/slam for punchy energy
Gentle fade/slide for calm or professional
Word-by-word reveal for emphasis
Typewriter for technical or narrative pacing

1. 视觉感受 —— 整体美学风格：

企业/专业脚本 → 简洁、极简、克制
活力/营销向脚本 → 醒目、有冲击力、高影响力
故事讲述/叙事类脚本 → 优雅、温暖、有电影感
技术/教育类脚本 → 精准、高对比度、结构化
社交媒体/休闲向脚本 → 活泼、动态、友好

2. 配色方案 —— 由内容的情绪基调决定：

高能量内容使用深色背景加亮色点缀
专业或平静内容使用柔和/中性色调
高对比度（黑底白字、白底黑字）提升清晰度
仅使用一种强调色突出重点，避免使用多种强调色

3. 字体风格 —— 指排版特征，而非具体字体名称：

粗体/窄体字用于提升冲击力和活力感
简洁无衬线字体用于现代专业的场景
圆角字体用于友好易亲近的场景
衬线字体用于优雅和故事讲述的场景

4. 动效特征 —— 文字的进入和退出方式：

缩放弹出/猛入效果用于提升冲击感
柔和淡入/滑动用于平静或专业的场景
逐字显示用于强调内容
打字机效果适配技术或叙事类内容的节奏

Per-Word Styling

逐字样式设置

Scan the script for words that deserve distinct visual treatment. Not every word is equal — some carry the message.

扫描脚本中需要特殊视觉处理的单词。不同单词的权重不同——部分单词承担了核心信息传递的作用。

What to Detect

检测范围

Brand names / product names — larger size, unique color, distinct entrance
ALL CAPS words — the author emphasized them intentionally. Scale boost, flash, or accent color.
Numbers / statistics — bold weight, accent color. Numbers are the payload in data-driven content.
Emotional keywords — "incredible", "insane", "amazing", "revolutionary" → exaggerated animation (overshoot, bounce)
Proper nouns — names of people, places, events → distinct accent or italic
Call-to-action phrases — "sign up", "get started", "try it now" → highlight, underline, or color pop

品牌名 / 产品名 —— 更大字号、独特颜色、特殊入场效果
ALL CAPS 单词 —— 是作者特意强调的内容，放大字号、闪烁效果或使用强调色
数字 / 统计数据 —— 加粗、使用强调色，数字是数据驱动内容的核心信息
情绪关键词 —— "incredible"、"insane"、"amazing"、"revolutionary" → 使用夸张动效（超出回弹、弹跳效果）
专有名词 —— 人名、地名、事件名 → 特殊强调色或斜体
行动号召（CTA）短语 —— "sign up"、"get started"、"try it now" → 高亮、下划线或颜色突出

How to Apply

应用方式

For each detected word, specify:

Font size multiplier (e.g., 1.3x for emphasis, 1.5x for hero moments)
Color override (specific hex value)
Weight/style change (bolder, italic)
Animation variant (overshoot entrance, glow pulse, scale pop)

对于每个检测到的单词，指定以下设置：

字号倍数（例如强调内容用1.3倍，核心高光时刻用1.5倍）
颜色覆盖（指定具体十六进制值）
字重/样式修改（加粗、斜体）
动效变体（超出回弹入场、发光脉冲、缩放弹出）

Script-to-Style Mapping

脚本到样式的映射

Script tone	Font mood	Animation	Color	Size
Hype/launch	Heavy condensed, 800-900 weight	Scale-pop, back.out(1.7), fast 0.1-0.2s	Bright accent on dark (cyan, yellow, lime)	Large 72-96px
Corporate/pitch	Clean sans-serif, 600-700 weight	Fade + slide-up, power3.out, 0.3s	White/neutral on dark, single muted accent	Medium 56-72px
Tutorial/educational	Mono or clean sans, 500-600 weight	Typewriter or gentle fade, 0.4-0.5s	High contrast, minimal color	Medium 48-64px
Storytelling/brand	Serif or elegant sans, 400-500 weight	Slow fade, power2.out, 0.5-0.6s	Warm muted tones, low opacity (0.85-0.9)	Smaller 44-56px
Social/casual	Rounded sans, 700-800 weight	Bounce, elastic.out, word-by-word	Playful colors, colored backgrounds on pills	Medium-large 56-80px

脚本语气	字体风格	动效	配色	字号大小
Hype/产品发布	粗窄体，字重800-900	缩放弹出，back.out(1.7)，时长0.1-0.2s	深色背景搭配亮色强调（青色、黄色、青柠色）	大字号72-96px
企业宣传/项目Pitch	简洁无衬线体，字重600-700	淡入+上滑，power3.out，时长0.3s	深色背景搭配白色/中性色，单一柔和强调色	中等字号56-72px
教程/教育类	等宽字体或简洁无衬线体，字重500-600	打字机效果或柔和淡入，时长0.4-0.5s	高对比度，少用颜色	中等字号48-64px
故事讲述/品牌宣传	衬线体或优雅无衬线体，字重400-500	缓慢淡入，power2.out，时长0.5-0.6s	暖调柔和色系，低不透明度(0.85-0.9)	偏小字号44-56px
社交/休闲向	圆角无衬线体，字重700-800	弹跳效果，elastic.out，逐字显示	活泼配色，胶囊状彩色背景	中大号字号56-80px

Word Grouping by Tone

按语气分组单词

Group size affects pacing. Fast content needs fast caption turnover.

High energy: 2-3 words per group. Quick turnover matches rapid delivery.
Conversational: 3-5 words per group. Natural phrase length.
Measured/calm: 4-6 words per group. Longer groups match slower pace.

Break groups on sentence boundaries (

), pauses (>150ms gap), or max word count — whichever comes first.

分组大小会影响播放节奏。快节奏内容需要更快的字幕切换。

高能量内容： 每组2-3个单词，快速切换匹配快节奏的语速
对话类内容： 每组3-5个单词，符合自然短语长度
平缓/平静内容： 每组4-6个单词，更长的分组适配慢节奏

在句子边界（

）、停顿（间隔>150ms）或达到最大单词数时拆分分组，以最先满足的条件为准。

Positioning

位置设置

Landscape (1920x1080): Bottom 80-120px, centered
Portrait (1080x1920): Lower middle ~600-700px from bottom, centered
Never cover the subject's face
Use
```
position: absolute
```
— never relative (causes overflow)
One caption group visible at a time

横屏(1920x1080)： 底部80-120px区域，居中
竖屏(1080x1920)： 下半部分，距离底部约600-700px区域，居中
永远不要遮挡人物面部
使用
```
position: absolute
```
——不要用relative（会导致溢出）
同一时间仅显示一组字幕

Constraints

约束条件

Deterministic. No
```
Math.random()
```
, no
```
Date.now()
```
.
Sync to transcript timestamps. Words appear when spoken.
One group visible at a time. No overlapping caption groups.
Check project root for font files before defaulting to Google Fonts.

确定性： 不要使用
```
Math.random()
```
、
```
Date.now()
```
这类随机方法
与转录时间戳同步： 单词在被念到的时候显示
同一时间仅显示一组字幕： 不要出现字幕组重叠
优先检查项目根目录是否有字体文件，再默认使用Google Fonts。