seedance-podcast-visual

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Podcast Visual — Audio-to-Video Transformation Prompts

播客可视化——音频转视频提示词指南

Transform podcast audio into cinematic visual content using Seedance 2.0 on Higgsfield. This skill produces video prompts that replace static audiograms with storytelling-driven visual experiences built entirely from constructed imagery.

借助Higgsfield平台上的Seedance 2.0,将播客音频转化为电影级视觉内容。本技能生成的视频提示词,可替代静态音频波形图,通过构建的视觉元素打造以叙事为核心的视觉体验。

Input Specifications

输入规范

Primary inputs:
  • Up to 3 audio files (podcast clips, interview excerpts, sound bites, episode highlights)
  • Transcript or key quote text from the audio
  • Speaker name(s) and brief context (topic, show name, tone)
  • Desired visual style (abstract, cinematic, interview reconstruction, kinetic)
  • Target platform (Instagram Reels, YouTube Shorts, LinkedIn, TikTok)
  • Aspect ratio: 9:16 (vertical/mobile-first), 16:9 (widescreen), or 1:1 (square)
Audio file handling:
  • File 1: Primary clip — the main sound bite or key quote being visualized
  • File 2 (optional): Intro or context clip — sets up the narrative before the hook
  • File 3 (optional): Reaction or follow-up clip — speaker response, co-host moment, audience reaction
  • Duration guidance: each clip should be 15–90 seconds; total sequence up to 3 minutes
What you extract from audio before writing prompts:
  • The single most quotable sentence (becomes the visual anchor)
  • The emotional register: contemplative, fired-up, vulnerable, instructive, funny
  • Pacing: fast and punchy vs. slow and deliberate delivery
  • Natural pauses: where silence lives (these become visual breath moments)
  • Speaker energy level: seated calm, animated gesturing, emotional peak

核心输入项:
  • 最多3个音频文件(播客片段、访谈节选、精彩音频片段、剧集高光片段)
  • 音频对应的文字转录稿或关键引语文本
  • 主播姓名及简要背景信息(话题、节目名称、风格基调)
  • 期望的视觉风格(抽象风、电影风、访谈还原风、动态文字风)
  • 目标发布平台(Instagram Reels、YouTube Shorts、LinkedIn、TikTok)
  • 画面比例:9:16(竖屏/移动端优先)、16:9(宽屏)或1:1(正方形)
音频文件处理规则:
  • 文件1:核心片段——需可视化的主要精彩音频或关键引语
  • 文件2(可选):开场或背景片段——在核心内容前铺垫叙事背景
  • 文件3(可选):反应或后续片段——主播回应、搭档互动、观众反应
  • 时长建议:每个片段15-90秒;总序列时长不超过3分钟
撰写提示词前需从音频中提取的信息:
  • 最具传播性的单句台词(作为视觉核心锚点)
  • 情绪基调:沉思、激昂、脆弱、指导性、诙谐
  • 语速节奏:快节奏有力 vs 慢节奏沉稳
  • 自然停顿点:留白的位置(这些是视觉呼吸的契机)
  • 主播能量状态:坐姿沉稳、手势生动、情绪峰值

Philosophy

设计理念

Old model (audiogram)New model (podcast visual)
Show the waveformShow what the words feel like
Static background imageConstructed cinematic environment
Speaker photo as thumbnailSpeaker reconstructed in scene
Generic brand colorsLighting and atmosphere matched to tone
Passive viewingActive emotional engagement
Optimized for "audio on"Compelling even on mute

旧模式(音频波形图)新模式(播客可视化)
展示音频波形呈现文字传递的情绪感受
静态背景图构建电影级场景环境
以主播照片作为缩略图在场景中还原主播形象
通用品牌配色灯光与氛围匹配内容基调
被动观看体验主动情感共鸣
优化适配“开启音频”场景即使静音也具备吸引力

2-Second Hook Patterns

2秒钩子模式

The hook is the opening frame that stops the scroll. It must communicate emotion, intrigue, or tension before a single word is heard. Four proven structures:
钩子是吸引用户停留的开场画面,必须在用户听到任何内容前传递情绪、引发好奇或制造张力。以下是四种经过验证的结构:

The Quote Impact

引语冲击型

Display the most provocative line from the clip as large kinetic text before audio begins. The text arrives with weight — not a gentle fade, but a hard cut or a push-in. The visual behind it is blurred or dark, forcing the text into full focus.
When to use: clips with a single devastating sentence, contrarian takes, counterintuitive statistics, direct challenges to conventional wisdom.
Visual execution in prompt: specify "bold white sans-serif typography slams onto dark background, camera holds for 1.5 seconds, then cuts to speaker close-up, shallow depth of field, background softly bokeh'd."
在音频开始前,将片段中最具冲击力的台词以大尺寸动态文字展示。文字需强势登场——不是柔和淡入,而是硬切或推入。背景采用模糊或深色效果,让文字成为绝对焦点。
适用场景: 包含极具冲击力单句、反向观点、反直觉数据、直接挑战传统认知的片段。
提示词中的视觉执行描述: 明确说明“粗体白色无衬线字体强势切入深色背景,镜头保持1.5秒后切至主播特写,浅景深,背景柔和虚化。”

The Reaction Shot

反应镜头型

Open on the speaker's face at the moment of peak emotional expression — surprise, laughter, conviction, vulnerability — before any context is given. This creates a curiosity gap: the viewer needs to hear what caused that expression.
When to use: interview moments where a genuine reaction occurs, storytelling clips where the speaker relives something visceral, moments of realization or revelation.
Visual execution in prompt: specify "extreme close-up on speaker's face, caught mid-expression, eyes slightly wide, ambient room sound implied by environment, camera slowly eases back over 3 seconds to reveal setting."
开场先展示主播情绪峰值时刻的面部表情——惊讶、大笑、坚定、脆弱——无需任何背景铺垫。这会制造好奇缺口:观众会想知道是什么引发了这样的表情。
适用场景: 访谈中出现真实反应的时刻、主播重温亲身经历的叙事片段、顿悟或揭秘时刻。
提示词中的视觉执行描述: 明确说明“主播面部极端特写,捕捉表情瞬间,眼睛微睁,环境暗示出房间 ambient 音效,镜头在3秒内缓慢拉远,展现场景。”

The Visual Metaphor

视觉隐喻型

Instead of showing the speaker at all, open with an environmental or abstract image that represents the core concept of the clip. A podcast about burnout opens on dying embers. A clip about compounding returns opens on a single drop rippling outward. The metaphor does expository work so the audio can focus on depth.
When to use: concept-heavy clips, philosophical discussions, any clip where the idea is more powerful than the person delivering it.
Visual execution in prompt: specify the metaphor object explicitly, its lighting, its motion quality, and a precise camera behavior (slow push, orbital, static hold with foreground element drifting through).
完全不展示主播,开场用环境或抽象图像代表片段的核心概念。关于职业倦怠的播客开场画面是熄灭的余烬;关于复利收益的片段开场是水滴向外扩散的涟漪。隐喻承担说明功能,让音频可以聚焦深度内容。
适用场景: 概念密集型片段、哲学讨论、观点比讲述者更有力量的任何片段。
提示词中的视觉执行描述: 明确说明隐喻对象、灯光效果、运动质感,以及精准的镜头行为(缓慢推入、环绕拍摄、静态固定前景元素飘过)。

The Sound Wave Art

声波艺术型

Not a functional audiogram waveform — instead, an artistic rendering of sound as visual sculpture. Particles forming and dissolving in rhythm with imagined speech cadence. Light bending through air as if vibrated by voice. Sound made beautiful, not informational.
When to use: music-adjacent podcasts, high-production brand content, moments where you want to foreground the craft of the medium itself.
Visual execution in prompt: specify particle behavior, color palette tied to the emotional register of the clip, and whether motion is rhythmic/predictable or fluid/organic. Avoid the word "waveform" — describe it as "acoustic particle field" or "resonant light diffusion."

不是功能性的音频波形图——而是将声音转化为视觉雕塑的艺术化呈现。粒子随语音节奏聚散;光线因声音振动而弯曲。让声音变得美观,而非仅作为信息载体。
适用场景: 与音乐相关的播客、高制作水准的品牌内容、希望突出媒介本身质感的时刻。
提示词中的视觉执行描述: 明确说明粒子行为、与片段情绪基调匹配的配色方案,以及运动是规律节奏型还是流畅有机型。避免使用“波形图”一词——改用“声学粒子场”或“共振光扩散”来描述。

Visual Formats

视觉格式

Abstract Visualization

抽象可视化

The audio inspires a visual world that does not contain the speaker at all. Instead, abstract imagery — light, texture, particle systems, color gradients, fluid dynamics — evolves in response to the imagined emotional arc of the audio.
Core parameters:
  • Color temperature must match emotional tone (cool/blue for analytical, warm/amber for intimate, high-contrast for confrontational)
  • Motion should breathe with speech rhythm — slowing during pauses, accelerating during emphasis
  • Avoid literal representation; the visual is interpretive, not illustrative
  • Works best at 9:16 for mobile, full-bleed composition
Prompt elements to always include: dominant color palette, motion behavior (fluid, particle, crystalline, liquid, smoke), camera behavior (static, slow push, orbital), and whether the environment is finite (a room implied by light edges) or infinite (void space)
音频启发一个完全不含主播的视觉世界。抽象元素——光线、纹理、粒子系统、渐变色、流体动态——随音频的情绪弧演变。
核心参数:
  • 色温必须匹配情绪基调(冷蓝色调用于分析型内容、暖琥珀色调用于亲密型内容、高对比度用于对抗型内容)
  • 运动节奏需与语音同步——停顿处放缓、强调处加速
  • 避免字面化呈现;视觉是诠释性的,而非说明性的
  • 最适合9:16竖屏移动端,全屏构图
提示词必须包含的元素: 主导配色方案、运动类型(流体、粒子、晶体、液态、烟雾)、镜头行为(静态、缓慢推入、环绕),以及环境是有限空间(光线边缘暗示房间)还是无限空间(虚空)

Cinematic B-Roll Narrative

电影级B-Roll叙事

Construct a series of visuals that would, in a traditional documentary, accompany the audio as b-roll. Except here every frame is generated — no stock footage, no compromises. The b-roll tells the story of the words.
Core parameters:
  • Each visual beat corresponds to a sentence or phrase in the clip
  • Environments are specific: not "a city" but "a rain-slicked street at 11 PM, single sodium-vapor streetlight, no pedestrians"
  • Objects carry symbolic weight: a speaker discussing scarcity shows empty shelves; one discussing abundance shows an overflowing market
  • Camera movement is motivated — zoom-in when tension builds, cut to wide when perspective expands
Prompt elements to always include: specific environment (time of day, weather, geography implied), one or two key objects in frame, camera move, lighting source, color grade direction (film noir, golden hour, overcast flat light, neon-saturated).
构建一系列视觉画面,如同传统纪录片中配合音频的B-Roll素材。但这里的每一帧都是生成的——无需库存素材,无需妥协。B-Roll画面为文字讲述故事。
核心参数:
  • 每个视觉节拍对应片段中的一个句子或短语
  • 场景需具体:不是“一座城市”,而是“深夜11点的雨湿街道,唯一的钠蒸汽路灯,无行人”
  • 物体承载象征意义:讨论稀缺性的主播对应空货架;讨论富足的主播对应满溢的市场
  • 镜头运动要有动机——紧张感升级时推近,视角拓展时切至广角
提示词必须包含的元素: 具体场景(时间、天气、隐含地理位置)、画面中的1-2个关键物体、镜头运动、光源、色彩分级方向(黑色电影风、黄金时刻、阴天平光、霓虹饱和)

Split-Screen Interview Reconstruction

分屏访谈还原

Reconstruct the podcast conversation as if it were a filmed interview, split-screen between two constructed environments. Each speaker occupies a distinct visual space — differentiated by lighting color temperature, depth of field, and environmental detail — while remaining in visual dialogue with each other.
Core parameters:
  • Left panel and right panel are visually asymmetric by design, not just mirrored
  • Lighting on each speaker communicates their role: warmer for the guest/storyteller, cooler-neutral for the host/interrogator
  • Camera behavior between panels should differ: one speaker gets a slow push-in, the other a static hold
  • Invisible edit: both panels feel like they belong to the same moment even though they are compositionally separate
Prompt elements to always include: panel ratio (50/50, 60/40, or dynamic shift), description of each environment, lighting scheme for each, camera behavior for each, whether there is any visual bleed or hard line between panels.
将播客对话还原为拍摄式访谈,分屏展示两个构建的场景。每位主播占据独特的视觉空间——通过色温、景深、环境细节区分——同时保持视觉对话感。
核心参数:
  • 左右面板设计上需不对称,而非简单镜像
  • 每位主播的灯光传递其角色:嘉宾/讲述者用暖光,主持人/提问者用冷中性光
  • 左右面板的镜头行为需不同:一位主播对应缓慢推近,另一位对应静态固定
  • 无缝衔接:尽管构图分离,两个面板仍需感觉属于同一时刻
提示词必须包含的元素: 面板比例(50/50、60/40或动态切换)、每个场景的描述、各自的灯光方案、各自的镜头行为,以及面板间是否有视觉渗透或硬分隔线

Kinetic Typography

动态文字

The words themselves become the visual. The transcript animates — letters forming, words scaling, phrases colliding, key terms expanding to fill frame. The background is subordinate to type.
Core parameters:
  • Font weight must be heavy — no light or thin weights for podcast visuals; impact requires mass
  • Motion choreography follows the rhythm of speech, not just the meaning
  • Color contrast must survive both light and dark mode viewing
  • Key words get visual differentiation: scale, color pop, brief hold, or shake-on-impact
  • Secondary text (speaker name, show name, timestamp) should not compete with primary quote text
Prompt elements to always include: primary font style (condensed sans, geometric sans, slab serif, display), animation behavior per word type (verbs get thrust/motion, nouns get scale, punctuation gets pause), background treatment (solid, gradient, subtle texture, barely-there environmental photo), and transition behavior between lines.

文字本身成为视觉主体。转录稿动态呈现——字母生成、文字缩放、短语碰撞、关键词放大填满画面。背景从属于文字。
核心参数:
  • 字体必须加粗——播客视觉无需细体字;冲击力需要厚重感
  • 动效编排遵循语音节奏,而非仅文字含义
  • 色彩对比度需适配明暗两种模式观看
  • 关键词需视觉区分:缩放、色彩突出、短暂停留或冲击震动
  • 次要文字(主播姓名、节目名称、时间戳)不得与主引语文字竞争焦点
提示词必须包含的元素: 主字体风格( condensed sans、geometric sans、slab serif、display)、不同类型文字的动效(动词用推力/运动、名词用缩放、标点用停顿)、背景处理(纯色、渐变、细微纹理、淡环境图),以及行与行之间的过渡行为

Audio-Visual Sync Techniques

音视频同步技巧

Seedance 2.0 generates video from text, not from audio input directly. Syncing visual rhythm to audio content is a craft challenge solved at the prompting stage.
Seedance 2.0通过文本生成视频,而非直接处理音频输入。在提示词阶段解决视觉节奏与音频内容的同步问题是一项技巧挑战。

Beat-Matching Visuals to Speech Rhythm

视觉与语音节奏匹配

Map the cadence of the clip before writing the prompt. Fast-talking speakers with high information density need visuals with shorter hold times and more cuts implied. Slower, deliberate speakers — the kind who pause for effect — need visuals that can breathe with them.
Practical approach:
  • Count syllables per second in the clip (rough: slow = under 3/sec, normal = 3-5/sec, fast = 5+/sec)
  • For slow speakers: specify long camera holds, environmental drift, single sustained composition
  • For fast speakers: specify implied cut rhythm, multiple visual planes in frame, foreground elements that create natural layering
撰写提示词前先梳理片段的节奏。语速快、信息密度高的主播需要视觉停留时间更短、隐含更多剪辑的提示词。语速慢、沉稳的主播——擅长用停顿制造效果——需要能与之同步呼吸的视觉提示词。
实操方法:
  • 计算片段的每秒音节数(大致标准:慢=每秒少于3个,正常=3-5个,快=5个以上)
  • 慢语速主播:指定长镜头固定、环境缓慢变化、单一持续构图
  • 快语速主播:指定隐含剪辑节奏、画面中多视觉层次、前景元素自然分层

Emphasis Moments

重点时刻强化

Identify the 2-3 words or phrases in the clip that carry the most weight. These become visual events — a push-in, a light shift, a particle burst, a text scale event. The prompt should describe these precisely.
Example language for prompts:
  • "At the word 'everything,' camera completes its slow push-in to extreme close-up"
  • "The phrase 'no one told me' appears in oversized type, centered, holds for 1.5 seconds before dissolving"
  • "Warm backlight intensifies for 0.8 seconds aligned with the speaker's raised hand gesture"
识别片段中2-3个最具分量的词或短语。这些将成为视觉事件——推近镜头、光线变化、粒子爆发、文字缩放。提示词需精准描述这些事件。
提示词示例语言:
  • “当说出‘一切’这个词时,镜头完成缓慢推近至极端特写”
  • “短语‘没人告诉我’以超大字体居中显示,停留1.5秒后消失”
  • “暖背光在主播抬手动作时增强0.8秒”

Pause Visualization

停顿可视化

Silence in a podcast clip is intentional. Great speakers use pauses like punctuation. These moments are visual opportunities: a held frame, a slow-motion exhale, a breath of empty space in the composition. Do not cut away during a pause — hold, and let the tension live.
Prompt language for pauses: "camera holds completely static," "particle motion slows to near-still," "background light dims 15% and recovers," "single dust mote drifts through foreground."

播客片段中的沉默是有意为之的。优秀主播将停顿用作标点符号。这些时刻是视觉机会:固定画面、慢动作呼气、构图中的留白空间。停顿期间不要切换画面——保持固定,让张力延续。
停顿的提示词语言: “镜头完全静止”、“粒子运动近乎停止”、“背景灯光调暗15%后恢复”、“单个尘埃颗粒飘过前景”

Camera and Composition for Non-Live Content

非实时内容的镜头与构图

Everything in a Seedance 2.0 podcast visual is constructed, not captured. This is not a limitation — it is the entire point. Camera behavior must be specified deliberately.
Core camera vocabulary for podcast visuals:
  • Slow push-in (dolly in): builds intimacy and urgency over time; use for building conviction in the speaker's words
  • Static hold: signals stability and authority; the speaker's words are certain enough that the camera need not move
  • Slow rack focus: transitions attention from background to foreground or vice versa; use during pivots in the narrative
  • Orbital (360 slow arc): creates dimensionality around a subject or object; use for abstract or metaphor sequences
  • Floating handheld: simulates presence and documentary intimacy; specify "subtle organic camera drift, not mechanical" to distinguish from tripod hold
  • Extreme close-up (ECU): a detail shot — eye, hand, mouth, object — that signals emotional intensity without requiring full frame context
Composition principles:
  • For vertical (9:16): place the subject at eye-height in the upper third; leave space in the lower third for typography overlay
  • For widescreen (16:9): observe the rule of thirds strictly; avoid center-frame static compositions for interview reconstruction
  • For square (1:1): treat it as a poster — every frame should be self-contained and graphically bold
  • Always specify negative space intentionality: "empty space to the right of the speaker implies room for text overlay"

Seedance 2.0生成的播客视觉元素全部是构建的,而非拍摄的。这不是限制——而是核心优势。镜头行为需明确指定。
播客视觉的核心镜头术语:
  • 缓慢推近(轨道推入): 随时间推移建立亲密感与紧迫感;用于强化主播话语的说服力
  • 静态固定: 传递稳定与权威;主播的话语足够确定,无需镜头移动
  • 缓慢焦点切换: 将注意力从背景转移到前景,或反之;用于叙事转折时刻
  • 环绕拍摄(360度缓慢弧形运动): 为主体或物体营造立体感;用于抽象或隐喻序列
  • 浮动手持: 模拟在场感与纪录片式亲密感;指定“细微有机镜头漂移,非机械感”以区别于三脚架固定
  • 极端特写(ECU): 细节镜头——眼睛、手、嘴、物体——无需完整画面即可传递情绪强度
构图原则:
  • 竖屏(9:16):主体置于上三分之一的视线高度;下三分之一留空间用于文字叠加
  • 宽屏(16:9):严格遵循三分法;访谈还原场景避免中心静态构图
  • 正方形(1:1):视为海报设计——每一帧都应独立完整且视觉冲击力强
  • 始终明确留白意图:“主播右侧留白用于文字叠加”

Lighting

灯光设计

Lighting is the single most powerful emotional signal in a constructed podcast visual. Specify it precisely.
灯光是构建播客视觉中最强大的情绪信号。需精准指定。

Studio Interview Recreation

演播室访谈还原

A clean, professional environment that reads as controlled but not sterile. Two-point or three-point lighting setup. The subject is clearly lit; the background is separated but not overlit.
Prompt language: "soft key light from camera-left, warm 4800K, catchlight visible in eyes; subtle fill from camera-right at 1/3 power; dark grey background with faint rim separation from practical studio light; no hard shadows on face."
干净专业的环境,可控但不刻板。两点或三点灯光设置。主体清晰打光;背景分离但不过度打光。
提示词语言: “镜头左侧的柔和主光,4800K暖色调,眼睛可见反光点;镜头右侧的细微补光,功率为主光的1/3;深灰色背景,实用演播室灯光形成微弱边缘分离;面部无硬阴影。”

Abstract Void

抽象虚空

The subject or object exists in pure darkness, lit by a single motivated source. Extreme contrast. The background gives nothing — all attention is on the subject. This style works for any clip where the words are the only thing that matters.
Prompt language: "single motivated light source from above-left, 3200K tungsten warmth, subject is lit in isolation against black void, no ambient fill, deep shadow on right side of face, high contrast ratio."
主体或物体存在于纯黑暗中,由单一目标光源照亮。极端对比度。背景无任何元素——所有注意力集中在主体上。适用于任何话语是唯一核心的片段。
提示词语言: “左上方向的单一目标光源,3200K钨丝暖光,主体孤立照亮于黑色虚空,无环境补光,面部右侧深阴影,高对比度。”

Warm Conversation

温暖对话风

Intimate, inviting, lived-in. Suggests a real conversation happening in a real space — a kitchen, a living room, a café booth. Light sources are practical: a window, a lamp, candles. Color temperature is warm (2700K–3200K). This is the lighting of trust.
Prompt language: "warm practical lamp at frame-right, 2700K, soft pool of light; secondary fill from large window implied off-frame left, cool daylight, low intensity; environment reads as domestic interior, bookshelves soft in background, overall low-contrast warmth."
亲密、有吸引力、生活化。暗示真实对话发生在真实空间——厨房、客厅、咖啡馆卡座。光源为实用型:窗户、台灯、蜡烛。色温偏暖(2700K–3200K)。这是传递信任感的灯光。
提示词语言: “画面右侧的暖光实用台灯,2700K,柔和光池;左侧画外大窗户的次要补光,冷日光,低强度;环境为居家室内,背景书架柔和虚化,整体低对比度暖色调。”

Dramatic Single Source

戏剧性单光源

One hard light. Everything else is darkness or very deep shadow. This is the lighting of revelation, of stakes, of honesty forced into the open. Use for clips where the speaker says something they'd normally not say publicly.
Prompt language: "single hard light source from above, 5500K, 45-degree angle to subject, no fill, deep shadows across lower face and neck, light falls off sharply, background completely unlit, high-drama chiaroscuro treatment."

单一硬光。其余部分为黑暗或极深阴影。这是揭秘、高风险、被迫坦诚的灯光风格。适用于主播说出通常不会公开内容的片段。
提示词语言: “上方的单一硬光源,5500K,与主体呈45度角,无补光,面部下方与颈部深阴影,光线急剧衰减,背景完全无光照,高戏剧性明暗对比处理。”

Typography

文字设计

When text appears in the video — quotes, speaker IDs, timestamps — it must be designed with the same care as the visual.
当视频中出现文字——引语、主播标识、时间戳——需与视觉元素同样精心设计。

Quote Display Techniques

引语展示技巧

Full-frame typographic hold: the quote appears as the primary visual. Large, bold, white or cream text on dark background. The text itself is the frame. Used for the most powerful single sentence.
Lower-third persistent text: the quote runs as a lower-third during the speaker visual. Not subtitles — the quote is selected for impact, not transcription accuracy. Use a font weight heavier than body text. Maximum 10-12 words.
Word-by-word reveal: each word appears individually, timed to the rhythm of delivery. Prompt must specify animation behavior: "words appear with a hard cut at each syllable boundary, no fades, duration of each word matches speech pacing."
Floating quote with depth: the text appears to exist in three-dimensional space within the scene, as if the words are physically present. Prompt: "quote text appears to float in mid-air within the environment, slight parallax offset from background movement, letters cast soft shadow onto the space behind them."
全屏文字固定: 引语作为核心视觉元素。大尺寸、粗体、白色或米白色文字置于深色背景。文字本身就是画面。用于最具力量的单句台词。
底部持续文字: 引语作为底部条在主播画面期间展示。不是字幕——引语是为了传递冲击力,而非转录准确性。字体比正文字体更粗。最多10-12个单词。
逐字展示: 每个单词单独出现,与语速同步。提示词需指定动效行为:“每个单词在音节边界硬切出现,无淡入淡出,每个单词的持续时间匹配语速。”
带深度的浮动引语: 文字似乎存在于场景的三维空间中,仿佛文字是物理实体。提示词:“引语文字悬浮于环境空中,与背景运动有轻微视差偏移,字母在后方空间投射柔和阴影。”

Speaker Identification

主播标识

Speaker name and show/platform title appear within the first 5 seconds and again at the end. Keep it subordinate to the quote unless brand-building is the primary goal.
Placement options:
  • Lower-left corner, smaller type, contrasting color to background
  • Upper-right corner with show logo treatment
  • End frame only — full-screen speaker name, episode title, call-to-action
Typography rule: speaker ID font should be same family as quote text, lower weight (regular vs. bold). Never two competing display fonts.
主播姓名及节目/平台名称需在前5秒内出现,并在结尾再次展示。除非品牌建设是核心目标,否则需让位于引语。
放置选项:
  • 左下角,小字体,与背景对比色
  • 右上角,搭配节目Logo样式
  • 仅在结尾画面——全屏主播姓名、剧集标题、行动号召
文字规则: 主播标识字体需与引语字体同系列,字重更轻(常规 vs 粗体)。绝不要使用两种冲突的展示字体。

Running Captions

内嵌字幕

For podcast clips on TikTok/Reels, burned-in subtitles are expected:
  • Position: center-bottom, above platform UI (20% from bottom edge)
  • Style: bold sans-serif, white with black outline or dark background bar
  • Timing: word-by-word or phrase-by-phrase sync to audio
  • Prompt phrasing: "Burned-in subtitle text appears word-by-word synced to speech rhythm. Bold white sans-serif, center-bottom positioning above platform UI. Dark semi-transparent bar behind text for legibility."
TikTok/Reels平台的播客片段需要内嵌字幕:
  • 位置:底部居中,高于平台UI(距底部边缘20%)
  • 样式:粗体无衬线字体,白色带黑色描边或深色背景条
  • timing:逐字或逐短语与音频同步
  • 提示词表述:“内嵌字幕逐字与语音节奏同步出现。粗体白色无衬线字体,底部居中置于平台UI上方。文字后方深色半透明条提升可读性。”

Timestamp Markers

时间戳标记

For longer clips with multiple audio files, timestamp markers orient the viewer within the episode. These are subtle — a small badge in the upper-right, episode time in small sans-serif. They add credibility (this is a real episode worth finding) without competing with the primary visual.

对于包含多个音频文件的长片段,时间戳标记帮助观众定位剧集位置。需低调——右上角的小徽章,小尺寸无衬线字体显示剧集时间。它们增加可信度(这是值得寻找的真实剧集),且不干扰核心视觉。

Complete Example Prompts

完整提示词示例

Example 1: The Conviction Statement (Abstract Void, 9:16)

示例1:信念宣言(抽象虚空,9:16)

Context: A founder describing the moment she decided to quit her corporate job. Single audio clip, 45 seconds. Quote: "I wasn't jumping. I was finally standing still."

Seedance 2.0 video prompt, 9:16 vertical, 45 seconds:
Opening frame: pure black void, single tungsten-warm spotlight descends from directly above, illuminating only the empty center of frame. Dust particles drift slowly through the beam. Camera is completely static. Hold 2 seconds.
At second 2: bold white condensed sans-serif text appears with a hard cut — "I wasn't jumping." — centered at mid-frame, massive scale, filling 60% of horizontal width. Text holds for 1.8 seconds with absolute stillness. No animation, no fade. The words have weight.
Hard cut to black. 0.5 second black.
Text reappears: "I was finally standing still." — same font, same scale, same position. Camera begins an imperceptibly slow push-in toward the text, barely perceptible over 3 seconds. Text dissolves.
At second 8: transition to extreme close-up on a woman's hands resting palms-down on a dark wooden desk, ring light catchlight visible on skin, warm 3200K key light from camera-left. Hands are still, completely still. Camera holds.
Slow pull-back over 12 seconds reveals the desk, a single sheet of paper face-down on the surface, pen beside it. The environment implies a home office — dark bookshelf edge visible, one small practical lamp providing ambient fill. The scene is intimate and decisive.
At second 28: the hands turn the paper over. Slow motion, 50% speed. We do not see what is written. Camera slowly rises from hands to the window behind the desk, soft late-afternoon light, city skyline slightly out of focus.
Final 8 seconds: frame holds on the window with golden-hour light. Speaker name appears lower-left in clean white regular-weight sans-serif: "— [Speaker Name]." below it in smaller type: "@showname | Episode 47." Fade to black.
Color grade: high contrast, warm shadows, teal-pulled highlights. Cinematic aspect, no vignette.

背景: 创始人描述决定辞去企业工作的时刻。单个音频片段,45秒。引语:“我不是在纵身一跃,我终于站稳了脚跟。”

Seedance 2.0视频提示词,9:16竖屏,45秒:
开场画面:纯黑色虚空,单一钨丝暖光从正上方落下,仅照亮画面中心空白区域。尘埃颗粒在光束中缓慢漂浮。镜头完全静止。保持2秒。
第2秒:粗体白色 condensed sans-serif字体硬切出现——“我不是在纵身一跃。”——居中于画面中部,超大尺寸,占据水平宽度的60%。文字完全静止停留1.8秒。无动效,无淡入淡出。文字具备重量感。
硬切至黑屏。停留0.5秒。
文字再次出现:“我终于站稳了脚跟。”——相同字体、尺寸、位置。镜头开始极其缓慢地向文字推近,3秒内几乎难以察觉。文字消失。
第8秒:切换至女性双手掌心向下放在深色木桌上的极端特写,皮肤可见环形灯反光,镜头左侧3200K暖主光。双手完全静止。镜头固定。
12秒内缓慢拉远,展示桌面、一张面朝下的纸、旁边的笔。环境暗示家庭办公室——背景可见深色书架边缘,一盏小实用台灯提供环境补光。场景亲密且坚定。
第28秒:双手将纸翻转。慢动作,50%速度。看不到纸上内容。镜头从双手缓慢抬升至桌面后方的窗户,柔和傍晚光线,城市天际线略微失焦。
最后8秒:画面固定在黄金时刻光线的窗户上。主播姓名以简洁白色常规字重无衬线字体出现在左下角:“——[主播姓名]。”下方更小字体:“@showname | 第47集。”渐变为黑屏。
色彩分级:高对比度,暖阴影,青色调高光。电影质感,无暗角。

Example 2: The Expert Take (Split-Screen Interview Reconstruction, 16:9)

示例2:专家观点(分屏访谈还原,16:9)

Context: Two-person podcast. Host and guest discussing why most startup pivots fail. Two audio clips — host question and guest response. Combined 75 seconds.

Seedance 2.0 video prompt, 16:9 widescreen, 75 seconds:
Opening: split-screen, hard vertical line at 50% horizontal. Left panel: host environment — modern home office, cool-toned ambient light, bookshelves visible in background out of focus, camera at eye level, static hold, subject framing leaves slight negative space on their right for visual breathing room. Right panel: guest environment — warmer, slightly softer ambience, practical desk lamp at frame-right, 2800K warmth, camera position slightly lower than eye-level suggesting the guest is more relaxed.
Both panels are simultaneously visible. The subjects are constructed, not real footage — described as follows:
Left panel subject: man, early 40s, lean posture, dark shirt, slightly leaning forward toward camera, expression attentive, one hand resting on desk, neutral background with soft depth separation. Light: key light from camera-left 4500K, minimal fill, clean catchlight.
Right panel subject: woman, mid-30s, natural posture, light-colored top, relaxed but engaged expression, hands occasionally gesture just below frame. Light: warm key from camera-right 3000K, softer shadow wrap, environment feels more inhabited.
At second 8: left panel pushes to 60% of screen width as host speaks. Right panel reduces to 40%, subject slightly defocused. No title card yet.
At second 22: guest begins speaking. Panels return to 50/50. Left panel camera begins barely perceptible slow zoom-in on host's listening expression. Right panel camera holds static on guest.
At second 35: key quote from guest appears as lower-third overlay on the right panel, white bold condensed sans-serif on semi-transparent dark strip: "Most pivots fail because the founders pivot away from pain instead of toward insight." Text persists for 6 seconds, then dissolves.
Final 15 seconds: both panels hold in conversation mode, cameras hold static. Full-screen end frame fades in from black: episode title centered in clean display type, both names and handles below in smaller weight, podcast logo upper-right corner. Dark background, minimal, professional.
Color grade: both panels intentionally slightly different — left cooler, right warmer — reinforcing their distinct perspectives. No filter effect, cinematic naturalism.

背景: 双人播客。主持人与嘉宾讨论为何大多数创业转型失败。两个音频片段——主持人提问与嘉宾回应。总时长75秒。

Seedance 2.0视频提示词,16:9宽屏,75秒:
开场:分屏,硬竖线位于水平50%位置。左面板:主持人环境——现代家庭办公室,冷色调环境光,背景书架失焦可见,镜头与视线齐平,静态固定,主体构图右侧留轻微留白以呼吸。右面板:嘉宾环境——更暖、更柔和氛围,画面右侧实用台灯,2800K暖光,镜头位置略低于视线,暗示嘉宾更放松。
两个面板同时可见。主体为构建生成,非真实素材——描述如下:
左面板主体:男性,40岁出头,体态偏瘦,深色衬衫,略微前倾朝向镜头,表情专注,一只手放在桌上,中性背景柔和景深分离。灯光:镜头左侧4500K主光,极少补光,清晰反光点。
右面板主体:女性,30多岁,自然姿态,浅色上衣,放松但投入的表情,双手偶尔在画面下方做出手势。灯光:镜头右侧3000K暖主光,柔和阴影包裹,环境更具生活气息。
第8秒:主持人说话时,左面板占比推至屏幕宽度的60%。右面板缩至40%,主体略微失焦。暂不显示标题卡。
第22秒:嘉宾开始说话。面板恢复50/50比例。左面板镜头开始极其缓慢地向主持人倾听的表情推近。右面板镜头固定在嘉宾身上。
第35秒:嘉宾的关键引语作为底部条叠加在右面板上,白色粗体 condensed sans-serif字体置于半透明深色条:“大多数转型失败是因为创始人逃避痛苦,而非追求洞察。”文字持续6秒后消失。
最后15秒:两个面板保持对话模式,镜头固定。全屏结尾画面从黑屏淡入:剧集标题居中以简洁展示字体呈现,下方是两人姓名及账号,右上角是播客Logo。深色背景,简洁专业。
色彩分级:两个面板故意略有差异——左面板偏冷,右面板偏暖——强化各自的独特视角。无滤镜效果,电影级自然质感。

Example 3: The Vulnerable Moment (Cinematic B-Roll Narrative, 9:16)

示例3:脆弱时刻(电影级B-Roll叙事,9:16)

Context: Solo podcast host reflecting on burnout. Single audio clip, 60 seconds. Emotional, introspective delivery. Key line: "I kept shipping things I didn't believe in anymore."

Seedance 2.0 video prompt, 9:16 vertical, 60 seconds:
Opening 3 seconds: extreme close-up on a laptop keyboard, single finger resting on the Enter key, not pressing. Shallow depth of field, bokeh'd background implies a dim room. Camera holds static. The finger does not move. No other motion in frame.
Cut at second 3: interior of a home office at night. Blue-grey ambient light from a monitor out of frame to the left. Desk covered in post-it notes, cups, cables — organized chaos of overwork. Camera positioned low, looking up at the desk from below-counter angle. Slow dolly-left movement, 5-second duration, revealing more of the desk's surface and a wall of notes behind. One note reads "SHIP IT" in visible handwriting but slightly out of focus.
Cut at second 14: back to the keyboard close-up, now slightly wider — we see both hands, neither moving. Camera begins an extremely slow push-in toward the hands over 8 seconds.
Cut at second 22: new environment. A coffee cup, half-full, cold. Steam: none. Morning light implied from window light on the cup's surface — but the light is pale, low-energy overcast. Camera static. Cup is centered in frame. 4 second hold.
Cut at second 26: corridor of a generic office building, late at night, fluorescent lights, empty. Camera at end of corridor looking down its length. Camera does not move for 5 seconds, then begins an extremely slow push forward down the corridor.
At second 31: kinetic text overlaid on the corridor shot: words appear one at a time, white heavy-weight sans-serif, centered: "things" — "I" — "didn't" — "believe" — "in" — "anymore." Each word appears at the rhythm of the imagined speech delivery (slow, deliberate). Final word "anymore." holds for 2 seconds, then the entire corridor shot and text fades to a held black.
From second 45: return to laptop close-up, tightest framing yet — just the trackpad, one thumb resting on it. Stillness. Over 8 seconds, a single small light — implied notification — pulses twice on the screen reflection in the trackpad surface. The thumb does not move.
Final 7 seconds: fade to deep navy. Speaker handle appears in small white regular-weight type, center-frame. Below it: episode name. Below that: "full episode linked in bio." All text fades in slowly, no animation.
Color grade: desaturated, cool blue-green shadows, low contrast in highlights. The visual language of exhaustion. No warmth until there is something to be warm about — and in this clip, there isn't.

背景: 单人播客主播反思职业倦怠。单个音频片段,60秒。情绪饱满、内省的表达。关键台词:“我一直在输出自己不再认同的内容。”

Seedance 2.0视频提示词,9:16竖屏,60秒:
开场3秒:笔记本电脑键盘极端特写,单根手指放在回车键上,未按下。浅景深,背景虚化暗示昏暗房间。镜头固定。手指不动。画面无其他运动。
第3秒切至:深夜家庭办公室内部。左侧画外显示器的蓝灰色环境光。桌面布满便利贴、杯子、线缆——过度工作的有序混乱。镜头低位,从柜台下方角度仰视桌面。缓慢向左轨道移动,持续5秒,展示更多桌面及后方便利贴墙。一张便利贴可见手写“发布”字样但略微失焦。
第14秒切回:键盘特写,画面略宽——可见双手,均未移动。镜头开始极其缓慢地向双手推近,持续8秒。
第22秒切至:新场景。半满的咖啡杯,已冷却。无蒸汽。窗户光线暗示早晨——但光线苍白、低能量阴天。镜头固定。杯子居中。保持4秒。
第26秒切至:普通办公楼走廊,深夜,荧光灯,空无一人。镜头位于走廊尽头看向远方。镜头固定5秒后,开始极其缓慢地向前推进。
第31秒:走廊画面叠加动态文字:逐字出现,白色粗体无衬线字体,居中:“内容”——“我”——“不再”——“认同”——“的”——“输出。”每个单词的出现节奏匹配想象中的语速(缓慢、沉稳)。最后一个词“输出。”停留2秒后,整个走廊画面及文字淡出至黑屏固定。
第45秒起:回到笔记本电脑特写,构图最紧凑——仅触控板,一根拇指放在上面。静止。8秒内,触控板表面的屏幕反射中,一个小灯光(暗示通知)闪烁两次。拇指不动。
最后7秒:渐变为深蓝。主播账号以小尺寸白色常规字重字体出现在画面中央。下方是剧集名称。再下方:“完整剧集链接在简介中。”所有文字缓慢淡入,无动效。
色彩分级:低饱和度,冷蓝绿色阴影,高光低对比度。疲惫的视觉语言。除非有值得温暖的内容,否则无暖色调——本片段中没有。

Prompt Rules for Podcast Visualization

播客可视化提示词规则

  1. Every visual element must serve the emotional content of the audio — no decorative shots without emotional purpose.
  2. Never use the word "waveform" in a prompt. A waveform is a technical representation of audio data. A Seedance 2.0 podcast visual is a cinematic translation of meaning. Use language like "acoustic particle field," "resonant light behavior," "sound-reactive atmospheric texture" — or, better, describe what the emotion looks like rather than what the sound looks like.
  3. Specify timing in seconds. Seedance 2.0 prompts that include "at second X" guidance produce more coherent visual sequences than prompts that describe only static states. Map every visual event to a time anchor.
  4. One dominant visual per scene. Avoid prompts that describe three competing visual elements in a single shot. The viewer can hold one thing at a time. Complexity comes from sequencing, not from stuffing each frame.
  5. Write the pause. Every description of motion must be balanced by a description of stillness. "Camera holds completely static for 4 seconds" is as important as any camera movement. Stillness communicates confidence.
  6. Match color temperature to emotional register. This is not a suggestion — it is a requirement. Cool temperatures (5000K+) for analytical, distanced, or revelatory content. Warm temperatures (under 3200K) for intimate, vulnerable, nostalgic content. Mixed temperatures for conflict or unresolved tension.
  7. Typography is visual, not verbal. When including quote text in prompts, describe its physical appearance (scale, weight, position, animation behavior) as precisely as you describe a person's expression. The font is a costume; the animation is a performance.
  8. Audio files inform the prompt; they do not dictate it. The visual narrative should be coherent on mute. If the video requires the audio to make sense, the prompt has not done its job. Test by asking: "would a viewer who cannot hear this clip understand its emotional content within 5 seconds?"
  9. End on identity. The last 5-10 seconds of every podcast visual should establish the creator's identity (name, handle, show, episode). This is not vanity — it is the conversion moment. The viewer has just felt something; tell them where to find more.
  10. Multi-clip sequencing. When using 2 or 3 audio files, each clip gets its own visual arc (opening hook, development, resolution), but all arcs share a unified color grade and typographic system. Consistency across clips signals craft; variation within clips signals dynamism.
  1. 每个视觉元素必须服务于音频的情感内容——无情感目的的装饰性镜头一律禁用。
  2. 提示词中绝不要使用“波形图”一词。 波形图是音频数据的技术呈现。Seedance 2.0播客可视化是意义的电影级转译。使用“声学粒子场”、“共振光行为”、“声反应大气纹理”等表述——或者更好的方式,描述情绪的视觉呈现,而非声音的视觉呈现。
  3. 以秒为单位指定时间。 包含“第X秒”指导的Seedance 2.0提示词,比仅描述静态状态的提示词生成的视觉序列更连贯。将每个视觉事件映射到时间锚点。
  4. 每个场景一个主导视觉元素。 避免在单个镜头中描述三个相互竞争的视觉元素。观众一次只能关注一个元素。复杂性来自序列编排,而非每个画面的堆砌。
  5. 写出停顿细节。 每个运动描述必须对应一个静止描述。“镜头完全静止4秒”与任何镜头运动同样重要。静止传递自信。
  6. 色温匹配情绪基调。 这不是建议——是要求。冷色调(5000K+)用于分析型、疏离型或揭秘型内容。暖色调(3200K以下)用于亲密型、脆弱型或怀旧型内容。混合色调用于冲突或未解决的紧张感。
  7. 文字是视觉元素,而非语言元素。 提示词中包含引语文本时,需像描述人物表情一样精准描述其物理外观(尺寸、字重、位置、动效行为)。字体是服装;动效是表演。
  8. 音频文件启发提示词,但不主导提示词。 视觉叙事静音时也需连贯。如果视频需要音频才能理解,说明提示词未完成任务。测试标准:“听不到音频的观众能否在5秒内理解片段的情感内容?”
  9. 结尾展示身份信息。 每个播客视觉的最后5-10秒需展示创作者身份(姓名、账号、节目、剧集)。这不是虚荣——是转化时刻。观众刚刚产生情绪共鸣;告诉他们去哪里找到更多内容。
  10. 多片段序列编排。 使用2或3个音频文件时,每个片段有自己的视觉弧(开场钩子、发展、收尾),但所有弧需共享统一的色彩分级和文字系统。片段间的一致性体现专业性;片段内的变化体现动态感。