Talking Head Production

会说话的头部视频制作

Create talking head videos with AI avatars and lipsync via inference.sh CLI.

通过inference.sh CLI工具，使用AI虚拟形象和唇形同步功能制作会说话的头部视频。

Quick Start

快速开始

bash

curl -fsSL https://cli.inference.sh | sh && infsh login

bash

curl -fsSL https://cli.inference.sh | sh && infsh login

Generate dialogue audio

生成对话音频

infsh app run falai/dia-tts --input '{ "prompt": "[S1] Welcome to our product tour. Today I will show you three features that will save you hours every week." }'

Create talking head video with OmniHuman

使用OmniHuman创建会说话的头部视频

infsh app run bytedance/omnihuman-1-5 --input '{ "image": "path/to/portrait.png", "audio": "path/to/dialogue.mp3" }'

undefined

infsh app run bytedance/omnihuman-1-5 --input '{ "image": "path/to/portrait.png", "audio": "path/to/dialogue.mp3" }'

undefined

Portrait Requirements

肖像要求

The source portrait image is critical. Poor portraits = poor video output.

源肖像图片至关重要。劣质肖像会导致视频输出效果不佳。

Must Have

必备要求

Requirement	Why	Spec
Center-framed	Avatar needs face in predictable position	Face centered in frame
Head and shoulders	Body visible for natural gestures	Crop below chest
Eyes to camera	Creates connection with viewer	Direct frontal gaze
Neutral expression	Starting point for animation	Slight smile OK, not laughing/frowning
Clear face	Model needs to detect features	No sunglasses, heavy shadows, or obstructions
High resolution	Detail preservation	Min 512x512 face region, ideally 1024x1024+

要求	原因	规格
居中构图	虚拟形象需要面部处于可预测的位置	面部位于画面中央
头肩取景	显示身体部分以实现自然手势	裁剪至胸部以下
直视镜头	与观众建立连接	正面直视镜头
中性表情	作为动画的起始状态	可带轻微微笑，不要大笑或皱眉
面部清晰	模型需要检测面部特征	无墨镜、浓重阴影或遮挡物
高分辨率	保留细节	面部区域最小512x512，理想为1024x1024及以上

Background

背景

Type	When to Use
Solid color	Professional, clean, easy to composite
Soft bokeh	Natural, lifestyle feel
Office/studio	Business context
Transparent (via bg removal)	Compositing into other scenes

bash

undefined

类型	使用场景
纯色背景	专业、简洁，易于合成
柔焦背景	自然、生活化风格
办公室/工作室背景	商务场景
透明背景（通过抠图实现）	合成到其他场景中

bash

undefined

Generate a professional portrait background

生成专业肖像背景

infsh app run falai/flux-dev-lora --input '{ "prompt": "professional headshot photograph of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, high quality portrait photography" }'

Or remove background from existing portrait

或移除现有肖像的背景

infsh app run <bg-removal-app> --input '{ "image": "path/to/portrait-with-background.png" }'

undefined

infsh app run <bg-removal-app> --input '{ "image": "path/to/portrait-with-background.png" }'

undefined

Audio Quality

音频质量

Audio quality directly impacts lipsync accuracy. Clean audio = accurate lip movement.

音频质量直接影响唇形同步的准确性。清晰的音频才能实现精准的唇部动作。

Requirements

要求

Parameter	Target	Why
Background noise	None/minimal	Noise confuses lipsync timing
Volume	Consistent throughout	Prevents sync drift
Sample rate	44.1kHz or 48kHz	Standard quality
Format	MP3 128kbps+ or WAV	Compatible with all tools

参数	目标	原因
背景噪音	无/极少	噪音会干扰唇形同步的时序
音量	全程一致	防止同步偏移
采样率	44.1kHz或48kHz	标准音质
格式	MP3 128kbps+或WAV	兼容所有工具

Generating Audio

生成音频

bash

undefined

bash

undefined

Simple narration

简单旁白

infsh app run falai/dia-tts --input '{ "prompt": "[S1] Hi there! I am excited to share something with you today. We have been working on a feature that our users have been requesting for months... and it is finally here." }'

With emotion and pacing

带情感和节奏的音频

infsh app run falai/dia-tts --input '{ "prompt": "[S1] You know what is frustrating? Spending hours on tasks that should take minutes. (sighs) We have all been there. But what if I told you... there is a better way?" }'

undefined

infsh app run falai/dia-tts --input '{ "prompt": "[S1] You know what is frustrating? Spending hours on tasks that should take minutes. (sighs) We have all been there. But what if I told you... there is a better way?" }'

undefined

Model Selection

模型选择

Model	App ID	Best For	Max Duration
OmniHuman 1.5	`bytedance/omnihuman-1-5`	Multi-character, gestures, high quality	~30s per clip
OmniHuman 1.0	`bytedance/omnihuman-1-0`	Single character, simpler	~30s per clip
PixVerse Lipsync	`falai/pixverse-lipsync`	Quick lipsync on existing video	Short clips
Fabric	`falai/fabric-1-0`	Cloth/fabric animation on portraits	Short clips

模型	应用ID	最佳适用场景	最长时长
OmniHuman 1.5	`bytedance/omnihuman-1-5`	多角色、手势动作、高质量	每段约30秒
OmniHuman 1.0	`bytedance/omnihuman-1-0`	单角色、简单场景	每段约30秒
PixVerse Lipsync	`falai/pixverse-lipsync`	现有视频快速唇形同步	短视频
Fabric	`falai/fabric-1-0`	肖像上的布料动画	短视频

Production Workflows

制作工作流

Basic: Portrait + Audio -> Video

基础流程：肖像 + 音频 -> 视频

bash

undefined

bash

undefined

1. Generate or prepare audio

1. 生成或准备音频

infsh app run falai/dia-tts --input '{ "prompt": "[S1] Your narration script here." }'

2. Generate talking head

2. 生成会说话的头部视频

infsh app run bytedance/omnihuman-1-5 --input '{ "image": "portrait.png", "audio": "narration.mp3" }'

undefined

infsh app run bytedance/omnihuman-1-5 --input '{ "image": "portrait.png", "audio": "narration.mp3" }'

undefined

With Captions

添加字幕

bash

undefined

bash

undefined

1-2. Same as above

1-2. 与上述步骤相同

3. Add captions to the talking head video

3. 为会说话的头部视频添加字幕

infsh app run infsh/caption-videos --input '{ "video": "talking-head.mp4", "caption_file": "captions.srt" }'

undefined

infsh app run infsh/caption-videos --input '{ "video": "talking-head.mp4", "caption_file": "captions.srt" }'

undefined

Long-Form (Stitched Clips)

长视频（片段拼接）

For content longer than 30 seconds, split into segments:

bash

undefined

对于超过30秒的内容，拆分为多个片段：

bash

undefined

Generate audio segments

生成音频片段

infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment one script."}' --no-wait infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment two script."}' --no-wait infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment three script."}' --no-wait

Generate talking head for each segment (same portrait for consistency)

为每个片段生成会说话的头部视频（使用相同肖像保持一致性）

infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment1.mp3"}' --no-wait infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment2.mp3"}' --no-wait infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment3.mp3"}' --no-wait

Merge all segments

合并所有片段

infsh app run infsh/media-merger --input '{ "media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"] }'

undefined

infsh app run infsh/media-merger --input '{ "media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"] }'

undefined

Multi-Character Conversation

多角色对话

OmniHuman 1.5 supports up to 2 characters:

bash

undefined

OmniHuman 1.5最多支持2个角色：

bash

undefined

1. Generate dialogue with two speakers

1. 生成双角色对话音频

infsh app run falai/dia-tts --input '{ "prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch." }'

2. Create video with two characters

2. 创建双角色视频

infsh app run bytedance/omnihuman-1-5 --input '{ "image": "two-person-portrait.png", "audio": "dialogue.mp3" }'

undefined

infsh app run bytedance/omnihuman-1-5 --input '{ "image": "two-person-portrait.png", "audio": "dialogue.mp3" }'

undefined

Framing Guidelines

构图指南

┌─────────────────────────────────┐
│          Headroom (minimal)     │
│  ┌───────────────────────────┐  │
│  │                           │  │
│  │     ● ─ ─ Eyes at 1/3 ─ ─│─ │ ← Eyes at top 1/3 line
│  │    /|\                    │  │
│  │     |   Head & shoulders  │  │
│  │    / \  visible           │  │
│  │                           │  │
│  └───────────────────────────┘  │
│       Crop below chest          │
└─────────────────────────────────┘

┌─────────────────────────────────┐
│          顶部留白（最少）     │
│  ┌───────────────────────────┐  │
│  │                           │  │
│  │     ● ─ ─ 眼睛位于上1/3处 ─ ─│─ │ ← 眼睛位于上1/3线位置
│  │    /|\                    │  │
│  │     |   头肩区域可见       │  │
│  │    / \                    │  │
│  │                           │  │
│  └───────────────────────────┘  │
│       裁剪至胸部以下          │
└─────────────────────────────────┘

Common Mistakes

常见错误

Mistake	Problem	Fix
Low-res portrait	Blurry face, poor lipsync	Use 1024x1024+ face region
Profile/side angle	Lipsync can't track mouth well	Use frontal or near-frontal
Noisy audio	Lipsync drifts, looks unnatural	Record clean or use TTS
Too-long clips	Quality degrades after 30s	Split into segments, stitch
Sunglasses/obstruction	Face features hidden	Clear face required
Inconsistent lighting	Uncanny when animated	Even, soft lighting
No captions	Loses silent/mobile viewers	Always add captions

错误	问题	解决方法
低分辨率肖像	面部模糊，唇形同步效果差	使用1024x1024及以上的面部区域
侧面/侧脸肖像	唇形同步无法很好地追踪嘴巴	使用正面或接近正面的肖像
音频有噪音	唇形同步偏移，效果不自然	录制清晰音频或使用TTS生成
片段过长	30秒后画质下降	拆分为多个片段，再拼接
戴墨镜/面部有遮挡	面部特征被遮挡	需要清晰无遮挡的面部
光线不一致	动画效果怪异	使用均匀柔和的光线
未添加字幕	失去静音/移动端观众	务必添加字幕

talking-head-production

Original

Translation

Talking Head Production

会说话的头部视频制作

Quick Start

快速开始

Generate dialogue audio

生成对话音频

Create talking head video with OmniHuman

使用OmniHuman创建会说话的头部视频

Portrait Requirements

肖像要求

Must Have

必备要求

Background

背景

Generate a professional portrait background

生成专业肖像背景

Or remove background from existing portrait

或移除现有肖像的背景

Audio Quality

音频质量

Requirements

要求

Generating Audio

生成音频

Simple narration

简单旁白

With emotion and pacing

带情感和节奏的音频

Model Selection

模型选择

Production Workflows

制作工作流

Basic: Portrait + Audio -> Video

基础流程：肖像 + 音频 -> 视频

1. Generate or prepare audio

1. 生成或准备音频

2. Generate talking head

2. 生成会说话的头部视频

With Captions

添加字幕

1-2. Same as above

1-2. 与上述步骤相同

3. Add captions to the talking head video

3. 为会说话的头部视频添加字幕

Long-Form (Stitched Clips)

长视频（片段拼接）

Generate audio segments

生成音频片段

Generate talking head for each segment (same portrait for consistency)

为每个片段生成会说话的头部视频（使用相同肖像保持一致性）

Merge all segments

合并所有片段

Multi-Character Conversation

多角色对话

1. Generate dialogue with two speakers

1. 生成双角色对话音频

2. Create video with two characters

2. 创建双角色视频

Framing Guidelines

构图指南

Common Mistakes

常见错误

Related Skills

相关技能