talking-head-production

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Talking Head Production

会说话的头部视频制作

Create talking head videos with AI avatars and lipsync via inference.sh CLI.
通过inference.sh CLI工具,使用AI虚拟形象和唇形同步功能制作会说话的头部视频。

Quick Start

快速开始

bash
curl -fsSL https://cli.inference.sh | sh && infsh login
bash
curl -fsSL https://cli.inference.sh | sh && infsh login

Generate dialogue audio

生成对话音频

infsh app run falai/dia-tts --input '{ "prompt": "[S1] Welcome to our product tour. Today I will show you three features that will save you hours every week." }'
infsh app run falai/dia-tts --input '{ "prompt": "[S1] Welcome to our product tour. Today I will show you three features that will save you hours every week." }'

Create talking head video with OmniHuman

使用OmniHuman创建会说话的头部视频

infsh app run bytedance/omnihuman-1-5 --input '{ "image": "path/to/portrait.png", "audio": "path/to/dialogue.mp3" }'
undefined
infsh app run bytedance/omnihuman-1-5 --input '{ "image": "path/to/portrait.png", "audio": "path/to/dialogue.mp3" }'
undefined

Portrait Requirements

肖像要求

The source portrait image is critical. Poor portraits = poor video output.
源肖像图片至关重要。劣质肖像会导致视频输出效果不佳。

Must Have

必备要求

RequirementWhySpec
Center-framedAvatar needs face in predictable positionFace centered in frame
Head and shouldersBody visible for natural gesturesCrop below chest
Eyes to cameraCreates connection with viewerDirect frontal gaze
Neutral expressionStarting point for animationSlight smile OK, not laughing/frowning
Clear faceModel needs to detect featuresNo sunglasses, heavy shadows, or obstructions
High resolutionDetail preservationMin 512x512 face region, ideally 1024x1024+
要求原因规格
居中构图虚拟形象需要面部处于可预测的位置面部位于画面中央
头肩取景显示身体部分以实现自然手势裁剪至胸部以下
直视镜头与观众建立连接正面直视镜头
中性表情作为动画的起始状态可带轻微微笑,不要大笑或皱眉
面部清晰模型需要检测面部特征无墨镜、浓重阴影或遮挡物
高分辨率保留细节面部区域最小512x512,理想为1024x1024及以上

Background

背景

TypeWhen to Use
Solid colorProfessional, clean, easy to composite
Soft bokehNatural, lifestyle feel
Office/studioBusiness context
Transparent (via bg removal)Compositing into other scenes
bash
undefined
类型使用场景
纯色背景专业、简洁,易于合成
柔焦背景自然、生活化风格
办公室/工作室背景商务场景
透明背景(通过抠图实现)合成到其他场景中
bash
undefined

Generate a professional portrait background

生成专业肖像背景

infsh app run falai/flux-dev-lora --input '{ "prompt": "professional headshot photograph of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, high quality portrait photography" }'
infsh app run falai/flux-dev-lora --input '{ "prompt": "professional headshot photograph of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, high quality portrait photography" }'

Or remove background from existing portrait

或移除现有肖像的背景

infsh app run <bg-removal-app> --input '{ "image": "path/to/portrait-with-background.png" }'
undefined
infsh app run <bg-removal-app> --input '{ "image": "path/to/portrait-with-background.png" }'
undefined

Audio Quality

音频质量

Audio quality directly impacts lipsync accuracy. Clean audio = accurate lip movement.
音频质量直接影响唇形同步的准确性。清晰的音频才能实现精准的唇部动作。

Requirements

要求

ParameterTargetWhy
Background noiseNone/minimalNoise confuses lipsync timing
VolumeConsistent throughoutPrevents sync drift
Sample rate44.1kHz or 48kHzStandard quality
FormatMP3 128kbps+ or WAVCompatible with all tools
参数目标原因
背景噪音无/极少噪音会干扰唇形同步的时序
音量全程一致防止同步偏移
采样率44.1kHz或48kHz标准音质
格式MP3 128kbps+或WAV兼容所有工具

Generating Audio

生成音频

bash
undefined
bash
undefined

Simple narration

简单旁白

infsh app run falai/dia-tts --input '{ "prompt": "[S1] Hi there! I am excited to share something with you today. We have been working on a feature that our users have been requesting for months... and it is finally here." }'
infsh app run falai/dia-tts --input '{ "prompt": "[S1] Hi there! I am excited to share something with you today. We have been working on a feature that our users have been requesting for months... and it is finally here." }'

With emotion and pacing

带情感和节奏的音频

infsh app run falai/dia-tts --input '{ "prompt": "[S1] You know what is frustrating? Spending hours on tasks that should take minutes. (sighs) We have all been there. But what if I told you... there is a better way?" }'
undefined
infsh app run falai/dia-tts --input '{ "prompt": "[S1] You know what is frustrating? Spending hours on tasks that should take minutes. (sighs) We have all been there. But what if I told you... there is a better way?" }'
undefined

Model Selection

模型选择

ModelApp IDBest ForMax Duration
OmniHuman 1.5
bytedance/omnihuman-1-5
Multi-character, gestures, high quality~30s per clip
OmniHuman 1.0
bytedance/omnihuman-1-0
Single character, simpler~30s per clip
PixVerse Lipsync
falai/pixverse-lipsync
Quick lipsync on existing videoShort clips
Fabric
falai/fabric-1-0
Cloth/fabric animation on portraitsShort clips
模型应用ID最佳适用场景最长时长
OmniHuman 1.5
bytedance/omnihuman-1-5
多角色、手势动作、高质量每段约30秒
OmniHuman 1.0
bytedance/omnihuman-1-0
单角色、简单场景每段约30秒
PixVerse Lipsync
falai/pixverse-lipsync
现有视频快速唇形同步短视频
Fabric
falai/fabric-1-0
肖像上的布料动画短视频

Production Workflows

制作工作流

Basic: Portrait + Audio -> Video

基础流程:肖像 + 音频 -> 视频

bash
undefined
bash
undefined

1. Generate or prepare audio

1. 生成或准备音频

infsh app run falai/dia-tts --input '{ "prompt": "[S1] Your narration script here." }'
infsh app run falai/dia-tts --input '{ "prompt": "[S1] Your narration script here." }'

2. Generate talking head

2. 生成会说话的头部视频

infsh app run bytedance/omnihuman-1-5 --input '{ "image": "portrait.png", "audio": "narration.mp3" }'
undefined
infsh app run bytedance/omnihuman-1-5 --input '{ "image": "portrait.png", "audio": "narration.mp3" }'
undefined

With Captions

添加字幕

bash
undefined
bash
undefined

1-2. Same as above

1-2. 与上述步骤相同

3. Add captions to the talking head video

3. 为会说话的头部视频添加字幕

infsh app run infsh/caption-videos --input '{ "video": "talking-head.mp4", "caption_file": "captions.srt" }'
undefined
infsh app run infsh/caption-videos --input '{ "video": "talking-head.mp4", "caption_file": "captions.srt" }'
undefined

Long-Form (Stitched Clips)

长视频(片段拼接)

For content longer than 30 seconds, split into segments:
bash
undefined
对于超过30秒的内容,拆分为多个片段:
bash
undefined

Generate audio segments

生成音频片段

infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment one script."}' --no-wait infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment two script."}' --no-wait infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment three script."}' --no-wait
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment one script."}' --no-wait infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment two script."}' --no-wait infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment three script."}' --no-wait

Generate talking head for each segment (same portrait for consistency)

为每个片段生成会说话的头部视频(使用相同肖像保持一致性)

infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment1.mp3"}' --no-wait infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment2.mp3"}' --no-wait infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment3.mp3"}' --no-wait
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment1.mp3"}' --no-wait infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment2.mp3"}' --no-wait infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment3.mp3"}' --no-wait

Merge all segments

合并所有片段

infsh app run infsh/media-merger --input '{ "media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"] }'
undefined
infsh app run infsh/media-merger --input '{ "media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"] }'
undefined

Multi-Character Conversation

多角色对话

OmniHuman 1.5 supports up to 2 characters:
bash
undefined
OmniHuman 1.5最多支持2个角色:
bash
undefined

1. Generate dialogue with two speakers

1. 生成双角色对话音频

infsh app run falai/dia-tts --input '{ "prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch." }'
infsh app run falai/dia-tts --input '{ "prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch." }'

2. Create video with two characters

2. 创建双角色视频

infsh app run bytedance/omnihuman-1-5 --input '{ "image": "two-person-portrait.png", "audio": "dialogue.mp3" }'
undefined
infsh app run bytedance/omnihuman-1-5 --input '{ "image": "two-person-portrait.png", "audio": "dialogue.mp3" }'
undefined

Framing Guidelines

构图指南

┌─────────────────────────────────┐
│          Headroom (minimal)     │
│  ┌───────────────────────────┐  │
│  │                           │  │
│  │     ● ─ ─ Eyes at 1/3 ─ ─│─ │ ← Eyes at top 1/3 line
│  │    /|\                    │  │
│  │     |   Head & shoulders  │  │
│  │    / \  visible           │  │
│  │                           │  │
│  └───────────────────────────┘  │
│       Crop below chest          │
└─────────────────────────────────┘
┌─────────────────────────────────┐
│          顶部留白(最少)     │
│  ┌───────────────────────────┐  │
│  │                           │  │
│  │     ● ─ ─ 眼睛位于上1/3处 ─ ─│─ │ ← 眼睛位于上1/3线位置
│  │    /|\                    │  │
│  │     |   头肩区域可见       │  │
│  │    / \                    │  │
│  │                           │  │
│  └───────────────────────────┘  │
│       裁剪至胸部以下          │
└─────────────────────────────────┘

Common Mistakes

常见错误

MistakeProblemFix
Low-res portraitBlurry face, poor lipsyncUse 1024x1024+ face region
Profile/side angleLipsync can't track mouth wellUse frontal or near-frontal
Noisy audioLipsync drifts, looks unnaturalRecord clean or use TTS
Too-long clipsQuality degrades after 30sSplit into segments, stitch
Sunglasses/obstructionFace features hiddenClear face required
Inconsistent lightingUncanny when animatedEven, soft lighting
No captionsLoses silent/mobile viewersAlways add captions
错误问题解决方法
低分辨率肖像面部模糊,唇形同步效果差使用1024x1024及以上的面部区域
侧面/侧脸肖像唇形同步无法很好地追踪嘴巴使用正面或接近正面的肖像
音频有噪音唇形同步偏移,效果不自然录制清晰音频或使用TTS生成
片段过长30秒后画质下降拆分为多个片段,再拼接
戴墨镜/面部有遮挡面部特征被遮挡需要清晰无遮挡的面部
光线不一致动画效果怪异使用均匀柔和的光线
未添加字幕失去静音/移动端观众务必添加字幕

Related Skills

相关技能

bash
npx skills add inference-sh/skills@ai-avatar-video
npx skills add inference-sh/skills@ai-video-generation
npx skills add inference-sh/skills@text-to-speech
Browse all apps:
infsh app list
bash
npx skills add inference-sh/skills@ai-avatar-video
npx skills add inference-sh/skills@ai-video-generation
npx skills add inference-sh/skills@text-to-speech
浏览所有应用:
infsh app list