talking-head-production
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTalking Head Production
会说话的头部视频制作
Create talking head videos with AI avatars and lipsync via inference.sh CLI.
通过inference.sh CLI工具,使用AI虚拟形象和唇形同步功能制作会说话的头部视频。
Quick Start
快速开始
bash
curl -fsSL https://cli.inference.sh | sh && infsh loginbash
curl -fsSL https://cli.inference.sh | sh && infsh loginGenerate dialogue audio
生成对话音频
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Welcome to our product tour. Today I will show you three features that will save you hours every week."
}'
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Welcome to our product tour. Today I will show you three features that will save you hours every week."
}'
Create talking head video with OmniHuman
使用OmniHuman创建会说话的头部视频
infsh app run bytedance/omnihuman-1-5 --input '{
"image": "path/to/portrait.png",
"audio": "path/to/dialogue.mp3"
}'
undefinedinfsh app run bytedance/omnihuman-1-5 --input '{
"image": "path/to/portrait.png",
"audio": "path/to/dialogue.mp3"
}'
undefinedPortrait Requirements
肖像要求
The source portrait image is critical. Poor portraits = poor video output.
源肖像图片至关重要。劣质肖像会导致视频输出效果不佳。
Must Have
必备要求
| Requirement | Why | Spec |
|---|---|---|
| Center-framed | Avatar needs face in predictable position | Face centered in frame |
| Head and shoulders | Body visible for natural gestures | Crop below chest |
| Eyes to camera | Creates connection with viewer | Direct frontal gaze |
| Neutral expression | Starting point for animation | Slight smile OK, not laughing/frowning |
| Clear face | Model needs to detect features | No sunglasses, heavy shadows, or obstructions |
| High resolution | Detail preservation | Min 512x512 face region, ideally 1024x1024+ |
| 要求 | 原因 | 规格 |
|---|---|---|
| 居中构图 | 虚拟形象需要面部处于可预测的位置 | 面部位于画面中央 |
| 头肩取景 | 显示身体部分以实现自然手势 | 裁剪至胸部以下 |
| 直视镜头 | 与观众建立连接 | 正面直视镜头 |
| 中性表情 | 作为动画的起始状态 | 可带轻微微笑,不要大笑或皱眉 |
| 面部清晰 | 模型需要检测面部特征 | 无墨镜、浓重阴影或遮挡物 |
| 高分辨率 | 保留细节 | 面部区域最小512x512,理想为1024x1024及以上 |
Background
背景
| Type | When to Use |
|---|---|
| Solid color | Professional, clean, easy to composite |
| Soft bokeh | Natural, lifestyle feel |
| Office/studio | Business context |
| Transparent (via bg removal) | Compositing into other scenes |
bash
undefined| 类型 | 使用场景 |
|---|---|
| 纯色背景 | 专业、简洁,易于合成 |
| 柔焦背景 | 自然、生活化风格 |
| 办公室/工作室背景 | 商务场景 |
| 透明背景(通过抠图实现) | 合成到其他场景中 |
bash
undefinedGenerate a professional portrait background
生成专业肖像背景
infsh app run falai/flux-dev-lora --input '{
"prompt": "professional headshot photograph of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, high quality portrait photography"
}'
infsh app run falai/flux-dev-lora --input '{
"prompt": "professional headshot photograph of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, high quality portrait photography"
}'
Or remove background from existing portrait
或移除现有肖像的背景
infsh app run <bg-removal-app> --input '{
"image": "path/to/portrait-with-background.png"
}'
undefinedinfsh app run <bg-removal-app> --input '{
"image": "path/to/portrait-with-background.png"
}'
undefinedAudio Quality
音频质量
Audio quality directly impacts lipsync accuracy. Clean audio = accurate lip movement.
音频质量直接影响唇形同步的准确性。清晰的音频才能实现精准的唇部动作。
Requirements
要求
| Parameter | Target | Why |
|---|---|---|
| Background noise | None/minimal | Noise confuses lipsync timing |
| Volume | Consistent throughout | Prevents sync drift |
| Sample rate | 44.1kHz or 48kHz | Standard quality |
| Format | MP3 128kbps+ or WAV | Compatible with all tools |
| 参数 | 目标 | 原因 |
|---|---|---|
| 背景噪音 | 无/极少 | 噪音会干扰唇形同步的时序 |
| 音量 | 全程一致 | 防止同步偏移 |
| 采样率 | 44.1kHz或48kHz | 标准音质 |
| 格式 | MP3 128kbps+或WAV | 兼容所有工具 |
Generating Audio
生成音频
bash
undefinedbash
undefinedSimple narration
简单旁白
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Hi there! I am excited to share something with you today. We have been working on a feature that our users have been requesting for months... and it is finally here."
}'
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Hi there! I am excited to share something with you today. We have been working on a feature that our users have been requesting for months... and it is finally here."
}'
With emotion and pacing
带情感和节奏的音频
infsh app run falai/dia-tts --input '{
"prompt": "[S1] You know what is frustrating? Spending hours on tasks that should take minutes. (sighs) We have all been there. But what if I told you... there is a better way?"
}'
undefinedinfsh app run falai/dia-tts --input '{
"prompt": "[S1] You know what is frustrating? Spending hours on tasks that should take minutes. (sighs) We have all been there. But what if I told you... there is a better way?"
}'
undefinedModel Selection
模型选择
| Model | App ID | Best For | Max Duration |
|---|---|---|---|
| OmniHuman 1.5 | | Multi-character, gestures, high quality | ~30s per clip |
| OmniHuman 1.0 | | Single character, simpler | ~30s per clip |
| PixVerse Lipsync | | Quick lipsync on existing video | Short clips |
| Fabric | | Cloth/fabric animation on portraits | Short clips |
| 模型 | 应用ID | 最佳适用场景 | 最长时长 |
|---|---|---|---|
| OmniHuman 1.5 | | 多角色、手势动作、高质量 | 每段约30秒 |
| OmniHuman 1.0 | | 单角色、简单场景 | 每段约30秒 |
| PixVerse Lipsync | | 现有视频快速唇形同步 | 短视频 |
| Fabric | | 肖像上的布料动画 | 短视频 |
Production Workflows
制作工作流
Basic: Portrait + Audio -> Video
基础流程:肖像 + 音频 -> 视频
bash
undefinedbash
undefined1. Generate or prepare audio
1. 生成或准备音频
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Your narration script here."
}'
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Your narration script here."
}'
2. Generate talking head
2. 生成会说话的头部视频
infsh app run bytedance/omnihuman-1-5 --input '{
"image": "portrait.png",
"audio": "narration.mp3"
}'
undefinedinfsh app run bytedance/omnihuman-1-5 --input '{
"image": "portrait.png",
"audio": "narration.mp3"
}'
undefinedWith Captions
添加字幕
bash
undefinedbash
undefined1-2. Same as above
1-2. 与上述步骤相同
3. Add captions to the talking head video
3. 为会说话的头部视频添加字幕
infsh app run infsh/caption-videos --input '{
"video": "talking-head.mp4",
"caption_file": "captions.srt"
}'
undefinedinfsh app run infsh/caption-videos --input '{
"video": "talking-head.mp4",
"caption_file": "captions.srt"
}'
undefinedLong-Form (Stitched Clips)
长视频(片段拼接)
For content longer than 30 seconds, split into segments:
bash
undefined对于超过30秒的内容,拆分为多个片段:
bash
undefinedGenerate audio segments
生成音频片段
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment one script."}' --no-wait
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment two script."}' --no-wait
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment three script."}' --no-wait
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment one script."}' --no-wait
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment two script."}' --no-wait
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment three script."}' --no-wait
Generate talking head for each segment (same portrait for consistency)
为每个片段生成会说话的头部视频(使用相同肖像保持一致性)
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment1.mp3"}' --no-wait
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment2.mp3"}' --no-wait
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment3.mp3"}' --no-wait
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment1.mp3"}' --no-wait
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment2.mp3"}' --no-wait
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment3.mp3"}' --no-wait
Merge all segments
合并所有片段
infsh app run infsh/media-merger --input '{
"media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"]
}'
undefinedinfsh app run infsh/media-merger --input '{
"media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"]
}'
undefinedMulti-Character Conversation
多角色对话
OmniHuman 1.5 supports up to 2 characters:
bash
undefinedOmniHuman 1.5最多支持2个角色:
bash
undefined1. Generate dialogue with two speakers
1. 生成双角色对话音频
infsh app run falai/dia-tts --input '{
"prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch."
}'
infsh app run falai/dia-tts --input '{
"prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch."
}'
2. Create video with two characters
2. 创建双角色视频
infsh app run bytedance/omnihuman-1-5 --input '{
"image": "two-person-portrait.png",
"audio": "dialogue.mp3"
}'
undefinedinfsh app run bytedance/omnihuman-1-5 --input '{
"image": "two-person-portrait.png",
"audio": "dialogue.mp3"
}'
undefinedFraming Guidelines
构图指南
┌─────────────────────────────────┐
│ Headroom (minimal) │
│ ┌───────────────────────────┐ │
│ │ │ │
│ │ ● ─ ─ Eyes at 1/3 ─ ─│─ │ ← Eyes at top 1/3 line
│ │ /|\ │ │
│ │ | Head & shoulders │ │
│ │ / \ visible │ │
│ │ │ │
│ └───────────────────────────┘ │
│ Crop below chest │
└─────────────────────────────────┘┌─────────────────────────────────┐
│ 顶部留白(最少) │
│ ┌───────────────────────────┐ │
│ │ │ │
│ │ ● ─ ─ 眼睛位于上1/3处 ─ ─│─ │ ← 眼睛位于上1/3线位置
│ │ /|\ │ │
│ │ | 头肩区域可见 │ │
│ │ / \ │ │
│ │ │ │
│ └───────────────────────────┘ │
│ 裁剪至胸部以下 │
└─────────────────────────────────┘Common Mistakes
常见错误
| Mistake | Problem | Fix |
|---|---|---|
| Low-res portrait | Blurry face, poor lipsync | Use 1024x1024+ face region |
| Profile/side angle | Lipsync can't track mouth well | Use frontal or near-frontal |
| Noisy audio | Lipsync drifts, looks unnatural | Record clean or use TTS |
| Too-long clips | Quality degrades after 30s | Split into segments, stitch |
| Sunglasses/obstruction | Face features hidden | Clear face required |
| Inconsistent lighting | Uncanny when animated | Even, soft lighting |
| No captions | Loses silent/mobile viewers | Always add captions |
| 错误 | 问题 | 解决方法 |
|---|---|---|
| 低分辨率肖像 | 面部模糊,唇形同步效果差 | 使用1024x1024及以上的面部区域 |
| 侧面/侧脸肖像 | 唇形同步无法很好地追踪嘴巴 | 使用正面或接近正面的肖像 |
| 音频有噪音 | 唇形同步偏移,效果不自然 | 录制清晰音频或使用TTS生成 |
| 片段过长 | 30秒后画质下降 | 拆分为多个片段,再拼接 |
| 戴墨镜/面部有遮挡 | 面部特征被遮挡 | 需要清晰无遮挡的面部 |
| 光线不一致 | 动画效果怪异 | 使用均匀柔和的光线 |
| 未添加字幕 | 失去静音/移动端观众 | 务必添加字幕 |
Related Skills
相关技能
bash
npx skills add inference-sh/skills@ai-avatar-video
npx skills add inference-sh/skills@ai-video-generation
npx skills add inference-sh/skills@text-to-speechBrowse all apps:
infsh app listbash
npx skills add inference-sh/skills@ai-avatar-video
npx skills add inference-sh/skills@ai-video-generation
npx skills add inference-sh/skills@text-to-speech浏览所有应用:
infsh app list