ai-podcast
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAI Podcast Generator
AI播客生成器
Create multi-person talking head podcast videos using the inference.sh pipeline: portrait generation → TTS audio → avatar video → merge. Supports real humans (via Phota), 3D mascots, illustrated characters, and mixed casts.
Use when the user wants to create a podcast, talking head video, demo reel, promotional conversation, or any multi-speaker video content.
借助inference.sh流程创建多人对话式头部特写播客视频:肖像生成 → TTS音频 → 虚拟形象视频 → 合并。支持真实人物(通过Phota)、3D吉祥物、插画风格角色以及混合类型阵容。
适用于用户想要创建播客、对话式头部特写视频、演示片段、推广对话或任何多发言人视频内容的场景。
Pipeline Overview
流程概述
Characters (images) → TTS (audio per turn) → Avatar (video per turn) → Merge (final video)角色(图片)→ TTS(逐段音频)→ 虚拟形象(逐段视频)→ 合并(最终视频)Process
操作流程
Step 1: Character Creation
步骤1:角色创建
Choose the right tool per character type:
| Character Type | Tool | Notes |
|---|---|---|
| Real human (new) | | 16:9, |
| Real human (consistent ID) | | Consistent identity across all shots. Requires a trained Phota profile first (see below). |
| Brand mascot / logo character | | Pass logo + character sheet as reference images |
| Illustrated / stylized | | Pass style reference as input image |
Training a Phota identity (optional but recommended for humans):
If you need a real human character with consistent identity across multiple angles and shots, train a Phota profile first:
bash
infsh app run phota/train --input '{
"images": ["url1.jpg", "url2.jpg", ...],
"wait": true
}' --save profile.json- Requires 30-50 face images of the subject
- Training takes a few minutes with
wait: true - Returns a you then use in
profile_idasphota/generatein prompts[[profile_id]] - The profile is reusable forever — train once, generate unlimited shots
If you don't need cross-shot consistency (e.g. single-speaker video, one angle only), is simpler and cheaper.
pruna/p-imageCharacter sheets first, podcast frames second:
- Generate a character sheet (plain white background, multiple angles) for each character
- Then place characters into the podcast studio setting using the sheet as reference
For branded characters (logo on clothing):
- Generate the character with a plain version of the garment
- Use with the logo as a second reference image to add the logo
phota/edit - Always pass the logo image alongside character references when generating new angles
根据角色类型选择合适工具:
| 角色类型 | 工具 | 说明 |
|---|---|---|
| 全新真实人物 | | 16:9比例, |
| 身份一致的真实人物 | 带 | 所有镜头中角色身份保持一致。需先训练一个Phota身份档案(详见下文)。 |
| 品牌吉祥物/Logo角色 | | 上传Logo和角色设定图作为参考图片 |
| 插画/风格化角色 | | 上传风格参考图作为输入图片 |
训练Phota身份档案(可选,但推荐用于真实人物):
如果需要真实人物角色在不同角度和镜头中保持身份一致,请先训练一个Phota身份档案:
bash
infsh app run phota/train --input '{
"images": ["url1.jpg", "url2.jpg", ...],
"wait": true
}' --save profile.json- 需要30-50张目标人物的面部照片
- 启用后训练需耗时数分钟
wait: true - 训练完成后会返回一个,在
profile_id的提示词中使用phota/generate即可调用[[profile_id]] - 身份档案可永久复用——训练一次,即可生成无限数量的镜头
如果不需要跨镜头身份一致性(例如单发言人视频、单一角度),更简单且成本更低。
pruna/p-image先做角色设定图,再生成播客镜头:
- 为每个角色生成一张角色设定图(纯白背景,包含多个角度)
- 随后以设定图为参考,将角色放入播客场景中
品牌角色(服装带有Logo)制作方法:
- 先生成穿着纯色服装的角色
- 使用工具,上传Logo作为第二张参考图添加Logo
phota/edit - 生成新角度时,始终同时上传Logo图片和角色参考图
Step 2: Alternate Angles
步骤2:多角度生成
Generate at least 2 angles per character for visual variety:
| Angle | When to use |
|---|---|
| Front/medium | Establishing shots, opening, closing |
| Close-up | Reactions, emotional moments, punchy lines |
For close-ups, prompt for "tight framing, chest up, shallow depth of field" — not "turned to the side" (which just makes them look away).
Identity consistency rules:
- For real humans with a Phota profile: use or
phota/generatefor new angles — Gemini does not preserve facial identity and will produce a different personphota/edit - For real humans without a Phota profile: try to generate all needed angles in one go with , or consider training a Phota profile if you need many shots
pruna/p-image - For mascots/illustrations: Gemini 3 Pro is fine, pass the established frame as reference
Framing rule: Use tight framing on individual speakers. Wide shots with multiple seats show empty chairs when only one person is on screen.
为每个角色生成至少2种角度,提升视觉多样性:
| 角度 | 使用场景 |
|---|---|
| 正面/中景 | 开场、结尾、场景建立镜头 |
| 特写 | 反应镜头、情感时刻、关键台词 |
生成特写镜头时,提示词使用“紧凑构图,胸部以上,浅景深”——不要用“转向侧面”(会导致角色视线偏离)。
身份一致性规则:
- 对于拥有Phota身份档案的真实人物:使用或
phota/generate生成新角度——Gemini无法保留面部身份,会生成不同的人物phota/edit - 对于无Phota身份档案的真实人物:尝试用一次性生成所有需要的角度;如果需要大量镜头,建议训练一个Phota身份档案
pruna/p-image - 对于吉祥物/插画角色:使用Gemini 3 Pro即可,上传已生成的镜头作为参考
构图规则: 单人发言镜头使用紧凑构图。包含多个座位的宽景镜头在只有一人出镜时会显示空椅子。
Step 3: QA Frames
步骤3:镜头质检
Before proceeding, visually inspect all frames for:
- Extra people in the background
- Multiple microphones (should be single mic per shot)
- Wrong or distorted logos
- Inconsistent character identity across angles
- Weird artifacts (extra limbs, merged objects)
Fix issues before generating video — re-rendering video is the most expensive step in the pipeline.
进入下一步前,需目视检查所有镜头是否存在以下问题:
- 背景中出现多余人物
- 多个麦克风(每个镜头应只有一个麦克风)
- Logo错误或变形
- 不同角度的角色身份不一致
- 奇怪的瑕疵(多余肢体、物体融合)
生成视频前修复所有问题——视频渲染是整个流程中成本最高的步骤。
Step 4: Write the Script
步骤4:撰写脚本
Rules for natural conversation:
- Write it like a real conversation, NOT like people reading ad copy in turns
- Include reactions ("wait, hold on", "that is wild"), interruptions, and follow-up questions
- Vary turn length — short reactions (1 sentence) mixed with longer explanations (2-3 sentences)
- The host should ask real questions, not set up obvious talking points
- Keep total duration target in mind: ~2.5 words/second for natural speech at 1.05x rate
Duration guide:
| Target | Words |
|---|---|
| 15s | ~38 words |
| 30s | ~75 words |
| 60s | ~150 words |
自然对话规则:
- 按照真实对话的风格撰写,不要写成轮流读广告文案的形式
- 加入反应语(“等等,先别急”“这太离谱了”)、插话和跟进问题
- 调整发言时长——短句反应(1句话)和长段解释(2-3句话)混合
- 主持人应提出真实的问题,而非设置明显的话题引子
- 牢记总时长目标:1.05倍语速下,自然语速约为每秒2.5个单词
时长参考:
| 目标时长 | 单词数 |
|---|---|
| 15秒 | ~38词 |
| 30秒 | ~75词 |
| 60秒 | ~150词 |
Step 5: Generate TTS Audio
步骤5:生成TTS音频
Use for each turn.
inworld/text-to-speech-2bash
infsh app run inworld/text-to-speech-2 --input '{
"text": "...",
"voice_id": "...",
"speaking_rate": 1.05,
"audio_encoding": "MP3"
}' --save output.jsonVoice selection:
- Generate samples with the same line across candidate voices BEFORE committing
- Let the user listen and approve voices
- Good podcast voices: Tyler, Nate, Lauren, Kelsey, Naomi, Anjali (EN_US)
- Use to list all available voices
inworld/text-to-speech-2:voices
Speaking rate:
- Default to 1.05 for natural podcast pacing
- Use 1.1 for short snappy reactions
- NEVER go below 1.0 — sounds slow and disengaging
- Keep rate consistent per character across all their turns
All TTS turns can run in parallel (cheap, fast ~2-8s each).
使用生成每段音频。
inworld/text-to-speech-2bash
infsh app run inworld/text-to-speech-2 --input '{
"text": "...",
"voice_id": "...",
"speaking_rate": 1.05,
"audio_encoding": "MP3"
}' --save output.json语音选择:
- 在确定语音前,用同一句台词生成多个候选语音的样本
- 让用户试听并确认语音
- 适合播客的语音:Tyler、Nate、Lauren、Kelsey、Naomi、Anjali(美式英语)
- 使用查看所有可用语音
inworld/text-to-speech-2:voices
语速设置:
- 默认设置为1.05,符合自然播客节奏
- 短句反应可设置为1.1
- 绝对不要低于1.0——语速过慢会显得生硬且缺乏吸引力
- 同一角色的所有发言需保持语速一致
所有TTS音频可并行生成(成本低,速度快,每段约2-8秒)。
Step 6: Generate Video Clips
步骤6:生成视频片段
Use for each turn.
pruna/p-video-avatarbash
infsh app run pruna/p-video-avatar --input '{
"image": "<character_frame_url>",
"audio": "<tts_audio_url>",
"resolution": "720p",
"video_prompt": "..."
}' --save output.jsonCritical: Run clips SEQUENTIALLY, not in parallel. Parallel runs hit the same GPU and cause CUDA OOM failures. Each clip takes 15-90s depending on audio length.
Angle assignment plan: Alternate between front and close-up shots across turns for visual variety. Example for 6 turns:
T1: Speaker A — front
T2: Speaker B — front
T3: Speaker C — front (or close-up)
T4: Speaker A — close-up
T5: Speaker B — close-up
T6: Speaker A — front使用生成每段视频。
pruna/p-video-avatarbash
infsh app run pruna/p-video-avatar --input '{
"image": "<character_frame_url>",
"audio": "<tts_audio_url>",
"resolution": "720p",
"video_prompt": "..."
}' --save output.json关键提示:视频片段需按顺序生成,不要并行。 并行生成会占用同一GPU内存,导致CUDA内存不足(OOM)错误。每个片段的生成时间根据音频长度为15-90秒。
角度分配方案: 不同发言段交替使用正面和特写镜头,提升视觉多样性。6段发言的示例:
T1:发言人A — 正面
T2:发言人B — 正面
T3:发言人C — 正面(或特写)
T4:发言人A — 特写
T5:发言人B — 特写
T6:发言人A — 正面Step 7: Merge
步骤7:视频合并
Use to stitch all clips into the final video.
infsh/media-mergerbash
undefined使用将所有片段拼接成最终视频。
infsh/media-mergerbash
undefinedBuild input JSON
构建输入JSON
{
"media_files": [
{"file": "<clip1_url>"},
{"file": "<clip2_url>"},
...
],
"fps": 24,
"output_format": "mp4"
}
infsh app run infsh/media-merger --input merger_input.json --save final.json
Merger is free and takes 2-6 minutes depending on total duration.{
"media_files": [
{"file": "<clip1_url>"},
{"file": "<clip2_url>"},
...
],
"fps": 24,
"output_format": "mp4"
}
infsh app run infsh/media-merger --input merger_input.json --save final.json
合并工具免费,耗时根据总时长为2-6分钟。Rules
注意规则
-
Gemini does not preserve human facial identity — generating alternate angles of a real human with Gemini will produce a different person. For identity-consistent human shots, use Phota with a trained profile_id, or generate all angles in a single batch. This was learned after Gemini produced an entirely different face for a close-up that was supposed to match the front shot.
-
NEVER run p-video-avatar clips in parallel — they compete for GPU memory and fail with CUDA OOM. Run them sequentially. This was learned after 2 of 3 parallel runs failed.
-
NEVER set speaking_rate below 1.0 — it sounds artificial and disengaging. Default to 1.05. Learned from user feedback that 0.9 rate "felt weird and disengaging."
-
ALWAYS QA frames before generating video — video generation is the most expensive step in the pipeline. Catching a double mic or wrong logo in the image stage is cheap to fix. Catching it after video generation means re-rendering the entire clip.
-
ALWAYS use tight framing for individual speaker shots — wide/establishing shots show empty seats where other speakers should be. Frame from waist or chest up so no empty chairs are visible.
-
ALWAYS pass the logo as a reference image when generating branded characters — describing a logo in text produces wrong results. Pass the actual logo file as a second image input.
-
ALWAYS get voice approval before full production — generate samples with the same line across 5-8 candidate voices and let the user pick before committing to the full script.
-
Script should read like a conversation, not an ad — people reading ad copy in turns sounds fake. Include reactions, interruptions, varied turn lengths, and genuine questions. The host should have personality, not just set up talking points.
-
Gemini无法保留人类面部身份——用Gemini生成真实人物的不同角度会得到完全不同的人。如需身份一致的真实人物镜头,请使用带有已训练的Phota工具,或一次性生成所有角度。这一规则来自实际经验:Gemini曾为某个本该匹配正面镜头的特写生成了完全不同的人脸。
profile_id -
绝对不要并行生成p-video-avatar视频片段——它们会竞争GPU内存,导致CUDA内存不足错误。请按顺序生成。这一规则来自实际经验:3次并行生成中有2次失败。
-
绝对不要将speaking_rate设置低于1.0——语速过慢会显得生硬且缺乏吸引力。默认设置为1.05。这一规则来自用户反馈:0.9倍语速“感觉怪异且没有吸引力”。
-
生成视频前务必检查镜头质量——视频生成是整个流程中成本最高的步骤。在图片阶段发现双麦克风或错误Logo的问题,修复成本很低;如果在视频生成后才发现,就需要重新渲染整个片段。
-
单人发言镜头务必使用紧凑构图——宽景/场景建立镜头会显示其他发言人的空座位。采用腰部或胸部以上的构图,避免出现空椅子。
-
生成品牌角色时务必上传Logo作为参考图片——用文字描述Logo会得到错误结果。请上传实际的Logo文件作为第二张输入图片。
-
正式制作前务必获得用户对语音的确认——用同一句台词生成5-8个候选语音的样本,让用户选择后再开始制作完整脚本。
-
脚本需写成对话风格,而非广告文案——轮流读广告文案的形式会显得虚假。加入反应语、插话、不同时长的发言和真实的问题。主持人应具备个性,而不仅仅是引出话题。
App Reference
工具参考
| App | Purpose |
|---|---|
| Generate portraits from text |
| Train identity profile from 30-50 face images |
| Generate images with trained identity via |
| Edit images preserving identity of known subjects |
| Image gen/edit, mascots, style transfer |
| Text to speech, 100+ languages, voice steering |
| Portrait + audio → talking head video |
| Concatenate video clips into one video |
Use to check the cost of any individual task.
belt task cost <task-id>| 工具 | 用途 |
|---|---|
| 文本生成肖像 |
| 用30-50张面部照片训练身份档案 |
| 通过 |
| 编辑图片并保留已知主体的身份 |
| 图片生成/编辑、吉祥物制作、风格迁移 |
| 文本转语音,支持100+种语言,可调整语音风格 |
| 肖像+音频 → 对话式头部特写视频 |
| 将多个视频片段拼接成一个完整视频 |
使用查看任意单个任务的成本。
belt task cost <task-id>