ai-podcast

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Podcast Generator

AI播客生成器

Create multi-person talking head podcast videos using the inference.sh pipeline: portrait generation → TTS audio → avatar video → merge. Supports real humans (via Phota), 3D mascots, illustrated characters, and mixed casts.
Use when the user wants to create a podcast, talking head video, demo reel, promotional conversation, or any multi-speaker video content.
借助inference.sh流程创建多人对话式头部特写播客视频:肖像生成 → TTS音频 → 虚拟形象视频 → 合并。支持真实人物(通过Phota)、3D吉祥物、插画风格角色以及混合类型阵容。
适用于用户想要创建播客、对话式头部特写视频、演示片段、推广对话或任何多发言人视频内容的场景。

Pipeline Overview

流程概述

Characters (images) → TTS (audio per turn) → Avatar (video per turn) → Merge (final video)
角色(图片)→ TTS(逐段音频)→ 虚拟形象(逐段视频)→ 合并(最终视频)

Process

操作流程

Step 1: Character Creation

步骤1:角色创建

Choose the right tool per character type:
Character TypeToolNotes
Real human (new)
pruna/p-image
16:9,
prompt_upsampling: true
. Quick, no training needed, but identity won't be consistent across multiple generations.
Real human (consistent ID)
phota/generate
with
[[profile_id]]
Consistent identity across all shots. Requires a trained Phota profile first (see below).
Brand mascot / logo character
google/gemini-3-pro-image-preview
Pass logo + character sheet as reference images
Illustrated / stylized
google/gemini-3-pro-image-preview
Pass style reference as input image
Training a Phota identity (optional but recommended for humans):
If you need a real human character with consistent identity across multiple angles and shots, train a Phota profile first:
bash
infsh app run phota/train --input '{
  "images": ["url1.jpg", "url2.jpg", ...],
  "wait": true
}' --save profile.json
  • Requires 30-50 face images of the subject
  • Training takes a few minutes with
    wait: true
  • Returns a
    profile_id
    you then use in
    phota/generate
    as
    [[profile_id]]
    in prompts
  • The profile is reusable forever — train once, generate unlimited shots
If you don't need cross-shot consistency (e.g. single-speaker video, one angle only),
pruna/p-image
is simpler and cheaper.
Character sheets first, podcast frames second:
  1. Generate a character sheet (plain white background, multiple angles) for each character
  2. Then place characters into the podcast studio setting using the sheet as reference
For branded characters (logo on clothing):
  1. Generate the character with a plain version of the garment
  2. Use
    phota/edit
    with the logo as a second reference image to add the logo
  3. Always pass the logo image alongside character references when generating new angles
根据角色类型选择合适工具:
角色类型工具说明
全新真实人物
pruna/p-image
16:9比例,
prompt_upsampling: true
。生成快速,无需训练,但多次生成的角色身份无法保持一致。
身份一致的真实人物
[[profile_id]]
phota/generate
所有镜头中角色身份保持一致。需先训练一个Phota身份档案(详见下文)。
品牌吉祥物/Logo角色
google/gemini-3-pro-image-preview
上传Logo和角色设定图作为参考图片
插画/风格化角色
google/gemini-3-pro-image-preview
上传风格参考图作为输入图片
训练Phota身份档案(可选,但推荐用于真实人物):
如果需要真实人物角色在不同角度和镜头中保持身份一致,请先训练一个Phota身份档案:
bash
infsh app run phota/train --input '{
  "images": ["url1.jpg", "url2.jpg", ...],
  "wait": true
}' --save profile.json
  • 需要30-50张目标人物的面部照片
  • 启用
    wait: true
    后训练需耗时数分钟
  • 训练完成后会返回一个
    profile_id
    ,在
    phota/generate
    的提示词中使用
    [[profile_id]]
    即可调用
  • 身份档案可永久复用——训练一次,即可生成无限数量的镜头
如果不需要跨镜头身份一致性(例如单发言人视频、单一角度),
pruna/p-image
更简单且成本更低。
先做角色设定图,再生成播客镜头:
  1. 为每个角色生成一张角色设定图(纯白背景,包含多个角度)
  2. 随后以设定图为参考,将角色放入播客场景中
品牌角色(服装带有Logo)制作方法:
  1. 先生成穿着纯色服装的角色
  2. 使用
    phota/edit
    工具,上传Logo作为第二张参考图添加Logo
  3. 生成新角度时,始终同时上传Logo图片和角色参考图

Step 2: Alternate Angles

步骤2:多角度生成

Generate at least 2 angles per character for visual variety:
AngleWhen to use
Front/mediumEstablishing shots, opening, closing
Close-upReactions, emotional moments, punchy lines
For close-ups, prompt for "tight framing, chest up, shallow depth of field" — not "turned to the side" (which just makes them look away).
Identity consistency rules:
  • For real humans with a Phota profile: use
    phota/generate
    or
    phota/edit
    for new angles — Gemini does not preserve facial identity and will produce a different person
  • For real humans without a Phota profile: try to generate all needed angles in one go with
    pruna/p-image
    , or consider training a Phota profile if you need many shots
  • For mascots/illustrations: Gemini 3 Pro is fine, pass the established frame as reference
Framing rule: Use tight framing on individual speakers. Wide shots with multiple seats show empty chairs when only one person is on screen.
为每个角色生成至少2种角度,提升视觉多样性:
角度使用场景
正面/中景开场、结尾、场景建立镜头
特写反应镜头、情感时刻、关键台词
生成特写镜头时,提示词使用“紧凑构图,胸部以上,浅景深”——不要用“转向侧面”(会导致角色视线偏离)。
身份一致性规则:
  • 对于拥有Phota身份档案的真实人物:使用
    phota/generate
    phota/edit
    生成新角度——Gemini无法保留面部身份,会生成不同的人物
  • 对于无Phota身份档案的真实人物:尝试用
    pruna/p-image
    一次性生成所有需要的角度;如果需要大量镜头,建议训练一个Phota身份档案
  • 对于吉祥物/插画角色:使用Gemini 3 Pro即可,上传已生成的镜头作为参考
构图规则: 单人发言镜头使用紧凑构图。包含多个座位的宽景镜头在只有一人出镜时会显示空椅子。

Step 3: QA Frames

步骤3:镜头质检

Before proceeding, visually inspect all frames for:
  • Extra people in the background
  • Multiple microphones (should be single mic per shot)
  • Wrong or distorted logos
  • Inconsistent character identity across angles
  • Weird artifacts (extra limbs, merged objects)
Fix issues before generating video — re-rendering video is the most expensive step in the pipeline.
进入下一步前,需目视检查所有镜头是否存在以下问题:
  • 背景中出现多余人物
  • 多个麦克风(每个镜头应只有一个麦克风)
  • Logo错误或变形
  • 不同角度的角色身份不一致
  • 奇怪的瑕疵(多余肢体、物体融合)
生成视频前修复所有问题——视频渲染是整个流程中成本最高的步骤。

Step 4: Write the Script

步骤4:撰写脚本

Rules for natural conversation:
  • Write it like a real conversation, NOT like people reading ad copy in turns
  • Include reactions ("wait, hold on", "that is wild"), interruptions, and follow-up questions
  • Vary turn length — short reactions (1 sentence) mixed with longer explanations (2-3 sentences)
  • The host should ask real questions, not set up obvious talking points
  • Keep total duration target in mind: ~2.5 words/second for natural speech at 1.05x rate
Duration guide:
TargetWords
15s~38 words
30s~75 words
60s~150 words
自然对话规则:
  • 按照真实对话的风格撰写,不要写成轮流读广告文案的形式
  • 加入反应语(“等等,先别急”“这太离谱了”)、插话和跟进问题
  • 调整发言时长——短句反应(1句话)和长段解释(2-3句话)混合
  • 主持人应提出真实的问题,而非设置明显的话题引子
  • 牢记总时长目标:1.05倍语速下,自然语速约为每秒2.5个单词
时长参考:
目标时长单词数
15秒~38词
30秒~75词
60秒~150词

Step 5: Generate TTS Audio

步骤5:生成TTS音频

Use
inworld/text-to-speech-2
for each turn.
bash
infsh app run inworld/text-to-speech-2 --input '{
  "text": "...",
  "voice_id": "...",
  "speaking_rate": 1.05,
  "audio_encoding": "MP3"
}' --save output.json
Voice selection:
  • Generate samples with the same line across candidate voices BEFORE committing
  • Let the user listen and approve voices
  • Good podcast voices: Tyler, Nate, Lauren, Kelsey, Naomi, Anjali (EN_US)
  • Use
    inworld/text-to-speech-2:voices
    to list all available voices
Speaking rate:
  • Default to 1.05 for natural podcast pacing
  • Use 1.1 for short snappy reactions
  • NEVER go below 1.0 — sounds slow and disengaging
  • Keep rate consistent per character across all their turns
All TTS turns can run in parallel (cheap, fast ~2-8s each).
使用
inworld/text-to-speech-2
生成每段音频。
bash
infsh app run inworld/text-to-speech-2 --input '{
  "text": "...",
  "voice_id": "...",
  "speaking_rate": 1.05,
  "audio_encoding": "MP3"
}' --save output.json
语音选择:
  • 在确定语音前,用同一句台词生成多个候选语音的样本
  • 让用户试听并确认语音
  • 适合播客的语音:Tyler、Nate、Lauren、Kelsey、Naomi、Anjali(美式英语)
  • 使用
    inworld/text-to-speech-2:voices
    查看所有可用语音
语速设置:
  • 默认设置为1.05,符合自然播客节奏
  • 短句反应可设置为1.1
  • 绝对不要低于1.0——语速过慢会显得生硬且缺乏吸引力
  • 同一角色的所有发言需保持语速一致
所有TTS音频可并行生成(成本低,速度快,每段约2-8秒)。

Step 6: Generate Video Clips

步骤6:生成视频片段

Use
pruna/p-video-avatar
for each turn.
bash
infsh app run pruna/p-video-avatar --input '{
  "image": "<character_frame_url>",
  "audio": "<tts_audio_url>",
  "resolution": "720p",
  "video_prompt": "..."
}' --save output.json
Critical: Run clips SEQUENTIALLY, not in parallel. Parallel runs hit the same GPU and cause CUDA OOM failures. Each clip takes 15-90s depending on audio length.
Angle assignment plan: Alternate between front and close-up shots across turns for visual variety. Example for 6 turns:
T1: Speaker A — front
T2: Speaker B — front
T3: Speaker C — front (or close-up)
T4: Speaker A — close-up
T5: Speaker B — close-up
T6: Speaker A — front
使用
pruna/p-video-avatar
生成每段视频。
bash
infsh app run pruna/p-video-avatar --input '{
  "image": "<character_frame_url>",
  "audio": "<tts_audio_url>",
  "resolution": "720p",
  "video_prompt": "..."
}' --save output.json
关键提示:视频片段需按顺序生成,不要并行。 并行生成会占用同一GPU内存,导致CUDA内存不足(OOM)错误。每个片段的生成时间根据音频长度为15-90秒。
角度分配方案: 不同发言段交替使用正面和特写镜头,提升视觉多样性。6段发言的示例:
T1:发言人A — 正面
T2:发言人B — 正面
T3:发言人C — 正面(或特写)
T4:发言人A — 特写
T5:发言人B — 特写
T6:发言人A — 正面

Step 7: Merge

步骤7:视频合并

Use
infsh/media-merger
to stitch all clips into the final video.
bash
undefined
使用
infsh/media-merger
将所有片段拼接成最终视频。
bash
undefined

Build input JSON

构建输入JSON

{ "media_files": [ {"file": "<clip1_url>"}, {"file": "<clip2_url>"}, ... ], "fps": 24, "output_format": "mp4" }
infsh app run infsh/media-merger --input merger_input.json --save final.json

Merger is free and takes 2-6 minutes depending on total duration.
{ "media_files": [ {"file": "<clip1_url>"}, {"file": "<clip2_url>"}, ... ], "fps": 24, "output_format": "mp4" }
infsh app run infsh/media-merger --input merger_input.json --save final.json

合并工具免费,耗时根据总时长为2-6分钟。

Rules

注意规则

  1. Gemini does not preserve human facial identity — generating alternate angles of a real human with Gemini will produce a different person. For identity-consistent human shots, use Phota with a trained profile_id, or generate all angles in a single batch. This was learned after Gemini produced an entirely different face for a close-up that was supposed to match the front shot.
  2. NEVER run p-video-avatar clips in parallel — they compete for GPU memory and fail with CUDA OOM. Run them sequentially. This was learned after 2 of 3 parallel runs failed.
  3. NEVER set speaking_rate below 1.0 — it sounds artificial and disengaging. Default to 1.05. Learned from user feedback that 0.9 rate "felt weird and disengaging."
  4. ALWAYS QA frames before generating video — video generation is the most expensive step in the pipeline. Catching a double mic or wrong logo in the image stage is cheap to fix. Catching it after video generation means re-rendering the entire clip.
  5. ALWAYS use tight framing for individual speaker shots — wide/establishing shots show empty seats where other speakers should be. Frame from waist or chest up so no empty chairs are visible.
  6. ALWAYS pass the logo as a reference image when generating branded characters — describing a logo in text produces wrong results. Pass the actual logo file as a second image input.
  7. ALWAYS get voice approval before full production — generate samples with the same line across 5-8 candidate voices and let the user pick before committing to the full script.
  8. Script should read like a conversation, not an ad — people reading ad copy in turns sounds fake. Include reactions, interruptions, varied turn lengths, and genuine questions. The host should have personality, not just set up talking points.
  1. Gemini无法保留人类面部身份——用Gemini生成真实人物的不同角度会得到完全不同的人。如需身份一致的真实人物镜头,请使用带有已训练
    profile_id
    的Phota工具,或一次性生成所有角度。这一规则来自实际经验:Gemini曾为某个本该匹配正面镜头的特写生成了完全不同的人脸。
  2. 绝对不要并行生成p-video-avatar视频片段——它们会竞争GPU内存,导致CUDA内存不足错误。请按顺序生成。这一规则来自实际经验:3次并行生成中有2次失败。
  3. 绝对不要将speaking_rate设置低于1.0——语速过慢会显得生硬且缺乏吸引力。默认设置为1.05。这一规则来自用户反馈:0.9倍语速“感觉怪异且没有吸引力”。
  4. 生成视频前务必检查镜头质量——视频生成是整个流程中成本最高的步骤。在图片阶段发现双麦克风或错误Logo的问题,修复成本很低;如果在视频生成后才发现,就需要重新渲染整个片段。
  5. 单人发言镜头务必使用紧凑构图——宽景/场景建立镜头会显示其他发言人的空座位。采用腰部或胸部以上的构图,避免出现空椅子。
  6. 生成品牌角色时务必上传Logo作为参考图片——用文字描述Logo会得到错误结果。请上传实际的Logo文件作为第二张输入图片。
  7. 正式制作前务必获得用户对语音的确认——用同一句台词生成5-8个候选语音的样本,让用户选择后再开始制作完整脚本。
  8. 脚本需写成对话风格,而非广告文案——轮流读广告文案的形式会显得虚假。加入反应语、插话、不同时长的发言和真实的问题。主持人应具备个性,而不仅仅是引出话题。

App Reference

工具参考

AppPurpose
pruna/p-image
Generate portraits from text
phota/train
Train identity profile from 30-50 face images
phota/generate
Generate images with trained identity via
[[profile_id]]
phota/edit
Edit images preserving identity of known subjects
google/gemini-3-pro-image-preview
Image gen/edit, mascots, style transfer
inworld/text-to-speech-2
Text to speech, 100+ languages, voice steering
pruna/p-video-avatar
Portrait + audio → talking head video
infsh/media-merger
Concatenate video clips into one video
Use
belt task cost <task-id>
to check the cost of any individual task.
工具用途
pruna/p-image
文本生成肖像
phota/train
用30-50张面部照片训练身份档案
phota/generate
通过
[[profile_id]]
调用已训练身份生成图片
phota/edit
编辑图片并保留已知主体的身份
google/gemini-3-pro-image-preview
图片生成/编辑、吉祥物制作、风格迁移
inworld/text-to-speech-2
文本转语音,支持100+种语言,可调整语音风格
pruna/p-video-avatar
肖像+音频 → 对话式头部特写视频
infsh/media-merger
将多个视频片段拼接成一个完整视频
使用
belt task cost <task-id>
查看任意单个任务的成本。