ai-podcast

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Podcast Generator

AI播客生成器

Create multi-person talking head podcast videos using the inference.sh pipeline: portrait generation → TTS audio → avatar video → merge. Supports real humans (via Phota), 3D mascots, illustrated characters, and mixed casts.

Use when the user wants to create a podcast, talking head video, demo reel, promotional conversation, or any multi-speaker video content.

借助inference.sh流程创建多人对话式头部特写播客视频：肖像生成 → TTS音频 → 虚拟形象视频 → 合并。支持真实人物（通过Phota）、3D吉祥物、插画风格角色以及混合类型阵容。

适用于用户想要创建播客、对话式头部特写视频、演示片段、推广对话或任何多发言人视频内容的场景。

Pipeline Overview

流程概述

Characters (images) → TTS (audio per turn) → Avatar (video per turn) → Merge (final video)

角色（图片）→ TTS（逐段音频）→ 虚拟形象（逐段视频）→ 合并（最终视频）

Process

操作流程

Step 1: Character Creation

步骤1：角色创建

Choose the right tool per character type:

Character Type	Tool	Notes
Real human (new)	`pruna/p-image`	16:9, `prompt_upsampling: true` . Quick, no training needed, but identity won't be consistent across multiple generations.
Real human (consistent ID)	`phota/generate` with `[[profile_id]]`	Consistent identity across all shots. Requires a trained Phota profile first (see below).
Brand mascot / logo character	`google/gemini-3-pro-image-preview`	Pass logo + character sheet as reference images
Illustrated / stylized	`google/gemini-3-pro-image-preview`	Pass style reference as input image

Training a Phota identity (optional but recommended for humans):

If you need a real human character with consistent identity across multiple angles and shots, train a Phota profile first:

bash

infsh app run phota/train --input '{
  "images": ["url1.jpg", "url2.jpg", ...],
  "wait": true
}' --save profile.json

Requires 30-50 face images of the subject
Training takes a few minutes with
```
wait: true
```
Returns a
```
profile_id
```
you then use in
```
phota/generate
```
as
```
[[profile_id]]
```
in prompts
The profile is reusable forever — train once, generate unlimited shots

If you don't need cross-shot consistency (e.g. single-speaker video, one angle only),

pruna/p-image

is simpler and cheaper.

Character sheets first, podcast frames second:

Generate a character sheet (plain white background, multiple angles) for each character
Then place characters into the podcast studio setting using the sheet as reference

For branded characters (logo on clothing):

Generate the character with a plain version of the garment
Use
```
phota/edit
```
with the logo as a second reference image to add the logo
Always pass the logo image alongside character references when generating new angles

根据角色类型选择合适工具：

角色类型	工具	说明
全新真实人物	`pruna/p-image`	16:9比例， `prompt_upsampling: true` 。生成快速，无需训练，但多次生成的角色身份无法保持一致。
身份一致的真实人物	带 `[[profile_id]]` 的 `phota/generate`	所有镜头中角色身份保持一致。需先训练一个Phota身份档案（详见下文）。
品牌吉祥物/Logo角色	`google/gemini-3-pro-image-preview`	上传Logo和角色设定图作为参考图片
插画/风格化角色	`google/gemini-3-pro-image-preview`	上传风格参考图作为输入图片

训练Phota身份档案（可选，但推荐用于真实人物）：

如果需要真实人物角色在不同角度和镜头中保持身份一致，请先训练一个Phota身份档案：

bash

infsh app run phota/train --input '{
  "images": ["url1.jpg", "url2.jpg", ...],
  "wait": true
}' --save profile.json

需要30-50张目标人物的面部照片
启用
```
wait: true
```
后训练需耗时数分钟
训练完成后会返回一个
```
profile_id
```
，在
```
phota/generate
```
的提示词中使用
```
[[profile_id]]
```
即可调用
身份档案可永久复用——训练一次，即可生成无限数量的镜头

如果不需要跨镜头身份一致性（例如单发言人视频、单一角度），

pruna/p-image

更简单且成本更低。

先做角色设定图，再生成播客镜头：

为每个角色生成一张角色设定图（纯白背景，包含多个角度）
随后以设定图为参考，将角色放入播客场景中

品牌角色（服装带有Logo）制作方法：

先生成穿着纯色服装的角色
使用
```
phota/edit
```
工具，上传Logo作为第二张参考图添加Logo
生成新角度时，始终同时上传Logo图片和角色参考图

Step 2: Alternate Angles

步骤2：多角度生成

Generate at least 2 angles per character for visual variety:

Angle	When to use
Front/medium	Establishing shots, opening, closing
Close-up	Reactions, emotional moments, punchy lines

For close-ups, prompt for "tight framing, chest up, shallow depth of field" — not "turned to the side" (which just makes them look away).

Identity consistency rules:

For real humans with a Phota profile: use
```
phota/generate
```
or
```
phota/edit
```
for new angles — Gemini does not preserve facial identity and will produce a different person
For real humans without a Phota profile: try to generate all needed angles in one go with
```
pruna/p-image
```
, or consider training a Phota profile if you need many shots
For mascots/illustrations: Gemini 3 Pro is fine, pass the established frame as reference

Framing rule: Use tight framing on individual speakers. Wide shots with multiple seats show empty chairs when only one person is on screen.

为每个角色生成至少2种角度，提升视觉多样性：

角度	使用场景
正面/中景	开场、结尾、场景建立镜头
特写	反应镜头、情感时刻、关键台词

生成特写镜头时，提示词使用“紧凑构图，胸部以上，浅景深”——不要用“转向侧面”（会导致角色视线偏离）。

身份一致性规则：

对于拥有Phota身份档案的真实人物：使用
```
phota/generate
```
或
```
phota/edit
```
生成新角度——Gemini无法保留面部身份，会生成不同的人物
对于无Phota身份档案的真实人物：尝试用
```
pruna/p-image
```
一次性生成所有需要的角度；如果需要大量镜头，建议训练一个Phota身份档案
对于吉祥物/插画角色：使用Gemini 3 Pro即可，上传已生成的镜头作为参考

构图规则： 单人发言镜头使用紧凑构图。包含多个座位的宽景镜头在只有一人出镜时会显示空椅子。

Step 3: QA Frames

步骤3：镜头质检

Before proceeding, visually inspect all frames for:

Extra people in the background
Multiple microphones (should be single mic per shot)
Wrong or distorted logos
Inconsistent character identity across angles
Weird artifacts (extra limbs, merged objects)

Fix issues before generating video — re-rendering video is the most expensive step in the pipeline.

进入下一步前，需目视检查所有镜头是否存在以下问题：

背景中出现多余人物
多个麦克风（每个镜头应只有一个麦克风）
Logo错误或变形
不同角度的角色身份不一致
奇怪的瑕疵（多余肢体、物体融合）

生成视频前修复所有问题——视频渲染是整个流程中成本最高的步骤。

Step 4: Write the Script

步骤4：撰写脚本

Rules for natural conversation:

Write it like a real conversation, NOT like people reading ad copy in turns
Include reactions ("wait, hold on", "that is wild"), interruptions, and follow-up questions
Vary turn length — short reactions (1 sentence) mixed with longer explanations (2-3 sentences)
The host should ask real questions, not set up obvious talking points
Keep total duration target in mind: ~2.5 words/second for natural speech at 1.05x rate

Duration guide:

Target	Words
15s	~38 words
30s	~75 words
60s	~150 words

自然对话规则：

按照真实对话的风格撰写，不要写成轮流读广告文案的形式
加入反应语（“等等，先别急”“这太离谱了”）、插话和跟进问题
调整发言时长——短句反应（1句话）和长段解释（2-3句话）混合
主持人应提出真实的问题，而非设置明显的话题引子
牢记总时长目标：1.05倍语速下，自然语速约为每秒2.5个单词

时长参考：

目标时长	单词数
15秒	~38词
30秒	~75词
60秒	~150词

Step 5: Generate TTS Audio

步骤5：生成TTS音频

Use

inworld/text-to-speech-2

for each turn.

bash

infsh app run inworld/text-to-speech-2 --input '{
  "text": "...",
  "voice_id": "...",
  "speaking_rate": 1.05,
  "audio_encoding": "MP3"
}' --save output.json

Voice selection:

Generate samples with the same line across candidate voices BEFORE committing
Let the user listen and approve voices
Good podcast voices: Tyler, Nate, Lauren, Kelsey, Naomi, Anjali (EN_US)
Use
```
inworld/text-to-speech-2:voices
```
to list all available voices

Speaking rate:

Default to 1.05 for natural podcast pacing
Use 1.1 for short snappy reactions
NEVER go below 1.0 — sounds slow and disengaging
Keep rate consistent per character across all their turns

All TTS turns can run in parallel (cheap, fast ~2-8s each).

使用

inworld/text-to-speech-2

生成每段音频。

bash

infsh app run inworld/text-to-speech-2 --input '{
  "text": "...",
  "voice_id": "...",
  "speaking_rate": 1.05,
  "audio_encoding": "MP3"
}' --save output.json

语音选择：

在确定语音前，用同一句台词生成多个候选语音的样本
让用户试听并确认语音
适合播客的语音：Tyler、Nate、Lauren、Kelsey、Naomi、Anjali（美式英语）
使用
```
inworld/text-to-speech-2:voices
```
查看所有可用语音

语速设置：

默认设置为1.05，符合自然播客节奏
短句反应可设置为1.1
绝对不要低于1.0——语速过慢会显得生硬且缺乏吸引力
同一角色的所有发言需保持语速一致

所有TTS音频可并行生成（成本低，速度快，每段约2-8秒）。

Step 6: Generate Video Clips

步骤6：生成视频片段

Use

pruna/p-video-avatar

for each turn.

bash

infsh app run pruna/p-video-avatar --input '{
  "image": "<character_frame_url>",
  "audio": "<tts_audio_url>",
  "resolution": "720p",
  "video_prompt": "..."
}' --save output.json

Critical: Run clips SEQUENTIALLY, not in parallel. Parallel runs hit the same GPU and cause CUDA OOM failures. Each clip takes 15-90s depending on audio length.

Angle assignment plan: Alternate between front and close-up shots across turns for visual variety. Example for 6 turns:

T1: Speaker A — front
T2: Speaker B — front
T3: Speaker C — front (or close-up)
T4: Speaker A — close-up
T5: Speaker B — close-up
T6: Speaker A — front

使用

pruna/p-video-avatar

生成每段视频。

bash

infsh app run pruna/p-video-avatar --input '{
  "image": "<character_frame_url>",
  "audio": "<tts_audio_url>",
  "resolution": "720p",
  "video_prompt": "..."
}' --save output.json

关键提示：视频片段需按顺序生成，不要并行。 并行生成会占用同一GPU内存，导致CUDA内存不足（OOM）错误。每个片段的生成时间根据音频长度为15-90秒。

角度分配方案： 不同发言段交替使用正面和特写镜头，提升视觉多样性。6段发言的示例：

T1：发言人A — 正面
T2：发言人B — 正面
T3：发言人C — 正面（或特写）
T4：发言人A — 特写
T5：发言人B — 特写
T6：发言人A — 正面

Step 7: Merge

步骤7：视频合并

Use

infsh/media-merger

to stitch all clips into the final video.

bash

undefined

使用

infsh/media-merger

将所有片段拼接成最终视频。

bash

undefined

Build input JSON

构建输入JSON

{ "media_files": [ {"file": "<clip1_url>"}, {"file": "<clip2_url>"}, ... ], "fps": 24, "output_format": "mp4" }

infsh app run infsh/media-merger --input merger_input.json --save final.json


Merger is free and takes 2-6 minutes depending on total duration.

{ "media_files": [ {"file": "<clip1_url>"}, {"file": "<clip2_url>"}, ... ], "fps": 24, "output_format": "mp4" }

infsh app run infsh/media-merger --input merger_input.json --save final.json


合并工具免费，耗时根据总时长为2-6分钟。

Rules

注意规则

Gemini does not preserve human facial identity — generating alternate angles of a real human with Gemini will produce a different person. For identity-consistent human shots, use Phota with a trained profile_id, or generate all angles in a single batch. This was learned after Gemini produced an entirely different face for a close-up that was supposed to match the front shot.
NEVER run p-video-avatar clips in parallel — they compete for GPU memory and fail with CUDA OOM. Run them sequentially. This was learned after 2 of 3 parallel runs failed.
NEVER set speaking_rate below 1.0 — it sounds artificial and disengaging. Default to 1.05. Learned from user feedback that 0.9 rate "felt weird and disengaging."
ALWAYS QA frames before generating video — video generation is the most expensive step in the pipeline. Catching a double mic or wrong logo in the image stage is cheap to fix. Catching it after video generation means re-rendering the entire clip.
ALWAYS use tight framing for individual speaker shots — wide/establishing shots show empty seats where other speakers should be. Frame from waist or chest up so no empty chairs are visible.
ALWAYS pass the logo as a reference image when generating branded characters — describing a logo in text produces wrong results. Pass the actual logo file as a second image input.
ALWAYS get voice approval before full production — generate samples with the same line across 5-8 candidate voices and let the user pick before committing to the full script.
Script should read like a conversation, not an ad — people reading ad copy in turns sounds fake. Include reactions, interruptions, varied turn lengths, and genuine questions. The host should have personality, not just set up talking points.

Gemini无法保留人类面部身份——用Gemini生成真实人物的不同角度会得到完全不同的人。如需身份一致的真实人物镜头，请使用带有已训练
```
profile_id
```
的Phota工具，或一次性生成所有角度。这一规则来自实际经验：Gemini曾为某个本该匹配正面镜头的特写生成了完全不同的人脸。
绝对不要并行生成p-video-avatar视频片段——它们会竞争GPU内存，导致CUDA内存不足错误。请按顺序生成。这一规则来自实际经验：3次并行生成中有2次失败。
绝对不要将speaking_rate设置低于1.0——语速过慢会显得生硬且缺乏吸引力。默认设置为1.05。这一规则来自用户反馈：0.9倍语速“感觉怪异且没有吸引力”。
生成视频前务必检查镜头质量——视频生成是整个流程中成本最高的步骤。在图片阶段发现双麦克风或错误Logo的问题，修复成本很低；如果在视频生成后才发现，就需要重新渲染整个片段。
单人发言镜头务必使用紧凑构图——宽景/场景建立镜头会显示其他发言人的空座位。采用腰部或胸部以上的构图，避免出现空椅子。
生成品牌角色时务必上传Logo作为参考图片——用文字描述Logo会得到错误结果。请上传实际的Logo文件作为第二张输入图片。
正式制作前务必获得用户对语音的确认——用同一句台词生成5-8个候选语音的样本，让用户选择后再开始制作完整脚本。
脚本需写成对话风格，而非广告文案——轮流读广告文案的形式会显得虚假。加入反应语、插话、不同时长的发言和真实的问题。主持人应具备个性，而不仅仅是引出话题。

App Reference

工具参考

App	Purpose
`pruna/p-image`	Generate portraits from text
`phota/train`	Train identity profile from 30-50 face images
`phota/generate`	Generate images with trained identity via `[[profile_id]]`
`phota/edit`	Edit images preserving identity of known subjects
`google/gemini-3-pro-image-preview`	Image gen/edit, mascots, style transfer
`inworld/text-to-speech-2`	Text to speech, 100+ languages, voice steering
`pruna/p-video-avatar`	Portrait + audio → talking head video
`infsh/media-merger`	Concatenate video clips into one video

Use

belt task cost <task-id>

to check the cost of any individual task.

工具	用途
`pruna/p-image`	文本生成肖像
`phota/train`	用30-50张面部照片训练身份档案
`phota/generate`	通过 `[[profile_id]]` 调用已训练身份生成图片
`phota/edit`	编辑图片并保留已知主体的身份
`google/gemini-3-pro-image-preview`	图片生成/编辑、吉祥物制作、风格迁移
`inworld/text-to-speech-2`	文本转语音，支持100+种语言，可调整语音风格
`pruna/p-video-avatar`	肖像+音频 → 对话式头部特写视频
`infsh/media-merger`	将多个视频片段拼接成一个完整视频

使用

belt task cost <task-id>

查看任意单个任务的成本。