minimax-multimodal-toolkit

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MiniMax Multi-Modal Toolkit

MiniMax多模态工具包

Generate voice, music, video, and image content via MiniMax APIs — the unified entry for MiniMax multimodal use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.

通过MiniMax API生成语音、音乐、视频和图像内容——这是MiniMax多模态使用场景（音频+音乐+视频+图像）的统一入口。包含用于自定义语音的语音克隆与语音设计功能、带人物参考的图像生成，以及基于FFmpeg的媒体工具，支持音视频格式转换、拼接、裁剪和提取。

Output Directory

输出目录

All generated files MUST be saved to
minimax-output/
under the AGENT'S current working directory (NOT the skill directory). Every script call MUST include an explicit

--output

-o

argument pointing to this location. Never omit the output argument or rely on script defaults.

Rules:

Before running any script, ensure
```
minimax-output/
```
exists in the agent's working directory (create if needed:
```
mkdir -p minimax-output
```
)
Always use absolute or relative paths from the agent's working directory:
```
--output minimax-output/video.mp4
```
Never
```
cd
```
into the skill directory to run scripts — run from the agent's working directory using the full script path
Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in
```
minimax-output/tmp/
```
. They can be cleaned up when no longer needed:
```
rm -rf minimax-output/tmp
```

所有生成的文件必须保存至Agent当前工作目录下的
minimax-output/
文件夹（而非技能目录）。每次调用脚本时必须显式添加

--output

-o

参数指向该位置，绝不能省略输出参数或依赖脚本默认设置。

规则：

运行任何脚本前，确保Agent工作目录中存在
```
minimax-output/
```
文件夹（若不存在则创建：
```
mkdir -p minimax-output
```
）
始终使用Agent工作目录的绝对路径或相对路径：
```
--output minimax-output/video.mp4
```
切勿进入技能目录运行脚本——需从Agent工作目录使用完整脚本路径执行
中间/临时文件（片段音频、视频片段、提取的帧）会自动存放在
```
minimax-output/tmp/
```
，无需使用时可清理：
```
rm -rf minimax-output/tmp
```

Prerequisites

前置条件

bash

brew install ffmpeg jq              # macOS (or apt install ffmpeg jq on Linux)
bash scripts/check_environment.sh

No Python or pip required — all scripts are pure bash using

curl

ffmpeg

jq

, and

xxd

bash

brew install ffmpeg jq              # macOS系统（Linux系统请用apt install ffmpeg jq）
bash scripts/check_environment.sh

无需Python或pip依赖——所有脚本均为纯bash编写，基于

curl

、

ffmpeg

、

jq

和

xxd

。

API Host Configuration

API主机配置

MiniMax provides two service endpoints for different regions. Set

MINIMAX_API_HOST

before running any script:

Region	Platform URL	API Host Value
China Mainland（中国大陆）	https://platform.minimaxi.com	`https://api.minimaxi.com`
Global（全球）	https://platform.minimax.io	`https://api.minimax.io`

bash

undefined

MiniMax为不同区域提供两个服务端点，运行任何脚本前需设置

MINIMAX_API_HOST

：

区域	平台URL	API主机值
中国大陆	https://platform.minimaxi.com	`https://api.minimaxi.com`
全球	https://platform.minimax.io	`https://api.minimax.io`

bash

undefined

China Mainland

中国大陆

export MINIMAX_API_HOST="https://api.minimaxi.com"

or Global

或全球区域

export MINIMAX_API_HOST="https://api.minimax.io"


**IMPORTANT — When API Host is missing:**
Before running any script, check if `MINIMAX_API_HOST` is set in the environment. If it is NOT configured:
1. Ask the user which service endpoint their MiniMax account uses:
   - **China Mainland** → `https://api.minimaxi.com`
   - **Global** → `https://api.minimax.io`
2. Instruct and help user to set it via `export MINIMAX_API_HOST="https://api.minimaxi.com"` (or the global variant) in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence

export MINIMAX_API_HOST="https://api.minimax.io"


**重要提示——当API主机未设置时：**
运行任何脚本前，检查环境中是否已设置`MINIMAX_API_HOST`。若未配置：
1. 询问用户其MiniMax账号使用的服务端点：
   - **中国大陆** → `https://api.minimaxi.com`
   - **全球** → `https://api.minimax.io`
2. 指导并帮助用户通过终端执行`export MINIMAX_API_HOST="https://api.minimaxi.com"`（或对应全球区域的地址），或添加至shell配置文件（`~/.zshrc`/`~/.bashrc`）以持久生效

API Key Configuration

API密钥配置

Set the

MINIMAX_API_KEY

environment variable before running any script:

bash

export MINIMAX_API_KEY="your-api-key-here"

The key starts with

sk-api-

sk-cp-

, obtainable from https://platform.minimaxi.com (China) or https://platform.minimax.io (Global)

IMPORTANT — When API Key is missing: Before running any script, check if

MINIMAX_API_KEY

is set in the environment. If it is NOT configured:

Ask the user to provide their MiniMax API key
Instruct and help user to set it via
```
export MINIMAX_API_KEY="sk-..."
```
in their terminal or add it to their shell profile (
```
~/.zshrc
```
/
```
~/.bashrc
```
) for persistence

运行任何脚本前需设置

MINIMAX_API_KEY

环境变量：

bash

export MINIMAX_API_KEY="your-api-key-here"

密钥以

sk-api-

或

sk-cp-

开头，可从https://platform.minimaxi.com（中国大陆）或https://platform.minimax.io（全球）获取

重要提示——当API密钥缺失时： 运行任何脚本前，检查环境中是否已设置

MINIMAX_API_KEY

。若未配置：

请用户提供其MiniMax API密钥
指导并帮助用户通过终端执行
```
export MINIMAX_API_KEY="sk-..."
```
，或添加至shell配置文件（
```
~/.zshrc
```
/
```
~/.bashrc
```
）以持久生效

Key Capabilities

核心功能

Capability	Description	Entry point
TTS	Text-to-speech synthesis with multiple voices and emotions	`scripts/tts/generate_voice.sh`
Voice Cloning	Clone a voice from an audio sample (10s–5min)	`scripts/tts/generate_voice.sh clone`
Voice Design	Create a custom voice from a text description	`scripts/tts/generate_voice.sh design`
Music Generation	Generate songs with lyrics or instrumental tracks	`scripts/music/generate_music.sh`
Image Generation	Text-to-image, image-to-image with character reference	`scripts/image/generate_image.sh`
Video Generation	Text-to-video, image-to-video, subject reference, templates	`scripts/video/generate_video.sh`
Long Video	Multi-scene chained video with crossfade transitions	`scripts/video/generate_long_video.sh`
Media Tools	Audio/video format conversion, concatenation, trimming, extraction	`scripts/media_tools.sh`

功能	描述	入口
TTS	支持多音色、多情感的文本转语音合成	`scripts/tts/generate_voice.sh`
语音克隆	从音频样本（10秒–5分钟）克隆语音	`scripts/tts/generate_voice.sh clone`
语音设计	通过文本描述创建自定义语音	`scripts/tts/generate_voice.sh design`
音乐生成	生成带歌词的歌曲或纯音乐曲目	`scripts/music/generate_music.sh`
图像生成	文本转图像、带人物参考的图像转图像	`scripts/image/generate_image.sh`
视频生成	文本转视频、图像转视频、主体参考、模板	`scripts/video/generate_video.sh`
长视频生成	带淡入淡出过渡的多场景链式视频	`scripts/video/generate_long_video.sh`
媒体工具	音视频格式转换、拼接、裁剪、提取	`scripts/media_tools.sh`

TTS (Text-to-Speech)

TTS（文本转语音）

Entry point:

scripts/tts/generate_voice.sh

入口：

scripts/tts/generate_voice.sh

IMPORTANT: Single voice vs Multi-segment — Choose the right approach

重要提示：单语音 vs 多片段——选择合适的方式

User intent	Approach
Single voice / no multi-character need	`tts` command — generate the entire text in one call
Multiple characters / narrator + dialogue	`generate` command with segments.json

Default behavior: When the user simply asks to generate speech/voice and does NOT mention multiple voices or characters, use the

tts

command directly with a single appropriate voice. Do NOT split into segments or use the multi-segment pipeline — just pass the full text to

tts

in one call.

Only use multi-segment

generate

when:

The user explicitly needs multiple voices/characters
The text requires narrator + character dialogue separation
The text exceeds 10,000 characters (API limit per request) — in this case, split into segments with the same voice

用户意图	实现方式
单语音 / 无需多角色	`tts` 命令——一次调用生成全部文本
多角色 / 旁白+对话	使用segments.json的 `generate` 命令

默认行为： 当用户仅要求生成语音且未提及多音色或多角色时，直接使用

tts

命令选择合适的单音色，无需拆分片段或使用多片段流水线——只需将完整文本一次性传入

tts

命令。

仅在以下场景使用多片段

generate

命令：

用户明确需要多音色/多角色
文本需区分旁白与角色对话
文本超过10000字符（API单次请求限制）——此时需拆分片段并使用相同音色

Single-voice generation (DEFAULT)

单语音生成（默认）

bash

bash scripts/tts/generate_voice.sh tts "Hello world" -o minimax-output/hello.mp3
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3

bash

bash scripts/tts/generate_voice.sh tts "Hello world" -o minimax-output/hello.mp3
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3

Multi-segment generation (multi-voice / audiobook / podcast)

多片段生成（多音色 / 有声书 / 播客）

Complete workflow — follow ALL steps in order:

Write segments.json — split text into segments with voice assignments (see format and rules below)
Run
generate
command — this reads segments.json, generates audio for EACH segment via TTS API, then merges them into a single output file with crossfade

bash

undefined

完整工作流——按顺序执行所有步骤：

编写segments.json——将文本拆分为带音色分配的片段（见下方格式与规则）
运行
generate
命令——读取segments.json，通过TTS API为每个片段生成音频，然后将它们合并为单个输出文件并添加淡入淡出过渡

bash

undefined

Step 1: Write segments.json to minimax-output/

步骤1：将segments.json写入minimax-output/

(use the Write tool to create minimax-output/segments.json)

（使用Write工具创建minimax-output/segments.json）

Step 2: Generate audio from segments.json — this is the CRITICAL step

步骤2：从segments.json生成音频——这是关键步骤

It generates each segment individually and merges them into one file

它会为每个片段单独生成音频，然后合并为一个文件

bash scripts/tts/generate_voice.sh generate minimax-output/segments.json
-o minimax-output/output.mp3 --crossfade 200


**Do NOT skip Step 2.** Writing segments.json alone does nothing — you MUST run the `generate` command to actually produce audio.

bash scripts/tts/generate_voice.sh generate minimax-output/segments.json
-o minimax-output/output.mp3 --crossfade 200


**切勿跳过步骤2**。仅编写segments.json不会产生任何效果——必须运行`generate`命令才能生成音频。

Voice management

语音管理

bash

undefined

bash

undefined

List all available voices

列出所有可用音色

bash scripts/tts/generate_voice.sh list-voices

Voice cloning (from audio sample, 10s–5min)

语音克隆（从音频样本，10秒–5分钟）

bash scripts/tts/generate_voice.sh clone sample.mp3 --voice-id my-voice

Voice design (from text description)

语音设计（通过文本描述）

bash scripts/tts/generate_voice.sh design "A warm female narrator voice" --voice-id narrator

undefined

bash scripts/tts/generate_voice.sh design "温暖的女性旁白音色" --voice-id narrator

undefined

Audio processing

音频处理

bash

bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp3

bash

bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp3

TTS Models

TTS模型

Model	Notes
speech-2.8-hd	Recommended, auto emotion matching
speech-2.8-turbo	Faster variant
speech-2.6-hd	Previous gen, manual emotion
speech-2.6-turbo	Previous gen, faster

模型	说明
speech-2.8-hd	推荐使用，自动匹配情感
speech-2.8-turbo	更快的变体
speech-2.6-hd	前代版本，需手动设置情感
speech-2.6-turbo	前代版本，速度更快

segments.json Format

segments.json格式

Default crossfade between segments: 200ms (

--crossfade 200

json

[
  { "text": "Hello!", "voice_id": "female-shaonv", "emotion": "" },
  { "text": "Welcome.", "voice_id": "male-qn-qingse", "emotion": "happy" }
]

Leave

emotion

empty for speech-2.8 models (auto-matched from text).

片段间默认淡入淡出过渡时长：200毫秒（

--crossfade 200

）。

json

[
  { "text": "你好！", "voice_id": "female-shaonv", "emotion": "" },
  { "text": "欢迎到来。", "voice_id": "male-qn-qingse", "emotion": "happy" }
]

对于speech-2.8模型，

emotion

字段留空即可（会根据文本自动匹配情感）。

IMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.)

重要提示：多片段脚本生成规则（有声书、播客等）

When generating segments.json for audiobooks, podcasts, or any multi-character narration, you MUST split narration text from character dialogue into separate segments with distinct voices.

Rule: Narration and dialogue are ALWAYS separate segments.

A sentence like

"Tom said: The weather is great today!"

must be split into two segments:

Segment 1 (narrator voice):
```
"Tom said:"
```
Segment 2 (character voice):
```
"The weather is great today!"
```

Example — Audiobook with narrator + 2 characters:

json

[
  { "text": "Morning sunlight streamed into the classroom as students filed in one by one.", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "The weather is amazing today! Let's go to the park after school!", "voice_id": "tom-voice", "emotion": "happy" },
  { "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "Sure, but I need to drop off my backpack at home first.", "voice_id": "lisa-voice", "emotion": "" },
  { "text": "They exchanged a smile and went back to listening to the lecture.", "voice_id": "narrator-voice", "emotion": "" }
]

Key principles:

Narrator uses a consistent neutral narrator voice throughout
Each character has a dedicated voice_id, maintained consistently across all their dialogue
Split at dialogue boundaries —
```
"He said:"
```
is narrator, the quoted content is the character
Do NOT merge narrator text and character speech into a single segment
For characters without pre-existing voice_ids, use voice cloning or voice design to create them first, then reference the created voice_id in segments

为有声书、播客或任何多角色旁白生成segments.json时，必须将旁白文本与角色对话拆分为单独的片段，并使用不同的音色。

规则：旁白与对话始终为单独片段。

例如句子

"汤姆说：今天天气真好！"

必须拆分为两个片段：

片段1（旁白音色）：
```
"汤姆说："
```
片段2（角色音色）：
```
"今天天气真好！"
```

示例——带旁白+2个角色的有声书：

json

[
  { "text": "清晨的阳光洒进教室，学生们陆续走进来。", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "汤姆笑着转向丽莎：", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "今天天气太棒了！放学后我们去公园吧！", "voice_id": "tom-voice", "emotion": "happy" },
  { "text": "丽莎想了想，然后回答：", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "好呀，但我得先回家放下背包。", "voice_id": "lisa-voice", "emotion": "" },
  { "text": "他们相视一笑，回到座位继续听课。", "voice_id": "narrator-voice", "emotion": "" }
]

核心原则：

旁白全程使用一致的中性旁白音色
每个角色对应专属的voice_id，且在所有对话中保持一致
在对话边界拆分——
```
"他说："
```
属于旁白，引号内的内容属于角色
切勿合并旁白文本与角色语音为单个片段
对于无预设voice_id的角色，先通过语音克隆或语音设计创建，再在片段中引用该voice_id

Music Generation

音乐生成

Entry point:

scripts/music/generate_music.sh

入口：

scripts/music/generate_music.sh

IMPORTANT: Instrumental vs Lyrics — When to use which

重要提示：纯音乐 vs 带歌词——场景选择

Scenario	Mode	Action
BGM for video / voice / podcast	Instrumental (default)	Use `--instrumental` directly, do NOT ask user
User explicitly asks to "create music" / "make a song"	Ask user first	Ask whether they want instrumental or with lyrics

When adding background music to video or voice content, always default to instrumental mode (

--instrumental

). Do not ask the user — BGM should never have vocals competing with the main content.

When the user explicitly asks to create/generate music as the primary task, ask them whether they want:

Instrumental (pure music, no vocals)
With lyrics (song with vocals — user provides or you help write lyrics)

bash

undefined

场景	模式	操作
视频/语音/播客的背景音乐	纯音乐（默认）	直接使用 `--instrumental` 参数，无需询问用户
用户明确要求"创作音乐"/"制作歌曲"	先询问用户	询问用户需要纯音乐还是带歌词的歌曲

为视频或语音内容添加背景音乐时，始终默认使用纯音乐模式（

--instrumental

），无需询问用户——背景音乐不应有 vocals 与主内容竞争。

当用户明确要求创作/生成音乐作为主要任务时，询问用户需要：

纯音乐（无 vocals 的纯音乐）
带歌词（有人声的歌曲——用户提供歌词或协助编写）

bash

undefined

Instrumental (for BGM or when user chooses instrumental)

纯音乐（用于背景音乐或用户选择纯音乐时）

bash scripts/music/generate_music.sh
--instrumental
--prompt "ambient electronic, atmospheric"
--output minimax-output/ambient.mp3 --download

Song with lyrics (when user chooses vocal music)

带歌词的歌曲（用户选择有声音乐时）

bash scripts/music/generate_music.sh
--lyrics "[verse]\nHello world\n[chorus]\nLa la la"
--prompt "indie folk, melancholic"
--output minimax-output/song.mp3 --download

With style fields

带风格字段

bash scripts/music/generate_music.sh
--lyrics "[verse]\nLyrics here"
--genre "pop" --mood "upbeat" --tempo "fast"
--output minimax-output/pop_track.mp3 --download

undefined

bash scripts/music/generate_music.sh
--lyrics "[verse]\n歌词内容"
--genre "pop" --mood "upbeat" --tempo "fast"
--output minimax-output/pop_track.mp3 --download

undefined

Music Model

音乐模型

Default model:

music-2.5

music-2.5

does not support

--instrumental

directly. When instrumental music is needed, the script automatically applies a workaround:

Sets lyrics to
```
[intro] [outro]
```
(empty structural tags, no actual vocals), appends
```
pure music, no lyrics
```
to the prompt

This produces instrumental-style output without requiring manual intervention. You can always use

--instrumental

and the script handles the rest.

默认模型：

music-2.5

music-2.5

不直接支持

--instrumental

参数。当需要纯音乐时，脚本会自动应用解决方案：

将歌词设置为
```
[intro] [outro]
```
（空结构标签，无实际 vocals），并在prompt后追加
```
pure music, no lyrics
```

无需手动干预即可生成纯音乐风格的输出，只需使用

--instrumental

参数，脚本会处理其余操作。

Image Generation

图像生成

Entry point:

scripts/image/generate_image.sh

Model:

image-01

— photorealistic image generation from text prompts, with optional character reference for image-to-image.

入口：

scripts/image/generate_image.sh

模型：

image-01

——通过文本提示生成写实风格图像，支持图像转图像时的人物参考。

IMPORTANT: Mode Selection — t2i vs i2i

重要提示：模式选择——t2i vs i2i

User intent	Mode
Generate image from text description (default)	`t2i` — text-to-image
Generate image with a character reference photo (keep same person)	`i2i` — image-to-image

Default behavior: When the user asks to generate/create an image without mentioning a reference photo, use

t2i

mode (default). Only use

i2i

mode when the user provides a character reference image or explicitly asks to base the image on an existing person's appearance.

用户意图	模式
根据文本描述生成图像（默认）	`t2i` ——文本转图像
基于人物参考照片生成图像（保留同一人物）	`i2i` ——图像转图像

默认行为： 当用户要求生成图像但未提及参考照片时，使用

t2i

模式（默认）。仅当用户提供人物参考图像或明确要求基于现有人物外观生成图像时，才使用

i2i

模式。

IMPORTANT: Aspect Ratio — Infer from user context

重要提示：宽高比——根据用户上下文推断

Do NOT always default to

1:1

. Analyze the user's request and choose the most appropriate aspect ratio:

User intent / context	Recommended ratio	Resolution
头像、图标、社交媒体头像、avatar、icon、profile pic	`1:1`	1024×1024
风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper	`16:9`	1280×720
传统照片、经典比例、classic photo	`4:3`	1152×864
摄影作品、杂志封面、photography、magazine	`3:2`	1248×832
人像竖图、海报、portrait photo、poster	`2:3`	832×1248
竖版海报、书籍封面、tall poster、book cover	`3:4`	864×1152
手机壁纸、社交媒体故事、phone wallpaper、story、reel	`9:16`	720×1280
超宽全景、电影画幅、panoramic、cinematic ultrawide	`21:9`	1344×576
未指定特定需求 / ambiguous	`1:1`	1024×1024

切勿始终默认使用

1:1

。分析用户需求并选择最合适的宽高比：

用户意图/上下文	推荐宽高比	分辨率
头像、图标、社交媒体头像、avatar、icon、profile pic	`1:1`	1024×1024
风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper	`16:9`	1280×720
传统照片、经典比例、classic photo	`4:3`	1152×864
摄影作品、杂志封面、photography、magazine	`3:2`	1248×832
人像竖图、海报、portrait photo、poster	`2:3`	832×1248
竖版海报、书籍封面、tall poster、book cover	`3:4`	864×1152
手机壁纸、社交媒体故事、phone wallpaper、story、reel	`9:16`	720×1280
超宽全景、电影画幅、panoramic、cinematic ultrawide	`21:9`	1344×576
未指定特定需求 / 模糊需求	`1:1`	1024×1024

IMPORTANT: Image Count — When to generate multiple images

重要提示：图像数量——何时生成多张

User intent	Count ( `-n` )
Default / single image request	`1` (default)
用户说"几张"、"多张"、"一些" / "a few", "several"	`3`
用户说"多种方案"、"备选" / "variations", "options"	`3` – `4`
用户明确指定数量	Use the specified number (1–9)

用户意图	数量（ `-n` ）
默认/单张图像需求	`1` （默认）
用户说"几张"、"多张"、"一些" / "a few", "several"	`3`
用户说"多种方案"、"备选" / "variations", "options"	`3` – `4`
用户明确指定数量	使用用户指定的数量（1–9）

Text-to-Image Examples

文本转图像示例

bash

undefined

bash

undefined

Basic text-to-image

基础文本转图像

bash scripts/image/generate_image.sh
--prompt "A cat sitting on a rooftop at sunset, cinematic lighting, warm tones, photorealistic"
-o minimax-output/cat.png

bash scripts/image/generate_image.sh
--prompt "一只猫坐在日落时分的屋顶上，电影级光影，暖色调，写实风格"
-o minimax-output/cat.png

Landscape with inferred aspect ratio

风景图像（推断宽高比）

bash scripts/image/generate_image.sh
--prompt "Mountain landscape with misty valleys, photorealistic, golden hour"
--aspect-ratio 16:9
-o minimax-output/landscape.png

bash scripts/image/generate_image.sh
--prompt "云雾缭绕的山谷山景，写实风格，黄金时刻"
--aspect-ratio 16:9
-o minimax-output/landscape.png

Phone wallpaper (portrait 9:16)

手机壁纸（竖版9:16）

bash scripts/image/generate_image.sh
--prompt "Aurora borealis over a snowy forest, vivid colors, magical atmosphere"
--aspect-ratio 9:16
-o minimax-output/wallpaper.png

bash scripts/image/generate_image.sh
--prompt "雪林上空的极光，色彩鲜艳，魔幻氛围"
--aspect-ratio 9:16
-o minimax-output/wallpaper.png

Multiple variations

多种变体

bash scripts/image/generate_image.sh
--prompt "Abstract geometric art, vibrant colors"
-n 3
-o minimax-output/art.png

bash scripts/image/generate_image.sh
--prompt "抽象几何艺术，色彩鲜艳"
-n 3
-o minimax-output/art.png

With prompt optimizer

带提示词优化器

bash scripts/image/generate_image.sh
--prompt "A man standing on Venice Beach, 90s documentary style"
--aspect-ratio 16:9 --prompt-optimizer
-o minimax-output/beach.png

bash scripts/image/generate_image.sh
--prompt "一个男人站在威尼斯海滩，90年代纪录片风格"
--aspect-ratio 16:9 --prompt-optimizer
-o minimax-output/beach.png

Custom dimensions (must be multiple of 8)

自定义尺寸（必须为8的倍数）

bash scripts/image/generate_image.sh
--prompt "Product photo of a luxury watch on marble surface"
--width 1024 --height 768
-o minimax-output/watch.png

undefined

bash scripts/image/generate_image.sh
--prompt "大理石台面上的奢华手表产品图"
--width 1024 --height 768
-o minimax-output/watch.png

undefined

Image-to-Image (Character Reference)

图像转图像（人物参考）

Use a reference photo to generate images with the same character in new scenes. Best results with a single front-facing portrait. Supported formats: JPG, JPEG, PNG (max 10MB).

bash

undefined

使用参考照片生成同一人物在新场景中的图像，最佳效果为单人正面肖像。支持格式：JPG、JPEG、PNG（最大10MB）。

bash

undefined

Character reference — place same person in a new scene

人物参考——将同一人物置于新场景

bash scripts/image/generate_image.sh
--mode i2i
--prompt "A girl looking into the distance from a library window, warm afternoon light"
--ref-image face.jpg
--aspect-ratio 16:9
-o minimax-output/girl_library.png

bash scripts/image/generate_image.sh
--mode i2i
--prompt "一个女孩从图书馆窗口望向远方，温暖的午后阳光"
--ref-image face.jpg
--aspect-ratio 16:9
-o minimax-output/girl_library.png

Multiple character variations

多个人物变体

bash scripts/image/generate_image.sh
--mode i2i
--prompt "A woman in a red dress at a gala event, elegant, cinematic"
--ref-image face.jpg -n 3
-o minimax-output/gala.png

undefined

bash scripts/image/generate_image.sh
--mode i2i
--prompt "一个穿红裙的女人在晚宴上，优雅，电影级风格"
--ref-image face.jpg -n 3
-o minimax-output/gala.png

undefined

Aspect Ratio Reference

宽高比参考

Ratio	Resolution	Best for
`1:1`	1024×1024	Default, avatars, icons, social media
`16:9`	1280×720	Landscape, banner, desktop wallpaper
`4:3`	1152×864	Classic photo, presentations
`3:2`	1248×832	Photography, magazine layout
`2:3`	832×1248	Portrait photo, poster
`3:4`	864×1152	Book cover, tall poster
`9:16`	720×1280	Phone wallpaper, social story/reel
`21:9`	1344×576	Ultra-wide panoramic, cinematic

比例	分辨率	适用场景
`1:1`	1024×1024	默认、头像、图标、社交媒体
`16:9`	1280×720	风景、横幅、桌面壁纸
`4:3`	1152×864	传统照片、演示文稿
`3:2`	1248×832	摄影作品、杂志排版
`2:3`	832×1248	人像照片、海报
`3:4`	864×1152	书籍封面、竖版海报
`9:16`	720×1280	手机壁纸、社交故事/短视频
`21:9`	1344×576	超宽全景、电影画幅

Key Options

核心选项

Option	Description
`--prompt TEXT`	Image description, max 1500 chars (required)
`--aspect-ratio RATIO`	Aspect ratio (see table above). Infer from user context
`--width PX` / `--height PX`	Custom size, 512–2048, must be multiple of 8, both required together. Overridden by `--aspect-ratio` if both set
`-n N`	Number of images to generate, 1–9 (default 1)
`--seed N`	Random seed for reproducibility. Same seed + same params → similar results
`--prompt-optimizer`	Enable automatic prompt optimization by the API
`--ref-image FILE`	Character reference image for i2i mode (local file or URL, JPG/JPEG/PNG, max 10MB)
`--no-download`	Print image URLs instead of downloading files
`--aigc-watermark`	Add AIGC watermark to generated images

选项	描述
`--prompt TEXT`	图像描述，最多1500字符（必填）
`--aspect-ratio RATIO`	宽高比（见上表），根据用户上下文推断
`--width PX` / `--height PX`	自定义尺寸，512–2048，必须为8的倍数，需同时设置宽和高。若与 `--aspect-ratio` 同时设置，会被后者覆盖
`-n N`	生成图像数量，1–9（默认1）
`--seed N`	随机种子，用于复现结果。相同种子+相同参数→相似结果
`--prompt-optimizer`	启用API自动优化提示词
`--ref-image FILE`	i2i模式下的人物参考图像（本地文件或URL，JPG/JPEG/PNG，最大10MB）
`--no-download`	仅打印图像URL，不下载文件
`--aigc-watermark`	为生成的图像添加AIGC水印

Video Generation

视频生成

IMPORTANT: Single vs Multi-Segment — Choose the right script

重要提示：单片段 vs 多片段——选择合适的脚本

User intent	Script to use
Default / no special request	`scripts/video/generate_video.sh` (single segment, 10s, 768P)
User explicitly asks for "long video", "multi-scene", "story", or duration > 10s	`scripts/video/generate_long_video.sh` (multi-segment)

Default behavior: Always use single-segment

generate_video.sh

with duration 10s and resolution 768P unless the user explicitly asks for a long video, multi-scene video, or specifies a total duration exceeding 10 seconds. Do NOT automatically split into multiple segments — a single 10s video is the standard output. Only use

generate_long_video.sh

when the user clearly needs multi-scene or longer content.

Entry point (single video):

scripts/video/generate_video.sh

Entry point (long/multi-scene):

scripts/video/generate_long_video.sh

用户意图	使用脚本
默认/无特殊需求	`scripts/video/generate_video.sh` （单片段，10秒，768P）
用户明确要求"长视频"、"多场景"、"故事"或时长超过10秒	`scripts/video/generate_long_video.sh` （多片段）

默认行为： 除非用户明确要求长视频、多场景视频或指定总时长超过10秒，否则始终使用单片段

generate_video.sh

，默认参数为10秒时长、768P分辨率。切勿自动拆分为多片段——单段10秒视频为标准输出。仅当用户明确需要多场景或更长内容时，才使用

generate_long_video.sh

。

入口（单段视频）：

scripts/video/generate_video.sh

入口（长/多场景视频）：

scripts/video/generate_long_video.sh

Video Model Constraints (MUST follow)

视频模型限制（必须遵守）

Duration limits by model and resolution:

Model	720P	768P	1080P
MiniMax-Hailuo-2.3	-	6s or 10s	6s only
MiniMax-Hailuo-2.3-Fast	-	6s or 10s	6s only
MiniMax-Hailuo-02	-	6s or 10s	6s only
T2V-01 / T2V-01-Director	6s only	-	-
I2V-01 / I2V-01-Director / I2V-01-live	6s only	-	-
S2V-01 (ref)	6s only	-	-

Resolution options by model and duration:

Model	6s	10s
MiniMax-Hailuo-2.3	768P (default), 1080P	768P only
MiniMax-Hailuo-2.3-Fast	768P (default), 1080P	768P only
MiniMax-Hailuo-02	512P, 768P (default), 1080P	512P, 768P (default)
Other models	720P (default)	Not supported

Key rules:

Default: 10s + 768P (best balance of length and quality for MiniMax-Hailuo-2.3)
1080P only supports 6s duration — if user requests 1080P, set
```
--duration 6
```
10s duration only works with 768P (or 512P on Hailuo-02) — never combine 10s + 1080P
Older models (T2V-01, I2V-01, S2V-01) only support 6s at 720P

按模型和分辨率划分的时长限制：

模型	720P	768P	1080P
MiniMax-Hailuo-2.3	-	6秒或10秒	仅6秒
MiniMax-Hailuo-2.3-Fast	-	6秒或10秒	仅6秒
MiniMax-Hailuo-02	-	6秒或10秒	仅6秒
T2V-01 / T2V-01-Director	仅6秒	-	-
I2V-01 / I2V-01-Director / I2V-01-live	仅6秒	-	-
S2V-01 (ref)	仅6秒	-	-

按模型和时长划分的分辨率选项：

模型	6秒	10秒
MiniMax-Hailuo-2.3	768P（默认）、1080P	仅768P
MiniMax-Hailuo-2.3-Fast	768P（默认）、1080P	仅768P
MiniMax-Hailuo-02	512P、768P（默认）、1080P	512P、768P（默认）
其他模型	720P（默认）	不支持

核心规则：

默认设置：10秒+768P（MiniMax-Hailuo-2.3的时长与画质最佳平衡）
1080P仅支持6秒时长——若用户要求1080P，需设置
```
--duration 6
```
10秒时长仅支持768P（或Hailuo-02的512P）——切勿将10秒与1080P组合使用
旧模型（T2V-01、I2V-01、S2V-01）仅支持720P、6秒时长

IMPORTANT: Prompt Optimization (MUST follow before generating any video)

重要提示：提示词优化（生成任何视频前必须遵守）

Before calling any video generation script, you MUST optimize the user's prompt by reading and applying

references/video-prompt-guide.md

. Never pass the user's raw description directly as

--prompt

Optimization steps:

Apply the Professional Formula:

Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere

BAD:
```
"A puppy in a park"
```

GOOD:

"A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [跟随] smooth tracking shot, warm golden hour lighting, shallow depth of field, joyful atmosphere"

Add camera instructions using

[指令]

syntax:

[推进]

[拉远]

[跟随]

[固定]

[左摇]

, etc.

Include aesthetic details: lighting (golden hour, dramatic side lighting), color grading (warm tones, cinematic), texture (dust particles, rain droplets), atmosphere (intimate, epic, peaceful)
Keep to 1-2 key actions for 6-10 second videos — do not overcrowd with events
For i2v mode (image-to-video): Focus prompt on movement and change only, since the image already establishes the visual. Do NOT re-describe what's in the image.
- BAD:
```
"A lake with mountains"
```
  (just repeating the image)
- GOOD:
```
"Gentle ripples spread across the water surface, a breeze rustles the distant trees, [固定] fixed camera, soft morning light, peaceful and serene"
```
For multi-segment long videos: Each segment's prompt must be self-contained and optimized individually. The i2v segments (segment 2+) should describe motion/change relative to the previous segment's ending frame.

bash

undefined

调用任何视频生成脚本前，必须阅读并应用

references/video-prompt-guide.md

中的规则优化用户的提示词，绝不能直接将用户的原始描述作为

--prompt

参数传入。

优化步骤：

应用专业公式：

主体+场景+动作+镜头运动+美学氛围

错误示例：
```
"公园里的小狗"
```

正确示例：

"一只金毛寻回犬小狗在公园阳光斑驳的草地上朝镜头跑来，[跟随]平滑跟拍镜头，温暖的黄金时刻光线，浅景深，欢快氛围"

使用
[指令]
语法添加镜头说明：
```
[推进]
```
、
```
[拉远]
```
、
```
[跟随]
```
、
```
[固定]
```
、
```
[左摇]
```
等。
包含美学细节：光线（黄金时刻、戏剧性侧光）、色彩分级（暖色调、电影级）、纹理（尘埃颗粒、雨滴）、氛围（温馨、史诗感、宁静）。
6-10秒视频仅保留1-2个核心动作——不要添加过多事件。
i2v模式（图像转视频）：提示词仅聚焦动作和变化，因为图像已经确定了视觉内容，切勿重复描述图像中的内容。
- 错误示例：
```
"有山的湖泊"
```
  （仅重复图像内容）
- 正确示例：
```
"水面泛起轻柔的涟漪，微风吹动远处的树木，[固定]固定镜头，柔和的晨光，宁静祥和的氛围"
```
多片段长视频：每个片段的提示词必须独立优化。i2v片段（第2段及以后）需描述相对于上一段结束帧的运动/变化。

bash

undefined

Text-to-video (default: 10s, 768P)

文本转视频（默认：10秒，768P）

bash scripts/video/generate_video.sh
--mode t2v
--prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful"
--output minimax-output/puppy.mp4

bash scripts/video/generate_video.sh
--mode t2v
--prompt "一只金毛寻回犬小狗在阳光明媚的草地上朝镜头跑来，[跟随]跟拍镜头，温暖的黄金时刻，浅景深，欢快氛围"
--output minimax-output/puppy.mp4

Text-to-video with 1080P (must use --duration 6)

1080P文本转视频（必须设置--duration 6）

bash scripts/video/generate_video.sh
--mode t2v
--prompt "A golden retriever puppy bounds toward the camera"
--duration 6 --resolution 1080P
--output minimax-output/puppy_hd.mp4

bash scripts/video/generate_video.sh
--mode t2v
--prompt "一只金毛寻回犬小狗朝镜头跑来"
--duration 6 --resolution 1080P
--output minimax-output/puppy_hd.mp4

Image-to-video (prompt focuses on MOTION, not image content)

图像转视频（提示词聚焦动作，而非图像内容）

bash scripts/video/generate_video.sh
--mode i2v
--prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones"
--first-frame photo.jpg
--output minimax-output/animated.mp4

bash scripts/video/generate_video.sh
--mode i2v
--prompt "花瓣在微风中轻轻摇曳，柔和的光线在表面移动，[固定]固定构图，梦幻的马卡龙色调"
--first-frame photo.jpg
--output minimax-output/animated.mp4

Start-end frame interpolation (sef mode uses MiniMax-Hailuo-02)

首尾帧插值（sef模式使用MiniMax-Hailuo-02）

bash scripts/video/generate_video.sh
--mode sef
--first-frame start.jpg --last-frame end.jpg
--output minimax-output/transition.mp4

Subject reference (face consistency, ref mode uses S2V-01, 6s only)

主体参考（面部一致性，ref模式使用S2V-01，仅6秒）

bash scripts/video/generate_video.sh
--mode ref
--prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field"
--subject-image face.jpg
--duration 6
--output minimax-output/person.mp4

undefined

bash scripts/video/generate_video.sh
--mode ref
--prompt "一个穿白裙的年轻女子在阳光明媚的花园中慢慢行走，[跟随]平滑跟拍，温暖的自然光线，电影级景深"
--subject-image face.jpg
--duration 6
--output minimax-output/person.mp4

undefined

Long-form Video (Multi-scene)

长视频（多场景）

Multi-scene long videos chain segments together: the first segment generates via text-to-video (t2v), then each subsequent segment uses the last frame of the previous segment as its first frame (i2v). Segments are joined with crossfade transitions for smooth continuity. Default is 10 seconds per segment.

Workflow:

Segment 1: t2v — generated purely from the optimized text prompt
Segment 2+: i2v — the previous segment's last frame becomes
```
first_frame_image
```
, prompt describes motion and change from that ending state
All segments are concatenated with 0.5s crossfade transitions to eliminate jump cuts
Optional: AI-generated background music is overlaid

Prompt rules for each segment:

Each segment prompt MUST be independently optimized using the Professional Formula
Segment 1 (t2v): Full scene description with subject, scene, camera, atmosphere
Segment 2+ (i2v): Focus on what changes and moves from the previous ending frame. Do NOT repeat the visual description — the first frame already provides it
Maintain visual consistency: keep lighting, color grading, and style keywords consistent across segments
Each segment covers only 10 seconds of action — keep it focused

bash

undefined

多场景长视频将多个片段链接在一起：第一个片段通过文本转视频（t2v）生成，后续每个片段以上一个片段的最后一帧作为首帧，通过图像转视频（i2v）生成。片段之间通过淡入淡出过渡连接，保证流畅性。默认每个片段时长为10秒。

工作流：

片段1：t2v——完全基于优化后的文本提示生成
片段2+：i2v——上一个片段的最后一帧作为
```
first_frame_image
```
，提示词描述相对于该结束帧的动作和变化
所有片段通过0.5秒的淡入淡出过渡拼接，消除跳切
可选：添加AI生成的背景音乐

每个片段的提示词规则：

每个片段的提示词必须使用专业公式独立优化
片段1（t2v）：包含主体、场景、镜头、氛围的完整场景描述
片段2+（i2v）：聚焦相对于上一结束帧的变化和动作，切勿重复视觉描述——首帧已经提供了视觉内容
保持视觉一致性：所有片段的光线、色彩分级、风格关键词保持一致
每个片段仅涵盖10秒的动作——保持内容聚焦

bash

undefined

Example: 3-segment story with optimized per-segment prompts (default: 10s/segment, 768P)

示例：3片段故事，每个片段使用优化后的提示词（默认：10秒/片段，768P）

bash scripts/video/generate_long_video.sh
--scenes
"A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere"
"The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure"
"The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale"
--music-prompt "cinematic orchestral ambient, slow build, sci-fi atmosphere"
--output minimax-output/long_video.mp4

bash scripts/video/generate_long_video.sh
--scenes
"一名孤独的宇航员站在红色荒漠行星表面，风吹动尘埃颗粒，[推进]缓慢推近面罩，戏剧性轮廓光，电影级科幻氛围"
"宇航员转身，开始走向地平线上远处发光的建筑，靴子周围扬起尘埃，[跟随]从后方跟拍，广袤荒凉的地貌，建筑发出的金色光芒"
"宇航员抵达建筑入口，巨大的门廊闪烁着蓝色能量，[推进]缓慢推近门廊，光线反射在面罩上，令人震撼的史诗规模"
--music-prompt "电影级管弦乐 ambient，缓慢递进，科幻氛围"
--output minimax-output/long_video.mp4

With custom settings

自定义设置

bash scripts/video/generate_long_video.sh
--scenes "Scene 1 prompt" "Scene 2 prompt"
--segment-duration 10
--resolution 768P
--crossfade 0.5
--music-prompt "calm ambient background music"
--output minimax-output/long_video.mp4

undefined

bash scripts/video/generate_long_video.sh
--scenes "场景1提示词" "场景2提示词"
--segment-duration 10
--resolution 768P
--crossfade 0.5
--music-prompt "舒缓的 ambient 背景音乐"
--output minimax-output/long_video.mp4

undefined

Add Background Music

添加背景音乐

bash

bash scripts/video/add_bgm.sh \
  --video input.mp4 \
  --generate-bgm --instrumental \
  --music-prompt "soft piano background" \
  --bgm-volume 0.3 \
  --output minimax-output/output_with_bgm.mp4

bash

bash scripts/video/add_bgm.sh \
  --video input.mp4 \
  --generate-bgm --instrumental \
  --music-prompt "柔和的钢琴背景音乐" \
  --bgm-volume 0.3 \
  --output minimax-output/output_with_bgm.mp4

Template Video

模板视频

bash

bash scripts/video/generate_template_video.sh \
  --template-id 392753057216684038 \
  --media photo.jpg \
  --output minimax-output/template_output.mp4

bash

bash scripts/video/generate_template_video.sh \
  --template-id 392753057216684038 \
  --media photo.jpg \
  --output minimax-output/template_output.mp4

Video Models

视频模型

Mode	Default Model	Default Duration	Default Resolution	Notes
t2v	MiniMax-Hailuo-2.3	10s	768P	Latest text-to-video
i2v	MiniMax-Hailuo-2.3	10s	768P	Latest image-to-video
sef	MiniMax-Hailuo-02	6s	768P	Start-end frame
ref	S2V-01	6s	720P	Subject reference, 6s only

模式	默认模型	默认时长	默认分辨率	说明
t2v	MiniMax-Hailuo-2.3	10秒	768P	最新文本转视频模型
i2v	MiniMax-Hailuo-2.3	10秒	768P	最新图像转视频模型
sef	MiniMax-Hailuo-02	6秒	768P	首尾帧插值
ref	S2V-01	6秒	720P	主体参考，仅6秒

Media Tools (Audio/Video Processing)

媒体工具（音视频处理）

Entry point:

scripts/media_tools.sh

Standalone FFmpeg-based utilities for format conversion, concatenation, extraction, trimming, and audio overlay. Use these when the user needs to process existing media files without generating new content via MiniMax API.

入口：

scripts/media_tools.sh

基于FFmpeg的独立工具，支持格式转换、拼接、提取、裁剪和音频叠加。当用户需要处理现有媒体文件，无需通过MiniMax API生成新内容时使用。

Video Format Conversion

视频格式转换

bash

undefined

bash

undefined

Convert between formats (mp4, mov, webm, mkv, avi, ts, flv)

格式转换（mp4, mov, webm, mkv, avi, ts, flv）

bash scripts/media_tools.sh convert-video input.webm -o output.mp4 bash scripts/media_tools.sh convert-video input.mp4 -o output.mov

With quality / resolution / fps options

带画质/分辨率/帧率选项

bash scripts/media_tools.sh convert-video input.mp4 -o output.mp4
--crf 18 --preset medium --resolution 1920x1080 --fps 30

undefined

bash scripts/media_tools.sh convert-video input.mp4 -o output.mp4
--crf 18 --preset medium --resolution 1920x1080 --fps 30

undefined

Audio Format Conversion

音频格式转换

bash

undefined

bash

undefined

Convert between formats (mp3, wav, flac, ogg, aac, m4a, opus, wma)

格式转换（mp3, wav, flac, ogg, aac, m4a, opus, wma）

bash scripts/media_tools.sh convert-audio input.wav -o output.mp3 bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac
--bitrate 320k --sample-rate 48000 --channels 2

undefined

bash scripts/media_tools.sh convert-audio input.wav -o output.mp3 bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac
--bitrate 320k --sample-rate 48000 --channels 2

undefined

Video Concatenation

视频拼接

bash

undefined

bash

undefined

Concatenate with crossfade transition (default 0.5s)

带淡入淡出过渡的拼接（默认0.5秒）

bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 seg3.mp4 -o merged.mp4

Hard cut (no crossfade)

硬切（无过渡）

bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0

undefined

bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0

undefined

Audio Concatenation

音频拼接

bash

undefined

bash

undefined

Simple concatenation

简单拼接

bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3

With crossfade

带淡入淡出过渡

bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1

undefined

bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1

undefined

Extract Audio from Video

从视频提取音频

bash

undefined

bash

undefined

Extract as mp3

提取为mp3格式

bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3

Extract as wav with higher bitrate

以更高比特率提取为wav格式

bash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k

undefined

bash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k

undefined

Video Trimming

视频裁剪

bash

undefined

bash

undefined

Trim by start/end time (seconds)

按开始/结束时间裁剪（秒）

bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 5 --end 15

Trim by start + duration

按开始时间+时长裁剪

bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8

undefined

bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8

undefined

Add Audio to Video (Overlay / Replace)

为视频添加音频（叠加/替换）

bash

undefined

bash

undefined

Mix audio with existing video audio

将音频与视频原有音频混合

bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4
--volume 0.3 --fade-in 2 --fade-out 3

Replace original audio entirely

完全替换原音频

bash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4
--replace

undefined

bash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4
--replace

undefined

Media File Info

媒体文件信息

bash

bash scripts/media_tools.sh probe input.mp4

bash

bash scripts/media_tools.sh probe input.mp4

Script Architecture

脚本架构

scripts/
├── check_environment.sh         # Env verification (curl, ffmpeg, jq, xxd, API key)
├── media_tools.sh               # Audio/video conversion, concat, trim, extract
├── tts/
│   └── generate_voice.sh        # Unified TTS CLI (tts, clone, design, list-voices, generate, merge, convert)
├── music/
│   └── generate_music.sh        # Music generation CLI
├── image/
│   └── generate_image.sh        # Image generation CLI (2 modes: t2i, i2i)
└── video/
    ├── generate_video.sh        # Video generation CLI (4 modes: t2v, i2v, sef, ref)
    ├── generate_long_video.sh   # Multi-scene long video
    ├── generate_template_video.sh # Template-based video
    └── add_bgm.sh              # Background music overlay

scripts/
├── check_environment.sh         # 环境验证（curl、ffmpeg、jq、xxd、API密钥）
├── media_tools.sh               # 音视频转换、拼接、裁剪、提取
├── tts/
│   └── generate_voice.sh        # 统一TTS命令行工具（tts、克隆、设计、列出音色、生成、合并、转换）
├── music/
│   └── generate_music.sh        # 音乐生成命令行工具
├── image/
│   └── generate_image.sh        # 图像生成命令行工具（2种模式：t2i、i2i）
└── video/
    ├── generate_video.sh        # 视频生成命令行工具（4种模式：t2v、i2v、sef、ref）
    ├── generate_long_video.sh   # 多场景长视频
    ├── generate_template_video.sh # 基于模板的视频
    └── add_bgm.sh              # 背景音乐叠加

References

参考资料

Read these for detailed API parameters, voice catalogs, and prompt engineering:

tts-guide.md — TTS setup, voice management, audio processing, segment format, troubleshooting
tts-voice-catalog.md — Full voice catalog with IDs, descriptions, and parameter reference
music-api.md — Music generation API: endpoints, parameters, response format
image-api.md — Image generation API: text-to-image, image-to-image, parameters
video-api.md — Video API: endpoints, models, parameters, camera instructions, templates
video-prompt-guide.md — Video prompt engineering: formulas, styles, image-to-video tips

阅读以下文档获取详细的API参数、音色目录和提示词工程技巧：

tts-guide.md — TTS设置、语音管理、音频处理、片段格式、故障排除
tts-voice-catalog.md — 完整音色目录，包含ID、描述和参数参考
music-api.md — 音乐生成API：端点、参数、响应格式
image-api.md — 图像生成API：文本转图像、图像转图像、参数
video-api.md — 视频API：端点、模型、参数、镜头指令、模板
video-prompt-guide.md — 视频提示词工程：公式、风格、图像转视频技巧