minimax-multimodal-toolkit

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MiniMax Multi-Modal Toolkit

MiniMax多模态工具包

Generate voice, music, video, and image content via MiniMax APIs — the unified entry for MiniMax multimodal use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.
通过MiniMax API生成语音、音乐、视频和图像内容——这是MiniMax多模态使用场景(音频+音乐+视频+图像)的统一入口。包含用于自定义语音的语音克隆与语音设计功能、带人物参考的图像生成,以及基于FFmpeg的媒体工具,支持音视频格式转换、拼接、裁剪和提取。

Output Directory

输出目录

All generated files MUST be saved to
minimax-output/
under the AGENT'S current working directory (NOT the skill directory).
Every script call MUST include an explicit
--output
/
-o
argument pointing to this location. Never omit the output argument or rely on script defaults.
Rules:
  1. Before running any script, ensure
    minimax-output/
    exists in the agent's working directory (create if needed:
    mkdir -p minimax-output
    )
  2. Always use absolute or relative paths from the agent's working directory:
    --output minimax-output/video.mp4
  3. Never
    cd
    into the skill directory to run scripts — run from the agent's working directory using the full script path
  4. Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in
    minimax-output/tmp/
    . They can be cleaned up when no longer needed:
    rm -rf minimax-output/tmp
所有生成的文件必须保存至Agent当前工作目录下的
minimax-output/
文件夹(而非技能目录)。
每次调用脚本时必须显式添加
--output
/
-o
参数指向该位置,绝不能省略输出参数或依赖脚本默认设置。
规则:
  1. 运行任何脚本前,确保Agent工作目录中存在
    minimax-output/
    文件夹(若不存在则创建:
    mkdir -p minimax-output
  2. 始终使用Agent工作目录的绝对路径或相对路径:
    --output minimax-output/video.mp4
  3. 切勿进入技能目录运行脚本——需从Agent工作目录使用完整脚本路径执行
  4. 中间/临时文件(片段音频、视频片段、提取的帧)会自动存放在
    minimax-output/tmp/
    ,无需使用时可清理:
    rm -rf minimax-output/tmp

Prerequisites

前置条件

bash
brew install ffmpeg jq              # macOS (or apt install ffmpeg jq on Linux)
bash scripts/check_environment.sh
No Python or pip required — all scripts are pure bash using
curl
,
ffmpeg
,
jq
, and
xxd
.
bash
brew install ffmpeg jq              # macOS系统(Linux系统请用apt install ffmpeg jq)
bash scripts/check_environment.sh
无需Python或pip依赖——所有脚本均为纯bash编写,基于
curl
ffmpeg
jq
xxd

API Host Configuration

API主机配置

MiniMax provides two service endpoints for different regions. Set
MINIMAX_API_HOST
before running any script:
RegionPlatform URLAPI Host Value
China Mainland(中国大陆)https://platform.minimaxi.com
https://api.minimaxi.com
Global(全球)https://platform.minimax.io
https://api.minimax.io
bash
undefined
MiniMax为不同区域提供两个服务端点,运行任何脚本前需设置
MINIMAX_API_HOST
区域平台URLAPI主机值
中国大陆https://platform.minimaxi.com
https://api.minimaxi.com
全球https://platform.minimax.io
https://api.minimax.io
bash
undefined

China Mainland

中国大陆

export MINIMAX_API_HOST="https://api.minimaxi.com"
export MINIMAX_API_HOST="https://api.minimaxi.com"

or Global

或全球区域

export MINIMAX_API_HOST="https://api.minimax.io"

**IMPORTANT — When API Host is missing:**
Before running any script, check if `MINIMAX_API_HOST` is set in the environment. If it is NOT configured:
1. Ask the user which service endpoint their MiniMax account uses:
   - **China Mainland** → `https://api.minimaxi.com`
   - **Global** → `https://api.minimax.io`
2. Instruct and help user to set it via `export MINIMAX_API_HOST="https://api.minimaxi.com"` (or the global variant) in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence
export MINIMAX_API_HOST="https://api.minimax.io"

**重要提示——当API主机未设置时:**
运行任何脚本前,检查环境中是否已设置`MINIMAX_API_HOST`。若未配置:
1. 询问用户其MiniMax账号使用的服务端点:
   - **中国大陆** → `https://api.minimaxi.com`
   - **全球** → `https://api.minimax.io`
2. 指导并帮助用户通过终端执行`export MINIMAX_API_HOST="https://api.minimaxi.com"`(或对应全球区域的地址),或添加至shell配置文件(`~/.zshrc`/`~/.bashrc`)以持久生效

API Key Configuration

API密钥配置

Set the
MINIMAX_API_KEY
environment variable before running any script:
bash
export MINIMAX_API_KEY="your-api-key-here"
The key starts with
sk-api-
or
sk-cp-
, obtainable from https://platform.minimaxi.com (China) or https://platform.minimax.io (Global)
IMPORTANT — When API Key is missing: Before running any script, check if
MINIMAX_API_KEY
is set in the environment. If it is NOT configured:
  1. Ask the user to provide their MiniMax API key
  2. Instruct and help user to set it via
    export MINIMAX_API_KEY="sk-..."
    in their terminal or add it to their shell profile (
    ~/.zshrc
    /
    ~/.bashrc
    ) for persistence
运行任何脚本前需设置
MINIMAX_API_KEY
环境变量:
bash
export MINIMAX_API_KEY="your-api-key-here"
重要提示——当API密钥缺失时: 运行任何脚本前,检查环境中是否已设置
MINIMAX_API_KEY
。若未配置:
  1. 请用户提供其MiniMax API密钥
  2. 指导并帮助用户通过终端执行
    export MINIMAX_API_KEY="sk-..."
    ,或添加至shell配置文件(
    ~/.zshrc
    /
    ~/.bashrc
    )以持久生效

Key Capabilities

核心功能

CapabilityDescriptionEntry point
TTSText-to-speech synthesis with multiple voices and emotions
scripts/tts/generate_voice.sh
Voice CloningClone a voice from an audio sample (10s–5min)
scripts/tts/generate_voice.sh clone
Voice DesignCreate a custom voice from a text description
scripts/tts/generate_voice.sh design
Music GenerationGenerate songs with lyrics or instrumental tracks
scripts/music/generate_music.sh
Image GenerationText-to-image, image-to-image with character reference
scripts/image/generate_image.sh
Video GenerationText-to-video, image-to-video, subject reference, templates
scripts/video/generate_video.sh
Long VideoMulti-scene chained video with crossfade transitions
scripts/video/generate_long_video.sh
Media ToolsAudio/video format conversion, concatenation, trimming, extraction
scripts/media_tools.sh
功能描述入口
TTS支持多音色、多情感的文本转语音合成
scripts/tts/generate_voice.sh
语音克隆从音频样本(10秒–5分钟)克隆语音
scripts/tts/generate_voice.sh clone
语音设计通过文本描述创建自定义语音
scripts/tts/generate_voice.sh design
音乐生成生成带歌词的歌曲或纯音乐曲目
scripts/music/generate_music.sh
图像生成文本转图像、带人物参考的图像转图像
scripts/image/generate_image.sh
视频生成文本转视频、图像转视频、主体参考、模板
scripts/video/generate_video.sh
长视频生成带淡入淡出过渡的多场景链式视频
scripts/video/generate_long_video.sh
媒体工具音视频格式转换、拼接、裁剪、提取
scripts/media_tools.sh

TTS (Text-to-Speech)

TTS(文本转语音)

Entry point:
scripts/tts/generate_voice.sh
入口:
scripts/tts/generate_voice.sh

IMPORTANT: Single voice vs Multi-segment — Choose the right approach

重要提示:单语音 vs 多片段——选择合适的方式

User intentApproach
Single voice / no multi-character need
tts
command — generate the entire text in one call
Multiple characters / narrator + dialogue
generate
command with segments.json
Default behavior: When the user simply asks to generate speech/voice and does NOT mention multiple voices or characters, use the
tts
command directly with a single appropriate voice. Do NOT split into segments or use the multi-segment pipeline — just pass the full text to
tts
in one call.
Only use multi-segment
generate
when:
  • The user explicitly needs multiple voices/characters
  • The text requires narrator + character dialogue separation
  • The text exceeds 10,000 characters (API limit per request) — in this case, split into segments with the same voice
用户意图实现方式
单语音 / 无需多角色
tts
命令——一次调用生成全部文本
多角色 / 旁白+对话使用segments.json的
generate
命令
默认行为: 当用户仅要求生成语音且未提及多音色或多角色时,直接使用
tts
命令选择合适的单音色,无需拆分片段或使用多片段流水线——只需将完整文本一次性传入
tts
命令。
仅在以下场景使用多片段
generate
命令:
  • 用户明确需要多音色/多角色
  • 文本需区分旁白与角色对话
  • 文本超过10000字符(API单次请求限制)——此时需拆分片段并使用相同音色

Single-voice generation (DEFAULT)

单语音生成(默认)

bash
bash scripts/tts/generate_voice.sh tts "Hello world" -o minimax-output/hello.mp3
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3
bash
bash scripts/tts/generate_voice.sh tts "Hello world" -o minimax-output/hello.mp3
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3

Multi-segment generation (multi-voice / audiobook / podcast)

多片段生成(多音色 / 有声书 / 播客)

Complete workflow — follow ALL steps in order:
  1. Write segments.json — split text into segments with voice assignments (see format and rules below)
  2. Run
    generate
    command
    — this reads segments.json, generates audio for EACH segment via TTS API, then merges them into a single output file with crossfade
bash
undefined
完整工作流——按顺序执行所有步骤:
  1. 编写segments.json——将文本拆分为带音色分配的片段(见下方格式与规则)
  2. 运行
    generate
    命令
    ——读取segments.json,通过TTS API为每个片段生成音频,然后将它们合并为单个输出文件并添加淡入淡出过渡
bash
undefined

Step 1: Write segments.json to minimax-output/

步骤1:将segments.json写入minimax-output/

(use the Write tool to create minimax-output/segments.json)

(使用Write工具创建minimax-output/segments.json)

Step 2: Generate audio from segments.json — this is the CRITICAL step

步骤2:从segments.json生成音频——这是关键步骤

It generates each segment individually and merges them into one file

它会为每个片段单独生成音频,然后合并为一个文件

bash scripts/tts/generate_voice.sh generate minimax-output/segments.json
-o minimax-output/output.mp3 --crossfade 200

**Do NOT skip Step 2.** Writing segments.json alone does nothing — you MUST run the `generate` command to actually produce audio.
bash scripts/tts/generate_voice.sh generate minimax-output/segments.json
-o minimax-output/output.mp3 --crossfade 200

**切勿跳过步骤2**。仅编写segments.json不会产生任何效果——必须运行`generate`命令才能生成音频。

Voice management

语音管理

bash
undefined
bash
undefined

List all available voices

列出所有可用音色

bash scripts/tts/generate_voice.sh list-voices
bash scripts/tts/generate_voice.sh list-voices

Voice cloning (from audio sample, 10s–5min)

语音克隆(从音频样本,10秒–5分钟)

bash scripts/tts/generate_voice.sh clone sample.mp3 --voice-id my-voice
bash scripts/tts/generate_voice.sh clone sample.mp3 --voice-id my-voice

Voice design (from text description)

语音设计(通过文本描述)

bash scripts/tts/generate_voice.sh design "A warm female narrator voice" --voice-id narrator
undefined
bash scripts/tts/generate_voice.sh design "温暖的女性旁白音色" --voice-id narrator
undefined

Audio processing

音频处理

bash
bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp3
bash
bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp3

TTS Models

TTS模型

ModelNotes
speech-2.8-hdRecommended, auto emotion matching
speech-2.8-turboFaster variant
speech-2.6-hdPrevious gen, manual emotion
speech-2.6-turboPrevious gen, faster
模型说明
speech-2.8-hd推荐使用,自动匹配情感
speech-2.8-turbo更快的变体
speech-2.6-hd前代版本,需手动设置情感
speech-2.6-turbo前代版本,速度更快

segments.json Format

segments.json格式

Default crossfade between segments: 200ms (
--crossfade 200
).
json
[
  { "text": "Hello!", "voice_id": "female-shaonv", "emotion": "" },
  { "text": "Welcome.", "voice_id": "male-qn-qingse", "emotion": "happy" }
]
Leave
emotion
empty for speech-2.8 models (auto-matched from text).
片段间默认淡入淡出过渡时长:200毫秒
--crossfade 200
)。
json
[
  { "text": "你好!", "voice_id": "female-shaonv", "emotion": "" },
  { "text": "欢迎到来。", "voice_id": "male-qn-qingse", "emotion": "happy" }
]
对于speech-2.8模型,
emotion
字段留空即可(会根据文本自动匹配情感)。

IMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.)

重要提示:多片段脚本生成规则(有声书、播客等)

When generating segments.json for audiobooks, podcasts, or any multi-character narration, you MUST split narration text from character dialogue into separate segments with distinct voices.
Rule: Narration and dialogue are ALWAYS separate segments.
A sentence like
"Tom said: The weather is great today!"
must be split into two segments:
  • Segment 1 (narrator voice):
    "Tom said:"
  • Segment 2 (character voice):
    "The weather is great today!"
Example — Audiobook with narrator + 2 characters:
json
[
  { "text": "Morning sunlight streamed into the classroom as students filed in one by one.", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "The weather is amazing today! Let's go to the park after school!", "voice_id": "tom-voice", "emotion": "happy" },
  { "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "Sure, but I need to drop off my backpack at home first.", "voice_id": "lisa-voice", "emotion": "" },
  { "text": "They exchanged a smile and went back to listening to the lecture.", "voice_id": "narrator-voice", "emotion": "" }
]
Key principles:
  1. Narrator uses a consistent neutral narrator voice throughout
  2. Each character has a dedicated voice_id, maintained consistently across all their dialogue
  3. Split at dialogue boundaries
    "He said:"
    is narrator, the quoted content is the character
  4. Do NOT merge narrator text and character speech into a single segment
  5. For characters without pre-existing voice_ids, use voice cloning or voice design to create them first, then reference the created voice_id in segments
为有声书、播客或任何多角色旁白生成segments.json时,必须将旁白文本与角色对话拆分为单独的片段,并使用不同的音色。
规则:旁白与对话始终为单独片段。
例如句子
"汤姆说:今天天气真好!"
必须拆分为两个片段:
  • 片段1(旁白音色):
    "汤姆说:"
  • 片段2(角色音色):
    "今天天气真好!"
示例——带旁白+2个角色的有声书:
json
[
  { "text": "清晨的阳光洒进教室,学生们陆续走进来。", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "汤姆笑着转向丽莎:", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "今天天气太棒了!放学后我们去公园吧!", "voice_id": "tom-voice", "emotion": "happy" },
  { "text": "丽莎想了想,然后回答:", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "好呀,但我得先回家放下背包。", "voice_id": "lisa-voice", "emotion": "" },
  { "text": "他们相视一笑,回到座位继续听课。", "voice_id": "narrator-voice", "emotion": "" }
]
核心原则:
  1. 旁白全程使用一致的中性旁白音色
  2. 每个角色对应专属的voice_id,且在所有对话中保持一致
  3. 在对话边界拆分——
    "他说:"
    属于旁白,引号内的内容属于角色
  4. 切勿合并旁白文本与角色语音为单个片段
  5. 对于无预设voice_id的角色,先通过语音克隆或语音设计创建,再在片段中引用该voice_id

Music Generation

音乐生成

Entry point:
scripts/music/generate_music.sh
入口:
scripts/music/generate_music.sh

IMPORTANT: Instrumental vs Lyrics — When to use which

重要提示:纯音乐 vs 带歌词——场景选择

ScenarioModeAction
BGM for video / voice / podcastInstrumental (default)Use
--instrumental
directly, do NOT ask user
User explicitly asks to "create music" / "make a song"Ask user firstAsk whether they want instrumental or with lyrics
When adding background music to video or voice content, always default to instrumental mode (
--instrumental
). Do not ask the user — BGM should never have vocals competing with the main content.
When the user explicitly asks to create/generate music as the primary task, ask them whether they want:
  • Instrumental (pure music, no vocals)
  • With lyrics (song with vocals — user provides or you help write lyrics)
bash
undefined
场景模式操作
视频/语音/播客的背景音乐纯音乐(默认)直接使用
--instrumental
参数,无需询问用户
用户明确要求"创作音乐"/"制作歌曲"先询问用户询问用户需要纯音乐还是带歌词的歌曲
为视频或语音内容添加背景音乐时,始终默认使用纯音乐模式(
--instrumental
),无需询问用户——背景音乐不应有 vocals 与主内容竞争。
当用户明确要求创作/生成音乐作为主要任务时,询问用户需要:
  • 纯音乐(无 vocals 的纯音乐)
  • 带歌词(有人声的歌曲——用户提供歌词或协助编写)
bash
undefined

Instrumental (for BGM or when user chooses instrumental)

纯音乐(用于背景音乐或用户选择纯音乐时)

bash scripts/music/generate_music.sh
--instrumental
--prompt "ambient electronic, atmospheric"
--output minimax-output/ambient.mp3 --download
bash scripts/music/generate_music.sh
--instrumental
--prompt "ambient electronic, atmospheric"
--output minimax-output/ambient.mp3 --download

Song with lyrics (when user chooses vocal music)

带歌词的歌曲(用户选择有声音乐时)

bash scripts/music/generate_music.sh
--lyrics "[verse]\nHello world\n[chorus]\nLa la la"
--prompt "indie folk, melancholic"
--output minimax-output/song.mp3 --download
bash scripts/music/generate_music.sh
--lyrics "[verse]\nHello world\n[chorus]\nLa la la"
--prompt "indie folk, melancholic"
--output minimax-output/song.mp3 --download

With style fields

带风格字段

bash scripts/music/generate_music.sh
--lyrics "[verse]\nLyrics here"
--genre "pop" --mood "upbeat" --tempo "fast"
--output minimax-output/pop_track.mp3 --download
undefined
bash scripts/music/generate_music.sh
--lyrics "[verse]\n歌词内容"
--genre "pop" --mood "upbeat" --tempo "fast"
--output minimax-output/pop_track.mp3 --download
undefined

Music Model

音乐模型

Default model:
music-2.5
music-2.5
does not support
--instrumental
directly. When instrumental music is needed, the script automatically applies a workaround:
  • Sets lyrics to
    [intro] [outro]
    (empty structural tags, no actual vocals), appends
    pure music, no lyrics
    to the prompt
This produces instrumental-style output without requiring manual intervention. You can always use
--instrumental
and the script handles the rest.
默认模型:
music-2.5
music-2.5
不直接支持
--instrumental
参数。当需要纯音乐时,脚本会自动应用解决方案:
  • 将歌词设置为
    [intro] [outro]
    (空结构标签,无实际 vocals),并在prompt后追加
    pure music, no lyrics
无需手动干预即可生成纯音乐风格的输出,只需使用
--instrumental
参数,脚本会处理其余操作。

Image Generation

图像生成

Entry point:
scripts/image/generate_image.sh
Model:
image-01
— photorealistic image generation from text prompts, with optional character reference for image-to-image.
入口:
scripts/image/generate_image.sh
模型:
image-01
——通过文本提示生成写实风格图像,支持图像转图像时的人物参考。

IMPORTANT: Mode Selection — t2i vs i2i

重要提示:模式选择——t2i vs i2i

User intentMode
Generate image from text description (default)
t2i
— text-to-image
Generate image with a character reference photo (keep same person)
i2i
— image-to-image
Default behavior: When the user asks to generate/create an image without mentioning a reference photo, use
t2i
mode (default). Only use
i2i
mode when the user provides a character reference image or explicitly asks to base the image on an existing person's appearance.
用户意图模式
根据文本描述生成图像(默认)
t2i
——文本转图像
基于人物参考照片生成图像(保留同一人物)
i2i
——图像转图像
默认行为: 当用户要求生成图像但未提及参考照片时,使用
t2i
模式(默认)。仅当用户提供人物参考图像或明确要求基于现有人物外观生成图像时,才使用
i2i
模式。

IMPORTANT: Aspect Ratio — Infer from user context

重要提示:宽高比——根据用户上下文推断

Do NOT always default to
1:1
. Analyze the user's request and choose the most appropriate aspect ratio:
User intent / contextRecommended ratioResolution
头像、图标、社交媒体头像、avatar、icon、profile pic
1:1
1024×1024
风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper
16:9
1280×720
传统照片、经典比例、classic photo
4:3
1152×864
摄影作品、杂志封面、photography、magazine
3:2
1248×832
人像竖图、海报、portrait photo、poster
2:3
832×1248
竖版海报、书籍封面、tall poster、book cover
3:4
864×1152
手机壁纸、社交媒体故事、phone wallpaper、story、reel
9:16
720×1280
超宽全景、电影画幅、panoramic、cinematic ultrawide
21:9
1344×576
未指定特定需求 / ambiguous
1:1
1024×1024
切勿始终默认使用
1:1
。分析用户需求并选择最合适的宽高比:
用户意图/上下文推荐宽高比分辨率
头像、图标、社交媒体头像、avatar、icon、profile pic
1:1
1024×1024
风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper
16:9
1280×720
传统照片、经典比例、classic photo
4:3
1152×864
摄影作品、杂志封面、photography、magazine
3:2
1248×832
人像竖图、海报、portrait photo、poster
2:3
832×1248
竖版海报、书籍封面、tall poster、book cover
3:4
864×1152
手机壁纸、社交媒体故事、phone wallpaper、story、reel
9:16
720×1280
超宽全景、电影画幅、panoramic、cinematic ultrawide
21:9
1344×576
未指定特定需求 / 模糊需求
1:1
1024×1024

IMPORTANT: Image Count — When to generate multiple images

重要提示:图像数量——何时生成多张

User intentCount (
-n
)
Default / single image request
1
(default)
用户说"几张"、"多张"、"一些" / "a few", "several"
3
用户说"多种方案"、"备选" / "variations", "options"
3
4
用户明确指定数量Use the specified number (1–9)
用户意图数量(
-n
默认/单张图像需求
1
(默认)
用户说"几张"、"多张"、"一些" / "a few", "several"
3
用户说"多种方案"、"备选" / "variations", "options"
3
4
用户明确指定数量使用用户指定的数量(1–9)

Text-to-Image Examples

文本转图像示例

bash
undefined
bash
undefined

Basic text-to-image

基础文本转图像

bash scripts/image/generate_image.sh
--prompt "A cat sitting on a rooftop at sunset, cinematic lighting, warm tones, photorealistic"
-o minimax-output/cat.png
bash scripts/image/generate_image.sh
--prompt "一只猫坐在日落时分的屋顶上,电影级光影,暖色调,写实风格"
-o minimax-output/cat.png

Landscape with inferred aspect ratio

风景图像(推断宽高比)

bash scripts/image/generate_image.sh
--prompt "Mountain landscape with misty valleys, photorealistic, golden hour"
--aspect-ratio 16:9
-o minimax-output/landscape.png
bash scripts/image/generate_image.sh
--prompt "云雾缭绕的山谷山景,写实风格,黄金时刻"
--aspect-ratio 16:9
-o minimax-output/landscape.png

Phone wallpaper (portrait 9:16)

手机壁纸(竖版9:16)

bash scripts/image/generate_image.sh
--prompt "Aurora borealis over a snowy forest, vivid colors, magical atmosphere"
--aspect-ratio 9:16
-o minimax-output/wallpaper.png
bash scripts/image/generate_image.sh
--prompt "雪林上空的极光,色彩鲜艳,魔幻氛围"
--aspect-ratio 9:16
-o minimax-output/wallpaper.png

Multiple variations

多种变体

bash scripts/image/generate_image.sh
--prompt "Abstract geometric art, vibrant colors"
-n 3
-o minimax-output/art.png
bash scripts/image/generate_image.sh
--prompt "抽象几何艺术,色彩鲜艳"
-n 3
-o minimax-output/art.png

With prompt optimizer

带提示词优化器

bash scripts/image/generate_image.sh
--prompt "A man standing on Venice Beach, 90s documentary style"
--aspect-ratio 16:9 --prompt-optimizer
-o minimax-output/beach.png
bash scripts/image/generate_image.sh
--prompt "一个男人站在威尼斯海滩,90年代纪录片风格"
--aspect-ratio 16:9 --prompt-optimizer
-o minimax-output/beach.png

Custom dimensions (must be multiple of 8)

自定义尺寸(必须为8的倍数)

bash scripts/image/generate_image.sh
--prompt "Product photo of a luxury watch on marble surface"
--width 1024 --height 768
-o minimax-output/watch.png
undefined
bash scripts/image/generate_image.sh
--prompt "大理石台面上的奢华手表产品图"
--width 1024 --height 768
-o minimax-output/watch.png
undefined

Image-to-Image (Character Reference)

图像转图像(人物参考)

Use a reference photo to generate images with the same character in new scenes. Best results with a single front-facing portrait. Supported formats: JPG, JPEG, PNG (max 10MB).
bash
undefined
使用参考照片生成同一人物在新场景中的图像,最佳效果为单人正面肖像。支持格式:JPG、JPEG、PNG(最大10MB)。
bash
undefined

Character reference — place same person in a new scene

人物参考——将同一人物置于新场景

bash scripts/image/generate_image.sh
--mode i2i
--prompt "A girl looking into the distance from a library window, warm afternoon light"
--ref-image face.jpg
--aspect-ratio 16:9
-o minimax-output/girl_library.png
bash scripts/image/generate_image.sh
--mode i2i
--prompt "一个女孩从图书馆窗口望向远方,温暖的午后阳光"
--ref-image face.jpg
--aspect-ratio 16:9
-o minimax-output/girl_library.png

Multiple character variations

多个人物变体

bash scripts/image/generate_image.sh
--mode i2i
--prompt "A woman in a red dress at a gala event, elegant, cinematic"
--ref-image face.jpg -n 3
-o minimax-output/gala.png
undefined
bash scripts/image/generate_image.sh
--mode i2i
--prompt "一个穿红裙的女人在晚宴上,优雅,电影级风格"
--ref-image face.jpg -n 3
-o minimax-output/gala.png
undefined

Aspect Ratio Reference

宽高比参考

RatioResolutionBest for
1:1
1024×1024Default, avatars, icons, social media
16:9
1280×720Landscape, banner, desktop wallpaper
4:3
1152×864Classic photo, presentations
3:2
1248×832Photography, magazine layout
2:3
832×1248Portrait photo, poster
3:4
864×1152Book cover, tall poster
9:16
720×1280Phone wallpaper, social story/reel
21:9
1344×576Ultra-wide panoramic, cinematic
比例分辨率适用场景
1:1
1024×1024默认、头像、图标、社交媒体
16:9
1280×720风景、横幅、桌面壁纸
4:3
1152×864传统照片、演示文稿
3:2
1248×832摄影作品、杂志排版
2:3
832×1248人像照片、海报
3:4
864×1152书籍封面、竖版海报
9:16
720×1280手机壁纸、社交故事/短视频
21:9
1344×576超宽全景、电影画幅

Key Options

核心选项

OptionDescription
--prompt TEXT
Image description, max 1500 chars (required)
--aspect-ratio RATIO
Aspect ratio (see table above). Infer from user context
--width PX
/
--height PX
Custom size, 512–2048, must be multiple of 8, both required together. Overridden by
--aspect-ratio
if both set
-n N
Number of images to generate, 1–9 (default 1)
--seed N
Random seed for reproducibility. Same seed + same params → similar results
--prompt-optimizer
Enable automatic prompt optimization by the API
--ref-image FILE
Character reference image for i2i mode (local file or URL, JPG/JPEG/PNG, max 10MB)
--no-download
Print image URLs instead of downloading files
--aigc-watermark
Add AIGC watermark to generated images
选项描述
--prompt TEXT
图像描述,最多1500字符(必填)
--aspect-ratio RATIO
宽高比(见上表),根据用户上下文推断
--width PX
/
--height PX
自定义尺寸,512–2048,必须为8的倍数,需同时设置宽和高。若与
--aspect-ratio
同时设置,会被后者覆盖
-n N
生成图像数量,1–9(默认1)
--seed N
随机种子,用于复现结果。相同种子+相同参数→相似结果
--prompt-optimizer
启用API自动优化提示词
--ref-image FILE
i2i模式下的人物参考图像(本地文件或URL,JPG/JPEG/PNG,最大10MB)
--no-download
仅打印图像URL,不下载文件
--aigc-watermark
为生成的图像添加AIGC水印

Video Generation

视频生成

IMPORTANT: Single vs Multi-Segment — Choose the right script

重要提示:单片段 vs 多片段——选择合适的脚本

User intentScript to use
Default / no special request
scripts/video/generate_video.sh
(single segment, 10s, 768P)
User explicitly asks for "long video", "multi-scene", "story", or duration > 10s
scripts/video/generate_long_video.sh
(multi-segment)
Default behavior: Always use single-segment
generate_video.sh
with duration 10s and resolution 768P unless the user explicitly asks for a long video, multi-scene video, or specifies a total duration exceeding 10 seconds. Do NOT automatically split into multiple segments — a single 10s video is the standard output. Only use
generate_long_video.sh
when the user clearly needs multi-scene or longer content.
Entry point (single video):
scripts/video/generate_video.sh
Entry point (long/multi-scene):
scripts/video/generate_long_video.sh
用户意图使用脚本
默认/无特殊需求
scripts/video/generate_video.sh
(单片段,10秒,768P
用户明确要求"长视频"、"多场景"、"故事"或时长超过10秒
scripts/video/generate_long_video.sh
(多片段)
默认行为: 除非用户明确要求长视频、多场景视频或指定总时长超过10秒,否则始终使用单片段
generate_video.sh
,默认参数为10秒时长、768P分辨率。切勿自动拆分为多片段——单段10秒视频为标准输出。仅当用户明确需要多场景或更长内容时,才使用
generate_long_video.sh
入口(单段视频):
scripts/video/generate_video.sh
入口(长/多场景视频):
scripts/video/generate_long_video.sh

Video Model Constraints (MUST follow)

视频模型限制(必须遵守)

Duration limits by model and resolution:
Model720P768P1080P
MiniMax-Hailuo-2.3-6s or 10s6s only
MiniMax-Hailuo-2.3-Fast-6s or 10s6s only
MiniMax-Hailuo-02-6s or 10s6s only
T2V-01 / T2V-01-Director6s only--
I2V-01 / I2V-01-Director / I2V-01-live6s only--
S2V-01 (ref)6s only--
Resolution options by model and duration:
Model6s10s
MiniMax-Hailuo-2.3768P (default), 1080P768P only
MiniMax-Hailuo-2.3-Fast768P (default), 1080P768P only
MiniMax-Hailuo-02512P, 768P (default), 1080P512P, 768P (default)
Other models720P (default)Not supported
Key rules:
  • Default: 10s + 768P (best balance of length and quality for MiniMax-Hailuo-2.3)
  • 1080P only supports 6s duration — if user requests 1080P, set
    --duration 6
  • 10s duration only works with 768P (or 512P on Hailuo-02) — never combine 10s + 1080P
  • Older models (T2V-01, I2V-01, S2V-01) only support 6s at 720P
按模型和分辨率划分的时长限制:
模型720P768P1080P
MiniMax-Hailuo-2.3-6秒或10秒仅6秒
MiniMax-Hailuo-2.3-Fast-6秒或10秒仅6秒
MiniMax-Hailuo-02-6秒或10秒仅6秒
T2V-01 / T2V-01-Director仅6秒--
I2V-01 / I2V-01-Director / I2V-01-live仅6秒--
S2V-01 (ref)仅6秒--
按模型和时长划分的分辨率选项:
模型6秒10秒
MiniMax-Hailuo-2.3768P(默认)、1080P仅768P
MiniMax-Hailuo-2.3-Fast768P(默认)、1080P仅768P
MiniMax-Hailuo-02512P、768P(默认)、1080P512P、768P(默认)
其他模型720P(默认)不支持
核心规则:
  • 默认设置:10秒+768P(MiniMax-Hailuo-2.3的时长与画质最佳平衡)
  • 1080P仅支持6秒时长——若用户要求1080P,需设置
    --duration 6
  • 10秒时长仅支持768P(或Hailuo-02的512P)——切勿将10秒与1080P组合使用
  • 旧模型(T2V-01、I2V-01、S2V-01)仅支持720P、6秒时长

IMPORTANT: Prompt Optimization (MUST follow before generating any video)

重要提示:提示词优化(生成任何视频前必须遵守)

Before calling any video generation script, you MUST optimize the user's prompt by reading and applying
references/video-prompt-guide.md
. Never pass the user's raw description directly as
--prompt
.
Optimization steps:
  1. Apply the Professional Formula:
    Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere
    • BAD:
      "A puppy in a park"
    • GOOD:
      "A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [跟随] smooth tracking shot, warm golden hour lighting, shallow depth of field, joyful atmosphere"
  2. Add camera instructions using
    [指令]
    syntax:
    [推进]
    ,
    [拉远]
    ,
    [跟随]
    ,
    [固定]
    ,
    [左摇]
    , etc.
  3. Include aesthetic details: lighting (golden hour, dramatic side lighting), color grading (warm tones, cinematic), texture (dust particles, rain droplets), atmosphere (intimate, epic, peaceful)
  4. Keep to 1-2 key actions for 6-10 second videos — do not overcrowd with events
  5. For i2v mode (image-to-video): Focus prompt on movement and change only, since the image already establishes the visual. Do NOT re-describe what's in the image.
    • BAD:
      "A lake with mountains"
      (just repeating the image)
    • GOOD:
      "Gentle ripples spread across the water surface, a breeze rustles the distant trees, [固定] fixed camera, soft morning light, peaceful and serene"
  6. For multi-segment long videos: Each segment's prompt must be self-contained and optimized individually. The i2v segments (segment 2+) should describe motion/change relative to the previous segment's ending frame.
bash
undefined
调用任何视频生成脚本前,必须阅读并应用
references/video-prompt-guide.md
中的规则优化用户的提示词,绝不能直接将用户的原始描述作为
--prompt
参数传入。
优化步骤:
  1. 应用专业公式
    主体+场景+动作+镜头运动+美学氛围
    • 错误示例:
      "公园里的小狗"
    • 正确示例:
      "一只金毛寻回犬小狗在公园阳光斑驳的草地上朝镜头跑来,[跟随]平滑跟拍镜头,温暖的黄金时刻光线,浅景深,欢快氛围"
  2. 使用
    [指令]
    语法添加镜头说明
    [推进]
    [拉远]
    [跟随]
    [固定]
    [左摇]
    等。
  3. 包含美学细节:光线(黄金时刻、戏剧性侧光)、色彩分级(暖色调、电影级)、纹理(尘埃颗粒、雨滴)、氛围(温馨、史诗感、宁静)。
  4. 6-10秒视频仅保留1-2个核心动作——不要添加过多事件。
  5. i2v模式(图像转视频):提示词仅聚焦动作和变化,因为图像已经确定了视觉内容,切勿重复描述图像中的内容。
    • 错误示例:
      "有山的湖泊"
      (仅重复图像内容)
    • 正确示例:
      "水面泛起轻柔的涟漪,微风吹动远处的树木,[固定]固定镜头,柔和的晨光,宁静祥和的氛围"
  6. 多片段长视频:每个片段的提示词必须独立优化。i2v片段(第2段及以后)需描述相对于上一段结束帧的运动/变化。
bash
undefined

Text-to-video (default: 10s, 768P)

文本转视频(默认:10秒,768P)

bash scripts/video/generate_video.sh
--mode t2v
--prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful"
--output minimax-output/puppy.mp4
bash scripts/video/generate_video.sh
--mode t2v
--prompt "一只金毛寻回犬小狗在阳光明媚的草地上朝镜头跑来,[跟随]跟拍镜头,温暖的黄金时刻,浅景深,欢快氛围"
--output minimax-output/puppy.mp4

Text-to-video with 1080P (must use --duration 6)

1080P文本转视频(必须设置--duration 6)

bash scripts/video/generate_video.sh
--mode t2v
--prompt "A golden retriever puppy bounds toward the camera"
--duration 6 --resolution 1080P
--output minimax-output/puppy_hd.mp4
bash scripts/video/generate_video.sh
--mode t2v
--prompt "一只金毛寻回犬小狗朝镜头跑来"
--duration 6 --resolution 1080P
--output minimax-output/puppy_hd.mp4

Image-to-video (prompt focuses on MOTION, not image content)

图像转视频(提示词聚焦动作,而非图像内容)

bash scripts/video/generate_video.sh
--mode i2v
--prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones"
--first-frame photo.jpg
--output minimax-output/animated.mp4
bash scripts/video/generate_video.sh
--mode i2v
--prompt "花瓣在微风中轻轻摇曳,柔和的光线在表面移动,[固定]固定构图,梦幻的马卡龙色调"
--first-frame photo.jpg
--output minimax-output/animated.mp4

Start-end frame interpolation (sef mode uses MiniMax-Hailuo-02)

首尾帧插值(sef模式使用MiniMax-Hailuo-02)

bash scripts/video/generate_video.sh
--mode sef
--first-frame start.jpg --last-frame end.jpg
--output minimax-output/transition.mp4
bash scripts/video/generate_video.sh
--mode sef
--first-frame start.jpg --last-frame end.jpg
--output minimax-output/transition.mp4

Subject reference (face consistency, ref mode uses S2V-01, 6s only)

主体参考(面部一致性,ref模式使用S2V-01,仅6秒)

bash scripts/video/generate_video.sh
--mode ref
--prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field"
--subject-image face.jpg
--duration 6
--output minimax-output/person.mp4
undefined
bash scripts/video/generate_video.sh
--mode ref
--prompt "一个穿白裙的年轻女子在阳光明媚的花园中慢慢行走,[跟随]平滑跟拍,温暖的自然光线,电影级景深"
--subject-image face.jpg
--duration 6
--output minimax-output/person.mp4
undefined

Long-form Video (Multi-scene)

长视频(多场景)

Multi-scene long videos chain segments together: the first segment generates via text-to-video (t2v), then each subsequent segment uses the last frame of the previous segment as its first frame (i2v). Segments are joined with crossfade transitions for smooth continuity. Default is 10 seconds per segment.
Workflow:
  1. Segment 1: t2v — generated purely from the optimized text prompt
  2. Segment 2+: i2v — the previous segment's last frame becomes
    first_frame_image
    , prompt describes motion and change from that ending state
  3. All segments are concatenated with 0.5s crossfade transitions to eliminate jump cuts
  4. Optional: AI-generated background music is overlaid
Prompt rules for each segment:
  • Each segment prompt MUST be independently optimized using the Professional Formula
  • Segment 1 (t2v): Full scene description with subject, scene, camera, atmosphere
  • Segment 2+ (i2v): Focus on what changes and moves from the previous ending frame. Do NOT repeat the visual description — the first frame already provides it
  • Maintain visual consistency: keep lighting, color grading, and style keywords consistent across segments
  • Each segment covers only 10 seconds of action — keep it focused
bash
undefined
多场景长视频将多个片段链接在一起:第一个片段通过文本转视频(t2v)生成,后续每个片段以上一个片段的最后一帧作为首帧,通过图像转视频(i2v)生成。片段之间通过淡入淡出过渡连接,保证流畅性。默认每个片段时长为10秒。
工作流:
  1. 片段1:t2v——完全基于优化后的文本提示生成
  2. 片段2+:i2v——上一个片段的最后一帧作为
    first_frame_image
    ,提示词描述相对于该结束帧的动作和变化
  3. 所有片段通过0.5秒的淡入淡出过渡拼接,消除跳切
  4. 可选:添加AI生成的背景音乐
每个片段的提示词规则:
  • 每个片段的提示词必须使用专业公式独立优化
  • 片段1(t2v):包含主体、场景、镜头、氛围的完整场景描述
  • 片段2+(i2v):聚焦相对于上一结束帧的变化和动作,切勿重复视觉描述——首帧已经提供了视觉内容
  • 保持视觉一致性:所有片段的光线、色彩分级、风格关键词保持一致
  • 每个片段仅涵盖10秒的动作——保持内容聚焦
bash
undefined

Example: 3-segment story with optimized per-segment prompts (default: 10s/segment, 768P)

示例:3片段故事,每个片段使用优化后的提示词(默认:10秒/片段,768P)

bash scripts/video/generate_long_video.sh
--scenes
"A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere"
"The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure"
"The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale"
--music-prompt "cinematic orchestral ambient, slow build, sci-fi atmosphere"
--output minimax-output/long_video.mp4
bash scripts/video/generate_long_video.sh
--scenes
"一名孤独的宇航员站在红色荒漠行星表面,风吹动尘埃颗粒,[推进]缓慢推近面罩,戏剧性轮廓光,电影级科幻氛围"
"宇航员转身,开始走向地平线上远处发光的建筑,靴子周围扬起尘埃,[跟随]从后方跟拍,广袤荒凉的地貌,建筑发出的金色光芒"
"宇航员抵达建筑入口,巨大的门廊闪烁着蓝色能量,[推进]缓慢推近门廊,光线反射在面罩上,令人震撼的史诗规模"
--music-prompt "电影级管弦乐 ambient,缓慢递进,科幻氛围"
--output minimax-output/long_video.mp4

With custom settings

自定义设置

bash scripts/video/generate_long_video.sh
--scenes "Scene 1 prompt" "Scene 2 prompt"
--segment-duration 10
--resolution 768P
--crossfade 0.5
--music-prompt "calm ambient background music"
--output minimax-output/long_video.mp4
undefined
bash scripts/video/generate_long_video.sh
--scenes "场景1提示词" "场景2提示词"
--segment-duration 10
--resolution 768P
--crossfade 0.5
--music-prompt "舒缓的 ambient 背景音乐"
--output minimax-output/long_video.mp4
undefined

Add Background Music

添加背景音乐

bash
bash scripts/video/add_bgm.sh \
  --video input.mp4 \
  --generate-bgm --instrumental \
  --music-prompt "soft piano background" \
  --bgm-volume 0.3 \
  --output minimax-output/output_with_bgm.mp4
bash
bash scripts/video/add_bgm.sh \
  --video input.mp4 \
  --generate-bgm --instrumental \
  --music-prompt "柔和的钢琴背景音乐" \
  --bgm-volume 0.3 \
  --output minimax-output/output_with_bgm.mp4

Template Video

模板视频

bash
bash scripts/video/generate_template_video.sh \
  --template-id 392753057216684038 \
  --media photo.jpg \
  --output minimax-output/template_output.mp4
bash
bash scripts/video/generate_template_video.sh \
  --template-id 392753057216684038 \
  --media photo.jpg \
  --output minimax-output/template_output.mp4

Video Models

视频模型

ModeDefault ModelDefault DurationDefault ResolutionNotes
t2vMiniMax-Hailuo-2.310s768PLatest text-to-video
i2vMiniMax-Hailuo-2.310s768PLatest image-to-video
sefMiniMax-Hailuo-026s768PStart-end frame
refS2V-016s720PSubject reference, 6s only
模式默认模型默认时长默认分辨率说明
t2vMiniMax-Hailuo-2.310秒768P最新文本转视频模型
i2vMiniMax-Hailuo-2.310秒768P最新图像转视频模型
sefMiniMax-Hailuo-026秒768P首尾帧插值
refS2V-016秒720P主体参考,仅6秒

Media Tools (Audio/Video Processing)

媒体工具(音视频处理)

Entry point:
scripts/media_tools.sh
Standalone FFmpeg-based utilities for format conversion, concatenation, extraction, trimming, and audio overlay. Use these when the user needs to process existing media files without generating new content via MiniMax API.
入口:
scripts/media_tools.sh
基于FFmpeg的独立工具,支持格式转换、拼接、提取、裁剪和音频叠加。当用户需要处理现有媒体文件,无需通过MiniMax API生成新内容时使用。

Video Format Conversion

视频格式转换

bash
undefined
bash
undefined

Convert between formats (mp4, mov, webm, mkv, avi, ts, flv)

格式转换(mp4, mov, webm, mkv, avi, ts, flv)

bash scripts/media_tools.sh convert-video input.webm -o output.mp4 bash scripts/media_tools.sh convert-video input.mp4 -o output.mov
bash scripts/media_tools.sh convert-video input.webm -o output.mp4 bash scripts/media_tools.sh convert-video input.mp4 -o output.mov

With quality / resolution / fps options

带画质/分辨率/帧率选项

bash scripts/media_tools.sh convert-video input.mp4 -o output.mp4
--crf 18 --preset medium --resolution 1920x1080 --fps 30
undefined
bash scripts/media_tools.sh convert-video input.mp4 -o output.mp4
--crf 18 --preset medium --resolution 1920x1080 --fps 30
undefined

Audio Format Conversion

音频格式转换

bash
undefined
bash
undefined

Convert between formats (mp3, wav, flac, ogg, aac, m4a, opus, wma)

格式转换(mp3, wav, flac, ogg, aac, m4a, opus, wma)

bash scripts/media_tools.sh convert-audio input.wav -o output.mp3 bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac
--bitrate 320k --sample-rate 48000 --channels 2
undefined
bash scripts/media_tools.sh convert-audio input.wav -o output.mp3 bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac
--bitrate 320k --sample-rate 48000 --channels 2
undefined

Video Concatenation

视频拼接

bash
undefined
bash
undefined

Concatenate with crossfade transition (default 0.5s)

带淡入淡出过渡的拼接(默认0.5秒)

bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 seg3.mp4 -o merged.mp4
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 seg3.mp4 -o merged.mp4

Hard cut (no crossfade)

硬切(无过渡)

bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0
undefined
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0
undefined

Audio Concatenation

音频拼接

bash
undefined
bash
undefined

Simple concatenation

简单拼接

bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3

With crossfade

带淡入淡出过渡

bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1
undefined
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1
undefined

Extract Audio from Video

从视频提取音频

bash
undefined
bash
undefined

Extract as mp3

提取为mp3格式

bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3

Extract as wav with higher bitrate

以更高比特率提取为wav格式

bash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k
undefined
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k
undefined

Video Trimming

视频裁剪

bash
undefined
bash
undefined

Trim by start/end time (seconds)

按开始/结束时间裁剪(秒)

bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 5 --end 15
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 5 --end 15

Trim by start + duration

按开始时间+时长裁剪

bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8
undefined
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8
undefined

Add Audio to Video (Overlay / Replace)

为视频添加音频(叠加/替换)

bash
undefined
bash
undefined

Mix audio with existing video audio

将音频与视频原有音频混合

bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4
--volume 0.3 --fade-in 2 --fade-out 3
bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4
--volume 0.3 --fade-in 2 --fade-out 3

Replace original audio entirely

完全替换原音频

bash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4
--replace
undefined
bash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4
--replace
undefined

Media File Info

媒体文件信息

bash
bash scripts/media_tools.sh probe input.mp4
bash
bash scripts/media_tools.sh probe input.mp4

Script Architecture

脚本架构

scripts/
├── check_environment.sh         # Env verification (curl, ffmpeg, jq, xxd, API key)
├── media_tools.sh               # Audio/video conversion, concat, trim, extract
├── tts/
│   └── generate_voice.sh        # Unified TTS CLI (tts, clone, design, list-voices, generate, merge, convert)
├── music/
│   └── generate_music.sh        # Music generation CLI
├── image/
│   └── generate_image.sh        # Image generation CLI (2 modes: t2i, i2i)
└── video/
    ├── generate_video.sh        # Video generation CLI (4 modes: t2v, i2v, sef, ref)
    ├── generate_long_video.sh   # Multi-scene long video
    ├── generate_template_video.sh # Template-based video
    └── add_bgm.sh              # Background music overlay
scripts/
├── check_environment.sh         # 环境验证(curl、ffmpeg、jq、xxd、API密钥)
├── media_tools.sh               # 音视频转换、拼接、裁剪、提取
├── tts/
│   └── generate_voice.sh        # 统一TTS命令行工具(tts、克隆、设计、列出音色、生成、合并、转换)
├── music/
│   └── generate_music.sh        # 音乐生成命令行工具
├── image/
│   └── generate_image.sh        # 图像生成命令行工具(2种模式:t2i、i2i)
└── video/
    ├── generate_video.sh        # 视频生成命令行工具(4种模式:t2v、i2v、sef、ref)
    ├── generate_long_video.sh   # 多场景长视频
    ├── generate_template_video.sh # 基于模板的视频
    └── add_bgm.sh              # 背景音乐叠加

References

参考资料

Read these for detailed API parameters, voice catalogs, and prompt engineering:
  • tts-guide.md — TTS setup, voice management, audio processing, segment format, troubleshooting
  • tts-voice-catalog.md — Full voice catalog with IDs, descriptions, and parameter reference
  • music-api.md — Music generation API: endpoints, parameters, response format
  • image-api.md — Image generation API: text-to-image, image-to-image, parameters
  • video-api.md — Video API: endpoints, models, parameters, camera instructions, templates
  • video-prompt-guide.md — Video prompt engineering: formulas, styles, image-to-video tips
阅读以下文档获取详细的API参数、音色目录和提示词工程技巧:
  • tts-guide.md — TTS设置、语音管理、音频处理、片段格式、故障排除
  • tts-voice-catalog.md — 完整音色目录,包含ID、描述和参数参考
  • music-api.md — 音乐生成API:端点、参数、响应格式
  • image-api.md — 图像生成API:文本转图像、图像转图像、参数
  • video-api.md — 视频API:端点、模型、参数、镜头指令、模板
  • video-prompt-guide.md — 视频提示词工程:公式、风格、图像转视频技巧