minimax-multimodal-toolkit
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMiniMax Multi-Modal Toolkit
MiniMax多模态工具包
Generate voice, music, video, and image content via MiniMax APIs — the unified entry for MiniMax multimodal use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.
通过MiniMax API生成语音、音乐、视频和图像内容——这是MiniMax多模态使用场景(音频+音乐+视频+图像)的统一入口。包含用于自定义语音的语音克隆与语音设计功能、带人物参考的图像生成,以及基于FFmpeg的媒体工具,支持音视频格式转换、拼接、裁剪和提取。
Output Directory
输出目录
All generated files MUST be saved to under the AGENT'S current working directory (NOT the skill directory). Every script call MUST include an explicit / argument pointing to this location. Never omit the output argument or rely on script defaults.
minimax-output/--output-oRules:
- Before running any script, ensure exists in the agent's working directory (create if needed:
minimax-output/)mkdir -p minimax-output - Always use absolute or relative paths from the agent's working directory:
--output minimax-output/video.mp4 - Never into the skill directory to run scripts — run from the agent's working directory using the full script path
cd - Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in . They can be cleaned up when no longer needed:
minimax-output/tmp/rm -rf minimax-output/tmp
所有生成的文件必须保存至Agent当前工作目录下的文件夹(而非技能目录)。 每次调用脚本时必须显式添加/参数指向该位置,绝不能省略输出参数或依赖脚本默认设置。
minimax-output/--output-o规则:
- 运行任何脚本前,确保Agent工作目录中存在文件夹(若不存在则创建:
minimax-output/)mkdir -p minimax-output - 始终使用Agent工作目录的绝对路径或相对路径:
--output minimax-output/video.mp4 - 切勿进入技能目录运行脚本——需从Agent工作目录使用完整脚本路径执行
- 中间/临时文件(片段音频、视频片段、提取的帧)会自动存放在,无需使用时可清理:
minimax-output/tmp/rm -rf minimax-output/tmp
Prerequisites
前置条件
bash
brew install ffmpeg jq # macOS (or apt install ffmpeg jq on Linux)
bash scripts/check_environment.shNo Python or pip required — all scripts are pure bash using , , , and .
curlffmpegjqxxdbash
brew install ffmpeg jq # macOS系统(Linux系统请用apt install ffmpeg jq)
bash scripts/check_environment.sh无需Python或pip依赖——所有脚本均为纯bash编写,基于、、和。
curlffmpegjqxxdAPI Host Configuration
API主机配置
MiniMax provides two service endpoints for different regions. Set before running any script:
MINIMAX_API_HOST| Region | Platform URL | API Host Value |
|---|---|---|
| China Mainland(中国大陆) | https://platform.minimaxi.com | |
| Global(全球) | https://platform.minimax.io | |
bash
undefinedMiniMax为不同区域提供两个服务端点,运行任何脚本前需设置:
MINIMAX_API_HOST| 区域 | 平台URL | API主机值 |
|---|---|---|
| 中国大陆 | https://platform.minimaxi.com | |
| 全球 | https://platform.minimax.io | |
bash
undefinedChina Mainland
中国大陆
export MINIMAX_API_HOST="https://api.minimaxi.com"
export MINIMAX_API_HOST="https://api.minimaxi.com"
or Global
或全球区域
export MINIMAX_API_HOST="https://api.minimax.io"
**IMPORTANT — When API Host is missing:**
Before running any script, check if `MINIMAX_API_HOST` is set in the environment. If it is NOT configured:
1. Ask the user which service endpoint their MiniMax account uses:
- **China Mainland** → `https://api.minimaxi.com`
- **Global** → `https://api.minimax.io`
2. Instruct and help user to set it via `export MINIMAX_API_HOST="https://api.minimaxi.com"` (or the global variant) in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistenceexport MINIMAX_API_HOST="https://api.minimax.io"
**重要提示——当API主机未设置时:**
运行任何脚本前,检查环境中是否已设置`MINIMAX_API_HOST`。若未配置:
1. 询问用户其MiniMax账号使用的服务端点:
- **中国大陆** → `https://api.minimaxi.com`
- **全球** → `https://api.minimax.io`
2. 指导并帮助用户通过终端执行`export MINIMAX_API_HOST="https://api.minimaxi.com"`(或对应全球区域的地址),或添加至shell配置文件(`~/.zshrc`/`~/.bashrc`)以持久生效API Key Configuration
API密钥配置
Set the environment variable before running any script:
MINIMAX_API_KEYbash
export MINIMAX_API_KEY="your-api-key-here"The key starts with or , obtainable from https://platform.minimaxi.com (China) or https://platform.minimax.io (Global)
sk-api-sk-cp-IMPORTANT — When API Key is missing:
Before running any script, check if is set in the environment. If it is NOT configured:
MINIMAX_API_KEY- Ask the user to provide their MiniMax API key
- Instruct and help user to set it via in their terminal or add it to their shell profile (
export MINIMAX_API_KEY="sk-..."/~/.zshrc) for persistence~/.bashrc
运行任何脚本前需设置环境变量:
MINIMAX_API_KEYbash
export MINIMAX_API_KEY="your-api-key-here"重要提示——当API密钥缺失时:
运行任何脚本前,检查环境中是否已设置。若未配置:
MINIMAX_API_KEY- 请用户提供其MiniMax API密钥
- 指导并帮助用户通过终端执行,或添加至shell配置文件(
export MINIMAX_API_KEY="sk-..."/~/.zshrc)以持久生效~/.bashrc
Key Capabilities
核心功能
| Capability | Description | Entry point |
|---|---|---|
| TTS | Text-to-speech synthesis with multiple voices and emotions | |
| Voice Cloning | Clone a voice from an audio sample (10s–5min) | |
| Voice Design | Create a custom voice from a text description | |
| Music Generation | Generate songs with lyrics or instrumental tracks | |
| Image Generation | Text-to-image, image-to-image with character reference | |
| Video Generation | Text-to-video, image-to-video, subject reference, templates | |
| Long Video | Multi-scene chained video with crossfade transitions | |
| Media Tools | Audio/video format conversion, concatenation, trimming, extraction | |
| 功能 | 描述 | 入口 |
|---|---|---|
| TTS | 支持多音色、多情感的文本转语音合成 | |
| 语音克隆 | 从音频样本(10秒–5分钟)克隆语音 | |
| 语音设计 | 通过文本描述创建自定义语音 | |
| 音乐生成 | 生成带歌词的歌曲或纯音乐曲目 | |
| 图像生成 | 文本转图像、带人物参考的图像转图像 | |
| 视频生成 | 文本转视频、图像转视频、主体参考、模板 | |
| 长视频生成 | 带淡入淡出过渡的多场景链式视频 | |
| 媒体工具 | 音视频格式转换、拼接、裁剪、提取 | |
TTS (Text-to-Speech)
TTS(文本转语音)
Entry point:
scripts/tts/generate_voice.sh入口:
scripts/tts/generate_voice.shIMPORTANT: Single voice vs Multi-segment — Choose the right approach
重要提示:单语音 vs 多片段——选择合适的方式
| User intent | Approach |
|---|---|
| Single voice / no multi-character need | |
| Multiple characters / narrator + dialogue | |
Default behavior: When the user simply asks to generate speech/voice and does NOT mention multiple voices or characters, use the command directly with a single appropriate voice. Do NOT split into segments or use the multi-segment pipeline — just pass the full text to in one call.
ttsttsOnly use multi-segment when:
generate- The user explicitly needs multiple voices/characters
- The text requires narrator + character dialogue separation
- The text exceeds 10,000 characters (API limit per request) — in this case, split into segments with the same voice
| 用户意图 | 实现方式 |
|---|---|
| 单语音 / 无需多角色 | |
| 多角色 / 旁白+对话 | 使用segments.json的 |
默认行为: 当用户仅要求生成语音且未提及多音色或多角色时,直接使用命令选择合适的单音色,无需拆分片段或使用多片段流水线——只需将完整文本一次性传入命令。
ttstts仅在以下场景使用多片段命令:
generate- 用户明确需要多音色/多角色
- 文本需区分旁白与角色对话
- 文本超过10000字符(API单次请求限制)——此时需拆分片段并使用相同音色
Single-voice generation (DEFAULT)
单语音生成(默认)
bash
bash scripts/tts/generate_voice.sh tts "Hello world" -o minimax-output/hello.mp3
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3bash
bash scripts/tts/generate_voice.sh tts "Hello world" -o minimax-output/hello.mp3
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3Multi-segment generation (multi-voice / audiobook / podcast)
多片段生成(多音色 / 有声书 / 播客)
Complete workflow — follow ALL steps in order:
- Write segments.json — split text into segments with voice assignments (see format and rules below)
- Run command — this reads segments.json, generates audio for EACH segment via TTS API, then merges them into a single output file with crossfade
generate
bash
undefined完整工作流——按顺序执行所有步骤:
- 编写segments.json——将文本拆分为带音色分配的片段(见下方格式与规则)
- 运行命令——读取segments.json,通过TTS API为每个片段生成音频,然后将它们合并为单个输出文件并添加淡入淡出过渡
generate
bash
undefinedStep 1: Write segments.json to minimax-output/
步骤1:将segments.json写入minimax-output/
(use the Write tool to create minimax-output/segments.json)
(使用Write工具创建minimax-output/segments.json)
Step 2: Generate audio from segments.json — this is the CRITICAL step
步骤2:从segments.json生成音频——这是关键步骤
It generates each segment individually and merges them into one file
它会为每个片段单独生成音频,然后合并为一个文件
bash scripts/tts/generate_voice.sh generate minimax-output/segments.json
-o minimax-output/output.mp3 --crossfade 200
-o minimax-output/output.mp3 --crossfade 200
**Do NOT skip Step 2.** Writing segments.json alone does nothing — you MUST run the `generate` command to actually produce audio.bash scripts/tts/generate_voice.sh generate minimax-output/segments.json
-o minimax-output/output.mp3 --crossfade 200
-o minimax-output/output.mp3 --crossfade 200
**切勿跳过步骤2**。仅编写segments.json不会产生任何效果——必须运行`generate`命令才能生成音频。Voice management
语音管理
bash
undefinedbash
undefinedList all available voices
列出所有可用音色
bash scripts/tts/generate_voice.sh list-voices
bash scripts/tts/generate_voice.sh list-voices
Voice cloning (from audio sample, 10s–5min)
语音克隆(从音频样本,10秒–5分钟)
bash scripts/tts/generate_voice.sh clone sample.mp3 --voice-id my-voice
bash scripts/tts/generate_voice.sh clone sample.mp3 --voice-id my-voice
Voice design (from text description)
语音设计(通过文本描述)
bash scripts/tts/generate_voice.sh design "A warm female narrator voice" --voice-id narrator
undefinedbash scripts/tts/generate_voice.sh design "温暖的女性旁白音色" --voice-id narrator
undefinedAudio processing
音频处理
bash
bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp3bash
bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp3TTS Models
TTS模型
| Model | Notes |
|---|---|
| speech-2.8-hd | Recommended, auto emotion matching |
| speech-2.8-turbo | Faster variant |
| speech-2.6-hd | Previous gen, manual emotion |
| speech-2.6-turbo | Previous gen, faster |
| 模型 | 说明 |
|---|---|
| speech-2.8-hd | 推荐使用,自动匹配情感 |
| speech-2.8-turbo | 更快的变体 |
| speech-2.6-hd | 前代版本,需手动设置情感 |
| speech-2.6-turbo | 前代版本,速度更快 |
segments.json Format
segments.json格式
Default crossfade between segments: 200ms ().
--crossfade 200json
[
{ "text": "Hello!", "voice_id": "female-shaonv", "emotion": "" },
{ "text": "Welcome.", "voice_id": "male-qn-qingse", "emotion": "happy" }
]Leave empty for speech-2.8 models (auto-matched from text).
emotion片段间默认淡入淡出过渡时长:200毫秒()。
--crossfade 200json
[
{ "text": "你好!", "voice_id": "female-shaonv", "emotion": "" },
{ "text": "欢迎到来。", "voice_id": "male-qn-qingse", "emotion": "happy" }
]对于speech-2.8模型,字段留空即可(会根据文本自动匹配情感)。
emotionIMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.)
重要提示:多片段脚本生成规则(有声书、播客等)
When generating segments.json for audiobooks, podcasts, or any multi-character narration, you MUST split narration text from character dialogue into separate segments with distinct voices.
Rule: Narration and dialogue are ALWAYS separate segments.
A sentence like must be split into two segments:
"Tom said: The weather is great today!"- Segment 1 (narrator voice):
"Tom said:" - Segment 2 (character voice):
"The weather is great today!"
Example — Audiobook with narrator + 2 characters:
json
[
{ "text": "Morning sunlight streamed into the classroom as students filed in one by one.", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "The weather is amazing today! Let's go to the park after school!", "voice_id": "tom-voice", "emotion": "happy" },
{ "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "Sure, but I need to drop off my backpack at home first.", "voice_id": "lisa-voice", "emotion": "" },
{ "text": "They exchanged a smile and went back to listening to the lecture.", "voice_id": "narrator-voice", "emotion": "" }
]Key principles:
- Narrator uses a consistent neutral narrator voice throughout
- Each character has a dedicated voice_id, maintained consistently across all their dialogue
- Split at dialogue boundaries — is narrator, the quoted content is the character
"He said:" - Do NOT merge narrator text and character speech into a single segment
- For characters without pre-existing voice_ids, use voice cloning or voice design to create them first, then reference the created voice_id in segments
为有声书、播客或任何多角色旁白生成segments.json时,必须将旁白文本与角色对话拆分为单独的片段,并使用不同的音色。
规则:旁白与对话始终为单独片段。
例如句子必须拆分为两个片段:
"汤姆说:今天天气真好!"- 片段1(旁白音色):
"汤姆说:" - 片段2(角色音色):
"今天天气真好!"
示例——带旁白+2个角色的有声书:
json
[
{ "text": "清晨的阳光洒进教室,学生们陆续走进来。", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "汤姆笑着转向丽莎:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "今天天气太棒了!放学后我们去公园吧!", "voice_id": "tom-voice", "emotion": "happy" },
{ "text": "丽莎想了想,然后回答:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "好呀,但我得先回家放下背包。", "voice_id": "lisa-voice", "emotion": "" },
{ "text": "他们相视一笑,回到座位继续听课。", "voice_id": "narrator-voice", "emotion": "" }
]核心原则:
- 旁白全程使用一致的中性旁白音色
- 每个角色对应专属的voice_id,且在所有对话中保持一致
- 在对话边界拆分——属于旁白,引号内的内容属于角色
"他说:" - 切勿合并旁白文本与角色语音为单个片段
- 对于无预设voice_id的角色,先通过语音克隆或语音设计创建,再在片段中引用该voice_id
Music Generation
音乐生成
Entry point:
scripts/music/generate_music.sh入口:
scripts/music/generate_music.shIMPORTANT: Instrumental vs Lyrics — When to use which
重要提示:纯音乐 vs 带歌词——场景选择
| Scenario | Mode | Action |
|---|---|---|
| BGM for video / voice / podcast | Instrumental (default) | Use |
| User explicitly asks to "create music" / "make a song" | Ask user first | Ask whether they want instrumental or with lyrics |
When adding background music to video or voice content, always default to instrumental mode (). Do not ask the user — BGM should never have vocals competing with the main content.
--instrumentalWhen the user explicitly asks to create/generate music as the primary task, ask them whether they want:
- Instrumental (pure music, no vocals)
- With lyrics (song with vocals — user provides or you help write lyrics)
bash
undefined| 场景 | 模式 | 操作 |
|---|---|---|
| 视频/语音/播客的背景音乐 | 纯音乐(默认) | 直接使用 |
| 用户明确要求"创作音乐"/"制作歌曲" | 先询问用户 | 询问用户需要纯音乐还是带歌词的歌曲 |
为视频或语音内容添加背景音乐时,始终默认使用纯音乐模式(),无需询问用户——背景音乐不应有 vocals 与主内容竞争。
--instrumental当用户明确要求创作/生成音乐作为主要任务时,询问用户需要:
- 纯音乐(无 vocals 的纯音乐)
- 带歌词(有人声的歌曲——用户提供歌词或协助编写)
bash
undefinedInstrumental (for BGM or when user chooses instrumental)
纯音乐(用于背景音乐或用户选择纯音乐时)
bash scripts/music/generate_music.sh
--instrumental
--prompt "ambient electronic, atmospheric"
--output minimax-output/ambient.mp3 --download
--instrumental
--prompt "ambient electronic, atmospheric"
--output minimax-output/ambient.mp3 --download
bash scripts/music/generate_music.sh
--instrumental
--prompt "ambient electronic, atmospheric"
--output minimax-output/ambient.mp3 --download
--instrumental
--prompt "ambient electronic, atmospheric"
--output minimax-output/ambient.mp3 --download
Song with lyrics (when user chooses vocal music)
带歌词的歌曲(用户选择有声音乐时)
bash scripts/music/generate_music.sh
--lyrics "[verse]\nHello world\n[chorus]\nLa la la"
--prompt "indie folk, melancholic"
--output minimax-output/song.mp3 --download
--lyrics "[verse]\nHello world\n[chorus]\nLa la la"
--prompt "indie folk, melancholic"
--output minimax-output/song.mp3 --download
bash scripts/music/generate_music.sh
--lyrics "[verse]\nHello world\n[chorus]\nLa la la"
--prompt "indie folk, melancholic"
--output minimax-output/song.mp3 --download
--lyrics "[verse]\nHello world\n[chorus]\nLa la la"
--prompt "indie folk, melancholic"
--output minimax-output/song.mp3 --download
With style fields
带风格字段
bash scripts/music/generate_music.sh
--lyrics "[verse]\nLyrics here"
--genre "pop" --mood "upbeat" --tempo "fast"
--output minimax-output/pop_track.mp3 --download
--lyrics "[verse]\nLyrics here"
--genre "pop" --mood "upbeat" --tempo "fast"
--output minimax-output/pop_track.mp3 --download
undefinedbash scripts/music/generate_music.sh
--lyrics "[verse]\n歌词内容"
--genre "pop" --mood "upbeat" --tempo "fast"
--output minimax-output/pop_track.mp3 --download
--lyrics "[verse]\n歌词内容"
--genre "pop" --mood "upbeat" --tempo "fast"
--output minimax-output/pop_track.mp3 --download
undefinedMusic Model
音乐模型
Default model:
music-2.5music-2.5--instrumental- Sets lyrics to (empty structural tags, no actual vocals), appends
[intro] [outro]to the promptpure music, no lyrics
This produces instrumental-style output without requiring manual intervention. You can always use and the script handles the rest.
--instrumental默认模型:
music-2.5music-2.5--instrumental- 将歌词设置为(空结构标签,无实际 vocals),并在prompt后追加
[intro] [outro]pure music, no lyrics
无需手动干预即可生成纯音乐风格的输出,只需使用参数,脚本会处理其余操作。
--instrumentalImage Generation
图像生成
Entry point:
scripts/image/generate_image.shModel: — photorealistic image generation from text prompts, with optional character reference for image-to-image.
image-01入口:
scripts/image/generate_image.sh模型:——通过文本提示生成写实风格图像,支持图像转图像时的人物参考。
image-01IMPORTANT: Mode Selection — t2i vs i2i
重要提示:模式选择——t2i vs i2i
| User intent | Mode |
|---|---|
| Generate image from text description (default) | |
| Generate image with a character reference photo (keep same person) | |
Default behavior: When the user asks to generate/create an image without mentioning a reference photo, use mode (default). Only use mode when the user provides a character reference image or explicitly asks to base the image on an existing person's appearance.
t2ii2i| 用户意图 | 模式 |
|---|---|
| 根据文本描述生成图像(默认) | |
| 基于人物参考照片生成图像(保留同一人物) | |
默认行为: 当用户要求生成图像但未提及参考照片时,使用模式(默认)。仅当用户提供人物参考图像或明确要求基于现有人物外观生成图像时,才使用模式。
t2ii2iIMPORTANT: Aspect Ratio — Infer from user context
重要提示:宽高比——根据用户上下文推断
Do NOT always default to . Analyze the user's request and choose the most appropriate aspect ratio:
1:1| User intent / context | Recommended ratio | Resolution |
|---|---|---|
| 头像、图标、社交媒体头像、avatar、icon、profile pic | | 1024×1024 |
| 风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper | | 1280×720 |
| 传统照片、经典比例、classic photo | | 1152×864 |
| 摄影作品、杂志封面、photography、magazine | | 1248×832 |
| 人像竖图、海报、portrait photo、poster | | 832×1248 |
| 竖版海报、书籍封面、tall poster、book cover | | 864×1152 |
| 手机壁纸、社交媒体故事、phone wallpaper、story、reel | | 720×1280 |
| 超宽全景、电影画幅、panoramic、cinematic ultrawide | | 1344×576 |
| 未指定特定需求 / ambiguous | | 1024×1024 |
切勿始终默认使用。分析用户需求并选择最合适的宽高比:
1:1| 用户意图/上下文 | 推荐宽高比 | 分辨率 |
|---|---|---|
| 头像、图标、社交媒体头像、avatar、icon、profile pic | | 1024×1024 |
| 风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper | | 1280×720 |
| 传统照片、经典比例、classic photo | | 1152×864 |
| 摄影作品、杂志封面、photography、magazine | | 1248×832 |
| 人像竖图、海报、portrait photo、poster | | 832×1248 |
| 竖版海报、书籍封面、tall poster、book cover | | 864×1152 |
| 手机壁纸、社交媒体故事、phone wallpaper、story、reel | | 720×1280 |
| 超宽全景、电影画幅、panoramic、cinematic ultrawide | | 1344×576 |
| 未指定特定需求 / 模糊需求 | | 1024×1024 |
IMPORTANT: Image Count — When to generate multiple images
重要提示:图像数量——何时生成多张
| User intent | Count ( |
|---|---|
| Default / single image request | |
| 用户说"几张"、"多张"、"一些" / "a few", "several" | |
| 用户说"多种方案"、"备选" / "variations", "options" | |
| 用户明确指定数量 | Use the specified number (1–9) |
| 用户意图 | 数量( |
|---|---|
| 默认/单张图像需求 | |
| 用户说"几张"、"多张"、"一些" / "a few", "several" | |
| 用户说"多种方案"、"备选" / "variations", "options" | |
| 用户明确指定数量 | 使用用户指定的数量(1–9) |
Text-to-Image Examples
文本转图像示例
bash
undefinedbash
undefinedBasic text-to-image
基础文本转图像
bash scripts/image/generate_image.sh
--prompt "A cat sitting on a rooftop at sunset, cinematic lighting, warm tones, photorealistic"
-o minimax-output/cat.png
--prompt "A cat sitting on a rooftop at sunset, cinematic lighting, warm tones, photorealistic"
-o minimax-output/cat.png
bash scripts/image/generate_image.sh
--prompt "一只猫坐在日落时分的屋顶上,电影级光影,暖色调,写实风格"
-o minimax-output/cat.png
--prompt "一只猫坐在日落时分的屋顶上,电影级光影,暖色调,写实风格"
-o minimax-output/cat.png
Landscape with inferred aspect ratio
风景图像(推断宽高比)
bash scripts/image/generate_image.sh
--prompt "Mountain landscape with misty valleys, photorealistic, golden hour"
--aspect-ratio 16:9
-o minimax-output/landscape.png
--prompt "Mountain landscape with misty valleys, photorealistic, golden hour"
--aspect-ratio 16:9
-o minimax-output/landscape.png
bash scripts/image/generate_image.sh
--prompt "云雾缭绕的山谷山景,写实风格,黄金时刻"
--aspect-ratio 16:9
-o minimax-output/landscape.png
--prompt "云雾缭绕的山谷山景,写实风格,黄金时刻"
--aspect-ratio 16:9
-o minimax-output/landscape.png
Phone wallpaper (portrait 9:16)
手机壁纸(竖版9:16)
bash scripts/image/generate_image.sh
--prompt "Aurora borealis over a snowy forest, vivid colors, magical atmosphere"
--aspect-ratio 9:16
-o minimax-output/wallpaper.png
--prompt "Aurora borealis over a snowy forest, vivid colors, magical atmosphere"
--aspect-ratio 9:16
-o minimax-output/wallpaper.png
bash scripts/image/generate_image.sh
--prompt "雪林上空的极光,色彩鲜艳,魔幻氛围"
--aspect-ratio 9:16
-o minimax-output/wallpaper.png
--prompt "雪林上空的极光,色彩鲜艳,魔幻氛围"
--aspect-ratio 9:16
-o minimax-output/wallpaper.png
Multiple variations
多种变体
bash scripts/image/generate_image.sh
--prompt "Abstract geometric art, vibrant colors"
-n 3
-o minimax-output/art.png
--prompt "Abstract geometric art, vibrant colors"
-n 3
-o minimax-output/art.png
bash scripts/image/generate_image.sh
--prompt "抽象几何艺术,色彩鲜艳"
-n 3
-o minimax-output/art.png
--prompt "抽象几何艺术,色彩鲜艳"
-n 3
-o minimax-output/art.png
With prompt optimizer
带提示词优化器
bash scripts/image/generate_image.sh
--prompt "A man standing on Venice Beach, 90s documentary style"
--aspect-ratio 16:9 --prompt-optimizer
-o minimax-output/beach.png
--prompt "A man standing on Venice Beach, 90s documentary style"
--aspect-ratio 16:9 --prompt-optimizer
-o minimax-output/beach.png
bash scripts/image/generate_image.sh
--prompt "一个男人站在威尼斯海滩,90年代纪录片风格"
--aspect-ratio 16:9 --prompt-optimizer
-o minimax-output/beach.png
--prompt "一个男人站在威尼斯海滩,90年代纪录片风格"
--aspect-ratio 16:9 --prompt-optimizer
-o minimax-output/beach.png
Custom dimensions (must be multiple of 8)
自定义尺寸(必须为8的倍数)
bash scripts/image/generate_image.sh
--prompt "Product photo of a luxury watch on marble surface"
--width 1024 --height 768
-o minimax-output/watch.png
--prompt "Product photo of a luxury watch on marble surface"
--width 1024 --height 768
-o minimax-output/watch.png
undefinedbash scripts/image/generate_image.sh
--prompt "大理石台面上的奢华手表产品图"
--width 1024 --height 768
-o minimax-output/watch.png
--prompt "大理石台面上的奢华手表产品图"
--width 1024 --height 768
-o minimax-output/watch.png
undefinedImage-to-Image (Character Reference)
图像转图像(人物参考)
Use a reference photo to generate images with the same character in new scenes. Best results with a single front-facing portrait. Supported formats: JPG, JPEG, PNG (max 10MB).
bash
undefined使用参考照片生成同一人物在新场景中的图像,最佳效果为单人正面肖像。支持格式:JPG、JPEG、PNG(最大10MB)。
bash
undefinedCharacter reference — place same person in a new scene
人物参考——将同一人物置于新场景
bash scripts/image/generate_image.sh
--mode i2i
--prompt "A girl looking into the distance from a library window, warm afternoon light"
--ref-image face.jpg
--aspect-ratio 16:9
-o minimax-output/girl_library.png
--mode i2i
--prompt "A girl looking into the distance from a library window, warm afternoon light"
--ref-image face.jpg
--aspect-ratio 16:9
-o minimax-output/girl_library.png
bash scripts/image/generate_image.sh
--mode i2i
--prompt "一个女孩从图书馆窗口望向远方,温暖的午后阳光"
--ref-image face.jpg
--aspect-ratio 16:9
-o minimax-output/girl_library.png
--mode i2i
--prompt "一个女孩从图书馆窗口望向远方,温暖的午后阳光"
--ref-image face.jpg
--aspect-ratio 16:9
-o minimax-output/girl_library.png
Multiple character variations
多个人物变体
bash scripts/image/generate_image.sh
--mode i2i
--prompt "A woman in a red dress at a gala event, elegant, cinematic"
--ref-image face.jpg -n 3
-o minimax-output/gala.png
--mode i2i
--prompt "A woman in a red dress at a gala event, elegant, cinematic"
--ref-image face.jpg -n 3
-o minimax-output/gala.png
undefinedbash scripts/image/generate_image.sh
--mode i2i
--prompt "一个穿红裙的女人在晚宴上,优雅,电影级风格"
--ref-image face.jpg -n 3
-o minimax-output/gala.png
--mode i2i
--prompt "一个穿红裙的女人在晚宴上,优雅,电影级风格"
--ref-image face.jpg -n 3
-o minimax-output/gala.png
undefinedAspect Ratio Reference
宽高比参考
| Ratio | Resolution | Best for |
|---|---|---|
| 1024×1024 | Default, avatars, icons, social media |
| 1280×720 | Landscape, banner, desktop wallpaper |
| 1152×864 | Classic photo, presentations |
| 1248×832 | Photography, magazine layout |
| 832×1248 | Portrait photo, poster |
| 864×1152 | Book cover, tall poster |
| 720×1280 | Phone wallpaper, social story/reel |
| 1344×576 | Ultra-wide panoramic, cinematic |
| 比例 | 分辨率 | 适用场景 |
|---|---|---|
| 1024×1024 | 默认、头像、图标、社交媒体 |
| 1280×720 | 风景、横幅、桌面壁纸 |
| 1152×864 | 传统照片、演示文稿 |
| 1248×832 | 摄影作品、杂志排版 |
| 832×1248 | 人像照片、海报 |
| 864×1152 | 书籍封面、竖版海报 |
| 720×1280 | 手机壁纸、社交故事/短视频 |
| 1344×576 | 超宽全景、电影画幅 |
Key Options
核心选项
| Option | Description |
|---|---|
| Image description, max 1500 chars (required) |
| Aspect ratio (see table above). Infer from user context |
| Custom size, 512–2048, must be multiple of 8, both required together. Overridden by |
| Number of images to generate, 1–9 (default 1) |
| Random seed for reproducibility. Same seed + same params → similar results |
| Enable automatic prompt optimization by the API |
| Character reference image for i2i mode (local file or URL, JPG/JPEG/PNG, max 10MB) |
| Print image URLs instead of downloading files |
| Add AIGC watermark to generated images |
| 选项 | 描述 |
|---|---|
| 图像描述,最多1500字符(必填) |
| 宽高比(见上表),根据用户上下文推断 |
| 自定义尺寸,512–2048,必须为8的倍数,需同时设置宽和高。若与 |
| 生成图像数量,1–9(默认1) |
| 随机种子,用于复现结果。相同种子+相同参数→相似结果 |
| 启用API自动优化提示词 |
| i2i模式下的人物参考图像(本地文件或URL,JPG/JPEG/PNG,最大10MB) |
| 仅打印图像URL,不下载文件 |
| 为生成的图像添加AIGC水印 |
Video Generation
视频生成
IMPORTANT: Single vs Multi-Segment — Choose the right script
重要提示:单片段 vs 多片段——选择合适的脚本
| User intent | Script to use |
|---|---|
| Default / no special request | |
| User explicitly asks for "long video", "multi-scene", "story", or duration > 10s | |
Default behavior: Always use single-segment with duration 10s and resolution 768P unless the user explicitly asks for a long video, multi-scene video, or specifies a total duration exceeding 10 seconds. Do NOT automatically split into multiple segments — a single 10s video is the standard output. Only use when the user clearly needs multi-scene or longer content.
generate_video.shgenerate_long_video.shEntry point (single video):
Entry point (long/multi-scene):
scripts/video/generate_video.shscripts/video/generate_long_video.sh| 用户意图 | 使用脚本 |
|---|---|
| 默认/无特殊需求 | |
| 用户明确要求"长视频"、"多场景"、"故事"或时长超过10秒 | |
默认行为: 除非用户明确要求长视频、多场景视频或指定总时长超过10秒,否则始终使用单片段,默认参数为10秒时长、768P分辨率。切勿自动拆分为多片段——单段10秒视频为标准输出。仅当用户明确需要多场景或更长内容时,才使用。
generate_video.shgenerate_long_video.sh入口(单段视频):
入口(长/多场景视频):
scripts/video/generate_video.shscripts/video/generate_long_video.shVideo Model Constraints (MUST follow)
视频模型限制(必须遵守)
Duration limits by model and resolution:
| Model | 720P | 768P | 1080P |
|---|---|---|---|
| MiniMax-Hailuo-2.3 | - | 6s or 10s | 6s only |
| MiniMax-Hailuo-2.3-Fast | - | 6s or 10s | 6s only |
| MiniMax-Hailuo-02 | - | 6s or 10s | 6s only |
| T2V-01 / T2V-01-Director | 6s only | - | - |
| I2V-01 / I2V-01-Director / I2V-01-live | 6s only | - | - |
| S2V-01 (ref) | 6s only | - | - |
Resolution options by model and duration:
| Model | 6s | 10s |
|---|---|---|
| MiniMax-Hailuo-2.3 | 768P (default), 1080P | 768P only |
| MiniMax-Hailuo-2.3-Fast | 768P (default), 1080P | 768P only |
| MiniMax-Hailuo-02 | 512P, 768P (default), 1080P | 512P, 768P (default) |
| Other models | 720P (default) | Not supported |
Key rules:
- Default: 10s + 768P (best balance of length and quality for MiniMax-Hailuo-2.3)
- 1080P only supports 6s duration — if user requests 1080P, set
--duration 6 - 10s duration only works with 768P (or 512P on Hailuo-02) — never combine 10s + 1080P
- Older models (T2V-01, I2V-01, S2V-01) only support 6s at 720P
按模型和分辨率划分的时长限制:
| 模型 | 720P | 768P | 1080P |
|---|---|---|---|
| MiniMax-Hailuo-2.3 | - | 6秒或10秒 | 仅6秒 |
| MiniMax-Hailuo-2.3-Fast | - | 6秒或10秒 | 仅6秒 |
| MiniMax-Hailuo-02 | - | 6秒或10秒 | 仅6秒 |
| T2V-01 / T2V-01-Director | 仅6秒 | - | - |
| I2V-01 / I2V-01-Director / I2V-01-live | 仅6秒 | - | - |
| S2V-01 (ref) | 仅6秒 | - | - |
按模型和时长划分的分辨率选项:
| 模型 | 6秒 | 10秒 |
|---|---|---|
| MiniMax-Hailuo-2.3 | 768P(默认)、1080P | 仅768P |
| MiniMax-Hailuo-2.3-Fast | 768P(默认)、1080P | 仅768P |
| MiniMax-Hailuo-02 | 512P、768P(默认)、1080P | 512P、768P(默认) |
| 其他模型 | 720P(默认) | 不支持 |
核心规则:
- 默认设置:10秒+768P(MiniMax-Hailuo-2.3的时长与画质最佳平衡)
- 1080P仅支持6秒时长——若用户要求1080P,需设置
--duration 6 - 10秒时长仅支持768P(或Hailuo-02的512P)——切勿将10秒与1080P组合使用
- 旧模型(T2V-01、I2V-01、S2V-01)仅支持720P、6秒时长
IMPORTANT: Prompt Optimization (MUST follow before generating any video)
重要提示:提示词优化(生成任何视频前必须遵守)
Before calling any video generation script, you MUST optimize the user's prompt by reading and applying . Never pass the user's raw description directly as .
references/video-prompt-guide.md--promptOptimization steps:
-
Apply the Professional Formula:
Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere- BAD:
"A puppy in a park" - GOOD:
"A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [跟随] smooth tracking shot, warm golden hour lighting, shallow depth of field, joyful atmosphere"
- BAD:
-
Add camera instructions usingsyntax:
[指令],[推进],[拉远],[跟随],[固定], etc.[左摇] -
Include aesthetic details: lighting (golden hour, dramatic side lighting), color grading (warm tones, cinematic), texture (dust particles, rain droplets), atmosphere (intimate, epic, peaceful)
-
Keep to 1-2 key actions for 6-10 second videos — do not overcrowd with events
-
For i2v mode (image-to-video): Focus prompt on movement and change only, since the image already establishes the visual. Do NOT re-describe what's in the image.
- BAD: (just repeating the image)
"A lake with mountains" - GOOD:
"Gentle ripples spread across the water surface, a breeze rustles the distant trees, [固定] fixed camera, soft morning light, peaceful and serene"
- BAD:
-
For multi-segment long videos: Each segment's prompt must be self-contained and optimized individually. The i2v segments (segment 2+) should describe motion/change relative to the previous segment's ending frame.
bash
undefined调用任何视频生成脚本前,必须阅读并应用中的规则优化用户的提示词,绝不能直接将用户的原始描述作为参数传入。
references/video-prompt-guide.md--prompt优化步骤:
-
应用专业公式:
主体+场景+动作+镜头运动+美学氛围- 错误示例:
"公园里的小狗" - 正确示例:
"一只金毛寻回犬小狗在公园阳光斑驳的草地上朝镜头跑来,[跟随]平滑跟拍镜头,温暖的黄金时刻光线,浅景深,欢快氛围"
- 错误示例:
-
使用语法添加镜头说明:
[指令]、[推进]、[拉远]、[跟随]、[固定]等。[左摇] -
包含美学细节:光线(黄金时刻、戏剧性侧光)、色彩分级(暖色调、电影级)、纹理(尘埃颗粒、雨滴)、氛围(温馨、史诗感、宁静)。
-
6-10秒视频仅保留1-2个核心动作——不要添加过多事件。
-
i2v模式(图像转视频):提示词仅聚焦动作和变化,因为图像已经确定了视觉内容,切勿重复描述图像中的内容。
- 错误示例:(仅重复图像内容)
"有山的湖泊" - 正确示例:
"水面泛起轻柔的涟漪,微风吹动远处的树木,[固定]固定镜头,柔和的晨光,宁静祥和的氛围"
- 错误示例:
-
多片段长视频:每个片段的提示词必须独立优化。i2v片段(第2段及以后)需描述相对于上一段结束帧的运动/变化。
bash
undefinedText-to-video (default: 10s, 768P)
文本转视频(默认:10秒,768P)
bash scripts/video/generate_video.sh
--mode t2v
--prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful"
--output minimax-output/puppy.mp4
--mode t2v
--prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful"
--output minimax-output/puppy.mp4
bash scripts/video/generate_video.sh
--mode t2v
--prompt "一只金毛寻回犬小狗在阳光明媚的草地上朝镜头跑来,[跟随]跟拍镜头,温暖的黄金时刻,浅景深,欢快氛围"
--output minimax-output/puppy.mp4
--mode t2v
--prompt "一只金毛寻回犬小狗在阳光明媚的草地上朝镜头跑来,[跟随]跟拍镜头,温暖的黄金时刻,浅景深,欢快氛围"
--output minimax-output/puppy.mp4
Text-to-video with 1080P (must use --duration 6)
1080P文本转视频(必须设置--duration 6)
bash scripts/video/generate_video.sh
--mode t2v
--prompt "A golden retriever puppy bounds toward the camera"
--duration 6 --resolution 1080P
--output minimax-output/puppy_hd.mp4
--mode t2v
--prompt "A golden retriever puppy bounds toward the camera"
--duration 6 --resolution 1080P
--output minimax-output/puppy_hd.mp4
bash scripts/video/generate_video.sh
--mode t2v
--prompt "一只金毛寻回犬小狗朝镜头跑来"
--duration 6 --resolution 1080P
--output minimax-output/puppy_hd.mp4
--mode t2v
--prompt "一只金毛寻回犬小狗朝镜头跑来"
--duration 6 --resolution 1080P
--output minimax-output/puppy_hd.mp4
Image-to-video (prompt focuses on MOTION, not image content)
图像转视频(提示词聚焦动作,而非图像内容)
bash scripts/video/generate_video.sh
--mode i2v
--prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones"
--first-frame photo.jpg
--output minimax-output/animated.mp4
--mode i2v
--prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones"
--first-frame photo.jpg
--output minimax-output/animated.mp4
bash scripts/video/generate_video.sh
--mode i2v
--prompt "花瓣在微风中轻轻摇曳,柔和的光线在表面移动,[固定]固定构图,梦幻的马卡龙色调"
--first-frame photo.jpg
--output minimax-output/animated.mp4
--mode i2v
--prompt "花瓣在微风中轻轻摇曳,柔和的光线在表面移动,[固定]固定构图,梦幻的马卡龙色调"
--first-frame photo.jpg
--output minimax-output/animated.mp4
Start-end frame interpolation (sef mode uses MiniMax-Hailuo-02)
首尾帧插值(sef模式使用MiniMax-Hailuo-02)
bash scripts/video/generate_video.sh
--mode sef
--first-frame start.jpg --last-frame end.jpg
--output minimax-output/transition.mp4
--mode sef
--first-frame start.jpg --last-frame end.jpg
--output minimax-output/transition.mp4
bash scripts/video/generate_video.sh
--mode sef
--first-frame start.jpg --last-frame end.jpg
--output minimax-output/transition.mp4
--mode sef
--first-frame start.jpg --last-frame end.jpg
--output minimax-output/transition.mp4
Subject reference (face consistency, ref mode uses S2V-01, 6s only)
主体参考(面部一致性,ref模式使用S2V-01,仅6秒)
bash scripts/video/generate_video.sh
--mode ref
--prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field"
--subject-image face.jpg
--duration 6
--output minimax-output/person.mp4
--mode ref
--prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field"
--subject-image face.jpg
--duration 6
--output minimax-output/person.mp4
undefinedbash scripts/video/generate_video.sh
--mode ref
--prompt "一个穿白裙的年轻女子在阳光明媚的花园中慢慢行走,[跟随]平滑跟拍,温暖的自然光线,电影级景深"
--subject-image face.jpg
--duration 6
--output minimax-output/person.mp4
--mode ref
--prompt "一个穿白裙的年轻女子在阳光明媚的花园中慢慢行走,[跟随]平滑跟拍,温暖的自然光线,电影级景深"
--subject-image face.jpg
--duration 6
--output minimax-output/person.mp4
undefinedLong-form Video (Multi-scene)
长视频(多场景)
Multi-scene long videos chain segments together: the first segment generates via text-to-video (t2v), then each subsequent segment uses the last frame of the previous segment as its first frame (i2v). Segments are joined with crossfade transitions for smooth continuity. Default is 10 seconds per segment.
Workflow:
- Segment 1: t2v — generated purely from the optimized text prompt
- Segment 2+: i2v — the previous segment's last frame becomes , prompt describes motion and change from that ending state
first_frame_image - All segments are concatenated with 0.5s crossfade transitions to eliminate jump cuts
- Optional: AI-generated background music is overlaid
Prompt rules for each segment:
- Each segment prompt MUST be independently optimized using the Professional Formula
- Segment 1 (t2v): Full scene description with subject, scene, camera, atmosphere
- Segment 2+ (i2v): Focus on what changes and moves from the previous ending frame. Do NOT repeat the visual description — the first frame already provides it
- Maintain visual consistency: keep lighting, color grading, and style keywords consistent across segments
- Each segment covers only 10 seconds of action — keep it focused
bash
undefined多场景长视频将多个片段链接在一起:第一个片段通过文本转视频(t2v)生成,后续每个片段以上一个片段的最后一帧作为首帧,通过图像转视频(i2v)生成。片段之间通过淡入淡出过渡连接,保证流畅性。默认每个片段时长为10秒。
工作流:
- 片段1:t2v——完全基于优化后的文本提示生成
- 片段2+:i2v——上一个片段的最后一帧作为,提示词描述相对于该结束帧的动作和变化
first_frame_image - 所有片段通过0.5秒的淡入淡出过渡拼接,消除跳切
- 可选:添加AI生成的背景音乐
每个片段的提示词规则:
- 每个片段的提示词必须使用专业公式独立优化
- 片段1(t2v):包含主体、场景、镜头、氛围的完整场景描述
- 片段2+(i2v):聚焦相对于上一结束帧的变化和动作,切勿重复视觉描述——首帧已经提供了视觉内容
- 保持视觉一致性:所有片段的光线、色彩分级、风格关键词保持一致
- 每个片段仅涵盖10秒的动作——保持内容聚焦
bash
undefinedExample: 3-segment story with optimized per-segment prompts (default: 10s/segment, 768P)
示例:3片段故事,每个片段使用优化后的提示词(默认:10秒/片段,768P)
bash scripts/video/generate_long_video.sh
--scenes
"A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere"
"The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure"
"The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale"
--music-prompt "cinematic orchestral ambient, slow build, sci-fi atmosphere"
--output minimax-output/long_video.mp4
--scenes
"A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere"
"The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure"
"The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale"
--music-prompt "cinematic orchestral ambient, slow build, sci-fi atmosphere"
--output minimax-output/long_video.mp4
bash scripts/video/generate_long_video.sh
--scenes
"一名孤独的宇航员站在红色荒漠行星表面,风吹动尘埃颗粒,[推进]缓慢推近面罩,戏剧性轮廓光,电影级科幻氛围"
"宇航员转身,开始走向地平线上远处发光的建筑,靴子周围扬起尘埃,[跟随]从后方跟拍,广袤荒凉的地貌,建筑发出的金色光芒"
"宇航员抵达建筑入口,巨大的门廊闪烁着蓝色能量,[推进]缓慢推近门廊,光线反射在面罩上,令人震撼的史诗规模"
--music-prompt "电影级管弦乐 ambient,缓慢递进,科幻氛围"
--output minimax-output/long_video.mp4
--scenes
"一名孤独的宇航员站在红色荒漠行星表面,风吹动尘埃颗粒,[推进]缓慢推近面罩,戏剧性轮廓光,电影级科幻氛围"
"宇航员转身,开始走向地平线上远处发光的建筑,靴子周围扬起尘埃,[跟随]从后方跟拍,广袤荒凉的地貌,建筑发出的金色光芒"
"宇航员抵达建筑入口,巨大的门廊闪烁着蓝色能量,[推进]缓慢推近门廊,光线反射在面罩上,令人震撼的史诗规模"
--music-prompt "电影级管弦乐 ambient,缓慢递进,科幻氛围"
--output minimax-output/long_video.mp4
With custom settings
自定义设置
bash scripts/video/generate_long_video.sh
--scenes "Scene 1 prompt" "Scene 2 prompt"
--segment-duration 10
--resolution 768P
--crossfade 0.5
--music-prompt "calm ambient background music"
--output minimax-output/long_video.mp4
--scenes "Scene 1 prompt" "Scene 2 prompt"
--segment-duration 10
--resolution 768P
--crossfade 0.5
--music-prompt "calm ambient background music"
--output minimax-output/long_video.mp4
undefinedbash scripts/video/generate_long_video.sh
--scenes "场景1提示词" "场景2提示词"
--segment-duration 10
--resolution 768P
--crossfade 0.5
--music-prompt "舒缓的 ambient 背景音乐"
--output minimax-output/long_video.mp4
--scenes "场景1提示词" "场景2提示词"
--segment-duration 10
--resolution 768P
--crossfade 0.5
--music-prompt "舒缓的 ambient 背景音乐"
--output minimax-output/long_video.mp4
undefinedAdd Background Music
添加背景音乐
bash
bash scripts/video/add_bgm.sh \
--video input.mp4 \
--generate-bgm --instrumental \
--music-prompt "soft piano background" \
--bgm-volume 0.3 \
--output minimax-output/output_with_bgm.mp4bash
bash scripts/video/add_bgm.sh \
--video input.mp4 \
--generate-bgm --instrumental \
--music-prompt "柔和的钢琴背景音乐" \
--bgm-volume 0.3 \
--output minimax-output/output_with_bgm.mp4Template Video
模板视频
bash
bash scripts/video/generate_template_video.sh \
--template-id 392753057216684038 \
--media photo.jpg \
--output minimax-output/template_output.mp4bash
bash scripts/video/generate_template_video.sh \
--template-id 392753057216684038 \
--media photo.jpg \
--output minimax-output/template_output.mp4Video Models
视频模型
| Mode | Default Model | Default Duration | Default Resolution | Notes |
|---|---|---|---|---|
| t2v | MiniMax-Hailuo-2.3 | 10s | 768P | Latest text-to-video |
| i2v | MiniMax-Hailuo-2.3 | 10s | 768P | Latest image-to-video |
| sef | MiniMax-Hailuo-02 | 6s | 768P | Start-end frame |
| ref | S2V-01 | 6s | 720P | Subject reference, 6s only |
| 模式 | 默认模型 | 默认时长 | 默认分辨率 | 说明 |
|---|---|---|---|---|
| t2v | MiniMax-Hailuo-2.3 | 10秒 | 768P | 最新文本转视频模型 |
| i2v | MiniMax-Hailuo-2.3 | 10秒 | 768P | 最新图像转视频模型 |
| sef | MiniMax-Hailuo-02 | 6秒 | 768P | 首尾帧插值 |
| ref | S2V-01 | 6秒 | 720P | 主体参考,仅6秒 |
Media Tools (Audio/Video Processing)
媒体工具(音视频处理)
Entry point:
scripts/media_tools.shStandalone FFmpeg-based utilities for format conversion, concatenation, extraction, trimming, and audio overlay. Use these when the user needs to process existing media files without generating new content via MiniMax API.
入口:
scripts/media_tools.sh基于FFmpeg的独立工具,支持格式转换、拼接、提取、裁剪和音频叠加。当用户需要处理现有媒体文件,无需通过MiniMax API生成新内容时使用。
Video Format Conversion
视频格式转换
bash
undefinedbash
undefinedConvert between formats (mp4, mov, webm, mkv, avi, ts, flv)
格式转换(mp4, mov, webm, mkv, avi, ts, flv)
bash scripts/media_tools.sh convert-video input.webm -o output.mp4
bash scripts/media_tools.sh convert-video input.mp4 -o output.mov
bash scripts/media_tools.sh convert-video input.webm -o output.mp4
bash scripts/media_tools.sh convert-video input.mp4 -o output.mov
With quality / resolution / fps options
带画质/分辨率/帧率选项
bash scripts/media_tools.sh convert-video input.mp4 -o output.mp4
--crf 18 --preset medium --resolution 1920x1080 --fps 30
--crf 18 --preset medium --resolution 1920x1080 --fps 30
undefinedbash scripts/media_tools.sh convert-video input.mp4 -o output.mp4
--crf 18 --preset medium --resolution 1920x1080 --fps 30
--crf 18 --preset medium --resolution 1920x1080 --fps 30
undefinedAudio Format Conversion
音频格式转换
bash
undefinedbash
undefinedConvert between formats (mp3, wav, flac, ogg, aac, m4a, opus, wma)
格式转换(mp3, wav, flac, ogg, aac, m4a, opus, wma)
bash scripts/media_tools.sh convert-audio input.wav -o output.mp3
bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac
--bitrate 320k --sample-rate 48000 --channels 2
--bitrate 320k --sample-rate 48000 --channels 2
undefinedbash scripts/media_tools.sh convert-audio input.wav -o output.mp3
bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac
--bitrate 320k --sample-rate 48000 --channels 2
--bitrate 320k --sample-rate 48000 --channels 2
undefinedVideo Concatenation
视频拼接
bash
undefinedbash
undefinedConcatenate with crossfade transition (default 0.5s)
带淡入淡出过渡的拼接(默认0.5秒)
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 seg3.mp4 -o merged.mp4
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 seg3.mp4 -o merged.mp4
Hard cut (no crossfade)
硬切(无过渡)
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0
undefinedbash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0
undefinedAudio Concatenation
音频拼接
bash
undefinedbash
undefinedSimple concatenation
简单拼接
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3
With crossfade
带淡入淡出过渡
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1
undefinedbash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1
undefinedExtract Audio from Video
从视频提取音频
bash
undefinedbash
undefinedExtract as mp3
提取为mp3格式
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3
Extract as wav with higher bitrate
以更高比特率提取为wav格式
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k
undefinedbash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k
undefinedVideo Trimming
视频裁剪
bash
undefinedbash
undefinedTrim by start/end time (seconds)
按开始/结束时间裁剪(秒)
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 5 --end 15
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 5 --end 15
Trim by start + duration
按开始时间+时长裁剪
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8
undefinedbash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8
undefinedAdd Audio to Video (Overlay / Replace)
为视频添加音频(叠加/替换)
bash
undefinedbash
undefinedMix audio with existing video audio
将音频与视频原有音频混合
bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4
--volume 0.3 --fade-in 2 --fade-out 3
--volume 0.3 --fade-in 2 --fade-out 3
bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4
--volume 0.3 --fade-in 2 --fade-out 3
--volume 0.3 --fade-in 2 --fade-out 3
Replace original audio entirely
完全替换原音频
bash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4
--replace
--replace
undefinedbash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4
--replace
--replace
undefinedMedia File Info
媒体文件信息
bash
bash scripts/media_tools.sh probe input.mp4bash
bash scripts/media_tools.sh probe input.mp4Script Architecture
脚本架构
scripts/
├── check_environment.sh # Env verification (curl, ffmpeg, jq, xxd, API key)
├── media_tools.sh # Audio/video conversion, concat, trim, extract
├── tts/
│ └── generate_voice.sh # Unified TTS CLI (tts, clone, design, list-voices, generate, merge, convert)
├── music/
│ └── generate_music.sh # Music generation CLI
├── image/
│ └── generate_image.sh # Image generation CLI (2 modes: t2i, i2i)
└── video/
├── generate_video.sh # Video generation CLI (4 modes: t2v, i2v, sef, ref)
├── generate_long_video.sh # Multi-scene long video
├── generate_template_video.sh # Template-based video
└── add_bgm.sh # Background music overlayscripts/
├── check_environment.sh # 环境验证(curl、ffmpeg、jq、xxd、API密钥)
├── media_tools.sh # 音视频转换、拼接、裁剪、提取
├── tts/
│ └── generate_voice.sh # 统一TTS命令行工具(tts、克隆、设计、列出音色、生成、合并、转换)
├── music/
│ └── generate_music.sh # 音乐生成命令行工具
├── image/
│ └── generate_image.sh # 图像生成命令行工具(2种模式:t2i、i2i)
└── video/
├── generate_video.sh # 视频生成命令行工具(4种模式:t2v、i2v、sef、ref)
├── generate_long_video.sh # 多场景长视频
├── generate_template_video.sh # 基于模板的视频
└── add_bgm.sh # 背景音乐叠加References
参考资料
Read these for detailed API parameters, voice catalogs, and prompt engineering:
- tts-guide.md — TTS setup, voice management, audio processing, segment format, troubleshooting
- tts-voice-catalog.md — Full voice catalog with IDs, descriptions, and parameter reference
- music-api.md — Music generation API: endpoints, parameters, response format
- image-api.md — Image generation API: text-to-image, image-to-image, parameters
- video-api.md — Video API: endpoints, models, parameters, camera instructions, templates
- video-prompt-guide.md — Video prompt engineering: formulas, styles, image-to-video tips
阅读以下文档获取详细的API参数、音色目录和提示词工程技巧:
- tts-guide.md — TTS设置、语音管理、音频处理、片段格式、故障排除
- tts-voice-catalog.md — 完整音色目录,包含ID、描述和参数参考
- music-api.md — 音乐生成API:端点、参数、响应格式
- image-api.md — 图像生成API:文本转图像、图像转图像、参数
- video-api.md — 视频API:端点、模型、参数、镜头指令、模板
- video-prompt-guide.md — 视频提示词工程:公式、风格、图像转视频技巧