MiniMax Multi-Modal Toolkit

Generate voice, music, video, and image content via MiniMax APIs — the unified entry for MiniMax multimodal use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.

Output Directory

All generated files MUST be saved to
minimax-output/
under the AGENT'S current working directory (NOT the skill directory). Every script call MUST include an explicit

--output

-o

argument pointing to this location. Never omit the output argument or rely on script defaults.

Rules:

Before running any script, ensure
```
minimax-output/
```
exists in the agent's working directory (create if needed:
```
mkdir -p minimax-output
```
)
Always use absolute or relative paths from the agent's working directory:
```
--output minimax-output/video.mp4
```
Never
```
cd
```
into the skill directory to run scripts — run from the agent's working directory using the full script path
Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in
```
minimax-output/tmp/
```
. They can be cleaned up when no longer needed:
```
rm -rf minimax-output/tmp
```

Prerequisites

bash

brew install ffmpeg jq              # macOS (or apt install ffmpeg jq on Linux)
bash scripts/check_environment.sh

No Python or pip required — all scripts are pure bash using

curl

ffmpeg

jq

, and

xxd

API Host Configuration

MiniMax provides two service endpoints for different regions. Set

MINIMAX_API_HOST

before running any script:

Region	Platform URL	API Host Value
China Mainland（中国大陆）	https://platform.minimaxi.com	`https://api.minimaxi.com`
Global（全球）	https://platform.minimax.io	`https://api.minimax.io`

bash

# China Mainland
export MINIMAX_API_HOST="https://api.minimaxi.com"

# or Global
export MINIMAX_API_HOST="https://api.minimax.io"

IMPORTANT — When API Host is missing: Before running any script, check if

MINIMAX_API_HOST

is set in the environment. If it is NOT configured:

Ask the user which service endpoint their MiniMax account uses:
- China Mainland →
```
https://api.minimaxi.com
```
- Global →
```
https://api.minimax.io
```
Instruct and help user to set it via
```
export MINIMAX_API_HOST="https://api.minimaxi.com"
```
(or the global variant) in their terminal or add it to their shell profile (
```
~/.zshrc
```
/
```
~/.bashrc
```
) for persistence

API Key Configuration

Set the

MINIMAX_API_KEY

environment variable before running any script:

bash

export MINIMAX_API_KEY="your-api-key-here"

The key starts with

sk-api-

sk-cp-

, obtainable from https://platform.minimaxi.com (China) or https://platform.minimax.io (Global)

IMPORTANT — When API Key is missing: Before running any script, check if

MINIMAX_API_KEY

is set in the environment. If it is NOT configured:

Ask the user to provide their MiniMax API key
Instruct and help user to set it via
```
export MINIMAX_API_KEY="sk-..."
```
in their terminal or add it to their shell profile (
```
~/.zshrc
```
/
```
~/.bashrc
```
) for persistence

Key Capabilities

Capability	Description	Entry point
TTS	Text-to-speech synthesis with multiple voices and emotions	`scripts/tts/generate_voice.sh`
Voice Cloning	Clone a voice from an audio sample (10s–5min)	`scripts/tts/generate_voice.sh clone`
Voice Design	Create a custom voice from a text description	`scripts/tts/generate_voice.sh design`
Music Generation	Generate songs with lyrics or instrumental tracks	`scripts/music/generate_music.sh`
Image Generation	Text-to-image, image-to-image with character reference	`scripts/image/generate_image.sh`
Video Generation	Text-to-video, image-to-video, subject reference, templates	`scripts/video/generate_video.sh`
Long Video	Multi-scene chained video with crossfade transitions	`scripts/video/generate_long_video.sh`
Media Tools	Audio/video format conversion, concatenation, trimming, extraction	`scripts/media_tools.sh`

TTS (Text-to-Speech)

Entry point:

scripts/tts/generate_voice.sh

IMPORTANT: Single voice vs Multi-segment — Choose the right approach

User intent	Approach
Single voice / no multi-character need	`tts` command — generate the entire text in one call
Multiple characters / narrator + dialogue	`generate` command with segments.json

Default behavior: When the user simply asks to generate speech/voice and does NOT mention multiple voices or characters, use the

tts

command directly with a single appropriate voice. Do NOT split into segments or use the multi-segment pipeline — just pass the full text to

tts

in one call.

Only use multi-segment

generate

when:

The user explicitly needs multiple voices/characters
The text requires narrator + character dialogue separation
The text exceeds 10,000 characters (API limit per request) — in this case, split into segments with the same voice

Single-voice generation (DEFAULT)

bash

bash scripts/tts/generate_voice.sh tts "Hello world" -o minimax-output/hello.mp3
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3

Multi-segment generation (multi-voice / audiobook / podcast)

Complete workflow — follow ALL steps in order:

Write segments.json — split text into segments with voice assignments (see format and rules below)
Run
generate
command — this reads segments.json, generates audio for EACH segment via TTS API, then merges them into a single output file with crossfade

bash

# Step 1: Write segments.json to minimax-output/
# (use the Write tool to create minimax-output/segments.json)

# Step 2: Generate audio from segments.json — this is the CRITICAL step
# It generates each segment individually and merges them into one file
bash scripts/tts/generate_voice.sh generate minimax-output/segments.json \
  -o minimax-output/output.mp3 --crossfade 200

Do NOT skip Step 2. Writing segments.json alone does nothing — you MUST run the

generate

command to actually produce audio.

Voice management

bash

# List all available voices
bash scripts/tts/generate_voice.sh list-voices

# Voice cloning (from audio sample, 10s–5min)
bash scripts/tts/generate_voice.sh clone sample.mp3 --voice-id my-voice

# Voice design (from text description)
bash scripts/tts/generate_voice.sh design "A warm female narrator voice" --voice-id narrator

Audio processing

bash

bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp3

TTS Models

Model	Notes
speech-2.8-hd	Recommended, auto emotion matching
speech-2.8-turbo	Faster variant
speech-2.6-hd	Previous gen, manual emotion
speech-2.6-turbo	Previous gen, faster

segments.json Format

Default crossfade between segments: 200ms (

--crossfade 200

json

[
  { "text": "Hello!", "voice_id": "female-shaonv", "emotion": "" },
  { "text": "Welcome.", "voice_id": "male-qn-qingse", "emotion": "happy" }
]

Leave

emotion

empty for speech-2.8 models (auto-matched from text).

IMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.)

When generating segments.json for audiobooks, podcasts, or any multi-character narration, you MUST split narration text from character dialogue into separate segments with distinct voices.

Rule: Narration and dialogue are ALWAYS separate segments.

A sentence like

"Tom said: The weather is great today!"

must be split into two segments:

Segment 1 (narrator voice):
```
"Tom said:"
```
Segment 2 (character voice):
```
"The weather is great today!"
```

Example — Audiobook with narrator + 2 characters:

json

[
  { "text": "Morning sunlight streamed into the classroom as students filed in one by one.", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "The weather is amazing today! Let's go to the park after school!", "voice_id": "tom-voice", "emotion": "happy" },
  { "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator-voice", "emotion": "" },
  { "text": "Sure, but I need to drop off my backpack at home first.", "voice_id": "lisa-voice", "emotion": "" },
  { "text": "They exchanged a smile and went back to listening to the lecture.", "voice_id": "narrator-voice", "emotion": "" }
]

Key principles:

Narrator uses a consistent neutral narrator voice throughout
Each character has a dedicated voice_id, maintained consistently across all their dialogue
Split at dialogue boundaries —
```
"He said:"
```
is narrator, the quoted content is the character
Do NOT merge narrator text and character speech into a single segment
For characters without pre-existing voice_ids, use voice cloning or voice design to create them first, then reference the created voice_id in segments

Music Generation

Entry point:

scripts/music/generate_music.sh

IMPORTANT: Instrumental vs Lyrics — When to use which

Scenario	Mode	Action
BGM for video / voice / podcast	Instrumental (default)	Use `--instrumental` directly, do NOT ask user
User explicitly asks to "create music" / "make a song"	Ask user first	Ask whether they want instrumental or with lyrics

When adding background music to video or voice content, always default to instrumental mode (

--instrumental

). Do not ask the user — BGM should never have vocals competing with the main content.

When the user explicitly asks to create/generate music as the primary task, ask them whether they want:

Instrumental (pure music, no vocals)
With lyrics (song with vocals — user provides or you help write lyrics)

bash

# Instrumental (for BGM or when user chooses instrumental)
bash scripts/music/generate_music.sh \
  --instrumental \
  --prompt "ambient electronic, atmospheric" \
  --output minimax-output/ambient.mp3 --download

# Song with lyrics (when user chooses vocal music)
bash scripts/music/generate_music.sh \
  --lyrics "[verse]\nHello world\n[chorus]\nLa la la" \
  --prompt "indie folk, melancholic" \
  --output minimax-output/song.mp3 --download

# With style fields
bash scripts/music/generate_music.sh \
  --lyrics "[verse]\nLyrics here" \
  --genre "pop" --mood "upbeat" --tempo "fast" \
  --output minimax-output/pop_track.mp3 --download

Music Model

Default model:

music-2.5

music-2.5

does not support

--instrumental

directly. When instrumental music is needed, the script automatically applies a workaround:

Sets lyrics to
```
[intro] [outro]
```
(empty structural tags, no actual vocals), appends
```
pure music, no lyrics
```
to the prompt

This produces instrumental-style output without requiring manual intervention. You can always use

--instrumental

and the script handles the rest.

Image Generation

Entry point:

scripts/image/generate_image.sh

Model:

image-01

— photorealistic image generation from text prompts, with optional character reference for image-to-image.

IMPORTANT: Mode Selection — t2i vs i2i

User intent	Mode
Generate image from text description (default)	`t2i` — text-to-image
Generate image with a character reference photo (keep same person)	`i2i` — image-to-image

Default behavior: When the user asks to generate/create an image without mentioning a reference photo, use

t2i

mode (default). Only use

i2i

mode when the user provides a character reference image or explicitly asks to base the image on an existing person's appearance.

IMPORTANT: Aspect Ratio — Infer from user context

Do NOT always default to

1:1

. Analyze the user's request and choose the most appropriate aspect ratio:

User intent / context	Recommended ratio	Resolution
头像、图标、社交媒体头像、avatar、icon、profile pic	`1:1`	1024×1024
风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper	`16:9`	1280×720
传统照片、经典比例、classic photo	`4:3`	1152×864
摄影作品、杂志封面、photography、magazine	`3:2`	1248×832
人像竖图、海报、portrait photo、poster	`2:3`	832×1248
竖版海报、书籍封面、tall poster、book cover	`3:4`	864×1152
手机壁纸、社交媒体故事、phone wallpaper、story、reel	`9:16`	720×1280
超宽全景、电影画幅、panoramic、cinematic ultrawide	`21:9`	1344×576
未指定特定需求 / ambiguous	`1:1`	1024×1024

IMPORTANT: Image Count — When to generate multiple images

User intent	Count ( `-n` )
Default / single image request	`1` (default)
用户说"几张"、"多张"、"一些" / "a few", "several"	`3`
用户说"多种方案"、"备选" / "variations", "options"	`3` – `4`
用户明确指定数量	Use the specified number (1–9)

Text-to-Image Examples

bash

# Basic text-to-image
bash scripts/image/generate_image.sh \
  --prompt "A cat sitting on a rooftop at sunset, cinematic lighting, warm tones, photorealistic" \
  -o minimax-output/cat.png

# Landscape with inferred aspect ratio
bash scripts/image/generate_image.sh \
  --prompt "Mountain landscape with misty valleys, photorealistic, golden hour" \
  --aspect-ratio 16:9 \
  -o minimax-output/landscape.png

# Phone wallpaper (portrait 9:16)
bash scripts/image/generate_image.sh \
  --prompt "Aurora borealis over a snowy forest, vivid colors, magical atmosphere" \
  --aspect-ratio 9:16 \
  -o minimax-output/wallpaper.png

# Multiple variations
bash scripts/image/generate_image.sh \
  --prompt "Abstract geometric art, vibrant colors" \
  -n 3 \
  -o minimax-output/art.png

# With prompt optimizer
bash scripts/image/generate_image.sh \
  --prompt "A man standing on Venice Beach, 90s documentary style" \
  --aspect-ratio 16:9 --prompt-optimizer \
  -o minimax-output/beach.png

# Custom dimensions (must be multiple of 8)
bash scripts/image/generate_image.sh \
  --prompt "Product photo of a luxury watch on marble surface" \
  --width 1024 --height 768 \
  -o minimax-output/watch.png

Image-to-Image (Character Reference)

Use a reference photo to generate images with the same character in new scenes. Best results with a single front-facing portrait. Supported formats: JPG, JPEG, PNG (max 10MB).

bash

# Character reference — place same person in a new scene
bash scripts/image/generate_image.sh \
  --mode i2i \
  --prompt "A girl looking into the distance from a library window, warm afternoon light" \
  --ref-image face.jpg \
  --aspect-ratio 16:9 \
  -o minimax-output/girl_library.png

# Multiple character variations
bash scripts/image/generate_image.sh \
  --mode i2i \
  --prompt "A woman in a red dress at a gala event, elegant, cinematic" \
  --ref-image face.jpg -n 3 \
  -o minimax-output/gala.png

Aspect Ratio Reference

Ratio	Resolution	Best for
`1:1`	1024×1024	Default, avatars, icons, social media
`16:9`	1280×720	Landscape, banner, desktop wallpaper
`4:3`	1152×864	Classic photo, presentations
`3:2`	1248×832	Photography, magazine layout
`2:3`	832×1248	Portrait photo, poster
`3:4`	864×1152	Book cover, tall poster
`9:16`	720×1280	Phone wallpaper, social story/reel
`21:9`	1344×576	Ultra-wide panoramic, cinematic

Key Options

Option	Description
`--prompt TEXT`	Image description, max 1500 chars (required)
`--aspect-ratio RATIO`	Aspect ratio (see table above). Infer from user context
`--width PX` / `--height PX`	Custom size, 512–2048, must be multiple of 8, both required together. Overridden by `--aspect-ratio` if both set
`-n N`	Number of images to generate, 1–9 (default 1)
`--seed N`	Random seed for reproducibility. Same seed + same params → similar results
`--prompt-optimizer`	Enable automatic prompt optimization by the API
`--ref-image FILE`	Character reference image for i2i mode (local file or URL, JPG/JPEG/PNG, max 10MB)
`--no-download`	Print image URLs instead of downloading files
`--aigc-watermark`	Add AIGC watermark to generated images

Video Generation

IMPORTANT: Single vs Multi-Segment — Choose the right script

User intent	Script to use
Default / no special request	`scripts/video/generate_video.sh` (single segment, 10s, 768P)
User explicitly asks for "long video", "multi-scene", "story", or duration > 10s	`scripts/video/generate_long_video.sh` (multi-segment)

Default behavior: Always use single-segment

generate_video.sh

with duration 10s and resolution 768P unless the user explicitly asks for a long video, multi-scene video, or specifies a total duration exceeding 10 seconds. Do NOT automatically split into multiple segments — a single 10s video is the standard output. Only use

generate_long_video.sh

when the user clearly needs multi-scene or longer content.

Entry point (single video):

scripts/video/generate_video.sh

Entry point (long/multi-scene):

scripts/video/generate_long_video.sh

Video Model Constraints (MUST follow)

Duration limits by model and resolution:

Model	720P	768P	1080P
MiniMax-Hailuo-2.3	-	6s or 10s	6s only
MiniMax-Hailuo-2.3-Fast	-	6s or 10s	6s only
MiniMax-Hailuo-02	-	6s or 10s	6s only
T2V-01 / T2V-01-Director	6s only	-	-
I2V-01 / I2V-01-Director / I2V-01-live	6s only	-	-
S2V-01 (ref)	6s only	-	-

Resolution options by model and duration:

Model	6s	10s
MiniMax-Hailuo-2.3	768P (default), 1080P	768P only
MiniMax-Hailuo-2.3-Fast	768P (default), 1080P	768P only
MiniMax-Hailuo-02	512P, 768P (default), 1080P	512P, 768P (default)
Other models	720P (default)	Not supported

Key rules:

Default: 10s + 768P (best balance of length and quality for MiniMax-Hailuo-2.3)
1080P only supports 6s duration — if user requests 1080P, set
```
--duration 6
```
10s duration only works with 768P (or 512P on Hailuo-02) — never combine 10s + 1080P
Older models (T2V-01, I2V-01, S2V-01) only support 6s at 720P

IMPORTANT: Prompt Optimization (MUST follow before generating any video)

Before calling any video generation script, you MUST optimize the user's prompt by reading and applying

references/video-prompt-guide.md

. Never pass the user's raw description directly as

--prompt

Optimization steps:

Apply the Professional Formula:

Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere

BAD:
```
"A puppy in a park"
```

GOOD:

"A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [跟随] smooth tracking shot, warm golden hour lighting, shallow depth of field, joyful atmosphere"

Add camera instructions using

[指令]

syntax:

[推进]

[拉远]

[跟随]

[固定]

[左摇]

, etc.

Include aesthetic details: lighting (golden hour, dramatic side lighting), color grading (warm tones, cinematic), texture (dust particles, rain droplets), atmosphere (intimate, epic, peaceful)
Keep to 1-2 key actions for 6-10 second videos — do not overcrowd with events
For i2v mode (image-to-video): Focus prompt on movement and change only, since the image already establishes the visual. Do NOT re-describe what's in the image.
- BAD:
```
"A lake with mountains"
```
  (just repeating the image)
- GOOD:
```
"Gentle ripples spread across the water surface, a breeze rustles the distant trees, [固定] fixed camera, soft morning light, peaceful and serene"
```
For multi-segment long videos: Each segment's prompt must be self-contained and optimized individually. The i2v segments (segment 2+) should describe motion/change relative to the previous segment's ending frame.

bash

# Text-to-video (default: 10s, 768P)
bash scripts/video/generate_video.sh \
  --mode t2v \
  --prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful" \
  --output minimax-output/puppy.mp4

# Text-to-video with 1080P (must use --duration 6)
bash scripts/video/generate_video.sh \
  --mode t2v \
  --prompt "A golden retriever puppy bounds toward the camera" \
  --duration 6 --resolution 1080P \
  --output minimax-output/puppy_hd.mp4

# Image-to-video (prompt focuses on MOTION, not image content)
bash scripts/video/generate_video.sh \
  --mode i2v \
  --prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones" \
  --first-frame photo.jpg \
  --output minimax-output/animated.mp4

# Start-end frame interpolation (sef mode uses MiniMax-Hailuo-02)
bash scripts/video/generate_video.sh \
  --mode sef \
  --first-frame start.jpg --last-frame end.jpg \
  --output minimax-output/transition.mp4

# Subject reference (face consistency, ref mode uses S2V-01, 6s only)
bash scripts/video/generate_video.sh \
  --mode ref \
  --prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field" \
  --subject-image face.jpg \
  --duration 6 \
  --output minimax-output/person.mp4

Long-form Video (Multi-scene)

Multi-scene long videos chain segments together: the first segment generates via text-to-video (t2v), then each subsequent segment uses the last frame of the previous segment as its first frame (i2v). Segments are joined with crossfade transitions for smooth continuity. Default is 10 seconds per segment.

Workflow:

Segment 1: t2v — generated purely from the optimized text prompt
Segment 2+: i2v — the previous segment's last frame becomes
```
first_frame_image
```
, prompt describes motion and change from that ending state
All segments are concatenated with 0.5s crossfade transitions to eliminate jump cuts
Optional: AI-generated background music is overlaid

Prompt rules for each segment:

Each segment prompt MUST be independently optimized using the Professional Formula
Segment 1 (t2v): Full scene description with subject, scene, camera, atmosphere
Segment 2+ (i2v): Focus on what changes and moves from the previous ending frame. Do NOT repeat the visual description — the first frame already provides it
Maintain visual consistency: keep lighting, color grading, and style keywords consistent across segments
Each segment covers only 10 seconds of action — keep it focused

bash

# Example: 3-segment story with optimized per-segment prompts (default: 10s/segment, 768P)
bash scripts/video/generate_long_video.sh \
  --scenes \
    "A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere" \
    "The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure" \
    "The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale" \
  --music-prompt "cinematic orchestral ambient, slow build, sci-fi atmosphere" \
  --output minimax-output/long_video.mp4

# With custom settings
bash scripts/video/generate_long_video.sh \
  --scenes "Scene 1 prompt" "Scene 2 prompt" \
  --segment-duration 10 \
  --resolution 768P \
  --crossfade 0.5 \
  --music-prompt "calm ambient background music" \
  --output minimax-output/long_video.mp4

Add Background Music

bash

bash scripts/video/add_bgm.sh \
  --video input.mp4 \
  --generate-bgm --instrumental \
  --music-prompt "soft piano background" \
  --bgm-volume 0.3 \
  --output minimax-output/output_with_bgm.mp4

Template Video

bash

bash scripts/video/generate_template_video.sh \
  --template-id 392753057216684038 \
  --media photo.jpg \
  --output minimax-output/template_output.mp4

Video Models

Mode	Default Model	Default Duration	Default Resolution	Notes
t2v	MiniMax-Hailuo-2.3	10s	768P	Latest text-to-video
i2v	MiniMax-Hailuo-2.3	10s	768P	Latest image-to-video
sef	MiniMax-Hailuo-02	6s	768P	Start-end frame
ref	S2V-01	6s	720P	Subject reference, 6s only

Media Tools (Audio/Video Processing)

Entry point:

scripts/media_tools.sh

Standalone FFmpeg-based utilities for format conversion, concatenation, extraction, trimming, and audio overlay. Use these when the user needs to process existing media files without generating new content via MiniMax API.

Video Format Conversion

bash

# Convert between formats (mp4, mov, webm, mkv, avi, ts, flv)
bash scripts/media_tools.sh convert-video input.webm -o output.mp4
bash scripts/media_tools.sh convert-video input.mp4 -o output.mov

# With quality / resolution / fps options
bash scripts/media_tools.sh convert-video input.mp4 -o output.mp4 \
  --crf 18 --preset medium --resolution 1920x1080 --fps 30

Audio Format Conversion

bash

# Convert between formats (mp3, wav, flac, ogg, aac, m4a, opus, wma)
bash scripts/media_tools.sh convert-audio input.wav -o output.mp3
bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac \
  --bitrate 320k --sample-rate 48000 --channels 2

Video Concatenation

bash

# Concatenate with crossfade transition (default 0.5s)
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 seg3.mp4 -o merged.mp4

# Hard cut (no crossfade)
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0

Audio Concatenation

bash

# Simple concatenation
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3

# With crossfade
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1

Extract Audio from Video

bash

# Extract as mp3
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3

# Extract as wav with higher bitrate
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k

Video Trimming

bash

# Trim by start/end time (seconds)
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 5 --end 15

# Trim by start + duration
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8

Add Audio to Video (Overlay / Replace)

bash

# Mix audio with existing video audio
bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4 \
  --volume 0.3 --fade-in 2 --fade-out 3

# Replace original audio entirely
bash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4 \
  --replace

Media File Info

bash

bash scripts/media_tools.sh probe input.mp4

Script Architecture

scripts/
├── check_environment.sh         # Env verification (curl, ffmpeg, jq, xxd, API key)
├── media_tools.sh               # Audio/video conversion, concat, trim, extract
├── tts/
│   └── generate_voice.sh        # Unified TTS CLI (tts, clone, design, list-voices, generate, merge, convert)
├── music/
│   └── generate_music.sh        # Music generation CLI
├── image/
│   └── generate_image.sh        # Image generation CLI (2 modes: t2i, i2i)
└── video/
    ├── generate_video.sh        # Video generation CLI (4 modes: t2v, i2v, sef, ref)
    ├── generate_long_video.sh   # Multi-scene long video
    ├── generate_template_video.sh # Template-based video
    └── add_bgm.sh              # Background music overlay

References

Read these for detailed API parameters, voice catalogs, and prompt engineering:

tts-guide.md — TTS setup, voice management, audio processing, segment format, troubleshooting
tts-voice-catalog.md — Full voice catalog with IDs, descriptions, and parameter reference
music-api.md — Music generation API: endpoints, parameters, response format
image-api.md — Image generation API: text-to-image, image-to-image, parameters
video-api.md — Video API: endpoints, models, parameters, camera instructions, templates
video-prompt-guide.md — Video prompt engineering: formulas, styles, image-to-video tips

minimax-multimodal-toolkit

NPX Install

Tags

SKILL.md Content

MiniMax Multi-Modal Toolkit

Output Directory

Prerequisites

API Host Configuration

API Key Configuration

Key Capabilities

TTS (Text-to-Speech)

IMPORTANT: Single voice vs Multi-segment — Choose the right approach

Single-voice generation (DEFAULT)

Multi-segment generation (multi-voice / audiobook / podcast)

Voice management

Audio processing

TTS Models

segments.json Format

IMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.)

Music Generation

IMPORTANT: Instrumental vs Lyrics — When to use which

Music Model

Image Generation

IMPORTANT: Mode Selection — t2i vs i2i

IMPORTANT: Aspect Ratio — Infer from user context

IMPORTANT: Image Count — When to generate multiple images

Text-to-Image Examples

Image-to-Image (Character Reference)

Aspect Ratio Reference

Key Options

Video Generation

IMPORTANT: Single vs Multi-Segment — Choose the right script

Video Model Constraints (MUST follow)

IMPORTANT: Prompt Optimization (MUST follow before generating any video)

Long-form Video (Multi-scene)

Add Background Music

Template Video

Video Models

Media Tools (Audio/Video Processing)

Video Format Conversion

Audio Format Conversion

Video Concatenation

Audio Concatenation

Extract Audio from Video

Video Trimming

Add Audio to Video (Overlay / Replace)

Media File Info

Script Architecture

References