ai-multimodal

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Multimodal

AI多模态

Process audio, images, videos, documents, and generate images/videos using Google Gemini's multimodal API.

借助Google Gemini的多模态API处理音频、图像、视频、文档，并生成图像/视频。

Setup

环境配置

bash

export GEMINI_API_KEY="your-key"  # Get from https://aistudio.google.com/apikey
pip install google-genai python-dotenv pillow

bash

export GEMINI_API_KEY="your-key"  # 从https://aistudio.google.com/apikey获取
pip install google-genai python-dotenv pillow

API Key Rotation (Optional)

API密钥轮换（可选）

For high-volume usage or when hitting rate limits, configure multiple API keys:

bash

undefined

针对高用量场景或遇到速率限制时，可配置多个API密钥：

bash

undefined

Primary key (required)

主密钥（必填）

export GEMINI_API_KEY="key1"

Additional keys for rotation (optional)

用于轮换的额外密钥（可选）

export GEMINI_API_KEY_2="key2" export GEMINI_API_KEY_3="key3"


Or in your `.env` file:

GEMINI_API_KEY=key1 GEMINI_API_KEY_2=key2 GEMINI_API_KEY_3=key3


**Features:**
- Auto-rotates on rate limit (429/RESOURCE_EXHAUSTED) errors
- 60-second cooldown per key after rate limit
- Logs rotation events with `--verbose` flag
- Backward compatible: single key still works

export GEMINI_API_KEY_2="key2" export GEMINI_API_KEY_3="key3"


或在你的`.env`文件中配置：

GEMINI_API_KEY=key1 GEMINI_API_KEY_2=key2 GEMINI_API_KEY_3=key3


**特性：**
- 遇到速率限制（429/RESOURCE_EXHAUSTED）错误时自动轮换密钥
- 密钥触发速率限制后冷却60秒
- 使用`--verbose`参数可记录轮换事件
- 向后兼容：单密钥模式仍可正常工作

Quick Start

快速开始

Verify setup:

python scripts/check_setup.py

Analyze media:

python scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract>

TIP: When you're asked to analyze an image, check if

gemini

command is available, then use

"<prompt to analyze image>" | gemini -y -m gemini-2.5-flash

command. If

gemini

command is not available, use

python scripts/gemini_batch_process.py --files <file> --task analyze

command. Generate content:

python scripts/gemini_batch_process.py --task <generate|generate-video> --prompt "description"

Stdin support: You can pipe files directly via stdin (auto-detects PNG/JPG/PDF/WAV/MP3).
cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "Describe this"
python scripts/gemini_batch_process.py --files image.png --task analyze
(traditional)

验证环境配置：

python scripts/check_setup.py

分析媒体内容：

python scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract>

提示：当你需要分析图像时，先检查是否可用

gemini

命令，若可用则使用

"<图像分析提示词>" | gemini -y -m gemini-2.5-flash

命令。若

gemini

命令不可用，使用

python scripts/gemini_batch_process.py --files <file> --task analyze

命令。 生成内容：

python scripts/gemini_batch_process.py --task <generate|generate-video> --prompt "描述内容"

标准输入支持：你可以通过标准输入直接传输文件（自动检测PNG/JPG/PDF/WAV/MP3格式）。
cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "描述此图像"
python scripts/gemini_batch_process.py --files image.png --task analyze
（传统方式）

Models

模型说明

Image generation:

imagen-4.0-generate-001

(standard),

imagen-4.0-ultra-generate-001

(quality),

imagen-4.0-fast-generate-001

(speed)

Video generation:
```
veo-3.1-generate-preview
```
(8s clips with audio)
Analysis:
```
gemini-2.5-flash
```
(recommended),
```
gemini-2.5-pro
```
(advanced)

图像生成：

imagen-4.0-generate-001

（标准版）、

imagen-4.0-ultra-generate-001

（高品质版）、

imagen-4.0-fast-generate-001

（高速版）

视频生成：
```
veo-3.1-generate-preview
```
（生成带音频的8秒片段）
内容分析：
```
gemini-2.5-flash
```
（推荐使用）、
```
gemini-2.5-pro
```
（高级版）

Scripts

脚本介绍

gemini_batch_process.py
: CLI orchestrator for
```
transcribe|analyze|extract|generate|generate-video
```
that auto-resolves API keys, picks sensible default models per task, streams files inline vs File API, and saves structured outputs (text/JSON/CSV/markdown plus generated assets) for Imagen 4 + Veo workflows.
media_optimizer.py
: ffmpeg/Pillow-based preflight tool that compresses/resizes/converts audio, image, and video inputs, enforces target sizes/bitrates, splits long clips into hour chunks, and batch-processes directories so media stays within Gemini limits.
document_converter.py
: Gemini-powered converter that uploads PDFs/images/Office docs, applies a markdown-preserving prompt, batches multiple files, auto-names outputs under
```
docs/assets
```
, and exposes CLI flags for model, prompt, auto-file naming, and verbose logging.
check_setup.py
: Interactive readiness checker that verifies directory layout, centralized env resolver, required Python deps, and GEMINI_API_KEY availability/format, then performs a live Gemini API call and prints remediation instructions if anything fails.

Use

--help

for options.

gemini_batch_process.py
：用于
```
transcribe|analyze|extract|generate|generate-video
```
任务的CLI编排工具，可自动解析API密钥、为各任务选择合理的默认模型、支持流式文件传输与File API、为Imagen 4和Veo工作流保存结构化输出（文本/JSON/CSV/Markdown格式及生成的资源文件）。
media_optimizer.py
：基于ffmpeg/Pillow的预处理工具，可压缩/调整大小/转换音频、图像、视频输入，强制执行目标大小/比特率，将长片段分割为小时级 chunks，支持批量处理目录，确保媒体内容符合Gemini的限制要求。
document_converter.py
：基于Gemini的格式转换工具，可上传PDF/图像/Office文档，应用保留Markdown格式的提示词，支持批量处理多个文件，自动在
```
docs/assets
```
目录下命名输出文件，提供模型选择、提示词设置、自动命名、详细日志等CLI参数。
check_setup.py
：交互式环境就绪检查工具，可验证目录结构、集中式环境变量解析、所需Python依赖、GEMINI_API_KEY的有效性/格式，然后执行一次Gemini API调用，若有任何项失败则打印修复指导。

使用

--help

参数查看所有选项。

References

参考文档

Load for detailed guidance:

Topic	File	Description
Music	`references/music-generation.md`	Lyria RealTime API for background music generation, style prompts, real-time control, integration with video production.
Audio	`references/audio-processing.md`	Audio formats and limits, transcription (timestamps, speakers, segments), non-speech analysis, File API vs inline input, TTS models, best practices, cost and token math, and concrete meeting/podcast/interview recipes.
Images	`references/vision-understanding.md`	Vision capabilities overview, supported formats and models, captioning/classification/VQA, detection and segmentation, OCR and document reading, multi-image workflows, structured JSON output, token costs, best practices, and common product/screenshot/chart/scene use cases.
Image Gen	`references/image-generation.md`	Imagen 4 and Gemini image model overview, generate_images vs generate_content APIs, aspect ratios and costs, text/image/both modalities, editing and composition, style and quality control, safety settings, best practices, troubleshooting, and common marketing/concept-art/UI scenarios.
Video	`references/video-analysis.md`	Video analysis capabilities and supported formats, model/context choices, local/inline/YouTube inputs, clipping and FPS control, multi-video comparison, temporal Q&A and scene detection, transcription with visual context, token and cost guidance, and optimization/best-practice patterns.
Video Gen	`references/video-generation.md`	Veo model matrix, text-to-video and image-to-video quick start, multi-reference and extension flows, camera and timing control, configuration (resolution, aspect, audio, safety), prompt design patterns, performance tips, limitations, troubleshooting, and cost estimates.

如需详细指导，请查看以下文件：

主题	文件路径	描述
音乐	`references/music-generation.md`	用于背景音乐生成的Lyria RealTime API介绍，包括风格提示词、实时控制、视频制作集成。
音频	`references/audio-processing.md`	音频格式与限制、转录（时间戳、说话人识别、片段划分）、非语音分析、File API与流式输入对比、TTS模型、最佳实践、成本与token计算，以及会议/播客/访谈场景的具体实现方案。
图像	`references/vision-understanding.md`	视觉能力概述、支持的格式与模型、图像标注/分类/视觉问答、检测与分割、OCR与文档读取、多图像工作流、结构化JSON输出、token成本、最佳实践，以及产品截图/图表/场景等常见使用案例。
图像生成	`references/image-generation.md`	Imagen 4与Gemini图像模型概述、generate_images与generate_content API对比、宽高比与成本、文本/图像双模态支持、编辑与构图、风格与质量控制、安全设置、最佳实践、故障排查，以及营销素材/概念艺术/UI设计等常见场景。
视频	`references/video-analysis.md`	视频分析能力与支持格式、模型/上下文选择、本地/流式/YouTube输入、片段截取与帧率控制、多视频对比、时序问答与场景检测、带视觉上下文的转录、token与成本指导，以及优化/最佳实践模式。
视频生成	`references/video-generation.md`	Veo模型矩阵、文本转视频与图像转视频快速入门、多参考与扩展流程、镜头与时长控制、配置选项（分辨率、宽高比、音频、安全设置）、提示词设计模式、性能优化技巧、限制说明、故障排查，以及成本估算。

Limits

限制说明

Formats: Audio (WAV/MP3/AAC, 9.5h), Images (PNG/JPEG/WEBP, 3.6k), Video (MP4/MOV, 6h), PDF (1k pages) Size: 20MB inline, 2GB File API Important:

If you are going to generate a transcript of the audio, and the audio length is longer than 15 minutes, the transcript often gets truncated due to output token limits in the Gemini API response. To get the full transcript, you need to split the audio into smaller chunks (max 15 minutes per chunk) and transcribe each segment for a complete transcript.
If you are going to generate a transcript of the video and the video length is longer than 15 minutes, use ffmpeg to extract the audio from the video, truncate the audio to 15 minutes, transcribe all audio segments, and then combine the transcripts into a single transcript. Transcription Output Requirements:
Format: Markdown
Metadata: Duration, file size, generated date, description, file name, topics covered, etc.
Parts: from-to (e.g., 00:00-00:15), audio chunk name, transcript, status, etc.

Transcript format:

[HH:MM:SS -> HH:MM:SS] transcript content
[HH:MM:SS -> HH:MM:SS] transcript content
...

支持格式：音频（WAV/MP3/AAC，最长9.5小时）、图像（PNG/JPEG/WEBP，最长3.6k分辨率）、视频（MP4/MOV，最长6小时）、PDF（最多1000页） 文件大小：流式输入最大20MB，File API最大2GB 重要提示：

若要生成长于15分钟的音频转录文本，由于Gemini API响应的输出token限制，转录内容通常会被截断。如需完整转录文本，需将音频分割为多个不超过15分钟的片段，然后分别转录每个片段再合并结果。
若要生成长于15分钟的视频转录文本，使用ffmpeg提取视频中的音频，将音频截断为15分钟的片段，转录所有音频片段后再合并为完整的转录文本。 转录输出要求：
格式：Markdown
元数据：时长、文件大小、生成日期、描述、文件名、涵盖主题等
组成部分：时间范围（如00:00-00:15）、音频片段名称、转录文本、状态等

转录文本格式：

[HH:MM:SS -> HH:MM:SS] 转录内容
[HH:MM:SS -> HH:MM:SS] 转录内容
...

ai-multimodal

Original

Translation

AI Multimodal

AI多模态

Setup

环境配置

API Key Rotation (Optional)

API密钥轮换（可选）

Primary key (required)

主密钥（必填）

Additional keys for rotation (optional)

用于轮换的额外密钥（可选）

Quick Start

快速开始

Models

模型说明

Scripts

脚本介绍

References

参考文档

Limits

限制说明

Resources

资源链接