ai-multimodal
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAI Multimodal Processing Skill
AI多模态处理Skill
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
借助Google Gemini的多模态API处理音频、图像、视频、文档并生成图像。为所有多媒体内容的理解与生成提供统一接口。
Core Capabilities
核心功能
Audio Processing
音频处理
- Transcription with timestamps (up to 9.5 hours)
- Audio summarization and analysis
- Speech understanding and speaker identification
- Music and environmental sound analysis
- Text-to-speech generation with controllable voice
- 带时间戳的转录(最长9.5小时)
- 音频摘要与分析
- 语音理解与说话人识别
- 音乐与环境声音分析
- 可控制音色的文本转语音生成
Image Understanding
图像理解
- Image captioning and description
- Object detection with bounding boxes (2.0+)
- Pixel-level segmentation (2.5+)
- Visual question answering
- Multi-image comparison (up to 3,600 images)
- OCR and text extraction
- 图像标题生成与描述
- 带边界框的目标检测(需2.0+版本模型)
- 像素级分割(需2.5+版本模型)
- 视觉问答
- 多图像对比(最多3600张图像)
- OCR与文本提取
Video Analysis
视频分析
- Scene detection and summarization
- Video Q&A with temporal understanding
- Transcription with visual descriptions
- YouTube URL support
- Long video processing (up to 6 hours)
- Frame-level analysis
- 场景检测与摘要
- 支持时序理解的视频问答
- 带视觉描述的转录
- 支持YouTube链接
- 长视频处理(最长6小时)
- 帧级分析
Document Extraction
文档提取
- Native PDF vision processing (up to 1,000 pages)
- Table and form extraction
- Chart and diagram analysis
- Multi-page document understanding
- Structured data output (JSON schema)
- Format conversion (PDF to HTML/JSON)
- 原生PDF视觉处理(最多1000页)
- 表格与表单提取
- 图表与图形分析
- 多页文档理解
- 结构化数据输出(JSON schema)
- 格式转换(PDF转HTML/JSON)
Image Generation
图像生成
- Text-to-image generation
- Image editing and modification
- Multi-image composition (up to 3 images)
- Iterative refinement
- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
- Controllable style and quality
- 文本转图像生成
- 图像编辑与修改
- 多图像合成(最多3张图像)
- 迭代优化
- 多种宽高比(1:1、16:9、9:16、4:3、3:4)
- 可控制风格与质量
Capability Matrix
功能矩阵
| Task | Audio | Image | Video | Document | Generation |
|---|---|---|---|---|---|
| Transcription | ✓ | - | ✓ | - | - |
| Summarization | ✓ | ✓ | ✓ | ✓ | - |
| Q&A | ✓ | ✓ | ✓ | ✓ | - |
| Object Detection | - | ✓ | ✓ | - | - |
| Text Extraction | - | ✓ | - | ✓ | - |
| Structured Output | ✓ | ✓ | ✓ | ✓ | - |
| Creation | TTS | - | - | - | ✓ |
| Timestamps | ✓ | - | ✓ | - | - |
| Segmentation | - | ✓ | - | - | - |
| 任务 | 音频 | 图像 | 视频 | 文档 | 生成 |
|---|---|---|---|---|---|
| 转录 | ✓ | - | ✓ | - | - |
| 摘要 | ✓ | ✓ | ✓ | ✓ | - |
| 问答 | ✓ | ✓ | ✓ | ✓ | - |
| 目标检测 | - | ✓ | ✓ | - | - |
| 文本提取 | - | ✓ | - | ✓ | - |
| 结构化输出 | ✓ | ✓ | ✓ | ✓ | - |
| 内容创建 | 文本转语音 | - | - | - | ✓ |
| 时间戳 | ✓ | - | ✓ | - | - |
| 分割 | - | ✓ | - | - | - |
Model Selection Guide
模型选择指南
Gemini 2.5 Series (Recommended)
Gemini 2.5系列(推荐)
- gemini-2.5-pro: Highest quality, all features, 1M-2M context
- gemini-2.5-flash: Best balance, all features, 1M-2M context
- gemini-2.5-flash-lite: Lightweight, segmentation support
- gemini-2.5-flash-image: Image generation only
- gemini-2.5-pro: 最高质量,支持所有功能,100万-200万token上下文
- gemini-2.5-flash: 性价比最优,支持所有功能,100万-200万token上下文
- gemini-2.5-flash-lite: 轻量版,支持分割功能
- gemini-2.5-flash-image: 仅支持图像生成
Feature Requirements
功能要求
- Segmentation: Requires 2.5+ models
- Object Detection: Requires 2.0+ models
- Multi-video: Requires 2.5+ models
- Image Generation: Requires flash-image model
- 分割: 需要2.5+版本模型
- 目标检测: 需要2.0+版本模型
- 多视频处理: 需要2.5+版本模型
- 图像生成: 需要flash-image模型
Context Windows
上下文窗口
- 2M tokens: ~6 hours video (low-res) or ~2 hours (default)
- 1M tokens: ~3 hours video (low-res) or ~1 hour (default)
- Audio: 32 tokens/second (1 min = 1,920 tokens)
- PDF: 258 tokens/page (fixed)
- Image: 258-1,548 tokens based on size
- 200万token: 约6小时低分辨率视频或2小时默认分辨率视频
- 100万token: 约3小时低分辨率视频或1小时默认分辨率视频
- 音频: 32token/秒(1分钟=1920token)
- PDF: 固定258token/页
- 图像: 根据尺寸不同为258-1548token
Quick Start
快速开始
Prerequisites
前置条件
API Key Setup: Supports both Google AI Studio and Vertex AI.
The skill checks for in this order:
GEMINI_API_KEY- Process environment:
export GEMINI_API_KEY="your-key" - Project root:
.env .claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.env
Get API key: https://aistudio.google.com/apikey
For Vertex AI:
bash
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # OptionalInstall SDK:
bash
pip install google-genai python-dotenv pillowAPI密钥配置: 支持Google AI Studio与Vertex AI。
Skill会按以下顺序查找:
GEMINI_API_KEY- 进程环境:
export GEMINI_API_KEY="your-key" - 项目根目录:
.env .claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.env
获取API密钥: https://aistudio.google.com/apikey
针对Vertex AI:
bash
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # 可选安装SDK:
bash
pip install google-genai python-dotenv pillowCommon Patterns
常用示例
Transcribe Audio:
bash
python scripts/gemini_batch_process.py \
--files audio.mp3 \
--task transcribe \
--model gemini-2.5-flashAnalyze Image:
bash
python scripts/gemini_batch_process.py \
--files image.jpg \
--task analyze \
--prompt "Describe this image" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flashProcess Video:
bash
python scripts/gemini_batch_process.py \
--files video.mp4 \
--task analyze \
--prompt "Summarize key points with timestamps" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flashExtract from PDF:
bash
python scripts/gemini_batch_process.py \
--files document.pdf \
--task extract \
--prompt "Extract table data as JSON" \
--output docs/assets/<output-name>.md \
--format jsonGenerate Image:
bash
python scripts/gemini_batch_process.py \
--task generate \
--prompt "A futuristic city at sunset" \
--output docs/assets/<output-file-name> \
--model gemini-2.5-flash-image \
--aspect-ratio 16:9Optimize Media:
bash
undefined转录音频:
bash
python scripts/gemini_batch_process.py \
--files audio.mp3 \
--task transcribe \
--model gemini-2.5-flash分析图像:
bash
python scripts/gemini_batch_process.py \
--files image.jpg \
--task analyze \
--prompt "Describe this image" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash处理视频:
bash
python scripts/gemini_batch_process.py \
--files video.mp4 \
--task analyze \
--prompt "Summarize key points with timestamps" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash从PDF提取内容:
bash
python scripts/gemini_batch_process.py \
--files document.pdf \
--task extract \
--prompt "Extract table data as JSON" \
--output docs/assets/<output-name>.md \
--format json生成图像:
bash
python scripts/gemini_batch_process.py \
--task generate \
--prompt "A futuristic city at sunset" \
--output docs/assets/<output-file-name> \
--model gemini-2.5-flash-image \
--aspect-ratio 16:9媒体优化:
bash
undefinedPrepare large video for processing
为处理准备大视频
python scripts/media_optimizer.py
--input large-video.mp4
--output docs/assets/<output-file-name>
--target-size 100MB
--input large-video.mp4
--output docs/assets/<output-file-name>
--target-size 100MB
python scripts/media_optimizer.py
--input large-video.mp4
--output docs/assets/<output-file-name>
--target-size 100MB
--input large-video.mp4
--output docs/assets/<output-file-name>
--target-size 100MB
Batch optimize multiple files
批量优化多个文件
python scripts/media_optimizer.py
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85
**Convert Documents to Markdown**:
```bashpython scripts/media_optimizer.py
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85
**文档转Markdown**:
```bashConvert to PDF
转换为PDF
python scripts/document_converter.py
--input document.docx
--output docs/assets/document.md
--input document.docx
--output docs/assets/document.md
python scripts/document_converter.py
--input document.docx
--output docs/assets/document.md
--input document.docx
--output docs/assets/document.md
Extract pages
提取页面
python scripts/document_converter.py
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20
undefinedpython scripts/document_converter.py
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20
undefinedSupported Formats
支持的格式
Audio
音频
- WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
- Max 9.5 hours per request
- Auto-downsampled to 16 Kbps mono
- WAV、MP3、AAC、FLAC、OGG Vorbis、AIFF
- 每次请求最长9.5小时
- 自动下采样至16 Kbps单声道
Images
图像
- PNG, JPEG, WEBP, HEIC, HEIF
- Max 3,600 images per request
- Resolution: ≤384px = 258 tokens, larger = tiled
- PNG、JPEG、WEBP、HEIC、HEIF
- 每次请求最多3600张图像
- 分辨率:≤384px = 258token,更大尺寸将被分块处理
Video
视频
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
- Max 6 hours (low-res) or 2 hours (default)
- YouTube URLs supported (public only)
- MP4、MPEG、MOV、AVI、FLV、MPG、WebM、WMV、3GPP
- 最长6小时(低分辨率)或2小时(默认分辨率)
- 支持YouTube链接(仅公开视频)
Documents
文档
- PDF only for vision processing
- Max 1,000 pages
- TXT, HTML, Markdown supported (text-only)
- 视觉处理仅支持PDF
- 最多1000页
- 支持TXT、HTML、Markdown(仅文本处理)
Size Limits
大小限制
- Inline: <20MB total request
- File API: 2GB per file, 20GB project quota
- Retention: 48 hours auto-delete
- 内联上传: 总请求大小<20MB
- 文件API: 单文件最大2GB,项目配额20GB
- 留存时间: 自动删除,留存48小时
Reference Navigation
参考导航
For detailed implementation guidance, see:
如需详细实现指南,请查看:
Audio Processing
音频处理
- - Transcription, analysis, TTS
references/audio-processing.md- Timestamp handling and segment analysis
- Multi-speaker identification
- Non-speech audio analysis
- Text-to-speech generation
- - 转录、分析、文本转语音
references/audio-processing.md- 时间戳处理与分段分析
- 多说话人识别
- 非语音音频分析
- 文本转语音生成
Image Understanding
图像理解
- - Captioning, detection, OCR
references/vision-understanding.md- Object detection and localization
- Pixel-level segmentation
- Visual question answering
- Multi-image comparison
- - 标题生成、检测、OCR
references/vision-understanding.md- 目标检测与定位
- 像素级分割
- 视觉问答
- 多图像对比
Video Analysis
视频分析
- - Scene detection, temporal understanding
references/video-analysis.md- YouTube URL processing
- Timestamp-based queries
- Video clipping and FPS control
- Long video optimization
- - 场景检测、时序理解
references/video-analysis.md- YouTube链接处理
- 基于时间戳的查询
- 视频剪辑与FPS控制
- 长视频优化
Document Extraction
文档提取
- - PDF processing, structured output
references/document-extraction.md- Table and form extraction
- Chart and diagram analysis
- JSON schema validation
- Multi-page handling
- - PDF处理、结构化输出
references/document-extraction.md- 表格与表单提取
- 图表与图形分析
- JSON schema验证
- 多页处理
Image Generation
图像生成
- - Text-to-image, editing
references/image-generation.md- Prompt engineering strategies
- Image editing and composition
- Aspect ratio selection
- Safety settings
- - 文本转图像、编辑
references/image-generation.md- 提示词工程策略
- 图像编辑与合成
- 宽高比选择
- 安全设置
Cost Optimization
成本优化
Token Costs
Token成本
Input Pricing:
- Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
- Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
- Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output
Token Rates:
- Audio: 32 tokens/second (1 min = 1,920 tokens)
- Video: ~300 tokens/second (default) or ~100 (low-res)
- PDF: 258 tokens/page (fixed)
- Image: 258-1,548 tokens based on size
TTS Pricing:
- Flash TTS: $10/1M tokens
- Pro TTS: $20/1M tokens
输入定价:
- Gemini 2.5 Flash: 100万输入token/$1.00,100万输出token/$0.10
- Gemini 2.5 Pro: 100万输入token/$3.00,100万输出token/$12.00
- Gemini 1.5 Flash: 100万输入token/$0.70,100万输出token/$0.175
Token换算率:
- 音频: 32token/秒(1分钟=1920token)
- 视频: 约300token/秒(默认)或约100token/秒(低分辨率)
- PDF: 固定258token/页
- 图像: 根据尺寸不同为258-1548token
Best Practices
最佳实践
- Use for most tasks (best price/performance)
gemini-2.5-flash - Use File API for files >20MB or repeated queries
- Optimize media before upload (see )
media_optimizer.py - Process specific segments instead of full videos
- Use lower FPS for static content
- Implement context caching for repeated queries
- Batch process multiple files in parallel
- 大多数任务使用(性价比最优)
gemini-2.5-flash - 文件>20MB或重复查询使用文件API
- 上传前优化媒体(参考)
media_optimizer.py - 处理视频特定片段而非完整视频
- 静态内容使用更低的FPS
- 重复查询实现上下文缓存
- 并行批量处理多个文件
Rate Limits
速率限制
Free Tier:
- 10-15 RPM (requests per minute)
- 1M-4M TPM (tokens per minute)
- 1,500 RPD (requests per day)
YouTube Limits:
- Free tier: 8 hours/day
- Paid tier: No length limits
- Public videos only
Storage Limits:
- 20GB per project
- 2GB per file
- 48-hour retention
免费额度:
- 10-15 RPM(每分钟请求数)
- 100万-400万 TPM(每分钟token数)
- 1500 RPD(每天请求数)
YouTube限制:
- 免费额度:每天8小时
- 付费额度:无时长限制
- 仅支持公开视频
存储限制:
- 每个项目20GB
- 单文件2GB
- 留存48小时
Error Handling
错误处理
Common errors and solutions:
- 400: Invalid format/size - validate before upload
- 401: Invalid API key - check configuration
- 403: Permission denied - verify API key restrictions
- 404: File not found - ensure file uploaded and active
- 429: Rate limit exceeded - implement exponential backoff
- 500: Server error - retry with backoff
常见错误与解决方案:
- 400: 格式/大小无效 - 上传前验证
- 401: API密钥无效 - 检查配置
- 403: 权限拒绝 - 验证API密钥限制
- 404: 文件未找到 - 确保文件已上传且处于活跃状态
- 429: 超出速率限制 - 实现指数退避重试
- 500: 服务器错误 - 退避后重试
Scripts Overview
脚本概述
All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
- Supports all modalities (audio, image, video, PDF)
- Progress tracking and error recovery
- Output formats: JSON, Markdown, CSV
- Rate limiting and retry logic
- Dry-run mode
media_optimizer.py: Prepare media for Gemini API
- Compress videos/audio for size limits
- Resize images appropriately
- Split long videos into chunks
- Format conversion
- Quality vs size optimization
document_converter.py: Convert documents to PDF
- Convert DOCX, XLSX, PPTX to PDF
- Extract page ranges
- Optimize PDFs for Gemini
- Extract images from PDFs
- Batch conversion support
Run any script with for detailed usage.
--help所有脚本均支持统一的API密钥检测与错误处理:
gemini_batch_process.py: 批量处理多个媒体文件
- 支持所有模态(音频、图像、视频、PDF)
- 进度跟踪与错误恢复
- 输出格式:JSON、Markdown、CSV
- 速率限制与重试逻辑
- 试运行模式
media_optimizer.py: 为Gemini API准备媒体文件
- 压缩视频/音频以满足大小限制
- 适当调整图像尺寸
- 将长视频分割为片段
- 格式转换
- 质量与大小平衡优化
document_converter.py: 将文档转换为PDF
- 将DOCX、XLSX、PPTX转换为PDF
- 提取页面范围
- 优化PDF以适配Gemini
- 从PDF提取图像
- 支持批量转换
运行任意脚本时添加查看详细用法。
--help