ai-multimodal

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Multimodal Processing Skill

AI多模态处理Skill

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
借助Google Gemini的多模态API处理音频、图像、视频、文档并生成图像。为所有多媒体内容的理解与生成提供统一接口。

Core Capabilities

核心功能

Audio Processing

音频处理

  • Transcription with timestamps (up to 9.5 hours)
  • Audio summarization and analysis
  • Speech understanding and speaker identification
  • Music and environmental sound analysis
  • Text-to-speech generation with controllable voice
  • 带时间戳的转录(最长9.5小时)
  • 音频摘要与分析
  • 语音理解与说话人识别
  • 音乐与环境声音分析
  • 可控制音色的文本转语音生成

Image Understanding

图像理解

  • Image captioning and description
  • Object detection with bounding boxes (2.0+)
  • Pixel-level segmentation (2.5+)
  • Visual question answering
  • Multi-image comparison (up to 3,600 images)
  • OCR and text extraction
  • 图像标题生成与描述
  • 带边界框的目标检测(需2.0+版本模型)
  • 像素级分割(需2.5+版本模型)
  • 视觉问答
  • 多图像对比(最多3600张图像)
  • OCR与文本提取

Video Analysis

视频分析

  • Scene detection and summarization
  • Video Q&A with temporal understanding
  • Transcription with visual descriptions
  • YouTube URL support
  • Long video processing (up to 6 hours)
  • Frame-level analysis
  • 场景检测与摘要
  • 支持时序理解的视频问答
  • 带视觉描述的转录
  • 支持YouTube链接
  • 长视频处理(最长6小时)
  • 帧级分析

Document Extraction

文档提取

  • Native PDF vision processing (up to 1,000 pages)
  • Table and form extraction
  • Chart and diagram analysis
  • Multi-page document understanding
  • Structured data output (JSON schema)
  • Format conversion (PDF to HTML/JSON)
  • 原生PDF视觉处理(最多1000页)
  • 表格与表单提取
  • 图表与图形分析
  • 多页文档理解
  • 结构化数据输出(JSON schema)
  • 格式转换(PDF转HTML/JSON)

Image Generation

图像生成

  • Text-to-image generation
  • Image editing and modification
  • Multi-image composition (up to 3 images)
  • Iterative refinement
  • Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
  • Controllable style and quality
  • 文本转图像生成
  • 图像编辑与修改
  • 多图像合成(最多3张图像)
  • 迭代优化
  • 多种宽高比(1:1、16:9、9:16、4:3、3:4)
  • 可控制风格与质量

Capability Matrix

功能矩阵

TaskAudioImageVideoDocumentGeneration
Transcription---
Summarization-
Q&A-
Object Detection---
Text Extraction---
Structured Output-
CreationTTS---
Timestamps---
Segmentation----
任务音频图像视频文档生成
转录---
摘要-
问答-
目标检测---
文本提取---
结构化输出-
内容创建文本转语音---
时间戳---
分割----

Model Selection Guide

模型选择指南

Gemini 2.5 Series (Recommended)

Gemini 2.5系列(推荐)

  • gemini-2.5-pro: Highest quality, all features, 1M-2M context
  • gemini-2.5-flash: Best balance, all features, 1M-2M context
  • gemini-2.5-flash-lite: Lightweight, segmentation support
  • gemini-2.5-flash-image: Image generation only
  • gemini-2.5-pro: 最高质量,支持所有功能,100万-200万token上下文
  • gemini-2.5-flash: 性价比最优,支持所有功能,100万-200万token上下文
  • gemini-2.5-flash-lite: 轻量版,支持分割功能
  • gemini-2.5-flash-image: 仅支持图像生成

Feature Requirements

功能要求

  • Segmentation: Requires 2.5+ models
  • Object Detection: Requires 2.0+ models
  • Multi-video: Requires 2.5+ models
  • Image Generation: Requires flash-image model
  • 分割: 需要2.5+版本模型
  • 目标检测: 需要2.0+版本模型
  • 多视频处理: 需要2.5+版本模型
  • 图像生成: 需要flash-image模型

Context Windows

上下文窗口

  • 2M tokens: ~6 hours video (low-res) or ~2 hours (default)
  • 1M tokens: ~3 hours video (low-res) or ~1 hour (default)
  • Audio: 32 tokens/second (1 min = 1,920 tokens)
  • PDF: 258 tokens/page (fixed)
  • Image: 258-1,548 tokens based on size
  • 200万token: 约6小时低分辨率视频或2小时默认分辨率视频
  • 100万token: 约3小时低分辨率视频或1小时默认分辨率视频
  • 音频: 32token/秒(1分钟=1920token)
  • PDF: 固定258token/页
  • 图像: 根据尺寸不同为258-1548token

Quick Start

快速开始

Prerequisites

前置条件

API Key Setup: Supports both Google AI Studio and Vertex AI.
The skill checks for
GEMINI_API_KEY
in this order:
  1. Process environment:
    export GEMINI_API_KEY="your-key"
  2. Project root:
    .env
  3. .claude/.env
  4. .claude/skills/.env
  5. .claude/skills/ai-multimodal/.env
For Vertex AI:
bash
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional
Install SDK:
bash
pip install google-genai python-dotenv pillow
API密钥配置: 支持Google AI Studio与Vertex AI。
Skill会按以下顺序查找
GEMINI_API_KEY
  1. 进程环境:
    export GEMINI_API_KEY="your-key"
  2. 项目根目录:
    .env
  3. .claude/.env
  4. .claude/skills/.env
  5. .claude/skills/ai-multimodal/.env
针对Vertex AI:
bash
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # 可选
安装SDK:
bash
pip install google-genai python-dotenv pillow

Common Patterns

常用示例

Transcribe Audio:
bash
python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe \
  --model gemini-2.5-flash
Analyze Image:
bash
python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "Describe this image" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash
Process Video:
bash
python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "Summarize key points with timestamps" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash
Extract from PDF:
bash
python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "Extract table data as JSON" \
  --output docs/assets/<output-name>.md \
  --format json
Generate Image:
bash
python scripts/gemini_batch_process.py \
  --task generate \
  --prompt "A futuristic city at sunset" \
  --output docs/assets/<output-file-name> \
  --model gemini-2.5-flash-image \
  --aspect-ratio 16:9
Optimize Media:
bash
undefined
转录音频:
bash
python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe \
  --model gemini-2.5-flash
分析图像:
bash
python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "Describe this image" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash
处理视频:
bash
python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "Summarize key points with timestamps" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash
从PDF提取内容:
bash
python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "Extract table data as JSON" \
  --output docs/assets/<output-name>.md \
  --format json
生成图像:
bash
python scripts/gemini_batch_process.py \
  --task generate \
  --prompt "A futuristic city at sunset" \
  --output docs/assets/<output-file-name> \
  --model gemini-2.5-flash-image \
  --aspect-ratio 16:9
媒体优化:
bash
undefined

Prepare large video for processing

为处理准备大视频

python scripts/media_optimizer.py
--input large-video.mp4
--output docs/assets/<output-file-name>
--target-size 100MB
python scripts/media_optimizer.py
--input large-video.mp4
--output docs/assets/<output-file-name>
--target-size 100MB

Batch optimize multiple files

批量优化多个文件

python scripts/media_optimizer.py
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85

**Convert Documents to Markdown**:
```bash
python scripts/media_optimizer.py
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85

**文档转Markdown**:
```bash

Convert to PDF

转换为PDF

python scripts/document_converter.py
--input document.docx
--output docs/assets/document.md
python scripts/document_converter.py
--input document.docx
--output docs/assets/document.md

Extract pages

提取页面

python scripts/document_converter.py
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20
undefined
python scripts/document_converter.py
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20
undefined

Supported Formats

支持的格式

Audio

音频

  • WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
  • Max 9.5 hours per request
  • Auto-downsampled to 16 Kbps mono
  • WAV、MP3、AAC、FLAC、OGG Vorbis、AIFF
  • 每次请求最长9.5小时
  • 自动下采样至16 Kbps单声道

Images

图像

  • PNG, JPEG, WEBP, HEIC, HEIF
  • Max 3,600 images per request
  • Resolution: ≤384px = 258 tokens, larger = tiled
  • PNG、JPEG、WEBP、HEIC、HEIF
  • 每次请求最多3600张图像
  • 分辨率:≤384px = 258token,更大尺寸将被分块处理

Video

视频

  • MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
  • Max 6 hours (low-res) or 2 hours (default)
  • YouTube URLs supported (public only)
  • MP4、MPEG、MOV、AVI、FLV、MPG、WebM、WMV、3GPP
  • 最长6小时(低分辨率)或2小时(默认分辨率)
  • 支持YouTube链接(仅公开视频)

Documents

文档

  • PDF only for vision processing
  • Max 1,000 pages
  • TXT, HTML, Markdown supported (text-only)
  • 视觉处理仅支持PDF
  • 最多1000页
  • 支持TXT、HTML、Markdown(仅文本处理)

Size Limits

大小限制

  • Inline: <20MB total request
  • File API: 2GB per file, 20GB project quota
  • Retention: 48 hours auto-delete
  • 内联上传: 总请求大小<20MB
  • 文件API: 单文件最大2GB,项目配额20GB
  • 留存时间: 自动删除,留存48小时

Reference Navigation

参考导航

For detailed implementation guidance, see:
如需详细实现指南,请查看:

Audio Processing

音频处理

  • references/audio-processing.md
    - Transcription, analysis, TTS
    • Timestamp handling and segment analysis
    • Multi-speaker identification
    • Non-speech audio analysis
    • Text-to-speech generation
  • references/audio-processing.md
    - 转录、分析、文本转语音
    • 时间戳处理与分段分析
    • 多说话人识别
    • 非语音音频分析
    • 文本转语音生成

Image Understanding

图像理解

  • references/vision-understanding.md
    - Captioning, detection, OCR
    • Object detection and localization
    • Pixel-level segmentation
    • Visual question answering
    • Multi-image comparison
  • references/vision-understanding.md
    - 标题生成、检测、OCR
    • 目标检测与定位
    • 像素级分割
    • 视觉问答
    • 多图像对比

Video Analysis

视频分析

  • references/video-analysis.md
    - Scene detection, temporal understanding
    • YouTube URL processing
    • Timestamp-based queries
    • Video clipping and FPS control
    • Long video optimization
  • references/video-analysis.md
    - 场景检测、时序理解
    • YouTube链接处理
    • 基于时间戳的查询
    • 视频剪辑与FPS控制
    • 长视频优化

Document Extraction

文档提取

  • references/document-extraction.md
    - PDF processing, structured output
    • Table and form extraction
    • Chart and diagram analysis
    • JSON schema validation
    • Multi-page handling
  • references/document-extraction.md
    - PDF处理、结构化输出
    • 表格与表单提取
    • 图表与图形分析
    • JSON schema验证
    • 多页处理

Image Generation

图像生成

  • references/image-generation.md
    - Text-to-image, editing
    • Prompt engineering strategies
    • Image editing and composition
    • Aspect ratio selection
    • Safety settings
  • references/image-generation.md
    - 文本转图像、编辑
    • 提示词工程策略
    • 图像编辑与合成
    • 宽高比选择
    • 安全设置

Cost Optimization

成本优化

Token Costs

Token成本

Input Pricing:
  • Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
  • Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
  • Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output
Token Rates:
  • Audio: 32 tokens/second (1 min = 1,920 tokens)
  • Video: ~300 tokens/second (default) or ~100 (low-res)
  • PDF: 258 tokens/page (fixed)
  • Image: 258-1,548 tokens based on size
TTS Pricing:
  • Flash TTS: $10/1M tokens
  • Pro TTS: $20/1M tokens
输入定价:
  • Gemini 2.5 Flash: 100万输入token/$1.00,100万输出token/$0.10
  • Gemini 2.5 Pro: 100万输入token/$3.00,100万输出token/$12.00
  • Gemini 1.5 Flash: 100万输入token/$0.70,100万输出token/$0.175
Token换算率:
  • 音频: 32token/秒(1分钟=1920token)
  • 视频: 约300token/秒(默认)或约100token/秒(低分辨率)
  • PDF: 固定258token/页
  • 图像: 根据尺寸不同为258-1548token

Best Practices

最佳实践

  1. Use
    gemini-2.5-flash
    for most tasks (best price/performance)
  2. Use File API for files >20MB or repeated queries
  3. Optimize media before upload (see
    media_optimizer.py
    )
  4. Process specific segments instead of full videos
  5. Use lower FPS for static content
  6. Implement context caching for repeated queries
  7. Batch process multiple files in parallel
  1. 大多数任务使用
    gemini-2.5-flash
    (性价比最优)
  2. 文件>20MB或重复查询使用文件API
  3. 上传前优化媒体(参考
    media_optimizer.py
  4. 处理视频特定片段而非完整视频
  5. 静态内容使用更低的FPS
  6. 重复查询实现上下文缓存
  7. 并行批量处理多个文件

Rate Limits

速率限制

Free Tier:
  • 10-15 RPM (requests per minute)
  • 1M-4M TPM (tokens per minute)
  • 1,500 RPD (requests per day)
YouTube Limits:
  • Free tier: 8 hours/day
  • Paid tier: No length limits
  • Public videos only
Storage Limits:
  • 20GB per project
  • 2GB per file
  • 48-hour retention
免费额度:
  • 10-15 RPM(每分钟请求数)
  • 100万-400万 TPM(每分钟token数)
  • 1500 RPD(每天请求数)
YouTube限制:
  • 免费额度:每天8小时
  • 付费额度:无时长限制
  • 仅支持公开视频
存储限制:
  • 每个项目20GB
  • 单文件2GB
  • 留存48小时

Error Handling

错误处理

Common errors and solutions:
  • 400: Invalid format/size - validate before upload
  • 401: Invalid API key - check configuration
  • 403: Permission denied - verify API key restrictions
  • 404: File not found - ensure file uploaded and active
  • 429: Rate limit exceeded - implement exponential backoff
  • 500: Server error - retry with backoff
常见错误与解决方案:
  • 400: 格式/大小无效 - 上传前验证
  • 401: API密钥无效 - 检查配置
  • 403: 权限拒绝 - 验证API密钥限制
  • 404: 文件未找到 - 确保文件已上传且处于活跃状态
  • 429: 超出速率限制 - 实现指数退避重试
  • 500: 服务器错误 - 退避后重试

Scripts Overview

脚本概述

All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
  • Supports all modalities (audio, image, video, PDF)
  • Progress tracking and error recovery
  • Output formats: JSON, Markdown, CSV
  • Rate limiting and retry logic
  • Dry-run mode
media_optimizer.py: Prepare media for Gemini API
  • Compress videos/audio for size limits
  • Resize images appropriately
  • Split long videos into chunks
  • Format conversion
  • Quality vs size optimization
document_converter.py: Convert documents to PDF
  • Convert DOCX, XLSX, PPTX to PDF
  • Extract page ranges
  • Optimize PDFs for Gemini
  • Extract images from PDFs
  • Batch conversion support
Run any script with
--help
for detailed usage.
所有脚本均支持统一的API密钥检测与错误处理:
gemini_batch_process.py: 批量处理多个媒体文件
  • 支持所有模态(音频、图像、视频、PDF)
  • 进度跟踪与错误恢复
  • 输出格式:JSON、Markdown、CSV
  • 速率限制与重试逻辑
  • 试运行模式
media_optimizer.py: 为Gemini API准备媒体文件
  • 压缩视频/音频以满足大小限制
  • 适当调整图像尺寸
  • 将长视频分割为片段
  • 格式转换
  • 质量与大小平衡优化
document_converter.py: 将文档转换为PDF
  • 将DOCX、XLSX、PPTX转换为PDF
  • 提取页面范围
  • 优化PDF以适配Gemini
  • 从PDF提取图像
  • 支持批量转换
运行任意脚本时添加
--help
查看详细用法。

Resources

资源