ai-multimodal

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Multimodal Processing Skill

AI多模态处理Skill

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

借助Google Gemini的多模态API处理音频、图像、视频、文档并生成图像。为所有多媒体内容的理解与生成提供统一接口。

Core Capabilities

核心功能

Audio Processing

音频处理

Transcription with timestamps (up to 9.5 hours)
Audio summarization and analysis
Speech understanding and speaker identification
Music and environmental sound analysis
Text-to-speech generation with controllable voice

带时间戳的转录（最长9.5小时）
音频摘要与分析
语音理解与说话人识别
音乐与环境声音分析
可控制音色的文本转语音生成

Image Understanding

图像理解

Image captioning and description
Object detection with bounding boxes (2.0+)
Pixel-level segmentation (2.5+)
Visual question answering
Multi-image comparison (up to 3,600 images)
OCR and text extraction

图像标题生成与描述
带边界框的目标检测（需2.0+版本模型）
像素级分割（需2.5+版本模型）
视觉问答
多图像对比（最多3600张图像）
OCR与文本提取

Video Analysis

视频分析

Scene detection and summarization
Video Q&A with temporal understanding
Transcription with visual descriptions
YouTube URL support
Long video processing (up to 6 hours)
Frame-level analysis

场景检测与摘要
支持时序理解的视频问答
带视觉描述的转录
支持YouTube链接
长视频处理（最长6小时）
帧级分析

Document Extraction

文档提取

Native PDF vision processing (up to 1,000 pages)
Table and form extraction
Chart and diagram analysis
Multi-page document understanding
Structured data output (JSON schema)
Format conversion (PDF to HTML/JSON)

原生PDF视觉处理（最多1000页）
表格与表单提取
图表与图形分析
多页文档理解
结构化数据输出（JSON schema）
格式转换（PDF转HTML/JSON）

Image Generation

图像生成

Text-to-image generation
Image editing and modification
Multi-image composition (up to 3 images)
Iterative refinement
Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
Controllable style and quality

文本转图像生成
图像编辑与修改
多图像合成（最多3张图像）
迭代优化
多种宽高比（1:1、16:9、9:16、4:3、3:4）
可控制风格与质量

Capability Matrix

功能矩阵

Task	Audio	Image	Video	Document	Generation
Transcription	✓	-	✓	-	-
Summarization	✓	✓	✓	✓	-
Q&A	✓	✓	✓	✓	-
Object Detection	-	✓	✓	-	-
Text Extraction	-	✓	-	✓	-
Structured Output	✓	✓	✓	✓	-
Creation	TTS	-	-	-	✓
Timestamps	✓	-	✓	-	-
Segmentation	-	✓	-	-	-

任务	音频	图像	视频	文档	生成
转录	✓	-	✓	-	-
摘要	✓	✓	✓	✓	-
问答	✓	✓	✓	✓	-
目标检测	-	✓	✓	-	-
文本提取	-	✓	-	✓	-
结构化输出	✓	✓	✓	✓	-
内容创建	文本转语音	-	-	-	✓
时间戳	✓	-	✓	-	-
分割	-	✓	-	-	-

Model Selection Guide

模型选择指南

Gemini 2.5 Series (Recommended)

Gemini 2.5系列（推荐）

gemini-2.5-pro: Highest quality, all features, 1M-2M context
gemini-2.5-flash: Best balance, all features, 1M-2M context
gemini-2.5-flash-lite: Lightweight, segmentation support
gemini-2.5-flash-image: Image generation only

gemini-2.5-pro: 最高质量，支持所有功能，100万-200万token上下文
gemini-2.5-flash: 性价比最优，支持所有功能，100万-200万token上下文
gemini-2.5-flash-lite: 轻量版，支持分割功能
gemini-2.5-flash-image: 仅支持图像生成

Feature Requirements

功能要求

Segmentation: Requires 2.5+ models
Object Detection: Requires 2.0+ models
Multi-video: Requires 2.5+ models
Image Generation: Requires flash-image model

分割: 需要2.5+版本模型
目标检测: 需要2.0+版本模型
多视频处理: 需要2.5+版本模型
图像生成: 需要flash-image模型

Context Windows

上下文窗口

2M tokens: ~6 hours video (low-res) or ~2 hours (default)
1M tokens: ~3 hours video (low-res) or ~1 hour (default)
Audio: 32 tokens/second (1 min = 1,920 tokens)
PDF: 258 tokens/page (fixed)
Image: 258-1,548 tokens based on size

200万token: 约6小时低分辨率视频或2小时默认分辨率视频
100万token: 约3小时低分辨率视频或1小时默认分辨率视频
音频: 32token/秒（1分钟=1920token）
PDF: 固定258token/页
图像: 根据尺寸不同为258-1548token

Quick Start

快速开始

Prerequisites

前置条件

API Key Setup: Supports both Google AI Studio and Vertex AI.

The skill checks for

GEMINI_API_KEY

in this order:

Process environment:
```
export GEMINI_API_KEY="your-key"
```
Project root:
```
.env
```
```
.claude/.env
```
```
.claude/skills/.env
```
```
.claude/skills/ai-multimodal/.env
```

Get API key: https://aistudio.google.com/apikey

For Vertex AI:

bash

export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional

Install SDK:

bash

pip install google-genai python-dotenv pillow

API密钥配置: 支持Google AI Studio与Vertex AI。

Skill会按以下顺序查找

GEMINI_API_KEY

：

进程环境：
```
export GEMINI_API_KEY="your-key"
```
项目根目录：
```
.env
```
```
.claude/.env
```
```
.claude/skills/.env
```
```
.claude/skills/ai-multimodal/.env
```

获取API密钥: https://aistudio.google.com/apikey

针对Vertex AI:

bash

export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # 可选

安装SDK:

bash

pip install google-genai python-dotenv pillow

Common Patterns

常用示例

Transcribe Audio:

bash

python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe \
  --model gemini-2.5-flash

Analyze Image:

bash

python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "Describe this image" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

Process Video:

bash

python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "Summarize key points with timestamps" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

Extract from PDF:

bash

python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "Extract table data as JSON" \
  --output docs/assets/<output-name>.md \
  --format json

Generate Image:

bash

python scripts/gemini_batch_process.py \
  --task generate \
  --prompt "A futuristic city at sunset" \
  --output docs/assets/<output-file-name> \
  --model gemini-2.5-flash-image \
  --aspect-ratio 16:9

Optimize Media:

bash

undefined

转录音频:

bash

python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe \
  --model gemini-2.5-flash

分析图像:

bash

python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "Describe this image" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

处理视频:

bash

python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "Summarize key points with timestamps" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash

从PDF提取内容:

bash

python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "Extract table data as JSON" \
  --output docs/assets/<output-name>.md \
  --format json

生成图像:

bash

python scripts/gemini_batch_process.py \
  --task generate \
  --prompt "A futuristic city at sunset" \
  --output docs/assets/<output-file-name> \
  --model gemini-2.5-flash-image \
  --aspect-ratio 16:9

媒体优化:

bash

undefined

Prepare large video for processing

为处理准备大视频

python scripts/media_optimizer.py
--input large-video.mp4
--output docs/assets/<output-file-name>
--target-size 100MB

Batch optimize multiple files

批量优化多个文件

python scripts/media_optimizer.py
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85


**Convert Documents to Markdown**:
```bash

python scripts/media_optimizer.py
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85


**文档转Markdown**:
```bash

Convert to PDF

转换为PDF

python scripts/document_converter.py
--input document.docx
--output docs/assets/document.md

Extract pages

提取页面

python scripts/document_converter.py
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20

undefined

python scripts/document_converter.py
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20

undefined

Supported Formats

支持的格式

Audio

音频

WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
Max 9.5 hours per request
Auto-downsampled to 16 Kbps mono

WAV、MP3、AAC、FLAC、OGG Vorbis、AIFF
每次请求最长9.5小时
自动下采样至16 Kbps单声道

Images

图像

PNG, JPEG, WEBP, HEIC, HEIF
Max 3,600 images per request
Resolution: ≤384px = 258 tokens, larger = tiled

PNG、JPEG、WEBP、HEIC、HEIF
每次请求最多3600张图像
分辨率：≤384px = 258token，更大尺寸将被分块处理

Video

视频

MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
Max 6 hours (low-res) or 2 hours (default)
YouTube URLs supported (public only)

MP4、MPEG、MOV、AVI、FLV、MPG、WebM、WMV、3GPP
最长6小时（低分辨率）或2小时（默认分辨率）
支持YouTube链接（仅公开视频）

Documents

文档

PDF only for vision processing
Max 1,000 pages
TXT, HTML, Markdown supported (text-only)

视觉处理仅支持PDF
最多1000页
支持TXT、HTML、Markdown（仅文本处理）

Size Limits

大小限制

Inline: <20MB total request
File API: 2GB per file, 20GB project quota
Retention: 48 hours auto-delete

内联上传: 总请求大小<20MB
文件API: 单文件最大2GB，项目配额20GB
留存时间: 自动删除，留存48小时

Reference Navigation

参考导航

For detailed implementation guidance, see:

如需详细实现指南，请查看：

Audio Processing

音频处理

```
references/audio-processing.md
```
- Transcription, analysis, TTS
- Timestamp handling and segment analysis
- Multi-speaker identification
- Non-speech audio analysis
- Text-to-speech generation

```
references/audio-processing.md
```
- 转录、分析、文本转语音
- 时间戳处理与分段分析
- 多说话人识别
- 非语音音频分析
- 文本转语音生成

Image Understanding

图像理解

```
references/vision-understanding.md
```
- Captioning, detection, OCR
- Object detection and localization
- Pixel-level segmentation
- Visual question answering
- Multi-image comparison

```
references/vision-understanding.md
```
- 标题生成、检测、OCR
- 目标检测与定位
- 像素级分割
- 视觉问答
- 多图像对比

Video Analysis

视频分析

```
references/video-analysis.md
```
- Scene detection, temporal understanding
- YouTube URL processing
- Timestamp-based queries
- Video clipping and FPS control
- Long video optimization

```
references/video-analysis.md
```
- 场景检测、时序理解
- YouTube链接处理
- 基于时间戳的查询
- 视频剪辑与FPS控制
- 长视频优化

Document Extraction

文档提取

```
references/document-extraction.md
```
- PDF processing, structured output
- Table and form extraction
- Chart and diagram analysis
- JSON schema validation
- Multi-page handling

```
references/document-extraction.md
```
- PDF处理、结构化输出
- 表格与表单提取
- 图表与图形分析
- JSON schema验证
- 多页处理

Image Generation

图像生成

```
references/image-generation.md
```
- Text-to-image, editing
- Prompt engineering strategies
- Image editing and composition
- Aspect ratio selection
- Safety settings

```
references/image-generation.md
```
- 文本转图像、编辑
- 提示词工程策略
- 图像编辑与合成
- 宽高比选择
- 安全设置

Cost Optimization

成本优化

Token Costs

Token成本

Input Pricing:

Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output

Token Rates:

Audio: 32 tokens/second (1 min = 1,920 tokens)
Video: ~300 tokens/second (default) or ~100 (low-res)
PDF: 258 tokens/page (fixed)
Image: 258-1,548 tokens based on size

TTS Pricing:

Flash TTS: $10/1M tokens
Pro TTS: $20/1M tokens

输入定价:

Gemini 2.5 Flash: 100万输入token/$1.00，100万输出token/$0.10
Gemini 2.5 Pro: 100万输入token/$3.00，100万输出token/$12.00
Gemini 1.5 Flash: 100万输入token/$0.70，100万输出token/$0.175

Token换算率:

音频: 32token/秒（1分钟=1920token）
视频: 约300token/秒（默认）或约100token/秒（低分辨率）
PDF: 固定258token/页
图像: 根据尺寸不同为258-1548token

Best Practices

最佳实践

Use
```
gemini-2.5-flash
```
for most tasks (best price/performance)
Use File API for files >20MB or repeated queries
Optimize media before upload (see
```
media_optimizer.py
```
)
Process specific segments instead of full videos
Use lower FPS for static content
Implement context caching for repeated queries
Batch process multiple files in parallel

大多数任务使用
```
gemini-2.5-flash
```
（性价比最优）
文件>20MB或重复查询使用文件API
上传前优化媒体（参考
```
media_optimizer.py
```
）
处理视频特定片段而非完整视频
静态内容使用更低的FPS
重复查询实现上下文缓存
并行批量处理多个文件

Rate Limits

速率限制

Free Tier:

10-15 RPM (requests per minute)
1M-4M TPM (tokens per minute)
1,500 RPD (requests per day)

YouTube Limits:

Free tier: 8 hours/day
Paid tier: No length limits
Public videos only

Storage Limits:

20GB per project
2GB per file
48-hour retention

免费额度:

10-15 RPM（每分钟请求数）
100万-400万 TPM（每分钟token数）
1500 RPD（每天请求数）

YouTube限制:

免费额度：每天8小时
付费额度：无时长限制
仅支持公开视频

存储限制:

每个项目20GB
单文件2GB
留存48小时

Error Handling

错误处理

Common errors and solutions:

400: Invalid format/size - validate before upload
401: Invalid API key - check configuration
403: Permission denied - verify API key restrictions
404: File not found - ensure file uploaded and active
429: Rate limit exceeded - implement exponential backoff
500: Server error - retry with backoff

常见错误与解决方案:

400: 格式/大小无效 - 上传前验证
401: API密钥无效 - 检查配置
403: 权限拒绝 - 验证API密钥限制
404: 文件未找到 - 确保文件已上传且处于活跃状态
429: 超出速率限制 - 实现指数退避重试
500: 服务器错误 - 退避后重试

Scripts Overview

脚本概述

All scripts support unified API key detection and error handling:

gemini_batch_process.py: Batch process multiple media files

Supports all modalities (audio, image, video, PDF)
Progress tracking and error recovery
Output formats: JSON, Markdown, CSV
Rate limiting and retry logic
Dry-run mode

media_optimizer.py: Prepare media for Gemini API

Compress videos/audio for size limits
Resize images appropriately
Split long videos into chunks
Format conversion
Quality vs size optimization

document_converter.py: Convert documents to PDF

Convert DOCX, XLSX, PPTX to PDF
Extract page ranges
Optimize PDFs for Gemini
Extract images from PDFs
Batch conversion support

Run any script with

--help

for detailed usage.

所有脚本均支持统一的API密钥检测与错误处理:

gemini_batch_process.py: 批量处理多个媒体文件

支持所有模态（音频、图像、视频、PDF）
进度跟踪与错误恢复
输出格式：JSON、Markdown、CSV
速率限制与重试逻辑
试运行模式

media_optimizer.py: 为Gemini API准备媒体文件

压缩视频/音频以满足大小限制
适当调整图像尺寸
将长视频分割为片段
格式转换
质量与大小平衡优化

document_converter.py: 将文档转换为PDF

将DOCX、XLSX、PPTX转换为PDF
提取页面范围
优化PDF以适配Gemini
从PDF提取图像
支持批量转换

运行任意脚本时添加

--help

查看详细用法。

ai-multimodal

Original

Translation

AI Multimodal Processing Skill

AI多模态处理Skill

Core Capabilities

核心功能

Audio Processing

音频处理

Image Understanding

图像理解

Video Analysis

视频分析

Document Extraction

文档提取

Image Generation

图像生成

Capability Matrix

功能矩阵

Model Selection Guide

模型选择指南

Gemini 2.5 Series (Recommended)

Gemini 2.5系列（推荐）

Feature Requirements

功能要求

Context Windows

上下文窗口

Quick Start

快速开始

Prerequisites

前置条件

Common Patterns

常用示例

Prepare large video for processing

为处理准备大视频

Batch optimize multiple files

批量优化多个文件

Convert to PDF

转换为PDF

Extract pages

提取页面

Supported Formats

支持的格式

Audio

音频

Images

图像

Video

视频

Documents

文档

Size Limits

大小限制

Reference Navigation

参考导航

Audio Processing

音频处理

Image Understanding

图像理解

Video Analysis

视频分析

Document Extraction

文档提取

Image Generation

图像生成

Cost Optimization

成本优化

Token Costs

Token成本

Best Practices

最佳实践

Rate Limits

速率限制

Error Handling

错误处理

Scripts Overview

脚本概述

Resources

资源