gemini-3-multimodal
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGemini 3 Pro Multimodal Input Processing
Gemini 3 Pro 多模态输入处理
Comprehensive guide for processing multimodal inputs with Gemini 3 Pro, including image understanding, video analysis, audio processing, and PDF document extraction. This skill focuses on INPUT processing (analyzing media) - see for OUTPUT (generating images).
gemini-3-image-generation本指南全面介绍如何使用Gemini 3 Pro处理多模态输入,包括图像理解、视频分析、音频处理和PDF文档提取。本技能聚焦于输入处理(媒体分析)——如需了解输出(生成图像),请查看。
gemini-3-image-generationOverview
概述
Gemini 3 Pro provides native multimodal capabilities for understanding and analyzing various media types. This skill covers all input processing operations with granular control over quality, performance, and token consumption.
Gemini 3 Pro 具备原生多模态能力,可理解和分析各类媒体类型。本技能涵盖所有输入处理操作,支持对质量、性能和token消耗进行精细化控制。
Key Capabilities
核心功能
- Image Understanding: Object detection, OCR, visual Q&A, code from screenshots
- Video Processing: Up to 1 hour of video, frame analysis, OCR
- Audio Processing: Up to 9.5 hours of audio, speech understanding
- PDF Documents: Native PDF support, multi-page analysis, text extraction
- Media Resolution Control: Low/medium/high resolution for token optimization
- Token Optimization: Granular control over processing costs
- 图像理解:目标检测、OCR、视觉问答、从截图生成代码
- 视频处理:支持最长1小时视频、帧分析、OCR
- 音频处理:支持最长9.5小时音频、语音理解
- PDF文档:原生PDF支持、多页分析、文本提取
- 媒体分辨率控制:低/中/高分辨率选项,实现token优化
- Token优化:对处理成本进行精细化控制
When to Use This Skill
适用场景
- Analyzing images, photos, or screenshots
- Processing video content for insights
- Transcribing or understanding audio/speech
- Extracting information from PDF documents
- Building multimodal applications
- Optimizing media processing costs
- 分析图像、照片或截图
- 处理视频内容以提取洞察
- 转写或理解音频/语音
- 从PDF文档中提取信息
- 构建多模态应用
- 优化媒体处理成本
Quick Start
快速开始
Prerequisites
前提条件
- Gemini API setup (see skill)
gemini-3-pro-api - Media files in supported formats
- Gemini API配置(详见技能)
gemini-3-pro-api - 支持格式的媒体文件
Python Quick Start
Python快速开始
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")Upload and analyze image
上传并分析图像
image_file = genai.upload_file(Path("photo.jpg"))
response = model.generate_content([
"What's in this image?",
image_file
])
print(response.text)
undefinedimage_file = genai.upload_file(Path("photo.jpg"))
response = model.generate_content([
"What's in this image?",
image_file
])
print(response.text)
undefinedNode.js Quick Start
Node.js快速开始
typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
import fs from "fs";
const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const fileManager = new GoogleAIFileManager("YOUR_API_KEY");
// Upload and analyze image
const uploadResult = await fileManager.uploadFile("photo.jpg", {
mimeType: "image/jpeg"
});
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
"What's in this image?",
{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);
console.log(result.response.text());typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
import fs from "fs";
const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const fileManager = new GoogleAIFileManager("YOUR_API_KEY");
// 上传并分析图像
const uploadResult = await fileManager.uploadFile("photo.jpg", {
mimeType: "image/jpeg"
});
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
"What's in this image?",
{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);
console.log(result.response.text());Core Tasks
核心任务
Task 1: Analyze Image Content
任务1:分析图像内容
Goal: Extract information, objects, text, or insights from images.
Use Cases:
- Object detection and recognition
- OCR (text extraction from images)
- Visual Q&A
- Code generation from UI screenshots
- Chart/diagram analysis
- Product identification
Python Example:
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")目标:从图像中提取信息、识别物体、提取文本或获取洞察。
适用场景:
- 目标检测与识别
- OCR(从图像中提取文本)
- 视觉问答
- 从UI截图生成代码
- 图表/图示分析
- 产品识别
Python示例:
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")Configure model with high resolution for best quality
配置高分辨率模型以获得最佳质量
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "high" # 1,120 tokens per image
}
)
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "high" # 每张图像1120个token
}
)
Upload image
上传图像
image_path = Path("screenshot.png")
image_file = genai.upload_file(image_path)
image_path = Path("screenshot.png")
image_file = genai.upload_file(image_path)
Analyze with specific prompt
使用特定提示词进行分析
response = model.generate_content([
"""Analyze this image and provide:
1. Main objects and their locations
2. Any visible text (OCR)
3. Overall context and purpose
4. If code/UI: describe the functionality
""",
image_file
])
print(response.text)
response = model.generate_content([
"""分析此图像并提供:
1. 主要物体及其位置
2. 所有可见文本(OCR)
3. 整体背景与用途
4. 若为代码/UI:描述其功能
""",
image_file
])
print(response.text)
Check token usage
查看token使用量
print(f"Tokens used: {response.usage_metadata.total_token_count}")
**Node.js Example:**
```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload image
const uploadResult = await fileManager.uploadFile("screenshot.png", {
mimeType: "image/png"
});
// Configure model with high resolution
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
thinking_level: "high",
media_resolution: "high" // Best quality for OCR
}
});
const result = await model.generateContent([
`Analyze this image and provide:
1. Main objects and their locations
2. Any visible text (OCR)
3. Overall context and purpose`,
{ fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);
console.log(result.response.text());Resolution Options:
| Resolution | Tokens per Image | Best For |
|---|---|---|
| 280 tokens | Quick analysis, low detail |
| 560 tokens | Balanced quality/cost |
| 1,120 tokens | OCR, fine details, small text |
Supported Formats: JPEG, PNG, WEBP, HEIC, HEIF
See: for advanced patterns
references/image-understanding.mdprint(f"Tokens used: {response.usage_metadata.total_token_count}")
**Node.js示例**:
```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);Task 2: Process Video Content
上传图像
Goal: Analyze video content, extract insights, perform frame-by-frame analysis.
Use Cases:
- Video summarization
- Object tracking
- Scene detection
- Video OCR
- Content moderation
- Educational video analysis
Python Example:
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")const uploadResult = await fileManager.uploadFile("screenshot.png", {
mimeType: "image/png"
});
Configure for video processing
配置高分辨率模型
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "medium" # 70 tokens/frame (balanced)
}
)
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
thinking_level: "high",
media_resolution: "high" # OCR最佳质量
}
});
const result = await model.generateContent([
`Analyze this image and provide:
- Main objects and their locations
- Any visible text (OCR)
- Overall context and purpose`, { fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } } ]);
console.log(result.response.text());
**分辨率选项**:
| 分辨率 | 单图像Token数 | 最佳适用场景 |
|-----------|------------------|----------|
| `low` | 280个 | 快速分析、低细节需求 |
| `medium` | 560个 | 平衡质量与成本 |
| `high` | 1120个 | OCR、精细细节、小文本 |
**支持格式**:JPEG、PNG、WEBP、HEIC、HEIF
**参考**:`references/image-understanding.md` 获取进阶模式
---Upload video (up to 1 hour supported)
任务2:处理视频内容
video_path = Path("tutorial.mp4")
video_file = genai.upload_file(video_path)
目标:分析视频内容、提取洞察、逐帧分析。
适用场景:
- 视频摘要
- 目标追踪
- 场景检测
- 视频OCR
- 内容审核
- 教育视频分析
Python示例:
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")Wait for processing
配置视频处理模型
import time
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED":
raise ValueError("Video processing failed")
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "medium" # 每帧70个token(平衡型)
}
)
Analyze video
上传视频(支持最长1小时)
response = model.generate_content([
"""Analyze this video and provide:
1. Overall summary of content
2. Key scenes and timestamps
3. Main topics covered
4. Any visible text throughout the video
""",
video_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
**Node.js Example:**
```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload video
const uploadResult = await fileManager.uploadFile("tutorial.mp4", {
mimeType: "video/mp4"
});
// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
if (file.state === FileState.FAILED) {
throw new Error("Video processing failed");
}
// Analyze video
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
media_resolution: "medium"
}
});
const result = await model.generateContent([
`Analyze this video and provide:
1. Overall summary
2. Key scenes and timestamps
3. Main topics covered`,
{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);
console.log(result.response.text());Video Specs:
- Max Duration: 1 hour
- Formats: MP4, MOV, AVI, etc.
- Resolution Options: Low (70 tokens/frame), Medium (70 tokens/frame), High (280 tokens/frame)
- OCR: Available with high resolution
See: for advanced patterns
references/video-processing.mdvideo_path = Path("tutorial.mp4")
video_file = genai.upload_file(video_path)
Task 3: Process Audio/Speech
等待处理完成
Goal: Transcribe and understand audio content, process speech.
Use Cases:
- Audio transcription
- Speech analysis
- Podcast summarization
- Meeting notes
- Language understanding
- Audio classification
Python Example:
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")import time
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED":
raise ValueError("Video processing failed")
Upload audio file (up to 9.5 hours supported)
分析视频
audio_path = Path("podcast.mp3")
audio_file = genai.upload_file(audio_path)
response = model.generate_content([
"""分析此视频并提供:
1. 内容整体摘要
2. 关键场景与时间戳
3. 涵盖的主要主题
4. 视频中所有可见文本
""",
video_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
**Node.js示例**:
```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);Wait for processing
上传视频
import time
while audio_file.state.name == "PROCESSING":
time.sleep(5)
audio_file = genai.get_file(audio_file.name)
const uploadResult = await fileManager.uploadFile("tutorial.mp4", {
mimeType: "video/mp4"
});
Process audio
等待处理完成
response = model.generate_content([
"""Process this audio and provide:
1. Full transcription
2. Summary of main points
3. Key speakers (if multiple)
4. Important timestamps
5. Action items or conclusions
""",
audio_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
**Node.js Example:**
```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload audio
const uploadResult = await fileManager.uploadFile("podcast.mp3", {
mimeType: "audio/mp3"
});
// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
`Process this audio and provide:
1. Full transcription
2. Summary of main points
3. Key timestamps`,
{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);
console.log(result.response.text());Audio Specs:
- Max Duration: 9.5 hours
- Formats: WAV, MP3, FLAC, AAC, etc.
- Languages: Supports multiple languages
See: for advanced patterns
references/audio-processing.mdlet file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
if (file.state === FileState.FAILED) {
throw new Error("Video processing failed");
}
Task 4: Process PDF Documents
分析视频
Goal: Extract and analyze content from PDF documents.
Use Cases:
- Document analysis
- Information extraction
- Form processing
- Research paper analysis
- Contract review
- Multi-page document understanding
Python Example:
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
media_resolution: "medium"
}
});
const result = await model.generate_content([
`Analyze this video and provide:
- Overall summary
- Key scenes and timestamps
- Main topics covered`, { fileData: { fileUri: file.uri, mimeType: file.mimeType } } ]);
console.log(result.response.text());
**视频规格**:
- **最长时长**:1小时
- **支持格式**:MP4、MOV、AVI等
- **分辨率选项**:低(每帧70个token)、中(每帧70个token)、高(每帧280个token)
- **OCR**:高分辨率下支持
**参考**:`references/video-processing.md` 获取进阶模式
---Configure with medium resolution (recommended for PDFs)
任务3:处理音频/语音
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "medium" # 560 tokens/page (saturation point)
}
)
目标:转写并理解音频内容、处理语音。
适用场景:
- 音频转写
- 语音分析
- 播客摘要
- 会议记录
- 语言理解
- 音频分类
Python示例:
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")Upload PDF
上传音频文件(支持最长9.5小时)
pdf_path = Path("research_paper.pdf")
pdf_file = genai.upload_file(pdf_path)
audio_path = Path("podcast.mp3")
audio_file = genai.upload_file(audio_path)
Wait for processing
等待处理完成
import time
while pdf_file.state.name == "PROCESSING":
time.sleep(5)
pdf_file = genai.get_file(pdf_file.name)
import time
while audio_file.state.name == "PROCESSING":
time.sleep(5)
audio_file = genai.get_file(audio_file.name)
Analyze PDF
处理音频
response = model.generate_content([
"""Analyze this PDF document and provide:
1. Document type and purpose
2. Main sections and structure
3. Key findings or arguments
4. Important data or statistics
5. Conclusions or recommendations
""",
pdf_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
**Node.js Example:**
```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);
// Upload PDF
const uploadResult = await fileManager.uploadFile("research_paper.pdf", {
mimeType: "application/pdf"
});
// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
// Analyze with medium resolution (recommended)
const model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
media_resolution: "medium"
}
});
const result = await model.generateContent([
`Analyze this PDF and extract:
1. Main sections
2. Key findings
3. Important data`,
{ fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);
console.log(result.response.text());PDF Processing Tips:
- Recommended Resolution: (560 tokens/page) - saturation point for quality
medium - Multi-page: Automatically processes all pages
- Native Support: No conversion to images needed
- Text Extraction: High-quality text extraction built-in
See: for advanced patterns
references/document-processing.mdresponse = model.generate_content([
"""处理此音频并提供:
1. 完整转写文本
2. 要点摘要
3. 主要发言人(若有多位)
4. 重要时间戳
5. 行动项或结论
""",
audio_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
**Node.js示例**:
```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);Task 5: Optimize Media Processing Costs
上传音频
Goal: Balance quality and token consumption based on use case.
Strategy:
| Media Type | Resolution | Tokens | Use Case |
|---|---|---|---|
| Images | | 280 | Quick scan, thumbnails |
| Images | | 560 | General analysis |
| Images | | 1,120 | OCR, fine details, code |
| PDFs | | 560/page | Recommended (saturation point) |
| PDFs | | 1,120/page | Diminishing returns |
| Video | | 70/frame | Most use cases |
| Video | | 280/frame | OCR from video |
Python Optimization Example:
python
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")const uploadResult = await fileManager.uploadFile("podcast.mp3", {
mimeType: "audio/mp3"
});
Different resolutions for different use cases
等待处理完成
def analyze_image_optimized(image_path, need_ocr=False):
"""Analyze image with appropriate resolution"""
resolution = "high" if need_ocr else "medium"
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"media_resolution": resolution
}
)
image_file = genai.upload_file(image_path)
response = model.generate_content([
"Describe this image" if not need_ocr else "Extract all text from this image",
image_file
])
# Log token usage for cost tracking
tokens = response.usage_metadata.total_token_count
cost = (tokens / 1_000_000) * 2.00 # Input pricing
print(f"Resolution: {resolution}, Tokens: {tokens}, Cost: ${cost:.6f}")
return response.textlet file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generate_content([
`Process this audio and provide:
- Full transcription
- Summary of main points
- Key timestamps`, { fileData: { fileUri: file.uri, mimeType: file.mimeType } } ]);
console.log(result.response.text());
**音频规格**:
- **最长时长**:9.5小时
- **支持格式**:WAV、MP3、FLAC、AAC等
- **支持语言**:多语言支持
**参考**:`references/audio-processing.md` 获取进阶模式
---Use appropriate resolution
任务4:处理PDF文档
analyze_image_optimized("photo.jpg", need_ocr=False) # medium
analyze_image_optimized("document.png", need_ocr=True) # high
**Per-Item Resolution Control:**
```python目标:从PDF文档中提取并分析内容。
适用场景:
- 文档分析
- 信息提取
- 表单处理
- 研究论文分析
- 合同审查
- 多页文档理解
Python示例:
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")Set different resolutions for different media in same request
配置中分辨率模型(PDF推荐选项)
response = model.generate_content([
"Compare these images",
{"file": image1, "media_resolution": "high"}, # High detail
{"file": image2, "media_resolution": "low"}, # Low detail OK
])
**Cost Monitoring:**
```python
def log_media_costs(response):
"""Log media processing costs"""
usage = response.usage_metadata
# Pricing for ≤200k context
input_cost = (usage.prompt_token_count / 1_000_000) * 2.00
output_cost = (usage.candidates_token_count / 1_000_000) * 12.00
print(f"Input tokens: {usage.prompt_token_count} (${input_cost:.6f})")
print(f"Output tokens: {usage.candidates_token_count} (${output_cost:.6f})")
print(f"Total cost: ${input_cost + output_cost:.6f}")See: for comprehensive strategies
references/token-optimization.mdmodel = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"thinking_level": "high",
"media_resolution": "medium" # 每页560个token(饱和点)
}
)
Media Resolution Control
上传PDF
Resolution Options
—
| Setting | Images | PDFs | Video (per frame) | Recommendation |
|---|---|---|---|---|
| 280 tokens | 280 tokens | 70 tokens | Quick analysis, low detail |
| 560 tokens | 560 tokens | 70 tokens | Balanced quality/cost |
| 1,120 tokens | 1,120 tokens | 280 tokens | OCR, fine text, details |
pdf_path = Path("research_paper.pdf")
pdf_file = genai.upload_file(pdf_path)
Configuration
等待处理完成
Global Setting (all media):
python
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"media_resolution": "high" # Applies to all media
}
)Per-Item Setting (mixed resolutions):
python
response = model.generate_content([
"Analyze these files",
{"file": high_detail_image, "media_resolution": "high"},
{"file": low_detail_image, "media_resolution": "low"}
])import time
while pdf_file.state.name == "PROCESSING":
time.sleep(5)
pdf_file = genai.get_file(pdf_file.name)
Best Practices
分析PDF
- Images: Use for OCR/text extraction,
highfor general analysismedium - PDFs: Use (saturation point - higher resolutions show diminishing returns)
medium - Video: Use or
lowunless OCR neededmedium - Cost Control: Start with , increase only if quality insufficient
low
See: for detailed guide
references/media-resolution.mdresponse = model.generate_content([
"""分析此PDF文档并提供:
1. 文档类型与用途
2. 主要章节与结构
3. 核心发现或论点
4. 重要数据或统计信息
5. 结论或建议
""",
pdf_file
])
print(response.text)
print(f"Tokens used: {response.usage_metadata.total_token_count}")
**Node.js示例**:
```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);File Management
上传PDF
Upload Files
—
python
import google.generativeai as genaiconst uploadResult = await fileManager.uploadFile("research_paper.pdf", {
mimeType: "application/pdf"
});
Upload file
等待处理完成
file = genai.upload_file("path/to/file.jpg")
print(f"Uploaded: {file.name}")
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
await new Promise(resolve => setTimeout(resolve, 5000));
file = await fileManager.getFile(uploadResult.file.name);
}
Check processing status
使用中分辨率分析(推荐选项)
while file.state.name == "PROCESSING":
time.sleep(5)
file = genai.get_file(file.name)
print(f"Status: {file.state.name}")
undefinedconst model = genAI.getGenerativeModel({
model: "gemini-3-pro-preview",
generationConfig: {
media_resolution: "medium"
}
});
const result = await model.generate_content([
`Analyze this PDF and extract:
- Main sections
- Key findings
- Important data`, { fileData: { fileUri: file.uri, mimeType: file.mimeType } } ]);
console.log(result.response.text());
**PDF处理技巧**:
- **推荐分辨率**:`medium`(每页560个token)——质量饱和点
- **多页支持**:自动处理所有页面
- **原生支持**:无需转换为图像
- **文本提取**:内置高质量文本提取功能
**参考**:`references/document-processing.md` 获取进阶模式
---List Uploaded Files
任务5:优化媒体处理成本
python
undefined目标:根据使用场景平衡质量与token消耗。
策略:
| 媒体类型 | 分辨率 | Token数 | 适用场景 |
|---|---|---|---|
| 图像 | | 280 | 快速扫描、缩略图 |
| 图像 | | 560 | 常规分析 |
| 图像 | | 1120 | OCR、精细细节、代码 |
| 560/页 | 推荐(饱和点) | |
| 1120/页 | 收益递减 | |
| 视频 | | 70/帧 | 大多数场景 |
| 视频 | | 280/帧 | 视频OCR |
Python优化示例:
python
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")List all files
针对不同场景使用不同分辨率
for file in genai.list_files():
print(f"{file.name} - {file.display_name}")
undefineddef analyze_image_optimized(image_path, need_ocr=False):
"""使用合适分辨率分析图像"""
resolution = "high" if need_ocr else "medium"
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"media_resolution": resolution
}
)
image_file = genai.upload_file(image_path)
response = model.generate_content([
"Describe this image" if not need_ocr else "Extract all text from this image",
image_file
])
# 记录token使用量以追踪成本
tokens = response.usage_metadata.total_token_count
cost = (tokens / 1_000_000) * 2.00 # 输入定价
print(f"Resolution: {resolution}, Tokens: {tokens}, Cost: ${cost:.6f}")
return response.textDelete Files
使用合适的分辨率
python
undefinedanalyze_image_optimized("photo.jpg", need_ocr=False) # medium
analyze_image_optimized("document.png", need_ocr=True) # high
**单文件分辨率控制**:
```pythonDelete specific file
在同一请求中为不同媒体设置不同分辨率
genai.delete_file(file.name)
response = model.generate_content([
"Compare these images",
{"file": image1, "media_resolution": "high"}, # 高细节
{"file": image2, "media_resolution": "low"}, # 低细节即可
])
**成本监控**:
```python
def log_media_costs(response):
"""记录媒体处理成本"""
usage = response.usage_metadata
# ≤200k上下文的定价
input_cost = (usage.prompt_token_count / 1_000_000) * 2.00
output_cost = (usage.candidates_token_count / 1_000_000) * 12.00
print(f"Input tokens: {usage.prompt_token_count} (${input_cost:.6f})")
print(f"Output tokens: {usage.candidates_token_count} (${output_cost:.6f})")
print(f"Total cost: ${input_cost + output_cost:.6f}")参考: 获取全面策略
references/token-optimization.mdDelete all files
媒体分辨率控制
—
分辨率选项
for file in genai.list_files():
genai.delete_file(file.name)
print(f"Deleted: {file.name}")
undefined| 设置 | 图像 | 视频(每帧) | 推荐场景 | |
|---|---|---|---|---|
| 280个token | 280个token | 70个token | 快速分析、低细节需求 |
| 560个token | 560个token | 70个token | 平衡质量与成本 |
| 1120个token | 1120个token | 280个token | OCR、小文本、精细细节 |
File Lifecycle
配置方式
- Upload: Immediate
- Processing: Async (especially for video/audio)
- Storage: Files persist until deleted
- Expiration: Files may expire after period (check docs)
全局设置(所有媒体):
python
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={
"media_resolution": "high" # 应用于所有媒体
}
)单文件设置(混合分辨率):
python
response = model.generate_content([
"Analyze these files",
{"file": high_detail_image, "media_resolution": "high"},
{"file": low_detail_image, "media_resolution": "low"}
])Multi-File Processing
最佳实践
Process Multiple Images
—
python
undefined- 图像:OCR/文本提取使用,常规分析使用
highmedium - PDF:使用(质量饱和点——更高分辨率收益递减)
medium - 视频:除非需要OCR,否则使用或
lowmedium - 成本控制:从开始,仅在质量不足时提升分辨率
low
参考: 获取详细指南
references/media-resolution.mdUpload multiple images
文件管理
—
上传文件
images = [
genai.upload_file("photo1.jpg"),
genai.upload_file("photo2.jpg"),
genai.upload_file("photo3.jpg")
]
python
import google.generativeai as genaiAnalyze together
上传文件
response = model.generate_content([
"Compare these images and identify common elements",
*images
])
print(response.text)
undefinedfile = genai.upload_file("path/to/file.jpg")
print(f"Uploaded: {file.name}")
Mixed Media Types
检查处理状态
python
undefinedwhile file.state.name == "PROCESSING":
time.sleep(5)
file = genai.get_file(file.name)
print(f"Status: {file.state.name}")
undefinedCombine different media types
列出已上传文件
image = genai.upload_file("chart.png")
pdf = genai.upload_file("report.pdf")
response = model.generate_content([
"Does the chart match the data in the report?",
image,
pdf
])
---python
undefinedReferences
列出所有文件
Core Guides
- Image Understanding - Complete image analysis patterns
- Video Processing - Video analysis and frame extraction
- Audio Processing - Audio transcription and analysis
- Document Processing - PDF and document extraction
Optimization
- Media Resolution - Resolution control and quality tuning
- OCR Extraction - Text extraction best practices
- Token Optimization - Cost control and efficiency
Scripts
- Analyze Image Script - Production-ready image analysis
- Process Video Script - Video processing automation
- Process Audio Script - Audio transcription
- Process PDF Script - PDF extraction
Official Resources
for file in genai.list_files():
print(f"{file.name} - {file.display_name}")
undefinedRelated Skills
删除文件
- gemini-3-pro-api - Basic setup, authentication, text generation
- gemini-3-image-generation - Image OUTPUT (generating images)
- gemini-3-advanced - Function calling, tools, caching, batch processing
python
undefinedCommon Use Cases
删除指定文件
Visual Q&A Application
—
Combine image understanding with chat:
python
model = genai.GenerativeModel("gemini-3-pro-preview")
chat = model.start_chat()genai.delete_file(file.name)
Upload image
删除所有文件
image = genai.upload_file("product.jpg")
for file in genai.list_files():
genai.delete_file(file.name)
print(f"Deleted: {file.name}")
undefinedAsk questions about it
文件生命周期
response1 = chat.send_message(["What product is this?", image])
response2 = chat.send_message("What are its main features?")
response3 = chat.send_message("What's the price range for similar products?")
undefined- 上传:即时完成
- 处理:异步(视频/音频尤其如此)
- 存储:文件会保留直至被删除
- 过期:文件可能在一段时间后过期(请查看官方文档)
Document Analysis Pipeline
多文件处理
—
处理多张图像
Process multiple PDFs and extract insights:
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={"media_resolution": "medium"}
)python
undefinedProcess all PDFs in directory
上传多张图像
pdf_dir = Path("documents/")
results = {}
for pdf_path in pdf_dir.glob("*.pdf"):
pdf_file = genai.upload_file(pdf_path)
# Wait for processing
while pdf_file.state.name == "PROCESSING":
time.sleep(5)
pdf_file = genai.get_file(pdf_file.name)
# Extract key information
response = model.generate_content([
"Extract: 1) Document type, 2) Key dates, 3) Important numbers, 4) Summary",
pdf_file
])
results[pdf_path.name] = response.text
# Clean up
genai.delete_file(pdf_file.name)images = [
genai.upload_file("photo1.jpg"),
genai.upload_file("photo2.jpg"),
genai.upload_file("photo3.jpg")
]
Save results
批量分析
import json
with open("analysis_results.json", "w") as f:
json.dump(results, f, indent=2)
undefinedresponse = model.generate_content([
"Compare these images and identify common elements",
*images
])
print(response.text)
undefinedVideo Content Moderation
混合媒体类型
Analyze video for specific content:
python
video = genai.upload_file("user_upload.mp4")python
undefinedWait for processing
组合不同媒体类型
while video.state.name == "PROCESSING":
time.sleep(10)
video = genai.get_file(video.name)
response = model.generate_content([
"""Analyze this video for:
1. Inappropriate content (yes/no)
2. Violence or harmful content (yes/no)
3. Overall content rating (G/PG/PG-13/R)
4. Brief justification
Provide structured response.
""",
video])
print(response.text)
---image = genai.upload_file("chart.png")
pdf = genai.upload_file("report.pdf")
response = model.generate_content([
"Does the chart match the data in the report?",
image,
pdf
])
---Troubleshooting
参考资料
Issue: File processing stuck at "PROCESSING"
—
Solution: Large files (especially video) can take time. Wait 30-60 seconds between checks. If stuck > 5 minutes, file may have failed.
Issue: Low quality OCR results
相关技能
Solution: Use for images with text. Ensure image is clear and high resolution.
media_resolution: "high"- gemini-3-pro-api - 基础配置、认证、文本生成
- gemini-3-image-generation - 图像输出(生成图像)
- gemini-3-advanced - 函数调用、工具、缓存、批处理
Issue: High token costs
常见应用场景
—
视觉问答应用
Solution: Use appropriate media resolution. Start with , increase only if needed. For PDFs, is usually sufficient.
lowmedium结合图像理解与对话功能:
python
model = genai.GenerativeModel("gemini-3-pro-preview")
chat = model.start_chat()Issue: Video analysis missing details
上传图像
Solution: Use for better frame analysis, or provide more specific prompts about what to look for.
media_resolution: "high"image = genai.upload_file("product.jpg")
Issue: Audio transcription inaccurate
针对图像提问
Solution: Ensure audio quality is good (no excessive background noise). Provide context in prompt about accent, language, or domain.
response1 = chat.send_message(["What product is this?", image])
response2 = chat.send_message("What are its main features?")
response3 = chat.send_message("What's the price range for similar products?")
undefinedSummary
文档分析流水线
This skill provides comprehensive multimodal input processing capabilities:
✅ Image analysis with OCR and object detection
✅ Video processing up to 1 hour
✅ Audio transcription up to 9.5 hours
✅ Native PDF document processing
✅ Granular media resolution control
✅ Token optimization strategies
✅ Multi-file processing
✅ Production-ready examples
Ready to analyze multimodal content? Start with the task that matches your use case above!
批量处理多个PDF并提取洞察:
python
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel(
"gemini-3-pro-preview",
generation_config={"media_resolution": "medium"}
)—
处理目录下所有PDF
—
pdf_dir = Path("documents/")
results = {}
for pdf_path in pdf_dir.glob("*.pdf"):
pdf_file = genai.upload_file(pdf_path)
# 等待处理完成
while pdf_file.state.name == "PROCESSING":
time.sleep(5)
pdf_file = genai.get_file(pdf_file.name)
# 提取关键信息
response = model.generate_content([
"Extract: 1) Document type, 2) Key dates, 3) Important numbers, 4) Summary",
pdf_file
])
results[pdf_path.name] = response.text
# 清理文件
genai.delete_file(pdf_file.name)—
保存结果
—
import json
with open("analysis_results.json", "w") as f:
json.dump(results, f, indent=2)
undefined—
视频内容审核
—
分析视频是否包含特定内容:
python
video = genai.upload_file("user_upload.mp4")—
等待处理完成
—
while video.state.name == "PROCESSING":
time.sleep(10)
video = genai.get_file(video.name)
response = model.generate_content([
"""分析此视频:
1. 是否包含不当内容(是/否)
2. 是否包含暴力或有害内容(是/否)
3. 整体内容分级(G/PG/PG-13/R)
4. 简短理由
请提供结构化回复。
""",
video])
print(response.text)
---—
故障排查
—
问题:文件处理卡在"PROCESSING"状态
—
解决方案:大文件(尤其是视频)处理需要时间。每30-60秒检查一次状态。若超过5分钟仍未完成,文件可能处理失败。
—
问题:OCR结果质量低
—
解决方案:对包含文本的图像使用。确保图像清晰且分辨率高。
media_resolution: "high"—
问题:Token成本过高
—
解决方案:使用合适的媒体分辨率。从开始,仅在需要时提升。对于PDF,通常已足够。
lowmedium—
问题:视频分析遗漏细节
—
解决方案:使用以获得更好的帧分析效果,或在提示词中更明确地说明需要关注的内容。
media_resolution: "high"—
问题:音频转写不准确
—
解决方案:确保音频质量良好(无过多背景噪音)。在提示词中提供关于口音、语言或领域的上下文信息。
—
总结
—
本技能提供全面的多模态输入处理能力:
✅ 带OCR和目标检测的图像分析
✅ 最长1小时视频处理
✅ 最长9.5小时音频转写
✅ 原生PDF文档处理
✅ 精细化媒体分辨率控制
✅ Token优化策略
✅ 多文件处理
✅ 生产级示例
准备好分析多模态内容了吗? 从上述匹配你场景的任务开始吧!