gemini-3-multimodal

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Gemini 3 Pro Multimodal Input Processing

Gemini 3 Pro 多模态输入处理

Comprehensive guide for processing multimodal inputs with Gemini 3 Pro, including image understanding, video analysis, audio processing, and PDF document extraction. This skill focuses on INPUT processing (analyzing media) - see
gemini-3-image-generation
for OUTPUT (generating images).
本指南全面介绍如何使用Gemini 3 Pro处理多模态输入,包括图像理解、视频分析、音频处理和PDF文档提取。本技能聚焦于输入处理(媒体分析)——如需了解输出(生成图像),请查看
gemini-3-image-generation

Overview

概述

Gemini 3 Pro provides native multimodal capabilities for understanding and analyzing various media types. This skill covers all input processing operations with granular control over quality, performance, and token consumption.
Gemini 3 Pro 具备原生多模态能力,可理解和分析各类媒体类型。本技能涵盖所有输入处理操作,支持对质量、性能和token消耗进行精细化控制。

Key Capabilities

核心功能

  • Image Understanding: Object detection, OCR, visual Q&A, code from screenshots
  • Video Processing: Up to 1 hour of video, frame analysis, OCR
  • Audio Processing: Up to 9.5 hours of audio, speech understanding
  • PDF Documents: Native PDF support, multi-page analysis, text extraction
  • Media Resolution Control: Low/medium/high resolution for token optimization
  • Token Optimization: Granular control over processing costs
  • 图像理解:目标检测、OCR、视觉问答、从截图生成代码
  • 视频处理:支持最长1小时视频、帧分析、OCR
  • 音频处理:支持最长9.5小时音频、语音理解
  • PDF文档:原生PDF支持、多页分析、文本提取
  • 媒体分辨率控制:低/中/高分辨率选项,实现token优化
  • Token优化:对处理成本进行精细化控制

When to Use This Skill

适用场景

  • Analyzing images, photos, or screenshots
  • Processing video content for insights
  • Transcribing or understanding audio/speech
  • Extracting information from PDF documents
  • Building multimodal applications
  • Optimizing media processing costs

  • 分析图像、照片或截图
  • 处理视频内容以提取洞察
  • 转写或理解音频/语音
  • 从PDF文档中提取信息
  • 构建多模态应用
  • 优化媒体处理成本

Quick Start

快速开始

Prerequisites

前提条件

  • Gemini API setup (see
    gemini-3-pro-api
    skill)
  • Media files in supported formats
  • Gemini API配置(详见
    gemini-3-pro-api
    技能)
  • 支持格式的媒体文件

Python Quick Start

Python快速开始

python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-3-pro-preview")

Upload and analyze image

上传并分析图像

image_file = genai.upload_file(Path("photo.jpg")) response = model.generate_content([ "What's in this image?", image_file ]) print(response.text)
undefined
image_file = genai.upload_file(Path("photo.jpg")) response = model.generate_content([ "What's in this image?", image_file ]) print(response.text)
undefined

Node.js Quick Start

Node.js快速开始

typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
import fs from "fs";

const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const fileManager = new GoogleAIFileManager("YOUR_API_KEY");

// Upload and analyze image
const uploadResult = await fileManager.uploadFile("photo.jpg", {
  mimeType: "image/jpeg"
});

const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
  "What's in this image?",
  { fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);

console.log(result.response.text());

typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";
import fs from "fs";

const genAI = new GoogleGenerativeAI("YOUR_API_KEY");
const fileManager = new GoogleAIFileManager("YOUR_API_KEY");

// 上传并分析图像
const uploadResult = await fileManager.uploadFile("photo.jpg", {
  mimeType: "image/jpeg"
});

const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generateContent([
  "What's in this image?",
  { fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);

console.log(result.response.text());

Core Tasks

核心任务

Task 1: Analyze Image Content

任务1:分析图像内容

Goal: Extract information, objects, text, or insights from images.
Use Cases:
  • Object detection and recognition
  • OCR (text extraction from images)
  • Visual Q&A
  • Code generation from UI screenshots
  • Chart/diagram analysis
  • Product identification
Python Example:
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")
目标:从图像中提取信息、识别物体、提取文本或获取洞察。
适用场景
  • 目标检测与识别
  • OCR(从图像中提取文本)
  • 视觉问答
  • 从UI截图生成代码
  • 图表/图示分析
  • 产品识别
Python示例
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

Configure model with high resolution for best quality

配置高分辨率模型以获得最佳质量

model = genai.GenerativeModel( "gemini-3-pro-preview", generation_config={ "thinking_level": "high", "media_resolution": "high" # 1,120 tokens per image } )
model = genai.GenerativeModel( "gemini-3-pro-preview", generation_config={ "thinking_level": "high", "media_resolution": "high" # 每张图像1120个token } )

Upload image

上传图像

image_path = Path("screenshot.png") image_file = genai.upload_file(image_path)
image_path = Path("screenshot.png") image_file = genai.upload_file(image_path)

Analyze with specific prompt

使用特定提示词进行分析

response = model.generate_content([ """Analyze this image and provide: 1. Main objects and their locations 2. Any visible text (OCR) 3. Overall context and purpose 4. If code/UI: describe the functionality """, image_file ])
print(response.text)
response = model.generate_content([ """分析此图像并提供: 1. 主要物体及其位置 2. 所有可见文本(OCR) 3. 整体背景与用途 4. 若为代码/UI:描述其功能 """, image_file ])
print(response.text)

Check token usage

查看token使用量

print(f"Tokens used: {response.usage_metadata.total_token_count}")

**Node.js Example:**

```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload image
const uploadResult = await fileManager.uploadFile("screenshot.png", {
  mimeType: "image/png"
});

// Configure model with high resolution
const model = genAI.getGenerativeModel({
  model: "gemini-3-pro-preview",
  generationConfig: {
    thinking_level: "high",
    media_resolution: "high"  // Best quality for OCR
  }
});

const result = await model.generateContent([
  `Analyze this image and provide:
  1. Main objects and their locations
  2. Any visible text (OCR)
  3. Overall context and purpose`,
  { fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } }
]);

console.log(result.response.text());
Resolution Options:
ResolutionTokens per ImageBest For
low
280 tokensQuick analysis, low detail
medium
560 tokensBalanced quality/cost
high
1,120 tokensOCR, fine details, small text
Supported Formats: JPEG, PNG, WEBP, HEIC, HEIF
See:
references/image-understanding.md
for advanced patterns

print(f"Tokens used: {response.usage_metadata.total_token_count}")

**Node.js示例**:

```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

Task 2: Process Video Content

上传图像

Goal: Analyze video content, extract insights, perform frame-by-frame analysis.
Use Cases:
  • Video summarization
  • Object tracking
  • Scene detection
  • Video OCR
  • Content moderation
  • Educational video analysis
Python Example:
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")
const uploadResult = await fileManager.uploadFile("screenshot.png", { mimeType: "image/png" });

Configure for video processing

配置高分辨率模型

model = genai.GenerativeModel( "gemini-3-pro-preview", generation_config={ "thinking_level": "high", "media_resolution": "medium" # 70 tokens/frame (balanced) } )
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview", generationConfig: { thinking_level: "high", media_resolution: "high" # OCR最佳质量 } });
const result = await model.generateContent([ `Analyze this image and provide:
  1. Main objects and their locations
  2. Any visible text (OCR)
  3. Overall context and purpose`, { fileData: { fileUri: uploadResult.file.uri, mimeType: uploadResult.file.mimeType } } ]);
console.log(result.response.text());

**分辨率选项**:

| 分辨率 | 单图像Token数 | 最佳适用场景 |
|-----------|------------------|----------|
| `low` | 280个 | 快速分析、低细节需求 |
| `medium` | 560个 | 平衡质量与成本 |
| `high` | 1120个 | OCR、精细细节、小文本 |

**支持格式**:JPEG、PNG、WEBP、HEIC、HEIF

**参考**:`references/image-understanding.md` 获取进阶模式

---

Upload video (up to 1 hour supported)

任务2:处理视频内容

video_path = Path("tutorial.mp4") video_file = genai.upload_file(video_path)
目标:分析视频内容、提取洞察、逐帧分析。
适用场景
  • 视频摘要
  • 目标追踪
  • 场景检测
  • 视频OCR
  • 内容审核
  • 教育视频分析
Python示例
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

Wait for processing

配置视频处理模型

import time while video_file.state.name == "PROCESSING": time.sleep(5) video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED": raise ValueError("Video processing failed")
model = genai.GenerativeModel( "gemini-3-pro-preview", generation_config={ "thinking_level": "high", "media_resolution": "medium" # 每帧70个token(平衡型) } )

Analyze video

上传视频(支持最长1小时)

response = model.generate_content([ """Analyze this video and provide: 1. Overall summary of content 2. Key scenes and timestamps 3. Main topics covered 4. Any visible text throughout the video """, video_file ])
print(response.text) print(f"Tokens used: {response.usage_metadata.total_token_count}")

**Node.js Example:**

```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload video
const uploadResult = await fileManager.uploadFile("tutorial.mp4", {
  mimeType: "video/mp4"
});

// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
  await new Promise(resolve => setTimeout(resolve, 5000));
  file = await fileManager.getFile(uploadResult.file.name);
}

if (file.state === FileState.FAILED) {
  throw new Error("Video processing failed");
}

// Analyze video
const model = genAI.getGenerativeModel({
  model: "gemini-3-pro-preview",
  generationConfig: {
    media_resolution: "medium"
  }
});

const result = await model.generateContent([
  `Analyze this video and provide:
  1. Overall summary
  2. Key scenes and timestamps
  3. Main topics covered`,
  { fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);

console.log(result.response.text());
Video Specs:
  • Max Duration: 1 hour
  • Formats: MP4, MOV, AVI, etc.
  • Resolution Options: Low (70 tokens/frame), Medium (70 tokens/frame), High (280 tokens/frame)
  • OCR: Available with high resolution
See:
references/video-processing.md
for advanced patterns

video_path = Path("tutorial.mp4") video_file = genai.upload_file(video_path)

Task 3: Process Audio/Speech

等待处理完成

Goal: Transcribe and understand audio content, process speech.
Use Cases:
  • Audio transcription
  • Speech analysis
  • Podcast summarization
  • Meeting notes
  • Language understanding
  • Audio classification
Python Example:
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel("gemini-3-pro-preview")
import time while video_file.state.name == "PROCESSING": time.sleep(5) video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED": raise ValueError("Video processing failed")

Upload audio file (up to 9.5 hours supported)

分析视频

audio_path = Path("podcast.mp3") audio_file = genai.upload_file(audio_path)
response = model.generate_content([ """分析此视频并提供: 1. 内容整体摘要 2. 关键场景与时间戳 3. 涵盖的主要主题 4. 视频中所有可见文本 """, video_file ])
print(response.text) print(f"Tokens used: {response.usage_metadata.total_token_count}")

**Node.js示例**:

```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

Wait for processing

上传视频

import time while audio_file.state.name == "PROCESSING": time.sleep(5) audio_file = genai.get_file(audio_file.name)
const uploadResult = await fileManager.uploadFile("tutorial.mp4", { mimeType: "video/mp4" });

Process audio

等待处理完成

response = model.generate_content([ """Process this audio and provide: 1. Full transcription 2. Summary of main points 3. Key speakers (if multiple) 4. Important timestamps 5. Action items or conclusions """, audio_file ])
print(response.text) print(f"Tokens used: {response.usage_metadata.total_token_count}")

**Node.js Example:**

```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload audio
const uploadResult = await fileManager.uploadFile("podcast.mp3", {
  mimeType: "audio/mp3"
});

// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
  await new Promise(resolve => setTimeout(resolve, 5000));
  file = await fileManager.getFile(uploadResult.file.name);
}

const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });

const result = await model.generateContent([
  `Process this audio and provide:
  1. Full transcription
  2. Summary of main points
  3. Key timestamps`,
  { fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);

console.log(result.response.text());
Audio Specs:
  • Max Duration: 9.5 hours
  • Formats: WAV, MP3, FLAC, AAC, etc.
  • Languages: Supports multiple languages
See:
references/audio-processing.md
for advanced patterns

let file = await fileManager.getFile(uploadResult.file.name); while (file.state === FileState.PROCESSING) { await new Promise(resolve => setTimeout(resolve, 5000)); file = await fileManager.getFile(uploadResult.file.name); }
if (file.state === FileState.FAILED) { throw new Error("Video processing failed"); }

Task 4: Process PDF Documents

分析视频

Goal: Extract and analyze content from PDF documents.
Use Cases:
  • Document analysis
  • Information extraction
  • Form processing
  • Research paper analysis
  • Contract review
  • Multi-page document understanding
Python Example:
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview", generationConfig: { media_resolution: "medium" } });
const result = await model.generate_content([ `Analyze this video and provide:
  1. Overall summary
  2. Key scenes and timestamps
  3. Main topics covered`, { fileData: { fileUri: file.uri, mimeType: file.mimeType } } ]);
console.log(result.response.text());

**视频规格**:
- **最长时长**:1小时
- **支持格式**:MP4、MOV、AVI等
- **分辨率选项**:低(每帧70个token)、中(每帧70个token)、高(每帧280个token)
- **OCR**:高分辨率下支持

**参考**:`references/video-processing.md` 获取进阶模式

---

Configure with medium resolution (recommended for PDFs)

任务3:处理音频/语音

model = genai.GenerativeModel( "gemini-3-pro-preview", generation_config={ "thinking_level": "high", "media_resolution": "medium" # 560 tokens/page (saturation point) } )
目标:转写并理解音频内容、处理语音。
适用场景
  • 音频转写
  • 语音分析
  • 播客摘要
  • 会议记录
  • 语言理解
  • 音频分类
Python示例
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel("gemini-3-pro-preview")

Upload PDF

上传音频文件(支持最长9.5小时)

pdf_path = Path("research_paper.pdf") pdf_file = genai.upload_file(pdf_path)
audio_path = Path("podcast.mp3") audio_file = genai.upload_file(audio_path)

Wait for processing

等待处理完成

import time while pdf_file.state.name == "PROCESSING": time.sleep(5) pdf_file = genai.get_file(pdf_file.name)
import time while audio_file.state.name == "PROCESSING": time.sleep(5) audio_file = genai.get_file(audio_file.name)

Analyze PDF

处理音频

response = model.generate_content([ """Analyze this PDF document and provide: 1. Document type and purpose 2. Main sections and structure 3. Key findings or arguments 4. Important data or statistics 5. Conclusions or recommendations """, pdf_file ])
print(response.text) print(f"Tokens used: {response.usage_metadata.total_token_count}")

**Node.js Example:**

```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

// Upload PDF
const uploadResult = await fileManager.uploadFile("research_paper.pdf", {
  mimeType: "application/pdf"
});

// Wait for processing
let file = await fileManager.getFile(uploadResult.file.name);
while (file.state === FileState.PROCESSING) {
  await new Promise(resolve => setTimeout(resolve, 5000));
  file = await fileManager.getFile(uploadResult.file.name);
}

// Analyze with medium resolution (recommended)
const model = genAI.getGenerativeModel({
  model: "gemini-3-pro-preview",
  generationConfig: {
    media_resolution: "medium"
  }
});

const result = await model.generateContent([
  `Analyze this PDF and extract:
  1. Main sections
  2. Key findings
  3. Important data`,
  { fileData: { fileUri: file.uri, mimeType: file.mimeType } }
]);

console.log(result.response.text());
PDF Processing Tips:
  • Recommended Resolution:
    medium
    (560 tokens/page) - saturation point for quality
  • Multi-page: Automatically processes all pages
  • Native Support: No conversion to images needed
  • Text Extraction: High-quality text extraction built-in
See:
references/document-processing.md
for advanced patterns

response = model.generate_content([ """处理此音频并提供: 1. 完整转写文本 2. 要点摘要 3. 主要发言人(若有多位) 4. 重要时间戳 5. 行动项或结论 """, audio_file ])
print(response.text) print(f"Tokens used: {response.usage_metadata.total_token_count}")

**Node.js示例**:

```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

Task 5: Optimize Media Processing Costs

上传音频

Goal: Balance quality and token consumption based on use case.
Strategy:
Media TypeResolutionTokensUse Case
Images
low
280Quick scan, thumbnails
Images
medium
560General analysis
Images
high
1,120OCR, fine details, code
PDFs
medium
560/pageRecommended (saturation point)
PDFs
high
1,120/pageDiminishing returns
Video
low
/
medium
70/frameMost use cases
Video
high
280/frameOCR from video
Python Optimization Example:
python
import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
const uploadResult = await fileManager.uploadFile("podcast.mp3", { mimeType: "audio/mp3" });

Different resolutions for different use cases

等待处理完成

def analyze_image_optimized(image_path, need_ocr=False): """Analyze image with appropriate resolution""" resolution = "high" if need_ocr else "medium"
model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={
        "media_resolution": resolution
    }
)

image_file = genai.upload_file(image_path)
response = model.generate_content([
    "Describe this image" if not need_ocr else "Extract all text from this image",
    image_file
])

# Log token usage for cost tracking
tokens = response.usage_metadata.total_token_count
cost = (tokens / 1_000_000) * 2.00  # Input pricing
print(f"Resolution: {resolution}, Tokens: {tokens}, Cost: ${cost:.6f}")

return response.text
let file = await fileManager.getFile(uploadResult.file.name); while (file.state === FileState.PROCESSING) { await new Promise(resolve => setTimeout(resolve, 5000)); file = await fileManager.getFile(uploadResult.file.name); }
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview" });
const result = await model.generate_content([ `Process this audio and provide:
  1. Full transcription
  2. Summary of main points
  3. Key timestamps`, { fileData: { fileUri: file.uri, mimeType: file.mimeType } } ]);
console.log(result.response.text());

**音频规格**:
- **最长时长**:9.5小时
- **支持格式**:WAV、MP3、FLAC、AAC等
- **支持语言**:多语言支持

**参考**:`references/audio-processing.md` 获取进阶模式

---

Use appropriate resolution

任务4:处理PDF文档

analyze_image_optimized("photo.jpg", need_ocr=False) # medium analyze_image_optimized("document.png", need_ocr=True) # high

**Per-Item Resolution Control:**

```python
目标:从PDF文档中提取并分析内容。
适用场景
  • 文档分析
  • 信息提取
  • 表单处理
  • 研究论文分析
  • 合同审查
  • 多页文档理解
Python示例
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

Set different resolutions for different media in same request

配置中分辨率模型(PDF推荐选项)

response = model.generate_content([ "Compare these images", {"file": image1, "media_resolution": "high"}, # High detail {"file": image2, "media_resolution": "low"}, # Low detail OK ])

**Cost Monitoring:**

```python
def log_media_costs(response):
    """Log media processing costs"""
    usage = response.usage_metadata

    # Pricing for ≤200k context
    input_cost = (usage.prompt_token_count / 1_000_000) * 2.00
    output_cost = (usage.candidates_token_count / 1_000_000) * 12.00

    print(f"Input tokens: {usage.prompt_token_count} (${input_cost:.6f})")
    print(f"Output tokens: {usage.candidates_token_count} (${output_cost:.6f})")
    print(f"Total cost: ${input_cost + output_cost:.6f}")
See:
references/token-optimization.md
for comprehensive strategies

model = genai.GenerativeModel( "gemini-3-pro-preview", generation_config={ "thinking_level": "high", "media_resolution": "medium" # 每页560个token(饱和点) } )

Media Resolution Control

上传PDF

Resolution Options

SettingImagesPDFsVideo (per frame)Recommendation
low
280 tokens280 tokens70 tokensQuick analysis, low detail
medium
560 tokens560 tokens70 tokensBalanced quality/cost
high
1,120 tokens1,120 tokens280 tokensOCR, fine text, details
pdf_path = Path("research_paper.pdf") pdf_file = genai.upload_file(pdf_path)

Configuration

等待处理完成

Global Setting (all media):
python
model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={
        "media_resolution": "high"  # Applies to all media
    }
)
Per-Item Setting (mixed resolutions):
python
response = model.generate_content([
    "Analyze these files",
    {"file": high_detail_image, "media_resolution": "high"},
    {"file": low_detail_image, "media_resolution": "low"}
])
import time while pdf_file.state.name == "PROCESSING": time.sleep(5) pdf_file = genai.get_file(pdf_file.name)

Best Practices

分析PDF

  1. Images: Use
    high
    for OCR/text extraction,
    medium
    for general analysis
  2. PDFs: Use
    medium
    (saturation point - higher resolutions show diminishing returns)
  3. Video: Use
    low
    or
    medium
    unless OCR needed
  4. Cost Control: Start with
    low
    , increase only if quality insufficient
See:
references/media-resolution.md
for detailed guide

response = model.generate_content([ """分析此PDF文档并提供: 1. 文档类型与用途 2. 主要章节与结构 3. 核心发现或论点 4. 重要数据或统计信息 5. 结论或建议 """, pdf_file ])
print(response.text) print(f"Tokens used: {response.usage_metadata.total_token_count}")

**Node.js示例**:

```typescript
import { GoogleGenerativeAI } from "@google/generative-ai";
import { GoogleAIFileManager, FileState } from "@google/generative-ai/server";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const fileManager = new GoogleAIFileManager(process.env.GEMINI_API_KEY!);

File Management

上传PDF

Upload Files

python
import google.generativeai as genai
const uploadResult = await fileManager.uploadFile("research_paper.pdf", { mimeType: "application/pdf" });

Upload file

等待处理完成

file = genai.upload_file("path/to/file.jpg") print(f"Uploaded: {file.name}")
let file = await fileManager.getFile(uploadResult.file.name); while (file.state === FileState.PROCESSING) { await new Promise(resolve => setTimeout(resolve, 5000)); file = await fileManager.getFile(uploadResult.file.name); }

Check processing status

使用中分辨率分析(推荐选项)

while file.state.name == "PROCESSING": time.sleep(5) file = genai.get_file(file.name)
print(f"Status: {file.state.name}")
undefined
const model = genAI.getGenerativeModel({ model: "gemini-3-pro-preview", generationConfig: { media_resolution: "medium" } });
const result = await model.generate_content([ `Analyze this PDF and extract:
  1. Main sections
  2. Key findings
  3. Important data`, { fileData: { fileUri: file.uri, mimeType: file.mimeType } } ]);
console.log(result.response.text());

**PDF处理技巧**:
- **推荐分辨率**:`medium`(每页560个token)——质量饱和点
- **多页支持**:自动处理所有页面
- **原生支持**:无需转换为图像
- **文本提取**:内置高质量文本提取功能

**参考**:`references/document-processing.md` 获取进阶模式

---

List Uploaded Files

任务5:优化媒体处理成本

python
undefined
目标:根据使用场景平衡质量与token消耗。
策略
媒体类型分辨率Token数适用场景
图像
low
280快速扫描、缩略图
图像
medium
560常规分析
图像
high
1120OCR、精细细节、代码
PDF
medium
560/页推荐(饱和点)
PDF
high
1120/页收益递减
视频
low
/
medium
70/帧大多数场景
视频
high
280/帧视频OCR
Python优化示例
python
import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

List all files

针对不同场景使用不同分辨率

for file in genai.list_files(): print(f"{file.name} - {file.display_name}")
undefined
def analyze_image_optimized(image_path, need_ocr=False): """使用合适分辨率分析图像""" resolution = "high" if need_ocr else "medium"
model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={
        "media_resolution": resolution
    }
)

image_file = genai.upload_file(image_path)
response = model.generate_content([
    "Describe this image" if not need_ocr else "Extract all text from this image",
    image_file
])

# 记录token使用量以追踪成本
tokens = response.usage_metadata.total_token_count
cost = (tokens / 1_000_000) * 2.00  # 输入定价
print(f"Resolution: {resolution}, Tokens: {tokens}, Cost: ${cost:.6f}")

return response.text

Delete Files

使用合适的分辨率

python
undefined
analyze_image_optimized("photo.jpg", need_ocr=False) # medium analyze_image_optimized("document.png", need_ocr=True) # high

**单文件分辨率控制**:

```python

Delete specific file

在同一请求中为不同媒体设置不同分辨率

genai.delete_file(file.name)
response = model.generate_content([ "Compare these images", {"file": image1, "media_resolution": "high"}, # 高细节 {"file": image2, "media_resolution": "low"}, # 低细节即可 ])

**成本监控**:

```python
def log_media_costs(response):
    """记录媒体处理成本"""
    usage = response.usage_metadata

    # ≤200k上下文的定价
    input_cost = (usage.prompt_token_count / 1_000_000) * 2.00
    output_cost = (usage.candidates_token_count / 1_000_000) * 12.00

    print(f"Input tokens: {usage.prompt_token_count} (${input_cost:.6f})")
    print(f"Output tokens: {usage.candidates_token_count} (${output_cost:.6f})")
    print(f"Total cost: ${input_cost + output_cost:.6f}")
参考
references/token-optimization.md
获取全面策略

Delete all files

媒体分辨率控制

分辨率选项

for file in genai.list_files(): genai.delete_file(file.name) print(f"Deleted: {file.name}")
undefined
设置图像PDF视频(每帧)推荐场景
low
280个token280个token70个token快速分析、低细节需求
medium
560个token560个token70个token平衡质量与成本
high
1120个token1120个token280个tokenOCR、小文本、精细细节

File Lifecycle

配置方式

  • Upload: Immediate
  • Processing: Async (especially for video/audio)
  • Storage: Files persist until deleted
  • Expiration: Files may expire after period (check docs)

全局设置(所有媒体)
python
model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={
        "media_resolution": "high"  # 应用于所有媒体
    }
)
单文件设置(混合分辨率)
python
response = model.generate_content([
    "Analyze these files",
    {"file": high_detail_image, "media_resolution": "high"},
    {"file": low_detail_image, "media_resolution": "low"}
])

Multi-File Processing

最佳实践

Process Multiple Images

python
undefined
  1. 图像:OCR/文本提取使用
    high
    ,常规分析使用
    medium
  2. PDF:使用
    medium
    (质量饱和点——更高分辨率收益递减)
  3. 视频:除非需要OCR,否则使用
    low
    medium
  4. 成本控制:从
    low
    开始,仅在质量不足时提升分辨率
参考
references/media-resolution.md
获取详细指南

Upload multiple images

文件管理

上传文件

images = [ genai.upload_file("photo1.jpg"), genai.upload_file("photo2.jpg"), genai.upload_file("photo3.jpg") ]
python
import google.generativeai as genai

Analyze together

上传文件

response = model.generate_content([ "Compare these images and identify common elements", *images ])
print(response.text)
undefined
file = genai.upload_file("path/to/file.jpg") print(f"Uploaded: {file.name}")

Mixed Media Types

检查处理状态

python
undefined
while file.state.name == "PROCESSING": time.sleep(5) file = genai.get_file(file.name)
print(f"Status: {file.state.name}")
undefined

Combine different media types

列出已上传文件

image = genai.upload_file("chart.png") pdf = genai.upload_file("report.pdf")
response = model.generate_content([ "Does the chart match the data in the report?", image, pdf ])

---
python
undefined

References

列出所有文件

Core Guides
  • Image Understanding - Complete image analysis patterns
  • Video Processing - Video analysis and frame extraction
  • Audio Processing - Audio transcription and analysis
  • Document Processing - PDF and document extraction
Optimization
  • Media Resolution - Resolution control and quality tuning
  • OCR Extraction - Text extraction best practices
  • Token Optimization - Cost control and efficiency
Scripts
  • Analyze Image Script - Production-ready image analysis
  • Process Video Script - Video processing automation
  • Process Audio Script - Audio transcription
  • Process PDF Script - PDF extraction
Official Resources

for file in genai.list_files(): print(f"{file.name} - {file.display_name}")
undefined

Related Skills

删除文件

  • gemini-3-pro-api - Basic setup, authentication, text generation
  • gemini-3-image-generation - Image OUTPUT (generating images)
  • gemini-3-advanced - Function calling, tools, caching, batch processing

python
undefined

Common Use Cases

删除指定文件

Visual Q&A Application

Combine image understanding with chat:
python
model = genai.GenerativeModel("gemini-3-pro-preview")
chat = model.start_chat()
genai.delete_file(file.name)

Upload image

删除所有文件

image = genai.upload_file("product.jpg")
for file in genai.list_files(): genai.delete_file(file.name) print(f"Deleted: {file.name}")
undefined

Ask questions about it

文件生命周期

response1 = chat.send_message(["What product is this?", image]) response2 = chat.send_message("What are its main features?") response3 = chat.send_message("What's the price range for similar products?")
undefined
  • 上传:即时完成
  • 处理:异步(视频/音频尤其如此)
  • 存储:文件会保留直至被删除
  • 过期:文件可能在一段时间后过期(请查看官方文档)

Document Analysis Pipeline

多文件处理

处理多张图像

Process multiple PDFs and extract insights:
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={"media_resolution": "medium"}
)
python
undefined

Process all PDFs in directory

上传多张图像

pdf_dir = Path("documents/") results = {}
for pdf_path in pdf_dir.glob("*.pdf"): pdf_file = genai.upload_file(pdf_path)
# Wait for processing
while pdf_file.state.name == "PROCESSING":
    time.sleep(5)
    pdf_file = genai.get_file(pdf_file.name)

# Extract key information
response = model.generate_content([
    "Extract: 1) Document type, 2) Key dates, 3) Important numbers, 4) Summary",
    pdf_file
])

results[pdf_path.name] = response.text

# Clean up
genai.delete_file(pdf_file.name)
images = [ genai.upload_file("photo1.jpg"), genai.upload_file("photo2.jpg"), genai.upload_file("photo3.jpg") ]

Save results

批量分析

import json with open("analysis_results.json", "w") as f: json.dump(results, f, indent=2)
undefined
response = model.generate_content([ "Compare these images and identify common elements", *images ])
print(response.text)
undefined

Video Content Moderation

混合媒体类型

Analyze video for specific content:
python
video = genai.upload_file("user_upload.mp4")
python
undefined

Wait for processing

组合不同媒体类型

while video.state.name == "PROCESSING": time.sleep(10) video = genai.get_file(video.name)
response = model.generate_content([ """Analyze this video for: 1. Inappropriate content (yes/no) 2. Violence or harmful content (yes/no) 3. Overall content rating (G/PG/PG-13/R) 4. Brief justification
Provide structured response.
""",
video
])
print(response.text)

---
image = genai.upload_file("chart.png") pdf = genai.upload_file("report.pdf")
response = model.generate_content([ "Does the chart match the data in the report?", image, pdf ])

---

Troubleshooting

参考资料

Issue: File processing stuck at "PROCESSING"

Solution: Large files (especially video) can take time. Wait 30-60 seconds between checks. If stuck > 5 minutes, file may have failed.
核心指南
  • 图像理解 - 完整图像分析模式
  • 视频处理 - 视频分析与帧提取
  • 音频处理 - 音频转写与分析
  • 文档处理 - PDF与文档提取
优化相关
  • 媒体分辨率 - 分辨率控制与质量调优
  • OCR提取 - 文本提取最佳实践
  • Token优化 - 成本控制与效率提升
脚本
  • 图像分析脚本 - 生产级图像分析
  • 视频处理脚本 - 视频处理自动化
  • 音频处理脚本 - 音频转写
  • PDF处理脚本 - PDF提取
官方资源

Issue: Low quality OCR results

相关技能

Solution: Use
media_resolution: "high"
for images with text. Ensure image is clear and high resolution.
  • gemini-3-pro-api - 基础配置、认证、文本生成
  • gemini-3-image-generation - 图像输出(生成图像)
  • gemini-3-advanced - 函数调用、工具、缓存、批处理

Issue: High token costs

常见应用场景

视觉问答应用

Solution: Use appropriate media resolution. Start with
low
, increase only if needed. For PDFs,
medium
is usually sufficient.
结合图像理解与对话功能:
python
model = genai.GenerativeModel("gemini-3-pro-preview")
chat = model.start_chat()

Issue: Video analysis missing details

上传图像

Solution: Use
media_resolution: "high"
for better frame analysis, or provide more specific prompts about what to look for.
image = genai.upload_file("product.jpg")

Issue: Audio transcription inaccurate

针对图像提问

Solution: Ensure audio quality is good (no excessive background noise). Provide context in prompt about accent, language, or domain.

response1 = chat.send_message(["What product is this?", image]) response2 = chat.send_message("What are its main features?") response3 = chat.send_message("What's the price range for similar products?")
undefined

Summary

文档分析流水线

This skill provides comprehensive multimodal input processing capabilities:
✅ Image analysis with OCR and object detection ✅ Video processing up to 1 hour ✅ Audio transcription up to 9.5 hours ✅ Native PDF document processing ✅ Granular media resolution control ✅ Token optimization strategies ✅ Multi-file processing ✅ Production-ready examples
Ready to analyze multimodal content? Start with the task that matches your use case above!
批量处理多个PDF并提取洞察:
python
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel(
    "gemini-3-pro-preview",
    generation_config={"media_resolution": "medium"}
)

处理目录下所有PDF

pdf_dir = Path("documents/") results = {}
for pdf_path in pdf_dir.glob("*.pdf"): pdf_file = genai.upload_file(pdf_path)
# 等待处理完成
while pdf_file.state.name == "PROCESSING":
    time.sleep(5)
    pdf_file = genai.get_file(pdf_file.name)

# 提取关键信息
response = model.generate_content([
    "Extract: 1) Document type, 2) Key dates, 3) Important numbers, 4) Summary",
    pdf_file
])

results[pdf_path.name] = response.text

# 清理文件
genai.delete_file(pdf_file.name)

保存结果

import json with open("analysis_results.json", "w") as f: json.dump(results, f, indent=2)
undefined

视频内容审核

分析视频是否包含特定内容:
python
video = genai.upload_file("user_upload.mp4")

等待处理完成

while video.state.name == "PROCESSING": time.sleep(10) video = genai.get_file(video.name)
response = model.generate_content([ """分析此视频: 1. 是否包含不当内容(是/否) 2. 是否包含暴力或有害内容(是/否) 3. 整体内容分级(G/PG/PG-13/R) 4. 简短理由
请提供结构化回复。
""",
video
])
print(response.text)

---

故障排查

问题:文件处理卡在"PROCESSING"状态

解决方案:大文件(尤其是视频)处理需要时间。每30-60秒检查一次状态。若超过5分钟仍未完成,文件可能处理失败。

问题:OCR结果质量低

解决方案:对包含文本的图像使用
media_resolution: "high"
。确保图像清晰且分辨率高。

问题:Token成本过高

解决方案:使用合适的媒体分辨率。从
low
开始,仅在需要时提升。对于PDF,
medium
通常已足够。

问题:视频分析遗漏细节

解决方案:使用
media_resolution: "high"
以获得更好的帧分析效果,或在提示词中更明确地说明需要关注的内容。

问题:音频转写不准确

解决方案:确保音频质量良好(无过多背景噪音)。在提示词中提供关于口音、语言或领域的上下文信息。

总结

本技能提供全面的多模态输入处理能力:
✅ 带OCR和目标检测的图像分析 ✅ 最长1小时视频处理 ✅ 最长9.5小时音频转写 ✅ 原生PDF文档处理 ✅ 精细化媒体分辨率控制 ✅ Token优化策略 ✅ 多文件处理 ✅ 生产级示例
准备好分析多模态内容了吗? 从上述匹配你场景的任务开始吧!