gemini-document-processing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Gemini Document Processing

Gemini 文档处理

Process and analyze PDF documents using Google Gemini's native vision capabilities. Extract structured information, summarize content, answer questions, and understand complex documents with text, images, diagrams, charts, and tables.

利用Google Gemini的原生视觉能力处理和分析PDF文档。提取结构化信息、生成内容摘要、回答问题，并理解包含文本、图片、图形、图表和表格的复杂文档。

Core Capabilities

核心功能

PDF Vision Processing: Native understanding of PDFs up to 1,000 pages (258 tokens/page)
Multimodal Analysis: Process text, images, diagrams, charts, and tables
Structured Extraction: Output to JSON with schema validation
Document Q&A: Answer questions based on document content
Summarization: Generate summaries preserving context
Format Conversion: Transcribe to HTML while preserving layout

PDF视觉处理：原生支持处理最多1000页的PDF（每页258个token）
多模态分析：处理文本、图片、图形、图表和表格
结构化提取：输出带 schema 验证的JSON
文档问答：基于文档内容回答问题
摘要生成：生成保留上下文的摘要
格式转换：转录为HTML并保留布局

When to Use This Skill

适用场景

Use this skill when you need to:

Extract structured data from PDF documents (invoices, resumes, forms)
Summarize long documents or reports
Answer questions about PDF content
Analyze documents with complex layouts, charts, or diagrams
Convert PDFs to structured formats (JSON, HTML)
Process multiple documents in batch
Build document processing pipelines

当你需要以下操作时使用本技能：

从PDF文档（发票、简历、表单）中提取结构化数据
总结长文档或报告
回答关于PDF内容的问题
分析包含复杂布局、图表或图形的文档
将PDF转换为结构化格式（JSON、HTML）
批量处理多个文档
构建文档处理流水线

Quick Setup

快速设置

1. API Key Configuration

1. API密钥配置

The skill supports both Google AI Studio and Vertex AI endpoints.

本技能支持Google AI Studio和Vertex AI两种端点。

Option 1: Google AI Studio (Default)

选项1：Google AI Studio（默认）

The skill checks for

GEMINI_API_KEY

in this priority order:

Process environment variable
Project root
```
.env
```
```
.claude/.env
```
```
.claude/skills/.env
```

.env

file in skill directory (

.claude/skills/gemini-document-processing/.env

)

Get your API key: https://aistudio.google.com/apikey

Environment Variable (Recommended)

bash

export GEMINI_API_KEY="your-api-key-here"

Or in .env file:

bash

echo "GEMINI_API_KEY=your-api-key-here" > .env

本技能按以下优先级查找

GEMINI_API_KEY

：

进程环境变量
项目根目录的
```
.env
```
文件
```
.claude/.env
```
文件
```
.claude/skills/.env
```
文件

技能目录下的

.env

文件（

.claude/skills/gemini-document-processing/.env

）

获取API密钥： https://aistudio.google.com/apikey

环境变量（推荐）

bash

export GEMINI_API_KEY="your-api-key-here"

或在.env文件中配置：

bash

echo "GEMINI_API_KEY=your-api-key-here" > .env

Option 2: Vertex AI

选项2：Vertex AI

To use Vertex AI instead:

bash

undefined

若要使用Vertex AI：

bash

undefined

Enable Vertex AI

启用Vertex AI

export GEMINI_USE_VERTEX=true export VERTEX_PROJECT_ID=your-gcp-project-id export VERTEX_LOCATION=us-central1 # Optional, defaults to us-central1


Or in `.env` file:
```bash
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1

export GEMINI_USE_VERTEX=true export VERTEX_PROJECT_ID=your-gcp-project-id export VERTEX_LOCATION=us-central1 # 可选，默认值为us-central1


或在`.env`文件中配置：
```bash
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1

2. Install Dependencies

2. 安装依赖

bash

pip install google-genai python-dotenv

bash

pip install google-genai python-dotenv

Common Use Cases

常见用例

1. Extract Structured Data from PDF

1. 从PDF中提取结构化数据

python

undefined

python

undefined

Use the provided script

使用提供的脚本

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file invoice.pdf
--prompt "Extract invoice details as JSON"
--format json

undefined

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file invoice.pdf
--prompt "Extract invoice details as JSON"
--format json

undefined

2. Summarize Long Document

2. 总结长文档

python

undefined

python

undefined

Process and summarize

处理并生成摘要

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file report.pdf
--prompt "Provide a concise executive summary"

undefined

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file report.pdf
--prompt "Provide a concise executive summary"

undefined

3. Answer Questions About Document

3. 回答关于文档的问题

python

undefined

python

undefined

Q&A on document content

文档内容问答

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file contract.pdf
--prompt "What are the key terms and conditions?"

undefined

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file contract.pdf
--prompt "What are the key terms and conditions?"

undefined

4. Process with Python SDK

4. 使用Python SDK处理

python

from google import genai

client = genai.Client()

python

from google import genai

client = genai.Client()

Read PDF

读取PDF

with open('document.pdf', 'rb') as f: pdf_data = f.read()

Process document

处理文档

response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Extract key information from this document', genai.types.Part.from_bytes( data=pdf_data, mime_type='application/pdf' ) ] )

print(response.text)

undefined

print(response.text)

undefined

5. Structured Output with JSON Schema

5. 带JSON Schema的结构化输出

python

from google import genai
from pydantic import BaseModel

class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total: float
    vendor: str

client = genai.Client()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract invoice details',
        genai.types.Part.from_bytes(
            data=open('invoice.pdf', 'rb').read(),
            mime_type='application/pdf'
        )
    ],
    config=genai.types.GenerateContentConfig(
        response_mime_type='application/json',
        response_schema=InvoiceData
    )
)

invoice_data = InvoiceData.model_validate_json(response.text)

python

from google import genai
from pydantic import BaseModel

class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total: float
    vendor: str

client = genai.Client()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract invoice details',
        genai.types.Part.from_bytes(
            data=open('invoice.pdf', 'rb').read(),
            mime_type='application/pdf'
        )
    ],
    config=genai.types.GenerateContentConfig(
        response_mime_type='application/json',
        response_schema=InvoiceData
    )
)

invoice_data = InvoiceData.model_validate_json(response.text)

Key Constraints

关键限制

Format: Only PDFs get vision processing (TXT, HTML, Markdown are text-only)
Size: < 20MB use inline encoding, > 20MB use File API
Pages: Max 1,000 pages per document
Storage: File API stores for 48 hours only
Cost: 258 tokens per page (fixed, regardless of content density)

格式：仅PDF支持视觉处理（TXT、HTML、Markdown仅支持文本处理）
大小：小于20MB使用内联编码，大于20MB使用File API
页数：单文档最多1000页
存储：File API仅存储48小时
成本：每页固定258个token（与内容密度无关）

Performance Tips

性能优化建议

Use Inline Encoding for PDFs < 20MB (simpler, single request)
Use File API for larger files or repeated queries (enables context caching)
Place Prompt After PDF for single-page documents
Use Context Caching when querying same PDF multiple times
Process in Parallel for multiple independent documents
Use gemini-2.5-flash for best price/performance ratio

对于小于20MB的PDF使用内联编码（更简单，单请求）
对于大文件或重复查询使用File API（支持上下文缓存）
单页文档将提示放在PDF之后
多次查询同一PDF时使用上下文缓存
并行处理多个独立文档
使用gemini-2.5-flash以获得最佳性价比

Decision Guide

决策指南

PDF < 20MB?
├─ Yes → Use inline base64 encoding
└─ No  → Use File API

Need structured JSON output?
├─ Yes → Define response_schema with Pydantic
└─ No  → Get text response

Multiple queries on same PDF?
├─ Yes → Use File API + Context Caching
└─ No  → Inline encoding is sufficient

PDF < 20MB?
├─ 是 → 使用内联base64编码
└─ 否 → 使用File API

需要结构化JSON输出？
├─ 是 → 用Pydantic定义response_schema
└─ 否 → 获取文本响应

同一PDF需多次查询？
├─ 是 → 使用File API + 上下文缓存
└─ 否 → 内联编码足够

Script Reference

脚本参考

The skill includes a ready-to-use processing script:

bash

undefined

本技能包含一个即用型处理脚本：

bash

undefined

Basic usage

基础用法

python scripts/process-document.py --file document.pdf --prompt "Your prompt"

With JSON output

JSON输出

python scripts/process-document.py --file document.pdf --prompt "Extract data" --format json

With File API (for large files)

大文件使用File API

python scripts/process-document.py --file large-document.pdf --prompt "Summarize" --use-file-api

Multiple prompts

多提示查询

python scripts/process-document.py --file document.pdf --prompt "Question 1" --prompt "Question 2"

undefined

python scripts/process-document.py --file document.pdf --prompt "Question 1" --prompt "Question 2"

undefined

References

参考资料

For comprehensive documentation, see:

references/gemini-document-processing-report.md

- Complete API reference

```
references/quick-reference.md
```
- Quick lookup guide
```
references/code-examples.md
```
- Additional code patterns

如需完整文档，请参阅：

references/gemini-document-processing-report.md

- 完整API参考

```
references/quick-reference.md
```
- 快速查询指南
```
references/code-examples.md
```
- 额外代码示例

Troubleshooting

故障排除

API Key Not Found:

bash

undefined

未找到API密钥：

bash

undefined

Check API key is set

检查API密钥是否已设置

./scripts/check-api-key.sh


**File Too Large:**
- Use File API for files > 20MB
- Add `--use-file-api` flag to the script

**Vision Not Working:**
- Ensure file is PDF format
- Other formats (TXT, HTML) don't support vision processing

./scripts/check-api-key.sh


**文件过大：**
- 大于20MB的文件使用File API
- 在脚本中添加`--use-file-api`参数

**视觉功能不生效：**
- 确保文件为PDF格式
- 其他格式（TXT、HTML）不支持视觉处理

Support

支持

API Documentation: https://ai.google.dev/gemini-api/docs/document-processing
Get API Key: https://aistudio.google.com/apikey
Model Info: https://ai.google.dev/gemini-api/docs/models/gemini

API文档：https://ai.google.dev/gemini-api/docs/document-processing
获取API密钥：https://aistudio.google.com/apikey
模型信息：https://ai.google.dev/gemini-api/docs/models/gemini