gemini-document-processing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGemini Document Processing
Gemini 文档处理
Process and analyze PDF documents using Google Gemini's native vision capabilities. Extract structured information, summarize content, answer questions, and understand complex documents with text, images, diagrams, charts, and tables.
利用Google Gemini的原生视觉能力处理和分析PDF文档。提取结构化信息、生成内容摘要、回答问题,并理解包含文本、图片、图形、图表和表格的复杂文档。
Core Capabilities
核心功能
- PDF Vision Processing: Native understanding of PDFs up to 1,000 pages (258 tokens/page)
- Multimodal Analysis: Process text, images, diagrams, charts, and tables
- Structured Extraction: Output to JSON with schema validation
- Document Q&A: Answer questions based on document content
- Summarization: Generate summaries preserving context
- Format Conversion: Transcribe to HTML while preserving layout
- PDF视觉处理:原生支持处理最多1000页的PDF(每页258个token)
- 多模态分析:处理文本、图片、图形、图表和表格
- 结构化提取:输出带 schema 验证的JSON
- 文档问答:基于文档内容回答问题
- 摘要生成:生成保留上下文的摘要
- 格式转换:转录为HTML并保留布局
When to Use This Skill
适用场景
Use this skill when you need to:
- Extract structured data from PDF documents (invoices, resumes, forms)
- Summarize long documents or reports
- Answer questions about PDF content
- Analyze documents with complex layouts, charts, or diagrams
- Convert PDFs to structured formats (JSON, HTML)
- Process multiple documents in batch
- Build document processing pipelines
当你需要以下操作时使用本技能:
- 从PDF文档(发票、简历、表单)中提取结构化数据
- 总结长文档或报告
- 回答关于PDF内容的问题
- 分析包含复杂布局、图表或图形的文档
- 将PDF转换为结构化格式(JSON、HTML)
- 批量处理多个文档
- 构建文档处理流水线
Quick Setup
快速设置
1. API Key Configuration
1. API密钥配置
The skill supports both Google AI Studio and Vertex AI endpoints.
本技能支持Google AI Studio和Vertex AI两种端点。
Option 1: Google AI Studio (Default)
选项1:Google AI Studio(默认)
The skill checks for in this priority order:
GEMINI_API_KEY- Process environment variable
- Project root
.env .claude/.env.claude/skills/.env- file in skill directory (
.env).claude/skills/gemini-document-processing/.env
Get your API key: https://aistudio.google.com/apikey
Environment Variable (Recommended)
bash
export GEMINI_API_KEY="your-api-key-here"Or in .env file:
bash
echo "GEMINI_API_KEY=your-api-key-here" > .env本技能按以下优先级查找:
GEMINI_API_KEY- 进程环境变量
- 项目根目录的文件
.env - 文件
.claude/.env - 文件
.claude/skills/.env - 技能目录下的文件(
.env).claude/skills/gemini-document-processing/.env
获取API密钥: https://aistudio.google.com/apikey
环境变量(推荐)
bash
export GEMINI_API_KEY="your-api-key-here"或在.env文件中配置:
bash
echo "GEMINI_API_KEY=your-api-key-here" > .envOption 2: Vertex AI
选项2:Vertex AI
To use Vertex AI instead:
bash
undefined若要使用Vertex AI:
bash
undefinedEnable Vertex AI
启用Vertex AI
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional, defaults to us-central1
Or in `.env` file:
```bash
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # 可选,默认值为us-central1
或在`.env`文件中配置:
```bash
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central12. Install Dependencies
2. 安装依赖
bash
pip install google-genai python-dotenvbash
pip install google-genai python-dotenvCommon Use Cases
常见用例
1. Extract Structured Data from PDF
1. 从PDF中提取结构化数据
python
undefinedpython
undefinedUse the provided script
使用提供的脚本
python .claude/skills/gemini-document-processing/scripts/process-document.py
--file invoice.pdf
--prompt "Extract invoice details as JSON"
--format json
--file invoice.pdf
--prompt "Extract invoice details as JSON"
--format json
undefinedpython .claude/skills/gemini-document-processing/scripts/process-document.py
--file invoice.pdf
--prompt "Extract invoice details as JSON"
--format json
--file invoice.pdf
--prompt "Extract invoice details as JSON"
--format json
undefined2. Summarize Long Document
2. 总结长文档
python
undefinedpython
undefinedProcess and summarize
处理并生成摘要
python .claude/skills/gemini-document-processing/scripts/process-document.py
--file report.pdf
--prompt "Provide a concise executive summary"
--file report.pdf
--prompt "Provide a concise executive summary"
undefinedpython .claude/skills/gemini-document-processing/scripts/process-document.py
--file report.pdf
--prompt "Provide a concise executive summary"
--file report.pdf
--prompt "Provide a concise executive summary"
undefined3. Answer Questions About Document
3. 回答关于文档的问题
python
undefinedpython
undefinedQ&A on document content
文档内容问答
python .claude/skills/gemini-document-processing/scripts/process-document.py
--file contract.pdf
--prompt "What are the key terms and conditions?"
--file contract.pdf
--prompt "What are the key terms and conditions?"
undefinedpython .claude/skills/gemini-document-processing/scripts/process-document.py
--file contract.pdf
--prompt "What are the key terms and conditions?"
--file contract.pdf
--prompt "What are the key terms and conditions?"
undefined4. Process with Python SDK
4. 使用Python SDK处理
python
from google import genai
client = genai.Client()python
from google import genai
client = genai.Client()Read PDF
读取PDF
with open('document.pdf', 'rb') as f:
pdf_data = f.read()
with open('document.pdf', 'rb') as f:
pdf_data = f.read()
Process document
处理文档
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Extract key information from this document',
genai.types.Part.from_bytes(
data=pdf_data,
mime_type='application/pdf'
)
]
)
print(response.text)
undefinedresponse = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Extract key information from this document',
genai.types.Part.from_bytes(
data=pdf_data,
mime_type='application/pdf'
)
]
)
print(response.text)
undefined5. Structured Output with JSON Schema
5. 带JSON Schema的结构化输出
python
from google import genai
from pydantic import BaseModel
class InvoiceData(BaseModel):
invoice_number: str
date: str
total: float
vendor: str
client = genai.Client()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Extract invoice details',
genai.types.Part.from_bytes(
data=open('invoice.pdf', 'rb').read(),
mime_type='application/pdf'
)
],
config=genai.types.GenerateContentConfig(
response_mime_type='application/json',
response_schema=InvoiceData
)
)
invoice_data = InvoiceData.model_validate_json(response.text)python
from google import genai
from pydantic import BaseModel
class InvoiceData(BaseModel):
invoice_number: str
date: str
total: float
vendor: str
client = genai.Client()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Extract invoice details',
genai.types.Part.from_bytes(
data=open('invoice.pdf', 'rb').read(),
mime_type='application/pdf'
)
],
config=genai.types.GenerateContentConfig(
response_mime_type='application/json',
response_schema=InvoiceData
)
)
invoice_data = InvoiceData.model_validate_json(response.text)Key Constraints
关键限制
- Format: Only PDFs get vision processing (TXT, HTML, Markdown are text-only)
- Size: < 20MB use inline encoding, > 20MB use File API
- Pages: Max 1,000 pages per document
- Storage: File API stores for 48 hours only
- Cost: 258 tokens per page (fixed, regardless of content density)
- 格式:仅PDF支持视觉处理(TXT、HTML、Markdown仅支持文本处理)
- 大小:小于20MB使用内联编码,大于20MB使用File API
- 页数:单文档最多1000页
- 存储:File API仅存储48小时
- 成本:每页固定258个token(与内容密度无关)
Performance Tips
性能优化建议
- Use Inline Encoding for PDFs < 20MB (simpler, single request)
- Use File API for larger files or repeated queries (enables context caching)
- Place Prompt After PDF for single-page documents
- Use Context Caching when querying same PDF multiple times
- Process in Parallel for multiple independent documents
- Use gemini-2.5-flash for best price/performance ratio
- 对于小于20MB的PDF使用内联编码(更简单,单请求)
- 对于大文件或重复查询使用File API(支持上下文缓存)
- 单页文档将提示放在PDF之后
- 多次查询同一PDF时使用上下文缓存
- 并行处理多个独立文档
- 使用gemini-2.5-flash以获得最佳性价比
Decision Guide
决策指南
PDF < 20MB?
├─ Yes → Use inline base64 encoding
└─ No → Use File API
Need structured JSON output?
├─ Yes → Define response_schema with Pydantic
└─ No → Get text response
Multiple queries on same PDF?
├─ Yes → Use File API + Context Caching
└─ No → Inline encoding is sufficientPDF < 20MB?
├─ 是 → 使用内联base64编码
└─ 否 → 使用File API
需要结构化JSON输出?
├─ 是 → 用Pydantic定义response_schema
└─ 否 → 获取文本响应
同一PDF需多次查询?
├─ 是 → 使用File API + 上下文缓存
└─ 否 → 内联编码足够Script Reference
脚本参考
The skill includes a ready-to-use processing script:
bash
undefined本技能包含一个即用型处理脚本:
bash
undefinedBasic usage
基础用法
python scripts/process-document.py --file document.pdf --prompt "Your prompt"
python scripts/process-document.py --file document.pdf --prompt "Your prompt"
With JSON output
JSON输出
python scripts/process-document.py --file document.pdf --prompt "Extract data" --format json
python scripts/process-document.py --file document.pdf --prompt "Extract data" --format json
With File API (for large files)
大文件使用File API
python scripts/process-document.py --file large-document.pdf --prompt "Summarize" --use-file-api
python scripts/process-document.py --file large-document.pdf --prompt "Summarize" --use-file-api
Multiple prompts
多提示查询
python scripts/process-document.py --file document.pdf --prompt "Question 1" --prompt "Question 2"
undefinedpython scripts/process-document.py --file document.pdf --prompt "Question 1" --prompt "Question 2"
undefinedReferences
参考资料
For comprehensive documentation, see:
- - Complete API reference
references/gemini-document-processing-report.md - - Quick lookup guide
references/quick-reference.md - - Additional code patterns
references/code-examples.md
如需完整文档,请参阅:
- - 完整API参考
references/gemini-document-processing-report.md - - 快速查询指南
references/quick-reference.md - - 额外代码示例
references/code-examples.md
Troubleshooting
故障排除
API Key Not Found:
bash
undefined未找到API密钥:
bash
undefinedCheck API key is set
检查API密钥是否已设置
./scripts/check-api-key.sh
**File Too Large:**
- Use File API for files > 20MB
- Add `--use-file-api` flag to the script
**Vision Not Working:**
- Ensure file is PDF format
- Other formats (TXT, HTML) don't support vision processing./scripts/check-api-key.sh
**文件过大:**
- 大于20MB的文件使用File API
- 在脚本中添加`--use-file-api`参数
**视觉功能不生效:**
- 确保文件为PDF格式
- 其他格式(TXT、HTML)不支持视觉处理Support
支持
- API Documentation: https://ai.google.dev/gemini-api/docs/document-processing
- Get API Key: https://aistudio.google.com/apikey
- Model Info: https://ai.google.dev/gemini-api/docs/models/gemini