gemini-document-processing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Gemini Document Processing

Gemini 文档处理

Process and analyze PDF documents using Google Gemini's native vision capabilities. Extract structured information, summarize content, answer questions, and understand complex documents with text, images, diagrams, charts, and tables.
利用Google Gemini的原生视觉能力处理和分析PDF文档。提取结构化信息、生成内容摘要、回答问题,并理解包含文本、图片、图形、图表和表格的复杂文档。

Core Capabilities

核心功能

  • PDF Vision Processing: Native understanding of PDFs up to 1,000 pages (258 tokens/page)
  • Multimodal Analysis: Process text, images, diagrams, charts, and tables
  • Structured Extraction: Output to JSON with schema validation
  • Document Q&A: Answer questions based on document content
  • Summarization: Generate summaries preserving context
  • Format Conversion: Transcribe to HTML while preserving layout
  • PDF视觉处理:原生支持处理最多1000页的PDF(每页258个token)
  • 多模态分析:处理文本、图片、图形、图表和表格
  • 结构化提取:输出带 schema 验证的JSON
  • 文档问答:基于文档内容回答问题
  • 摘要生成:生成保留上下文的摘要
  • 格式转换:转录为HTML并保留布局

When to Use This Skill

适用场景

Use this skill when you need to:
  • Extract structured data from PDF documents (invoices, resumes, forms)
  • Summarize long documents or reports
  • Answer questions about PDF content
  • Analyze documents with complex layouts, charts, or diagrams
  • Convert PDFs to structured formats (JSON, HTML)
  • Process multiple documents in batch
  • Build document processing pipelines
当你需要以下操作时使用本技能:
  • 从PDF文档(发票、简历、表单)中提取结构化数据
  • 总结长文档或报告
  • 回答关于PDF内容的问题
  • 分析包含复杂布局、图表或图形的文档
  • 将PDF转换为结构化格式(JSON、HTML)
  • 批量处理多个文档
  • 构建文档处理流水线

Quick Setup

快速设置

1. API Key Configuration

1. API密钥配置

The skill supports both Google AI Studio and Vertex AI endpoints.
本技能支持Google AI StudioVertex AI两种端点。

Option 1: Google AI Studio (Default)

选项1:Google AI Studio(默认)

The skill checks for
GEMINI_API_KEY
in this priority order:
  1. Process environment variable
  2. Project root
    .env
  3. .claude/.env
  4. .claude/skills/.env
  5. .env
    file in skill directory (
    .claude/skills/gemini-document-processing/.env
    )
Environment Variable (Recommended)
bash
export GEMINI_API_KEY="your-api-key-here"
Or in .env file:
bash
echo "GEMINI_API_KEY=your-api-key-here" > .env
本技能按以下优先级查找
GEMINI_API_KEY
  1. 进程环境变量
  2. 项目根目录的
    .env
    文件
  3. .claude/.env
    文件
  4. .claude/skills/.env
    文件
  5. 技能目录下的
    .env
    文件(
    .claude/skills/gemini-document-processing/.env
环境变量(推荐)
bash
export GEMINI_API_KEY="your-api-key-here"
或在.env文件中配置:
bash
echo "GEMINI_API_KEY=your-api-key-here" > .env

Option 2: Vertex AI

选项2:Vertex AI

To use Vertex AI instead:
bash
undefined
若要使用Vertex AI:
bash
undefined

Enable Vertex AI

启用Vertex AI

export GEMINI_USE_VERTEX=true export VERTEX_PROJECT_ID=your-gcp-project-id export VERTEX_LOCATION=us-central1 # Optional, defaults to us-central1

Or in `.env` file:
```bash
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1
export GEMINI_USE_VERTEX=true export VERTEX_PROJECT_ID=your-gcp-project-id export VERTEX_LOCATION=us-central1 # 可选,默认值为us-central1

或在`.env`文件中配置:
```bash
GEMINI_USE_VERTEX=true
VERTEX_PROJECT_ID=your-gcp-project-id
VERTEX_LOCATION=us-central1

2. Install Dependencies

2. 安装依赖

bash
pip install google-genai python-dotenv
bash
pip install google-genai python-dotenv

Common Use Cases

常见用例

1. Extract Structured Data from PDF

1. 从PDF中提取结构化数据

python
undefined
python
undefined

Use the provided script

使用提供的脚本

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file invoice.pdf
--prompt "Extract invoice details as JSON"
--format json
undefined
python .claude/skills/gemini-document-processing/scripts/process-document.py
--file invoice.pdf
--prompt "Extract invoice details as JSON"
--format json
undefined

2. Summarize Long Document

2. 总结长文档

python
undefined
python
undefined

Process and summarize

处理并生成摘要

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file report.pdf
--prompt "Provide a concise executive summary"
undefined
python .claude/skills/gemini-document-processing/scripts/process-document.py
--file report.pdf
--prompt "Provide a concise executive summary"
undefined

3. Answer Questions About Document

3. 回答关于文档的问题

python
undefined
python
undefined

Q&A on document content

文档内容问答

python .claude/skills/gemini-document-processing/scripts/process-document.py
--file contract.pdf
--prompt "What are the key terms and conditions?"
undefined
python .claude/skills/gemini-document-processing/scripts/process-document.py
--file contract.pdf
--prompt "What are the key terms and conditions?"
undefined

4. Process with Python SDK

4. 使用Python SDK处理

python
from google import genai

client = genai.Client()
python
from google import genai

client = genai.Client()

Read PDF

读取PDF

with open('document.pdf', 'rb') as f: pdf_data = f.read()
with open('document.pdf', 'rb') as f: pdf_data = f.read()

Process document

处理文档

response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Extract key information from this document', genai.types.Part.from_bytes( data=pdf_data, mime_type='application/pdf' ) ] )
print(response.text)
undefined
response = client.models.generate_content( model='gemini-2.5-flash', contents=[ 'Extract key information from this document', genai.types.Part.from_bytes( data=pdf_data, mime_type='application/pdf' ) ] )
print(response.text)
undefined

5. Structured Output with JSON Schema

5. 带JSON Schema的结构化输出

python
from google import genai
from pydantic import BaseModel

class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total: float
    vendor: str

client = genai.Client()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract invoice details',
        genai.types.Part.from_bytes(
            data=open('invoice.pdf', 'rb').read(),
            mime_type='application/pdf'
        )
    ],
    config=genai.types.GenerateContentConfig(
        response_mime_type='application/json',
        response_schema=InvoiceData
    )
)

invoice_data = InvoiceData.model_validate_json(response.text)
python
from google import genai
from pydantic import BaseModel

class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total: float
    vendor: str

client = genai.Client()

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=[
        'Extract invoice details',
        genai.types.Part.from_bytes(
            data=open('invoice.pdf', 'rb').read(),
            mime_type='application/pdf'
        )
    ],
    config=genai.types.GenerateContentConfig(
        response_mime_type='application/json',
        response_schema=InvoiceData
    )
)

invoice_data = InvoiceData.model_validate_json(response.text)

Key Constraints

关键限制

  • Format: Only PDFs get vision processing (TXT, HTML, Markdown are text-only)
  • Size: < 20MB use inline encoding, > 20MB use File API
  • Pages: Max 1,000 pages per document
  • Storage: File API stores for 48 hours only
  • Cost: 258 tokens per page (fixed, regardless of content density)
  • 格式:仅PDF支持视觉处理(TXT、HTML、Markdown仅支持文本处理)
  • 大小:小于20MB使用内联编码,大于20MB使用File API
  • 页数:单文档最多1000页
  • 存储:File API仅存储48小时
  • 成本:每页固定258个token(与内容密度无关)

Performance Tips

性能优化建议

  1. Use Inline Encoding for PDFs < 20MB (simpler, single request)
  2. Use File API for larger files or repeated queries (enables context caching)
  3. Place Prompt After PDF for single-page documents
  4. Use Context Caching when querying same PDF multiple times
  5. Process in Parallel for multiple independent documents
  6. Use gemini-2.5-flash for best price/performance ratio
  1. 对于小于20MB的PDF使用内联编码(更简单,单请求)
  2. 对于大文件或重复查询使用File API(支持上下文缓存)
  3. 单页文档将提示放在PDF之后
  4. 多次查询同一PDF时使用上下文缓存
  5. 并行处理多个独立文档
  6. 使用gemini-2.5-flash以获得最佳性价比

Decision Guide

决策指南

PDF < 20MB?
├─ Yes → Use inline base64 encoding
└─ No  → Use File API

Need structured JSON output?
├─ Yes → Define response_schema with Pydantic
└─ No  → Get text response

Multiple queries on same PDF?
├─ Yes → Use File API + Context Caching
└─ No  → Inline encoding is sufficient
PDF < 20MB?
├─ 是 → 使用内联base64编码
└─ 否 → 使用File API

需要结构化JSON输出?
├─ 是 → 用Pydantic定义response_schema
└─ 否 → 获取文本响应

同一PDF需多次查询?
├─ 是 → 使用File API + 上下文缓存
└─ 否 → 内联编码足够

Script Reference

脚本参考

The skill includes a ready-to-use processing script:
bash
undefined
本技能包含一个即用型处理脚本:
bash
undefined

Basic usage

基础用法

python scripts/process-document.py --file document.pdf --prompt "Your prompt"
python scripts/process-document.py --file document.pdf --prompt "Your prompt"

With JSON output

JSON输出

python scripts/process-document.py --file document.pdf --prompt "Extract data" --format json
python scripts/process-document.py --file document.pdf --prompt "Extract data" --format json

With File API (for large files)

大文件使用File API

python scripts/process-document.py --file large-document.pdf --prompt "Summarize" --use-file-api
python scripts/process-document.py --file large-document.pdf --prompt "Summarize" --use-file-api

Multiple prompts

多提示查询

python scripts/process-document.py --file document.pdf --prompt "Question 1" --prompt "Question 2"
undefined
python scripts/process-document.py --file document.pdf --prompt "Question 1" --prompt "Question 2"
undefined

References

参考资料

For comprehensive documentation, see:
  • references/gemini-document-processing-report.md
    - Complete API reference
  • references/quick-reference.md
    - Quick lookup guide
  • references/code-examples.md
    - Additional code patterns
如需完整文档,请参阅:
  • references/gemini-document-processing-report.md
    - 完整API参考
  • references/quick-reference.md
    - 快速查询指南
  • references/code-examples.md
    - 额外代码示例

Troubleshooting

故障排除

API Key Not Found:
bash
undefined
未找到API密钥:
bash
undefined

Check API key is set

检查API密钥是否已设置

./scripts/check-api-key.sh

**File Too Large:**
- Use File API for files > 20MB
- Add `--use-file-api` flag to the script

**Vision Not Working:**
- Ensure file is PDF format
- Other formats (TXT, HTML) don't support vision processing
./scripts/check-api-key.sh

**文件过大:**
- 大于20MB的文件使用File API
- 在脚本中添加`--use-file-api`参数

**视觉功能不生效:**
- 确保文件为PDF格式
- 其他格式(TXT、HTML)不支持视觉处理

Support

支持