Vision & Multimodal Skill

视觉与多模态Skill

Leverage Claude's vision capabilities for image analysis, document processing, and multimodal understanding.

利用Claude的视觉能力进行图像分析、文档处理和多模态理解。

When to Use This Skill

何时使用该Skill

Image analysis and description
Document/PDF processing
Screenshot analysis
OCR-like text extraction
Visual comparison
Chart and diagram interpretation

图像分析与描述
文档/PDF处理
截图分析
类OCR文本提取
视觉对比
图表与示意图解读

Supported Formats

支持的格式

Format	Status	Best For
JPEG	✓	Photos, natural scenes
PNG	✓	Screenshots, UI, text
GIF	✓	Animated (first frame)
WebP	✓	Modern, compressed
PDF	✓	Documents (via Files API)

格式	状态	最佳适用场景
JPEG	✓	照片、自然场景
PNG	✓	截图、UI界面、文本
GIF	✓	动图（仅分析第一帧）
WebP	✓	现代压缩格式
PDF	✓	文档（通过Files API）

Image Size Guidelines

图像尺寸指南

Minimum: 200 pixels (smaller = reduced accuracy)
Optimal: 1000x1000 pixels
Maximum: 8000x8000 pixels
Token cost: ~(width × height) / 1000
Tip: Resize to 1568px max dimension for 30-50% token savings

最小尺寸： 200像素（尺寸越小，准确率越低）
最佳尺寸： 1000x1000像素
最大尺寸： 8000x8000像素
Token成本： ~(宽度 × 高度) / 1000
提示： 将最大维度调整为1568px可节省30-50%的Token

Core Patterns

核心使用模式

Pattern 1: Single Image Analysis

模式1：单图分析

python

import anthropic
import base64

client = anthropic.Anthropic()

python

import anthropic
import base64

client = anthropic.Anthropic()

Load and encode image

with open("image.jpg", "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": image_data } }, { "type": "text", "text": "Describe this image in detail." } ] }] )

undefined

with open("image.jpg", "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": image_data } }, { "type": "text", "text": "Describe this image in detail." } ] }] )

undefined

Pattern 2: Image from URL

模式2：从URL获取图像

python

import httpx

python

import httpx

Fetch and encode from URL

image_url = "https://example.com/image.jpg" response = httpx.get(image_url) image_data = base64.standard_b64encode(response.content).decode("utf-8")

Then use same pattern as above

undefined

undefined

Pattern 3: Multiple Images

模式3：多图处理

python

undefined

python

undefined

Compare multiple images (up to 100 per request)

messages = [{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}}, {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}}, {"type": "text", "text": "Compare these two images and list the differences."} ] }]

undefined

messages = [{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}}, {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}}, {"type": "text", "text": "Compare these two images and list the differences."} ] }]

undefined

Pattern 4: Few-Shot with Images

模式4：带图像的少样本示例

python

undefined

python

undefined

Teach by example

messages = [ # Example 1 {"role": "user", "content": [ {"type": "image", "source": {...}}, {"type": "text", "text": "Classify this image."} ]}, {"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},

# Example 2
{"role": "user", "content": [
    {"type": "image", "source": {...}},
    {"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},

# Target image
{"role": "user", "content": [
    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
    {"type": "text", "text": "Classify this image."}
]}

]

undefined

messages = [ # Example 1 {"role": "user", "content": [ {"type": "image", "source": {...}}, {"type": "text", "text": "Classify this image."} ]}, {"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},

# Example 2
{"role": "user", "content": [
    {"type": "image", "source": {...}},
    {"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},

# Target image
{"role": "user", "content": [
    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
    {"type": "text", "text": "Classify this image."}
]}

]

undefined

Pattern 5: PDF Processing

模式5：PDF处理

python

undefined

python

undefined

Using Files API (beta)

with open("document.pdf", "rb") as f: pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, messages=[{ "role": "user", "content": [ { "type": "document", "source": { "type": "base64", "media_type": "application/pdf", "data": pdf_data } }, {"type": "text", "text": "Summarize this document."} ] }] )

undefined

with open("document.pdf", "rb") as f: pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, messages=[{ "role": "user", "content": [ { "type": "document", "source": { "type": "base64", "media_type": "application/pdf", "data": pdf_data } }, { "type": "text", "text": "Summarize this document."} } ] }] )

undefined

Prompt Engineering for Vision

视觉任务的提示词工程

Strategy 1: Role Assignment

策略1：角色分配

python

prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.

Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""

python

prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.

Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""

Strategy 2: Step-by-Step Thinking

策略2：分步思考

python

prompt = """Before answering, analyze the image systematically:

<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>

Then provide your answer based on this analysis."""

python

prompt = """Before answering, analyze the image systematically:

<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>

Then provide your answer based on this analysis."""

Strategy 3: Structured Output

策略3：结构化输出

python

prompt = """Extract information from this receipt and return as JSON:

{
    "vendor": "",
    "date": "",
    "items": [{"name": "", "price": 0}],
    "total": 0
}"""

python

prompt = """Extract information from this receipt and return as JSON:

{
    "vendor": "",
    "date": "",
    "items": [{"name": "", "price": 0}],
    "total": 0
}"""

Image Optimization

图像优化

python

from PIL import Image
import io

def optimize_for_claude(image_path, max_dimension=1568):
    """Resize image to reduce token usage by 30-50%"""
    with Image.open(image_path) as img:
        # Calculate new dimensions
        ratio = min(max_dimension / img.width, max_dimension / img.height)
        if ratio < 1:
            new_size = (int(img.width * ratio), int(img.height * ratio))
            img = img.resize(new_size, Image.LANCZOS)

        # Convert to bytes
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG", quality=85)
        return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")

python

from PIL import Image
import io

def optimize_for_claude(image_path, max_dimension=1568):
    """Resize image to reduce token usage by 30-50%"""
    with Image.open(image_path) as img:
        # Calculate new dimensions
        ratio = min(max_dimension / img.width, max_dimension / img.height)
        if ratio < 1:
            new_size = (int(img.width * ratio), int(img.height * ratio))
            img = img.resize(new_size, Image.LANCZOS)

        # Convert to bytes
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG", quality=85)
        return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")

Common Use Cases

常见使用场景

Text Extraction (OCR-like)

文本提取（类OCR）

python

prompt = """Extract all text from this image.
Preserve the original formatting and structure as much as possible.
If text is unclear, indicate with [unclear]."""

python

prompt = """Extract all text from this image.
Preserve the original formatting and structure as much as possible.
If text is unclear, indicate with [unclear]."""

Table Extraction

表格提取

python

prompt = """Extract the table data from this image.
Return as a markdown table with proper headers and alignment."""

python

prompt = """Extract the table data from this image.
Return as a markdown table with proper headers and alignment."""

Chart Analysis

图表分析

python

prompt = """Analyze this chart:
1. What type of chart is this?
2. What are the axes/labels?
3. What are the key data points?
4. What trends or patterns are visible?"""

python

prompt = """Analyze this chart:
1. What type of chart is this?
2. What are the axes/labels?
3. What are the key data points?
4. What trends or patterns are visible?"""

Best Practices

最佳实践

DO:

建议：

Use high-quality images (≥1000px)
Resize large images to save tokens
Provide context about what to look for
Use few-shot examples for consistent output

使用高质量图像（≥1000px）
调整大尺寸图像以节省Token
提供分析的上下文信息
使用少样本示例保证输出一致性

DON'T:

避免：

Send images smaller than 200px
Expect perfect OCR for handwriting
Send very large images (>8000px)
Ignore token costs for multiple images

发送小于200px的图像
期望手写文本的完美OCR识别
发送超过8000px的超大图像
忽略多图处理的Token成本

Limitations

局限性

Cannot identify specific individuals
May struggle with very small text
Animated GIFs: only first frame analyzed
Some specialized symbols may be misread

无法识别特定个人
可能难以识别极小文本
动图GIF：仅分析第一帧
部分特殊符号可能被误读

vision-multimodal

Original

Translation

Vision & Multimodal Skill

视觉与多模态Skill

When to Use This Skill

何时使用该Skill

Supported Formats

支持的格式

Image Size Guidelines

图像尺寸指南

Core Patterns

核心使用模式

Pattern 1: Single Image Analysis

模式1：单图分析

Load and encode image

Load and encode image

Pattern 2: Image from URL

模式2：从URL获取图像

Fetch and encode from URL

Fetch and encode from URL

Then use same pattern as above

Then use same pattern as above

Pattern 3: Multiple Images

模式3：多图处理

Compare multiple images (up to 100 per request)

Compare multiple images (up to 100 per request)

Pattern 4: Few-Shot with Images

模式4：带图像的少样本示例

Teach by example

Teach by example

Pattern 5: PDF Processing

模式5：PDF处理

Using Files API (beta)

Using Files API (beta)

Prompt Engineering for Vision

视觉任务的提示词工程

Strategy 1: Role Assignment

策略1：角色分配

Strategy 2: Step-by-Step Thinking

策略2：分步思考

Strategy 3: Structured Output

策略3：结构化输出

Image Optimization

图像优化

Common Use Cases

常见使用场景

Text Extraction (OCR-like)

文本提取（类OCR）

Table Extraction

表格提取

Chart Analysis

图表分析

Best Practices

最佳实践

DO:

建议：

DON'T:

避免：

Limitations

局限性

See Also

相关链接