vision-multimodal

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Vision & Multimodal Skill

视觉与多模态Skill

Leverage Claude's vision capabilities for image analysis, document processing, and multimodal understanding.
利用Claude的视觉能力进行图像分析、文档处理和多模态理解。

When to Use This Skill

何时使用该Skill

  • Image analysis and description
  • Document/PDF processing
  • Screenshot analysis
  • OCR-like text extraction
  • Visual comparison
  • Chart and diagram interpretation
  • 图像分析与描述
  • 文档/PDF处理
  • 截图分析
  • 类OCR文本提取
  • 视觉对比
  • 图表与示意图解读

Supported Formats

支持的格式

FormatStatusBest For
JPEGPhotos, natural scenes
PNGScreenshots, UI, text
GIFAnimated (first frame)
WebPModern, compressed
PDFDocuments (via Files API)
格式状态最佳适用场景
JPEG照片、自然场景
PNG截图、UI界面、文本
GIF动图(仅分析第一帧)
WebP现代压缩格式
PDF文档(通过Files API)

Image Size Guidelines

图像尺寸指南

  • Minimum: 200 pixels (smaller = reduced accuracy)
  • Optimal: 1000x1000 pixels
  • Maximum: 8000x8000 pixels
  • Token cost: ~(width × height) / 1000
  • Tip: Resize to 1568px max dimension for 30-50% token savings
  • 最小尺寸: 200像素(尺寸越小,准确率越低)
  • 最佳尺寸: 1000x1000像素
  • 最大尺寸: 8000x8000像素
  • Token成本: ~(宽度 × 高度) / 1000
  • 提示: 将最大维度调整为1568px可节省30-50%的Token

Core Patterns

核心使用模式

Pattern 1: Single Image Analysis

模式1:单图分析

python
import anthropic
import base64

client = anthropic.Anthropic()
python
import anthropic
import base64

client = anthropic.Anthropic()

Load and encode image

Load and encode image

with open("image.jpg", "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": image_data } }, { "type": "text", "text": "Describe this image in detail." } ] }] )
undefined
with open("image.jpg", "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": image_data } }, { "type": "text", "text": "Describe this image in detail." } ] }] )
undefined

Pattern 2: Image from URL

模式2:从URL获取图像

python
import httpx
python
import httpx

Fetch and encode from URL

Fetch and encode from URL

image_url = "https://example.com/image.jpg" response = httpx.get(image_url) image_data = base64.standard_b64encode(response.content).decode("utf-8")
image_url = "https://example.com/image.jpg" response = httpx.get(image_url) image_data = base64.standard_b64encode(response.content).decode("utf-8")

Then use same pattern as above

Then use same pattern as above

undefined
undefined

Pattern 3: Multiple Images

模式3:多图处理

python
undefined
python
undefined

Compare multiple images (up to 100 per request)

Compare multiple images (up to 100 per request)

messages = [{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}}, {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}}, {"type": "text", "text": "Compare these two images and list the differences."} ] }]
undefined
messages = [{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}}, {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}}, {"type": "text", "text": "Compare these two images and list the differences."} ] }]
undefined

Pattern 4: Few-Shot with Images

模式4:带图像的少样本示例

python
undefined
python
undefined

Teach by example

Teach by example

messages = [ # Example 1 {"role": "user", "content": [ {"type": "image", "source": {...}}, {"type": "text", "text": "Classify this image."} ]}, {"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},
# Example 2
{"role": "user", "content": [
    {"type": "image", "source": {...}},
    {"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},

# Target image
{"role": "user", "content": [
    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
    {"type": "text", "text": "Classify this image."}
]}
]
undefined
messages = [ # Example 1 {"role": "user", "content": [ {"type": "image", "source": {...}}, {"type": "text", "text": "Classify this image."} ]}, {"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},
# Example 2
{"role": "user", "content": [
    {"type": "image", "source": {...}},
    {"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},

# Target image
{"role": "user", "content": [
    {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
    {"type": "text", "text": "Classify this image."}
]}
]
undefined

Pattern 5: PDF Processing

模式5:PDF处理

python
undefined
python
undefined

Using Files API (beta)

Using Files API (beta)

with open("document.pdf", "rb") as f: pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, messages=[{ "role": "user", "content": [ { "type": "document", "source": { "type": "base64", "media_type": "application/pdf", "data": pdf_data } }, {"type": "text", "text": "Summarize this document."} ] }] )
undefined
with open("document.pdf", "rb") as f: pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, messages=[{ "role": "user", "content": [ { "type": "document", "source": { "type": "base64", "media_type": "application/pdf", "data": pdf_data } }, { "type": "text", "text": "Summarize this document."} } ] }] )
undefined

Prompt Engineering for Vision

视觉任务的提示词工程

Strategy 1: Role Assignment

策略1:角色分配

python
prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.

Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""
python
prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.

Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""

Strategy 2: Step-by-Step Thinking

策略2:分步思考

python
prompt = """Before answering, analyze the image systematically:

<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>

Then provide your answer based on this analysis."""
python
prompt = """Before answering, analyze the image systematically:

<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>

Then provide your answer based on this analysis."""

Strategy 3: Structured Output

策略3:结构化输出

python
prompt = """Extract information from this receipt and return as JSON:

{
    "vendor": "",
    "date": "",
    "items": [{"name": "", "price": 0}],
    "total": 0
}"""
python
prompt = """Extract information from this receipt and return as JSON:

{
    "vendor": "",
    "date": "",
    "items": [{"name": "", "price": 0}],
    "total": 0
}"""

Image Optimization

图像优化

python
from PIL import Image
import io

def optimize_for_claude(image_path, max_dimension=1568):
    """Resize image to reduce token usage by 30-50%"""
    with Image.open(image_path) as img:
        # Calculate new dimensions
        ratio = min(max_dimension / img.width, max_dimension / img.height)
        if ratio < 1:
            new_size = (int(img.width * ratio), int(img.height * ratio))
            img = img.resize(new_size, Image.LANCZOS)

        # Convert to bytes
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG", quality=85)
        return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
python
from PIL import Image
import io

def optimize_for_claude(image_path, max_dimension=1568):
    """Resize image to reduce token usage by 30-50%"""
    with Image.open(image_path) as img:
        # Calculate new dimensions
        ratio = min(max_dimension / img.width, max_dimension / img.height)
        if ratio < 1:
            new_size = (int(img.width * ratio), int(img.height * ratio))
            img = img.resize(new_size, Image.LANCZOS)

        # Convert to bytes
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG", quality=85)
        return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")

Common Use Cases

常见使用场景

Text Extraction (OCR-like)

文本提取(类OCR)

python
prompt = """Extract all text from this image.
Preserve the original formatting and structure as much as possible.
If text is unclear, indicate with [unclear]."""
python
prompt = """Extract all text from this image.
Preserve the original formatting and structure as much as possible.
If text is unclear, indicate with [unclear]."""

Table Extraction

表格提取

python
prompt = """Extract the table data from this image.
Return as a markdown table with proper headers and alignment."""
python
prompt = """Extract the table data from this image.
Return as a markdown table with proper headers and alignment."""

Chart Analysis

图表分析

python
prompt = """Analyze this chart:
1. What type of chart is this?
2. What are the axes/labels?
3. What are the key data points?
4. What trends or patterns are visible?"""
python
prompt = """Analyze this chart:
1. What type of chart is this?
2. What are the axes/labels?
3. What are the key data points?
4. What trends or patterns are visible?"""

Best Practices

最佳实践

DO:

建议:

  • Use high-quality images (≥1000px)
  • Resize large images to save tokens
  • Provide context about what to look for
  • Use few-shot examples for consistent output
  • 使用高质量图像(≥1000px)
  • 调整大尺寸图像以节省Token
  • 提供分析的上下文信息
  • 使用少样本示例保证输出一致性

DON'T:

避免:

  • Send images smaller than 200px
  • Expect perfect OCR for handwriting
  • Send very large images (>8000px)
  • Ignore token costs for multiple images
  • 发送小于200px的图像
  • 期望手写文本的完美OCR识别
  • 发送超过8000px的超大图像
  • 忽略多图处理的Token成本

Limitations

局限性

  • Cannot identify specific individuals
  • May struggle with very small text
  • Animated GIFs: only first frame analyzed
  • Some specialized symbols may be misread
  • 无法识别特定个人
  • 可能难以识别极小文本
  • 动图GIF:仅分析第一帧
  • 部分特殊符号可能被误读

See Also

相关链接

  • [[llm-integration]] - API basics
  • [[extended-thinking]] - Complex reasoning
  • [[citations-retrieval]] - Document citations
  • [[llm-integration]] - API基础
  • [[extended-thinking]] - 复杂推理
  • [[citations-retrieval]] - 文档引用