multimodal-llm
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMultimodal LLM Patterns
多模态LLM模式
Integrate vision and audio capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, and text-to-speech.
集成主流多模态模型的视觉与音频能力,涵盖图像分析、文档理解、实时语音Agent、语音转文本(Speech-to-Text)及文本转语音(Text-to-Speech)功能。
Quick Reference
快速参考
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Vision: Image Analysis | 1 | HIGH | Image captioning, VQA, multi-image comparison, object detection |
| Vision: Document Understanding | 1 | HIGH | OCR, chart/diagram analysis, PDF processing, table extraction |
| Vision: Model Selection | 1 | MEDIUM | Choosing provider, cost optimization, image size limits |
| Audio: Speech-to-Text | 1 | HIGH | Transcription, speaker diarization, long-form audio |
| Audio: Text-to-Speech | 1 | MEDIUM | Voice synthesis, expressive TTS, multi-speaker dialogue |
| Audio: Model Selection | 1 | MEDIUM | Real-time voice agents, provider comparison, pricing |
Total: 6 rules across 2 categories (Vision, Audio)
Vision: Image Analysis
视觉:图像分析
Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set and resize images before encoding.
max_tokens| Rule | File | Key Pattern |
|---|---|---|
| Image Analysis | | Base64 encoding, multi-image, bounding boxes |
将图像发送至多模态LLM,实现标题生成、视觉问答(VQA)及目标检测。需始终设置,并在编码前调整图像尺寸。
max_tokens| 规则 | 文件 | 核心模式 |
|---|---|---|
| 图像分析 | | Base64编码、多图像处理、边界框标注 |
Vision: Document Understanding
视觉:文档理解
Extract structured data from documents, charts, and PDFs using vision models.
| Rule | File | Key Pattern |
|---|---|---|
| Document Vision | | PDF page ranges, detail levels, OCR strategies |
借助视觉模型从文档、图表及PDF中提取结构化数据。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 文档视觉处理 | | PDF页码范围、细节级别设置、OCR策略 |
Vision: Model Selection
视觉:模型选型
Choose the right vision provider based on accuracy, cost, and context window needs.
| Rule | File | Key Pattern |
|---|---|---|
| Vision Models | | Provider comparison, token costs, image limits |
基于准确率、成本及上下文窗口需求选择合适的视觉模型供应商。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 视觉模型 | | 供应商对比、Token成本、图像尺寸限制 |
Audio: Speech-to-Text
音频:语音转文本
Convert audio to text with speaker diarization, timestamps, and sentiment analysis.
| Rule | File | Key Pattern |
|---|---|---|
| Speech-to-Text | | Gemini long-form, GPT-4o-Transcribe, AssemblyAI features |
实现带说话人分离、时间戳及情感分析的音频转文本功能。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 语音转文本 | | Gemini长文本处理、GPT-4o-Transcribe、AssemblyAI特性 |
Audio: Text-to-Speech
音频:文本转语音
Generate natural speech from text with voice selection and expressive cues.
| Rule | File | Key Pattern |
|---|---|---|
| Text-to-Speech | | Gemini TTS, voice config, auditory cues |
基于文本生成自然语音,支持音色选择与情感语调设置。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 文本转语音 | | Gemini TTS、语音配置、听觉提示 |
Audio: Model Selection
音频:模型选型
Select the right audio/voice provider for real-time, transcription, or TTS use cases.
| Rule | File | Key Pattern |
|---|---|---|
| Audio Models | | Real-time voice comparison, STT benchmarks, pricing |
针对实时交互、转写或文本转语音场景选择合适的音频/语音供应商。
| 规则 | 文件 | 核心模式 |
|---|---|---|
| 音频模型 | | 实时语音对比、STT基准测试、定价评估 |
Key Decisions
关键决策建议
| Decision | Recommendation |
|---|---|
| High accuracy vision | Claude Opus 4.6 or GPT-5 |
| Long documents | Gemini 2.5 Pro (1M context) |
| Cost-efficient vision | Gemini 2.5 Flash ($0.15/M tokens) |
| Video analysis | Gemini 2.5/3 Pro (native video) |
| Voice assistant | Grok Voice Agent (fastest, <1s) |
| Emotional voice AI | Gemini Live API |
| Long audio transcription | Gemini 2.5 Pro (9.5hr) |
| Speaker diarization | AssemblyAI or Gemini |
| Self-hosted STT | Whisper Large V3 |
| 决策场景 | 推荐方案 |
|---|---|
| 高精度视觉任务 | Claude Opus 4.6 或 GPT-5 |
| 长文档处理 | Gemini 2.5 Pro(1M上下文窗口) |
| 高性价比视觉任务 | Gemini 2.5 Flash($0.15/百万Token) |
| 视频分析 | Gemini 2.5/3 Pro(原生视频支持) |
| 语音助手 | Grok Voice Agent(速度最快,延迟<1秒) |
| 情感化语音AI | Gemini Live API |
| 长音频转写 | Gemini 2.5 Pro(支持9.5小时音频) |
| 说话人分离 | AssemblyAI 或 Gemini |
| 自托管STT | Whisper Large V3 |
Example
示例
python
import anthropic, base64
client = anthropic.Anthropic()
with open("image.png", "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "Describe this image"}
]}]
)python
import anthropic, base64
client = anthropic.Anthropic()
with open("image.png", "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "Describe this image"}
]}]
)Common Mistakes
常见误区
- Not setting on vision requests (responses truncated)
max_tokens - Sending oversized images without resizing (>2048px)
- Using detail level for simple yes/no classification
high - Using STT+LLM+TTS pipeline instead of native speech-to-speech
- Not leveraging barge-in support for natural voice conversations
- Using deprecated models (GPT-4V, Whisper-1)
- Ignoring rate limits on vision and audio endpoints
- 未在视觉请求中设置(导致响应被截断)
max_tokens - 直接发送超大尺寸图像而未进行缩放(>2048像素)
- 对简单的是/否分类任务使用细节级别
high - 采用STT+LLM+TTS流水线而非原生语音转语音方案
- 未利用打断(barge-in)支持实现自然语音对话
- 使用已弃用模型(如GPT-4V、Whisper-1)
- 忽略视觉与音频接口的调用速率限制
Related Skills
相关技能
- - Multimodal RAG with image + text retrieval
rag-retrieval - - General LLM function calling patterns
llm-integration - - WebSocket patterns for real-time audio
streaming-api-patterns
- - 支持图像+文本检索的多模态RAG
rag-retrieval - - 通用LLM函数调用模式
llm-integration - - 实时音频的WebSocket模式
streaming-api-patterns