Loading...
Loading...
Vision, audio, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, or building multimodal AI pipelines.
npx skill4agent add yonatangross/orchestkit multimodal-llm| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Vision: Image Analysis | 1 | HIGH | Image captioning, VQA, multi-image comparison, object detection |
| Vision: Document Understanding | 1 | HIGH | OCR, chart/diagram analysis, PDF processing, table extraction |
| Vision: Model Selection | 1 | MEDIUM | Choosing provider, cost optimization, image size limits |
| Audio: Speech-to-Text | 1 | HIGH | Transcription, speaker diarization, long-form audio |
| Audio: Text-to-Speech | 1 | MEDIUM | Voice synthesis, expressive TTS, multi-speaker dialogue |
| Audio: Model Selection | 1 | MEDIUM | Real-time voice agents, provider comparison, pricing |
max_tokens| Rule | File | Key Pattern |
|---|---|---|
| Image Analysis | | Base64 encoding, multi-image, bounding boxes |
| Rule | File | Key Pattern |
|---|---|---|
| Document Vision | | PDF page ranges, detail levels, OCR strategies |
| Rule | File | Key Pattern |
|---|---|---|
| Vision Models | | Provider comparison, token costs, image limits |
| Rule | File | Key Pattern |
|---|---|---|
| Speech-to-Text | | Gemini long-form, GPT-4o-Transcribe, AssemblyAI features |
| Rule | File | Key Pattern |
|---|---|---|
| Text-to-Speech | | Gemini TTS, voice config, auditory cues |
| Rule | File | Key Pattern |
|---|---|---|
| Audio Models | | Real-time voice comparison, STT benchmarks, pricing |
| Decision | Recommendation |
|---|---|
| High accuracy vision | Claude Opus 4.6 or GPT-5 |
| Long documents | Gemini 2.5 Pro (1M context) |
| Cost-efficient vision | Gemini 2.5 Flash ($0.15/M tokens) |
| Video analysis | Gemini 2.5/3 Pro (native video) |
| Voice assistant | Grok Voice Agent (fastest, <1s) |
| Emotional voice AI | Gemini Live API |
| Long audio transcription | Gemini 2.5 Pro (9.5hr) |
| Speaker diarization | AssemblyAI or Gemini |
| Self-hosted STT | Whisper Large V3 |
import anthropic, base64
client = anthropic.Anthropic()
with open("image.png", "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "Describe this image"}
]}]
)max_tokenshighrag-retrievalllm-integrationstreaming-api-patterns