Loading...
Loading...
Process and generate multimedia content using Google Gemini API for better vision capabilities. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (better image analysis than Claude models, captioning, reasoning, object detection, design extraction, OCR, visual Q&A, segmentation, handle multiple images), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots (instead of default vision capabilities of Claude, only fallback to Claude's vision capabilities if needed), processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.
npx skill4agent add binhmuc/autobot-review ai-multimodalexport GEMINI_API_KEY="your-key" # Get from https://aistudio.google.com/apikey
pip install google-genai python-dotenv pillow# Primary key (required)
export GEMINI_API_KEY="key1"
# Additional keys for rotation (optional)
export GEMINI_API_KEY_2="key2"
export GEMINI_API_KEY_3="key3".envGEMINI_API_KEY=key1
GEMINI_API_KEY_2=key2
GEMINI_API_KEY_3=key3--verbosepython scripts/check_setup.pypython scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract>gemini"<prompt to analyze image>" | gemini -y -m gemini-2.5-flashgeminipython scripts/gemini_batch_process.py --files <file> --task analyzepython scripts/gemini_batch_process.py --task <generate|generate-video> --prompt "description"Stdin support: You can pipe files directly via stdin (auto-detects PNG/JPG/PDF/WAV/MP3).
cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "Describe this" (traditional)python scripts/gemini_batch_process.py --files image.png --task analyze
imagen-4.0-generate-001imagen-4.0-ultra-generate-001imagen-4.0-fast-generate-001veo-3.1-generate-previewgemini-2.5-flashgemini-2.5-progemini_batch_process.pytranscribe|analyze|extract|generate|generate-videomedia_optimizer.pydocument_converter.pydocs/assetscheck_setup.py--help| Topic | File | Description |
|---|---|---|
| Music | | Lyria RealTime API for background music generation, style prompts, real-time control, integration with video production. |
| Audio | | Audio formats and limits, transcription (timestamps, speakers, segments), non-speech analysis, File API vs inline input, TTS models, best practices, cost and token math, and concrete meeting/podcast/interview recipes. |
| Images | | Vision capabilities overview, supported formats and models, captioning/classification/VQA, detection and segmentation, OCR and document reading, multi-image workflows, structured JSON output, token costs, best practices, and common product/screenshot/chart/scene use cases. |
| Image Gen | | Imagen 4 and Gemini image model overview, generate_images vs generate_content APIs, aspect ratios and costs, text/image/both modalities, editing and composition, style and quality control, safety settings, best practices, troubleshooting, and common marketing/concept-art/UI scenarios. |
| Video | | Video analysis capabilities and supported formats, model/context choices, local/inline/YouTube inputs, clipping and FPS control, multi-video comparison, temporal Q&A and scene detection, transcription with visual context, token and cost guidance, and optimization/best-practice patterns. |
| Video Gen | | Veo model matrix, text-to-video and image-to-video quick start, multi-reference and extension flows, camera and timing control, configuration (resolution, aspect, audio, safety), prompt design patterns, performance tips, limitations, troubleshooting, and cost estimates. |
[HH:MM:SS -> HH:MM:SS] transcript content
[HH:MM:SS -> HH:MM:SS] transcript content
...