Loading...
Loading...
Local vision-language model for image analysis using SmolVLM-2B
npx skill4agent add tdimino/claude-code-minoan smolvlmpython ~/.claude/skills/smolvlm/scripts/view_image.py /path/to/image.pngpython ~/.claude/skills/smolvlm/scripts/view_image.py /path/to/image.png "What text is visible?"# Extract text (OCR)
python ~/.claude/skills/smolvlm/scripts/view_image.py screenshot.png "Extract all text"
# UI analysis
python ~/.claude/skills/smolvlm/scripts/view_image.py ui.png "Describe the UI elements"
# Detailed description
python ~/.claude/skills/smolvlm/scripts/view_image.py photo.jpg --detailed"Describe this image""Describe this image in detail, including colors, composition, and any text""Extract all visible text from this image""What text appears in this screenshot?""Read the text in this document""Describe the user interface elements""What buttons and controls are visible?""Identify the application and its current state""How many [objects] are in this image?""What color is the [object]?""Is there a [object] in this image?""What programming language is shown?""Describe what this code does""Identify any errors in this code screenshot"| Spec | Value |
|---|---|
| Model | SmolVLM-2B-Instruct |
| Size | ~4GB |
| Peak Memory | 5.8GB |
| Speed | ~94 tok/s (M-series) |
| Supported Formats | PNG, JPG, JPEG, GIF, WebP |
uv pip install mlx-vlm --system