Loading...
Loading...
Use the mm CLI to index, explore, query, and extract content from multimodal directories containing images, videos, PDFs, code, and other files. Triggers: exploring a directory's contents, listing/finding files by type or size, extracting text from PDFs, getting image metadata, searching across file contents, counting tokens, viewing directory trees, extracting PDF page mosaics, video keyframe extraction, 'what files are in this folder', 'find all images', 'show me the PDFs', 'how much storage do videos use', 'extract text from this PDF', 'search documents for X', 'analyze this directory', 'how many tokens', 'show the tree'.
npx skill4agent add vlm-run/skills mm-cli-skillmm--format json# First run `mm --help` or `mm --version` to confirm mm isn't already installed
pip install mm-ctx
# Alternative: shell installer
# macOS / Linux
curl -LsSf https://vlm-run.github.io/mm/install/install.sh | sh
# Windows (PowerShell)
irm https://vlm-run.github.io/mm/install/install.ps1 | iex| Command | Purpose |
|---|---|
| Locate/list files by name/kind/ext/size, tabular listing, tree view, schema |
| Content extraction (auto-detected by file type × mode) |
| Content search — text and semantic (via embeddings) |
| Count files, bytes, lines, tokens |
| Benchmark suite with statistical analysis |
| Extraction mode settings (show, init, set, reset-db, reset-profiles, reset) |
| Manage LLM provider profiles (list, add, update, use, remove) |
mm find <dir> --tree --depth 1mm wc <dir> --by-kindfindgrepcatmm cat <file> -m accuratemm find <dir> --kind image # all images
mm find <dir> --kind video # all videos
mm find <dir> --kind document # all PDFs/docs
mm find <dir> --kind audio # audio files
mm find <dir> --name "test_.*\.py" # filter by name (regex)
mm find <dir> -n config # filter by name (substring)
mm find <dir> --ext .png,.webp # by extension
mm find <dir> --min-size 1mb --max-size 10mb # by size range
mm find <dir> --kind image --limit 5 --format json # JSON output, capped
mm find <dir> --sort size --reverse --limit 10 # largest files--format json# Tabular listing (default)
mm find <dir> # all files
mm find <dir> --columns name,kind,size --limit 10 # select columns
mm find <dir> --sort size --reverse --format json # sorted JSON
# Tree view
mm find <dir> --tree # full tree with sizes
mm find <dir> --tree --depth 1 # top-level dirs only
mm find <dir> --tree --kind image # only image files
mm find <dir> --tree --format json # JSON tree structure
# Schema inspection
mm find <dir> --schema # Rich table with column docs
mm find <dir> --schema --format json # machine-readable
# Include gitignored files
mm find <dir> --no-ignore # bypass .gitignore rules
mm find <dir> --no-ignore --kind video # gitignored videos
mm find <dir> --no-ignore --tree # tree including ignored dirsfiles| Column | Type | Description |
|---|---|---|
| path | string | Relative path from scan root |
| name | string | File name with extension |
| stem | string | File name without extension |
| ext | string | Extension including dot ( |
| size | uint64 | File size in bytes |
| modified | timestamp | Last modification time |
| created | timestamp | Creation time |
| mime | string | MIME type ( |
| kind | string | |
| is_binary | bool | Whether file is binary |
| depth | uint16 | Directory depth (0 = top-level) |
| parent | string | Parent directory path |
| width | uint32 | Pixel width (images from header, videos via native parsing). Null for non-media. |
| height | uint32 | Pixel height (images from header, videos via native parsing). Null for non-media. |
--mode fast-m accurate# Fast mode (default) — local extraction, no LLM
mm cat <file> # text/metadata extraction
mm cat photo.png # image metadata (dims, MIME, hash, EXIF)
mm cat video.mp4 # video metadata (resolution, duration, codecs)
mm cat paper.pdf # text extraction via pypdfium2
# Accurate mode — LLM-powered descriptions
mm cat photo.png -m accurate # VLM caption
mm cat video.mp4 -m accurate # mosaic → VLM description
mm cat audio.mp3 -m accurate # transcript → LLM summary
mm cat paper.pdf -m accurate # text → LLM summary
# Head / tail
mm cat <file> -n 20 # first 20 lines
mm cat <file> -n -10 # last 10 lines
# Cache control
mm cat <file> --no-cache # bypass cache, force fresh run
# Output formats
mm cat <file> --format json # JSON output-p--pipeline# Named encoder (encodes media into VLM-ready JSON messages)
mm cat photo.png -p image-resize # Fit to 1024px, base64 encode
mm cat photo.png -p image-tile # Resized overview + all tiles in one Message
mm cat video.mp4 -p video-frame-sample # Extract frames at 1fps
mm cat video.mp4 -p video-chunk # Chunk into 60s segments
mm cat doc.pdf -p document-rasterize # Render pages as images
mm cat doc.pdf -p document-rasterize-text # Rasterize + extract text
# YAML pipeline file
mm cat photo.png -p custom-pipeline.yaml
# Multiple pipelines (dispatched by kind field in YAML)
mm cat *.jpg *.mp4 -p image.yaml -p video.yaml
# List available encoders and pipelines
mm cat --list-pipelines| Name | Media | Description |
|---|---|---|
| image | Default. Fit to 1024px bounding box |
| image | Resized overview + tile crops in one Message |
| video | Extract frames at fps (requires ffmpeg) |
| video | Frames + Whisper transcript (accurate mode default) |
| video | Chunk into time-based segments with overlap |
| video | Build mosaic grids from sampled frames |
| video | Scene detection → representative frames per shot |
| video | Scene detection → mosaic grid per shot |
| video | Pass video file as a Gemini Part |
| video | Chunk video into Gemini Parts |
| audio | Transcribe audio via Whisper (fast/accurate default) |
| audio | Pass audio file as a Gemini Part |
| document | Extract text per page from PDF/DOCX/PPTX |
| document | Render PDF pages as images (requires pypdfium2) |
| document | Rasterize + extract text, interleaved |
| document | Pass document file as a Gemini Part |
.pypython/mm/encoders/image/python/mm/encoders/video/~/.config/mm/encoders/namefrom pathlib import Path
from mm.encoders import register_encoder
@register_encoder(media_types=("image",))
def my_custom(path: Path, **kw):
"""Registered as 'my-custom' (auto-named from function)."""
import base64, io
from PIL import Image
img = Image.open(path)
img.thumbnail((1024, 1024))
buf = io.BytesIO()
img.save(buf, "JPEG", quality=90)
b64 = base64.b64encode(buf.getvalue()).decode()
yield {"role": "user", "content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}}
]}from mm import process_image, process_image_tiled, process_video, process_document
from pathlib import Path
msg = process_image(Path("photo.png"), max_width=1024) # Single Message dict
tiles = list(process_image_tiled(Path("scan.png"), tile_size=1024)) # Multiple Messages
chunks = list(process_video(Path("video.mp4"))) # Multiple Messages
pages = list(process_document(Path("doc.pdf"))) # Multiple Messages
# Via Context
from mm import Context
ctx = Context("~/data")
messages = ctx.encode("photo.png", strategy="resize")mm cat photo.png -m accurate --encode.strategy image-tile # override encoder
mm cat photo.png -m accurate --encode.pyfunc ~/my_filter.py # custom transformmm cat photo.png -m accurate --generate.max-tokens 1024 # increase token limit
mm cat photo.png -m accurate --generate.temperature 0.5 # higher temperature
mm cat photo.png -m accurate --generate.prompt "List 3 main objects in this image."
mm cat photo.png -m accurate --generate.json-mode true # request JSON responsemm cat photo.png -m accurate \
--encode.strategy image-tile \
--generate.max-tokens 512 \
--generate.prompt "Analyze this architecture diagram."kindgenerate# custom-image.yaml
kind: image
mode: accurate
encode:
strategy: resize
max_width: 512
generate:
prompt: "What is in this image? One sentence only."
max_tokens: 64mm cat photo.png -p custom-image.yaml# encode-only.yaml
kind: document
mode: fast
encode:
strategy: null
# generate omitted = encode-only, no LLM callkind: image
mode: accurate
encode:
strategy: image-tile
max_width: 2048
generate:
prompt: "Describe this image in detail."
max_tokens: 512
---
kind: video
mode: accurate
encode:
mosaic_tile: "8x6"
mosaic_count: 2
frame_selection: scene
generate:
prompt: "Summarize this video."
max_tokens: 1024mm cat *.jpg *.mp4 -p multi-pipeline.yamlmm cat photo.png -p my-pipeline.yaml --generate.max-tokens 128~/.config/mm/mm.toml[pipelines]
image.fast = "~/.config/mm/pipelines/image/fast.yaml"
video.accurate = "/path/to/my-video-accurate.yaml"--encode.pyfunc# my_transform.py:
# def transform(parts, context):
# extra = {"type": "text", "text": "Focus on the data flow."}
# return parts + [extra]
mm cat photo.png -m accurate --encode.pyfunc ~/my_transform.pykind: image
mode: accurate
encode:
strategy: resize
max_width: 512
pyfunc: ~/my_transforms/filter.py
generate:
prompt: "Analyze this image."
max_tokens: 128defencode:
pyfunc: |
def transform(parts, context):
return [p for p in parts if p.get("type") == "image_url"]mm wc <dir> # summary totals
mm wc <dir> --by-kind # breakdown by file kind
mm wc <dir> --kind document # only documents
mm wc <dir> --format json # machine-readable# Text search (regex matching)
mm grep "pattern" <dir> # search all files
mm grep "attention" <dir> --kind document # search only documents
mm grep "TODO" <dir> --kind code # search code files
mm grep "invoice" <dir> --kind document --format json # JSON output
mm grep "error" <dir> -C 2 # 2 context lines
mm grep "invoice" <dir> --count # match counts per file
mm grep "Quantum Phase" <dir> -i # case-insensitive search
mm grep "TODO" <dir> --ignore-case --kind code # case-insensitive in code
mm grep "secret" <dir> --no-ignore # search gitignored files too
# Semantic search (vector similarity via embeddings)
mm grep "financial projections" <dir> -s # semantic search across all files
mm grep "architecture overview" <dir> -s --format json # JSON with distances
mm grep "revenue forecast" <dir> -s --index # auto-index unindexed files before search--kind code--kind textmm bench <dir> # full benchmark
mm bench <dir> --rounds 5 # more measurement rounds
mm bench <dir> --mode accurate # include accurate-mode benchmarks
mm bench <dir> --format json # JSON output for archivalmm config show # show current config
mm config init # create config with default profile
mm config init --force # overwrite existing config
mm config set mode.fast.whisper_model tiny # set a config value
mm config set mode.accurate.beam_size 5 # set a config value
mm config reset-db # delete all databases and caches
mm config reset-profiles # restore profiles to defaults
mm config reset # reset everything (db + profiles)~/.config/mm/mm.tomlmm profile list # list all profiles (● = active)
mm profile add openrouter --base-url https://openrouter.ai/api/v1 --model vlm-1
mm profile update openrouter --model gemma4:e2b
mm profile use openrouter # switch active profile
mm profile remove openroutermm --profile openrouter cat photo.png -m accurate # one-off override
MM_PROFILE=openrouter mm cat photo.png -m accurate # env override--format json--format tsv--format csv--format dataset-jsonl--format dataset-hfmm find <dir> --kind image | mm cat # find images, extract metadata
mm find <dir> --kind document --min-size 10mb | wc -l # count large PDFs
mm find <dir> --kind video --format json | jq '.[].name' # extract video namesfindwcfind --tree --depth 1wc --by-kind--format jsonfindcatmm cat video.mp4 -m accurate--mode fast--mode accurate--no-cache-m accurate-p--encode.pyfunc--list-pipelines