Loading...
Loading...
Use when "CLIP", "Whisper", "Stable Diffusion", "SDXL", "speech-to-text", "text-to-image", "image generation", "transcription", "zero-shot classification", "image-text similarity", "inpainting", "ControlNet"
npx skill4agent add eyadsibai/ltk multimodal-models| Model | Modality | Task |
|---|---|---|
| CLIP | Image + Text | Zero-shot classification, similarity |
| Whisper | Audio → Text | Transcription, translation |
| Stable Diffusion | Text → Image | Image generation, editing |
| Task | How |
|---|---|
| Zero-shot classification | Compare image to text label embeddings |
| Image search | Find images matching text query |
| Content moderation | Classify against safety categories |
| Image similarity | Compare image embeddings |
| Model | Parameters | Trade-off |
|---|---|---|
| ViT-B/32 | 151M | Recommended balance |
| ViT-L/14 | 428M | Best quality, slower |
| RN50 | 102M | Fastest, lower quality |
| Concept | Description |
|---|---|
| Dual encoder | Separate encoders for image and text |
| Contrastive learning | Trained to match image-text pairs |
| Normalization | Always normalize embeddings before similarity |
| Descriptive labels | Better labels = better zero-shot accuracy |
| Task | Configuration |
|---|---|
| Transcription | Default |
| Translation to English | |
| Subtitles | Output format SRT/VTT |
| Word timestamps | |
| Model | Size | Speed | Recommendation |
|---|---|---|---|
| turbo | 809M | Fast | Recommended |
| large | 1550M | Slow | Maximum quality |
| small | 244M | Medium | Good balance |
| base | 74M | Fast | Quick tests |
| tiny | 39M | Fastest | Prototyping only |
| Concept | Description |
|---|---|
| Language detection | Auto-detects, or specify for speed |
| Initial prompt | Improves technical terms accuracy |
| Timestamps | Segment-level or word-level |
| faster-whisper | 4× faster alternative implementation |
| Task | Pipeline |
|---|---|
| Text-to-image | |
| Style transfer | |
| Fill regions | |
| Guided generation | |
| Custom styles | LoRA adapters |
| Model | Resolution | Quality |
|---|---|---|
| SDXL | 1024×1024 | Best |
| SD 1.5 | 512×512 | Good, faster |
| SD 2.1 | 768×768 | Middle ground |
| Parameter | Effect | Typical Value |
|---|---|---|
| num_inference_steps | Quality vs speed | 20-50 |
| guidance_scale | Prompt adherence | 7-12 |
| negative_prompt | Avoid artifacts | "blurry, low quality" |
| strength (img2img) | How much to change | 0.5-0.8 |
| seed | Reproducibility | Fixed number |
| Method | Input | Use Case |
|---|---|---|
| ControlNet | Edge/depth/pose | Structural guidance |
| LoRA | Trained weights | Custom styles |
| Img2Img | Source image | Style transfer |
| Inpainting | Image + mask | Fill regions |
| Technique | Effect |
|---|---|
| CPU offload | Reduces VRAM usage |
| Attention slicing | Trades speed for memory |
| VAE tiling | Large image support |
| xFormers | Faster attention |
| DPM scheduler | Fewer steps needed |
| Model | VRAM Needed |
|---|---|
| CLIP ViT-B/32 | ~2 GB |
| Whisper turbo | ~6 GB |
| SD 1.5 | ~6 GB |
| SDXL | ~10 GB |
| Practice | Why |
|---|---|
| Use recommended model sizes | Best quality/speed balance |
| Cache embeddings (CLIP) | Expensive to recompute |
| Specify language (Whisper) | Faster than auto-detect |
| Use negative prompts (SD) | Avoid common artifacts |
| Set seeds for reproducibility | Consistent results |