multimodal-models

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Multimodal Models

多模态模型

Pre-trained models for vision, audio, and cross-modal tasks.

用于视觉、音频和跨模态任务的预训练模型。

Model Overview

模型概述

Model	Modality	Task
CLIP	Image + Text	Zero-shot classification, similarity
Whisper	Audio → Text	Transcription, translation
Stable Diffusion	Text → Image	Image generation, editing

模型	模态	任务
CLIP	图像 + 文本	零样本分类、相似度计算
Whisper	音频 → 文本	转录、翻译
Stable Diffusion	文本 → 图像	图像生成、编辑

CLIP (Vision-Language)

CLIP（视觉-语言模型）

Zero-shot image classification without training on specific labels.

无需针对特定标签训练的零样本图像分类模型。

CLIP Use Cases

CLIP 应用场景

Task	How
Zero-shot classification	Compare image to text label embeddings
Image search	Find images matching text query
Content moderation	Classify against safety categories
Image similarity	Compare image embeddings

任务	实现方式
零样本分类	将图像与文本标签的嵌入向量进行对比
图像搜索	查找与文本查询匹配的图像
内容审核	针对安全类别进行分类
图像相似度计算	对比图像嵌入向量

CLIP Models

CLIP 模型版本

Model	Parameters	Trade-off
ViT-B/32	151M	Recommended balance
ViT-L/14	428M	Best quality, slower
RN50	102M	Fastest, lower quality

模型	参数规模	权衡点
ViT-B/32	1.51亿	推荐的平衡版本
ViT-L/14	4.28亿	质量最佳，速度较慢
RN50	1.02亿	速度最快，质量较低

CLIP Concepts

CLIP 核心概念

Concept	Description
Dual encoder	Separate encoders for image and text
Contrastive learning	Trained to match image-text pairs
Normalization	Always normalize embeddings before similarity
Descriptive labels	Better labels = better zero-shot accuracy

Key concept: CLIP embeds images and text in same space. Classification = find nearest text embedding.

概念	说明
双编码器	为图像和文本分别配备独立的编码器
对比学习	通过匹配图文对进行训练
归一化	计算相似度前需先对嵌入向量进行归一化
描述性标签	标签越精准，零样本分类准确率越高

核心概念：CLIP 将图像和文本映射到同一向量空间。分类过程即查找最接近的文本嵌入向量。

CLIP Limitations

CLIP 局限性

Not for fine-grained classification
No spatial understanding (whole image only)
May reflect training data biases

不适用于细粒度分类
缺乏空间理解能力（仅能处理整幅图像）
可能反映训练数据中的偏见

Whisper (Speech Recognition)

Whisper（语音识别）

Robust multilingual transcription supporting 99 languages.

支持99种语言的鲁棒性多语言转录模型。

Whisper Use Cases

Whisper 应用场景

Task	Configuration
Transcription	Default `transcribe` task
Translation to English	`task="translate"`
Subtitles	Output format SRT/VTT
Word timestamps	`word_timestamps=True`

任务	配置方式
转录	默认 `transcribe` 任务
翻译为英文	设置 `task="translate"`
生成字幕	输出格式选择SRT/VTT
单词级时间戳	设置 `word_timestamps=True`

Whisper Models

Whisper 模型版本

Model	Size	Speed	Recommendation
turbo	809M	Fast	Recommended
large	1550M	Slow	Maximum quality
small	244M	Medium	Good balance
base	74M	Fast	Quick tests
tiny	39M	Fastest	Prototyping only

模型	规模	速度	推荐场景
turbo	8.09亿	快	推荐使用
large	15.5亿	慢	追求最高质量
small	2.44亿	中等	平衡质量与速度
base	7400万	快	快速测试
tiny	3900万	最快	仅用于原型开发

Whisper Concepts

Whisper 核心概念

Concept	Description
Language detection	Auto-detects, or specify for speed
Initial prompt	Improves technical terms accuracy
Timestamps	Segment-level or word-level
faster-whisper	4× faster alternative implementation

Key concept: Specify language when known—auto-detection adds latency.

概念	说明
语言检测	自动检测语言，已知语言时手动指定可提升速度
初始提示	提升专业术语的识别准确率
时间戳	支持段落级或单词级
faster-whisper	速度提升4倍的替代实现方案

核心概念：已知语言时手动指定——自动检测会增加延迟。

Whisper Limitations

Whisper 局限性

May hallucinate on silence/noise
No speaker diarization (who said what)
Accuracy degrades on >30 min audio
Not suitable for real-time captioning

在静音/噪音环境下可能产生幻觉输出
不支持说话人 diarization（区分说话人）
音频时长超过30分钟时准确率下降
不适用于实时字幕生成

Stable Diffusion (Image Generation)

Stable Diffusion（图像生成）

Text-to-image generation with various control methods.

具备多种控制方式的文本转图像生成模型。

SD Use Cases

SD 应用场景

Task	Pipeline
Text-to-image	`DiffusionPipeline`
Style transfer	`Image2Image`
Fill regions	`Inpainting`
Guided generation	`ControlNet`
Custom styles	LoRA adapters

任务	流水线
文本转图像	`DiffusionPipeline`
风格迁移	`Image2Image`
区域填充	`Inpainting`
引导生成	`ControlNet`
自定义风格	LoRA 适配器

SD Models

SD 模型版本

Model	Resolution	Quality
SDXL	1024×1024	Best
SD 1.5	512×512	Good, faster
SD 2.1	768×768	Middle ground

模型	分辨率	质量
SDXL	1024×1024	最佳
SD 1.5	512×512	良好，速度更快
SD 2.1	768×768	折中方案

Key Parameters

关键参数

Parameter	Effect	Typical Value
num_inference_steps	Quality vs speed	20-50
guidance_scale	Prompt adherence	7-12
negative_prompt	Avoid artifacts	"blurry, low quality"
strength (img2img)	How much to change	0.5-0.8
seed	Reproducibility	Fixed number

参数	作用	典型取值
num_inference_steps	质量与速度的权衡	20-50
guidance_scale	对提示词的遵循程度	7-12
negative_prompt	避免生成瑕疵	"模糊、低质量"
strength（img2img）	图像修改幅度	0.5-0.8
seed	结果可复现性	固定数值

Control Methods

控制方式

Method	Input	Use Case
ControlNet	Edge/depth/pose	Structural guidance
LoRA	Trained weights	Custom styles
Img2Img	Source image	Style transfer
Inpainting	Image + mask	Fill regions

方法	输入	应用场景
ControlNet	边缘/深度/姿态图	结构引导生成
LoRA	训练后的权重	自定义风格
Img2Img	源图像	风格迁移
Inpainting	图像+遮罩	区域填充

Memory Optimization

内存优化技巧

Technique	Effect
CPU offload	Reduces VRAM usage
Attention slicing	Trades speed for memory
VAE tiling	Large image support
xFormers	Faster attention
DPM scheduler	Fewer steps needed

Key concept: Use SDXL for quality, SD 1.5 for speed. Always use negative prompts.

技巧	效果
CPU 卸载	降低VRAM占用
注意力切片	以速度换内存
VAE 分块	支持生成大尺寸图像
xFormers	加速注意力计算
DPM 调度器	减少所需推理步数

核心概念：追求质量选SDXL，追求速度选SD 1.5。务必使用negative prompt。

SD Limitations

SD 局限性

GPU strongly recommended (CPU very slow)
Large VRAM requirements for SDXL
May generate anatomical errors
Prompt engineering matters

强烈推荐使用GPU（CPU运行速度极慢）
SDXL 对VRAM要求较高
可能生成解剖结构错误的图像
提示词工程对结果影响较大

Common Patterns

通用模式

Embedding and Similarity

嵌入向量与相似度计算

All three models use embeddings:

CLIP: Image/text embeddings for similarity
Whisper: Audio embeddings for transcription
SD: Text embeddings for image conditioning

这三个模型均使用嵌入向量：

CLIP：图像/文本嵌入向量用于相似度计算
Whisper：音频嵌入向量用于转录
SD：文本嵌入向量用于图像生成条件控制

GPU Acceleration

GPU 加速需求

Model	VRAM Needed
CLIP ViT-B/32	~2 GB
Whisper turbo	~6 GB
SD 1.5	~6 GB
SDXL	~10 GB

模型	所需VRAM
CLIP ViT-B/32	~2GB
Whisper turbo	~6GB
SD 1.5	~6GB
SDXL	~10GB

Best Practices

最佳实践

Practice	Why
Use recommended model sizes	Best quality/speed balance
Cache embeddings (CLIP)	Expensive to recompute
Specify language (Whisper)	Faster than auto-detect
Use negative prompts (SD)	Avoid common artifacts
Set seeds for reproducibility	Consistent results

实践	原因
使用推荐的模型规模	平衡质量与速度
缓存CLIP的嵌入向量	重新计算成本较高
手动指定Whisper的语言	比自动检测更快
SD使用negative prompt	避免常见瑕疵
设置固定seed	获得一致的结果

Resources

资源

CLIP: https://github.com/openai/CLIP
Whisper: https://github.com/openai/whisper
Diffusers: https://huggingface.co/docs/diffusers

CLIP: https://github.com/openai/CLIP
Whisper: https://github.com/openai/whisper
Diffusers: https://huggingface.co/docs/diffusers