LLaVA - Large Language and Vision Assistant

LLaVA - 大型语言与视觉助手

Open-source vision-language model for conversational image understanding.

用于对话式图像理解的开源视觉语言模型。

When to use LLaVA

何时使用LLaVA

Use when:

Building vision-language chatbots
Visual question answering (VQA)
Image description and captioning
Multi-turn image conversations
Visual instruction following
Document understanding with images

Metrics:

23,000+ GitHub stars
GPT-4V level capabilities (targeted)
Apache 2.0 License
Multiple model sizes (7B-34B params)

Use alternatives instead:

GPT-4V: Highest quality, API-based
CLIP: Simple zero-shot classification
BLIP-2: Better for captioning only
Flamingo: Research, not open-source

适用场景：

构建视觉语言聊天机器人
视觉问答（VQA）
图像描述与字幕生成
多轮图像对话
视觉指令遵循
含图像的文档理解

相关指标：

23000+ GitHub星标
对标GPT-4V级别的能力
Apache 2.0许可证
多种模型参数规模（7B-34B）

可替代方案：

GPT-4V：质量最高，基于API
CLIP：简单的零样本分类
BLIP-2：仅适用于字幕生成的场景表现更优
Flamingo：研究用模型，非开源

Quick start

快速开始

Installation

安装

bash

undefined

bash

undefined

Clone repository

git clone https://github.com/haotian-liu/LLaVA cd LLaVA

Install

pip install -e .

undefined

pip install -e .

undefined

Basic usage

基础用法

python

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import torch

python

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import torch

Load model

model_path = "liuhaotian/llava-v1.5-7b" tokenizer, model, image_processor, context_len = load_pretrained_model( model_path=model_path, model_base=None, model_name=get_model_name_from_path(model_path) )

Load image

image = Image.open("image.jpg") image_tensor = process_images([image], image_processor, model.config) image_tensor = image_tensor.to(model.device, dtype=torch.float16)

Create conversation

conv = conv_templates["llava_v1"].copy() conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?") conv.append_message(conv.roles[1], None) prompt = conv.get_prompt()

Generate response

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)

with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, temperature=0.2, max_new_tokens=512 )

response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip() print(response)

undefined

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)

with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, temperature=0.2, max_new_tokens=512 )

response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip() print(response)

undefined

Available models

可用模型

Model	Parameters	VRAM	Quality
LLaVA-v1.5-7B	7B	~14 GB	Good
LLaVA-v1.5-13B	13B	~28 GB	Better
LLaVA-v1.6-34B	34B	~70 GB	Best

python

undefined

模型	参数规模	显存需求	质量
LLaVA-v1.5-7B	7B	~14 GB	良好
LLaVA-v1.5-13B	13B	~28 GB	更优
LLaVA-v1.6-34B	34B	~70 GB	最佳

python

undefined

Load different models

model_7b = "liuhaotian/llava-v1.5-7b" model_13b = "liuhaotian/llava-v1.5-13b" model_34b = "liuhaotian/llava-v1.6-34b"

4-bit quantization for lower VRAM

load_4bit = True # Reduces VRAM by ~4×

undefined

load_4bit = True # Reduces VRAM by ~4×

undefined

CLI usage

CLI用法

bash

undefined

bash

undefined

Single image query

python -m llava.serve.cli
--model-path liuhaotian/llava-v1.5-7b
--image-file image.jpg
--query "What is in this image?"

Multi-turn conversation

python -m llava.serve.cli
--model-path liuhaotian/llava-v1.5-7b
--image-file image.jpg

Then type questions interactively

undefined

undefined

Web UI (Gradio)

Web UI（Gradio）

bash

undefined

bash

undefined

Launch Gradio interface

python -m llava.serve.gradio_web_server
--model-path liuhaotian/llava-v1.5-7b
--load-4bit # Optional: reduce VRAM

Access at http://localhost:7860

undefined

undefined

Multi-turn conversations

多轮对话

python

undefined

python

undefined

Initialize conversation

conv = conv_templates["llava_v1"].copy()

Turn 1

conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?") conv.append_message(conv.roles[1], None) response1 = generate(conv, model, image) # "A dog playing in a park"

Turn 2

conv.messages[-1][1] = response1 # Add previous response conv.append_message(conv.roles[0], "What breed is the dog?") conv.append_message(conv.roles[1], None) response2 = generate(conv, model, image) # "Golden Retriever"

Turn 3

conv.messages[-1][1] = response2 conv.append_message(conv.roles[0], "What time of day is it?") conv.append_message(conv.roles[1], None) response3 = generate(conv, model, image)

undefined

conv.messages[-1][1] = response2 conv.append_message(conv.roles[0], "What time of day is it?") conv.append_message(conv.roles[1], None) response3 = generate(conv, model, image)

undefined

Common tasks

常见任务

Image captioning

图像字幕生成

python

question = "Describe this image in detail."
response = ask(model, image, question)

python

question = "Describe this image in detail."
response = ask(model, image, question)

Visual question answering

视觉问答

python

question = "How many people are in the image?"
response = ask(model, image, question)

python

question = "How many people are in the image?"
response = ask(model, image, question)

Object detection (textual)

目标检测（文本形式）

python

question = "List all the objects you can see in this image."
response = ask(model, image, question)

python

question = "List all the objects you can see in this image."
response = ask(model, image, question)

Scene understanding

场景理解

python

question = "What is happening in this scene?"
response = ask(model, image, question)

python

question = "What is happening in this scene?"
response = ask(model, image, question)

Document understanding

文档理解

python

question = "What is the main topic of this document?"
response = ask(model, document_image, question)

python

question = "What is the main topic of this document?"
response = ask(model, document_image, question)

Training custom model

训练自定义模型

bash

undefined

bash

undefined

Stage 1: Feature alignment (558K image-caption pairs)

bash scripts/v1_5/pretrain.sh

Stage 2: Visual instruction tuning (150K instruction data)

bash scripts/v1_5/finetune.sh

undefined

bash scripts/v1_5/finetune.sh

undefined

Quantization (reduce VRAM)

量化（降低显存占用）

python

undefined

python

undefined

4-bit quantization

tokenizer, model, image_processor, context_len = load_pretrained_model( model_path="liuhaotian/llava-v1.5-13b", model_base=None, model_name=get_model_name_from_path("liuhaotian/llava-v1.5-13b"), load_4bit=True # Reduces VRAM ~4× )

8-bit quantization

load_8bit=True # Reduces VRAM ~2×

undefined

load_8bit=True # Reduces VRAM ~2×

undefined

Best practices

最佳实践

Start with 7B model - Good quality, manageable VRAM
Use 4-bit quantization - Reduces VRAM significantly
GPU required - CPU inference extremely slow
Clear prompts - Specific questions get better answers
Multi-turn conversations - Maintain conversation context
Temperature 0.2-0.7 - Balance creativity/consistency
max_new_tokens 512-1024 - For detailed responses
Batch processing - Process multiple images sequentially

从7B模型开始：质量良好，显存需求可控
使用4-bit量化：大幅降低显存占用
需要GPU：CPU推理速度极慢
明确的提示词：具体问题能得到更优答案
多轮对话：保持对话上下文
温度值设为0.2-0.7：平衡创造性与一致性
max_new_tokens设为512-1024：用于生成详细回答
批量处理：按顺序处理多张图像

Performance

性能表现

Model	VRAM (FP16)	VRAM (4-bit)	Speed (tokens/s)
7B	~14 GB	~4 GB	~20
13B	~28 GB	~8 GB	~12
34B	~70 GB	~18 GB	~5

On A100 GPU

模型	FP16显存需求	4-bit显存需求	生成速度（tokens/秒）
7B	~14 GB	~4 GB	~20
13B	~28 GB	~8 GB	~12
34B	~70 GB	~18 GB	~5

基于A100 GPU测试

Benchmarks

基准测试

LLaVA achieves competitive scores on:

VQAv2: 78.5%
GQA: 62.0%
MM-Vet: 35.4%
MMBench: 64.3%

LLaVA在以下任务中取得了有竞争力的分数：

VQAv2：78.5%
GQA：62.0%
MM-Vet：35.4%
MMBench：64.3%

Limitations

局限性

Hallucinations - May describe things not in image
Spatial reasoning - Struggles with precise locations
Small text - Difficulty reading fine print
Object counting - Imprecise for many objects
VRAM requirements - Need powerful GPU
Inference speed - Slower than CLIP

幻觉问题：可能描述图像中不存在的内容
空间推理能力弱：难以处理精确位置相关任务
小文本识别困难：难以读取细小文字
目标计数不准确：对大量物体的计数不够精确
显存需求高：需要高性能GPU
推理速度慢：比CLIP速度慢

Integration with frameworks

与框架集成

LangChain

python

from langchain.llms.base import LLM

class LLaVALLM(LLM):
    def _call(self, prompt, stop=None):
        # Custom LLaVA inference
        return response

llm = LLaVALLM()

python

from langchain.llms.base import LLM

class LLaVALLM(LLM):
    def _call(self, prompt, stop=None):
        # Custom LLaVA inference
        return response

llm = LLaVALLM()

Gradio App

Gradio应用

python

import gradio as gr

def chat(image, text, history):
    response = ask_llava(model, image, text)
    return response

demo = gr.ChatInterface(
    chat,
    additional_inputs=[gr.Image(type="pil")],
    title="LLaVA Chat"
)
demo.launch()

python

import gradio as gr

def chat(image, text, history):
    response = ask_llava(model, image, text)
    return response

demo = gr.ChatInterface(
    chat,
    additional_inputs=[gr.Image(type="pil")],
    title="LLaVA Chat"
)
demo.launch()

llava

Original

Translation

LLaVA - Large Language and Vision Assistant

LLaVA - 大型语言与视觉助手

When to use LLaVA

何时使用LLaVA

Quick start

快速开始

Installation

安装

Clone repository

Clone repository

Install

Install

Basic usage

基础用法

Load model

Load model

Load image

Load image

Create conversation

Create conversation

Generate response

Generate response

Available models

可用模型

Load different models

Load different models

4-bit quantization for lower VRAM

4-bit quantization for lower VRAM

CLI usage

CLI用法

Single image query

Single image query

Multi-turn conversation

Multi-turn conversation

Then type questions interactively

Then type questions interactively

Web UI (Gradio)

Web UI（Gradio）

Launch Gradio interface

Launch Gradio interface

Access at http://localhost:7860

Access at http://localhost:7860

Multi-turn conversations

多轮对话

Initialize conversation

Initialize conversation

Turn 1

Turn 1

Turn 2

Turn 2

Turn 3

Turn 3

Common tasks

常见任务

Image captioning

图像字幕生成

Visual question answering

视觉问答

Object detection (textual)

目标检测（文本形式）

Scene understanding

场景理解

Document understanding

文档理解

Training custom model

训练自定义模型

Stage 1: Feature alignment (558K image-caption pairs)

Stage 1: Feature alignment (558K image-caption pairs)

Stage 2: Visual instruction tuning (150K instruction data)

Stage 2: Visual instruction tuning (150K instruction data)

Quantization (reduce VRAM)

量化（降低显存占用）

4-bit quantization

4-bit quantization

8-bit quantization

8-bit quantization

Best practices