llava

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLaVA - Large Language and Vision Assistant

LLaVA - 大型语言与视觉助手

Open-source vision-language model for conversational image understanding.
用于对话式图像理解的开源视觉语言模型。

When to use LLaVA

何时使用LLaVA

Use when:
  • Building vision-language chatbots
  • Visual question answering (VQA)
  • Image description and captioning
  • Multi-turn image conversations
  • Visual instruction following
  • Document understanding with images
Metrics:
  • 23,000+ GitHub stars
  • GPT-4V level capabilities (targeted)
  • Apache 2.0 License
  • Multiple model sizes (7B-34B params)
Use alternatives instead:
  • GPT-4V: Highest quality, API-based
  • CLIP: Simple zero-shot classification
  • BLIP-2: Better for captioning only
  • Flamingo: Research, not open-source
适用场景:
  • 构建视觉语言聊天机器人
  • 视觉问答(VQA)
  • 图像描述与字幕生成
  • 多轮图像对话
  • 视觉指令遵循
  • 含图像的文档理解
相关指标:
  • 23000+ GitHub星标
  • 对标GPT-4V级别的能力
  • Apache 2.0许可证
  • 多种模型参数规模(7B-34B)
可替代方案:
  • GPT-4V:质量最高,基于API
  • CLIP:简单的零样本分类
  • BLIP-2:仅适用于字幕生成的场景表现更优
  • Flamingo:研究用模型,非开源

Quick start

快速开始

Installation

安装

bash
undefined
bash
undefined

Clone repository

Clone repository

Install

Install

pip install -e .
undefined
pip install -e .
undefined

Basic usage

基础用法

python
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import torch
python
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import torch

Load model

Load model

model_path = "liuhaotian/llava-v1.5-7b" tokenizer, model, image_processor, context_len = load_pretrained_model( model_path=model_path, model_base=None, model_name=get_model_name_from_path(model_path) )
model_path = "liuhaotian/llava-v1.5-7b" tokenizer, model, image_processor, context_len = load_pretrained_model( model_path=model_path, model_base=None, model_name=get_model_name_from_path(model_path) )

Load image

Load image

image = Image.open("image.jpg") image_tensor = process_images([image], image_processor, model.config) image_tensor = image_tensor.to(model.device, dtype=torch.float16)
image = Image.open("image.jpg") image_tensor = process_images([image], image_processor, model.config) image_tensor = image_tensor.to(model.device, dtype=torch.float16)

Create conversation

Create conversation

conv = conv_templates["llava_v1"].copy() conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?") conv.append_message(conv.roles[1], None) prompt = conv.get_prompt()
conv = conv_templates["llava_v1"].copy() conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?") conv.append_message(conv.roles[1], None) prompt = conv.get_prompt()

Generate response

Generate response

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)
with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, temperature=0.2, max_new_tokens=512 )
response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip() print(response)
undefined
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)
with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, temperature=0.2, max_new_tokens=512 )
response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip() print(response)
undefined

Available models

可用模型

ModelParametersVRAMQuality
LLaVA-v1.5-7B7B~14 GBGood
LLaVA-v1.5-13B13B~28 GBBetter
LLaVA-v1.6-34B34B~70 GBBest
python
undefined
模型参数规模显存需求质量
LLaVA-v1.5-7B7B~14 GB良好
LLaVA-v1.5-13B13B~28 GB更优
LLaVA-v1.6-34B34B~70 GB最佳
python
undefined

Load different models

Load different models

model_7b = "liuhaotian/llava-v1.5-7b" model_13b = "liuhaotian/llava-v1.5-13b" model_34b = "liuhaotian/llava-v1.6-34b"
model_7b = "liuhaotian/llava-v1.5-7b" model_13b = "liuhaotian/llava-v1.5-13b" model_34b = "liuhaotian/llava-v1.6-34b"

4-bit quantization for lower VRAM

4-bit quantization for lower VRAM

load_4bit = True # Reduces VRAM by ~4×
undefined
load_4bit = True # Reduces VRAM by ~4×
undefined

CLI usage

CLI用法

bash
undefined
bash
undefined

Single image query

Single image query

python -m llava.serve.cli
--model-path liuhaotian/llava-v1.5-7b
--image-file image.jpg
--query "What is in this image?"
python -m llava.serve.cli
--model-path liuhaotian/llava-v1.5-7b
--image-file image.jpg
--query "What is in this image?"

Multi-turn conversation

Multi-turn conversation

python -m llava.serve.cli
--model-path liuhaotian/llava-v1.5-7b
--image-file image.jpg
python -m llava.serve.cli
--model-path liuhaotian/llava-v1.5-7b
--image-file image.jpg

Then type questions interactively

Then type questions interactively

undefined
undefined

Web UI (Gradio)

Web UI(Gradio)

bash
undefined
bash
undefined

Launch Gradio interface

Launch Gradio interface

python -m llava.serve.gradio_web_server
--model-path liuhaotian/llava-v1.5-7b
--load-4bit # Optional: reduce VRAM
python -m llava.serve.gradio_web_server
--model-path liuhaotian/llava-v1.5-7b
--load-4bit # Optional: reduce VRAM
undefined
undefined

Multi-turn conversations

多轮对话

python
undefined
python
undefined

Initialize conversation

Initialize conversation

conv = conv_templates["llava_v1"].copy()
conv = conv_templates["llava_v1"].copy()

Turn 1

Turn 1

conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?") conv.append_message(conv.roles[1], None) response1 = generate(conv, model, image) # "A dog playing in a park"
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?") conv.append_message(conv.roles[1], None) response1 = generate(conv, model, image) # "A dog playing in a park"

Turn 2

Turn 2

conv.messages[-1][1] = response1 # Add previous response conv.append_message(conv.roles[0], "What breed is the dog?") conv.append_message(conv.roles[1], None) response2 = generate(conv, model, image) # "Golden Retriever"
conv.messages[-1][1] = response1 # Add previous response conv.append_message(conv.roles[0], "What breed is the dog?") conv.append_message(conv.roles[1], None) response2 = generate(conv, model, image) # "Golden Retriever"

Turn 3

Turn 3

conv.messages[-1][1] = response2 conv.append_message(conv.roles[0], "What time of day is it?") conv.append_message(conv.roles[1], None) response3 = generate(conv, model, image)
undefined
conv.messages[-1][1] = response2 conv.append_message(conv.roles[0], "What time of day is it?") conv.append_message(conv.roles[1], None) response3 = generate(conv, model, image)
undefined

Common tasks

常见任务

Image captioning

图像字幕生成

python
question = "Describe this image in detail."
response = ask(model, image, question)
python
question = "Describe this image in detail."
response = ask(model, image, question)

Visual question answering

视觉问答

python
question = "How many people are in the image?"
response = ask(model, image, question)
python
question = "How many people are in the image?"
response = ask(model, image, question)

Object detection (textual)

目标检测(文本形式)

python
question = "List all the objects you can see in this image."
response = ask(model, image, question)
python
question = "List all the objects you can see in this image."
response = ask(model, image, question)

Scene understanding

场景理解

python
question = "What is happening in this scene?"
response = ask(model, image, question)
python
question = "What is happening in this scene?"
response = ask(model, image, question)

Document understanding

文档理解

python
question = "What is the main topic of this document?"
response = ask(model, document_image, question)
python
question = "What is the main topic of this document?"
response = ask(model, document_image, question)

Training custom model

训练自定义模型

bash
undefined
bash
undefined

Stage 1: Feature alignment (558K image-caption pairs)

Stage 1: Feature alignment (558K image-caption pairs)

bash scripts/v1_5/pretrain.sh
bash scripts/v1_5/pretrain.sh

Stage 2: Visual instruction tuning (150K instruction data)

Stage 2: Visual instruction tuning (150K instruction data)

bash scripts/v1_5/finetune.sh
undefined
bash scripts/v1_5/finetune.sh
undefined

Quantization (reduce VRAM)

量化(降低显存占用)

python
undefined
python
undefined

4-bit quantization

4-bit quantization

tokenizer, model, image_processor, context_len = load_pretrained_model( model_path="liuhaotian/llava-v1.5-13b", model_base=None, model_name=get_model_name_from_path("liuhaotian/llava-v1.5-13b"), load_4bit=True # Reduces VRAM ~4× )
tokenizer, model, image_processor, context_len = load_pretrained_model( model_path="liuhaotian/llava-v1.5-13b", model_base=None, model_name=get_model_name_from_path("liuhaotian/llava-v1.5-13b"), load_4bit=True # Reduces VRAM ~4× )

8-bit quantization

8-bit quantization

load_8bit=True # Reduces VRAM ~2×
undefined
load_8bit=True # Reduces VRAM ~2×
undefined

Best practices

最佳实践

  1. Start with 7B model - Good quality, manageable VRAM
  2. Use 4-bit quantization - Reduces VRAM significantly
  3. GPU required - CPU inference extremely slow
  4. Clear prompts - Specific questions get better answers
  5. Multi-turn conversations - Maintain conversation context
  6. Temperature 0.2-0.7 - Balance creativity/consistency
  7. max_new_tokens 512-1024 - For detailed responses
  8. Batch processing - Process multiple images sequentially
  1. 从7B模型开始:质量良好,显存需求可控
  2. 使用4-bit量化:大幅降低显存占用
  3. 需要GPU:CPU推理速度极慢
  4. 明确的提示词:具体问题能得到更优答案
  5. 多轮对话:保持对话上下文
  6. 温度值设为0.2-0.7:平衡创造性与一致性
  7. max_new_tokens设为512-1024:用于生成详细回答
  8. 批量处理:按顺序处理多张图像

Performance

性能表现

ModelVRAM (FP16)VRAM (4-bit)Speed (tokens/s)
7B~14 GB~4 GB~20
13B~28 GB~8 GB~12
34B~70 GB~18 GB~5
On A100 GPU
模型FP16显存需求4-bit显存需求生成速度(tokens/秒)
7B~14 GB~4 GB~20
13B~28 GB~8 GB~12
34B~70 GB~18 GB~5
基于A100 GPU测试

Benchmarks

基准测试

LLaVA achieves competitive scores on:
  • VQAv2: 78.5%
  • GQA: 62.0%
  • MM-Vet: 35.4%
  • MMBench: 64.3%
LLaVA在以下任务中取得了有竞争力的分数:
  • VQAv2:78.5%
  • GQA:62.0%
  • MM-Vet:35.4%
  • MMBench:64.3%

Limitations

局限性

  1. Hallucinations - May describe things not in image
  2. Spatial reasoning - Struggles with precise locations
  3. Small text - Difficulty reading fine print
  4. Object counting - Imprecise for many objects
  5. VRAM requirements - Need powerful GPU
  6. Inference speed - Slower than CLIP
  1. 幻觉问题:可能描述图像中不存在的内容
  2. 空间推理能力弱:难以处理精确位置相关任务
  3. 小文本识别困难:难以读取细小文字
  4. 目标计数不准确:对大量物体的计数不够精确
  5. 显存需求高:需要高性能GPU
  6. 推理速度慢:比CLIP速度慢

Integration with frameworks

与框架集成

LangChain

LangChain

python
from langchain.llms.base import LLM

class LLaVALLM(LLM):
    def _call(self, prompt, stop=None):
        # Custom LLaVA inference
        return response

llm = LLaVALLM()
python
from langchain.llms.base import LLM

class LLaVALLM(LLM):
    def _call(self, prompt, stop=None):
        # Custom LLaVA inference
        return response

llm = LLaVALLM()

Gradio App

Gradio应用

python
import gradio as gr

def chat(image, text, history):
    response = ask_llava(model, image, text)
    return response

demo = gr.ChatInterface(
    chat,
    additional_inputs=[gr.Image(type="pil")],
    title="LLaVA Chat"
)
demo.launch()
python
import gradio as gr

def chat(image, text, history):
    response = ask_llava(model, image, text)
    return response

demo = gr.ChatInterface(
    chat,
    additional_inputs=[gr.Image(type="pil")],
    title="LLaVA Chat"
)
demo.launch()

Resources

相关资源