Llava

Name: Llava
Author: orchestra-research

orchestra-research/ai-research-skills

396 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

llava is a Claude Code skill that guides developers through LLaVA two-stage pretraining and visual instruction fine-tuning with correct JSON conversation format and GPU shell scripts.

About

llava is an AI research skill for training LLaVA vision-language models in two stages. Stage 1 feature alignment pretrains on 558K CC3M image-caption pairs using CLIP ViT-L/14 and Vicuna-7B or LLaMA-2-7B base models via scripts/v1_5/pretrain.sh, taking roughly 20 hours on 8× A100 GPUs. Stage 2 visual instruction tuning fine-tunes on 150K GPT-generated multimodal instruction samples through scripts/v1_5/finetune.sh with JSON conversation formatting. Developers reach for llava when building custom LLaVA checkpoints and need the correct data formats, base model choices, and bash training scripts instead of misconfigured single-stage fine-tunes that fail to align vision and language modules.

Two training stages: feature alignment pretrain (~20h on 8× A100) then visual instruction tuning (~24h on 8× A100)
Stage 1 uses 558K image-caption pairs; stage 2 uses 150K GPT-generated multimodal instruction samples
Provides v1_5 pretrain.sh and finetune.sh entrypoints with Vicuna-7B or LLaMA-2-7B and CLIP ViT-L/14
Documents JSON instruction format with <image> tokens and human/gpt conversation turns

Llava by the numbers

396 all-time installs (skills.sh)
+36 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #507 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill llava

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/llava.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/llava)

Installs	396
repo stars	★ 11.2k
Security audit	2 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you train LLaVA vision-language models?

Run LLaVA two-stage pretrain and visual instruction fine-tuning with correct JSON conversation format and GPU scripts.

Who is it for?

ML engineers training LLaVA v1.5 models who need two-stage pretrain and instruction-tuning scripts with documented dataset sizes and GPU requirements.

Skip if: Developers fine-tuning BLIP-2 or CLIP-only models without LLaVA projector and conversation JSON pipelines.

When should I use this skill?

User asks to train LLaVA, run pretrain.sh or finetune.sh, or format multimodal instruction JSON for visual tuning.

What you get

LLaVA checkpoint after two-stage training, JSON instruction dataset, and pretrain/finetune shell scripts.

Pretrained projector checkpoint
Instruction-tuned LLaVA weights
JSON conversation dataset

By the numbers

Stage 1 uses 558K CC3M image-caption pairs
Stage 2 uses 150K GPT-generated multimodal instruction samples
~20 hours training time on 8× A100 GPUs for Stage 1

Files

SKILL.mdMarkdownGitHub ↗

LLaVA - Large Language and Vision Assistant

Open-source vision-language model for conversational image understanding.

When to use LLaVA

Use when:

Building vision-language chatbots
Visual question answering (VQA)
Image description and captioning
Multi-turn image conversations
Visual instruction following
Document understanding with images

Metrics:

23,000+ GitHub stars
GPT-4V level capabilities (targeted)
Apache 2.0 License
Multiple model sizes (7B-34B params)

Use alternatives instead:

GPT-4V: Highest quality, API-based
CLIP: Simple zero-shot classification
BLIP-2: Better for captioning only
Flamingo: Research, not open-source

Quick start

Installation

# Clone repository
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA

# Install
pip install -e .

Basic usage

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import torch

# Load model
model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

# Load image
image = Image.open("image.jpg")
image_tensor = process_images([image], image_processor, model.config)
image_tensor = image_tensor.to(model.device, dtype=torch.float16)

# Create conversation
conv = conv_templates["llava_v1"].copy()
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# Generate response
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        do_sample=True,
        temperature=0.2,
        max_new_tokens=512
    )

response = tokenizer.decode(output_ids[0], skip_special_tokens=True).strip()
print(response)

Available models

Model	Parameters	VRAM	Quality
LLaVA-v1.5-7B	7B	~14 GB	Good
LLaVA-v1.5-13B	13B	~28 GB	Better
LLaVA-v1.6-34B	34B	~70 GB	Best

# Load different models
model_7b = "liuhaotian/llava-v1.5-7b"
model_13b = "liuhaotian/llava-v1.5-13b"
model_34b = "liuhaotian/llava-v1.6-34b"

# 4-bit quantization for lower VRAM
load_4bit = True  # Reduces VRAM by ~4×

CLI usage

# Single image query
python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file image.jpg \
    --query "What is in this image?"

# Multi-turn conversation
python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file image.jpg
# Then type questions interactively

Web UI (Gradio)

# Launch Gradio interface
python -m llava.serve.gradio_web_server \
    --model-path liuhaotian/llava-v1.5-7b \
    --load-4bit  # Optional: reduce VRAM

# Access at http://localhost:7860

Multi-turn conversations

# Initialize conversation
conv = conv_templates["llava_v1"].copy()

# Turn 1
conv.append_message(conv.roles[0], DEFAULT_IMAGE_TOKEN + "\nWhat is in this image?")
conv.append_message(conv.roles[1], None)
response1 = generate(conv, model, image)  # "A dog playing in a park"

# Turn 2
conv.messages[-1][1] = response1  # Add previous response
conv.append_message(conv.roles[0], "What breed is the dog?")
conv.append_message(conv.roles[1], None)
response2 = generate(conv, model, image)  # "Golden Retriever"

# Turn 3
conv.messages[-1][1] = response2
conv.append_message(conv.roles[0], "What time of day is it?")
conv.append_message(conv.roles[1], None)
response3 = generate(conv, model, image)

Common tasks

Image captioning

question = "Describe this image in detail."
response = ask(model, image, question)

Visual question answering

question = "How many people are in the image?"
response = ask(model, image, question)

Object detection (textual)

question = "List all the objects you can see in this image."
response = ask(model, image, question)

Scene understanding

question = "What is happening in this scene?"
response = ask(model, image, question)

Document understanding

question = "What is the main topic of this document?"
response = ask(model, document_image, question)

Training custom model

# Stage 1: Feature alignment (558K image-caption pairs)
bash scripts/v1_5/pretrain.sh

# Stage 2: Visual instruction tuning (150K instruction data)
bash scripts/v1_5/finetune.sh

Quantization (reduce VRAM)

# 4-bit quantization
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path="liuhaotian/llava-v1.5-13b",
    model_base=None,
    model_name=get_model_name_from_path("liuhaotian/llava-v1.5-13b"),
    load_4bit=True  # Reduces VRAM ~4×
)

# 8-bit quantization
load_8bit=True  # Reduces VRAM ~2×

Best practices

1. Start with 7B model - Good quality, manageable VRAM 2. Use 4-bit quantization - Reduces VRAM significantly 3. GPU required - CPU inference extremely slow 4. Clear prompts - Specific questions get better answers 5. Multi-turn conversations - Maintain conversation context 6. Temperature 0.2-0.7 - Balance creativity/consistency 7. max_new_tokens 512-1024 - For detailed responses 8. Batch processing - Process multiple images sequentially

Performance

Model	VRAM (FP16)	VRAM (4-bit)	Speed (tokens/s)
7B	~14 GB	~4 GB	~20
13B	~28 GB	~8 GB	~12
34B	~70 GB	~18 GB	~5

On A100 GPU

Benchmarks

LLaVA achieves competitive scores on:

VQAv2: 78.5%
GQA: 62.0%
MM-Vet: 35.4%
MMBench: 64.3%

Limitations

1. Hallucinations - May describe things not in image 2. Spatial reasoning - Struggles with precise locations 3. Small text - Difficulty reading fine print 4. Object counting - Imprecise for many objects 5. VRAM requirements - Need powerful GPU 6. Inference speed - Slower than CLIP

Integration with frameworks

LangChain

from langchain.llms.base import LLM

class LLaVALLM(LLM):
    def _call(self, prompt, stop=None):
        # Custom LLaVA inference
        return response

llm = LLaVALLM()

Gradio App

import gradio as gr

def chat(image, text, history):
    response = ask_llava(model, image, text)
    return response

demo = gr.ChatInterface(
    chat,
    additional_inputs=[gr.Image(type="pil")],
    title="LLaVA Chat"
)
demo.launch()

Resources

GitHub: https://github.com/haotian-liu/LLaVA ⭐ 23,000+
Paper: https://arxiv.org/abs/2304.08485
Demo: https://llava.hliu.cc
Models: https://huggingface.co/liuhaotian
License: Apache 2.0

LLaVA Training Guide

Guide to training and fine-tuning LLaVA models.

Training stages

Stage 1: Feature alignment (Pretraining)

Purpose: Align vision encoder with language model

Data: 558K image-caption pairs (CC3M subset)

# Download pretrained projector or train from scratch
bash scripts/v1_5/pretrain.sh

Configuration:

Base model: Vicuna-7B or LLaMA-2-7B
Vision encoder: CLIP ViT-L/14
Training time: ~20 hours on 8× A100

Stage 2: Visual instruction tuning

Purpose: Teach model to follow visual instructions

Data: 150K GPT-generated multimodal instruction data

# Fine-tune with instruction data
bash scripts/v1_5/finetune.sh

Configuration:

Epochs: 1
Batch size: 128 (across 8 GPUs)
Learning rate: 2e-5
Training time: ~24 hours on 8× A100

Data format

Instruction data format

[
    {
        "id": "001",
        "image": "path/to/image.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWhat is in this image?"
            },
            {
                "from": "gpt",
                "value": "The image shows a dog playing in a park."
            },
            {
                "from": "human",
                "value": "What breed is the dog?"
            },
            {
                "from": "gpt",
                "value": "It appears to be a Golden Retriever."
            }
        ]
    }
]

Fine-tuning on custom data

Prepare your data

import json

# Create instruction data
data = []
for image_path, qa_pairs in your_dataset:
    conversations = []
    for q, a in qa_pairs:
        conversations.append({"from": "human", "value": f"<image>\n{q}"})
        conversations.append({"from": "gpt", "value": a})

    data.append({
        "id": str(len(data)),
        "image": image_path,
        "conversations": conversations
    })

# Save
with open("custom_data.json", "w") as f:
    json.dump(data, f, indent=2)

Fine-tune script

#!/bin/bash

# Set paths
DATA_PATH="custom_data.json"
IMAGE_FOLDER="path/to/images"
MODEL_PATH="liuhaotian/llava-v1.5-7b"
OUTPUT_DIR="./checkpoints/llava-custom"

# Fine-tune
deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path $MODEL_PATH \
    --version v1 \
    --data_path $DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

LoRA fine-tuning (memory efficient)

from peft import LoraConfig, get_peft_model

# LoRA config
lora_config = LoraConfig(
    r=8,  # LoRA rank
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(base_model, lora_config)

# Train with much lower memory

Hardware requirements

Full fine-tuning

7B model: 8× A100 (40GB)
13B model: 8× A100 (80GB)
Training time: 20-48 hours

LoRA fine-tuning

7B model: 1× A100 (40GB)
13B model: 2× A100 (40GB)
Training time: 10-24 hours

Best practices

1. Start with pretrained - Don't train from scratch 2. Use LoRA for efficiency - 10× less memory 3. Quality over quantity - 1K high-quality > 10K low-quality 4. Multi-turn conversations - More engaging than single Q&A 5. Diverse images - Cover different scenarios 6. Clear instructions - Specific questions get better answers 7. Monitor loss - Should decrease smoothly 8. Save checkpoints - Training can fail 9. Test regularly - Validate on held-out set 10. Use DeepSpeed - For multi-GPU training

Resources

Training script: https://github.com/haotian-liu/LLaVA/tree/main/scripts
Data format: https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md
Paper: https://arxiv.org/abs/2304.08485

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use llava for two-stage LLaVA v1.5 training pipelines; use blip-2-vision-language when fine-tuning Salesforce BLIP-2 for captioning or VQA only.

FAQ

How long does LLaVA Stage 1 pretraining take?

llava documents Stage 1 feature alignment on 558K image-caption pairs taking roughly 20 hours on 8× A100 GPUs when running scripts/v1_5/pretrain.sh with CLIP ViT-L/14 and Vicuna-7B.

What data does LLaVA Stage 2 instruction tuning use?

llava Stage 2 runs scripts/v1_5/finetune.sh on 150K GPT-generated multimodal instruction samples formatted as JSON conversations to teach visual instruction following.

Is Llava safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmresearch

About

Llava by the numbers

Add your badge

How do you train LLaVA vision-language models?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

LLaVA - Large Language and Vision Assistant

When to use LLaVA

Quick start

Installation

Basic usage

Available models

CLI usage

Web UI (Gradio)

Multi-turn conversations

Common tasks

Image captioning

Visual question answering

Object detection (textual)

Scene understanding

Document understanding

Training custom model

Quantization (reduce VRAM)

Best practices

Performance

Benchmarks

Limitations

Integration with frameworks

LangChain

Gradio App

Resources

LLaVA Training Guide

Training stages

Stage 1: Feature alignment (Pretraining)

Stage 2: Visual instruction tuning

Data format

Instruction data format

Fine-tuning on custom data

Prepare your data

Fine-tune script

LoRA fine-tuning (memory efficient)

Hardware requirements

Full fine-tuning

LoRA fine-tuning

Best practices

Resources

Related skills

How it compares

FAQ

How long does LLaVA Stage 1 pretraining take?

What data does LLaVA Stage 2 instruction tuning use?

Is Llava safe to install?

This week in AI coding