
Blip 2 Vision Language
Fine-tune Salesforce BLIP-2 for image captioning or VQA with LoRA or Q-Former-only training without guessing freeze and PEFT settings.
Overview
BLIP-2 Vision Language is an agent skill for the Build phase that documents LoRA and Q-Former fine-tuning workflows for Salesforce BLIP-2 vision-language models.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill blip-2-vision-languageWhat is this skill?
- LoRA fine-tuning recipe on blip2-opt-2.7b with q_proj, v_proj, k_proj, out_proj (~4M trainable of ~3.8B params)
- Q-Former-only freeze pattern that keeps vision and LLM weights fixed while tuning the bridge
- Custom CaptionDataset + DataLoader pattern for paired image–text fine-tuning
- float16 + device_map auto loading via Hugging Face Blip2ForConditionalGeneration and Blip2Processor
- LoRA example: ~4M trainable parameters of ~3.8B total (~0.1%)
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need domain-accurate image captions or answers from BLIP-2 but the base checkpoint misses your visual vocabulary and you are unsure which layers to train.
Who is it for?
Solo builders fine-tuning BLIP-2 on a small custom image–caption dataset with limited GPU budget and PEFT experience.
Skip if: Teams that only need one-off inference from a hosted vision API with no custom training pipeline.
When should I use this skill?
You are fine-tuning BLIP-2 with LoRA or Q-Former-only training and need freeze rules, target modules, and a caption dataset scaffold.
What do I get? / Deliverables
You get copy-ready training setup—LoRA or Q-Former-only—with frozen-weight rules and a caption dataset template so you can start a fine-tune run confidently.
- LoRA or Q-Former-only training configuration for Blip2ForConditionalGeneration
- CaptionDataset and DataLoader pattern for fine-tuning data
Recommended Skills
Journey fit
Vision-language model training is core product engineering once you are past idea validation and need a custom multimodal stack. Backend subphase covers model loading, training loops, datasets, and GPU-oriented PyTorch code rather than UI or distribution work.
How it compares
Use for BLIP-2-specific PEFT and Q-Former patterns instead of generic Hugging Face fine-tuning cheat sheets that skip multimodal freeze logic.
Common Questions / FAQ
Who is blip-2-vision-language for?
Indie developers and small ML-minded teams building multimodal products who already use PyTorch and Transformers and want BLIP-2 fine-tuning steps without reading the full paper stack first.
When should I use blip-2-vision-language?
During build when you are implementing training scripts for captioning or visual QA, especially after a prototype proves BLIP-2 works but domain terms or layout understanding fail on real user images.
Is blip-2-vision-language safe to install?
Treat it like any third-party agent skill: review the Security Audits panel on this Prism page and inspect the skill bundle before granting shell, network, or filesystem access on your training machine.
Workflow Chain
Then invoke: unsloth
SKILL.md
READMESKILL.md - Blip 2 Vision Language
# BLIP-2 Advanced Usage Guide ## Fine-tuning BLIP-2 ### LoRA fine-tuning (recommended) ```python import torch from transformers import Blip2ForConditionalGeneration, Blip2Processor from peft import LoraConfig, get_peft_model # Load base model model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto" ) # Configure LoRA for the language model lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "out_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) # Apply LoRA model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: ~4M, all params: ~3.8B (0.1%) ``` ### Fine-tuning Q-Former only ```python # Freeze everything except Q-Former for name, param in model.named_parameters(): if "qformer" not in name.lower(): param.requires_grad = False else: param.requires_grad = True # Check trainable parameters trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)") ``` ### Custom dataset for fine-tuning ```python import torch from torch.utils.data import Dataset, DataLoader from PIL import Image class CaptionDataset(Dataset): def __init__(self, data, processor, max_length=128): self.data = data # List of {"image_path": str, "caption": str} self.processor = processor self.max_length = max_length def __len__(self): return len(self.data) def __getitem__(self, idx): item = self.data[idx] image = Image.open(item["image_path"]).convert("RGB") # Process inputs encoding = self.processor( images=image, text=item["caption"], padding="max_length", truncation=True, max_length=self.max_length, return_tensors="pt" ) # Remove batch dimension encoding = {k: v.squeeze(0) for k, v in encoding.items()} # Labels for language modeling encoding["labels"] = encoding["input_ids"].clone() return encoding # Create dataloader dataset = CaptionDataset(train_data, processor) dataloader = DataLoader(dataset, batch_size=8, shuffle=True) ``` ### Training loop ```python from transformers import AdamW, get_linear_schedule_with_warmup from tqdm import tqdm # Optimizer optimizer = AdamW(model.parameters(), lr=1e-5, weight_decay=0.01) # Scheduler num_epochs = 3 num_training_steps = len(dataloader) * num_epochs scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=num_training_steps // 10, num_training_steps=num_training_steps ) # Training model.train() for epoch in range(num_epochs): total_loss = 0 for batch in tqdm(dataloader, desc=f"Epoch {epoch+1}"): batch = {k: v.to("cuda") for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() scheduler.step() optimizer.zero_grad() total_loss += loss.item() avg_loss = total_loss / len(dataloader) print(f"Epoch {epoch+1} - Loss: {avg_loss:.4f}") # Save fine-tuned model model.save_pretrained("blip2-finetuned") processor.save_pretrained("blip2-finetuned") ``` ### Fine-tuning with LAVIS ```python from lavis.models import load_model_and_preprocess from lavis.common.registry import registry from lavis.datasets.builders import load_dataset # Load model model, vis_processors, txt_processors = load_model_and_preprocess( name="blip2_opt", model_type="pretrain_opt2.7b", is_eval=False, # Training mode device="cuda" ) # Load dataset dataset = load_dataset("coco_caption") # Get trainer class runner_cls = registry.get_runner_class(