
Huggingface Vision Trainer
Fine-tune SAM2.1 on a vision matting dataset with Hugging Face Trainer, custom loss, and dataset collation for bbox-conditioned masks.
Overview
Hugging Face Vision Trainer is an agent skill for the Build phase that guides fine-tuning SAM2.1 with Hugging Face Trainer on a matting dataset using custom dataset and loss setup.
Install
npx skills add https://github.com/huggingface/skills --skill huggingface-vision-trainerWhat is this skill?
- End-to-end SAM2.1 fine-tune on merve/MicroMat-mini with train/test split
- SAMDataset and custom collator aligned to SAM2 processor expectations
- Ground-truth mask visualization with bbox overlays via matplotlib
- HF ecosystem stack: transformers, datasets, monai, trackio
- Uses dataset merve/MicroMat-mini with 10% test split
- Documents SAM2.1 fine-tuning with Hugging Face Trainer and custom loss
Adoption & trust: 818 installs on skills.sh; 10.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need tighter segmentation or matting on your own images but generic SAM2 checkpoints miss domain-specific mask quality.
Who is it for?
Indie ML builders already comfortable with Python GPUs who want SAM2 fine-tuning recipes on HF datasets.
Skip if: Non-technical founders seeking no-code vision APIs or production deployment hardening without reading training code.
When should I use this skill?
You are implementing SAM2 or similar vision fine-tuning with HF Trainer, custom collators, and matting or mask supervision.
What do I get? / Deliverables
You have a reproducible training notebook path—dataset split, SAM2 processor wiring, and Trainer-ready batches—for a fine-tuned vision model on your task.
- Train/val split pipeline for MicroMat-mini
- SAMDataset + collator pattern for SAM2 processor
- Visualization and training-ready sample batches
Recommended Skills
Journey fit
Canonical shelf on Build because the skill walks through training pipelines, datasets, and model customization—not distribution or prod monitoring. Backend subphase fits ML training jobs, collators, and HF Trainer configuration rather than UI polish.
How it compares
Training recipe skill—not an MCP server or one-click hosted AutoTrain button.
Common Questions / FAQ
Who is huggingface-vision-trainer for?
Developers fine-tuning vision segmentation models with Hugging Face Trainer who want SAM2.1 matting examples on MicroMat-mini.
When should I use huggingface-vision-trainer?
Use it in Build backend work when implementing custom vision features that need fine-tuned masks before you Ship test checkpoints.
Is huggingface-vision-trainer safe to install?
Training skills pull packages and datasets from the network; review the Security Audits panel on this page and pin dependency versions in your environment.
SKILL.md
READMESKILL.md - Huggingface Vision Trainer
# Fine-tuning SAM2 with HF Trainer Fine-tune SAM2.1 on a small part of the MicroMat dataset for image matting, using the Hugging Face Trainer with a custom loss function. ```python !pip install -q transformers datasets monai trackio ``` ## Load and explore the dataset ```python from datasets import load_dataset dataset = load_dataset("merve/MicroMat-mini", split="train") dataset ``` ```python dataset = dataset.train_test_split(test_size=0.1) train_ds = dataset["train"] val_ds = dataset["test"] ``` ```python import json train_ds[0] ``` ```python json.loads(train_ds["prompt"][0])["bbox"] ``` ## Visualize a sample ```python import matplotlib.pyplot as plt import numpy as np def show_mask(mask, ax, bbox): color = np.array([0.12, 0.56, 1.0, 0.6]) mask = np.array(mask) h, w = mask.shape mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, 4) ax.imshow(mask_image) x0, y0, x1, y1 = bbox ax.add_patch( plt.Rectangle( (x0, y0), x1 - x0, y1 - y0, fill=False, edgecolor="lime", linewidth=2 ) ) example = train_ds[0] image = np.array(example["image"]) ground_truth_mask = np.array(example["mask"]) fig, ax = plt.subplots() ax.imshow(image) show_mask(ground_truth_mask, ax, json.loads(example["prompt"])["bbox"]) ax.set_title("Ground truth mask") ax.set_axis_off() plt.show() ``` ## Build the dataset and collator `SAMDataset` wraps each sample into the format expected by the SAM2 processor. Ground-truth masks are stored under the key `"labels"` so the Trainer automatically pops them before calling `model.forward()`. ```python from torch.utils.data import Dataset import torch import torch.nn.functional as F class SAMDataset(Dataset): def __init__(self, dataset, processor): self.dataset = dataset self.processor = processor def __len__(self): return len(self.dataset) def __getitem__(self, idx): item = self.dataset[idx] image = item["image"] prompt = json.loads(item["prompt"])["bbox"] inputs = self.processor(image, input_boxes=[[prompt]], return_tensors="pt") inputs["labels"] = (np.array(item["mask"]) > 0).astype(np.float32) inputs["original_image_size"] = torch.tensor(image.size[::-1]) return inputs def collate_fn(batch): pixel_values = torch.cat([item["pixel_values"] for item in batch], dim=0) original_sizes = torch.stack([item["original_sizes"] for item in batch]) input_boxes = torch.cat([item["input_boxes"] for item in batch], dim=0) labels = torch.cat( [ F.interpolate( torch.as_tensor(x["labels"]).unsqueeze(0).unsqueeze(0).float(), size=(256, 256), mode="nearest", ) for x in batch ], dim=0, ).long() return { "pixel_values": pixel_values, "original_sizes": original_sizes, "input_boxes": input_boxes, "labels": labels, "original_image_size": torch.stack( [item["original_image_size"] for item in batch] ), "multimask_output": False, } ``` ```python from transformers import Sam2Processor processor = Sam2Processor.from_pretrained("facebook/sam2.1-hiera-small") train_dataset = SAMDataset(dataset=train_ds, processor=processor) val_dataset = SAMDataset(dataset=val_ds, processor=processor) ``` ## Load model and freeze encoder layers ```python from transformers import Sam2Model model = Sam2Model.from_pretrained("facebook/sam2.1-hiera-small") for name, param in model.named_parameters(): if name.startswith("vision_encoder") or name.startswith("prompt_encoder"): param.requires_grad_(False) ``` ## Inference before training ```python item = val_ds[1] img = item["image"] bbox = json.loads(item["prompt"])["bbox"] inputs = processor(images=img, input_boxes=[[bbox]], return_tensors="pt").to( model.device ) with torch.no_grad(): outputs = model(**inputs) masks