Huggingface Vision Trainer

Name: Huggingface Vision Trainer
Author: huggingface

huggingface/skills

Fine-tune SAM2.1 on a vision matting dataset with Hugging Face Trainer, custom loss, and dataset collation for bbox-conditioned masks.

Overview

Hugging Face Vision Trainer is an agent skill for the Build phase that guides fine-tuning SAM2.1 with Hugging Face Trainer on a matting dataset using custom dataset and loss setup.

Install

npx skills add https://github.com/huggingface/skills --skill huggingface-vision-trainer

What is this skill?

End-to-end SAM2.1 fine-tune on merve/MicroMat-mini with train/test split
SAMDataset and custom collator aligned to SAM2 processor expectations
Ground-truth mask visualization with bbox overlays via matplotlib
HF ecosystem stack: transformers, datasets, monai, trackio
Uses dataset merve/MicroMat-mini with 10% test split
Documents SAM2.1 fine-tuning with Hugging Face Trainer and custom loss

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 818 installs on skills.sh; 10.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

You need tighter segmentation or matting on your own images but generic SAM2 checkpoints miss domain-specific mask quality.

Who is it for?

Indie ML builders already comfortable with Python GPUs who want SAM2 fine-tuning recipes on HF datasets.

Skip if: Non-technical founders seeking no-code vision APIs or production deployment hardening without reading training code.

When should I use this skill?

You are implementing SAM2 or similar vision fine-tuning with HF Trainer, custom collators, and matting or mask supervision.

What do I get? / Deliverables

You have a reproducible training notebook path—dataset split, SAM2 processor wiring, and Trainer-ready batches—for a fine-tuned vision model on your task.

Train/val split pipeline for MicroMat-mini
SAMDataset + collator pattern for SAM2 processor
Visualization and training-ready sample batches

Recommended Skills

Paper Context Resolverlllllllama/ai-paper-reproduction-skill

Optional helper-tier skill that supplements README-guided deep learning reproduction by resolving specific paper details…140k installs·412 stars

Repo Intake And Planlllllllama/ai-paper-reproduction-skill

Rigor Intake scans repository docs and layout to classify documented commands and propose a minimal reproduction plan fo…140k installs·412 stars

Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill

Rigor Setup establishes conservative environment and asset assumptions aligned with README and config evidence before ex…140k installs·412 stars

Minimal Run And Auditlllllllama/ai-paper-reproduction-skill

RigorPilot executes the selected minimal reproduction command and produces normalized, auditable run evidence for paper …140k installs·412 stars

Analyze Projectlllllllama/rigorpilot-skills

analyze-project is a read-only agent skill from the RigorPilot family aimed at solo builders and small teams inheriting …32.3k installs·412 stars

Ai Research Reproductionlllllllama/rigorpilot-skills

ai-research-reproduction is the RigorPilot Reproduce orchestrator for solo builders and small teams who need to rerun a …32.3k installs·412 stars

Journey fit

Primary fit

BuildBackend, data & payments

Canonical shelf on Build because the skill walks through training pipelines, datasets, and model customization—not distribution or prod monitoring. Backend subphase fits ML training jobs, collators, and HF Trainer configuration rather than UI polish.

How it compares

Training recipe skill—not an MCP server or one-click hosted AutoTrain button.

Common Questions / FAQ

Who is huggingface-vision-trainer for?

Developers fine-tuning vision segmentation models with Hugging Face Trainer who want SAM2.1 matting examples on MicroMat-mini.

When should I use huggingface-vision-trainer?

Use it in Build backend work when implementing custom vision features that need fine-tuned masks before you Ship test checkpoints.

Is huggingface-vision-trainer safe to install?

Training skills pull packages and datasets from the network; review the Security Audits panel on this page and pin dependency versions in your environment.

SKILL.md

READMESKILL.md - Huggingface Vision Trainer

# Fine-tuning SAM2 with HF Trainer

Fine-tune SAM2.1 on a small part of the MicroMat dataset for image matting,
using the Hugging Face Trainer with a custom loss function.

```python
!pip install -q transformers datasets monai trackio
```

## Load and explore the dataset

```python
from datasets import load_dataset

dataset = load_dataset("merve/MicroMat-mini", split="train")
dataset
```

```python
dataset = dataset.train_test_split(test_size=0.1)
train_ds = dataset["train"]
val_ds = dataset["test"]
```

```python
import json

train_ds[0]
```

```python
json.loads(train_ds["prompt"][0])["bbox"]
```

## Visualize a sample

```python
import matplotlib.pyplot as plt
import numpy as np


def show_mask(mask, ax, bbox):
    color = np.array([0.12, 0.56, 1.0, 0.6])
    mask = np.array(mask)
    h, w = mask.shape
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, 4)
    ax.imshow(mask_image)
    x0, y0, x1, y1 = bbox
    ax.add_patch(
        plt.Rectangle(
            (x0, y0), x1 - x0, y1 - y0, fill=False, edgecolor="lime", linewidth=2
        )
    )


example = train_ds[0]
image = np.array(example["image"])
ground_truth_mask = np.array(example["mask"])

fig, ax = plt.subplots()
ax.imshow(image)
show_mask(ground_truth_mask, ax, json.loads(example["prompt"])["bbox"])
ax.set_title("Ground truth mask")
ax.set_axis_off()
plt.show()
```

## Build the dataset and collator

`SAMDataset` wraps each sample into the format expected by the SAM2 processor.
Ground-truth masks are stored under the key `"labels"` so the Trainer
automatically pops them before calling `model.forward()`.

```python
from torch.utils.data import Dataset
import torch
import torch.nn.functional as F


class SAMDataset(Dataset):
    def __init__(self, dataset, processor):
        self.dataset = dataset
        self.processor = processor

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        image = item["image"]
        prompt = json.loads(item["prompt"])["bbox"]
        inputs = self.processor(image, input_boxes=[[prompt]], return_tensors="pt")
        inputs["labels"] = (np.array(item["mask"]) > 0).astype(np.float32)
        inputs["original_image_size"] = torch.tensor(image.size[::-1])
        return inputs


def collate_fn(batch):
    pixel_values = torch.cat([item["pixel_values"] for item in batch], dim=0)
    original_sizes = torch.stack([item["original_sizes"] for item in batch])
    input_boxes = torch.cat([item["input_boxes"] for item in batch], dim=0)
    labels = torch.cat(
        [
            F.interpolate(
                torch.as_tensor(x["labels"]).unsqueeze(0).unsqueeze(0).float(),
                size=(256, 256),
                mode="nearest",
            )
            for x in batch
        ],
        dim=0,
    ).long()

    return {
        "pixel_values": pixel_values,
        "original_sizes": original_sizes,
        "input_boxes": input_boxes,
        "labels": labels,
        "original_image_size": torch.stack(
            [item["original_image_size"] for item in batch]
        ),
        "multimask_output": False,
    }
```

```python
from transformers import Sam2Processor

processor = Sam2Processor.from_pretrained("facebook/sam2.1-hiera-small")

train_dataset = SAMDataset(dataset=train_ds, processor=processor)
val_dataset = SAMDataset(dataset=val_ds, processor=processor)
```

## Load model and freeze encoder layers

```python
from transformers import Sam2Model

model = Sam2Model.from_pretrained("facebook/sam2.1-hiera-small")

for name, param in model.named_parameters():
    if name.startswith("vision_encoder") or name.startswith("prompt_encoder"):
        param.requires_grad_(False)
```

## Inference before training

```python
item = val_ds[1]
img = item["image"]
bbox = json.loads(item["prompt"])["bbox"]
inputs = processor(images=img, input_boxes=[[bbox]], return_tensors="pt").to(
    model.device
)

with torch.no_grad():
    outputs = model(**inputs)

masks

What is this skill?

End-to-end SAM2.1 fine-tune on merve/MicroMat-mini with train/test split

SAMDataset and custom collator aligned to SAM2 processor expectations

Ground-truth mask visualization with bbox overlays via matplotlib

HF ecosystem stack: transformers, datasets, monai, trackio

Uses dataset merve/MicroMat-mini with 10% test split

Documents SAM2.1 fine-tuning with Hugging Face Trainer and custom loss

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 818 installs on skills.sh; 10.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).

Journey fit

Primary fit

BuildBackend, data & payments

SKILL.md

READMESKILL.md - Huggingface Vision Trainer

# Fine-tuning SAM2 with HF Trainer

Fine-tune SAM2.1 on a small part of the MicroMat dataset for image matting,
using the Hugging Face Trainer with a custom loss function.

```python
!pip install -q transformers datasets monai trackio
```

## Load and explore the dataset

```python
from datasets import load_dataset

dataset = load_dataset("merve/MicroMat-mini", split="train")
dataset
```

```python
dataset = dataset.train_test_split(test_size=0.1)
train_ds = dataset["train"]
val_ds = dataset["test"]
```

```python
import json

train_ds[0]
```

```python
json.loads(train_ds["prompt"][0])["bbox"]
```

## Visualize a sample

```python
import matplotlib.pyplot as plt
import numpy as np


def show_mask(mask, ax, bbox):
    color = np.array([0.12, 0.56, 1.0, 0.6])
    mask = np.array(mask)
    h, w = mask.shape
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, 4)
    ax.imshow(mask_image)
    x0, y0, x1, y1 = bbox
    ax.add_patch(
        plt.Rectangle(
            (x0, y0), x1 - x0, y1 - y0, fill=False, edgecolor="lime", linewidth=2
        )
    )


example = train_ds[0]
image = np.array(example["image"])
ground_truth_mask = np.array(example["mask"])

fig, ax = plt.subplots()
ax.imshow(image)
show_mask(ground_truth_mask, ax, json.loads(example["prompt"])["bbox"])
ax.set_title("Ground truth mask")
ax.set_axis_off()
plt.show()
```

## Build the dataset and collator

`SAMDataset` wraps each sample into the format expected by the SAM2 processor.
Ground-truth masks are stored under the key `"labels"` so the Trainer
automatically pops them before calling `model.forward()`.

```python
from torch.utils.data import Dataset
import torch
import torch.nn.functional as F


class SAMDataset(Dataset):
    def __init__(self, dataset, processor):
        self.dataset = dataset
        self.processor = processor

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        image = item["image"]
        prompt = json.loads(item["prompt"])["bbox"]
        inputs = self.processor(image, input_boxes=[[prompt]], return_tensors="pt")
        inputs["labels"] = (np.array(item["mask"]) > 0).astype(np.float32)
        inputs["original_image_size"] = torch.tensor(image.size[::-1])
        return inputs


def collate_fn(batch):
    pixel_values = torch.cat([item["pixel_values"] for item in batch], dim=0)
    original_sizes = torch.stack([item["original_sizes"] for item in batch])
    input_boxes = torch.cat([item["input_boxes"] for item in batch], dim=0)
    labels = torch.cat(
        [
            F.interpolate(
                torch.as_tensor(x["labels"]).unsqueeze(0).unsqueeze(0).float(),
                size=(256, 256),
                mode="nearest",
            )
            for x in batch
        ],
        dim=0,
    ).long()

    return {
        "pixel_values": pixel_values,
        "original_sizes": original_sizes,
        "input_boxes": input_boxes,
        "labels": labels,
        "original_image_size": torch.stack(
            [item["original_image_size"] for item in batch]
        ),
        "multimask_output": False,
    }
```

```python
from transformers import Sam2Processor

processor = Sam2Processor.from_pretrained("facebook/sam2.1-hiera-small")

train_dataset = SAMDataset(dataset=train_ds, processor=processor)
val_dataset = SAMDataset(dataset=val_ds, processor=processor)
```

## Load model and freeze encoder layers

```python
from transformers import Sam2Model

model = Sam2Model.from_pretrained("facebook/sam2.1-hiera-small")

for name, param in model.named_parameters():
    if name.startswith("vision_encoder") or name.startswith("prompt_encoder"):
        param.requires_grad_(False)
```

## Inference before training

```python
item = val_ds[1]
img = item["image"]
bbox = json.loads(item["prompt"])["bbox"]
inputs = processor(images=img, input_boxes=[[bbox]], return_tensors="pt").to(
    model.device
)

with torch.no_grad():
    outputs = model(**inputs)

masks

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is huggingface-vision-trainer for?

When should I use huggingface-vision-trainer?

Is huggingface-vision-trainer safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is huggingface-vision-trainer for?

When should I use huggingface-vision-trainer?

Is huggingface-vision-trainer safe to install?

SKILL.md