Segment Anything Model

Name: Segment Anything Model
Author: orchestra-research

orchestra-research/ai-research-skills

398 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

segment-anything-model is an agent skill that guides developers implementing Meta SAM, SAM 2 video tracking, and Grounded SAM text-to-mask pipelines for computer-vision features.

About

segment-anything-model is a Claude Code skill from orchestra-research/ai-research-skills for adding Meta segmentation to products. It covers SAM 2 video segmentation with build_sam2_video_predictor, init_state on video files, point prompts via add_new_points, and propagate_in_video mask streaming. Grounded SAM text-to-mask workflows connect natural-language prompts to masks for agent vision pipelines. Developers reach for segment-anything-model when shipping interactive segmentation, video object tracking, or text-conditioned masking without writing integration code from scratch.

SAM 2 video path: build_sam2_video_predictor, init_state, point prompts, and propagate_in_video loop
Side-by-side SAM vs SAM 2 table covering memory bank, tracking, and Hiera model sizes
Grounded SAM pipeline: Grounding DINO load_model/predict plus SamPredictor for text-prompted masks
Concrete dependency commands for segment-anything-2 git install and groundingdino-py
Advanced usage focus beyond hello-world—video propagation and text-to-mask composition

Segment Anything Model by the numbers

398 all-time installs (skills.sh)
+36 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #499 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill segment-anything-model

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/segment-anything-model.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/segment-anything-model)

Installs	398
repo stars	★ 11.2k
Security audit	2 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you integrate SAM 2 video segmentation in Python?

Implement Meta SAM, SAM 2 video tracking, and Grounded SAM text-to-mask pipelines when you add segmentation to a CV or agent vision feature.

Who is it for?

CV engineers adding Meta SAM or SAM 2 segmentation to Python agent or vision backends.

Skip if: Teams needing only prebuilt cloud vision APIs without custom mask pipelines.

When should I use this skill?

A developer is integrating SAM, SAM 2 video tracking, or Grounded SAM into a computer-vision feature.

What you get

SAM 2 predictor setup code, point-prompt calls, video propagation loop, and Grounded SAM text-to-mask pipeline stubs.

SAM 2 predictor initialization code
video mask propagation loop

By the numbers

Uses sam2_hiera_l.yaml config with sam2_hiera_large.pt weights

Files

SKILL.mdMarkdownGitHub ↗

Segment Anything Model (SAM)

Comprehensive guide to using Meta AI's Segment Anything Model for zero-shot image segmentation.

When to use SAM

Use SAM when:

Need to segment any object in images without task-specific training
Building interactive annotation tools with point/box prompts
Generating training data for other vision models
Need zero-shot transfer to new image domains
Building object detection/segmentation pipelines
Processing medical, satellite, or domain-specific images

Key features:

Zero-shot segmentation: Works on any image domain without fine-tuning
Flexible prompts: Points, bounding boxes, or previous masks
Automatic segmentation: Generate all object masks automatically
High quality: Trained on 1.1 billion masks from 11 million images
Multiple model sizes: ViT-B (fastest), ViT-L, ViT-H (most accurate)
ONNX export: Deploy in browsers and edge devices

Use alternatives instead:

YOLO/Detectron2: For real-time object detection with classes
Mask2Former: For semantic/panoptic segmentation with categories
GroundingDINO + SAM: For text-prompted segmentation
SAM 2: For video segmentation tasks

Quick start

Installation

# From GitHub
pip install git+https://github.com/facebookresearch/segment-anything.git

# Optional dependencies
pip install opencv-python pycocotools matplotlib

# Or use HuggingFace transformers
pip install transformers

Download checkpoints

# ViT-H (largest, most accurate) - 2.4GB
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

# ViT-L (medium) - 1.2GB
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth

# ViT-B (smallest, fastest) - 375MB
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth

Basic usage with SamPredictor

import numpy as np
from segment_anything import sam_model_registry, SamPredictor

# Load model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")

# Create predictor
predictor = SamPredictor(sam)

# Set image (computes embeddings once)
image = cv2.imread("image.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image)

# Predict with point prompts
input_point = np.array([[500, 375]])  # (x, y) coordinates
input_label = np.array([1])  # 1 = foreground, 0 = background

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True  # Returns 3 mask options
)

# Select best mask
best_mask = masks[np.argmax(scores)]

HuggingFace Transformers

import torch
from PIL import Image
from transformers import SamModel, SamProcessor

# Load model and processor
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to("cuda")

# Process image with point prompt
image = Image.open("image.jpg")
input_points = [[[450, 600]]]  # Batch of points

inputs = processor(image, input_points=input_points, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

# Generate masks
with torch.no_grad():
    outputs = model(**inputs)

# Post-process masks to original size
masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu()
)

Core concepts

Model architecture

SAM Architecture:
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Image Encoder  │────▶│ Prompt Encoder  │────▶│  Mask Decoder   │
│     (ViT)       │     │ (Points/Boxes)  │     │ (Transformer)   │
└─────────────────┘     └─────────────────┘     └─────────────────┘
        │                       │                       │
   Image Embeddings      Prompt Embeddings         Masks + IoU
   (computed once)       (per prompt)             predictions

Model variants

Model	Checkpoint	Size	Speed	Accuracy
ViT-H	`vit_h`	2.4 GB	Slowest	Best
ViT-L	`vit_l`	1.2 GB	Medium	Good
ViT-B	`vit_b`	375 MB	Fastest	Good

Prompt types

Prompt	Description	Use Case
Point (foreground)	Click on object	Single object selection
Point (background)	Click outside object	Exclude regions
Bounding box	Rectangle around object	Larger objects
Previous mask	Low-res mask input	Iterative refinement

Interactive segmentation

Point prompts

# Single foreground point
input_point = np.array([[500, 375]])
input_label = np.array([1])

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True
)

# Multiple points (foreground + background)
input_points = np.array([[500, 375], [600, 400], [450, 300]])
input_labels = np.array([1, 1, 0])  # 2 foreground, 1 background

masks, scores, logits = predictor.predict(
    point_coords=input_points,
    point_labels=input_labels,
    multimask_output=False  # Single mask when prompts are clear
)

Box prompts

# Bounding box [x1, y1, x2, y2]
input_box = np.array([425, 600, 700, 875])

masks, scores, logits = predictor.predict(
    box=input_box,
    multimask_output=False
)

Combined prompts

# Box + points for precise control
masks, scores, logits = predictor.predict(
    point_coords=np.array([[500, 375]]),
    point_labels=np.array([1]),
    box=np.array([400, 300, 700, 600]),
    multimask_output=False
)

Iterative refinement

# Initial prediction
masks, scores, logits = predictor.predict(
    point_coords=np.array([[500, 375]]),
    point_labels=np.array([1]),
    multimask_output=True
)

# Refine with additional point using previous mask
masks, scores, logits = predictor.predict(
    point_coords=np.array([[500, 375], [550, 400]]),
    point_labels=np.array([1, 0]),  # Add background point
    mask_input=logits[np.argmax(scores)][None, :, :],  # Use best mask
    multimask_output=False
)

Automatic mask generation

Basic automatic segmentation

from segment_anything import SamAutomaticMaskGenerator

# Create generator
mask_generator = SamAutomaticMaskGenerator(sam)

# Generate all masks
masks = mask_generator.generate(image)

# Each mask contains:
# - segmentation: binary mask
# - bbox: [x, y, w, h]
# - area: pixel count
# - predicted_iou: quality score
# - stability_score: robustness score
# - point_coords: generating point

Customized generation

mask_generator = SamAutomaticMaskGenerator(
    model=sam,
    points_per_side=32,          # Grid density (more = more masks)
    pred_iou_thresh=0.88,        # Quality threshold
    stability_score_thresh=0.95,  # Stability threshold
    crop_n_layers=1,             # Multi-scale crops
    crop_n_points_downscale_factor=2,
    min_mask_region_area=100,    # Remove tiny masks
)

masks = mask_generator.generate(image)

Filtering masks

# Sort by area (largest first)
masks = sorted(masks, key=lambda x: x['area'], reverse=True)

# Filter by predicted IoU
high_quality = [m for m in masks if m['predicted_iou'] > 0.9]

# Filter by stability score
stable_masks = [m for m in masks if m['stability_score'] > 0.95]

Batched inference

Multiple images

# Process multiple images efficiently
images = [cv2.imread(f"image_{i}.jpg") for i in range(10)]

all_masks = []
for image in images:
    predictor.set_image(image)
    masks, _, _ = predictor.predict(
        point_coords=np.array([[500, 375]]),
        point_labels=np.array([1]),
        multimask_output=True
    )
    all_masks.append(masks)

Multiple prompts per image

# Process multiple prompts efficiently (one image encoding)
predictor.set_image(image)

# Batch of point prompts
points = [
    np.array([[100, 100]]),
    np.array([[200, 200]]),
    np.array([[300, 300]])
]

all_masks = []
for point in points:
    masks, scores, _ = predictor.predict(
        point_coords=point,
        point_labels=np.array([1]),
        multimask_output=True
    )
    all_masks.append(masks[np.argmax(scores)])

ONNX deployment

Export model

python scripts/export_onnx_model.py \
    --checkpoint sam_vit_h_4b8939.pth \
    --model-type vit_h \
    --output sam_onnx.onnx \
    --return-single-mask

Use ONNX model

import onnxruntime

# Load ONNX model
ort_session = onnxruntime.InferenceSession("sam_onnx.onnx")

# Run inference (image embeddings computed separately)
masks = ort_session.run(
    None,
    {
        "image_embeddings": image_embeddings,
        "point_coords": point_coords,
        "point_labels": point_labels,
        "mask_input": np.zeros((1, 1, 256, 256), dtype=np.float32),
        "has_mask_input": np.array([0], dtype=np.float32),
        "orig_im_size": np.array([h, w], dtype=np.float32)
    }
)

Common workflows

Workflow 1: Annotation tool

import cv2

# Load model
predictor = SamPredictor(sam)
predictor.set_image(image)

def on_click(event, x, y, flags, param):
    if event == cv2.EVENT_LBUTTONDOWN:
        # Foreground point
        masks, scores, _ = predictor.predict(
            point_coords=np.array([[x, y]]),
            point_labels=np.array([1]),
            multimask_output=True
        )
        # Display best mask
        display_mask(masks[np.argmax(scores)])

Workflow 2: Object extraction

def extract_object(image, point):
    """Extract object at point with transparent background."""
    predictor.set_image(image)

    masks, scores, _ = predictor.predict(
        point_coords=np.array([point]),
        point_labels=np.array([1]),
        multimask_output=True
    )

    best_mask = masks[np.argmax(scores)]

    # Create RGBA output
    rgba = np.zeros((image.shape[0], image.shape[1], 4), dtype=np.uint8)
    rgba[:, :, :3] = image
    rgba[:, :, 3] = best_mask * 255

    return rgba

Workflow 3: Medical image segmentation

# Process medical images (grayscale to RGB)
medical_image = cv2.imread("scan.png", cv2.IMREAD_GRAYSCALE)
rgb_image = cv2.cvtColor(medical_image, cv2.COLOR_GRAY2RGB)

predictor.set_image(rgb_image)

# Segment region of interest
masks, scores, _ = predictor.predict(
    box=np.array([x1, y1, x2, y2]),  # ROI bounding box
    multimask_output=True
)

Output format

Mask data structure

# SamAutomaticMaskGenerator output
{
    "segmentation": np.ndarray,  # H×W binary mask
    "bbox": [x, y, w, h],        # Bounding box
    "area": int,                 # Pixel count
    "predicted_iou": float,      # 0-1 quality score
    "stability_score": float,    # 0-1 robustness score
    "crop_box": [x, y, w, h],    # Generation crop region
    "point_coords": [[x, y]],    # Input point
}

COCO RLE format

from pycocotools import mask as mask_utils

# Encode mask to RLE
rle = mask_utils.encode(np.asfortranarray(mask.astype(np.uint8)))
rle["counts"] = rle["counts"].decode("utf-8")

# Decode RLE to mask
decoded_mask = mask_utils.decode(rle)

Performance optimization

GPU memory

# Use smaller model for limited VRAM
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")

# Process images in batches
# Clear CUDA cache between large batches
torch.cuda.empty_cache()

Speed optimization

# Use half precision
sam = sam.half()

# Reduce points for automatic generation
mask_generator = SamAutomaticMaskGenerator(
    model=sam,
    points_per_side=16,  # Default is 32
)

# Use ONNX for deployment
# Export with --return-single-mask for faster inference

Common issues

Issue	Solution
Out of memory	Use ViT-B model, reduce image size
Slow inference	Use ViT-B, reduce points_per_side
Poor mask quality	Try different prompts, use box + points
Edge artifacts	Use stability_score filtering
Small objects missed	Increase points_per_side

References

[Advanced Usage](references/advanced-usage.md) - Batching, fine-tuning, integration
[Troubleshooting](references/troubleshooting.md) - Common issues and solutions

Resources

GitHub: https://github.com/facebookresearch/segment-anything
Paper: https://arxiv.org/abs/2304.02643
Demo: https://segment-anything.com
SAM 2 (Video): https://github.com/facebookresearch/segment-anything-2
HuggingFace: https://huggingface.co/facebook/sam-vit-huge

Segment Anything Advanced Usage Guide

SAM 2 (Video Segmentation)

Overview

SAM 2 extends SAM to video segmentation with streaming memory architecture:

pip install git+https://github.com/facebookresearch/segment-anything-2.git

Video segmentation

from sam2.build_sam import build_sam2_video_predictor

predictor = build_sam2_video_predictor("sam2_hiera_l.yaml", "sam2_hiera_large.pt")

# Initialize with video
predictor.init_state(video_path="video.mp4")

# Add prompt on first frame
predictor.add_new_points(
    frame_idx=0,
    obj_id=1,
    points=[[100, 200]],
    labels=[1]
)

# Propagate through video
for frame_idx, masks in predictor.propagate_in_video():
    # masks contains segmentation for all tracked objects
    process_frame(frame_idx, masks)

SAM 2 vs SAM comparison

Feature	SAM	SAM 2
Input	Images only	Images + Videos
Architecture	ViT + Decoder	Hiera + Memory
Memory	Per-image	Streaming memory bank
Tracking	No	Yes, across frames
Models	ViT-B/L/H	Hiera-T/S/B+/L

Grounded SAM (Text-Prompted Segmentation)

Setup

pip install groundingdino-py
pip install git+https://github.com/facebookresearch/segment-anything.git

Text-to-mask pipeline

from groundingdino.util.inference import load_model, predict
from segment_anything import sam_model_registry, SamPredictor
import cv2

# Load Grounding DINO
grounding_model = load_model("groundingdino_swint_ogc.pth", "GroundingDINO_SwinT_OGC.py")

# Load SAM
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)

def text_to_mask(image, text_prompt, box_threshold=0.3, text_threshold=0.25):
    """Generate masks from text description."""
    # Get bounding boxes from text
    boxes, logits, phrases = predict(
        model=grounding_model,
        image=image,
        caption=text_prompt,
        box_threshold=box_threshold,
        text_threshold=text_threshold
    )

    # Generate masks with SAM
    predictor.set_image(image)

    masks = []
    for box in boxes:
        # Convert normalized box to pixel coordinates
        h, w = image.shape[:2]
        box_pixels = box * np.array([w, h, w, h])

        mask, score, _ = predictor.predict(
            box=box_pixels,
            multimask_output=False
        )
        masks.append(mask[0])

    return masks, boxes, phrases

# Usage
image = cv2.imread("image.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

masks, boxes, phrases = text_to_mask(image, "person . dog . car")

Batched Processing

Efficient multi-image processing

import torch
from segment_anything import SamPredictor, sam_model_registry

class BatchedSAM:
    def __init__(self, checkpoint, model_type="vit_h", device="cuda"):
        self.sam = sam_model_registry[model_type](checkpoint=checkpoint)
        self.sam.to(device)
        self.predictor = SamPredictor(self.sam)
        self.device = device

    def process_batch(self, images, prompts):
        """Process multiple images with corresponding prompts."""
        results = []

        for image, prompt in zip(images, prompts):
            self.predictor.set_image(image)

            if "point" in prompt:
                masks, scores, _ = self.predictor.predict(
                    point_coords=prompt["point"],
                    point_labels=prompt["label"],
                    multimask_output=True
                )
            elif "box" in prompt:
                masks, scores, _ = self.predictor.predict(
                    box=prompt["box"],
                    multimask_output=False
                )

            results.append({
                "masks": masks,
                "scores": scores,
                "best_mask": masks[np.argmax(scores)]
            })

        return results

# Usage
batch_sam = BatchedSAM("sam_vit_h_4b8939.pth")

images = [cv2.imread(f"image_{i}.jpg") for i in range(10)]
prompts = [{"point": np.array([[100, 100]]), "label": np.array([1])} for _ in range(10)]

results = batch_sam.process_batch(images, prompts)

Parallel automatic mask generation

from concurrent.futures import ThreadPoolExecutor
from segment_anything import SamAutomaticMaskGenerator

def generate_masks_parallel(images, num_workers=4):
    """Generate masks for multiple images in parallel."""
    # Note: Each worker needs its own model instance
    def worker_init():
        sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")
        return SamAutomaticMaskGenerator(sam)

    generators = [worker_init() for _ in range(num_workers)]

    def process_image(args):
        idx, image = args
        generator = generators[idx % num_workers]
        return generator.generate(image)

    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        results = list(executor.map(process_image, enumerate(images)))

    return results

Custom Integration

FastAPI service

from fastapi import FastAPI, File, UploadFile
from pydantic import BaseModel
import numpy as np
import cv2
import io

app = FastAPI()

# Load model once
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")
predictor = SamPredictor(sam)

class PointPrompt(BaseModel):
    x: int
    y: int
    label: int = 1

@app.post("/segment/point")
async def segment_with_point(
    file: UploadFile = File(...),
    points: list[PointPrompt] = []
):
    # Read image
    contents = await file.read()
    nparr = np.frombuffer(contents, np.uint8)
    image = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Set image
    predictor.set_image(image)

    # Prepare prompts
    point_coords = np.array([[p.x, p.y] for p in points])
    point_labels = np.array([p.label for p in points])

    # Generate masks
    masks, scores, _ = predictor.predict(
        point_coords=point_coords,
        point_labels=point_labels,
        multimask_output=True
    )

    best_idx = np.argmax(scores)

    return {
        "mask": masks[best_idx].tolist(),
        "score": float(scores[best_idx]),
        "all_scores": scores.tolist()
    }

@app.post("/segment/auto")
async def segment_automatic(file: UploadFile = File(...)):
    contents = await file.read()
    nparr = np.frombuffer(contents, np.uint8)
    image = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    mask_generator = SamAutomaticMaskGenerator(sam)
    masks = mask_generator.generate(image)

    return {
        "num_masks": len(masks),
        "masks": [
            {
                "bbox": m["bbox"],
                "area": m["area"],
                "predicted_iou": m["predicted_iou"],
                "stability_score": m["stability_score"]
            }
            for m in masks
        ]
    }

Gradio interface

import gradio as gr
import numpy as np

# Load model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)

def segment_image(image, evt: gr.SelectData):
    """Segment object at clicked point."""
    predictor.set_image(image)

    point = np.array([[evt.index[0], evt.index[1]]])
    label = np.array([1])

    masks, scores, _ = predictor.predict(
        point_coords=point,
        point_labels=label,
        multimask_output=True
    )

    best_mask = masks[np.argmax(scores)]

    # Overlay mask on image
    overlay = image.copy()
    overlay[best_mask] = overlay[best_mask] * 0.5 + np.array([255, 0, 0]) * 0.5

    return overlay

with gr.Blocks() as demo:
    gr.Markdown("# SAM Interactive Segmentation")
    gr.Markdown("Click on an object to segment it")

    with gr.Row():
        input_image = gr.Image(label="Input Image", interactive=True)
        output_image = gr.Image(label="Segmented Image")

    input_image.select(segment_image, inputs=[input_image], outputs=[output_image])

demo.launch()

Fine-Tuning SAM

LoRA fine-tuning (experimental)

from peft import LoraConfig, get_peft_model
from transformers import SamModel

# Load model
model = SamModel.from_pretrained("facebook/sam-vit-base")

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["qkv"],  # Attention layers
    lora_dropout=0.1,
    bias="none",
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Training loop (simplified)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for batch in dataloader:
    outputs = model(
        pixel_values=batch["pixel_values"],
        input_points=batch["input_points"],
        input_labels=batch["input_labels"]
    )

    # Custom loss (e.g., IoU loss with ground truth)
    loss = compute_loss(outputs.pred_masks, batch["gt_masks"])
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

MedSAM (Medical imaging)

# MedSAM is a fine-tuned SAM for medical images
# https://github.com/bowang-lab/MedSAM

from segment_anything import sam_model_registry, SamPredictor
import torch

# Load MedSAM checkpoint
medsam = sam_model_registry["vit_b"](checkpoint="medsam_vit_b.pth")
medsam.to("cuda")

predictor = SamPredictor(medsam)

# Process medical image
# Convert grayscale to RGB if needed
medical_image = cv2.imread("ct_scan.png", cv2.IMREAD_GRAYSCALE)
rgb_image = np.stack([medical_image] * 3, axis=-1)

predictor.set_image(rgb_image)

# Segment with box prompt (common for medical imaging)
masks, scores, _ = predictor.predict(
    box=np.array([x1, y1, x2, y2]),
    multimask_output=False
)

Advanced Mask Processing

Mask refinement

import cv2
from scipy import ndimage

def refine_mask(mask, kernel_size=5, iterations=2):
    """Refine mask with morphological operations."""
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (kernel_size, kernel_size))

    # Close small holes
    closed = cv2.morphologyEx(mask.astype(np.uint8), cv2.MORPH_CLOSE, kernel, iterations=iterations)

    # Remove small noise
    opened = cv2.morphologyEx(closed, cv2.MORPH_OPEN, kernel, iterations=iterations)

    return opened.astype(bool)

def fill_holes(mask):
    """Fill holes in mask."""
    filled = ndimage.binary_fill_holes(mask)
    return filled

def remove_small_regions(mask, min_area=100):
    """Remove small disconnected regions."""
    labeled, num_features = ndimage.label(mask)
    sizes = ndimage.sum(mask, labeled, range(1, num_features + 1))

    # Keep only regions larger than min_area
    mask_clean = np.zeros_like(mask)
    for i, size in enumerate(sizes, 1):
        if size >= min_area:
            mask_clean[labeled == i] = True

    return mask_clean

Mask to polygon conversion

import cv2

def mask_to_polygons(mask, epsilon_factor=0.01):
    """Convert binary mask to polygon coordinates."""
    contours, _ = cv2.findContours(
        mask.astype(np.uint8),
        cv2.RETR_EXTERNAL,
        cv2.CHAIN_APPROX_SIMPLE
    )

    polygons = []
    for contour in contours:
        epsilon = epsilon_factor * cv2.arcLength(contour, True)
        approx = cv2.approxPolyDP(contour, epsilon, True)
        polygon = approx.squeeze().tolist()
        if len(polygon) >= 3:  # Valid polygon
            polygons.append(polygon)

    return polygons

def polygons_to_mask(polygons, height, width):
    """Convert polygons back to binary mask."""
    mask = np.zeros((height, width), dtype=np.uint8)
    for polygon in polygons:
        pts = np.array(polygon, dtype=np.int32)
        cv2.fillPoly(mask, [pts], 1)
    return mask.astype(bool)

Multi-scale segmentation

def multiscale_segment(image, predictor, point, scales=[0.5, 1.0, 2.0]):
    """Generate masks at multiple scales and combine."""
    h, w = image.shape[:2]
    masks_all = []

    for scale in scales:
        # Resize image
        new_h, new_w = int(h * scale), int(w * scale)
        scaled_image = cv2.resize(image, (new_w, new_h))
        scaled_point = (point * scale).astype(int)

        # Segment
        predictor.set_image(scaled_image)
        masks, scores, _ = predictor.predict(
            point_coords=scaled_point.reshape(1, 2),
            point_labels=np.array([1]),
            multimask_output=True
        )

        # Resize mask back
        best_mask = masks[np.argmax(scores)]
        original_mask = cv2.resize(best_mask.astype(np.uint8), (w, h)) > 0.5

        masks_all.append(original_mask)

    # Combine masks (majority voting)
    combined = np.stack(masks_all, axis=0)
    final_mask = np.sum(combined, axis=0) >= len(scales) // 2 + 1

    return final_mask

Performance Optimization

TensorRT acceleration

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

def export_to_tensorrt(onnx_path, engine_path, fp16=True):
    """Convert ONNX model to TensorRT engine."""
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, logger)

    with open(onnx_path, 'rb') as f:
        if not parser.parse(f.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            return None

    config = builder.create_builder_config()
    config.max_workspace_size = 1 << 30  # 1GB

    if fp16:
        config.set_flag(trt.BuilderFlag.FP16)

    engine = builder.build_engine(network, config)

    with open(engine_path, 'wb') as f:
        f.write(engine.serialize())

    return engine

Memory-efficient inference

class MemoryEfficientSAM:
    def __init__(self, checkpoint, model_type="vit_b"):
        self.sam = sam_model_registry[model_type](checkpoint=checkpoint)
        self.sam.eval()
        self.predictor = None

    def __enter__(self):
        self.sam.to("cuda")
        self.predictor = SamPredictor(self.sam)
        return self

    def __exit__(self, *args):
        self.sam.to("cpu")
        torch.cuda.empty_cache()

    def segment(self, image, points, labels):
        self.predictor.set_image(image)
        masks, scores, _ = self.predictor.predict(
            point_coords=points,
            point_labels=labels,
            multimask_output=True
        )
        return masks, scores

# Usage with context manager (auto-cleanup)
with MemoryEfficientSAM("sam_vit_b_01ec64.pth") as sam:
    masks, scores = sam.segment(image, points, labels)
# CUDA memory freed automatically

Dataset Generation

Create segmentation dataset

import json

def generate_dataset(images_dir, output_dir, mask_generator):
    """Generate segmentation dataset from images."""
    annotations = []

    for img_path in Path(images_dir).glob("*.jpg"):
        image = cv2.imread(str(img_path))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

        # Generate masks
        masks = mask_generator.generate(image)

        # Filter high-quality masks
        good_masks = [m for m in masks if m["predicted_iou"] > 0.9]

        # Save annotations
        for i, mask_data in enumerate(good_masks):
            annotation = {
                "image_id": img_path.stem,
                "mask_id": i,
                "bbox": mask_data["bbox"],
                "area": mask_data["area"],
                "segmentation": mask_to_rle(mask_data["segmentation"]),
                "predicted_iou": mask_data["predicted_iou"],
                "stability_score": mask_data["stability_score"]
            }
            annotations.append(annotation)

    # Save dataset
    with open(output_dir / "annotations.json", "w") as f:
        json.dump(annotations, f)

    return annotations

Segment Anything Troubleshooting Guide

Installation Issues

CUDA not available

Error: RuntimeError: CUDA not available

Solutions:

# Check CUDA availability
import torch
print(torch.cuda.is_available())
print(torch.version.cuda)

# Install PyTorch with CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# If CUDA works but SAM doesn't use it
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cuda")  # Explicitly move to GPU

Import errors

Error: ModuleNotFoundError: No module named 'segment_anything'

Solutions:

# Install from GitHub
pip install git+https://github.com/facebookresearch/segment-anything.git

# Or clone and install
git clone https://github.com/facebookresearch/segment-anything.git
cd segment-anything
pip install -e .

# Verify installation
python -c "from segment_anything import sam_model_registry; print('OK')"

Missing dependencies

Error: ModuleNotFoundError: No module named 'cv2' or similar

Solutions:

# Install all optional dependencies
pip install opencv-python pycocotools matplotlib onnxruntime onnx

# For pycocotools on Windows
pip install pycocotools-windows

Model Loading Issues

Checkpoint not found

Error: FileNotFoundError: checkpoint file not found

Solutions:

# Download correct checkpoint
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

# Verify file integrity
md5sum sam_vit_h_4b8939.pth
# Expected: a7bf3b02f3ebf1267aba913ff637d9a2

# Use absolute path
sam = sam_model_registry["vit_h"](checkpoint="/full/path/to/sam_vit_h_4b8939.pth")

Model type mismatch

Error: KeyError: 'unexpected key in state_dict'

Solutions:

# Ensure model type matches checkpoint
# vit_h checkpoint → vit_h model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")

# vit_l checkpoint → vit_l model
sam = sam_model_registry["vit_l"](checkpoint="sam_vit_l_0b3195.pth")

# vit_b checkpoint → vit_b model
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")

Out of memory during load

Error: CUDA out of memory during model loading

Solutions:

# Use smaller model
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")

# Load to CPU first, then move
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to("cpu")
torch.cuda.empty_cache()
sam.to("cuda")

# Use half precision
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam = sam.half()
sam.to("cuda")

Inference Issues

Image format errors

Error: ValueError: expected input to have 3 channels

Solutions:

import cv2

# Ensure RGB format
image = cv2.imread("image.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  # BGR to RGB

# Convert grayscale to RGB
if len(image.shape) == 2:
    image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)

# Handle RGBA
if image.shape[2] == 4:
    image = image[:, :, :3]  # Drop alpha channel

Coordinate errors

Error: IndexError: index out of bounds or incorrect mask location

Solutions:

# Ensure points are (x, y) not (row, col)
# x = column index, y = row index
point = np.array([[x, y]])  # Correct

# Verify coordinates are within image bounds
h, w = image.shape[:2]
assert 0 <= x < w and 0 <= y < h, "Point outside image"

# For bounding boxes: [x1, y1, x2, y2]
box = np.array([x1, y1, x2, y2])
assert x1 < x2 and y1 < y2, "Invalid box coordinates"

Empty or incorrect masks

Problem: Masks don't match expected object

Solutions:

# Try multiple prompts
input_points = np.array([[x1, y1], [x2, y2]])
input_labels = np.array([1, 1])  # Multiple foreground points

# Add background points
input_points = np.array([[obj_x, obj_y], [bg_x, bg_y]])
input_labels = np.array([1, 0])  # 1=foreground, 0=background

# Use box prompt for large objects
box = np.array([x1, y1, x2, y2])
masks, scores, _ = predictor.predict(box=box, multimask_output=False)

# Combine box and point
masks, scores, _ = predictor.predict(
    point_coords=np.array([[center_x, center_y]]),
    point_labels=np.array([1]),
    box=np.array([x1, y1, x2, y2]),
    multimask_output=True
)

# Check scores and select best
print(f"Scores: {scores}")
best_mask = masks[np.argmax(scores)]

Slow inference

Problem: Prediction takes too long

Solutions:

# Use smaller model
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")

# Reuse image embeddings
predictor.set_image(image)  # Compute once
for point in points:
    masks, _, _ = predictor.predict(...)  # Fast, reuses embeddings

# Reduce automatic generation points
mask_generator = SamAutomaticMaskGenerator(
    model=sam,
    points_per_side=16,  # Default is 32
)

# Use ONNX for deployment
# Export: python scripts/export_onnx_model.py --return-single-mask

Automatic Mask Generation Issues

Too many masks

Problem: Generating thousands of overlapping masks

Solutions:

mask_generator = SamAutomaticMaskGenerator(
    model=sam,
    points_per_side=16,          # Reduce from 32
    pred_iou_thresh=0.92,        # Increase from 0.88
    stability_score_thresh=0.98,  # Increase from 0.95
    box_nms_thresh=0.5,          # More aggressive NMS
    min_mask_region_area=500,    # Remove small masks
)

Too few masks

Problem: Missing objects in automatic generation

Solutions:

mask_generator = SamAutomaticMaskGenerator(
    model=sam,
    points_per_side=64,          # Increase density
    pred_iou_thresh=0.80,        # Lower threshold
    stability_score_thresh=0.85,  # Lower threshold
    crop_n_layers=2,             # Add multi-scale
    min_mask_region_area=0,      # Keep all masks
)

Small objects missed

Problem: Automatic generation misses small objects

Solutions:

# Use crop layers for multi-scale detection
mask_generator = SamAutomaticMaskGenerator(
    model=sam,
    crop_n_layers=2,
    crop_n_points_downscale_factor=1,  # Don't reduce points in crops
    min_mask_region_area=10,  # Very small minimum
)

# Or process image patches
def segment_with_patches(image, patch_size=512, overlap=64):
    h, w = image.shape[:2]
    all_masks = []

    for y in range(0, h, patch_size - overlap):
        for x in range(0, w, patch_size - overlap):
            patch = image[y:y+patch_size, x:x+patch_size]
            masks = mask_generator.generate(patch)

            # Offset masks to original coordinates
            for m in masks:
                m['bbox'][0] += x
                m['bbox'][1] += y
                # Offset segmentation mask too

            all_masks.extend(masks)

    return all_masks

Memory Issues

CUDA out of memory

Error: torch.cuda.OutOfMemoryError: CUDA out of memory

Solutions:

# Use smaller model
sam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")

# Clear cache between images
torch.cuda.empty_cache()

# Process images sequentially, not batched
for image in images:
    predictor.set_image(image)
    masks, _, _ = predictor.predict(...)
    torch.cuda.empty_cache()

# Reduce image size
max_size = 1024
h, w = image.shape[:2]
if max(h, w) > max_size:
    scale = max_size / max(h, w)
    image = cv2.resize(image, (int(w*scale), int(h*scale)))

# Use CPU for large batch processing
sam.to("cpu")

RAM out of memory

Problem: System runs out of RAM

Solutions:

# Process images one at a time
for img_path in image_paths:
    image = cv2.imread(img_path)
    masks = process_image(image)
    save_results(masks)
    del image, masks
    gc.collect()

# Use generators instead of lists
def generate_masks_lazy(image_paths):
    for path in image_paths:
        image = cv2.imread(path)
        masks = mask_generator.generate(image)
        yield path, masks

ONNX Export Issues

Export fails

Error: Various export errors

Solutions:

# Install correct ONNX version
pip install onnx==1.14.0 onnxruntime==1.15.0

# Use correct opset version
python scripts/export_onnx_model.py \
    --checkpoint sam_vit_h_4b8939.pth \
    --model-type vit_h \
    --output sam.onnx \
    --opset 17

ONNX runtime errors

Error: ONNXRuntimeError during inference

Solutions:

import onnxruntime

# Check available providers
print(onnxruntime.get_available_providers())

# Use CPU provider if GPU fails
session = onnxruntime.InferenceSession(
    "sam.onnx",
    providers=['CPUExecutionProvider']
)

# Verify input shapes
for input in session.get_inputs():
    print(f"{input.name}: {input.shape}")

HuggingFace Integration Issues

Processor errors

Error: Issues with SamProcessor

Solutions:

from transformers import SamModel, SamProcessor

# Use matching processor and model
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")

# Ensure input format
input_points = [[[x, y]]]  # Nested list for batch dimension
inputs = processor(image, input_points=input_points, return_tensors="pt")

# Post-process correctly
masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu()
)

Quality Issues

Jagged mask edges

Problem: Masks have rough, pixelated edges

Solutions:

import cv2
from scipy import ndimage

def smooth_mask(mask, sigma=2):
    """Smooth mask edges."""
    # Gaussian blur
    smooth = ndimage.gaussian_filter(mask.astype(float), sigma=sigma)
    return smooth > 0.5

def refine_edges(mask, kernel_size=5):
    """Refine mask edges with morphological operations."""
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (kernel_size, kernel_size))
    # Close small gaps
    closed = cv2.morphologyEx(mask.astype(np.uint8), cv2.MORPH_CLOSE, kernel)
    # Open to remove noise
    opened = cv2.morphologyEx(closed, cv2.MORPH_OPEN, kernel)
    return opened.astype(bool)

Incomplete segmentation

Problem: Mask doesn't cover entire object

Solutions:

# Add multiple points
input_points = np.array([
    [obj_center_x, obj_center_y],
    [obj_left_x, obj_center_y],
    [obj_right_x, obj_center_y],
    [obj_center_x, obj_top_y],
    [obj_center_x, obj_bottom_y]
])
input_labels = np.array([1, 1, 1, 1, 1])

# Use bounding box
masks, _, _ = predictor.predict(
    box=np.array([x1, y1, x2, y2]),
    multimask_output=False
)

# Iterative refinement
mask_input = None
for point in points:
    masks, scores, logits = predictor.predict(
        point_coords=point.reshape(1, 2),
        point_labels=np.array([1]),
        mask_input=mask_input,
        multimask_output=False
    )
    mask_input = logits

Common Error Messages

Error	Cause	Solution
`CUDA out of memory`	GPU memory full	Use smaller model, clear cache
`expected 3 channels`	Wrong image format	Convert to RGB
`index out of bounds`	Invalid coordinates	Check point/box bounds
`checkpoint not found`	Wrong path	Use absolute path
`unexpected key`	Model/checkpoint mismatch	Match model type
`invalid box coordinates`	x1 > x2 or y1 > y2	Fix box format

Getting Help

1. GitHub Issues: https://github.com/facebookresearch/segment-anything/issues 2. HuggingFace Forums: https://discuss.huggingface.co 3. Paper: https://arxiv.org/abs/2304.02643

Reporting Issues

Include:

Python version
PyTorch version: python -c "import torch; print(torch.__version__)"
CUDA version: python -c "import torch; print(torch.version.cuda)"
SAM model type (vit_b/l/h)
Full error traceback
Minimal reproducible code

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use when you need open-source Meta SAM integration rather than proprietary cloud segmentation APIs.

FAQ

How does segment-anything-model handle video?

segment-anything-model documents SAM 2 with build_sam2_video_predictor, init_state on a video path, add_new_points on frame 0, and propagate_in_video to stream masks across frames using the sam2_hiera_large checkpoint.

What install command does SAM 2 need?

segment-anything-model specifies pip install git+https://github.com/facebookresearch/segment-anything-2.git before importing sam2.build_sam and constructing the video predictor with a YAML config and weights file.

Is Segment Anything Model safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLresearchautomation