
Segment Anything Model
Implement Meta SAM, SAM 2 video tracking, and Grounded SAM text-to-mask pipelines when you add segmentation to a CV or agent vision feature.
Overview
Segment Anything Model is an agent skill for the Build phase that guides advanced SAM, SAM 2 video segmentation, and Grounded SAM integration in Python.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill segment-anything-modelWhat is this skill?
- SAM 2 video path: build_sam2_video_predictor, init_state, point prompts, and propagate_in_video loop
- Side-by-side SAM vs SAM 2 table covering memory bank, tracking, and Hiera model sizes
- Grounded SAM pipeline: Grounding DINO load_model/predict plus SamPredictor for text-prompted masks
- Concrete dependency commands for segment-anything-2 git install and groundingdino-py
- Advanced usage focus beyond hello-world—video propagation and text-to-mask composition
- SAM 2 model family includes Hiera-T/S/B+/L sizes
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need production-style segmentation or video masks but only have fragmented install notes and no clear predictor API flow for SAM 2 or text-grounded masks.
Who is it for?
Indie ML builders adding open-vocabulary or interactive segmentation to a Python backend or internal research prototype.
Skip if: Builders who only need a hosted vision API with no local GPU weights or custom CV pipeline work.
When should I use this skill?
Use when implementing Segment Anything, SAM 2 video segmentation, or Grounded SAM text-prompted masks in a Python project.
What do I get? / Deliverables
You can stand up SAM 2 video propagation or Grounded SAM inference patterns with the documented install stack and predictor calls wired into your app.
- Working predictor initialization code
- Video mask propagation loop or text-to-mask pipeline sketch
- Dependency install commands for SAM2 and Grounding DINO
Recommended Skills
Journey fit
Build is where segmentation models get wired into products—this guide is integration-oriented code and setup, not market research. Integrations fits pip installs, predictor APIs, and composing Grounding DINO with SAM predictors in application code.
How it compares
Hands-on SAM/SAM2 integration reference—not a no-code labeling SaaS or a generic image-generation skill.
Common Questions / FAQ
Who is segment-anything-model for?
Developers integrating Facebook Research segmentation models into Python apps, especially video tracking and text-prompted masks.
When should I use segment-anything-model?
During build integrations when implementing SAM 2 video segmentation, comparing SAM vs SAM 2, or composing Grounding DINO with SAM predictors.
Is segment-anything-model safe to install?
The skill pulls third-party Git and pip dependencies; review the Security Audits panel on this page and vet model checkpoints and GPU runtimes before production deploy.
SKILL.md
READMESKILL.md - Segment Anything Model
# Segment Anything Advanced Usage Guide ## SAM 2 (Video Segmentation) ### Overview SAM 2 extends SAM to video segmentation with streaming memory architecture: ```bash pip install git+https://github.com/facebookresearch/segment-anything-2.git ``` ### Video segmentation ```python from sam2.build_sam import build_sam2_video_predictor predictor = build_sam2_video_predictor("sam2_hiera_l.yaml", "sam2_hiera_large.pt") # Initialize with video predictor.init_state(video_path="video.mp4") # Add prompt on first frame predictor.add_new_points( frame_idx=0, obj_id=1, points=[[100, 200]], labels=[1] ) # Propagate through video for frame_idx, masks in predictor.propagate_in_video(): # masks contains segmentation for all tracked objects process_frame(frame_idx, masks) ``` ### SAM 2 vs SAM comparison | Feature | SAM | SAM 2 | |---------|-----|-------| | Input | Images only | Images + Videos | | Architecture | ViT + Decoder | Hiera + Memory | | Memory | Per-image | Streaming memory bank | | Tracking | No | Yes, across frames | | Models | ViT-B/L/H | Hiera-T/S/B+/L | ## Grounded SAM (Text-Prompted Segmentation) ### Setup ```bash pip install groundingdino-py pip install git+https://github.com/facebookresearch/segment-anything.git ``` ### Text-to-mask pipeline ```python from groundingdino.util.inference import load_model, predict from segment_anything import sam_model_registry, SamPredictor import cv2 # Load Grounding DINO grounding_model = load_model("groundingdino_swint_ogc.pth", "GroundingDINO_SwinT_OGC.py") # Load SAM sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth") predictor = SamPredictor(sam) def text_to_mask(image, text_prompt, box_threshold=0.3, text_threshold=0.25): """Generate masks from text description.""" # Get bounding boxes from text boxes, logits, phrases = predict( model=grounding_model, image=image, caption=text_prompt, box_threshold=box_threshold, text_threshold=text_threshold ) # Generate masks with SAM predictor.set_image(image) masks = [] for box in boxes: # Convert normalized box to pixel coordinates h, w = image.shape[:2] box_pixels = box * np.array([w, h, w, h]) mask, score, _ = predictor.predict( box=box_pixels, multimask_output=False ) masks.append(mask[0]) return masks, boxes, phrases # Usage image = cv2.imread("image.jpg") image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) masks, boxes, phrases = text_to_mask(image, "person . dog . car") ``` ## Batched Processing ### Efficient multi-image processing ```python import torch from segment_anything import SamPredictor, sam_model_registry class BatchedSAM: def __init__(self, checkpoint, model_type="vit_h", device="cuda"): self.sam = sam_model_registry[model_type](checkpoint=checkpoint) self.sam.to(device) self.predictor = SamPredictor(self.sam) self.device = device def process_batch(self, images, prompts): """Process multiple images with corresponding prompts.""" results = [] for image, prompt in zip(images, prompts): self.predictor.set_image(image) if "point" in prompt: masks, scores, _ = self.predictor.predict( point_coords=prompt["point"], point_labels=prompt["label"], multimask_output=True ) elif "box" in prompt: masks, scores, _ = self.predictor.predict( box=prompt["box"], multimask_output=False ) results.append({ "masks": masks, "scores": scores, "best_mask": masks[np.argmax(scores)] }) return results # Usage batch_sam = BatchedSAM("sam_vit_h_4b8939.pth") images = [cv2.imread(f"image_{i}.jpg") for i in range(