Clip

Name: Clip
Author: orchestra-research

orchestra-research/ai-research-skills

431 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

clip is a Claude Code skill that adds OpenAI CLIP zero-shot image classification and semantic image search to Python products using copy-paste PyTorch workflows and ViT-B/32 embeddings.

About

clip is an Orchestra Research skill (version 1.0.0, MIT license) for integrating OpenAI’s CLIP vision-language model without fine-tuning. CLIP was trained on 400M image-text pairs and the skill documents five model variants from RN50 (102M params) through ViT-L/14 (428M params), defaulting to ViT-B/32 for balanced speed and quality. Workflows cover zero-shot classification with clip.tokenize labels, cosine-similarity image-text matching, semantic search over image embedding indexes, batch 10×3 similarity matrices, and NSFW or violence moderation categories. Performance notes cite ~20ms GPU versus ~200ms CPU image encoding on a V100, plus ChromaDB integration for vector storage. Use clip for general-purpose image understanding and search—not for fine-grained detection, LLaVA chat, or SAM segmentation tasks called out as alternatives.

Zero-shot image classification with text prompts and softmax probabilities
Semantic image search by embedding a database and scoring text queries
ViT-B/32 load pattern with preprocess, encode_image, and encode_text
L2-normalized feature vectors for cosine-style similarity search
Practical Python snippets for indexing image paths and ranking matches

Clip by the numbers

431 all-time installs (skills.sh)
+30 installs in the week ending Jul 26, 2026 (Skillselion tracking)
Ranked #474 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill clip

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/clip.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/clip)

Installs	431
repo stars	★ 11.2k
Security audit	3 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you add zero-shot image search with CLIP?

Add CLIP-powered zero-shot classification and semantic image search to a product or research prototype with copy-paste PyTorch workflows.

Who is it for?

ML engineers adding zero-shot image classification, semantic image search, or content moderation to Python prototypes without training custom vision models.

Skip if: Skip clip when you need fine-grained object detection, conversational vision (LLaVA), image segmentation (SAM), or production fine-tuned classifiers.

When should I use this skill?

User asks for CLIP zero-shot classification, image-text similarity, semantic image search, or vision-language moderation in PyTorch.

What you get

PyTorch CLIP inference scripts, normalized image/text embeddings, similarity-ranked labels or search results, and optional Chroma/FAISS vector index integration.

CLIP inference scripts
Embedding indexes
Zero-shot classification outputs

By the numbers

Trained on 400M image-text pairs per skill documentation
Documents 5 CLIP model variants from RN50 to ViT-L/14
Cites ~20ms GPU vs ~200ms CPU image encoding benchmarks

Files

SKILL.mdMarkdownGitHub ↗

CLIP - Contrastive Language-Image Pre-Training

OpenAI's model that understands images from natural language.

When to use CLIP

Use when:

Zero-shot image classification (no training data needed)
Image-text similarity/matching
Semantic image search
Content moderation (detect NSFW, violence)
Visual question answering
Cross-modal retrieval (image→text, text→image)

Metrics:

25,300+ GitHub stars
Trained on 400M image-text pairs
Matches ResNet-50 on ImageNet (zero-shot)
MIT License

Use alternatives instead:

BLIP-2: Better captioning
LLaVA: Vision-language chat
Segment Anything: Image segmentation

Quick start

Installation

pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm

Zero-shot classification

import torch
import clip
from PIL import Image

# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Load image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

# Define possible labels
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)

# Compute similarity
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    # Cosine similarity
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

# Print results
labels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
    print(f"{label}: {prob:.2%}")

Available models

# Models (sorted by size)
models = [
    "RN50",           # ResNet-50
    "RN101",          # ResNet-101
    "ViT-B/32",       # Vision Transformer (recommended)
    "ViT-B/16",       # Better quality, slower
    "ViT-L/14",       # Best quality, slowest
]

model, preprocess = clip.load("ViT-B/32")

Model	Parameters	Speed	Quality
RN50	102M	Fast	Good
ViT-B/32	151M	Medium	Better
ViT-L/14	428M	Slow	Best

Image-text similarity

# Compute embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(text)

# Normalize
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# Cosine similarity
similarity = (image_features @ text_features.T).item()
print(f"Similarity: {similarity:.4f}")

Semantic image search

# Index images
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []

for img_path in image_paths:
    image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
    with torch.no_grad():
        embedding = model.encode_image(image)
        embedding /= embedding.norm(dim=-1, keepdim=True)
    image_embeddings.append(embedding)

image_embeddings = torch.cat(image_embeddings)

# Search with text query
query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
    text_embedding = model.encode_text(text_input)
    text_embedding /= text_embedding.norm(dim=-1, keepdim=True)

# Find most similar images
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)

for idx, score in zip(top_k.indices, top_k.values):
    print(f"{image_paths[idx]}: {score:.3f}")

Content moderation

# Define categories
categories = [
    "safe for work",
    "not safe for work",
    "violent content",
    "graphic content"
]

text = clip.tokenize(categories).to(device)

# Check image
with torch.no_grad():
    logits_per_image, _ = model(image, text)
    probs = logits_per_image.softmax(dim=-1)

# Get classification
max_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()

print(f"Category: {categories[max_idx]} ({max_prob:.2%})")

Batch processing

# Process multiple images
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)

with torch.no_grad():
    image_features = model.encode_image(images)
    image_features /= image_features.norm(dim=-1, keepdim=True)

# Batch text
texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_tokens)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Similarity matrix (10 images × 3 texts)
similarities = image_features @ text_features.T
print(similarities.shape)  # (10, 3)

Integration with vector databases

# Store CLIP embeddings in Chroma/FAISS
import chromadb

client = chromadb.Client()
collection = client.create_collection("image_embeddings")

# Add image embeddings
for img_path, embedding in zip(image_paths, image_embeddings):
    collection.add(
        embeddings=[embedding.cpu().numpy().tolist()],
        metadatas=[{"path": img_path}],
        ids=[img_path]
    )

# Query with text
query = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
    query_embeddings=[text_embedding.cpu().numpy().tolist()],
    n_results=5
)

Best practices

1. Use ViT-B/32 for most cases - Good balance 2. Normalize embeddings - Required for cosine similarity 3. Batch processing - More efficient 4. Cache embeddings - Expensive to recompute 5. Use descriptive labels - Better zero-shot performance 6. GPU recommended - 10-50× faster 7. Preprocess images - Use provided preprocess function

Performance

Operation	CPU	GPU (V100)
Image encoding	~200ms	~20ms
Text encoding	~50ms	~5ms
Similarity compute	<1ms	<1ms

Limitations

1. Not for fine-grained tasks - Best for broad categories 2. Requires descriptive text - Vague labels perform poorly 3. Biased on web data - May have dataset biases 4. No bounding boxes - Whole image only 5. Limited spatial understanding - Position/counting weak

Resources

GitHub: https://github.com/openai/CLIP ⭐ 25,300+
Paper: https://arxiv.org/abs/2103.00020
Colab: https://colab.research.google.com/github/openai/clip/
License: MIT

CLIP Applications Guide

Practical applications and use cases for CLIP.

Zero-shot image classification

import torch
import clip
from PIL import Image

model, preprocess = clip.load("ViT-B/32")

# Define categories
categories = [
    "a photo of a dog",
    "a photo of a cat",
    "a photo of a bird",
    "a photo of a car",
    "a photo of a person"
]

# Prepare image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0)
text = clip.tokenize(categories)

# Classify
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    logits_per_image, _ = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

# Print results
for category, prob in zip(categories, probs[0]):
    print(f"{category}: {prob:.2%}")

Semantic image search

# Index images
image_database = []
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]

for img_path in image_paths:
    image = preprocess(Image.open(img_path)).unsqueeze(0)
    with torch.no_grad():
        features = model.encode_image(image)
        features /= features.norm(dim=-1, keepdim=True)
    image_database.append((img_path, features))

# Search with text
query = "a sunset over mountains"
text_input = clip.tokenize([query])

with torch.no_grad():
    text_features = model.encode_text(text_input)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Find matches
similarities = []
for img_path, img_features in image_database:
    similarity = (text_features @ img_features.T).item()
    similarities.append((img_path, similarity))

# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
for img_path, score in similarities[:3]:
    print(f"{img_path}: {score:.3f}")

Content moderation

# Define safety categories
categories = [
    "safe for work content",
    "not safe for work content",
    "violent or graphic content",
    "hate speech or offensive content",
    "spam or misleading content"
]

text = clip.tokenize(categories)

# Check image
with torch.no_grad():
    logits, _ = model(image, text)
    probs = logits.softmax(dim=-1)

# Get classification
max_idx = probs.argmax().item()
confidence = probs[0, max_idx].item()

if confidence > 0.7:
    print(f"Classified as: {categories[max_idx]} ({confidence:.2%})")
else:
    print(f"Uncertain classification (confidence: {confidence:.2%})")

Image-to-text retrieval

# Text database
captions = [
    "A beautiful sunset over the ocean",
    "A cute dog playing in the park",
    "A modern city skyline at night",
    "A delicious pizza with toppings"
]

# Encode captions
caption_features = []
for caption in captions:
    text = clip.tokenize([caption])
    with torch.no_grad():
        features = model.encode_text(text)
        features /= features.norm(dim=-1, keepdim=True)
    caption_features.append(features)

caption_features = torch.cat(caption_features)

# Find matching captions for image
with torch.no_grad():
    image_features = model.encode_image(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)

similarities = (image_features @ caption_features.T).squeeze(0)
top_k = similarities.topk(3)

for idx, score in zip(top_k.indices, top_k.values):
    print(f"{captions[idx]}: {score:.3f}")

Visual question answering

# Create yes/no questions
image = preprocess(Image.open("photo.jpg")).unsqueeze(0)

questions = [
    "a photo showing people",
    "a photo showing animals",
    "a photo taken indoors",
    "a photo taken outdoors",
    "a photo taken during daytime",
    "a photo taken at night"
]

text = clip.tokenize(questions)

with torch.no_grad():
    logits, _ = model(image, text)
    probs = logits.softmax(dim=-1)

# Answer questions
for question, prob in zip(questions, probs[0]):
    answer = "Yes" if prob > 0.5 else "No"
    print(f"{question}: {answer} ({prob:.2%})")

Image deduplication

# Detect duplicate/similar images
def compute_similarity(img1_path, img2_path):
    img1 = preprocess(Image.open(img1_path)).unsqueeze(0)
    img2 = preprocess(Image.open(img2_path)).unsqueeze(0)

    with torch.no_grad():
        feat1 = model.encode_image(img1)
        feat2 = model.encode_image(img2)

        feat1 /= feat1.norm(dim=-1, keepdim=True)
        feat2 /= feat2.norm(dim=-1, keepdim=True)

        similarity = (feat1 @ feat2.T).item()

    return similarity

# Check for duplicates
threshold = 0.95
image_pairs = [("img1.jpg", "img2.jpg"), ("img1.jpg", "img3.jpg")]

for img1, img2 in image_pairs:
    sim = compute_similarity(img1, img2)
    if sim > threshold:
        print(f"{img1} and {img2} are duplicates (similarity: {sim:.3f})")

Best practices

1. Use descriptive labels - "a photo of X" works better than just "X" 2. Normalize embeddings - Always normalize for cosine similarity 3. Batch processing - Process multiple images/texts together 4. Cache embeddings - Expensive to recompute 5. Set appropriate thresholds - Test on validation data 6. Use GPU - 10-50× faster than CPU 7. Consider model size - ViT-B/32 good default, ViT-L/14 for best quality

Resources

Paper: https://arxiv.org/abs/2103.00020
GitHub: https://github.com/openai/CLIP
Colab: https://colab.research.google.com/github/openai/clip/

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use clip for quick zero-shot vision-language tasks; switch to BLIP-2 or LLaVA skills in the same repo for captioning or vision chat workloads.

FAQ

Which CLIP model does the clip skill recommend by default?

The clip skill recommends ViT-B/32 as the default CLIP model for most cases, balancing 151M parameters with medium speed and strong quality. It also documents RN50, RN101, ViT-B/16, and ViT-L/14 variants with parameter counts and speed trade-offs.

What can CLIP do without fine-tuning?

The clip skill covers zero-shot image classification, image-text similarity scoring, semantic image search, content moderation labels, and cross-modal retrieval using natural-language prompts. CLIP was trained on 400M image-text pairs and matches ResNet-50 ImageNet performance ze

Is Clip safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmresearch

About

Clip by the numbers

Add your badge

How do you add zero-shot image search with CLIP?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

CLIP - Contrastive Language-Image Pre-Training

When to use CLIP

Quick start

Installation

Zero-shot classification

Available models

Image-text similarity

Semantic image search

Content moderation

Batch processing

Integration with vector databases

Best practices

Performance

Limitations

Resources

CLIP Applications Guide

Zero-shot image classification

Semantic image search

Content moderation

Image-to-text retrieval

Visual question answering

Image deduplication

Best practices

Resources

Related skills

How it compares

FAQ

Which CLIP model does the clip skill recommend by default?

What can CLIP do without fine-tuning?

Is Clip safe to install?

This week in AI coding