
Clip
Add CLIP-powered zero-shot classification and semantic image search to a product or research prototype with copy-paste PyTorch workflows.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill clipWhat is this skill?
- Zero-shot image classification with text prompts and softmax probabilities
- Semantic image search by embedding a database and scoring text queries
- ViT-B/32 load pattern with preprocess, encode_image, and encode_text
- L2-normalized feature vectors for cosine-style similarity search
- Practical Python snippets for indexing image paths and ranking matches
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
CLIP integration is implemented when wiring multimodal features into an app or pipeline, not during initial market ideation alone. Embedding OpenAI CLIP for encode/search/classify is third-party model integration work under Build.
Common Questions / FAQ
Is Clip safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Clip
# CLIP Applications Guide Practical applications and use cases for CLIP. ## Zero-shot image classification ```python import torch import clip from PIL import Image model, preprocess = clip.load("ViT-B/32") # Define categories categories = [ "a photo of a dog", "a photo of a cat", "a photo of a bird", "a photo of a car", "a photo of a person" ] # Prepare image image = preprocess(Image.open("photo.jpg")).unsqueeze(0) text = clip.tokenize(categories) # Classify with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) logits_per_image, _ = model(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy() # Print results for category, prob in zip(categories, probs[0]): print(f"{category}: {prob:.2%}") ``` ## Semantic image search ```python # Index images image_database = [] image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"] for img_path in image_paths: image = preprocess(Image.open(img_path)).unsqueeze(0) with torch.no_grad(): features = model.encode_image(image) features /= features.norm(dim=-1, keepdim=True) image_database.append((img_path, features)) # Search with text query = "a sunset over mountains" text_input = clip.tokenize([query]) with torch.no_grad(): text_features = model.encode_text(text_input) text_features /= text_features.norm(dim=-1, keepdim=True) # Find matches similarities = [] for img_path, img_features in image_database: similarity = (text_features @ img_features.T).item() similarities.append((img_path, similarity)) # Sort by similarity similarities.sort(key=lambda x: x[1], reverse=True) for img_path, score in similarities[:3]: print(f"{img_path}: {score:.3f}") ``` ## Content moderation ```python # Define safety categories categories = [ "safe for work content", "not safe for work content", "violent or graphic content", "hate speech or offensive content", "spam or misleading content" ] text = clip.tokenize(categories) # Check image with torch.no_grad(): logits, _ = model(image, text) probs = logits.softmax(dim=-1) # Get classification max_idx = probs.argmax().item() confidence = probs[0, max_idx].item() if confidence > 0.7: print(f"Classified as: {categories[max_idx]} ({confidence:.2%})") else: print(f"Uncertain classification (confidence: {confidence:.2%})") ``` ## Image-to-text retrieval ```python # Text database captions = [ "A beautiful sunset over the ocean", "A cute dog playing in the park", "A modern city skyline at night", "A delicious pizza with toppings" ] # Encode captions caption_features = [] for caption in captions: text = clip.tokenize([caption]) with torch.no_grad(): features = model.encode_text(text) features /= features.norm(dim=-1, keepdim=True) caption_features.append(features) caption_features = torch.cat(caption_features) # Find matching captions for image with torch.no_grad(): image_features = model.encode_image(image) image_features /= image_features.norm(dim=-1, keepdim=True) similarities = (image_features @ caption_features.T).squeeze(0) top_k = similarities.topk(3) for idx, score in zip(top_k.indices, top_k.values): print(f"{captions[idx]}: {score:.3f}") ``` ## Visual question answering ```python # Create yes/no questions image = preprocess(Image.open("photo.jpg")).unsqueeze(0) questions = [ "a photo showing people", "a photo showing animals", "a photo taken indoors", "a photo taken outdoors", "a photo taken during daytime", "a photo taken at night" ] text = clip.tokenize(questions) with torch.no_grad(): logits, _ = model(image, text) probs = logits.softmax(dim=-1) # Answer questions for question, prob in zip(questions, probs[0]): answer = "Yes" if prob > 0.5 else "No" print(f"{question}: {answer} ({prob:.2%})") ``` ## Image deduplication ```python # Detect duplicate/similar im