Senior Computer Vision

Name: Senior Computer Vision
Author: alirezarezvani

alirezarezvani/claude-skills

965 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

senior-computer-vision is a production computer vision engineering skill that guides object detection, image segmentation, training, and ONNX/TensorRT deployment for developers building visual AI systems with PyTorch and

About

senior-computer-vision is a production computer vision engineering skill covering object detection, image segmentation, and visual AI system deployment. It spans CNN and Vision Transformer architectures, detection models including YOLO, Faster R-CNN, and DETR, plus segmentation with Mask R-CNN and SAM. Framework coverage includes PyTorch, torchvision, Ultralytics, Detectron2, and MMDetection, with deployment paths through ONNX and TensorRT for optimized inference. Developers reach for senior-computer-vision when building detection pipelines, training custom models, optimizing inference latency, or deploying vision systems to production. The skill provides expert-level guidance across the full vision ML lifecycle from architecture selection through exported runtime models.

Covers CNNs, Vision Transformers, YOLO, Faster R-CNN, DETR, Mask R-CNN and SAM models
Includes complete workflows for object detection pipelines, model optimization, deployment and custom dataset preparatio
Supports PyTorch, torchvision, Ultralytics, Detectron2 and MMDetection frameworks
Production deployment guidance with ONNX and TensorRT
Architecture selection guide plus reference documentation for vision engineering tasks

Senior Computer Vision by the numbers

965 all-time installs (skills.sh)
+20 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #1,102 of 16,565 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill senior-computer-vision

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/senior-computer-vision.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/senior-computer-vision)

Installs	965
repo stars	★ 23.5k
Security audit	2 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

How do you deploy YOLO detection models with TensorRT?

Get expert guidance on building, training, optimizing and deploying production-grade computer vision systems using modern architectures and frameworks.

Who is it for?

ML engineers building production object detection or segmentation systems with PyTorch, Ultralytics, or Detectron2.

Skip if: Beginners needing basic image classification tutorials or projects with no camera, video, or visual input requirements.

When should I use this skill?

The user builds object detection pipelines, trains YOLO or DETR models, optimizes vision inference, or deploys segmentation with Mask R-CNN or SAM.

What you get

Trained detection or segmentation models exported to ONNX/TensorRT with optimized inference pipelines and framework-specific training configs.

trained detection models
ONNX/TensorRT exports
inference pipeline configs

Files

SKILL.mdMarkdownGitHub ↗

Senior Computer Vision Engineer

Production computer vision engineering skill for object detection, image segmentation, and visual AI system deployment.

Quick Start
Core Expertise
Tech Stack
Workflow 1: Object Detection Pipeline
Workflow 2: Model Optimization and Deployment
Workflow 3: Custom Dataset Preparation
Architecture Selection Guide
Reference Documentation

Quick Start

# Generate training configuration for YOLO or Faster R-CNN
python scripts/vision_model_trainer.py models/ --task detection --arch yolov8

# Analyze model for optimization opportunities (quantization, pruning)
python scripts/inference_optimizer.py model.pt --target onnx --benchmark

# Build dataset pipeline with augmentations
python scripts/dataset_pipeline_builder.py images/ --format coco --augment

Core Expertise

This skill provides guidance on:

Object Detection: YOLO family (v5-v11), Faster R-CNN, DETR, RT-DETR
Instance Segmentation: Mask R-CNN, YOLACT, SOLOv2
Semantic Segmentation: DeepLabV3+, SegFormer, SAM (Segment Anything)
Image Classification: ResNet, EfficientNet, Vision Transformers (ViT, DeiT)
Video Analysis: Object tracking (ByteTrack, SORT), action recognition
3D Vision: Depth estimation, point cloud processing, NeRF
Production Deployment: ONNX, TensorRT, OpenVINO, CoreML

Tech Stack

Category	Technologies
Frameworks	PyTorch, torchvision, timm
Detection	Ultralytics (YOLO), Detectron2, MMDetection
Segmentation	segment-anything, mmsegmentation
Optimization	ONNX, TensorRT, OpenVINO, torch.compile
Image Processing	OpenCV, Pillow, albumentations
Annotation	CVAT, Label Studio, Roboflow
Experiment Tracking	MLflow, Weights & Biases
Serving	Triton Inference Server, TorchServe

Workflow 1: Object Detection Pipeline

Use this workflow when building an object detection system from scratch.

Step 1: Define Detection Requirements

Analyze the detection task requirements:

Detection Requirements Analysis:
- Target objects: [list specific classes to detect]
- Real-time requirement: [yes/no, target FPS]
- Accuracy priority: [speed vs accuracy trade-off]
- Deployment target: [cloud GPU, edge device, mobile]
- Dataset size: [number of images, annotations per class]

Step 2: Select Detection Architecture

Choose architecture based on requirements:

Requirement	Recommended Architecture	Why
Real-time (>30 FPS)	YOLOv8/v11, RT-DETR	Single-stage, optimized for speed
High accuracy	Faster R-CNN, DINO	Two-stage, better localization
Small objects	YOLO + SAHI, Faster R-CNN + FPN	Multi-scale detection
Edge deployment	YOLOv8n, MobileNetV3-SSD	Lightweight architectures
Transformer-based	DETR, DINO, RT-DETR	End-to-end, no NMS required

Step 3: Prepare Dataset

Convert annotations to required format:

# COCO format (recommended)
python scripts/dataset_pipeline_builder.py data/images/ \
    --annotations data/labels/ \
    --format coco \
    --split 0.8 0.1 0.1 \
    --output data/coco/

# Verify dataset
python -c "from pycocotools.coco import COCO; coco = COCO('data/coco/train.json'); print(f'Images: {len(coco.imgs)}, Categories: {len(coco.cats)}')"

Step 4: Configure Training

Generate training configuration:

# For Ultralytics YOLO
python scripts/vision_model_trainer.py data/coco/ \
    --task detection \
    --arch yolov8m \
    --epochs 100 \
    --batch 16 \
    --imgsz 640 \
    --output configs/

# For Detectron2
python scripts/vision_model_trainer.py data/coco/ \
    --task detection \
    --arch faster_rcnn_R_50_FPN \
    --framework detectron2 \
    --output configs/

Step 5: Train and Validate

# Ultralytics training
yolo detect train data=data.yaml model=yolov8m.pt epochs=100 imgsz=640

# Detectron2 training
python train_net.py --config-file configs/faster_rcnn.yaml --num-gpus 1

# Validate on test set
yolo detect val model=runs/detect/train/weights/best.pt data=data.yaml

Step 6: Evaluate Results

Key metrics to analyze:

Metric	Target	Description
mAP@50	>0.7	Mean Average Precision at IoU 0.5
mAP@50:95	>0.5	COCO primary metric
Precision	>0.8	Low false positives
Recall	>0.8	Low missed detections
Inference time	<33ms	For 30 FPS real-time

Workflow 2: Model Optimization and Deployment

Use this workflow when preparing a trained model for production deployment.

Step 1: Benchmark Baseline Performance

# Measure current model performance
python scripts/inference_optimizer.py model.pt \
    --benchmark \
    --input-size 640 640 \
    --batch-sizes 1 4 8 16 \
    --warmup 10 \
    --iterations 100

Expected output:

Baseline Performance (PyTorch FP32):
- Batch 1: 45.2ms (22.1 FPS)
- Batch 4: 89.4ms (44.7 FPS)
- Batch 8: 165.3ms (48.4 FPS)
- Memory: 2.1 GB
- Parameters: 25.9M

Step 2: Select Optimization Strategy

Deployment Target	Optimization Path
NVIDIA GPU (cloud)	PyTorch → ONNX → TensorRT FP16
NVIDIA GPU (edge)	PyTorch → TensorRT INT8
Intel CPU	PyTorch → ONNX → OpenVINO
Apple Silicon	PyTorch → CoreML
Generic CPU	PyTorch → ONNX Runtime
Mobile	PyTorch → TFLite or ONNX Mobile

Step 3: Export to ONNX

# Export with dynamic batch size
python scripts/inference_optimizer.py model.pt \
    --export onnx \
    --input-size 640 640 \
    --dynamic-batch \
    --simplify \
    --output model.onnx

# Verify ONNX model
python -c "import onnx; model = onnx.load('model.onnx'); onnx.checker.check_model(model); print('ONNX model valid')"

Step 4: Apply Quantization (Optional)

For INT8 quantization with calibration:

# Generate calibration dataset
python scripts/inference_optimizer.py model.onnx \
    --quantize int8 \
    --calibration-data data/calibration/ \
    --calibration-samples 500 \
    --output model_int8.onnx

Quantization impact analysis:

Precision	Size	Speed	Accuracy Drop
FP32	100%	1x	0%
FP16	50%	1.5-2x	<0.5%
INT8	25%	2-4x	1-3%

Step 5: Convert to Target Runtime

# TensorRT (NVIDIA GPU)
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16

# OpenVINO (Intel)
mo --input_model model.onnx --output_dir openvino/

# CoreML (Apple)
python -c "import coremltools as ct; model = ct.convert('model.onnx'); model.save('model.mlpackage')"

Step 6: Benchmark Optimized Model

python scripts/inference_optimizer.py model.engine \
    --benchmark \
    --runtime tensorrt \
    --compare model.pt

Expected speedup:

Optimization Results:
- Original (PyTorch FP32): 45.2ms
- Optimized (TensorRT FP16): 12.8ms
- Speedup: 3.5x
- Accuracy change: -0.3% mAP

Workflow 3: Custom Dataset Preparation

Use this workflow when preparing a computer vision dataset for training.

Step 1: Audit Raw Data

# Analyze image dataset
python scripts/dataset_pipeline_builder.py data/raw/ \
    --analyze \
    --output analysis/

Analysis report includes:

Dataset Analysis:
- Total images: 5,234
- Image sizes: 640x480 to 4096x3072 (variable)
- Formats: JPEG (4,891), PNG (343)
- Corrupted: 12 files
- Duplicates: 45 pairs

Annotation Analysis:
- Format detected: Pascal VOC XML
- Total annotations: 28,456
- Classes: 5 (car, person, bicycle, dog, cat)
- Distribution: car (12,340), person (8,234), bicycle (3,456), dog (2,890), cat (1,536)
- Empty images: 234

Step 2: Clean and Validate

# Remove corrupted and duplicate images
python scripts/dataset_pipeline_builder.py data/raw/ \
    --clean \
    --remove-corrupted \
    --remove-duplicates \
    --output data/cleaned/

Step 3: Convert Annotation Format

# Convert VOC to COCO format
python scripts/dataset_pipeline_builder.py data/cleaned/ \
    --annotations data/annotations/ \
    --input-format voc \
    --output-format coco \
    --output data/coco/

Supported format conversions:

From	To
Pascal VOC XML	COCO JSON
YOLO TXT	COCO JSON
COCO JSON	YOLO TXT
LabelMe JSON	COCO JSON
CVAT XML	COCO JSON

Step 4: Apply Augmentations

# Generate augmentation config
python scripts/dataset_pipeline_builder.py data/coco/ \
    --augment \
    --aug-config configs/augmentation.yaml \
    --output data/augmented/

Recommended augmentations for detection:

# configs/augmentation.yaml
augmentations:
  geometric:
    - horizontal_flip: { p: 0.5 }
    - vertical_flip: { p: 0.1 }  # Only if orientation invariant
    - rotate: { limit: 15, p: 0.3 }
    - scale: { scale_limit: 0.2, p: 0.5 }

  color:
    - brightness_contrast: { brightness_limit: 0.2, contrast_limit: 0.2, p: 0.5 }
    - hue_saturation: { hue_shift_limit: 20, sat_shift_limit: 30, p: 0.3 }
    - blur: { blur_limit: 3, p: 0.1 }

  advanced:
    - mosaic: { p: 0.5 }  # YOLO-style mosaic
    - mixup: { p: 0.1 }   # Image mixing
    - cutout: { num_holes: 8, max_h_size: 32, max_w_size: 32, p: 0.3 }

Step 5: Create Train/Val/Test Splits

python scripts/dataset_pipeline_builder.py data/augmented/ \
    --split 0.8 0.1 0.1 \
    --stratify \
    --seed 42 \
    --output data/final/

Split strategy guidelines:

Dataset Size	Train	Val	Test
<1,000 images	70%	15%	15%
1,000-10,000	80%	10%	10%
>10,000	90%	5%	5%

Step 6: Generate Dataset Configuration

# For Ultralytics YOLO
python scripts/dataset_pipeline_builder.py data/final/ \
    --generate-config yolo \
    --output data.yaml

# For Detectron2
python scripts/dataset_pipeline_builder.py data/final/ \
    --generate-config detectron2 \
    --output detectron2_config.py

Architecture Selection Guide

Object Detection Architectures

Architecture	Speed	Accuracy	Best For
YOLOv8n	1.2ms	37.3 mAP	Edge, mobile, real-time
YOLOv8s	2.1ms	44.9 mAP	Balanced speed/accuracy
YOLOv8m	4.2ms	50.2 mAP	General purpose
YOLOv8l	6.8ms	52.9 mAP	High accuracy
YOLOv8x	10.1ms	53.9 mAP	Maximum accuracy
RT-DETR-L	5.3ms	53.0 mAP	Transformer, no NMS
Faster R-CNN R50	46ms	40.2 mAP	Two-stage, high quality
DINO-4scale	85ms	49.0 mAP	SOTA transformer

Segmentation Architectures

Architecture	Type	Speed	Best For
YOLOv8-seg	Instance	4.5ms	Real-time instance seg
Mask R-CNN	Instance	67ms	High-quality masks
SAM	Promptable	50ms	Zero-shot segmentation
DeepLabV3+	Semantic	25ms	Scene parsing
SegFormer	Semantic	15ms	Efficient semantic seg

CNN vs Vision Transformer Trade-offs

Aspect	CNN (YOLO, R-CNN)	ViT (DETR, DINO)
Training data needed	1K-10K images	10K-100K+ images
Training time	Fast	Slow (needs more epochs)
Inference speed	Faster	Slower
Small objects	Good with FPN	Needs multi-scale
Global context	Limited	Excellent
Positional encoding	Implicit	Explicit

Reference Documentation

→ See references/reference-docs-and-commands.md for details

Performance Targets

Metric	Real-time	High Accuracy	Edge
FPS	>30	>10	>15
mAP@50	>0.6	>0.8	>0.5
Latency P99	<50ms	<150ms	<100ms
GPU Memory	<4GB	<8GB	<2GB
Model Size	<50MB	<200MB	<20MB

Resources

Architecture Guide: references/computer_vision_architectures.md
Optimization Guide: references/object_detection_optimization.md
Deployment Guide: references/production_vision_systems.md
Scripts: scripts/ directory for automation tools

Computer Vision Architectures

Comprehensive guide to CNN and Vision Transformer architectures for object detection, segmentation, and image classification.

Backbone Architectures
Detection Architectures
Segmentation Architectures
Vision Transformers
Feature Pyramid Networks
Architecture Selection

---

Backbone Architectures

Backbone networks extract feature representations from images. The choice of backbone affects both accuracy and inference speed.

ResNet Family

ResNet introduced residual connections that enable training of very deep networks.

Variant	Params	GFLOPs	Top-1 Acc	Use Case
ResNet-18	11.7M	1.8	69.8%	Edge, mobile
ResNet-34	21.8M	3.7	73.3%	Balanced
ResNet-50	25.6M	4.1	76.1%	Standard backbone
ResNet-101	44.5M	7.8	77.4%	High accuracy
ResNet-152	60.2M	11.6	78.3%	Maximum accuracy

Residual Block Architecture:

Input
  |
  +---> Conv 1x1 (reduce channels)
  |         |
  |     Conv 3x3
  |         |
  |     Conv 1x1 (expand channels)
  |         |
  +-----> Add <----+
            |
         ReLU
            |
         Output

When to use ResNet:

Standard detection/segmentation tasks
When pretrained weights are important
Moderate compute budget
Well-understood, stable architecture

EfficientNet Family

EfficientNet uses compound scaling to balance depth, width, and resolution.

Variant	Params	GFLOPs	Top-1 Acc	Relative Speed
EfficientNet-B0	5.3M	0.4	77.1%	1x
EfficientNet-B1	7.8M	0.7	79.1%	0.7x
EfficientNet-B2	9.2M	1.0	80.1%	0.6x
EfficientNet-B3	12M	1.8	81.6%	0.4x
EfficientNet-B4	19M	4.2	82.9%	0.25x
EfficientNet-B5	30M	9.9	83.6%	0.15x
EfficientNet-B6	43M	19	84.0%	0.1x
EfficientNet-B7	66M	37	84.3%	0.05x

Key innovations:

Mobile Inverted Bottleneck (MBConv) blocks
Squeeze-and-Excitation attention
Compound scaling coefficients
Swish activation function

When to use EfficientNet:

Mobile and edge deployment
When parameter efficiency matters
Classification tasks
Limited compute resources

ConvNeXt

ConvNeXt modernizes ResNet with techniques from Vision Transformers.

Variant	Params	GFLOPs	Top-1 Acc
ConvNeXt-T	29M	4.5	82.1%
ConvNeXt-S	50M	8.7	83.1%
ConvNeXt-B	89M	15.4	83.8%
ConvNeXt-L	198M	34.4	84.3%
ConvNeXt-XL	350M	60.9	84.7%

Key design choices:

7x7 depthwise convolutions (like ViT patch size)
Layer normalization instead of batch norm
GELU activation
Fewer but wider stages
Inverted bottleneck design

ConvNeXt Block:

Input
  |
  +---> DWConv 7x7
  |         |
  |     LayerNorm
  |         |
  |     Linear (4x channels)
  |         |
  |     GELU
  |         |
  |     Linear (1x channels)
  |         |
  +-----> Add <----+
            |
         Output

CSPNet (Cross Stage Partial)

CSPNet is the backbone design used in YOLO v4-v8.

Key features:

Gradient flow optimization
Reduced computation while maintaining accuracy
Cross-stage partial connections
Optimized for real-time detection

CSP Block:

Input
  |
  +----> Split ----+
  |                |
  |            Conv Block
  |                |
  |            Conv Block
  |                |
  +----> Concat <--+
            |
         Output

---

Detection Architectures

Two-Stage Detectors

Two-stage detectors first propose regions, then classify and refine them.

Faster R-CNN

Architecture: 1. Backbone: Feature extraction (ResNet, etc.) 2. RPN (Region Proposal Network): Generate object proposals 3. RoI Pooling/Align: Extract fixed-size features 4. Classification Head: Classify and refine boxes

Image → Backbone → Feature Map
                      |
                      +→ RPN → Proposals
                      |           |
                      +→ RoI Align ← +
                            |
                      FC Layers
                            |
                    Class + BBox

RPN Details:

Sliding window over feature map
Anchor boxes at each position (3 scales × 3 ratios = 9)
Predicts objectness score and box refinement
NMS to reduce proposals (typically 300-2000)

Performance characteristics:

mAP@50:95: ~40-42 (COCO, R50-FPN)
Inference: ~50-100ms per image
Better localization than single-stage
Slower but more accurate

Cascade R-CNN

Multi-stage refinement with increasing IoU thresholds.

Stage 1 (IoU 0.5) → Stage 2 (IoU 0.6) → Stage 3 (IoU 0.7)

Benefits:

Progressive refinement
Better high-IoU predictions
+3-4 mAP over Faster R-CNN
Minimal additional cost per stage

Single-Stage Detectors

Single-stage detectors predict boxes and classes in one pass.

YOLO Family

YOLOv8 Architecture:

Input Image
     |
  Backbone (CSPDarknet)
     |
  +--+--+--+
  |  |  |  |
 P3 P4 P5 (multi-scale features)
  |  |  |
  Neck (PANet + C2f)
  |  |  |
  Head (Decoupled)
     |
 Boxes + Classes

Key YOLOv8 innovations:

C2f module (faster CSP variant)
Anchor-free detection head
Decoupled classification/regression heads
Task-aligned assigner (TAL)
Distribution focal loss (DFL)

YOLO variant comparison:

Model	Size (px)	Params	mAP@50:95	Speed (ms)
YOLOv5n	640	1.9M	28.0	1.2
YOLOv5s	640	7.2M	37.4	1.8
YOLOv5m	640	21.2M	45.4	3.5
YOLOv8n	640	3.2M	37.3	1.2
YOLOv8s	640	11.2M	44.9	2.1
YOLOv8m	640	25.9M	50.2	4.2
YOLOv8l	640	43.7M	52.9	6.8
YOLOv8x	640	68.2M	53.9	10.1

SSD (Single Shot Detector)

Multi-scale detection with default boxes.

Architecture:

VGG16 or MobileNet backbone
Additional convolution layers for multi-scale
Default boxes at each scale
Direct classification and regression

When to use SSD:

Edge deployment (SSD-MobileNet)
When YOLO alternatives needed
Simple architecture requirements

RetinaNet

Focal loss to handle class imbalance.

Key innovation:

FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t)

Where:

γ (focusing parameter) = 2 typically
α (class weight) = 0.25 for background

Benefits:

Handles extreme foreground-background imbalance
Matches two-stage accuracy
Single-stage speed

---

Segmentation Architectures

Instance Segmentation

Mask R-CNN

Extends Faster R-CNN with mask prediction branch.

RoI Features → FC Layers → Class + BBox
      |
      +→ Conv Layers → Mask (28×28 per class)

Key details:

RoI Align (bilinear interpolation, no quantization)
Per-class binary mask prediction
Decoupled mask and classification
14×14 or 28×28 mask resolution

Performance:

mAP (box): ~39 on COCO
mAP (mask): ~35 on COCO
Inference: ~100-200ms

YOLACT / YOLACT++

Real-time instance segmentation.

Approach: 1. Generate prototype masks (global) 2. Predict mask coefficients per instance 3. Linear combination: mask = Σ(coefficients × prototypes)

Benefits:

Real-time (~30 FPS)
Simpler than Mask R-CNN
Global prototypes capture spatial info

YOLOv8-Seg

Adds segmentation head to YOLOv8.

Performance:

mAP (box): 44.6
mAP (mask): 36.8
Speed: 4.5ms

Semantic Segmentation

DeepLabV3+

Atrous convolutions for multi-scale context.

Key components: 1. ASPP (Atrous Spatial Pyramid Pooling)

Parallel atrous convolutions at different rates
Captures multi-scale context
Rates: 6, 12, 18 typically

2. Encoder-Decoder

Encoder: Backbone + ASPP
Decoder: Upsample with skip connections

Image → Backbone → ASPP → Decoder → Segmentation
              ↘    ↗
          Low-level features

Performance:

mIoU: 89.0 on Cityscapes
Inference: ~25ms (ResNet-50)

SegFormer

Transformer-based semantic segmentation.

Architecture: 1. Hierarchical Transformer Encoder

Multi-scale feature maps
Efficient self-attention
Overlapping patch embedding

2. MLP Decoder

Simple MLP aggregation
No complex decoders needed

Benefits:

No positional encoding needed
Efficient attention mechanism
Strong multi-scale features

Promptable Segmentation

SAM (Segment Anything Model)

Zero-shot segmentation with prompts.

Architecture: 1. Image Encoder: ViT-H (632M params) 2. Prompt Encoder: Points, boxes, masks, text 3. Mask Decoder: Lightweight transformer

Prompts supported:

Points (foreground/background)
Bounding boxes
Rough masks
Text (via CLIP integration)

Usage patterns:

# Point prompt
masks = sam.predict(image, point_coords=[[500, 375]], point_labels=[1])

# Box prompt
masks = sam.predict(image, box=[100, 100, 400, 400])

# Multiple points
masks = sam.predict(image, point_coords=[[500, 375], [200, 300]],
                   point_labels=[1, 0])  # 1=foreground, 0=background

---

Vision Transformers

ViT (Vision Transformer)

Original vision transformer architecture.

Architecture:

Image → Patch Embedding → [CLS] + Position Embedding
                              ↓
                    Transformer Encoder ×L
                              ↓
                         [CLS] token
                              ↓
                     Classification Head

Key details:

Patch size: 16×16 or 14×14 typically
Position embeddings: Learned 1D
[CLS] token for classification
Standard transformer encoder blocks

Variants:

Model	Patch	Layers	Hidden	Heads	Params
ViT-Ti	16	12	192	3	5.7M
ViT-S	16	12	384	6	22M
ViT-B	16	12	768	12	86M
ViT-L	16	24	1024	16	304M
ViT-H	14	32	1280	16	632M

DeiT (Data-efficient Image Transformers)

Training ViT without massive datasets.

Key innovations:

Knowledge distillation from CNN teachers
Strong data augmentation
Regularization (stochastic depth, label smoothing)
Distillation token (learns from teacher)

Training recipe:

RandAugment
Mixup (α=0.8)
CutMix (α=1.0)
Random erasing (p=0.25)
Stochastic depth (p=0.1)

Swin Transformer

Hierarchical transformer with shifted windows.

Key innovations: 1. Shifted Window Attention

Local attention within windows
Cross-window connection via shifting
O(n) complexity vs O(n²) for global attention

2. Hierarchical Feature Maps

Patch merging between stages
Similar to CNN feature pyramids
Direct use in detection/segmentation

Architecture:

Stage 1: 56×56, 96-dim   → Patch Merge
Stage 2: 28×28, 192-dim  → Patch Merge
Stage 3: 14×14, 384-dim  → Patch Merge
Stage 4: 7×7, 768-dim

Variants:

Model	Params	GFLOPs	Top-1
Swin-T	29M	4.5	81.3%
Swin-S	50M	8.7	83.0%
Swin-B	88M	15.4	83.5%
Swin-L	197M	34.5	84.5%

---

Feature Pyramid Networks

FPN variants for multi-scale detection.

Original FPN

Top-down pathway with lateral connections.

P5 ← C5 (1/32)
 ↓
P4 ← C4 + Upsample(P5) (1/16)
 ↓
P3 ← C3 + Upsample(P4) (1/8)
 ↓
P2 ← C2 + Upsample(P3) (1/4)

PANet (Path Aggregation Network)

Bottom-up augmentation after FPN.

FPN top-down → Bottom-up augmentation
P2 → N2 ↘
P3 → N3 → N3 ↘
P4 → N4 → N4 → N4 ↘
P5 → N5 → N5 → N5 → N5

Benefits:

Shorter path from low-level to high-level
Better localization signals
+1-2 mAP improvement

BiFPN (Bidirectional FPN)

Weighted bidirectional feature fusion.

Key innovations:

Learnable fusion weights
Bidirectional cross-scale connections
Repeated blocks for iterative refinement

Fusion formula:

O = Σ(w_i × I_i) / (ε + Σ w_i)

Where weights are learned via fast normalized fusion.

NAS-FPN

Neural architecture search for FPN design.

Searched on COCO:

7 fusion cells
Optimized connection patterns
3-4 mAP improvement over FPN

---

Architecture Selection

Decision Matrix

Requirement	Recommended	Alternative
Real-time (>30 FPS)	YOLOv8s	RT-DETR-S
Edge (<4GB RAM)	YOLOv8n	MobileNetV3-SSD
High accuracy	DINO, Cascade R-CNN	YOLOv8x
Instance segmentation	Mask R-CNN	YOLOv8-seg
Semantic segmentation	SegFormer	DeepLabV3+
Zero-shot	SAM	CLIP+segmentation
Small objects	YOLO+SAHI	Cascade R-CNN
Video real-time	YOLOv8 + ByteTrack	YOLOX + SORT

Training Data Requirements

Architecture	Minimum Images	Recommended
YOLO (fine-tune)	100-500	1,000-5,000
YOLO (from scratch)	5,000+	10,000+
Faster R-CNN	1,000+	5,000+
DETR/DINO	10,000+	50,000+
ViT backbone	10,000+	100,000+
SAM (fine-tune)	100-1,000	5,000+

Compute Requirements

Architecture	Training GPU	Inference GPU
YOLOv8n	4GB VRAM	2GB VRAM
YOLOv8m	8GB VRAM	4GB VRAM
YOLOv8x	16GB VRAM	8GB VRAM
Faster R-CNN R50	8GB VRAM	4GB VRAM
Mask R-CNN R101	16GB VRAM	8GB VRAM
DINO-4scale	32GB VRAM	16GB VRAM
SAM ViT-H	32GB VRAM	8GB VRAM

---

Code Examples

Load Pretrained Backbone (timm)

import timm

# List available models
print(timm.list_models('*resnet*'))

# Load pretrained
backbone = timm.create_model('resnet50', pretrained=True, features_only=True)

# Get feature maps
features = backbone(torch.randn(1, 3, 224, 224))
for f in features:
    print(f.shape)
# torch.Size([1, 64, 56, 56])
# torch.Size([1, 256, 56, 56])
# torch.Size([1, 512, 28, 28])
# torch.Size([1, 1024, 14, 14])
# torch.Size([1, 2048, 7, 7])

Custom Detection Backbone

import torch.nn as nn
from torchvision.models import resnet50
from torchvision.ops import FeaturePyramidNetwork

class DetectionBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        backbone = resnet50(pretrained=True)

        self.layer1 = nn.Sequential(backbone.conv1, backbone.bn1,
                                     backbone.relu, backbone.maxpool,
                                     backbone.layer1)
        self.layer2 = backbone.layer2
        self.layer3 = backbone.layer3
        self.layer4 = backbone.layer4

        self.fpn = FeaturePyramidNetwork(
            in_channels_list=[256, 512, 1024, 2048],
            out_channels=256
        )

    def forward(self, x):
        c1 = self.layer1(x)
        c2 = self.layer2(c1)
        c3 = self.layer3(c2)
        c4 = self.layer4(c3)

        features = {'feat0': c1, 'feat1': c2, 'feat2': c3, 'feat3': c4}
        pyramid = self.fpn(features)
        return pyramid

Vision Transformer with Detection Head

import timm

# Swin Transformer for detection
swin = timm.create_model('swin_base_patch4_window7_224',
                          pretrained=True,
                          features_only=True,
                          out_indices=[0, 1, 2, 3])

# Get multi-scale features
x = torch.randn(1, 3, 224, 224)
features = swin(x)
for i, f in enumerate(features):
    print(f"Stage {i}: {f.shape}")
# Stage 0: torch.Size([1, 128, 56, 56])
# Stage 1: torch.Size([1, 256, 28, 28])
# Stage 2: torch.Size([1, 512, 14, 14])
# Stage 3: torch.Size([1, 1024, 7, 7])

---

Resources

Object Detection Optimization

Comprehensive guide to optimizing object detection models for accuracy and inference speed.

Non-Maximum Suppression
Anchor Design and Optimization
Loss Functions
Training Strategies
Data Augmentation
Model Optimization Techniques
Hyperparameter Tuning

---

Non-Maximum Suppression

NMS removes redundant overlapping detections to produce final predictions.

Standard NMS

Basic algorithm: 1. Sort boxes by confidence score 2. Select highest confidence box 3. Remove boxes with IoU > threshold 4. Repeat until no boxes remain

def nms(boxes, scores, iou_threshold=0.5):
    """
    boxes: (N, 4) in format [x1, y1, x2, y2]
    scores: (N,)
    """
    order = scores.argsort()[::-1]
    keep = []

    while len(order) > 0:
        i = order[0]
        keep.append(i)

        if len(order) == 1:
            break

        # Calculate IoU with remaining boxes
        ious = compute_iou(boxes[i], boxes[order[1:]])

        # Keep boxes with IoU <= threshold
        mask = ious <= iou_threshold
        order = order[1:][mask]

    return keep

Parameters:

iou_threshold: 0.5-0.7 typical (lower = more suppression)
score_threshold: 0.25-0.5 (filter low-confidence first)

Soft-NMS

Reduces scores instead of removing boxes entirely.

Formula:

score = score * exp(-IoU^2 / sigma)

Benefits:

Better for overlapping objects
+1-2% mAP improvement
Slightly slower than hard NMS

def soft_nms(boxes, scores, sigma=0.5, score_threshold=0.001):
    """Gaussian penalty soft-NMS"""
    order = scores.argsort()[::-1]
    keep = []

    while len(order) > 0:
        i = order[0]
        keep.append(i)

        if len(order) == 1:
            break

        ious = compute_iou(boxes[i], boxes[order[1:]])

        # Gaussian penalty
        weights = np.exp(-ious**2 / sigma)
        scores[order[1:]] *= weights

        # Re-sort by updated scores
        mask = scores[order[1:]] > score_threshold
        order = order[1:][mask]
        order = order[scores[order].argsort()[::-1]]

    return keep

DIoU-NMS

Uses Distance-IoU instead of standard IoU.

Formula:

DIoU = IoU - (d^2 / c^2)

Where:

d = center distance between boxes
c = diagonal of smallest enclosing box

Benefits:

Better for occluded objects
Penalizes distant boxes less
Works well with DIoU loss

Batched NMS

NMS per class (prevents cross-class suppression).

def batched_nms(boxes, scores, classes, iou_threshold):
    """Per-class NMS"""
    # Offset boxes by class ID to prevent cross-class suppression
    max_coordinate = boxes.max()
    offsets = classes * (max_coordinate + 1)
    boxes_for_nms = boxes + offsets[:, None]

    keep = torchvision.ops.nms(boxes_for_nms, scores, iou_threshold)
    return keep

NMS-Free Detection (DETR-style)

Transformer-based detectors eliminate NMS.

How DETR avoids NMS:

Object queries are learned embeddings
Bipartite matching in training
Each query outputs exactly one detection
Set-based loss enforces uniqueness

Benefits:

End-to-end differentiable
No hand-crafted post-processing
Better for complex scenes

---

Anchor Design and Optimization

Anchor-Based Detection

Traditional detectors use predefined anchor boxes.

Anchor parameters:

Scales: [32, 64, 128, 256, 512] pixels
Ratios: [0.5, 1.0, 2.0] (height/width)
Stride: Feature map stride (8, 16, 32)

Anchor assignment:

Positive: IoU > 0.7 with ground truth
Negative: IoU < 0.3 with all ground truths
Ignored: 0.3 < IoU < 0.7

K-Means Anchor Clustering

Optimize anchors for your dataset.

import numpy as np
from sklearn.cluster import KMeans

def optimize_anchors(annotations, num_anchors=9, image_size=640):
    """
    annotations: list of (width, height) for each bounding box
    """
    # Normalize to input size
    boxes = np.array(annotations)
    boxes = boxes / boxes.max() * image_size

    # K-means clustering
    kmeans = KMeans(n_clusters=num_anchors, random_state=42)
    kmeans.fit(boxes)

    # Get anchor sizes
    anchors = kmeans.cluster_centers_

    # Sort by area
    areas = anchors[:, 0] * anchors[:, 1]
    anchors = anchors[np.argsort(areas)]

    # Calculate mean IoU with ground truth
    mean_iou = calculate_anchor_fit(boxes, anchors)
    print(f"Optimized anchors (mean IoU: {mean_iou:.3f}):")
    print(anchors.astype(int))

    return anchors

def calculate_anchor_fit(boxes, anchors):
    """Calculate how well anchors fit the boxes"""
    ious = []
    for box in boxes:
        box_area = box[0] * box[1]
        anchor_areas = anchors[:, 0] * anchors[:, 1]
        intersections = np.minimum(box[0], anchors[:, 0]) * \
                       np.minimum(box[1], anchors[:, 1])
        unions = box_area + anchor_areas - intersections
        max_iou = (intersections / unions).max()
        ious.append(max_iou)
    return np.mean(ious)

Anchor-Free Detection

Modern detectors predict boxes without anchors.

FCOS-style (center-based):

Predict (l, t, r, b) distances from center
Centerness score for quality
Multi-scale assignment

YOLO v8 style:

Predict (x, y, w, h) directly
Task-aligned assigner
Distribution focal loss for regression

Benefits of anchor-free:

No hyperparameter tuning for anchors
Simpler architecture
Better generalization

Anchor Assignment Strategies

ATSS (Adaptive Training Sample Selection): 1. For each GT, select k closest anchors per level 2. Calculate IoU for selected anchors 3. IoU threshold = mean + std of IoUs 4. Assign positives where IoU > threshold

TAL (Task-Aligned Assigner - YOLO v8):

score = cls_score^alpha * IoU^beta

Where alpha=0.5, beta=6.0 (weights classification and localization)

---

Loss Functions

Classification Losses

Cross-Entropy Loss

Standard multi-class classification:

loss = -log(p_correct_class)

Focal Loss

Handles class imbalance by down-weighting easy examples.

def focal_loss(pred, target, gamma=2.0, alpha=0.25):
    """
    pred: (N, num_classes) predicted probabilities
    target: (N,) ground truth class indices
    """
    ce_loss = F.cross_entropy(pred, target, reduction='none')
    pt = torch.exp(-ce_loss)  # probability of correct class

    # Focal term: (1 - pt)^gamma
    focal_term = (1 - pt) ** gamma

    # Alpha weighting
    alpha_t = alpha * target + (1 - alpha) * (1 - target)

    loss = alpha_t * focal_term * ce_loss
    return loss.mean()

Hyperparameters:

gamma: 2.0 typical, higher = more focus on hard examples
alpha: 0.25 for foreground class weight

Quality Focal Loss (QFL)

Combines classification with IoU quality.

def quality_focal_loss(pred, target, beta=2.0):
    """
    target: IoU values (0-1) instead of binary
    """
    ce = F.binary_cross_entropy(pred, target, reduction='none')
    focal_weight = torch.abs(pred - target) ** beta
    loss = focal_weight * ce
    return loss.mean()

Regression Losses

Smooth L1 Loss

def smooth_l1_loss(pred, target, beta=1.0):
    diff = torch.abs(pred - target)
    loss = torch.where(
        diff < beta,
        0.5 * diff ** 2 / beta,
        diff - 0.5 * beta
    )
    return loss.mean()

IoU-Based Losses

IoU Loss:

L_IoU = 1 - IoU

GIoU (Generalized IoU):

GIoU = IoU - (C - U) / C
L_GIoU = 1 - GIoU

Where C = area of smallest enclosing box, U = union area.

DIoU (Distance IoU):

DIoU = IoU - d^2 / c^2
L_DIoU = 1 - DIoU

Where d = center distance, c = diagonal of enclosing box.

CIoU (Complete IoU):

CIoU = IoU - d^2 / c^2 - alpha*v
v = (4/pi^2) * (arctan(w_gt/h_gt) - arctan(w/h))^2
alpha = v / (1 - IoU + v)
L_CIoU = 1 - CIoU

Comparison:

Loss	Handles	Best For
L1/L2	Basic regression	Simple tasks
IoU	Overlap	Standard detection
GIoU	Non-overlapping	Distant boxes
DIoU	Center distance	Faster convergence
CIoU	Aspect ratio	Best accuracy

def ciou_loss(pred_boxes, target_boxes):
    """
    pred_boxes, target_boxes: (N, 4) as [x1, y1, x2, y2]
    """
    # Standard IoU
    inter = compute_intersection(pred_boxes, target_boxes)
    union = compute_union(pred_boxes, target_boxes)
    iou = inter / (union + 1e-7)

    # Enclosing box diagonal
    enclose_x1 = torch.min(pred_boxes[:, 0], target_boxes[:, 0])
    enclose_y1 = torch.min(pred_boxes[:, 1], target_boxes[:, 1])
    enclose_x2 = torch.max(pred_boxes[:, 2], target_boxes[:, 2])
    enclose_y2 = torch.max(pred_boxes[:, 3], target_boxes[:, 3])
    c_sq = (enclose_x2 - enclose_x1)**2 + (enclose_y2 - enclose_y1)**2

    # Center distance
    pred_cx = (pred_boxes[:, 0] + pred_boxes[:, 2]) / 2
    pred_cy = (pred_boxes[:, 1] + pred_boxes[:, 3]) / 2
    target_cx = (target_boxes[:, 0] + target_boxes[:, 2]) / 2
    target_cy = (target_boxes[:, 1] + target_boxes[:, 3]) / 2
    d_sq = (pred_cx - target_cx)**2 + (pred_cy - target_cy)**2

    # Aspect ratio term
    pred_w = pred_boxes[:, 2] - pred_boxes[:, 0]
    pred_h = pred_boxes[:, 3] - pred_boxes[:, 1]
    target_w = target_boxes[:, 2] - target_boxes[:, 0]
    target_h = target_boxes[:, 3] - target_boxes[:, 1]

    v = (4 / math.pi**2) * (
        torch.atan(target_w / target_h) - torch.atan(pred_w / pred_h)
    )**2
    alpha_term = v / (1 - iou + v + 1e-7)

    ciou = iou - d_sq / (c_sq + 1e-7) - alpha_term * v
    return 1 - ciou

Distribution Focal Loss (DFL)

Used in YOLO v8 for regression.

Concept:

Predict distribution over discrete positions
Each regression target is a soft label
Allows uncertainty estimation

def dfl_loss(pred_dist, target, reg_max=16):
    """
    pred_dist: (N, reg_max) predicted distribution
    target: (N,) continuous target values (0 to reg_max)
    """
    # Convert continuous target to soft label
    target_left = target.floor().long()
    target_right = target_left + 1
    weight_right = target - target_left.float()
    weight_left = 1 - weight_right

    # Cross-entropy with soft targets
    loss_left = F.cross_entropy(pred_dist, target_left, reduction='none')
    loss_right = F.cross_entropy(pred_dist, target_right.clamp(max=reg_max-1),
                                  reduction='none')

    loss = weight_left * loss_left + weight_right * loss_right
    return loss.mean()

---

Training Strategies

Learning Rate Schedules

Warmup:

# Linear warmup for first N epochs
if epoch < warmup_epochs:
    lr = base_lr * (epoch + 1) / warmup_epochs

Cosine Annealing:

lr = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(pi * epoch / total_epochs))

Step Decay:

# Reduce by factor at milestones
lr = base_lr * (0.1 ** (milestones_passed))

Recommended schedule for detection:

optimizer = SGD(model.parameters(), lr=0.01, momentum=0.937, weight_decay=0.0005)

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=total_epochs,
    eta_min=0.0001
)

# With warmup
warmup_scheduler = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.1,
    total_iters=warmup_epochs
)

scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer,
    schedulers=[warmup_scheduler, scheduler],
    milestones=[warmup_epochs]
)

Exponential Moving Average (EMA)

Smooths model weights for better stability.

class EMA:
    def __init__(self, model, decay=0.9999):
        self.model = model
        self.decay = decay
        self.shadow = {}
        for name, param in model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()

    def update(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = (
                    self.decay * self.shadow[name] +
                    (1 - self.decay) * param.data
                )

    def apply_shadow(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                param.data.copy_(self.shadow[name])

Usage:

Update EMA after each training step
Use EMA weights for validation/inference
Decay: 0.9999 typical (higher = slower update)

Multi-Scale Training

Train with varying input sizes.

# Random size each batch
sizes = [480, 512, 544, 576, 608, 640, 672, 704, 736, 768]
input_size = random.choice(sizes)

# Resize batch to selected size
images = F.interpolate(images, size=input_size, mode='bilinear')

Benefits:

Better scale invariance
+1-2% mAP improvement
Slower training (variable batch size)

Gradient Accumulation

Simulate larger batch sizes.

accumulation_steps = 4
optimizer.zero_grad()

for i, (images, targets) in enumerate(dataloader):
    loss = model(images, targets) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Mixed Precision Training

Use FP16 for speed and memory.

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for images, targets in dataloader:
    optimizer.zero_grad()

    with autocast():
        loss = model(images, targets)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Benefits:

2-3x faster training
50% memory reduction
Minimal accuracy loss

---

Data Augmentation

Geometric Augmentations

import albumentations as A

geometric = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.Rotate(limit=15, p=0.3),
    A.RandomScale(scale_limit=0.2, p=0.5),
    A.Affine(translate_percent={'x': (-0.1, 0.1), 'y': (-0.1, 0.1)}, p=0.3),
], bbox_params=A.BboxParams(format='coco', label_fields=['class_labels']))

Color Augmentations

color = A.Compose([
    A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.5),
    A.HueSaturationValue(hue_shift_limit=20, sat_shift_limit=30, val_shift_limit=20, p=0.5),
    A.CLAHE(clip_limit=2.0, p=0.1),
    A.GaussianBlur(blur_limit=3, p=0.1),
    A.GaussNoise(var_limit=(10, 50), p=0.1),
])

Mosaic Augmentation

Combines 4 images into one (YOLO-style).

def mosaic_augmentation(images, labels, input_size=640):
    """
    images: list of 4 images
    labels: list of 4 label arrays
    """
    result_image = np.zeros((input_size, input_size, 3), dtype=np.uint8)
    result_labels = []

    # Random center point
    cx = int(random.uniform(input_size * 0.25, input_size * 0.75))
    cy = int(random.uniform(input_size * 0.25, input_size * 0.75))

    positions = [
        (0, 0, cx, cy),           # top-left
        (cx, 0, input_size, cy),  # top-right
        (0, cy, cx, input_size),  # bottom-left
        (cx, cy, input_size, input_size),  # bottom-right
    ]

    for i, (x1, y1, x2, y2) in enumerate(positions):
        img = images[i]
        h, w = y2 - y1, x2 - x1

        # Resize and place
        img_resized = cv2.resize(img, (w, h))
        result_image[y1:y2, x1:x2] = img_resized

        # Transform labels
        for label in labels[i]:
            # Scale and shift bounding boxes
            new_label = transform_bbox(label, img.shape, (h, w), (x1, y1))
            result_labels.append(new_label)

    return result_image, result_labels

MixUp

Blends two images and labels.

def mixup(image1, labels1, image2, labels2, alpha=0.5):
    """
    alpha: mixing ratio (0.5 = equal blend)
    """
    # Blend images
    mixed_image = (alpha * image1 + (1 - alpha) * image2).astype(np.uint8)

    # Blend labels with soft weights
    labels1_weighted = [(box, cls, alpha) for box, cls in labels1]
    labels2_weighted = [(box, cls, 1-alpha) for box, cls in labels2]

    mixed_labels = labels1_weighted + labels2_weighted
    return mixed_image, mixed_labels

Copy-Paste Augmentation

Paste objects from one image to another.

def copy_paste(background, bg_labels, source, src_labels, src_masks):
    """
    Paste segmented objects onto background
    """
    result = background.copy()

    for mask, label in zip(src_masks, src_labels):
        # Random position
        x_offset = random.randint(0, background.shape[1] - mask.shape[1])
        y_offset = random.randint(0, background.shape[0] - mask.shape[0])

        # Paste with mask
        region = result[y_offset:y_offset+mask.shape[0],
                       x_offset:x_offset+mask.shape[1]]
        region[mask > 0] = source[mask > 0]

        # Add new label
        new_box = transform_bbox(label, x_offset, y_offset)
        bg_labels.append(new_box)

    return result, bg_labels

Cutout / Random Erasing

Randomly erase patches.

def cutout(image, num_holes=8, max_h_size=32, max_w_size=32):
    h, w = image.shape[:2]
    result = image.copy()

    for _ in range(num_holes):
        y = random.randint(0, h)
        x = random.randint(0, w)
        h_size = random.randint(1, max_h_size)
        w_size = random.randint(1, max_w_size)

        y1, y2 = max(0, y - h_size // 2), min(h, y + h_size // 2)
        x1, x2 = max(0, x - w_size // 2), min(w, x + w_size // 2)

        result[y1:y2, x1:x2] = 0  # or random color

    return result

---

Model Optimization Techniques

Pruning

Remove unimportant weights.

Magnitude Pruning:

import torch.nn.utils.prune as prune

# Prune 30% of weights with smallest magnitude
for name, module in model.named_modules():
    if isinstance(module, nn.Conv2d):
        prune.l1_unstructured(module, name='weight', amount=0.3)

Structured Pruning (channels):

# Prune entire channels
prune.ln_structured(module, name='weight', amount=0.3, n=2, dim=0)

Knowledge Distillation

Train smaller model with larger teacher.

def distillation_loss(student_logits, teacher_logits, labels,
                      temperature=4.0, alpha=0.7):
    """
    Combine soft targets from teacher with hard labels
    """
    # Soft targets
    soft_student = F.log_softmax(student_logits / temperature, dim=1)
    soft_teacher = F.softmax(teacher_logits / temperature, dim=1)
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean')
    soft_loss *= temperature ** 2  # Scale by T^2

    # Hard targets
    hard_loss = F.cross_entropy(student_logits, labels)

    # Combined loss
    return alpha * soft_loss + (1 - alpha) * hard_loss

Quantization

Reduce precision for faster inference.

Post-Training Quantization:

import torch.quantization

# Prepare model
model.set_mode('inference')
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)

# Calibrate with representative data
with torch.no_grad():
    for images in calibration_loader:
        model(images)

# Convert to quantized model
torch.quantization.convert(model, inplace=True)

Quantization-Aware Training:

# Insert fake quantization during training
model.train()
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = torch.quantization.prepare_qat(model)

# Train with fake quantization
for epoch in range(num_epochs):
    train(model_prepared)

# Convert to quantized
model_quantized = torch.quantization.convert(model_prepared)

---

Hyperparameter Tuning

Key Hyperparameters

Parameter	Range	Default	Impact
Learning rate	1e-4 to 1e-1	0.01	Critical
Batch size	4 to 64	16	Memory/speed
Weight decay	1e-5 to 1e-3	5e-4	Regularization
Momentum	0.9 to 0.99	0.937	Optimization
Warmup epochs	1 to 10	3	Stability
IoU threshold (NMS)	0.4 to 0.7	0.5	Recall/precision
Confidence threshold	0.1 to 0.5	0.25	Detection count
Image size	320 to 1280	640	Accuracy/speed

Tuning Strategy

1. Baseline: Use default hyperparameters 2. Learning rate: Grid search [1e-3, 5e-3, 1e-2, 5e-2] 3. Batch size: Maximum that fits in memory 4. Augmentation: Start minimal, add progressively 5. Epochs: Train until validation loss plateaus 6. NMS threshold: Tune on validation set

Automated Hyperparameter Optimization

import optuna

def objective(trial):
    lr = trial.suggest_loguniform('lr', 1e-4, 1e-1)
    weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-3)
    mosaic_prob = trial.suggest_uniform('mosaic_prob', 0.0, 1.0)

    model = create_model()
    train_model(model, lr=lr, weight_decay=weight_decay, mosaic_prob=mosaic_prob)
    mAP = test_model(model)

    return mAP

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(f"Best params: {study.best_params}")
print(f"Best mAP: {study.best_value}")

---

Detection-Specific Tips

Small Object Detection

1. Higher resolution: 1280px instead of 640px 2. SAHI (Slicing): Inference on overlapping tiles 3. More FPN levels: P2 level (1/4 scale) 4. Anchor adjustment: Smaller anchors for small objects 5. Copy-paste augmentation: Increase small object frequency

Handling Class Imbalance

1. Focal loss: gamma=2.0, alpha=0.25 2. Over-sampling: Repeat rare class images 3. Class weights: Inverse frequency weighting 4. Copy-paste: Augment rare classes

Improving Localization

1. CIoU loss: Includes aspect ratio term 2. Cascade detection: Progressive refinement 3. Higher IoU threshold: 0.6-0.7 for positive samples 4. Deformable convolutions: Learn spatial offsets

Reducing False Positives

1. Higher confidence threshold: 0.4-0.5 2. More negative samples: Hard negative mining 3. Background class weight: Increase penalty 4. Ensemble: Multiple model voting

---

Resources

Production Vision Systems

Comprehensive guide to deploying computer vision models in production environments.

Model Export and Optimization
TensorRT Deployment
ONNX Runtime Deployment
Edge Device Deployment
Model Serving
Video Processing Pipelines
Monitoring and Observability
Scaling and Performance

---

Model Export and Optimization

PyTorch to ONNX Export

Basic export:

import torch
import torch.onnx

def export_to_onnx(model, input_shape, output_path, dynamic_batch=True):
    """
    Export PyTorch model to ONNX format.

    Args:
        model: PyTorch model
        input_shape: (C, H, W) input dimensions
        output_path: Path to save .onnx file
        dynamic_batch: Allow variable batch sizes
    """
    model.set_mode('inference')

    # Create dummy input
    dummy_input = torch.randn(1, *input_shape)

    # Dynamic axes for variable batch size
    dynamic_axes = None
    if dynamic_batch:
        dynamic_axes = {
            'input': {0: 'batch_size'},
            'output': {0: 'batch_size'}
        }

    # Export
    torch.onnx.export(
        model,
        dummy_input,
        output_path,
        export_params=True,
        opset_version=17,
        do_constant_folding=True,
        input_names=['input'],
        output_names=['output'],
        dynamic_axes=dynamic_axes
    )

    print(f"Exported to {output_path}")
    return output_path

ONNX Model Optimization

Simplify and optimize ONNX graph:

import onnx
from onnxsim import simplify

def optimize_onnx(input_path, output_path):
    """
    Simplify ONNX model for faster inference.
    """
    # Load model
    model = onnx.load(input_path)

    # Check validity
    onnx.checker.check_model(model)

    # Simplify
    model_simplified, check = simplify(model)

    if check:
        onnx.save(model_simplified, output_path)
        print(f"Simplified model saved to {output_path}")

        # Print size reduction
        import os
        original_size = os.path.getsize(input_path) / 1024 / 1024
        simplified_size = os.path.getsize(output_path) / 1024 / 1024
        print(f"Size: {original_size:.2f}MB -> {simplified_size:.2f}MB")
    else:
        print("Simplification failed, saving original")
        onnx.save(model, output_path)

    return output_path

Model Size Analysis

def analyze_model(model_path):
    """
    Analyze ONNX model structure and size.
    """
    model = onnx.load(model_path)

    # Count parameters
    total_params = 0
    param_sizes = {}

    for initializer in model.graph.initializer:
        param_count = 1
        for dim in initializer.dims:
            param_count *= dim
        total_params += param_count
        param_sizes[initializer.name] = param_count

    # Print summary
    print(f"Total parameters: {total_params:,}")
    print(f"Model size: {total_params * 4 / 1024 / 1024:.2f} MB (FP32)")
    print(f"Model size: {total_params * 2 / 1024 / 1024:.2f} MB (FP16)")
    print(f"Model size: {total_params / 1024 / 1024:.2f} MB (INT8)")

    # Top 10 largest layers
    print("\nLargest layers:")
    sorted_params = sorted(param_sizes.items(), key=lambda x: x[1], reverse=True)
    for name, size in sorted_params[:10]:
        print(f"  {name}: {size:,} params")

    return total_params

---

TensorRT Deployment

TensorRT Engine Build

import tensorrt as trt

def build_tensorrt_engine(onnx_path, engine_path, precision='fp16',
                          max_batch_size=8, workspace_gb=4):
    """
    Build TensorRT engine from ONNX model.

    Args:
        onnx_path: Path to ONNX model
        engine_path: Path to save TensorRT engine
        precision: 'fp32', 'fp16', or 'int8'
        max_batch_size: Maximum batch size
        workspace_gb: GPU memory workspace in GB
    """
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, logger)

    # Parse ONNX
    with open(onnx_path, 'rb') as f:
        if not parser.parse(f.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            raise RuntimeError("ONNX parsing failed")

    # Configure builder
    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE,
                                  workspace_gb * 1024 * 1024 * 1024)

    # Set precision
    if precision == 'fp16':
        config.set_flag(trt.BuilderFlag.FP16)
    elif precision == 'int8':
        config.set_flag(trt.BuilderFlag.INT8)
        # Requires calibrator for INT8

    # Set optimization profile for dynamic shapes
    profile = builder.create_optimization_profile()
    input_name = network.get_input(0).name
    input_shape = network.get_input(0).shape

    # Min, optimal, max batch sizes
    min_shape = (1,) + tuple(input_shape[1:])
    opt_shape = (max_batch_size // 2,) + tuple(input_shape[1:])
    max_shape = (max_batch_size,) + tuple(input_shape[1:])

    profile.set_shape(input_name, min_shape, opt_shape, max_shape)
    config.add_optimization_profile(profile)

    # Build engine
    serialized_engine = builder.build_serialized_network(network, config)

    # Save engine
    with open(engine_path, 'wb') as f:
        f.write(serialized_engine)

    print(f"TensorRT engine saved to {engine_path}")
    return engine_path

TensorRT Inference

import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

class TensorRTInference:
    def __init__(self, engine_path):
        """
        Load TensorRT engine and prepare for inference.
        """
        self.logger = trt.Logger(trt.Logger.WARNING)

        # Load engine
        with open(engine_path, 'rb') as f:
            engine_data = f.read()

        runtime = trt.Runtime(self.logger)
        self.engine = runtime.deserialize_cuda_engine(engine_data)
        self.context = self.engine.create_execution_context()

        # Allocate buffers
        self.inputs = []
        self.outputs = []
        self.bindings = []
        self.stream = cuda.Stream()

        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
            shape = self.engine.get_tensor_shape(name)
            size = trt.volume(shape)

            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)

            self.bindings.append(int(device_mem))

            if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
                self.inputs.append({'host': host_mem, 'device': device_mem,
                                   'shape': shape, 'name': name})
            else:
                self.outputs.append({'host': host_mem, 'device': device_mem,
                                    'shape': shape, 'name': name})

    def infer(self, input_data):
        """
        Run inference on input data.

        Args:
            input_data: numpy array (batch, C, H, W)

        Returns:
            Output numpy array
        """
        # Copy input to host buffer
        np.copyto(self.inputs[0]['host'], input_data.ravel())

        # Transfer input to device
        cuda.memcpy_htod_async(
            self.inputs[0]['device'],
            self.inputs[0]['host'],
            self.stream
        )

        # Run inference
        self.context.execute_async_v2(
            bindings=self.bindings,
            stream_handle=self.stream.handle
        )

        # Transfer output from device
        cuda.memcpy_dtoh_async(
            self.outputs[0]['host'],
            self.outputs[0]['device'],
            self.stream
        )

        # Synchronize
        self.stream.synchronize()

        # Reshape output
        output = self.outputs[0]['host'].reshape(self.outputs[0]['shape'])
        return output

INT8 Calibration

class Int8Calibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, calibration_data, cache_file, batch_size=8):
        """
        INT8 calibrator for TensorRT.

        Args:
            calibration_data: List of numpy arrays
            cache_file: Path to save calibration cache
            batch_size: Calibration batch size
        """
        super().__init__()
        self.calibration_data = calibration_data
        self.cache_file = cache_file
        self.batch_size = batch_size
        self.current_index = 0

        # Allocate device buffer
        self.device_input = cuda.mem_alloc(
            calibration_data[0].nbytes * batch_size
        )

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        if self.current_index + self.batch_size > len(self.calibration_data):
            return None

        # Get batch
        batch = self.calibration_data[
            self.current_index:self.current_index + self.batch_size
        ]
        batch = np.stack(batch, axis=0)

        # Copy to device
        cuda.memcpy_htod(self.device_input, batch)
        self.current_index += self.batch_size

        return [int(self.device_input)]

    def read_calibration_cache(self):
        if os.path.exists(self.cache_file):
            with open(self.cache_file, 'rb') as f:
                return f.read()
        return None

    def write_calibration_cache(self, cache):
        with open(self.cache_file, 'wb') as f:
            f.write(cache)

---

ONNX Runtime Deployment

Basic ONNX Runtime Inference

import onnxruntime as ort

class ONNXInference:
    def __init__(self, model_path, device='cuda'):
        """
        Initialize ONNX Runtime session.

        Args:
            model_path: Path to ONNX model
            device: 'cuda' or 'cpu'
        """
        # Set execution providers
        if device == 'cuda':
            providers = [
                ('CUDAExecutionProvider', {
                    'device_id': 0,
                    'arena_extend_strategy': 'kNextPowerOfTwo',
                    'gpu_mem_limit': 4 * 1024 * 1024 * 1024,  # 4GB
                    'cudnn_conv_algo_search': 'EXHAUSTIVE',
                }),
                'CPUExecutionProvider'
            ]
        else:
            providers = ['CPUExecutionProvider']

        # Session options
        sess_options = ort.SessionOptions()
        sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        sess_options.intra_op_num_threads = 4

        # Create session
        self.session = ort.InferenceSession(
            model_path,
            sess_options=sess_options,
            providers=providers
        )

        # Get input/output info
        self.input_name = self.session.get_inputs()[0].name
        self.input_shape = self.session.get_inputs()[0].shape
        self.output_name = self.session.get_outputs()[0].name

        print(f"Loaded model: {model_path}")
        print(f"Input: {self.input_name} {self.input_shape}")
        print(f"Provider: {self.session.get_providers()[0]}")

    def infer(self, input_data):
        """
        Run inference.

        Args:
            input_data: numpy array (batch, C, H, W)

        Returns:
            Model output
        """
        outputs = self.session.run(
            [self.output_name],
            {self.input_name: input_data.astype(np.float32)}
        )
        return outputs[0]

    def benchmark(self, input_shape, num_iterations=100, warmup=10):
        """
        Benchmark inference speed.
        """
        import time

        dummy_input = np.random.randn(*input_shape).astype(np.float32)

        # Warmup
        for _ in range(warmup):
            self.infer(dummy_input)

        # Benchmark
        start = time.perf_counter()
        for _ in range(num_iterations):
            self.infer(dummy_input)
        end = time.perf_counter()

        avg_time = (end - start) / num_iterations * 1000
        fps = 1000 / avg_time * input_shape[0]

        print(f"Average latency: {avg_time:.2f}ms")
        print(f"Throughput: {fps:.1f} images/sec")

        return avg_time, fps

---

Edge Device Deployment

NVIDIA Jetson Optimization

def optimize_for_jetson(model_path, output_path, jetson_model='orin'):
    """
    Optimize model for NVIDIA Jetson deployment.

    Args:
        model_path: Path to ONNX model
        output_path: Path to save optimized engine
        jetson_model: 'nano', 'xavier', 'orin'
    """
    # Jetson-specific configurations
    configs = {
        'nano': {'precision': 'fp16', 'workspace': 1, 'dla': False},
        'xavier': {'precision': 'fp16', 'workspace': 2, 'dla': True},
        'orin': {'precision': 'int8', 'workspace': 4, 'dla': True},
    }

    config = configs[jetson_model]

    # Build engine with Jetson-optimized settings
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, logger)

    with open(model_path, 'rb') as f:
        parser.parse(f.read())

    builder_config = builder.create_builder_config()
    builder_config.set_memory_pool_limit(
        trt.MemoryPoolType.WORKSPACE,
        config['workspace'] * 1024 * 1024 * 1024
    )

    if config['precision'] == 'fp16':
        builder_config.set_flag(trt.BuilderFlag.FP16)
    elif config['precision'] == 'int8':
        builder_config.set_flag(trt.BuilderFlag.INT8)

    # Enable DLA if supported
    if config['dla'] and builder.num_DLA_cores > 0:
        builder_config.default_device_type = trt.DeviceType.DLA
        builder_config.DLA_core = 0
        builder_config.set_flag(trt.BuilderFlag.GPU_FALLBACK)

    # Build and save
    serialized = builder.build_serialized_network(network, builder_config)
    with open(output_path, 'wb') as f:
        f.write(serialized)

    print(f"Jetson-optimized engine saved to {output_path}")

OpenVINO for Intel Devices

from openvino.runtime import Core

class OpenVINOInference:
    def __init__(self, model_path, device='CPU'):
        """
        Initialize OpenVINO inference.

        Args:
            model_path: Path to ONNX or OpenVINO IR model
            device: 'CPU', 'GPU', 'MYRIAD' (Intel NCS)
        """
        self.core = Core()

        # Load and compile model
        self.model = self.core.read_model(model_path)
        self.compiled = self.core.compile_model(self.model, device)

        # Get input/output info
        self.input_layer = self.compiled.input(0)
        self.output_layer = self.compiled.output(0)

        print(f"Loaded model on {device}")
        print(f"Input shape: {self.input_layer.shape}")

    def infer(self, input_data):
        """
        Run inference.
        """
        result = self.compiled([input_data])
        return result[self.output_layer]

    def benchmark(self, input_shape, num_iterations=100):
        """
        Benchmark inference speed.
        """
        import time

        dummy = np.random.randn(*input_shape).astype(np.float32)

        # Warmup
        for _ in range(10):
            self.infer(dummy)

        # Benchmark
        start = time.perf_counter()
        for _ in range(num_iterations):
            self.infer(dummy)
        elapsed = time.perf_counter() - start

        latency = elapsed / num_iterations * 1000
        print(f"Latency: {latency:.2f}ms")
        return latency


def convert_to_openvino(onnx_path, output_dir, precision='FP16'):
    """
    Convert ONNX to OpenVINO IR format.
    """
    from openvino.tools import mo

    mo.convert_model(
        onnx_path,
        output_model=f"{output_dir}/model.xml",
        compress_to_fp16=(precision == 'FP16')
    )
    print(f"Converted to OpenVINO IR at {output_dir}")

CoreML for Apple Silicon

import coremltools as ct

def convert_to_coreml(model_or_path, output_path, compute_units='ALL'):
    """
    Convert to CoreML for Apple devices.

    Args:
        model_or_path: PyTorch model or ONNX path
        output_path: Path to save .mlpackage
        compute_units: 'ALL', 'CPU_AND_GPU', 'CPU_AND_NE'
    """
    # Map compute units
    units_map = {
        'ALL': ct.ComputeUnit.ALL,
        'CPU_AND_GPU': ct.ComputeUnit.CPU_AND_GPU,
        'CPU_AND_NE': ct.ComputeUnit.CPU_AND_NE,  # Neural Engine
    }

    # Convert from ONNX
    if isinstance(model_or_path, str) and model_or_path.endswith('.onnx'):
        mlmodel = ct.convert(
            model_or_path,
            compute_units=units_map[compute_units],
            minimum_deployment_target=ct.target.macOS13  # or iOS16
        )
    else:
        # Convert from PyTorch
        traced = torch.jit.trace(model_or_path, torch.randn(1, 3, 640, 640))
        mlmodel = ct.convert(
            traced,
            inputs=[ct.TensorType(shape=(1, 3, 640, 640))],
            compute_units=units_map[compute_units],
        )

    mlmodel.save(output_path)
    print(f"CoreML model saved to {output_path}")

---

Model Serving

Triton Inference Server

Configuration file (config.pbtxt):

name: "yolov8"
platform: "onnxruntime_onnx"
max_batch_size: 8

input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]

output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [ 84, 8400 ]
  }
]

instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

Triton client:

import tritonclient.http as httpclient

class TritonClient:
    def __init__(self, url='localhost:8000', model_name='yolov8'):
        self.client = httpclient.InferenceServerClient(url=url)
        self.model_name = model_name

        # Check model is ready
        if not self.client.is_model_ready(model_name):
            raise RuntimeError(f"Model {model_name} is not ready")

    def infer(self, images):
        """
        Send inference request to Triton.

        Args:
            images: numpy array (batch, C, H, W)
        """
        # Create input
        inputs = [
            httpclient.InferInput("images", images.shape, "FP32")
        ]
        inputs[0].set_data_from_numpy(images)

        # Create output request
        outputs = [
            httpclient.InferRequestedOutput("output0")
        ]

        # Send request
        response = self.client.infer(
            model_name=self.model_name,
            inputs=inputs,
            outputs=outputs
        )

        return response.as_numpy("output0")

TorchServe Deployment

Model handler (handler.py):

from ts.torch_handler.base_handler import BaseHandler
import torch
import cv2
import numpy as np

class YOLOHandler(BaseHandler):
    def __init__(self):
        super().__init__()
        self.input_size = 640
        self.conf_threshold = 0.25
        self.iou_threshold = 0.45

    def preprocess(self, data):
        """Preprocess input images."""
        images = []
        for row in data:
            image = row.get("data") or row.get("body")

            if isinstance(image, (bytes, bytearray)):
                image = np.frombuffer(image, dtype=np.uint8)
                image = cv2.imdecode(image, cv2.IMREAD_COLOR)

            # Resize and normalize
            image = cv2.resize(image, (self.input_size, self.input_size))
            image = image.astype(np.float32) / 255.0
            image = np.transpose(image, (2, 0, 1))
            images.append(image)

        return torch.tensor(np.stack(images))

    def inference(self, data):
        """Run model inference."""
        with torch.no_grad():
            outputs = self.model(data)
        return outputs

    def postprocess(self, outputs):
        """Postprocess model outputs."""
        results = []
        for output in outputs:
            # Apply NMS and format results
            detections = self._nms(output, self.conf_threshold, self.iou_threshold)
            results.append(detections.tolist())
        return results

TorchServe configuration (config.properties):

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=4
job_queue_size=100
model_store=/opt/ml/model
load_models=yolov8.mar

FastAPI Serving

from fastapi import FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
import uvicorn
import numpy as np
import cv2

app = FastAPI(title="YOLO Detection API")

# Global model
model = None

@app.on_event("startup")
async def load_model():
    global model
    model = ONNXInference("models/yolov8m.onnx", device='cuda')

@app.post("/detect")
async def detect(file: UploadFile = File(...), conf: float = 0.25):
    """
    Detect objects in uploaded image.
    """
    # Read image
    contents = await file.read()
    nparr = np.frombuffer(contents, np.uint8)
    image = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

    # Preprocess
    input_image = preprocess_image(image, 640)

    # Inference
    outputs = model.infer(input_image)

    # Postprocess
    detections = postprocess_detections(outputs, conf, 0.45)

    return JSONResponse({
        "detections": detections,
        "image_size": list(image.shape[:2])
    })

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": model is not None}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

---

Video Processing Pipelines

Real-Time Video Detection

import cv2
import time
from collections import deque

class VideoDetector:
    def __init__(self, model, conf_threshold=0.25, track=True):
        self.model = model
        self.conf_threshold = conf_threshold
        self.track = track
        self.tracker = ByteTrack() if track else None
        self.fps_buffer = deque(maxlen=30)

    def process_video(self, source, output_path=None, show=True):
        """
        Process video stream with detection.

        Args:
            source: Video file path, camera index, or RTSP URL
            output_path: Path to save output video
            show: Display results in window
        """
        cap = cv2.VideoCapture(source)

        if output_path:
            fourcc = cv2.VideoWriter_fourcc(*'mp4v')
            fps = cap.get(cv2.CAP_PROP_FPS)
            width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
            height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
            writer = cv2.VideoWriter(output_path, fourcc, fps, (width, height))

        frame_count = 0
        start_time = time.time()

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            # Inference
            t0 = time.perf_counter()
            detections = self._detect(frame)

            # Tracking
            if self.track and len(detections) > 0:
                detections = self.tracker.update(detections)

            # Calculate FPS
            inference_time = time.perf_counter() - t0
            self.fps_buffer.append(1 / inference_time)
            avg_fps = sum(self.fps_buffer) / len(self.fps_buffer)

            # Draw results
            frame = self._draw_detections(frame, detections, avg_fps)

            # Output
            if output_path:
                writer.write(frame)

            if show:
                cv2.imshow('Detection', frame)
                if cv2.waitKey(1) == ord('q'):
                    break

            frame_count += 1

        # Cleanup
        cap.release()
        if output_path:
            writer.release()
        cv2.destroyAllWindows()

        # Print statistics
        total_time = time.time() - start_time
        print(f"Processed {frame_count} frames in {total_time:.1f}s")
        print(f"Average FPS: {frame_count / total_time:.1f}")

    def _detect(self, frame):
        """Run detection on single frame."""
        # Preprocess
        input_tensor = self._preprocess(frame)

        # Inference
        outputs = self.model.infer(input_tensor)

        # Postprocess
        detections = self._postprocess(outputs, frame.shape[:2])
        return detections

    def _preprocess(self, frame):
        """Preprocess frame for model input."""
        # Resize
        input_size = 640
        image = cv2.resize(frame, (input_size, input_size))

        # Normalize and transpose
        image = image.astype(np.float32) / 255.0
        image = np.transpose(image, (2, 0, 1))
        image = np.expand_dims(image, axis=0)

        return image

    def _draw_detections(self, frame, detections, fps):
        """Draw detections on frame."""
        for det in detections:
            x1, y1, x2, y2 = det['bbox']
            cls = det['class']
            conf = det['confidence']
            track_id = det.get('track_id', None)

            # Draw box
            color = self._get_color(cls)
            cv2.rectangle(frame, (int(x1), int(y1)), (int(x2), int(y2)), color, 2)

            # Draw label
            label = f"{cls}: {conf:.2f}"
            if track_id:
                label = f"ID:{track_id} {label}"

            cv2.putText(frame, label, (int(x1), int(y1) - 10),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

        # Draw FPS
        cv2.putText(frame, f"FPS: {fps:.1f}", (10, 30),
                   cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)

        return frame

Batch Video Processing

import concurrent.futures
from pathlib import Path

def process_videos_batch(video_paths, model, output_dir, max_workers=4):
    """
    Process multiple videos in parallel.
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    def process_single(video_path):
        detector = VideoDetector(model)
        output_path = output_dir / f"{Path(video_path).stem}_detected.mp4"
        detector.process_video(video_path, str(output_path), show=False)
        return output_path

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(process_single, vp): vp for vp in video_paths}

        for future in concurrent.futures.as_completed(futures):
            video_path = futures[future]
            try:
                output_path = future.result()
                print(f"Completed: {video_path} -> {output_path}")
            except Exception as e:
                print(f"Failed: {video_path} - {e}")

---

Monitoring and Observability

Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Define metrics
INFERENCE_COUNT = Counter(
    'model_inference_total',
    'Total number of inferences',
    ['model_name', 'status']
)

INFERENCE_LATENCY = Histogram(
    'model_inference_latency_seconds',
    'Inference latency in seconds',
    ['model_name'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
)

GPU_MEMORY = Gauge(
    'gpu_memory_used_bytes',
    'GPU memory usage in bytes',
    ['device']
)

DETECTIONS_COUNT = Counter(
    'detections_total',
    'Total detections by class',
    ['model_name', 'class_name']
)

class MetricsWrapper:
    def __init__(self, model, model_name='yolov8'):
        self.model = model
        self.model_name = model_name

    def infer(self, input_data):
        """Inference with metrics."""
        start_time = time.perf_counter()

        try:
            result = self.model.infer(input_data)
            INFERENCE_COUNT.labels(self.model_name, 'success').inc()

            # Count detections by class
            for det in result:
                DETECTIONS_COUNT.labels(self.model_name, det['class']).inc()

            return result

        except Exception as e:
            INFERENCE_COUNT.labels(self.model_name, 'error').inc()
            raise

        finally:
            latency = time.perf_counter() - start_time
            INFERENCE_LATENCY.labels(self.model_name).observe(latency)

            # Update GPU memory
            if torch.cuda.is_available():
                memory = torch.cuda.memory_allocated()
                GPU_MEMORY.labels('cuda:0').set(memory)

# Start metrics server
start_http_server(9090)

Logging Configuration

import logging
import json
from datetime import datetime

class StructuredLogger:
    def __init__(self, name, level=logging.INFO):
        self.logger = logging.getLogger(name)
        self.logger.setLevel(level)

        # JSON formatter
        handler = logging.StreamHandler()
        handler.setFormatter(JsonFormatter())
        self.logger.addHandler(handler)

    def log_inference(self, model_name, latency, num_detections, input_shape):
        self.logger.info(json.dumps({
            'event': 'inference',
            'timestamp': datetime.utcnow().isoformat(),
            'model_name': model_name,
            'latency_ms': latency * 1000,
            'num_detections': num_detections,
            'input_shape': list(input_shape)
        }))

    def log_error(self, model_name, error, input_shape):
        self.logger.error(json.dumps({
            'event': 'inference_error',
            'timestamp': datetime.utcnow().isoformat(),
            'model_name': model_name,
            'error': str(error),
            'error_type': type(error).__name__,
            'input_shape': list(input_shape)
        }))

class JsonFormatter(logging.Formatter):
    def format(self, record):
        return record.getMessage()

---

Scaling and Performance

Batch Processing Optimization

class BatchProcessor:
    def __init__(self, model, max_batch_size=8, max_wait_ms=100):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue = []
        self.lock = threading.Lock()
        self.results = {}

    async def process(self, image, request_id):
        """Add image to batch and wait for result."""
        future = asyncio.Future()

        with self.lock:
            self.queue.append((request_id, image, future))

            if len(self.queue) >= self.max_batch_size:
                self._process_batch()

        # Wait for result with timeout
        result = await asyncio.wait_for(future, timeout=5.0)
        return result

    def _process_batch(self):
        """Process accumulated batch."""
        batch_items = self.queue[:self.max_batch_size]
        self.queue = self.queue[self.max_batch_size:]

        # Stack images
        images = np.stack([item[1] for item in batch_items])

        # Inference
        outputs = self.model.infer(images)

        # Return results
        for i, (request_id, image, future) in enumerate(batch_items):
            future.set_result(outputs[i])

Multi-GPU Inference

import torch.nn as nn
from torch.nn.parallel import DataParallel

class MultiGPUInference:
    def __init__(self, model, device_ids=None):
        """
        Wrap model for multi-GPU inference.

        Args:
            model: PyTorch model
            device_ids: List of GPU IDs, e.g., [0, 1, 2, 3]
        """
        if device_ids is None:
            device_ids = list(range(torch.cuda.device_count()))

        self.device = torch.device('cuda:0')
        self.model = DataParallel(model, device_ids=device_ids)
        self.model.to(self.device)
        self.model.set_mode('inference')

    def infer(self, images):
        """
        Run inference across GPUs.
        """
        with torch.no_grad():
            images = torch.from_numpy(images).to(self.device)
            outputs = self.model(images)
        return outputs.cpu().numpy()

Performance Benchmarking

def comprehensive_benchmark(model, input_sizes, batch_sizes, num_iterations=100):
    """
    Benchmark model across different configurations.
    """
    results = []

    for input_size in input_sizes:
        for batch_size in batch_sizes:
            # Create input
            dummy = np.random.randn(batch_size, 3, input_size, input_size).astype(np.float32)

            # Warmup
            for _ in range(10):
                model.infer(dummy)

            # Benchmark
            latencies = []
            for _ in range(num_iterations):
                start = time.perf_counter()
                model.infer(dummy)
                latencies.append(time.perf_counter() - start)

            # Calculate statistics
            latencies = np.array(latencies) * 1000  # Convert to ms
            result = {
                'input_size': input_size,
                'batch_size': batch_size,
                'mean_latency_ms': np.mean(latencies),
                'std_latency_ms': np.std(latencies),
                'p50_latency_ms': np.percentile(latencies, 50),
                'p95_latency_ms': np.percentile(latencies, 95),
                'p99_latency_ms': np.percentile(latencies, 99),
                'throughput_fps': batch_size * 1000 / np.mean(latencies)
            }
            results.append(result)

            print(f"Size: {input_size}, Batch: {batch_size}")
            print(f"  Latency: {result['mean_latency_ms']:.2f}ms (p99: {result['p99_latency_ms']:.2f}ms)")
            print(f"  Throughput: {result['throughput_fps']:.1f} FPS")

    return results

---

Resources

senior-computer-vision reference

Reference Documentation

1. Computer Vision Architectures

See references/computer_vision_architectures.md for:

CNN backbone architectures (ResNet, EfficientNet, ConvNeXt)
Vision Transformer variants (ViT, DeiT, Swin)
Detection heads (anchor-based vs anchor-free)
Feature Pyramid Networks (FPN, BiFPN, PANet)
Neck architectures for multi-scale detection

2. Object Detection Optimization

See references/object_detection_optimization.md for:

Non-Maximum Suppression variants (NMS, Soft-NMS, DIoU-NMS)
Anchor optimization and anchor-free alternatives
Loss function design (focal loss, GIoU, CIoU, DIoU)
Training strategies (warmup, cosine annealing, EMA)
Data augmentation for detection (mosaic, mixup, copy-paste)

3. Production Vision Systems

See references/production_vision_systems.md for:

ONNX export and optimization
TensorRT deployment pipeline
Batch inference optimization
Edge device deployment (Jetson, Intel NCS)
Model serving with Triton
Video processing pipelines

Common Commands

Ultralytics YOLO

# Training
yolo detect train data=coco.yaml model=yolov8m.pt epochs=100 imgsz=640

# Validation
yolo detect val model=best.pt data=coco.yaml

# Inference
yolo detect predict model=best.pt source=images/ save=True

# Export
yolo export model=best.pt format=onnx simplify=True dynamic=True

Detectron2

# Training
python train_net.py --config-file configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml \
    --num-gpus 1 OUTPUT_DIR ./output

# Evaluation
python train_net.py --config-file configs/faster_rcnn.yaml --eval-only \
    MODEL.WEIGHTS output/model_final.pth

# Inference
python demo.py --config-file configs/faster_rcnn.yaml \
    --input images/*.jpg --output results/ \
    --opts MODEL.WEIGHTS output/model_final.pth

MMDetection

# Training
python tools/train.py configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py

# Testing
python tools/test.py configs/faster_rcnn.py checkpoints/latest.pth --eval bbox

# Inference
python demo/image_demo.py demo.jpg configs/faster_rcnn.py checkpoints/latest.pth

Model Optimization

# ONNX export and simplify
python -c "import torch; model = torch.load('model.pt'); torch.onnx.export(model, torch.randn(1,3,640,640), 'model.onnx', opset_version=17)"
python -m onnxsim model.onnx model_sim.onnx

# TensorRT conversion
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --workspace=4096

# Benchmark
trtexec --loadEngine=model.engine --batch=1 --iterations=1000 --avgRuns=100

#!/usr/bin/env python3
"""
Inference Optimizer

Analyzes and benchmarks vision models, and provides optimization recommendations.
Supports PyTorch, ONNX, and TensorRT models.

Usage:
    python inference_optimizer.py model.pt --benchmark
    python inference_optimizer.py model.pt --export onnx --output model.onnx
    python inference_optimizer.py model.onnx --analyze
"""

import os
import sys
import json
import argparse
import logging
import time
from pathlib import Path
from typing import Dict, List, Optional, Any, Tuple
from datetime import datetime
import statistics

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


# Model format signatures
MODEL_FORMATS = {
    '.pt': 'pytorch',
    '.pth': 'pytorch',
    '.onnx': 'onnx',
    '.engine': 'tensorrt',
    '.trt': 'tensorrt',
    '.xml': 'openvino',
    '.mlpackage': 'coreml',
    '.mlmodel': 'coreml',
}

# Optimization recommendations
OPTIMIZATION_PATHS = {
    ('pytorch', 'gpu'): ['onnx', 'tensorrt_fp16'],
    ('pytorch', 'cpu'): ['onnx', 'onnxruntime'],
    ('pytorch', 'edge'): ['onnx', 'tensorrt_int8'],
    ('pytorch', 'mobile'): ['onnx', 'tflite'],
    ('pytorch', 'apple'): ['coreml'],
    ('pytorch', 'intel'): ['onnx', 'openvino'],
    ('onnx', 'gpu'): ['tensorrt_fp16'],
    ('onnx', 'cpu'): ['onnxruntime'],
}


class InferenceOptimizer:
    """Analyzes and optimizes vision model inference."""

    def __init__(self, model_path: str):
        self.model_path = Path(model_path)
        self.model_format = self._detect_format()
        self.model_info = {}
        self.benchmark_results = {}

    def _detect_format(self) -> str:
        """Detect model format from file extension."""
        suffix = self.model_path.suffix.lower()
        if suffix in MODEL_FORMATS:
            return MODEL_FORMATS[suffix]
        raise ValueError(f"Unknown model format: {suffix}")

    def analyze_model(self) -> Dict[str, Any]:
        """Analyze model structure and size."""
        logger.info(f"Analyzing model: {self.model_path}")

        analysis = {
            'path': str(self.model_path),
            'format': self.model_format,
            'file_size_mb': self.model_path.stat().st_size / 1024 / 1024,
            'parameters': None,
            'layers': [],
            'input_shape': None,
            'output_shape': None,
            'ops_count': None,
        }

        if self.model_format == 'onnx':
            analysis.update(self._analyze_onnx())
        elif self.model_format == 'pytorch':
            analysis.update(self._analyze_pytorch())

        self.model_info = analysis
        return analysis

    def _analyze_onnx(self) -> Dict[str, Any]:
        """Analyze ONNX model."""
        try:
            import onnx
            model = onnx.load(str(self.model_path))
            onnx.checker.check_model(model)

            # Count parameters
            total_params = 0
            for initializer in model.graph.initializer:
                param_count = 1
                for dim in initializer.dims:
                    param_count *= dim
                total_params += param_count

            # Get input/output shapes
            inputs = []
            for inp in model.graph.input:
                shape = [d.dim_value if d.dim_value else -1
                        for d in inp.type.tensor_type.shape.dim]
                inputs.append({'name': inp.name, 'shape': shape})

            outputs = []
            for out in model.graph.output:
                shape = [d.dim_value if d.dim_value else -1
                        for d in out.type.tensor_type.shape.dim]
                outputs.append({'name': out.name, 'shape': shape})

            # Count operators
            op_counts = {}
            for node in model.graph.node:
                op_type = node.op_type
                op_counts[op_type] = op_counts.get(op_type, 0) + 1

            return {
                'parameters': total_params,
                'inputs': inputs,
                'outputs': outputs,
                'operator_counts': op_counts,
                'num_nodes': len(model.graph.node),
                'opset_version': model.opset_import[0].version if model.opset_import else None,
            }

        except ImportError:
            logger.warning("onnx package not installed, skipping detailed analysis")
            return {}
        except Exception as e:
            logger.error(f"Error analyzing ONNX model: {e}")
            return {'error': str(e)}

    def _analyze_pytorch(self) -> Dict[str, Any]:
        """Analyze PyTorch model."""
        try:
            import torch

            # Try to load as checkpoint
            checkpoint = torch.load(str(self.model_path), map_location='cpu')

            # Handle different checkpoint formats
            if isinstance(checkpoint, dict):
                if 'model' in checkpoint:
                    state_dict = checkpoint['model']
                elif 'state_dict' in checkpoint:
                    state_dict = checkpoint['state_dict']
                else:
                    state_dict = checkpoint
            else:
                # Assume it's the model itself
                if hasattr(checkpoint, 'state_dict'):
                    state_dict = checkpoint.state_dict()
                else:
                    return {'error': 'Could not extract state dict'}

            # Count parameters
            total_params = 0
            layer_info = []
            for name, param in state_dict.items():
                if hasattr(param, 'numel'):
                    param_count = param.numel()
                    total_params += param_count
                    layer_info.append({
                        'name': name,
                        'shape': list(param.shape),
                        'params': param_count,
                        'dtype': str(param.dtype)
                    })

            return {
                'parameters': total_params,
                'layers': layer_info[:20],  # First 20 layers
                'num_layers': len(layer_info),
            }

        except ImportError:
            logger.warning("torch package not installed, skipping detailed analysis")
            return {}
        except Exception as e:
            logger.error(f"Error analyzing PyTorch model: {e}")
            return {'error': str(e)}

    def benchmark(self, input_size: Tuple[int, int] = (640, 640),
                  batch_sizes: List[int] = None,
                  num_iterations: int = 100,
                  warmup: int = 10) -> Dict[str, Any]:
        """Benchmark model inference speed."""
        if batch_sizes is None:
            batch_sizes = [1, 4, 8, 16]

        logger.info(f"Benchmarking model with input size {input_size}")

        results = {
            'input_size': input_size,
            'num_iterations': num_iterations,
            'warmup_iterations': warmup,
            'batch_results': [],
            'device': 'cpu',
        }

        try:
            if self.model_format == 'onnx':
                results.update(self._benchmark_onnx(input_size, batch_sizes,
                                                    num_iterations, warmup))
            elif self.model_format == 'pytorch':
                results.update(self._benchmark_pytorch(input_size, batch_sizes,
                                                       num_iterations, warmup))
            else:
                results['error'] = f"Benchmarking not supported for {self.model_format}"

        except Exception as e:
            results['error'] = str(e)
            logger.error(f"Benchmark failed: {e}")

        self.benchmark_results = results
        return results

    def _benchmark_onnx(self, input_size: Tuple[int, int],
                        batch_sizes: List[int],
                        num_iterations: int, warmup: int) -> Dict[str, Any]:
        """Benchmark ONNX model."""
        import numpy as np

        try:
            import onnxruntime as ort

            # Try GPU first, fall back to CPU
            providers = ['CPUExecutionProvider']
            try:
                if 'CUDAExecutionProvider' in ort.get_available_providers():
                    providers = ['CUDAExecutionProvider'] + providers
            except:
                pass

            session = ort.InferenceSession(str(self.model_path), providers=providers)
            input_name = session.get_inputs()[0].name
            device = 'cuda' if 'CUDA' in session.get_providers()[0] else 'cpu'

            results = {'device': device, 'provider': session.get_providers()[0]}
            batch_results = []

            for batch_size in batch_sizes:
                # Create dummy input
                dummy = np.random.randn(batch_size, 3, *input_size).astype(np.float32)

                # Warmup
                for _ in range(warmup):
                    session.run(None, {input_name: dummy})

                # Benchmark
                latencies = []
                for _ in range(num_iterations):
                    start = time.perf_counter()
                    session.run(None, {input_name: dummy})
                    latencies.append((time.perf_counter() - start) * 1000)

                batch_result = {
                    'batch_size': batch_size,
                    'mean_latency_ms': statistics.mean(latencies),
                    'std_latency_ms': statistics.stdev(latencies) if len(latencies) > 1 else 0,
                    'min_latency_ms': min(latencies),
                    'max_latency_ms': max(latencies),
                    'p50_latency_ms': sorted(latencies)[len(latencies) // 2],
                    'p95_latency_ms': sorted(latencies)[int(len(latencies) * 0.95)],
                    'p99_latency_ms': sorted(latencies)[int(len(latencies) * 0.99)],
                    'throughput_fps': batch_size * 1000 / statistics.mean(latencies),
                }
                batch_results.append(batch_result)

                logger.info(f"Batch {batch_size}: {batch_result['mean_latency_ms']:.2f}ms, "
                           f"{batch_result['throughput_fps']:.1f} FPS")

            results['batch_results'] = batch_results
            return results

        except ImportError:
            return {'error': 'onnxruntime not installed'}

    def _benchmark_pytorch(self, input_size: Tuple[int, int],
                          batch_sizes: List[int],
                          num_iterations: int, warmup: int) -> Dict[str, Any]:
        """Benchmark PyTorch model."""
        try:
            import torch
            import numpy as np

            # Load model
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            checkpoint = torch.load(str(self.model_path), map_location=device)

            # Handle different checkpoint formats
            if isinstance(checkpoint, dict) and 'model' in checkpoint:
                model = checkpoint['model']
            elif hasattr(checkpoint, 'forward'):
                model = checkpoint
            else:
                return {'error': 'Could not load model for benchmarking'}

            model.to(device)
            model.train(False)

            results = {'device': str(device)}
            batch_results = []

            with torch.no_grad():
                for batch_size in batch_sizes:
                    dummy = torch.randn(batch_size, 3, *input_size, device=device)

                    # Warmup
                    for _ in range(warmup):
                        _ = model(dummy)
                    if device.type == 'cuda':
                        torch.cuda.synchronize()

                    # Benchmark
                    latencies = []
                    for _ in range(num_iterations):
                        if device.type == 'cuda':
                            torch.cuda.synchronize()
                        start = time.perf_counter()
                        _ = model(dummy)
                        if device.type == 'cuda':
                            torch.cuda.synchronize()
                        latencies.append((time.perf_counter() - start) * 1000)

                    batch_result = {
                        'batch_size': batch_size,
                        'mean_latency_ms': statistics.mean(latencies),
                        'std_latency_ms': statistics.stdev(latencies) if len(latencies) > 1 else 0,
                        'min_latency_ms': min(latencies),
                        'max_latency_ms': max(latencies),
                        'throughput_fps': batch_size * 1000 / statistics.mean(latencies),
                    }
                    batch_results.append(batch_result)

                    logger.info(f"Batch {batch_size}: {batch_result['mean_latency_ms']:.2f}ms, "
                               f"{batch_result['throughput_fps']:.1f} FPS")

            results['batch_results'] = batch_results
            return results

        except ImportError:
            return {'error': 'torch not installed'}
        except Exception as e:
            return {'error': str(e)}

    def get_optimization_recommendations(self, target: str = 'gpu') -> List[Dict[str, Any]]:
        """Get optimization recommendations for target platform."""
        recommendations = []

        key = (self.model_format, target)
        if key in OPTIMIZATION_PATHS:
            path = OPTIMIZATION_PATHS[key]
            for step in path:
                rec = {
                    'step': step,
                    'description': self._get_step_description(step),
                    'expected_speedup': self._get_expected_speedup(step),
                    'command': self._get_step_command(step),
                }
                recommendations.append(rec)

        # Add general recommendations
        if self.model_info:
            params = self.model_info.get('parameters', 0)
            if params and params > 50_000_000:
                recommendations.append({
                    'step': 'pruning',
                    'description': f'Model has {params/1e6:.1f}M parameters. '
                                 'Consider structured pruning to reduce size.',
                    'expected_speedup': '1.5-2x',
                })

            file_size = self.model_info.get('file_size_mb', 0)
            if file_size > 100:
                recommendations.append({
                    'step': 'quantization',
                    'description': f'Model size is {file_size:.1f}MB. '
                                 'INT8 quantization can reduce by 75%.',
                    'expected_speedup': '2-4x',
                })

        return recommendations

    def _get_step_description(self, step: str) -> str:
        """Get description for optimization step."""
        descriptions = {
            'onnx': 'Export to ONNX format for framework-agnostic deployment',
            'tensorrt_fp16': 'Convert to TensorRT with FP16 precision for NVIDIA GPUs',
            'tensorrt_int8': 'Convert to TensorRT with INT8 quantization for edge devices',
            'onnxruntime': 'Use ONNX Runtime for optimized CPU/GPU inference',
            'openvino': 'Convert to OpenVINO for Intel CPU/GPU optimization',
            'coreml': 'Convert to CoreML for Apple Silicon acceleration',
            'tflite': 'Convert to TensorFlow Lite for mobile deployment',
        }
        return descriptions.get(step, step)

    def _get_expected_speedup(self, step: str) -> str:
        """Get expected speedup for optimization step."""
        speedups = {
            'onnx': '1-1.5x',
            'tensorrt_fp16': '2-4x',
            'tensorrt_int8': '3-6x',
            'onnxruntime': '1.2-2x',
            'openvino': '1.5-3x',
            'coreml': '2-5x (on Apple Silicon)',
            'tflite': '1-2x',
        }
        return speedups.get(step, 'varies')

    def _get_step_command(self, step: str) -> str:
        """Get command for optimization step."""
        model_name = self.model_path.stem
        commands = {
            'onnx': f'yolo export model={model_name}.pt format=onnx',
            'tensorrt_fp16': f'trtexec --onnx={model_name}.onnx --saveEngine={model_name}.engine --fp16',
            'tensorrt_int8': f'trtexec --onnx={model_name}.onnx --saveEngine={model_name}.engine --int8',
            'onnxruntime': f'pip install onnxruntime-gpu',
            'openvino': f'mo --input_model {model_name}.onnx --output_dir openvino/',
            'coreml': f'yolo export model={model_name}.pt format=coreml',
        }
        return commands.get(step, '')

    def print_summary(self):
        """Print analysis and benchmark summary."""
        print("\n" + "=" * 70)
        print("MODEL ANALYSIS SUMMARY")
        print("=" * 70)

        if self.model_info:
            print(f"Path:        {self.model_info.get('path', 'N/A')}")
            print(f"Format:      {self.model_info.get('format', 'N/A')}")
            print(f"File Size:   {self.model_info.get('file_size_mb', 0):.2f} MB")

            params = self.model_info.get('parameters')
            if params:
                print(f"Parameters:  {params:,} ({params/1e6:.2f}M)")

            if 'num_nodes' in self.model_info:
                print(f"Nodes:       {self.model_info['num_nodes']}")

        if self.benchmark_results and 'batch_results' in self.benchmark_results:
            print("\n" + "-" * 70)
            print("BENCHMARK RESULTS")
            print("-" * 70)
            print(f"Device:      {self.benchmark_results.get('device', 'N/A')}")
            print(f"Input Size:  {self.benchmark_results.get('input_size', 'N/A')}")
            print()
            print(f"{'Batch':<8} {'Latency (ms)':<15} {'Throughput (FPS)':<18} {'P99 (ms)':<12}")
            print("-" * 55)

            for result in self.benchmark_results['batch_results']:
                print(f"{result['batch_size']:<8} "
                      f"{result['mean_latency_ms']:<15.2f} "
                      f"{result['throughput_fps']:<18.1f} "
                      f"{result.get('p99_latency_ms', 0):<12.2f}")

        print("=" * 70 + "\n")


def main():
    parser = argparse.ArgumentParser(
        description="Analyze and optimize vision model inference"
    )
    parser.add_argument('model_path', help='Path to model file')
    parser.add_argument('--analyze', action='store_true',
                       help='Analyze model structure')
    parser.add_argument('--benchmark', action='store_true',
                       help='Benchmark inference speed')
    parser.add_argument('--input-size', type=int, nargs=2, default=[640, 640],
                       metavar=('H', 'W'), help='Input image size')
    parser.add_argument('--batch-sizes', type=int, nargs='+', default=[1, 4, 8],
                       help='Batch sizes to benchmark')
    parser.add_argument('--iterations', type=int, default=100,
                       help='Number of benchmark iterations')
    parser.add_argument('--warmup', type=int, default=10,
                       help='Number of warmup iterations')
    parser.add_argument('--target', choices=['gpu', 'cpu', 'edge', 'mobile', 'apple', 'intel'],
                       default='gpu', help='Target deployment platform')
    parser.add_argument('--recommend', action='store_true',
                       help='Show optimization recommendations')
    parser.add_argument('--json', action='store_true',
                       help='Output as JSON')
    parser.add_argument('--output', '-o', help='Output file path')

    args = parser.parse_args()

    if not Path(args.model_path).exists():
        logger.error(f"Model not found: {args.model_path}")
        sys.exit(1)

    try:
        optimizer = InferenceOptimizer(args.model_path)
    except ValueError as e:
        logger.error(str(e))
        sys.exit(1)

    results = {}

    # Analyze model
    if args.analyze or not (args.benchmark or args.recommend):
        results['analysis'] = optimizer.analyze_model()

    # Benchmark
    if args.benchmark:
        results['benchmark'] = optimizer.benchmark(
            input_size=tuple(args.input_size),
            batch_sizes=args.batch_sizes,
            num_iterations=args.iterations,
            warmup=args.warmup
        )

    # Recommendations
    if args.recommend:
        if not optimizer.model_info:
            optimizer.analyze_model()
        results['recommendations'] = optimizer.get_optimization_recommendations(args.target)

    # Output
    if args.json:
        print(json.dumps(results, indent=2, default=str))
    else:
        optimizer.print_summary()

        if args.recommend and 'recommendations' in results:
            print("OPTIMIZATION RECOMMENDATIONS")
            print("-" * 70)
            for i, rec in enumerate(results['recommendations'], 1):
                print(f"\n{i}. {rec['step'].upper()}")
                print(f"   {rec['description']}")
                print(f"   Expected speedup: {rec['expected_speedup']}")
                if rec.get('command'):
                    print(f"   Command: {rec['command']}")
            print()

    # Save to file
    if args.output:
        with open(args.output, 'w') as f:
            json.dump(results, f, indent=2, default=str)
        logger.info(f"Results saved to {args.output}")


if __name__ == '__main__':
    main()

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Choose senior-computer-vision over general ML skills when the task involves detection, segmentation, or vision-specific deployment with YOLO, DETR, or SAM rather than tabular or NLP models.

FAQ

Which detection models does senior-computer-vision cover?

senior-computer-vision covers YOLO, Faster R-CNN, and DETR for object detection plus Mask R-CNN and SAM for image segmentation. It also includes CNN and Vision Transformer architectures with training guidance across PyTorch, Ultralytics, Detectron2, and MMDetection.

How does senior-computer-vision handle production deployment?

senior-computer-vision guides production deployment using ONNX and TensorRT for optimized inference. Developers use it when building detection pipelines, training custom models, tuning inference latency, and shipping visual AI systems beyond notebook prototypes.

Is Senior Computer Vision safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsautomation

About

Senior Computer Vision by the numbers

Add your badge

How do you deploy YOLO detection models with TensorRT?

Who is it for?

When should I use this skill?

What you get

Files

Senior Computer Vision Engineer

Table of Contents

Quick Start

Core Expertise

Tech Stack

Workflow 1: Object Detection Pipeline

Step 1: Define Detection Requirements

Step 2: Select Detection Architecture

Step 3: Prepare Dataset

Step 4: Configure Training

Step 5: Train and Validate

Step 6: Evaluate Results

Workflow 2: Model Optimization and Deployment

Step 1: Benchmark Baseline Performance

Step 2: Select Optimization Strategy

Step 3: Export to ONNX

Step 4: Apply Quantization (Optional)

Step 5: Convert to Target Runtime

Step 6: Benchmark Optimized Model

Workflow 3: Custom Dataset Preparation

Step 1: Audit Raw Data

Step 2: Clean and Validate

Step 3: Convert Annotation Format

Step 4: Apply Augmentations

Step 5: Create Train/Val/Test Splits

Step 6: Generate Dataset Configuration

Architecture Selection Guide

Object Detection Architectures

Segmentation Architectures

CNN vs Vision Transformer Trade-offs

Reference Documentation

Performance Targets

Resources

Computer Vision Architectures

Table of Contents

Backbone Architectures

ResNet Family

EfficientNet Family

ConvNeXt

CSPNet (Cross Stage Partial)

Detection Architectures

Two-Stage Detectors

Faster R-CNN

Cascade R-CNN

Single-Stage Detectors

YOLO Family

SSD (Single Shot Detector)

RetinaNet

Segmentation Architectures

Instance Segmentation

Mask R-CNN

YOLACT / YOLACT++

YOLOv8-Seg

Semantic Segmentation

DeepLabV3+

SegFormer

Promptable Segmentation

SAM (Segment Anything Model)

Vision Transformers

ViT (Vision Transformer)

DeiT (Data-efficient Image Transformers)

Swin Transformer

Feature Pyramid Networks

Original FPN

PANet (Path Aggregation Network)

BiFPN (Bidirectional FPN)

NAS-FPN

Architecture Selection

Decision Matrix

Training Data Requirements

Compute Requirements

Code Examples