Vision Framework

Name: Vision Framework
Author: dpearson2699

dpearson2699/swift-ios-skills

2.7k installs
944 repo stars
Updated July 15, 2026
dpearson2699/swift-ios-skills

vision-framework implements iOS on-device computer vision with modern async Vision requests, VisionKit scanning, and Core ML integration.

About

The vision-framework skill implements computer vision in iOS using on-device Vision APIs for text recognition, face detection, barcodes, segmentation, object tracking, document scanning, and Core ML inference. It documents two API generations: modern iOS 18 plus async perform on structs like RecognizeTextRequest versus legacy VNImageRequestHandler completion patterns for older targets. Coverage includes OCR with accurate and fast recognition levels, face rectangles, barcode symbologies, iOS 26 document scanning, image segmentation, video object tracking with stateful class requests, and VNCoreMLRequest for custom models. VisionKit DataScannerViewController integration supports live camera scanning. Patterns target iOS 26 with Swift 6.3 and include common mistakes and a review checklist for orientation, language codes, and request handler lifecycle. Developers invoke it when adding OCR, barcode scanning, face detection, or custom Core ML Vision inference to Swift iOS applications.

Modern iOS 18 plus async perform API vs legacy VNRequest handlers.
OCR, face detection, barcodes, segmentation, tracking, and document scan patterns.
VisionKit DataScannerViewController for live camera scanning.
VNCoreMLRequest integration for custom on-device model inference.
Review checklist and common mistakes for Vision request setup.

Vision Framework by the numbers

2,724 all-time installs (skills.sh)
+113 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #69 of 1,039 Mobile Development skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

At a glance

vision-framework capabilities & compatibility

Capabilities: modern and legacy vision request patterns · text recognition with language and accuracy leve · barcode and face detection configuration · visionkit datascannerviewcontroller integration · core ml custom model inference via vision
Use cases: frontend
Platforms: macOS

From the docs

What vision-framework says it does

Prefer the modern API for new code.

SKILL.md

let observations = try await request.perform(on: image)

SKILL.md

Detect text, faces, barcodes, objects, and body poses in images and video using on-device computer vision.

SKILL.md

npx skills add https://github.com/dpearson2699/swift-ios-skills --skill vision-framework

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/dpearson2699/swift-ios-skills/vision-framework.svg)](https://skillselion.com/skills/dpearson2699/swift-ios-skills/vision-framework)

Installs	2.7k
repo stars	★ 944
Security audit	3 / 3 scanners passed
Last updated	July 15, 2026
Repository	dpearson2699/swift-ios-skills ↗

How do I add OCR, barcode scanning, or face detection to an iOS app with Vision framework?

Implement on-device OCR, face detection, barcode scanning, and Core ML Vision requests in iOS apps.

Who is it for?

iOS developers adding on-device text, barcode, face, or custom ML vision features.

Skip if: Skip for server-side image ML, Android CV, or SwiftUI gesture-only work without vision.

When should I use this skill?

User asks about Vision OCR, barcode scanning, face detection, DataScannerViewController, or VNCoreMLRequest.

What you get

Working Vision request code with correct API generation, orientation handling, and VisionKit scanner if needed.

Vision VNRequest handlers
DataScannerViewController integration
VNCoreMLRequest inference pipeline

By the numbers

Covers 6+ Vision capabilities: OCR, faces, barcodes, segmentation, tracking, document scanning
Documents both iOS 16+ Swift-native API and legacy VNRequest patterns

Files

SKILL.mdMarkdownGitHub ↗

Vision Framework

Detect text, faces, barcodes, objects, and body poses in images and video using on-device computer vision. Patterns target iOS 26+ with Swift 6.3, backward-compatible where noted.

See references/vision-requests.md for complete code patterns and references/visionkit-scanner.md for DataScannerViewController integration.

Two API Generations
Request Pattern (Modern API)
Text Recognition (OCR)
Face Detection
Barcode Detection
Document Scanning (iOS 26+)
Image Segmentation
Object Tracking
Other Request Types
Core ML Integration
VisionKit: DataScannerViewController
Common Mistakes
Review Checklist
References

Two API Generations

Vision has two distinct API layers. Prefer the modern API for new code.

Aspect	Modern (iOS 18+)	Legacy
Pattern	`let result = try await request.perform(on: image)`	`VNImageRequestHandler` + completion handler
Request types	Swift types — structs and classes (`RecognizeTextRequest`, `DetectFaceRectanglesRequest`)	ObjC classes (`VNRecognizeTextRequest`, `VNDetectFaceRectanglesRequest`)
Concurrency	Native async/await	Completion handlers or synchronous `perform`
Observations	Typed return values	Cast `results` from `[Any]`
Availability	iOS 18+ / macOS 15+	iOS 11+

The modern API uses the ImageProcessingRequest protocol. Each request type has a perform(on:orientation:) method that accepts CGImage, CIImage, CVPixelBuffer, CMSampleBuffer, Data, or URL. Most requests are structs; stateful requests for video tracking (e.g., TrackObjectRequest, TrackRectangleRequest, DetectTrajectoriesRequest) are final classes.

Request Pattern (Modern API)

All modern Vision requests follow the same pattern: create a request struct, call perform(on:), and handle the typed result.

import Vision

func recognizeText(in image: CGImage) async throws -> [String] {
    var request = RecognizeTextRequest()
    request.recognitionLevel = .accurate
    request.recognitionLanguages = [Locale.Language(identifier: "en-US")]

    let observations = try await request.perform(on: image)
    return observations.compactMap { observation in
        observation.topCandidates(1).first?.string
    }
}

Legacy Pattern (Pre-iOS 18)

Use VNImageRequestHandler with completion-based requests when targeting older deployment versions.

import Vision

func recognizeTextLegacy(in image: CGImage) throws -> [String] {
    var recognized: [String] = []
    let request = VNRecognizeTextRequest { request, error in
        guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
        recognized = observations.compactMap { $0.topCandidates(1).first?.string }
    }
    request.recognitionLevel = .accurate

    let handler = VNImageRequestHandler(cgImage: image)
    try handler.perform([request])
    return recognized
}

Text Recognition (OCR)

Modern: RecognizeTextRequest (iOS 18+)

var request = RecognizeTextRequest()
request.recognitionLevel = .accurate       // .fast for real-time
request.recognitionLanguages = [
    Locale.Language(identifier: "en-US"),
    Locale.Language(identifier: "fr-FR"),
]
request.usesLanguageCorrection = true
request.customWords = ["SwiftUI", "Xcode"] // domain-specific terms

let observations = try await request.perform(on: cgImage)
for observation in observations {
    guard let candidate = observation.topCandidates(1).first else { continue }
    let text = candidate.string
    let confidence = candidate.confidence  // 0.0 ... 1.0
    let bounds = observation.boundingBox   // normalized coordinates
}

Legacy: VNRecognizeTextRequest

let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
request.recognitionLanguages = ["en-US", "fr-FR"]
request.usesLanguageCorrection = true

Key differences: Modern API uses Locale.Language for languages; legacy uses string identifiers. Both support .accurate (best quality) and .fast (real-time suitable) recognition levels.

Face Detection

Detect face rectangles, landmarks (eyes, nose, mouth), and capture quality.

// Modern API
let faceRequest = DetectFaceRectanglesRequest()
let faces = try await faceRequest.perform(on: cgImage)

for face in faces {
    let boundingBox = face.boundingBox   // normalized CGRect
    let roll = face.roll                 // Measurement<UnitAngle>
    let yaw = face.yaw                  // Measurement<UnitAngle>
}

// Landmarks (eyes, nose, mouth contours)
var landmarkRequest = DetectFaceLandmarksRequest()
let landmarkFaces = try await landmarkRequest.perform(on: cgImage)
for face in landmarkFaces {
    let landmarks = face.landmarks
    let leftEye = landmarks?.leftEye?.normalizedPoints
    let nose = landmarks?.nose?.normalizedPoints
}

Coordinate System

Vision uses a normalized coordinate system with origin at the bottom-left. Convert to UIKit (top-left origin) before display:

func convertToUIKit(_ rect: CGRect, imageHeight: CGFloat) -> CGRect {
    CGRect(
        x: rect.origin.x,
        y: imageHeight - rect.origin.y - rect.height,
        width: rect.width,
        height: rect.height
    )
}

Barcode Detection

Detect 1D and 2D barcodes including QR codes.

var request = DetectBarcodesRequest()
request.symbologies = [.qr, .ean13, .code128, .pdf417]

let barcodes = try await request.perform(on: cgImage)
for barcode in barcodes {
    let payload = barcode.payloadString          // decoded content
    let symbology = barcode.symbology            // .qr, .ean13, etc.
    let bounds = barcode.boundingBox             // normalized rect
}

Common symbologies: .qr, .aztec, .pdf417, .dataMatrix, .ean8, .ean13, .code39, .code128, .upce, .itf14.

Document Scanning (iOS 26+)

RecognizeDocumentsRequest provides structured document reading with layout understanding beyond basic OCR. Returns DocumentObservation objects with a nested Container structure for paragraphs, tables, lists, and barcodes.

var request = RecognizeDocumentsRequest()
let documents = try await request.perform(on: cgImage)

for observation in documents {
    let container = observation.document

    // Full text content
    let fullText = container.text

    // Structured access to paragraphs
    for paragraph in container.paragraphs {
        let paragraphText = paragraph.text
    }

    // Tables and lists
    for table in container.tables { /* structured table data */ }
    for list in container.lists { /* structured list data */ }

    // Embedded barcodes detected within the document
    for barcode in container.barcodes { /* barcode data */ }

    // Document title if detected
    if let title = container.title { print(title) }
}

For simpler document camera scanning, use VisionKit's VNDocumentCameraViewController which provides a full-screen camera UI with auto-capture, perspective correction, and multi-page scanning.

Image Segmentation

Modern: GeneratePersonSegmentationRequest (iOS 18+)

var request = GeneratePersonSegmentationRequest()
request.qualityLevel = .accurate  // .balanced, .fast

let mask = try await request.perform(on: cgImage)
// mask is a PersonSegmentationObservation with a pixelBuffer property
let maskBuffer = mask.pixelBuffer
// Apply mask using Core Image: CIFilter.blendWithMask()

Legacy: VNGeneratePersonSegmentationRequest

let request = VNGeneratePersonSegmentationRequest()
request.qualityLevel = .accurate  // .balanced, .fast
request.outputPixelFormat = kCVPixelFormatType_OneComponent8

let handler = VNImageRequestHandler(cgImage: cgImage)
try handler.perform([request])

guard let mask = request.results?.first?.pixelBuffer else { return }
// Apply mask using Core Image: CIFilter.blendWithMask()

Quality levels:

.accurate -- best quality, slowest (~1s), full resolution
.balanced -- good quality, moderate speed (~100ms), 960x540
.fast -- lowest quality, fastest (~10ms), 256x144, suitable for real-time

Instance Segmentation (iOS 18+)

Separate masks per person for individual effects.

// Modern API (iOS 18+)
let request = GeneratePersonInstanceMaskRequest()
let observation = try await request.perform(on: cgImage)
let indices = observation.allInstances

for index in indices {
    let mask = try observation.generateMask(forInstances: IndexSet(integer: index))
    // mask is a CVPixelBuffer with only this person visible
}

// Legacy API (iOS 17+)
let request = VNGeneratePersonInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: cgImage)
try handler.perform([request])

guard let result = request.results?.first else { return }
let indices = result.allInstances
for index in indices {
    let instanceMask = try result.generateMaskedImage(
        ofInstances: IndexSet(integer: index),
        from: handler,
        croppedToInstancesExtent: false
    )
}

See references/vision-requests.md for mask composition and Core Image filter integration patterns.

Object Tracking

Modern: TrackObjectRequest (iOS 18+)

TrackObjectRequest is a stateful request that maintains tracking context across frames. Conforms to both ImageProcessingRequest and StatefulRequest.

// Initialize with a detected object's bounding box
let initialObservation = DetectedObjectObservation(boundingBox: detectedRect)
var request = TrackObjectRequest(observation: initialObservation)
request.trackingLevel = .accurate

// For each video frame:
let results = try await request.perform(on: pixelBuffer)
if let tracked = results.first {
    let updatedBounds = tracked.boundingBox
    let confidence = tracked.confidence
}

Legacy: VNTrackObjectRequest

let trackRequest = VNTrackObjectRequest(detectedObjectObservation: initialObservation)
trackRequest.trackingLevel = .accurate

let sequenceHandler = VNSequenceRequestHandler()
// For each frame:
try sequenceHandler.perform([trackRequest], on: pixelBuffer)
if let result = trackRequest.results?.first {
    let updatedBounds = result.boundingBox
    trackRequest.inputObservation = result
}

Other Request Types

Vision provides additional requests covered in references/vision-requests.md:

Request	Purpose
`ClassifyImageRequest`	Classify scene content (outdoor, food, animal, etc.)
`GenerateAttentionBasedSaliencyImageRequest`	Heat map of where viewers focus attention
`GenerateObjectnessBasedSaliencyImageRequest`	Heat map of object-like regions
`GenerateForegroundInstanceMaskRequest`	Foreground object segmentation (not person-specific)
`DetectRectanglesRequest`	Detect rectangular shapes (documents, cards, screens)
`DetectHorizonRequest`	Detect horizon angle for auto-leveling photos
`DetectHumanBodyPoseRequest`	Detect body joints (shoulders, elbows, knees)
`DetectHumanBodyPose3DRequest`	3D human body pose estimation
`DetectHumanHandPoseRequest`	Detect hand joints and finger positions
`DetectAnimalBodyPoseRequest`	Detect animal body joint positions
`DetectFaceCaptureQualityRequest`	Face capture quality scoring (0–1) for photo selection
`TrackRectangleRequest`	Track rectangular objects across video frames
`TrackOpticalFlowRequest`	Optical flow between video frames
`DetectTrajectoriesRequest`	Detect object trajectories in video

All modern request types above are iOS 18+ / macOS 15+.

Core ML Integration

Run custom Core ML models through Vision for automatic image preprocessing (resizing, normalization, color space conversion).

// Modern API (iOS 18+)
let model = try MLModel(contentsOf: modelURL)
let request = CoreMLRequest(model: .init(model))
let results = try await request.perform(on: cgImage)

// Classification model
if let classification = results.first as? ClassificationObservation {
    let label = classification.identifier
    let confidence = classification.confidence
}

// Legacy API
let vnModel = try VNCoreMLModel(for: model)
let request = VNCoreMLRequest(model: vnModel) { request, error in
    guard let results = request.results as? [VNClassificationObservation] else { return }
    let topResult = results.first
}
let handler = VNImageRequestHandler(cgImage: cgImage)
try handler.perform([request])

For model conversion and optimization, see the coreml skill.

VisionKit: DataScannerViewController

DataScannerViewController provides a full-screen live camera scanner for text and barcodes. See references/visionkit-scanner.md for complete patterns.

Quick Start

import VisionKit

// Check availability (requires A12+ chip and camera)
guard DataScannerViewController.isSupported,
      DataScannerViewController.isAvailable else { return }

let scanner = DataScannerViewController(
    recognizedDataTypes: [
        .text(languages: ["en"]),
        .barcode(symbologies: [.qr, .ean13])
    ],
    qualityLevel: .balanced,
    recognizesMultipleItems: true,
    isHighFrameRateTrackingEnabled: true,
    isHighlightingEnabled: true
)
scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}

SwiftUI Integration

Wrap DataScannerViewController in UIViewControllerRepresentable. See references/visionkit-scanner.md for the full implementation.

Common Mistakes

DON'T: Use the legacy VNImageRequestHandler API for new iOS 18+ projects. DO: Use modern struct-based requests with perform(on:) and async/await. Why: Modern API provides type safety, better Swift concurrency support, and cleaner error handling.

DON'T: Forget to convert normalized coordinates before drawing bounding boxes. DO: Use VNImageRectForNormalizedRect(_:_:_:) or manual conversion from bottom-left origin to UIKit top-left origin. Why: Vision uses normalized coordinates (0...1) with bottom-left origin; UIKit uses points with top-left origin.

DON'T: Run Vision requests on the main thread. DO: Perform requests on a background thread or use async/await from a detached task. Why: Image analysis is CPU/GPU-intensive and blocks the UI if run on the main actor.

DON'T: Use .accurate recognition level for real-time camera feeds. DO: Use .fast for live video, .accurate for still images or offline processing. Why: Accurate recognition is too slow for 30fps video; fast recognition trades quality for speed.

DON'T: Ignore the confidence score on observations. DO: Filter results by confidence threshold (e.g., > 0.5) appropriate for your use case. Why: Low-confidence results are often incorrect and degrade user experience.

DON'T: Create a new VNImageRequestHandler for each frame when tracking objects. DO: Use VNSequenceRequestHandler for video frame sequences. Why: Sequence handler maintains temporal context for tracking; per-frame handlers lose state.

DON'T: Request all barcode symbologies when you only need QR codes. DO: Specify only the symbologies you need in the request. Why: Fewer symbologies means faster detection and fewer false positives.

DON'T: Assume DataScannerViewController is available on all devices. DO: Check both isSupported (hardware) and isAvailable (user permissions) before presenting. Why: Requires A12+ chip; isAvailable also checks camera access authorization.

Review Checklist

[ ] Uses modern Vision API (iOS 18+) unless targeting older deployments
[ ] Vision requests run off the main thread (async/await or background queue)
[ ] Normalized coordinates converted before UI display
[ ] Confidence threshold applied to filter low-quality observations
[ ] Recognition level matches use case (.fast for video, .accurate for stills)
[ ] Language hints set for text recognition when input language is known
[ ] Barcode symbologies limited to only those needed
[ ] DataScannerViewController availability checked before presentation
[ ] Camera usage description (NSCameraUsageDescription) in Info.plist for VisionKit
[ ] Person segmentation quality level appropriate for use case
[ ] VNSequenceRequestHandler used for video frame tracking (not per-frame handler)
[ ] Error handling covers request failures and empty results

References

Vision request patterns: references/vision-requests.md
VisionKit scanner integration: references/visionkit-scanner.md
Apple docs: Vision |

VisionKit | RecognizeTextRequest | DataScannerViewController

Vision Request Patterns

Complete implementation patterns for Vision framework requests covering text recognition, face detection, barcode scanning, segmentation, classification, and video processing. All patterns target iOS 26+ with Swift 6.3 unless noted.

Complete Text Recognition Pipeline
Face Detection with Landmarks
Barcode Detection with All Symbologies
Person Segmentation with Mask Application
Instance Segmentation (iOS 18+)
Image Classification
Saliency Detection
Rectangle Detection
Horizon Detection
Batch Processing Multiple Requests
Video Frame Processing with CMSampleBuffer
Object Tracking Across Video Frames
Coordinate Normalization Utilities
Performance Considerations

Complete Text Recognition Pipeline

Full pipeline from image loading through text extraction with coordinate mapping.

import Vision
import UIKit

@MainActor
final class TextRecognizer {
    func recognizeText(in image: UIImage) async throws -> [RecognizedTextBlock] {
        guard let cgImage = image.cgImage else {
            throw TextRecognitionError.invalidImage
        }

        var request = RecognizeTextRequest()
        request.recognitionLevel = .accurate
        request.recognitionLanguages = [
            Locale.Language(identifier: "en-US"),
        ]
        request.usesLanguageCorrection = true

        let observations = try await request.perform(on: cgImage)
        let imageSize = CGSize(
            width: cgImage.width,
            height: cgImage.height
        )

        return observations.compactMap { observation in
            guard let candidate = observation.topCandidates(1).first else { return nil }
            let boundingBox = observation.boundingBox
            let imageRect = VNImageRectForNormalizedRect(
                boundingBox,
                Int(imageSize.width),
                Int(imageSize.height)
            )
            return RecognizedTextBlock(
                text: candidate.string,
                confidence: candidate.confidence,
                boundingBox: imageRect
            )
        }
    }
}

struct RecognizedTextBlock: Sendable {
    let text: String
    let confidence: Float
    let boundingBox: CGRect
}

enum TextRecognitionError: Error {
    case invalidImage
}

Text Recognition with Language Hints

func recognizeMultilingualText(in cgImage: CGImage) async throws -> [String] {
    var request = RecognizeTextRequest()
    request.recognitionLevel = .accurate
    request.recognitionLanguages = [
        Locale.Language(identifier: "en-US"),
        Locale.Language(identifier: "fr-FR"),
        Locale.Language(identifier: "de-DE"),
    ]
    request.usesLanguageCorrection = true
    request.customWords = ["iOS", "SwiftUI", "Xcode"]

    let observations = try await request.perform(on: cgImage)
    return observations.compactMap { $0.topCandidates(1).first?.string }
}

Fast Text Recognition for Live Video

func recognizeTextFast(in sampleBuffer: CMSampleBuffer) async throws -> [String] {
    var request = RecognizeTextRequest()
    request.recognitionLevel = .fast
    request.recognitionLanguages = [Locale.Language(identifier: "en-US")]

    let observations = try await request.perform(on: sampleBuffer)
    return observations.compactMap { $0.topCandidates(1).first?.string }
}

Legacy Text Recognition (Pre-iOS 18)

import Vision

func recognizeTextLegacy(
    in cgImage: CGImage,
    completion: @escaping ([String]) -> Void
) {
    let request = VNRecognizeTextRequest { request, error in
        guard error == nil,
              let observations = request.results as? [VNRecognizedTextObservation]
        else {
            completion([])
            return
        }
        let strings = observations.compactMap {
            $0.topCandidates(1).first?.string
        }
        completion(strings)
    }
    request.recognitionLevel = .accurate
    request.recognitionLanguages = ["en-US"]
    request.usesLanguageCorrection = true

    let handler = VNImageRequestHandler(cgImage: cgImage)
    DispatchQueue.global(qos: .userInitiated).async {
        try? handler.perform([request])
    }
}

Face Detection with Landmarks

import Vision

struct DetectedFace: Sendable {
    let boundingBox: CGRect
    let landmarks: FaceLandmarkPoints?
    let roll: Measurement<UnitAngle>
    let yaw: Measurement<UnitAngle>
    let captureQuality: FaceObservation.CaptureQuality?
}

struct FaceLandmarkPoints: Sendable {
    let leftEye: [CGPoint]
    let rightEye: [CGPoint]
    let nose: [CGPoint]
    let outerLips: [CGPoint]
    let faceContour: [CGPoint]
}

func detectFaces(in cgImage: CGImage) async throws -> [DetectedFace] {
    // Detect face rectangles
    let rectRequest = DetectFaceRectanglesRequest()
    let faces = try await rectRequest.perform(on: cgImage)

    // Detect landmarks for detailed features
    let landmarkRequest = DetectFaceLandmarksRequest()
    let landmarkFaces = try await landmarkRequest.perform(on: cgImage)

    // Detect capture quality for photo selection
    let qualityRequest = DetectFaceCaptureQualityRequest()
    let qualityFaces = try await qualityRequest.perform(on: cgImage)

    return faces.enumerated().map { index, face in
        let landmarks: FaceLandmarkPoints?
        if index < landmarkFaces.count,
           let lm = landmarkFaces[index].landmarks {
            landmarks = FaceLandmarkPoints(
                leftEye: lm.leftEye?.normalizedPoints ?? [],
                rightEye: lm.rightEye?.normalizedPoints ?? [],
                nose: lm.nose?.normalizedPoints ?? [],
                outerLips: lm.outerLips?.normalizedPoints ?? [],
                faceContour: lm.faceContour?.normalizedPoints ?? []
            )
        } else {
            landmarks = nil
        }

        let quality: FaceObservation.CaptureQuality?
        if index < qualityFaces.count {
            quality = qualityFaces[index].captureQuality
        } else {
            quality = nil
        }

        return DetectedFace(
            boundingBox: face.boundingBox,
            landmarks: landmarks,
            roll: face.roll,
            yaw: face.yaw,
            captureQuality: quality
        )
    }
}

Barcode Detection with All Symbologies

import Vision

struct DetectedBarcode: Sendable {
    let payload: String?
    let symbology: VNBarcodeSymbology
    let boundingBox: CGRect
}

func detectBarcodes(
    in cgImage: CGImage,
    symbologies: [VNBarcodeSymbology] = [.qr, .ean13, .code128]
) async throws -> [DetectedBarcode] {
    var request = DetectBarcodesRequest()
    request.symbologies = symbologies

    let observations = try await request.perform(on: cgImage)
    return observations.map { barcode in
        DetectedBarcode(
            payload: barcode.payloadString,
            symbology: barcode.symbology,
            boundingBox: barcode.boundingBox
        )
    }
}

// Detect only QR codes with URL content
func detectQRCodes(in cgImage: CGImage) async throws -> [URL] {
    var request = DetectBarcodesRequest()
    request.symbologies = [.qr]

    let observations = try await request.perform(on: cgImage)
    return observations.compactMap { barcode in
        guard let payload = barcode.payloadString else { return nil }
        return URL(string: payload)
    }
}

Supported Symbologies Reference

// 1D barcodes
let linearSymbologies: [VNBarcodeSymbology] = [
    .codabar, .code39, .code39Checksum, .code39FullASCII,
    .code39FullASCIIChecksum, .code93, .code93i, .code128,
    .ean8, .ean13, .gs1DataBar, .gs1DataBarExpanded,
    .gs1DataBarLimited, .i2of5, .i2of5Checksum, .itf14,
    .msiPlessey, .upce,
]

// 2D barcodes
let matrixSymbologies: [VNBarcodeSymbology] = [
    .qr, .aztec, .dataMatrix, .pdf417, .microPDF417, .microQR,
]

Person Segmentation with Mask Application

Modern API (iOS 18+)

import Vision
import CoreImage
import CoreImage.CIFilterBuiltins

func segmentPerson(in cgImage: CGImage) async throws -> CIImage {
    var request = GeneratePersonSegmentationRequest()
    request.qualityLevel = .accurate  // .balanced, .fast

    let observation = try await request.perform(on: cgImage)
    let maskBuffer = observation.pixelBuffer

    let originalImage = CIImage(cgImage: cgImage)
    let maskImage = CIImage(cvPixelBuffer: maskBuffer)

    // Scale mask to match original image size
    let scaleX = originalImage.extent.width / maskImage.extent.width
    let scaleY = originalImage.extent.height / maskImage.extent.height
    let scaledMask = maskImage.transformed(by: CGAffineTransform(
        scaleX: scaleX, y: scaleY
    ))

    return scaledMask
}

// Apply background blur using person mask
func blurBackground(of cgImage: CGImage, blurRadius: Double = 20.0) async throws -> CIImage {
    let mask = try await segmentPerson(in: cgImage)
    let original = CIImage(cgImage: cgImage)

    let blurFilter = CIFilter.gaussianBlur()
    blurFilter.inputImage = original
    blurFilter.radius = Float(blurRadius)
    guard let blurredImage = blurFilter.outputImage else {
        throw SegmentationError.noMask
    }

    let blendFilter = CIFilter.blendWithMask()
    blendFilter.inputImage = original         // foreground (person)
    blendFilter.backgroundImage = blurredImage // blurred background
    blendFilter.maskImage = mask

    guard let result = blendFilter.outputImage else {
        throw SegmentationError.noMask
    }
    return result
}

enum SegmentationError: Error {
    case noMask
}

Legacy API (Pre-iOS 18)

func segmentPersonLegacy(in cgImage: CGImage) throws -> CVPixelBuffer {
    let request = VNGeneratePersonSegmentationRequest()
    request.qualityLevel = .accurate
    request.outputPixelFormat = kCVPixelFormatType_OneComponent8

    let handler = VNImageRequestHandler(cgImage: cgImage)
    try handler.perform([request])

    guard let maskBuffer = request.results?.first?.pixelBuffer else {
        throw SegmentationError.noMask
    }
    return maskBuffer
}

Instance Segmentation (iOS 18+)

Separate masks per person for individual effects.

// Modern API (iOS 18+)
func segmentIndividualPeople(in cgImage: CGImage) async throws -> [CVPixelBuffer] {
    let request = GeneratePersonInstanceMaskRequest()
    let observation = try await request.perform(on: cgImage)

    let indices = observation.allInstances
    return try indices.map { index in
        try observation.generateMask(forInstances: IndexSet(integer: index))
    }
}

// Legacy API (iOS 17+)
func segmentIndividualPeopleLegacy(in cgImage: CGImage) throws -> [CVPixelBuffer] {
    let request = VNGeneratePersonInstanceMaskRequest()
    let handler = VNImageRequestHandler(cgImage: cgImage)
    try handler.perform([request])

    guard let result = request.results?.first else { return [] }
    let indices = result.allInstances

    return try indices.map { index in
        try result.generateMask(forInstances: IndexSet(integer: index))
    }
}

Image Classification

import Vision

func classifyImage(_ cgImage: CGImage, maxResults: Int = 5) async throws -> [(String, Float)] {
    let request = ClassifyImageRequest()
    let observations = try await request.perform(on: cgImage)

    return observations.prefix(maxResults).map { observation in
        (observation.identifier, observation.confidence)
    }
}

Saliency Detection

Identify the most visually important or attention-grabbing regions.

// Attention-based saliency (what humans would look at)
func detectAttentionSaliency(in cgImage: CGImage) async throws -> [CGRect] {
    let request = GenerateAttentionBasedSaliencyImageRequest()
    let results = try await request.perform(on: cgImage)
    guard let saliency = results.first else { return [] }
    return saliency.salientObjects?.map(\.boundingBox) ?? []
}

// Objectness-based saliency (distinct objects)
func detectObjectSaliency(in cgImage: CGImage) async throws -> [CGRect] {
    let request = GenerateObjectnessBasedSaliencyImageRequest()
    let results = try await request.perform(on: cgImage)
    guard let saliency = results.first else { return [] }
    return saliency.salientObjects?.map(\.boundingBox) ?? []
}

Rectangle Detection

Detect rectangular shapes for document edges, business cards, etc.

func detectRectangles(in cgImage: CGImage) async throws -> [CGRect] {
    var request = DetectRectanglesRequest()
    request.minimumAspectRatio = 0.3
    request.maximumAspectRatio = 1.0
    request.minimumSize = 0.1
    request.maximumObservations = 5

    let observations = try await request.perform(on: cgImage)
    return observations.map(\.boundingBox)
}

Horizon Detection

Detect the horizon angle for auto-straightening photos.

func detectHorizon(in cgImage: CGImage) async throws -> CGFloat? {
    let request = DetectHorizonRequest()
    let results = try await request.perform(on: cgImage)
    return results.first?.angle.map { CGFloat($0) }
}

Batch Processing Multiple Requests

Run multiple requests on the same image simultaneously for efficiency.

func analyzeImage(_ cgImage: CGImage) async throws -> ImageAnalysisResult {
    async let textResults = {
        var req = RecognizeTextRequest()
        req.recognitionLevel = .accurate
        return try await req.perform(on: cgImage)
    }()

    async let faceResults = {
        let req = DetectFaceRectanglesRequest()
        return try await req.perform(on: cgImage)
    }()

    async let barcodeResults = {
        var req = DetectBarcodesRequest()
        req.symbologies = [.qr, .ean13]
        return try await req.perform(on: cgImage)
    }()

    let text = try await textResults
    let faces = try await faceResults
    let barcodes = try await barcodeResults

    return ImageAnalysisResult(
        recognizedText: text.compactMap { $0.topCandidates(1).first?.string },
        faceCount: faces.count,
        barcodePayloads: barcodes.compactMap(\.payloadString)
    )
}

struct ImageAnalysisResult: Sendable {
    let recognizedText: [String]
    let faceCount: Int
    let barcodePayloads: [String]
}

Legacy Batch Processing

With the legacy API, pass multiple requests to a single handler call.

func analyzeImageLegacy(_ cgImage: CGImage) throws {
    let textRequest = VNRecognizeTextRequest { request, error in
        // Handle text results
    }
    let faceRequest = VNDetectFaceRectanglesRequest { request, error in
        // Handle face results
    }
    let barcodeRequest = VNDetectBarcodesRequest { request, error in
        // Handle barcode results
    }

    let handler = VNImageRequestHandler(cgImage: cgImage)
    try handler.perform([textRequest, faceRequest, barcodeRequest])
}

Video Frame Processing with CMSampleBuffer

Process live camera frames from AVCaptureSession.

import AVFoundation
import Vision

final class VisionVideoProcessor: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate, Sendable {
    private let processingQueue = DispatchQueue(label: "vision.processing", qos: .userInitiated)

    func setupCapture(session: AVCaptureSession) {
        let output = AVCaptureVideoDataOutput()
        output.setSampleBufferDelegate(self, queue: processingQueue)
        output.alwaysDiscardsLateVideoFrames = true

        if session.canAddOutput(output) {
            session.addOutput(output)
        }
    }

    func captureOutput(
        _ output: AVCaptureOutput,
        didOutput sampleBuffer: CMSampleBuffer,
        from connection: AVCaptureConnection
    ) {
        Task {
            do {
                var request = RecognizeTextRequest()
                request.recognitionLevel = .fast
                let observations = try await request.perform(on: sampleBuffer)
                let strings = observations.compactMap {
                    $0.topCandidates(1).first?.string
                }
                // Dispatch results to main actor for UI update
                await MainActor.run {
                    // Update UI with recognized strings
                }
            } catch {
                // Handle error
            }
        }
    }
}

Object Tracking Across Video Frames

Modern API (iOS 18+)

TrackObjectRequest is a stateful request that maintains tracking context internally. No need for a separate sequence handler.

import Vision

final class ObjectTracker {
    private var request: TrackObjectRequest?

    /// Initialize tracking with a bounding box in normalized coordinates
    func startTracking(boundingBox: CGRect) {
        let observation = DetectedObjectObservation(boundingBox: boundingBox)
        var req = TrackObjectRequest(observation: observation)
        req.trackingLevel = .accurate
        request = req
    }

    /// Track object in next video frame
    func track(in pixelBuffer: CVPixelBuffer) async throws -> CGRect? {
        guard var req = request else { return nil }

        let results = try await req.perform(on: pixelBuffer)
        guard let tracked = results.first, tracked.confidence > 0.3 else {
            request = nil
            return nil
        }

        request = req  // preserve stateful tracking context
        return tracked.boundingBox
    }

    func stopTracking() {
        request = nil
    }
}

Legacy API

final class LegacyObjectTracker {
    private var sequenceHandler = VNSequenceRequestHandler()
    private var currentObservation: VNDetectedObjectObservation?

    func startTracking(boundingBox: CGRect) {
        currentObservation = VNDetectedObjectObservation(boundingBox: boundingBox)
    }

    func track(in pixelBuffer: CVPixelBuffer) throws -> CGRect? {
        guard let observation = currentObservation else { return nil }

        let trackRequest = VNTrackObjectRequest(detectedObjectObservation: observation)
        trackRequest.trackingLevel = .accurate

        try sequenceHandler.perform([trackRequest], on: pixelBuffer)

        guard let result = trackRequest.results?.first as? VNDetectedObjectObservation,
              result.confidence > 0.3 else {
            currentObservation = nil
            return nil
        }

        currentObservation = result
        return result.boundingBox
    }

    func stopTracking() {
        currentObservation = nil
    }
}

Coordinate Normalization Utilities

Vision uses normalized coordinates (0...1) with bottom-left origin. These utilities convert to UIKit/SwiftUI coordinate systems.

import Vision
import UIKit

enum VisionCoordinateConverter {
    /// Convert normalized Vision rect to image-pixel coordinates
    static func toImageCoordinates(
        _ normalizedRect: CGRect,
        imageWidth: Int,
        imageHeight: Int
    ) -> CGRect {
        VNImageRectForNormalizedRect(normalizedRect, imageWidth, imageHeight)
    }

    /// Convert normalized Vision point to image-pixel coordinates
    static func toImageCoordinates(
        _ normalizedPoint: CGPoint,
        imageWidth: Int,
        imageHeight: Int
    ) -> CGPoint {
        VNImagePointForNormalizedPoint(normalizedPoint, imageWidth, imageHeight)
    }

    /// Convert Vision rect (bottom-left origin) to UIKit rect (top-left origin)
    static func toUIKitCoordinates(
        _ normalizedRect: CGRect,
        viewSize: CGSize
    ) -> CGRect {
        let imageRect = VNImageRectForNormalizedRect(
            normalizedRect,
            Int(viewSize.width),
            Int(viewSize.height)
        )
        // Flip Y axis: Vision origin is bottom-left, UIKit is top-left
        return CGRect(
            x: imageRect.origin.x,
            y: viewSize.height - imageRect.origin.y - imageRect.height,
            width: imageRect.width,
            height: imageRect.height
        )
    }

    /// Convert an array of normalized points to UIKit points
    static func toUIKitPoints(
        _ normalizedPoints: [CGPoint],
        viewSize: CGSize
    ) -> [CGPoint] {
        normalizedPoints.map { point in
            CGPoint(
                x: point.x * viewSize.width,
                y: (1.0 - point.y) * viewSize.height  // flip Y
            )
        }
    }
}

Performance Considerations

Recognition Level Selection

Use Case	Level	Typical Latency
Live camera preview	`.fast`	~30ms per frame
Photo library scan	`.accurate`	~200-500ms per image
Batch document OCR	`.accurate`	~200-500ms per page
Barcode scanner	`.fast` or `.balanced`	~15-50ms per frame

Memory Management

Reuse VNSequenceRequestHandler across video frames (do not recreate per frame)
For batch processing, process one image at a time to avoid memory spikes
Release CVPixelBuffer references promptly after processing
Use autoreleasepool in tight loops processing many images

func batchProcess(images: [CGImage]) async throws -> [[String]] {
    var allResults: [[String]] = []

    for image in images {
        var request = RecognizeTextRequest()
        request.recognitionLevel = .accurate
        let obs = try await request.perform(on: image)
        let result = obs.compactMap { $0.topCandidates(1).first?.string }
        allResults.append(result)
    }
    return allResults
}

Threading

Modern API (perform(on:)) is async and safe to call from any context
Legacy API: create VNImageRequestHandler and call perform on a background queue
Never block the main thread with Vision requests
VNSequenceRequestHandler is not thread-safe -- use from a single serial queue

Request Reuse

Modern request structs are value types and cheap to create. Do not try to cache and reuse them across calls -- just create a fresh one each time.

For the legacy API, VNImageRequestHandler is tied to a single image. Create a new handler for each image you process. VNSequenceRequestHandler can be reused across frames in a sequence.

VisionKit Scanner Patterns

Complete implementation patterns for DataScannerViewController and VNDocumentCameraViewController covering availability checking, configuration, SwiftUI integration, delegate handling, custom overlays, and camera permissions. All patterns target iOS 26+ with Swift 6.3 unless noted.

Camera Permission Setup
DataScannerViewController
Delegate Methods
SwiftUI Integration
Custom Overlay UI
VNDocumentCameraViewController

Camera Permission Setup

Add the camera usage description to Info.plist before using any scanner:

<key>NSCameraUsageDescription</key>
<string>Camera access is needed to scan text and barcodes.</string>

Request permission before presenting the scanner:

import AVFoundation

func requestCameraAccess() async -> Bool {
    let status = AVCaptureDevice.authorizationStatus(for: .video)
    switch status {
    case .authorized:
        return true
    case .notDetermined:
        return await AVCaptureDevice.requestAccess(for: .video)
    case .denied, .restricted:
        return false
    @unknown default:
        return false
    }
}

DataScannerViewController

DataScannerViewController provides a full-screen live camera scanner for text and barcodes with built-in highlighting and interaction. Available on devices with an A12 chip or later (iOS 16+).

Availability Checking

Always check both hardware support and runtime availability before presenting.

import VisionKit

func canUseDataScanner() -> Bool {
    // Hardware check: requires A12 Bionic or later
    guard DataScannerViewController.isSupported else {
        return false
    }
    // Runtime check: camera authorized and not restricted
    guard DataScannerViewController.isAvailable else {
        return false
    }
    return true
}

isSupported checks hardware capability (A12+). isAvailable checks that the camera is authorized and not restricted by device management. Both must be true.

Configuration and Initialization

import VisionKit

func createTextScanner() -> DataScannerViewController {
    DataScannerViewController(
        recognizedDataTypes: [
            .text(languages: ["en"]),
        ],
        qualityLevel: .balanced,
        recognizesMultipleItems: true,
        isHighFrameRateTrackingEnabled: true,
        isPinchToZoomEnabled: true,
        isGuidanceEnabled: true,
        isHighlightingEnabled: true
    )
}

func createBarcodeScanner() -> DataScannerViewController {
    DataScannerViewController(
        recognizedDataTypes: [
            .barcode(symbologies: [.qr, .ean13, .code128]),
        ],
        qualityLevel: .fast,
        recognizesMultipleItems: false,
        isHighFrameRateTrackingEnabled: false,
        isPinchToZoomEnabled: false,
        isGuidanceEnabled: true,
        isHighlightingEnabled: true
    )
}

func createMixedScanner() -> DataScannerViewController {
    DataScannerViewController(
        recognizedDataTypes: [
            .text(languages: ["en"]),
            .barcode(symbologies: [.qr, .ean13]),
        ],
        qualityLevel: .balanced,
        recognizesMultipleItems: true,
        isHighFrameRateTrackingEnabled: true,
        isPinchToZoomEnabled: true,
        isGuidanceEnabled: true,
        isHighlightingEnabled: true
    )
}

Recognized Data Types

// Text with language hints
let textType: DataScannerViewController.RecognizedDataType =
    .text(languages: ["en", "fr", "de"])

// Text filtered by content type
let emailType: DataScannerViewController.RecognizedDataType =
    .text(textContentType: .emailAddress)
let urlType: DataScannerViewController.RecognizedDataType =
    .text(textContentType: .URL)
let phoneType: DataScannerViewController.RecognizedDataType =
    .text(textContentType: .telephoneNumber)
let addressType: DataScannerViewController.RecognizedDataType =
    .text(textContentType: .fullAddress)
let flightType: DataScannerViewController.RecognizedDataType =
    .text(textContentType: .flightNumber)
let trackingType: DataScannerViewController.RecognizedDataType =
    .text(textContentType: .shipmentTrackingNumber)

// Barcode with specific symbologies
let qrOnly: DataScannerViewController.RecognizedDataType =
    .barcode(symbologies: [.qr])
let retailBarcodes: DataScannerViewController.RecognizedDataType =
    .barcode(symbologies: [.ean8, .ean13, .upce, .code128])

Quality Levels

Level	Use Case	Notes
`.fast`	Barcode scanning, quick text grab	Lowest latency
`.balanced`	General purpose text + barcode	Default choice
`.accurate`	Detailed OCR, small text	Higher latency

Starting and Stopping

func presentScanner(_ scanner: DataScannerViewController,
                    from presenter: UIViewController) {
    scanner.delegate = presenter as? DataScannerViewControllerDelegate
    presenter.present(scanner, animated: true) {
        try? scanner.startScanning()
    }
}

func dismissScanner(_ scanner: DataScannerViewController) {
    scanner.stopScanning()
    scanner.dismiss(animated: true)
}

Delegate Methods

Implement DataScannerViewControllerDelegate to handle recognized items and scanner lifecycle events.

import VisionKit

final class ScannerCoordinator: NSObject, DataScannerViewControllerDelegate {

    var onTextRecognized: ((String) -> Void)?
    var onBarcodeRecognized: ((String, VNBarcodeSymbology) -> Void)?

    // Called when the user taps on a recognized item
    func dataScanner(
        _ scanner: DataScannerViewController,
        didTapOn item: RecognizedItem
    ) {
        switch item {
        case .text(let text):
            onTextRecognized?(text.transcript)
        case .barcode(let barcode):
            if let payload = barcode.payloadStringValue {
                onBarcodeRecognized?(payload, barcode.observation.symbology)
            }
        @unknown default:
            break
        }
    }

    // Called when new items appear in the camera view
    func dataScanner(
        _ scanner: DataScannerViewController,
        didAdd addedItems: [RecognizedItem],
        allItems: [RecognizedItem]
    ) {
        for item in addedItems {
            switch item {
            case .text(let text):
                print("New text: \(text.transcript)")
            case .barcode(let barcode):
                print("New barcode: \(barcode.payloadStringValue ?? "nil")")
            @unknown default:
                break
            }
        }
    }

    // Called when items are updated (position or content changes)
    func dataScanner(
        _ scanner: DataScannerViewController,
        didUpdate updatedItems: [RecognizedItem],
        allItems: [RecognizedItem]
    ) {
        // Handle position or content updates
    }

    // Called when items leave the camera view
    func dataScanner(
        _ scanner: DataScannerViewController,
        didRemove removedItems: [RecognizedItem],
        allItems: [RecognizedItem]
    ) {
        // Clean up UI for removed items
    }

    // Called when the scanner becomes unavailable (e.g., camera revoked)
    func dataScannerDidChangeUnavailabilityReasons(
        _ scanner: DataScannerViewController
    ) {
        // Handle unavailability -- dismiss or show fallback
    }
}

Async Sequence for Recognized Items

Use recognizedItems for a reactive stream of all currently visible items:

func observeRecognizedItems(_ scanner: DataScannerViewController) async {
    for await items in scanner.recognizedItems {
        let texts = items.compactMap { item -> String? in
            guard case .text(let text) = item else { return nil }
            return text.transcript
        }
        let barcodes = items.compactMap { item -> String? in
            guard case .barcode(let barcode) = item else { return nil }
            return barcode.payloadStringValue
        }
        await MainActor.run {
            // Update UI with current texts and barcodes
        }
    }
}

Capturing a Photo

Capture a still image from the scanner for further processing:

func captureAndProcess(_ scanner: DataScannerViewController) async throws {
    let photo = try await scanner.capturePhoto()
    // photo is a UIImage -- process with Vision or save
}

SwiftUI Integration

Wrap DataScannerViewController in UIViewControllerRepresentable for use in SwiftUI views.

Full DataScanner Representable

import SwiftUI
import VisionKit

struct DataScannerRepresentable: UIViewControllerRepresentable {
    let recognizedDataTypes: Set<DataScannerViewController.RecognizedDataType>
    let qualityLevel: DataScannerViewController.QualityLevel
    let recognizesMultipleItems: Bool
    @Binding var recognizedText: [String]
    @Binding var recognizedBarcodes: [String]

    func makeUIViewController(context: Context) -> DataScannerViewController {
        let scanner = DataScannerViewController(
            recognizedDataTypes: recognizedDataTypes,
            qualityLevel: qualityLevel,
            recognizesMultipleItems: recognizesMultipleItems,
            isHighFrameRateTrackingEnabled: true,
            isPinchToZoomEnabled: true,
            isGuidanceEnabled: true,
            isHighlightingEnabled: true
        )
        scanner.delegate = context.coordinator
        return scanner
    }

    func updateUIViewController(
        _ controller: DataScannerViewController,
        context: Context
    ) {
        // No dynamic updates needed
    }

    func makeCoordinator() -> Coordinator {
        Coordinator(parent: self)
    }

    static func dismantleUIViewController(
        _ controller: DataScannerViewController,
        coordinator: Coordinator
    ) {
        controller.stopScanning()
    }

    @MainActor
    final class Coordinator: NSObject, DataScannerViewControllerDelegate {
        let parent: DataScannerRepresentable

        init(parent: DataScannerRepresentable) {
            self.parent = parent
        }

        func dataScanner(
            _ scanner: DataScannerViewController,
            didTapOn item: RecognizedItem
        ) {
            switch item {
            case .text(let text):
                parent.recognizedText.append(text.transcript)
            case .barcode(let barcode):
                if let payload = barcode.payloadStringValue {
                    parent.recognizedBarcodes.append(payload)
                }
            @unknown default:
                break
            }
        }

        func dataScanner(
            _ scanner: DataScannerViewController,
            didAdd addedItems: [RecognizedItem],
            allItems: [RecognizedItem]
        ) {
            // Handle newly recognized items
        }

        func dataScanner(
            _ scanner: DataScannerViewController,
            didUpdate updatedItems: [RecognizedItem],
            allItems: [RecognizedItem]
        ) {
            // Handle item updates
        }

        func dataScanner(
            _ scanner: DataScannerViewController,
            didRemove removedItems: [RecognizedItem],
            allItems: [RecognizedItem]
        ) {
            // Handle removed items
        }
    }
}

SwiftUI Scanner View

import SwiftUI
import VisionKit

struct ScannerView: View {
    @State private var recognizedText: [String] = []
    @State private var recognizedBarcodes: [String] = []
    @State private var isShowingScanner = false

    var body: some View {
        VStack {
            if DataScannerViewController.isSupported {
                Button("Scan") {
                    isShowingScanner = true
                }
                .fullScreenCover(isPresented: $isShowingScanner) {
                    NavigationStack {
                        DataScannerRepresentable(
                            recognizedDataTypes: [
                                .text(languages: ["en"]),
                                .barcode(symbologies: [.qr]),
                            ],
                            qualityLevel: .balanced,
                            recognizesMultipleItems: true,
                            recognizedText: $recognizedText,
                            recognizedBarcodes: $recognizedBarcodes
                        )
                        .ignoresSafeArea()
                        .toolbar {
                            ToolbarItem(placement: .cancellationAction) {
                                Button("Done") {
                                    isShowingScanner = false
                                }
                            }
                        }
                    }
                }
            } else {
                ContentUnavailableView(
                    "Scanner Not Available",
                    systemImage: "camera.fill",
                    description: Text("This device does not support scanning.")
                )
            }

            List {
                Section("Text") {
                    ForEach(recognizedText, id: \.self) { text in
                        Text(text)
                    }
                }
                Section("Barcodes") {
                    ForEach(recognizedBarcodes, id: \.self) { barcode in
                        Text(barcode)
                    }
                }
            }
        }
    }
}

Starting the Scanner After Presentation

The scanner must be started after the view controller is fully presented. Use onAppear with a coordinator flag or start in the completion handler:

struct AutoStartScannerRepresentable: UIViewControllerRepresentable {
    func makeUIViewController(context: Context) -> DataScannerViewController {
        let scanner = DataScannerViewController(
            recognizedDataTypes: [.text(languages: ["en"])],
            qualityLevel: .balanced,
            recognizesMultipleItems: false,
            isHighFrameRateTrackingEnabled: true,
            isHighlightingEnabled: true
        )
        scanner.delegate = context.coordinator
        // Start scanning after a brief delay to ensure presentation is complete
        Task { @MainActor in
            try? scanner.startScanning()
        }
        return scanner
    }

    func updateUIViewController(
        _ controller: DataScannerViewController,
        context: Context
    ) {}

    func makeCoordinator() -> ScannerCoordinator {
        ScannerCoordinator()
    }

    static func dismantleUIViewController(
        _ controller: DataScannerViewController,
        coordinator: ScannerCoordinator
    ) {
        controller.stopScanning()
    }
}

Custom Overlay UI

Add custom views on top of the scanner for region-of-interest indicators, instructions, or result display.

Overlay with Region of Interest

struct ScannerWithOverlay: View {
    @State private var isShowingScanner = false
    @State private var lastScannedText = ""

    var body: some View {
        ZStack {
            AutoStartScannerRepresentable()
                .ignoresSafeArea()

            VStack {
                // Top instruction bar
                Text("Point camera at text or barcode")
                    .font(.subheadline)
                    .padding(.horizontal)
                    .padding(.vertical)
                    .background(.ultraThinMaterial, in: Capsule())
                    .padding(.top)

                Spacer()

                // Scan region indicator
                RoundedRectangle(cornerRadius: 12)
                    .strokeBorder(.white.opacity(0.6), lineWidth: 2)
                    .frame(width: 280, height: 180)

                Spacer()

                // Result display
                if !lastScannedText.isEmpty {
                    Text(lastScannedText)
                        .font(.body)
                        .padding()
                        .frame(maxWidth: .infinity)
                        .background(.ultraThinMaterial)
                        .clipShape(.rect(cornerRadius: 12))
                        .padding()
                }
            }
        }
    }
}

VNDocumentCameraViewController

VNDocumentCameraViewController provides a full-screen document camera with auto-capture, perspective correction, and multi-page scanning. Available on all devices running iOS 13+.

UIKit Presentation

import VisionKit

final class DocumentScannerPresenter: NSObject,
    VNDocumentCameraViewControllerDelegate
{
    weak var presenter: UIViewController?

    func showDocumentScanner() {
        let scanner = VNDocumentCameraViewController()
        scanner.delegate = self
        presenter?.present(scanner, animated: true)
    }

    func documentCameraViewController(
        _ controller: VNDocumentCameraViewController,
        didFinishWith scan: VNDocumentCameraScan
    ) {
        controller.dismiss(animated: true)
        for pageIndex in 0..<scan.pageCount {
            let pageImage = scan.imageOfPage(at: pageIndex)
            // Process each scanned page image
        }
    }

    func documentCameraViewControllerDidCancel(
        _ controller: VNDocumentCameraViewController
    ) {
        controller.dismiss(animated: true)
    }

    func documentCameraViewController(
        _ controller: VNDocumentCameraViewController,
        didFailWithError error: Error
    ) {
        controller.dismiss(animated: true)
        // Handle scanning error
    }
}

SwiftUI Document Scanner

import SwiftUI
import VisionKit

struct DocumentScannerRepresentable: UIViewControllerRepresentable {
    @Binding var scannedImages: [UIImage]
    @Environment(\.dismiss) private var dismiss

    func makeUIViewController(context: Context) -> VNDocumentCameraViewController {
        let scanner = VNDocumentCameraViewController()
        scanner.delegate = context.coordinator
        return scanner
    }

    func updateUIViewController(
        _ controller: VNDocumentCameraViewController,
        context: Context
    ) {}

    func makeCoordinator() -> Coordinator {
        Coordinator(parent: self)
    }

    @MainActor
    final class Coordinator: NSObject, VNDocumentCameraViewControllerDelegate {
        let parent: DocumentScannerRepresentable

        init(parent: DocumentScannerRepresentable) {
            self.parent = parent
        }

        func documentCameraViewController(
            _ controller: VNDocumentCameraViewController,
            didFinishWith scan: VNDocumentCameraScan
        ) {
            parent.scannedImages = (0..<scan.pageCount).map { scan.imageOfPage(at: $0) }
            parent.dismiss()
        }

        func documentCameraViewControllerDidCancel(
            _ controller: VNDocumentCameraViewController
        ) {
            parent.dismiss()
        }

        func documentCameraViewController(
            _ controller: VNDocumentCameraViewController,
            didFailWithError error: Error
        ) {
            parent.dismiss()
        }
    }
}

Document Scanner with OCR Pipeline

Combine document scanning with Vision text recognition for a complete OCR flow:

import SwiftUI
import VisionKit
import Vision

@MainActor
@Observable
final class DocumentOCRModel {
    var scannedPages: [UIImage] = []
    var extractedText: [String] = []
    var isProcessing = false

    func processScannedPages() async {
        isProcessing = true
        defer { isProcessing = false }

        extractedText = []
        for page in scannedPages {
            guard let cgImage = page.cgImage else { continue }
            do {
                var request = RecognizeTextRequest()
                request.recognitionLevel = .accurate
                request.recognitionLanguages = [Locale.Language(identifier: "en-US")]
                request.usesLanguageCorrection = true

                let observations = try await request.perform(on: cgImage)
                let pageText = observations
                    .compactMap { $0.topCandidates(1).first?.string }
                    .joined(separator: "\n")
                extractedText.append(pageText)
            } catch {
                extractedText.append("[Recognition failed]")
            }
        }
    }
}

struct DocumentOCRView: View {
    @State private var model = DocumentOCRModel()
    @State private var isShowingScanner = false

    var body: some View {
        NavigationStack {
            List {
                if model.isProcessing {
                    ProgressView("Recognizing text...")
                }
                ForEach(Array(model.extractedText.enumerated()), id: \.offset) { index, text in
                    Section("Page \(index + 1)") {
                        Text(text)
                            .font(.body)
                            .textSelection(.enabled)
                    }
                }
            }
            .navigationTitle("Document OCR")
            .toolbar {
                Button("Scan") {
                    isShowingScanner = true
                }
            }
            .fullScreenCover(isPresented: $isShowingScanner) {
                DocumentScannerRepresentable(scannedImages: $model.scannedPages)
            }
            .onChange(of: model.scannedPages) {
                Task { await model.processScannedPages() }
            }
        }
    }
}

Performance Considerations

DataScannerViewController

Use .fast quality for barcode-only scanning
Set recognizesMultipleItems = false when only one result is needed
Disable isHighFrameRateTrackingEnabled for barcode scanning to save power
Limit recognizedDataTypes to only what you need
Stop scanning when processing results to avoid wasted CPU cycles

VNDocumentCameraViewController

Pages are returned as UIImage at full resolution -- resize before

processing if memory is a concern

Process pages sequentially to avoid memory spikes
Use autoreleasepool when processing many pages in a loop

Related skills

Xcode Project SetupAutomatically create and configure a new Xcode project with Swift Package Manager dependencies for iOS or macOS agent projects.74.7k392

Expo Tailwind SetupInstantly configure Tailwind CSS v4 with NativeWind v5 and react-native-css inside an Expo project for universal styling.46.7k2.3k

Expo Dev ClientCreate custom development clients for Expo React Native apps that need native modules or Apple-specific targets.45.9k2.3k

Swiftui Expert SkillGet expert guidance when writing, reviewing, or refactoring SwiftUI views, state, performance, and modern iOS/macOS APIs.27.6k3.3k

Flutter Apply Architecture Best PracticesEnforce clean layered architecture when creating or refactoring a Flutter mobile application.25.4k2.7k

Expo ModuleCreate custom config plugins that safely modify native Android and iOS projects generated by Expo prebuild.25k2.3k

How it compares

Choose vision-framework over cloud OCR APIs when building privacy-preserving on-device vision with Apple Vision, VisionKit, and Core ML on iOS.

FAQ

Which Vision API should new code use?

Prefer modern iOS 18 plus async perform on request structs; use legacy VN handlers for older deployment targets.

Does Vision run on device or cloud?

Vision requests run on-device for text, faces, barcodes, and Core ML model inference.

When use VisionKit DataScannerViewController?

For live camera scanning UI with real-time barcode or text detection in a view controller.

Is Vision Framework safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Mobile Developmentfrontend

About

Vision Framework by the numbers

vision-framework capabilities & compatibility

What vision-framework says it does

Add your badge

How do I add OCR, barcode scanning, or face detection to an iOS app with Vision framework?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Vision Framework

Contents

Two API Generations

Request Pattern (Modern API)

Legacy Pattern (Pre-iOS 18)

Text Recognition (OCR)

Modern: RecognizeTextRequest (iOS 18+)

Legacy: VNRecognizeTextRequest

Face Detection

Coordinate System

Barcode Detection

Document Scanning (iOS 26+)

Image Segmentation

Modern: GeneratePersonSegmentationRequest (iOS 18+)

Legacy: VNGeneratePersonSegmentationRequest

Instance Segmentation (iOS 18+)

Object Tracking

Modern: TrackObjectRequest (iOS 18+)

Legacy: VNTrackObjectRequest

Other Request Types

Core ML Integration

VisionKit: DataScannerViewController

Quick Start

SwiftUI Integration

Common Mistakes

Review Checklist

References

Vision Request Patterns

Contents

Complete Text Recognition Pipeline

Text Recognition with Language Hints

Fast Text Recognition for Live Video

Legacy Text Recognition (Pre-iOS 18)

Face Detection with Landmarks

Barcode Detection with All Symbologies

Supported Symbologies Reference

Person Segmentation with Mask Application

Modern API (iOS 18+)

Legacy API (Pre-iOS 18)

Instance Segmentation (iOS 18+)

Image Classification

Saliency Detection

Rectangle Detection

Horizon Detection

Batch Processing Multiple Requests

Legacy Batch Processing

Video Frame Processing with CMSampleBuffer

Object Tracking Across Video Frames

Modern API (iOS 18+)

Legacy API

Coordinate Normalization Utilities

Performance Considerations

Recognition Level Selection

Memory Management

Threading

Request Reuse

VisionKit Scanner Patterns

Contents

Camera Permission Setup

DataScannerViewController

Availability Checking

Configuration and Initialization

Recognized Data Types

Quality Levels

Starting and Stopping

Delegate Methods

Async Sequence for Recognized Items

Capturing a Photo

SwiftUI Integration