Apple On Device Ai

Name: Apple On Device Ai
Author: dpearson2699

dpearson2699/swift-ios-skills

2.9k installs
944 repo stars
Updated July 15, 2026
dpearson2699/swift-ios-skills

apple-on-device-ai is a Swift iOS skill that selects and integrates Foundation Models, Core ML, MLX Swift, or llama.cpp for on-device inference.

About

The apple-on-device-ai skill routes on-device machine learning work across Apple Foundation Models, Core ML, MLX Swift, and llama.cpp with selection criteria by use case and OS version. Foundation Models targets text generation, summarization, structured output with Generable types, and tool calling on iOS 26 plus devices with Apple Intelligence, always after availability and locale checks. Core ML covers custom vision, NLP, and audio models converted via coremltools with quantization and Neural Engine optimization. MLX Swift delivers highest sustained LLM throughput on Apple Silicon, while llama.cpp supports GGUF cross-platform inference. Sections document LanguageModelSession management, streaming PartiallyGenerated output, Tool protocol registration, error handling for guardrails and context limits, and multi-backend fallback architecture with a coordinator actor. Common mistakes include skipping availability checks, concurrent session requests, untrusted content in instructions, missing model.eval before tracing, and exceeding sixty percent RAM on iOS for MLX models.

Framework router for Foundation Models, Core ML, MLX Swift, and llama.cpp.
Foundation Models availability, locale, Generable, and tool calling patterns.
Core ML conversion, quantization, and Neural Engine optimization overview.
Multi-backend fallback coordinator actor for mixed runtimes.
Review checklist for availability, token budget, and device testing.

Apple On Device Ai by the numbers

2,863 all-time installs (skills.sh)
+129 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #57 of 1,039 Mobile Development skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

At a glance

apple-on-device-ai capabilities & compatibility

Capabilities: framework selection router by use case and os · foundation models session, generable, and tool c · core ml conversion and optimization overview · mlx swift and llama.cpp inference patterns · availability and locale guardrails with fallback · multi backend coordinator actor serialization
Use cases: orchestration · api development · frontend
Platforms: macOS
Pricing: Free

From the docs

What apple-on-device-ai says it does

Always check before using. Never crash on unavailability.

SKILL.md

One request at a time per session (check `session.isResponding`)

SKILL.md

Never exceed 60% of total RAM on iOS

SKILL.md

npx skills add https://github.com/dpearson2699/swift-ios-skills --skill apple-on-device-ai

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/dpearson2699/swift-ios-skills/apple-on-device-ai.svg)](https://skillselion.com/skills/dpearson2699/swift-ios-skills/apple-on-device-ai)

Installs	2.9k
repo stars	★ 944
Security audit	2 / 3 scanners passed
Last updated	July 15, 2026
Repository	dpearson2699/swift-ios-skills ↗

Which Apple on-device AI framework should I use for text, vision, or open-source LLM inference on iOS and macOS?

Choose and integrate on-device AI with Foundation Models, Core ML, MLX Swift, or llama.cpp on Apple platforms.

Who is it for?

iOS developers adding private on-device AI with Apple Intelligence, custom Core ML models, or MLX and GGUF runtimes.

Skip if: Skip for server-side OpenAI calls only, Android inference, or pure UI layout without model integration.

When should I use this skill?

User builds on-device AI, Foundation Models sessions, Core ML conversion, MLX Swift LLMs, or llama.cpp GGUF on Apple platforms.

What you get

Framework choice, availability-gated session setup, structured output or model loading pattern, and fallback architecture guidance.

on-device inference integration
converted Core ML model
guided generation schema

By the numbers

Covers 4 on-device AI runtimes: Foundation Models, Core ML, MLX Swift, and llama.cpp

Files

SKILL.mdMarkdownGitHub ↗

On-Device AI for Apple Platforms

Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple Foundation Models, Core ML, MLX Swift, and llama.cpp.

Framework Selection Router
Apple Foundation Models Overview
Core ML Overview
MLX Swift Overview
Multi-Backend Architecture
Performance Best Practices
Common Mistakes
Review Checklist
References

Framework Selection Router

Use this decision tree to pick the right framework for your use case.

Apple Foundation Models

When to use: Text generation, summarization, entity extraction, structured output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence enabled. No app-managed API key, network round trip, or model hosting; still handle system model asset readiness.

Best for:

Generating text or structured data with @Generable types
Summarization, classification, content tagging
Tool-augmented generation with the Tool protocol
Apps that need guaranteed on-device privacy

Not suited for: Complex math, code generation, factual accuracy tasks, or apps targeting pre-iOS 26 devices.

Core ML

When to use: Deploying custom trained models (vision, NLP, audio) across all Apple platforms. Converting models from PyTorch, TensorFlow, or scikit-learn with coremltools.

Best for:

Image classification, object detection, segmentation
Custom NLP classifiers, sentiment analysis models
Audio/speech models via SoundAnalysis integration
Any scenario needing Neural Engine optimization
Models requiring quantization, palettization, or pruning

MLX Swift

When to use: Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma) on Apple Silicon with maximum throughput. Research and prototyping.

Best for:

Highest sustained token generation on Apple Silicon
Running Hugging Face models from mlx-community
Research requiring automatic differentiation
Fine-tuning workflows on Mac

llama.cpp

When to use: Cross-platform LLM inference using GGUF model format. Production deployments needing broad device support.

Best for:

GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0)
Cross-platform apps (iOS + Android + desktop)
Maximum compatibility with open-source model ecosystem

Quick Reference

Scenario	Framework
Text generation on Apple Intelligence devices (iOS 26+)	Foundation Models
Structured output from on-device LLM	Foundation Models (`@Generable`)
Image classification, object detection	Core ML
Custom model from PyTorch/TensorFlow	Core ML + coremltools
Running specific open-source LLMs	MLX Swift or llama.cpp
Maximum throughput on Apple Silicon	MLX Swift
Cross-platform LLM inference	llama.cpp
OCR and text recognition	Vision framework
Sentiment analysis, NER, tokenization	Natural Language framework
Training custom classifiers on device	Create ML

Apple Foundation Models Overview

On-device language model optimized for Apple Silicon. Available on devices supporting Apple Intelligence (iOS 26+, macOS 26+).

Token budget covers input + output; check contextSize for the limit
Resolve locale before generation by checking supportsLocale(_:) against

Locale.current and preferred fallbacks; do not raw-match supportedLanguages

Guardrails always enforced, cannot be disabled

Availability Checking (Required)

Always check before using. Never crash on unavailability.

import FoundationModels

switch SystemLanguageModel.default.availability {
case .available:
    guard SystemLanguageModel.default.supportsLocale(Locale.current) else {
        // Use locale fallback before generating
        break
    }
    // Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
    // Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
    // System model assets are not ready; show loading state
case .unavailable(.deviceNotEligible):
    // Device cannot run Apple Intelligence; use fallback
case .unavailable(let reason):
    // Unknown or future unavailable reason; use fallback and log reason
}

Session Management

// Basic session
let session = LanguageModelSession()

// Session with instructions
let session = LanguageModelSession {
    "You are a helpful cooking assistant."
}

// Session with tools
let session = LanguageModelSession(
    tools: [weatherTool, recipeTool]
) {
    "You are a helpful assistant with access to tools."
}

Key rules:

Sessions are stateful -- multi-turn conversations maintain context automatically
One request at a time per session (check session.isResponding)
Call session.prewarm() before user interaction for faster first response
Save/restore transcripts: LanguageModelSession(model: model, tools: [], transcript: savedTranscript)

Structured Output with `@Generable`

The @Generable macro creates compile-time schemas for type-safe output:

@Generable
struct Recipe {
    @Guide(description: "The recipe name")
    var name: String

    @Guide(description: "Cooking steps", .count(3))
    var steps: [String]

    @Guide(description: "Prep time in minutes", .range(1...120))
    var prepTime: Int
}

let response = try await session.respond(
    to: "Suggest a quick pasta recipe",
    generating: Recipe.self
)
print(response.content.name)

`@Guide` Constraints

Constraint	Purpose
`description:`	Natural language hint for generation
`.anyOf([values])`	Restrict to enumerated string values
`.count(n)`	Fixed array length
`.range(min...max)`	Numeric range
`.minimum(n)` / `.maximum(n)`	One-sided numeric bound
`.minimumCount(n)` / `.maximumCount(n)`	Array length bounds
`.constant(value)`	Always returns this value
`.pattern(regex)`	String format enforcement
`.element(guide)`	Guide applied to each array element

Properties generate in declaration order. Place foundational data before dependent data for better results.

Streaming Structured Output

let stream = session.streamResponse(
    to: "Suggest a recipe",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content is Recipe.PartiallyGenerated (all properties optional)
    if let name = snapshot.content.name { updateNameLabel(name) }
}

Tool Calling

struct WeatherTool: Tool {
    let name = "weather"
    let description = "Get current weather for a city."

    @Generable
    struct Arguments {
        @Guide(description: "The city name")
        var city: String
    }

    func call(arguments: Arguments) async throws -> String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}

Register only necessary tools at session creation. Tool is Sendable; tool descriptors and @Generable schemas consume the shared context window. The model chooses when to call tools, so prefetch deterministic required data into the prompt and reserve autonomous tools for dynamic lookups.

Error Handling

do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation(let context):
        // Content triggered safety filters
    case .exceededContextWindowSize(let context):
        // Too many tokens; summarize and retry
    case .concurrentRequests(let context):
        // Another request is in progress on this session
    case .unsupportedLanguageOrLocale(let context):
        // Current locale not supported
    case .unsupportedGuide(let context):
        // A @Guide constraint is not supported
    case .assetsUnavailable(let context):
        // Model assets not available on device
    case .refusal(let refusal, _):
        // Model refused; stream refusal.explanation for details
    case .rateLimited(let context):
        // Too many requests; back off and retry
    case .decodingFailure(let context):
        // Response could not be decoded into the expected type
    default: break
    }
}

Generation Options

let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)

Sampling modes: .greedy, .random(top:seed:), .random(probabilityThreshold:seed:).

Prompt Design Rules

1. Be concise -- use tokenCount(for:) to monitor the context window budget 2. Use bracketed placeholders in instructions: [descriptive example] 3. Use "DO NOT" in all caps for prohibitions 4. Provide up to 5 few-shot examples for consistency 5. Use length qualifiers: "in a few words", "in three sentences"

Safety and Guardrails

Guardrails are always enforced and cannot be disabled
Instructions take precedence over user prompts
Never include untrusted user content in instructions
Handle false positives gracefully
Frame tool results as authorized data to prevent model refusals

Use Cases

Foundation Models supports specialized use cases via SystemLanguageModel.UseCase:

.general -- Default for text generation, summarization, dialog
.contentTagging -- Optimized for categorization and labeling tasks

Custom Adapters

Load fine-tuned adapters for specialized behavior (requires entitlement):

let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)

See references/foundation-models.md for

the complete Foundation Models API reference.

Core ML Overview

Apple's framework for deploying trained models. Automatically dispatches to the optimal compute unit (CPU, GPU, or Neural Engine).

Model Formats

Format	Extension	When to Use
`.mlpackage`	Directory (mlprogram)	All new models (iOS 15+)
`.mlmodel`	Single file (neuralnetwork)	Legacy only (iOS 11-14)
`.mlmodelc`	Compiled	Pre-compiled for faster loading

Always use mlprogram (.mlpackage) for new work.

Conversion Pipeline (coremltools)

import coremltools as ct

# PyTorch conversion (torch.jit.trace)
model.eval()  # CRITICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")],
    minimum_deployment_target=ct.target.iOS18,
    convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")

Optimization Techniques

Technique	Size Reduction	Accuracy Impact	Best Compute Unit
INT8 per-channel	~4x	Low	CPU/GPU
INT4 per-block	~8x	Medium	GPU
Palettization 4-bit	~8x	Low-Medium	Neural Engine
W8A8 (weights+activations)	~4x	Low	ANE (A17 Pro/M4+)
Pruning 75%	~4x	Medium	CPU/ANE

Boundary with `coreml`

This skill owns Python-side conversion, compression, profiling, and framework selection. Use the sibling coreml skill for Swift app integration, prediction APIs, runtime configuration, Vision request wiring, and detailed model loading.

See references/coreml-conversion.md for the

full conversion pipeline and references/coreml-optimization.md

for optimization techniques.

MLX Swift Overview

Apple's ML framework for Swift. Highest sustained generation throughput on Apple Silicon via unified memory architecture.

Loading and Running LLMs

import MLX
import MLXLLM
import MLXLMCommon
import MLXLMHFAPI

let container = try await LLMModelFactory.shared.loadContainer(
    from: HubClient.default,
    using: TokenizersLoader(),
    configuration: .init(id: "mlx-community/Qwen3-4B-4bit")
)
let session = ChatSession(container)
print(try await session.respond(to: "Hello"))

Model Selection by Device

Device	RAM	Recommended Model	RAM Usage
iPhone 12-14	4-6 GB	SmolLM2-135M or Qwen 2.5 0.5B	~0.3 GB
iPhone 15 Pro+	8 GB	Gemma 3n E4B 4-bit	~3.5 GB
Mac 8 GB	8 GB	Llama 3.2 3B 4-bit	~3 GB
Mac 16 GB+	16 GB+	Mistral 7B 4-bit	~6 GB

Memory Management

1. Never exceed 60% of total RAM on iOS 2. Set MLX cache limits: Memory.cacheLimit = 512 * 1024 * 1024 3. Unload MLX and llama.cpp models on backgrounding or memory pressure; for MLX, also call Memory.clearCache() after generation-heavy phases 4. Use "Increased Memory Limit" entitlement for larger models 5. Validate MLX Swift and llama.cpp on physical Apple Silicon; Simulator cannot exercise Metal-dependent inference, memory, or performance

See references/mlx-swift.md for full MLX Swift

patterns and llama.cpp integration.

Multi-Backend Architecture

When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):

func respond(to prompt: String) async throws -> String {
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    } else if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    } else {
        throw AIError.noBackendAvailable
    }
}

Serialize all model access through a coordinator actor to prevent contention:

actor ModelCoordinator {
    func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
        try await work()
    }
}

For custom Core ML models, name only the conversion/optimization handoff here: send Swift app integration, model loading, Vision wiring, and prediction lifecycle to coreml. Keep private user content, such as journals, on device unless product explicitly opts into a nonlocal fallback.

Performance Best Practices

1. Run outside debugger for accurate benchmarks (Xcode: Cmd-Opt-R, uncheck "Debug Executable") 2. Call session.prewarm() for Foundation Models before user interaction 3. Pre-compile Core ML models to .mlmodelc for faster loading 4. Use EnumeratedShapes over RangeDim for Neural Engine optimization 5. Use 4-bit palettization for best Neural Engine memory/latency gains 6. Hand off detailed Vision, Natural Language, and Swift Core ML runtime integration to the sibling framework skills

Common Mistakes

1. No availability check. Starting generation without checking SystemLanguageModel.default.availability leaves unsupported devices with failures instead of fallback UI. 2. No fallback UI. Users on pre-iOS 26 or devices without Apple Intelligence see nothing. Always provide a graceful degradation path. 3. Exceeding the context window. The token budget covers input + output. Monitor usage via tokenCount(for:) and summarize when needed. 4. Concurrent requests on one session. LanguageModelSession supports one request at a time. Check session.isResponding or serialize access. 5. Untrusted content in instructions. User input placed in the instructions parameter bypasses guardrail boundaries. Keep user content in the prompt. 6. Forgetting `model.eval()` before Core ML tracing. PyTorch models must be in eval mode before torch.jit.trace. Training-mode artifacts corrupt output. 7. Using neuralnetwork format. Always use mlprogram (.mlpackage) for new Core ML models. The legacy neuralnetwork format is deprecated. 8. Exceeding 60% RAM on iOS (MLX Swift). Large models cause OOM kills. 9. Trusting MLX simulator results. Validate Metal-dependent behavior on physical devices; Simulator is only a UI/control-flow smoke test. 10. Not clearing MLX caches. Pair model unload with Memory.clearCache().

Review Checklist

[ ] Framework selection matches use case and target OS version
[ ] Foundation Models: availability checked before every API call
[ ] Foundation Models: graceful fallback when model unavailable
[ ] Foundation Models: session prewarm called before user interaction
[ ] Foundation Models: @Generable properties in logical generation order
[ ] Foundation Models: token budget accounted for (check contextSize)
[ ] Core ML: model format is mlprogram (.mlpackage) for iOS 15+
[ ] Core ML: conversion, deployment target, and compression validated
[ ] MLX Swift: model size appropriate for target device RAM
[ ] MLX Swift: cache limits set, caches cleared, models unloaded
[ ] All model access serialized through coordinator actor
[ ] Concurrency: model types and tool implementations are Sendable-conformant or @MainActor-isolated
[ ] Physical device testing performed (not simulator)

References

Foundation Models API -- LanguageModelSession, @Generable, tool calling, prompt design
Core ML Conversion -- Model conversion from PyTorch, TensorFlow, other frameworks
Core ML Optimization -- Quantization, palettization, pruning, performance tuning
MLX Swift & llama.cpp -- MLX Swift patterns, llama.cpp integration, memory management

{
  "skill_name": "apple-on-device-ai",
  "evals": [
    {
      "id": 0,
      "name": "foundation-models-tool-calling-plan",
      "prompt": "I'm adding an iOS 26 meal-planning feature that uses Apple's on-device Foundation Models to return a typed MealPlan, call a pantry lookup tool, stream partial results into SwiftUI, and fall back when Apple Intelligence is unavailable. Sketch the implementation plan and review checklist for another engineer.",
      "expected_output": "A Foundation Models implementation outline that is availability-safe, locale-aware, schema-conscious, tool-calling-aware, and ready for SwiftUI streaming without overclaiming model readiness.",
      "files": [],
      "expectations": [
        "Checks `SystemLanguageModel.default.availability` before generation and handles `appleIntelligenceNotEnabled`, `modelNotReady`, and `deviceNotEligible` with fallback UI.",
        "Uses `LanguageModelSession`, `@Generable`, concise `@Guide` constraints, and structured streaming with partial generated content.",
        "Prefers `supportsLocale(_:)` for locale gating and accounts for the shared input/output context window with `contextSize` or token counting.",
        "Defines a small set of `Sendable` tools at session creation, notes that tool schemas consume context, and does not promise forced tool invocation.",
        "Covers generation error handling for guardrails, context overflow, concurrent requests, unsupported guide or locale, assets unavailable, refusals, and rate limiting."
      ]
    },
    {
      "id": 1,
      "name": "coreml-conversion-optimization-review",
      "prompt": "Review this conversion plan before I hand it to an ML engineer: export a PyTorch vision transformer straight from training mode, convert it to a `.mlmodel` neuralnetwork with dynamic RangeDim inputs, quantize everything to INT4, and profile with MLComputePlan on iOS 17.0. Target devices are iPhone 15 Pro and M4 iPad.",
      "expected_output": "A correction-focused Core ML conversion review that replaces stale or risky guidance with modern mlprogram, explicit deployment target, shape, compression, and profiling advice.",
      "files": [],
      "expectations": [
        "Requires `model.eval()` before tracing/export and recommends validating outputs against the source model after conversion.",
        "Uses mlprogram `.mlpackage` for new models with an explicit `minimum_deployment_target` instead of legacy neuralnetwork `.mlmodel` output.",
        "Prefers `EnumeratedShapes` over broad `RangeDim` when valid shapes are known, especially for Neural Engine performance.",
        "Warns that compression must be accuracy-tested and should avoid blindly quantizing every tensor.",
        "Corrects `MLComputePlan` availability to iOS 17.4+ and distinguishes profiling from Swift app integration details."
      ]
    },
    {
      "id": 2,
      "name": "backend-selection-boundary",
      "prompt": "We have a private journaling app that needs summarization on iOS 26 when possible, an open-source LLM fallback for some Apple Silicon devices, and a custom sentiment classifier we already trained. Decide which Apple on-device AI stack belongs to each piece, and call out what should be delegated to the sibling Core ML integration skill rather than explained here.",
      "expected_output": "A boundary-aware backend selection answer that routes text generation to Foundation Models, custom classifier deployment to Core ML conversion/optimization, open-source fallback to MLX Swift or llama.cpp, and detailed Swift Core ML integration to the sibling Core ML skill.",
      "files": [],
      "expectations": [
        "Selects Foundation Models for on-device summarization when Apple Intelligence is available and includes fallback behavior for unsupported devices.",
        "Routes the trained sentiment classifier through Core ML conversion and optimization guidance rather than treating it as a Foundation Models prompt.",
        "Chooses MLX Swift or llama.cpp for open-source LLM fallback based on Apple Silicon throughput, GGUF compatibility, device memory, and cross-platform needs.",
        "Mentions MLX memory limits, background unloading, and physical-device validation for Metal-dependent behavior.",
        "Keeps detailed Swift Core ML app integration, Vision request wiring, and model loading code as a handoff to the sibling Core ML skill."
      ]
    }
  ]
}

Core ML Model Conversion Reference

Complete reference for converting models to Core ML format using coremltools. Use this reference for Python-side export, conversion, deployment-target, shape, and compression decisions. For Swift app runtime wiring, hand off to the sibling coreml skill; in conversion reviews, describe runtime availability in prose unless the user explicitly asks for Swift integration code.

coremltools Installation
Architecture Overview
Model Formats
Unified Conversion API
Converting from PyTorch
Converting from TensorFlow
Converting from scikit-learn
Converting from XGBoost
ONNX Conversion (Deprecated)
Input and Output Types
Flexible Input Shapes
Deployment Targets
Compute Precision
Compute Units
Stateful Models (iOS 18+)
Multifunction Models (iOS 18+)
Model Utilities
Graph Pass Control
Custom Composite Operators
Common Mistakes to Avoid

coremltools Installation

pip install coremltools

Use a fresh virtual environment and verify the wheel matrix for your Python and source-framework versions. The 9.0 release publishes wheels through CPython 3.13 and adds iOS 26 / macOS 26 deployment targets.

Architecture Overview

Your App (SwiftUI / UIKit)
  |-- Vision, Natural Language, SoundAnalysis, Foundation Models
  |-- Core ML (model loading, prediction, compilation)
  |-- Metal Performance Shaders Graph (GPU) / Accelerate (CPU) / Neural Engine (ANE)

Model Formats

Format	Extension	Model Type	When to Use
`.mlpackage`	Directory	mlprogram	All new models (iOS 15+)
`.mlmodel`	Single file	neuralnetwork	Legacy only (iOS 11-14)
`.mlmodelc`	Compiled	Either	Pre-compiled for faster loading

Always use mlprogram (.mlpackage) for new work. Neural network format is frozen and receives no new features.

mlprogram vs neuralnetwork

Aspect	neuralnetwork	mlprogram
GPU precision	Float16 only	Float16 and Float32
Optimization APIs	Limited	Full (quantize, palettize, prune)
Stateful models	No	Yes (iOS 18+)
Multifunction models	No	Yes (iOS 18+)
On-device training	Supported	Not supported
Weight storage	Embedded in protobuf	Separated (memory-efficient)

Unified Conversion API

import coremltools as ct

mlmodel = ct.convert(
    model,                          # PyTorch traced/exported model or TF model
    source='auto',                  # 'auto', 'pytorch', 'tensorflow'
    inputs=None,                    # list of TensorType/ImageType
    outputs=None,                   # list of TensorType/ImageType
    minimum_deployment_target=None, # ct.target.iOS15 through ct.target.iOS26
    convert_to='mlprogram',         # 'mlprogram' (default) or 'neuralnetwork'
    compute_precision=None,         # ct.precision.FLOAT16 (default), FLOAT32
    compute_units=ct.ComputeUnit.ALL,
    skip_model_load=False,          # True when converting on Linux
    states=None,                    # list of StateType for stateful models
    pass_pipeline=None,             # PassPipeline for graph optimization
)
mlmodel.save("Model.mlpackage")

Converting from PyTorch

torch.jit.trace (Recommended)

import torch
import coremltools as ct

model = MyModel()
model.eval()  # CRITICAL: always call eval() before tracing

example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

mlmodel = ct.convert(
    traced_model,
    inputs=[ct.TensorType(shape=example_input.shape, name="input")],
    minimum_deployment_target=ct.target.iOS16,
)
mlmodel.save("MyModel.mlpackage")

torch.export (Beta)

import torch
import coremltools as ct

model.eval()
example_inputs = (torch.rand(1, 3, 224, 224),)

# Dynamic shapes defined at export time
batch_dim = torch.export.Dim(name="batch", min=1, max=128)
exported = torch.export.export(model, example_inputs,
    dynamic_shapes={"x": {0: batch_dim}})

mlmodel = ct.convert(exported)

Key difference: torch.export defines dynamic shapes upfront (auto-converted to RangeDim). torch.jit.trace defines shapes in ct.convert() via RangeDim/EnumeratedShapes.

Converting from TensorFlow

import tensorflow as tf
import coremltools as ct

# Keras model
tf_model = tf.keras.applications.MobileNetV2()
mlmodel = ct.convert(tf_model)

# SavedModel directory
mlmodel = ct.convert("/path/to/saved_model/")

# HDF5 file
mlmodel = ct.convert("/path/to/model.h5")

# Frozen graph (.pb)
mlmodel = ct.convert("frozen_graph.pb",
    inputs=[ct.TensorType(shape=(1, 224, 224, 3))])

Converting from scikit-learn

from sklearn.linear_model import LinearRegression
import coremltools as ct

model = LinearRegression()
model.fit(X_train, y_train)

mlmodel = ct.converters.sklearn.convert(
    model, ["feature1", "feature2"], "prediction"
)
mlmodel.save("Regressor.mlmodel")

Converting from XGBoost

import xgboost
import coremltools as ct

model = xgboost.XGBClassifier()
model.fit(X_train, y_train)
mlmodel = ct.converters.xgboost.convert(model)

ONNX Conversion (Deprecated)

ONNX direct conversion is deprecated since coremltools 6. Convert from the original framework (PyTorch or TensorFlow) instead. If you only have an ONNX file, convert back to PyTorch first using onnx2torch.

Input and Output Types

TensorType

ct.TensorType(
    name="input",              # must match model input name
    shape=(1, 3, 224, 224),    # tuple of int, RangeDim, or EnumeratedShapes
    dtype=np.float32,          # np.float32, np.float16, np.int32, np.int8 (iOS26+)
    default_value=None,        # np.ndarray: makes input optional at runtime
)

ImageType

ct.ImageType(
    name="image",
    shape=(1, 3, 224, 224),
    scale=1/255.0,                   # per-channel scaling
    bias=[-0.485/0.229, -0.456/0.224, -0.406/0.225],  # ImageNet normalization
    color_layout=ct.colorlayout.RGB, # RGB, BGR, GRAYSCALE, GRAYSCALE_FLOAT16
    channel_first=True,              # True for PyTorch (NCHW), False for TF (NHWC)
)

StateType (iOS 18+)

ct.StateType(
    wrapped_type=ct.TensorType(shape=(1, 8, 128, 64), dtype=np.float16),
    name="kv_cache",
)

Flexible Input Shapes

Fixed Shape

inputs=[ct.TensorType(shape=(1, 3, 224, 224))]

RangeDim (Variable Dimensions)

inputs=[ct.TensorType(shape=(
    1, 3,
    ct.RangeDim(lower_bound=128, upper_bound=512, default=224),
    ct.RangeDim(lower_bound=128, upper_bound=512, default=224),
))]

EnumeratedShapes (Best Performance)

inputs=[ct.TensorType(shape=ct.EnumeratedShapes(
    shapes=[(1,3,224,224), (1,3,384,384), (1,3,512,512)],
    default=(1,3,224,224),
))]

Rule: Prefer EnumeratedShapes over RangeDim when you have a known set of sizes. EnumeratedShapes allows the Neural Engine to optimize for each shape at compilation time. RangeDim only optimizes for the default shape.

Rule: Before iOS 18, only ONE input can use EnumeratedShapes. Starting iOS 18, multiple inputs can use EnumeratedShapes.

Deployment Targets

Target	Model Type	Key Feature Unlocks
`ct.target.iOS13`	neuralnetwork	Basic neural network
`ct.target.iOS15`	mlprogram	FP16 precision, typed tensors
`ct.target.iOS16`	mlprogram	Palettized weights, sparse weights
`ct.target.iOS17`	mlprogram	W8A8 activation quantization (A17 Pro+)
`ct.target.iOS18`	mlprogram	Stateful models, multifunction, per-block quantization
`ct.target.iOS26`	mlprogram	INT8 I/O dtype, state read/write

Corresponding macOS targets: macOS10_15, macOS12, macOS13, macOS14, macOS15, macOS26.

Compute Precision

Value	Description
`ct.precision.FLOAT16`	Default for mlprogram. Smaller, faster.
`ct.precision.FLOAT32`	Higher accuracy. Use when FP16 causes issues.

Mixed Precision (Selective Per-Op)

def keep_layernorm_fp32(op):
    if op.op_type == "layer_norm":
        return False  # keep in FP32
    return True  # convert to FP16

mlmodel = ct.convert(model,
    compute_precision=ct.transform.FP16ComputePrecision(
        op_selector=keep_layernorm_fp32
    ))

Compute Units

Value	Description	When to Use
`.all`	CPU + GPU + Neural Engine	Default, recommended
`.cpuOnly`	CPU exclusively	Debugging, FP32 accuracy
`.cpuAndGPU`	CPU and GPU, no ANE	When ANE causes issues
`.cpuAndNeuralEngine`	CPU and ANE, no GPU	Energy efficiency (macOS 13+)

Stateful Models (iOS 18+)

Persist intermediate values across inference runs. Critical for LLM KV-cache.

Python Conversion

mlmodel = ct.convert(
    traced_model,
    inputs=[ct.TensorType(shape=(1,), name="x")],
    outputs=[ct.TensorType(name="y")],
    states=[ct.StateType(
        wrapped_type=ct.TensorType(shape=(1,)),
        name="accumulator",
    )],
    minimum_deployment_target=ct.target.iOS18,
)

# Python prediction with state
state = mlmodel.make_state()
result = mlmodel.predict({"x": np.array([2.0])}, state=state)

Swift Usage

let model = try MLModel(contentsOf: modelURL, configuration: config)
let state = model.makeState()
let input = try MLDictionaryFeatureProvider(
    dictionary: ["x": MLFeatureValue(double: 2.0)]
)
let output = try model.prediction(from: input, using: state)

Impact: Llama 3.1 with stateful KV-cache achieves 16.26 tokens/s vs 1.25 tokens/s without (13x improvement).

Multifunction Models (iOS 18+)

Pack multiple model functions (e.g., LoRA adapters) into a single .mlpackage. Shared weights are deduplicated.

desc = ct.utils.MultiFunctionDescriptor()
desc.add_function("base.mlpackage", src_function_name="main",
                  target_function_name="base")
desc.add_function("adapter1.mlpackage", src_function_name="main",
                  target_function_name="style_1")
desc.default_function_name = "base"
ct.utils.save_multifunction(desc, "combined.mlpackage")

let config = MLModelConfiguration()
config.functionName = "style_1"
let model = try MLModel(contentsOf: modelURL, configuration: config)

Model Utilities

# Inspect model
spec = mlmodel.get_spec()
print(spec.description)

# Rename inputs/outputs
ct.utils.rename_feature(spec, "old_name", "new_name")

# Set metadata
mlmodel.author = "Author"
mlmodel.short_description = "Description"
mlmodel.input_description["image"] = "RGB image 224x224"

# Split large models for debugging
ct.models.utils.bisect_model("large.mlpackage", "./output/")

# Create pipeline from multiple models
pipeline = ct.models.utils.make_pipeline(model1, model2)

# Randomize weights (for testing)
random_model = ct.models.utils.randomize_weights(mlmodel)

Graph Pass Control

# Skip specific optimization passes
pipeline = ct.PassPipeline()
pipeline.remove_passes({"common::fuse_conv_batchnorm"})
mlmodel = ct.convert(model, pass_pipeline=pipeline)

# Predefined pipelines
ct.PassPipeline.EMPTY              # no passes
ct.PassPipeline.CLEANUP            # minimal cleanup
ct.PassPipeline.DEFAULT_PRUNING    # optimized for pruned models
ct.PassPipeline.DEFAULT_PALETTIZATION  # optimized for palettized models

Custom Composite Operators

When PyTorch uses ops not natively supported:

from coremltools.converters.mil.frontend.torch.torch_op_registry import register_torch_op
from coremltools.converters.mil.frontend.torch.ops import _get_inputs
from coremltools.converters.mil import Builder as mb

@register_torch_op
def selu(context, node):
    x = _get_inputs(context, node, expected=1)[0]
    x = mb.elu(x=x, alpha=1.6732632423543772)
    x = mb.mul(x=x, y=1.0507009873554805, name=node.name)
    context.add(x)

Common Mistakes to Avoid

1. Forgetting model.eval(). PyTorch models MUST be in eval mode before tracing or exporting. 2. Using RangeDim when EnumeratedShapes would work. Known input sizes should use EnumeratedShapes for better Neural Engine performance. 3. Targeting neuralnetwork format for new models. Always use mlprogram. 4. Not specifying minimum_deployment_target. Always set it explicitly. 5. Using Float32 when Float16 suffices. Float16 is the default and correct for most models. 6. Converting from ONNX directly. ONNX conversion is deprecated. Convert from the original framework. 7. Not testing on physical devices. Simulator does not support Metal GPU or Neural Engine.

Core ML Optimization Reference

Complete reference for optimizing Core ML models: quantization, palettization, pruning, performance tuning, and profiling. Keep conversion/optimization handoffs focused on Python-side tooling and profiling decisions. Defer Swift prediction/runtime wiring to the sibling coreml skill unless the user explicitly asks for app integration code.

Optimization Technique Selection
Post-Training Weight Quantization (Data-Free)
Palettization (Weight Clustering)
Pruning (Weight Sparsification)
Joint Compression (Stacking Techniques)
Per-Op Configuration
Quantization-Aware Training (QAT)
Swift Integration
MLTensor (iOS 18+)
Neural Engine Best Practices
Model Loading Optimization
Profiling
Common Optimization Mistakes

Optimization Technique Selection

Technique	Size Reduction	Accuracy Impact	Best Compute Unit	Min OS
INT8 per-channel	~4x	Low	CPU/GPU	iOS 16
INT4 per-block	~8x	Medium	GPU	iOS 18
Palettization 4-bit	~8x	Low-Medium	Neural Engine	iOS 16
Palettization 2-bit	~16x	Medium-High	Neural Engine	iOS 16
W8A8 (weights+activations)	~4x	Low	ANE (A17 Pro/M4+)	iOS 17
Pruning 50%	~2x	Low	CPU/ANE	iOS 16
Pruning 75%	~4x	Medium	CPU/ANE	iOS 16

Post-Training Weight Quantization (Data-Free)

INT8 Per-Channel Symmetric

import coremltools as ct
import coremltools.optimize as cto

model = ct.models.MLModel("model.mlpackage")

op_config = cto.coreml.OpLinearQuantizerConfig(
    mode="linear_symmetric",  # or "linear" (asymmetric with zero-point)
    weight_threshold=512,     # only quantize tensors with > N elements
)
config = cto.coreml.OptimizationConfig(global_config=op_config)
compressed = cto.coreml.linear_quantize_weights(model, config=config)
compressed.save("model_int8.mlpackage")

INT4 Per-Block (PyTorch, Data-Free)

import coremltools.optimize as cto

config = cto.torch.quantization.PostTrainingQuantizerConfig.from_dict({
    "global_config": {
        "weight_dtype": "int4",
        "granularity": "per_block",
        "block_size": 128,
    }
})
quantizer = cto.torch.quantization.PostTrainingQuantizer(model, config)
quantized_model = quantizer.compress()

GPTQ Calibration-Based Quantization

config = cto.torch.layerwise_compression.LayerwiseCompressorConfig.from_dict({
    "global_config": {
        "algorithm": "gptq",
        "weight_dtype": 4,
        "granularity": "per_block",
        "block_size": 128,
    },
    "calibration_nsamples": 16,
})
compressor = cto.torch.layerwise_compression.LayerwiseCompressor(model, config)
compressed_model = compressor.compress(calibration_dataloader)

Palettization (Weight Clustering)

Especially effective on the Neural Engine. 4-bit palettization typically preserves accuracy better than 4-bit linear quantization.

Post-Conversion Palettization

op_config = cto.coreml.OpPalettizerConfig(
    mode="kmeans",                     # "kmeans" or "uniform"
    nbits=4,                           # {1, 2, 3, 4, 6, 8}
    granularity="per_grouped_channel", # iOS 18+ for grouped
    group_size=16,
)
config = cto.coreml.OptimizationConfig(global_config=op_config)
palettized = cto.coreml.palettize_weights(model, config=config)

Available Bit Widths

Bits	Unique Values	Size Reduction	Typical Quality
8	256	~2x	Excellent
6	64	~2.7x	Very good
4	16	~8x	Good
3	8	~10.7x	Moderate
2	4	~16x	Fair
1	2	~32x	Poor (binary)

Pruning (Weight Sparsification)

Magnitude Pruning

config = cto.coreml.OptimizationConfig(
    global_config=cto.coreml.OpMagnitudePrunerConfig(
        target_sparsity=0.75,
        weight_threshold=2048,
    )
)
pruned = cto.coreml.prune_weights(model, config=config)

Threshold Pruning

config = cto.coreml.OptimizationConfig(
    global_config=cto.coreml.OpThresholdPrunerConfig(
        threshold=1e-12,
        minimum_sparsity_percentile=0.5,
    )
)
pruned = cto.coreml.prune_weights(model, config=config)

Joint Compression (Stacking Techniques)

Apply multiple compression techniques in sequence:

# Palettize first, then prune on top
palettized = cto.coreml.palettize_weights(model, pal_config)
final = cto.coreml.prune_weights(
    palettized, prune_config, joint_compression=True
)

Per-Op Configuration

Fine-grained control over which operations get compressed:

config = cto.coreml.OptimizationConfig(
    global_config=global_op_config,
    op_type_configs={
        "linear": linear_config,
        "conv": conv_config,
    },
    op_name_configs={
        "embedding_layer": None,  # None = skip compression
    },
)

Quantization-Aware Training (QAT)

Train with quantization in the loop for best accuracy:

from coremltools.optimize.torch.quantization import (
    LinearQuantizer, LinearQuantizerConfig, ModuleLinearQuantizerConfig
)

config = LinearQuantizerConfig(
    global_config=ModuleLinearQuantizerConfig(
        quantization_scheme="symmetric",
        milestones=[0, 1000, 1000, 0],
    )
)
quantizer = LinearQuantizer(model, config)
quantizer.prepare(example_inputs=[1, 3, 224, 224], inplace=True)

# Training loop
for inputs, labels in data:
    output = model(inputs)
    loss = loss_fn(output, labels)
    loss.backward()
    optimizer.step()
    quantizer.step()

model = quantizer.finalize(inplace=True)

Swift Integration

Loading Models

// From Xcode-compiled model (auto-generated class)
let model = try MyImageClassifier(configuration: MLModelConfiguration())

// From URL at runtime
let config = MLModelConfiguration()
config.computeUnits = .all
let model = try MLModel(contentsOf: modelURL, configuration: config)

// From pre-compiled model (.mlmodelc) for faster loading
let compiledURL = try MLModel.compileModel(at: sourceModelURL)
let model = try MLModel(contentsOf: compiledURL)

MLModelConfiguration

let config = MLModelConfiguration()
config.computeUnits = .all
config.allowLowPrecisionAccumulationOnGPU = true
// config.functionName = "adapter_1"  // For multifunction models (iOS 18+)

Synchronous Prediction

let input = MyModelInput(image: pixelBuffer)
let output = try model.prediction(input: input)
let label = output.classLabel

Async Prediction (iOS 17+)

let output = try await model.prediction(input: input)

Thread-safe, supports Task cancellation, integrates with Swift concurrency. ~60% faster than synchronous for batch workloads.

Batch Prediction

let batchInputs: [MyModelInput] = images.map { MyModelInput(image: $0) }
let batchOutputs = try model.predictions(inputs: batchInputs)

MLFeatureProvider

let features = try MLDictionaryFeatureProvider(dictionary: [
    "input": MLFeatureValue(pixelBuffer: pixelBuffer),
    "threshold": MLFeatureValue(double: 0.5),
])
let output = try model.prediction(from: features)

Vision Framework Integration

import Vision
import CoreML

let vnModel = try VNCoreMLModel(for: MyDetector().model)
let request = VNCoreMLRequest(model: vnModel) { request, error in
    guard let results = request.results as? [VNClassificationObservation] else { return }
    let topResult = results.first
    print("\(topResult?.identifier ?? ""): \(topResult?.confidence ?? 0)")
}
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

Natural Language Integration

import NaturalLanguage

let nlModel = try NLModel(mlModel: SentimentClassifier().model)
let sentiment = nlModel.predictedLabel(for: "Great product!")

MLTensor (iOS 18+)

Swift type for multidimensional array operations:

import CoreML

let tensor = MLTensor([1.0, 2.0, 3.0, 4.0])
let reshaped = tensor.reshaped(to: [2, 2])
let result = tensor.softmax()
let matmulResult = tensorA.matmul(tensorB)

Neural Engine Best Practices

1. Use EnumeratedShapes instead of RangeDim for ANE optimization 2. Avoid unsupported ANE ops -- they cause fallback to CPU/GPU with transfer overhead 3. Use palettization (4-bit or 6-bit) for best ANE memory/latency gains 4. W8A8 quantization on A17 Pro / M4+ enables optimized INT8 compute on ANE

Model Loading Optimization

1. Pre-compile models -- use .mlmodelc for instant loading after first compilation 2. Cache compiled models to a fixed location after MLModel.compileModel(at:) 3. Use bisect_model() for very large models that are slow to load 4. Use MLComputePlan (iOS 17.4+) for programmatic profiling

Profiling

1. Xcode Performance tab -- open .mlpackage in Xcode to see load time, prediction time, per-op compute unit assignment 2. Core ML Instrument in Instruments app -- runtime profiling 3. MLComputePlan API -- programmatic access to profiling data 4. coremltools debugging -- MLModelValidator, MLModelComparator, MLModelInspector, MLModelBenchmarker

Reshape Frequency Hint

model = ct.models.MLModel("model.mlpackage",
    optimization_hints={
        "reshapeFrequency": ct.ReshapeFrequency.Infrequent
    })

Common Optimization Mistakes

1. Applying quantization without checking accuracy. Always validate after compression. Use MLModelComparator to compare outputs. 2. Ignoring weight_threshold. Small tensors (< 512 elements) should not be quantized -- overhead outweighs the benefit. 3. Using synchronous predictions in async contexts. Use async prediction (iOS 17+) in Swift concurrency code. 4. Not pre-compiling models. First load triggers device-specific compilation, which can be slow. 5. Ignoring compute_units configuration. Default .all is correct for production. .cpuOnly is for debugging only. 6. Not testing on physical devices. Simulator does not support Metal GPU or Neural Engine.

Foundation Models API Reference

Complete reference for Apple's Foundation Models framework (iOS 26+ / macOS 26+). On-device language model optimized for Apple Silicon. No app-managed API key, model hosting, or network round trip for generation; still handle Apple Intelligence and system model asset availability.

Framework Overview
Availability Checking
Use Cases
Session Management
Generating Responses
Structured Output with `@Generable`
Tool Calling
Error Handling
Generation Options
Safety and Guardrails
Custom Adapters
Context Management
Serialized Model Access
Prompt Design Best Practices
Feedback

Framework Overview

On-device language model optimized for Apple Silicon
Context window: limited total token budget (input + output combined); check

SystemLanguageModel.default.contextSize for the current limit

Prefer SystemLanguageModel.default.supportsLocale(_:) before generation;

use supportedLanguages only when listing broad language support

Capabilities: Summarization, entity extraction, text understanding, short

dialog, creative content, content tagging

Limitations: Not suited for complex math, code generation, or factual accuracy

SystemLanguageModel Properties

contextSize: Returns the model's maximum context window in tokens
supportedLanguages: Set<Locale.Language> values the model supports
supportsLocale(_ locale: Locale) -> Bool: Preferred locale check before generating because it accounts for fallbacks

Availability Checking

Always check before using. Never crash on unavailability.

import FoundationModels

// Quick boolean check
if SystemLanguageModel.default.isAvailable {
    // Proceed
}

// Detailed availability
switch SystemLanguageModel.default.availability {
case .available:
    let candidates = [Locale.current] + Locale.preferredLanguages.map(Locale.init(identifier:))
    guard let locale = candidates.first(where: SystemLanguageModel.default.supportsLocale) else {
        // Route to fallback UI before generating
        break
    }
    // Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
    // Guide user to Settings > Apple Intelligence
case .unavailable(.modelNotReady):
    // System model assets are downloading or unavailable for other system reasons
case .unavailable(.deviceNotEligible):
    // Device cannot run Apple Intelligence
default:
    // Graceful fallback for unknown or future unavailable reasons
}

Use Cases

Foundation Models supports specialized use cases:

// General purpose (default)
let model = SystemLanguageModel(useCase: .general, guardrails: .default)

// Content tagging (optimized for categorization)
let model = SystemLanguageModel(useCase: .contentTagging, guardrails: .default)

Session Management

Creating Sessions

// Basic session (uses SystemLanguageModel.default)
let session = LanguageModelSession()

// Session with system instructions
let session = LanguageModelSession {
    "You are a helpful cooking assistant."
    "Focus on quick, healthy recipes."
}

// Session with tools
let session = LanguageModelSession(
    tools: [weatherTool, recipeTool]
) {
    "You are a helpful assistant with access to tools."
}

// Session with specific model
let model = SystemLanguageModel(useCase: .general, guardrails: .default)
let session = LanguageModelSession(model: model, tools: []) {
    "You are a helpful assistant."
}

Session Rules

1. Sessions are stateful. Multi-turn conversations maintain context automatically. 2. One request at a time per session. Check session.isResponding before new requests. 3. Prewarm with session.prewarm() before user interaction for faster first response. 4. Save and restore transcripts for session continuity: LanguageModelSession(model: model, tools: [], transcript: savedTranscript).

Prewarming

// Prewarm before user interaction
session.prewarm()

// Prewarm with a prompt prefix for faster specific responses
session.prewarm(promptPrefix: Prompt("Summarize the following text:"))

Generating Responses

Plain Text

// Simple text response
let response = try await session.respond(to: "Summarize this article: \(text)")
print(response.content) // String

// With generation options
let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)

Streaming Text

let stream = session.streamResponse(to: "Tell me a story")
for try await snapshot in stream {
    print(snapshot.content, terminator: "")
}

// Or collect the full response
let response = try await stream.collect()

Structured Output with `@Generable`

The @Generable macro creates compile-time JSON schemas for type-safe output.

Basic Usage

@Generable
struct Recipe {
    @Guide(description: "The name of the recipe")
    var name: String

    @Guide(description: "A brief description of the dish")
    var summary: String

    @Guide(description: "Cooking steps", .count(3))
    var steps: [String]

    @Guide(description: "Prep time in minutes", .range(1...120))
    var prepTime: Int
}

let response = try await session.respond(
    to: "Suggest a quick pasta recipe",
    generating: Recipe.self
)
let recipe = response.content
print(recipe.name)
print(recipe.steps)

Supported Types for `@Generable` Properties

String
Int, Double, Float
Bool
[Element] where Element is Generable or a supported scalar
Optional<T> where T is Generable or a supported scalar
Other @Generable structs (nested)
Enums conforming to @Generable

`@Guide` Constraints

@Generable
struct ProductReview {
    @Guide(description: "Product name")
    var product: String

    @Guide(description: "Rating", .range(1...5))
    var rating: Int

    @Guide(description: "Sentiment", .anyOf(["positive", "neutral", "negative"]))
    var sentiment: String

    @Guide(description: "Key themes", .count(3))
    var themes: [String]

    @Guide(description: "Summary in one sentence", .pattern(/^[A-Z].*\.$/))
    var summary: String

    @Guide(description: "Always English", .constant("en"))
    var language: String
}

Complete constraint list:

Constraint	Type	Purpose
`description:`	All	Natural language hint for generation
`.anyOf([values])`	String	Restrict to enumerated values
`.count(n)`	Array	Fixed array length
`.minimumCount(n)`	Array	Minimum array length
`.maximumCount(n)`	Array	Maximum array length
`.range(min...max)`	Numeric	Closed numeric range
`.minimum(n)`	Numeric	Lower bound
`.maximum(n)`	Numeric	Upper bound
`.constant(value)`	String	Always returns this value
`.pattern(regex)`	String	Regex format enforcement
`.element(guide)`	Array	Guide applied to each element

Property Ordering

Properties are generated in declaration order. Place foundational data before dependent data:

@Generable
struct Summary {
    var title: String       // Generated first
    var keyPoints: [String] // Generated with title context
    var conclusion: String  // Generated with full context
}

Streaming Structured Output

let stream = session.streamResponse(
    to: "Suggest a recipe",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content is Recipe.PartiallyGenerated (all properties optional)
    if let name = snapshot.content.name { updateNameLabel(name) }
    if let steps = snapshot.content.steps { updateStepsList(steps) }
}

Enum Support

@Generable
enum Priority: String {
    case low, medium, high, critical
}

@Generable
struct Task {
    var title: String
    var priority: Priority
}

Tool Calling

Defining Tools

struct WeatherTool: Tool {
    let name = "weather"
    let description = "Get current weather for a city."

    @Generable
    struct Arguments {
        @Guide(description: "The city name")
        var city: String
    }

    func call(arguments: Arguments) async throws -> String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}

Using Tools

let session = LanguageModelSession(
    tools: [WeatherTool()]
) {
    "You are a helpful assistant."
}

// The model decides autonomously when to invoke tools
let response = try await session.respond(to: "What's the weather in Tokyo?")

Tool Best Practices

Register all tools at session creation
Keep active tool sets small, usually three to five tools
Include only tools needed for the current task
Each tool adds to the context token budget (name, description, and parameter

schema are included in instructions by default)

@Generable output schemas also consume the shared context window
Run deterministic or essential data fetches before calling the model, then put

the result directly in the prompt

Use model-autonomous tools for dynamic lookups where the model can decide

whether more app data is needed

Frame tool results as authorized user data to prevent refusals
The model calls tools autonomously; you cannot force tool invocation

Tool Protocol Details

Tool<Arguments, Output> conforms to Sendable; implement tools so captured

state is concurrency-safe

The associated Arguments type must conform to ConvertibleFromGeneratedContent
The associated Output type must conform to PromptRepresentable (e.g.,

String, [String], custom types)

includesSchemaInInstructions: Boolean property on Tool (default true). Set to false to omit the tool's JSON schema from the system prompt, saving context tokens when the model already knows the schema.
ToolCallError: Struct on LanguageModelSession representing a tool invocation failure. Properties: tool (the tool name), underlyingError (the original error).
DynamicGenerationSchema: Build generation schemas at runtime for dynamic use cases where compile-time @Generable is insufficient. Construct schemas programmatically and pass to respond(to:schema:).

Error Handling

do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation:
        // Content triggered safety filters; rephrase and retry
    case .exceededContextWindowSize:
        // Too many tokens; summarize earlier turns and create new session
    case .concurrentRequests:
        // Another request is already in progress on this session
    case .rateLimited:
        // Too many requests; back off and retry
    case .unsupportedLanguageOrLocale:
        // Current locale not supported by the model
    case .unsupportedGuide:
        // A @Guide constraint is not supported
    case .assetsUnavailable:
        // Model assets not available on device
    case .decodingFailure:
        // Failed to decode structured output
    case .refusal(let refusal, _):
        // Model refused the request
        let explanation = try await refusal.explanation.content
        print("Refused: \(explanation)")
    default: break
    }
}

Generation Options

let options = GenerationOptions(
    sampling: .greedy,              // Deterministic output
    temperature: nil,               // Use default
    maximumResponseTokens: 256      // Limit response length
)

// Random sampling with top-k
let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7
)

// Random sampling with probability threshold
let options = GenerationOptions(
    sampling: .random(probabilityThreshold: 0.9)
)

Sampling modes accept an optional seed parameter for reproducible output: .random(top: 40, seed: 42), .random(probabilityThreshold: 0.9, seed: 42).

Safety and Guardrails

Guardrail Types

// Default guardrails (recommended)
let model = SystemLanguageModel(useCase: .general, guardrails: .default)

// Permissive content transformations (for text rewriting tasks)
let model = SystemLanguageModel(
    useCase: .general,
    guardrails: .permissiveContentTransformations
)

Safety Rules

Guardrails are always enforced and cannot be disabled
Instructions take precedence over user prompts
Never include untrusted user content in instructions
Provide curated selections over free-form input when possible
Guardrails can produce false positives; handle gracefully
Frame tool results as authorized user data

Custom Adapters

Load fine-tuned LoRA adapters for specialized model behavior:

// Requires com.apple.developer.foundation-model-adapter entitlement
let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()

let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)
let response = try await session.respond(to: "Generate styled text")

Adapter Management

// Check compatible adapters
let ids = SystemLanguageModel.Adapter.compatibleAdapterIdentifiers(name: "my-adapter")

// Remove obsolete adapters
try SystemLanguageModel.Adapter.removeObsoleteAdapters()

Context Management

When conversations grow long:

1. Monitor token usage against SystemLanguageModel.default.contextSize 2. Use SystemLanguageModel.default.tokenCount(for:) to estimate usage 3. Summarize earlier turns into new session instructions 4. Create fresh sessions with summary context rather than overflowing

if transcript.estimatedTokenCount > 3000 {
    let summary = try await summarizeSession(session)
    session = LanguageModelSession {
        "Previous conversation summary: \(summary)"
        "Continue helping the user."
    }
}

Serialized Model Access

When multiple parts of an app need the model:

actor FoundationModelCoordinator {
    private var session: LanguageModelSession?

    func respond(to prompt: String) async throws -> String {
        if session == nil {
            session = LanguageModelSession()
        }
        guard let activeSession = session else {
            throw FoundationModelError.sessionUnavailable
        }
        let response = try await activeSession.respond(to: prompt)
        return response.content
    }
}

Serialize all Foundation Model access through a single coordinator to prevent Neural Engine contention.

Prompt Design Best Practices

1. Be concise. The context window covers both input and output tokens. Check SystemLanguageModel.default.contextSize for the current limit. 2. Use bracketed placeholders in instructions: [descriptive example]. 3. Use "DO NOT" in all caps for behavioral prohibitions. 4. Provide up to 5 few-shot examples for consistent output. 5. Use length qualifiers: "in a few words", "in three sentences". 6. Estimate token usage with SystemLanguageModel.default.tokenCount(for:) to avoid exceeding the context window.

Feedback

Log feedback for model improvement:

let data = session.logFeedbackAttachment(
    sentiment: .negative,
    issues: [
        LanguageModelFeedback.Issue(
            category: .didNotFollowInstructions,
            explanation: "Ignored the word count constraint"
        )
    ],
    desiredOutput: nil
)

Issue categories: .didNotFollowInstructions, .incorrect, .stereotypeOrBias, .suggestiveOrSexual, .tooVerbose, .triggeredGuardrailUnexpectedly, .unhelpful, .vulgarOrOffensive.

MLX Swift & llama.cpp Reference

Complete reference for running open-source LLMs on Apple platforms using MLX Swift and llama.cpp.

MLX Swift
llama.cpp
Multi-Backend Architecture
Built-in Apple Frameworks
Performance Best Practices
Review Checklist

MLX Swift

Apple's ML framework for Swift. Highest sustained generation throughput on Apple Silicon via unified memory architecture.

Key Characteristics

Unified memory: operations run on CPU or GPU without data transfer
Lazy computation: operations computed only when needed
Automatic differentiation for training
Metal GPU acceleration
Research-oriented but increasingly used in production

Loading and Running LLMs

import MLX
import MLXLLM
import MLXLMCommon
import MLXLMHFAPI
import MLXLMTokenizers

let container = try await LLMModelFactory.shared.loadContainer(
    from: HubClient.default,
    using: TokenizersLoader(),
    configuration: .init(id: "mlx-community/Qwen3-4B-4bit")
)

let session = ChatSession(container)
print(try await session.respond(to: "Hello"))

// Use ModelContainer directly when you need streaming control.
let input = try await container.prepare(input: UserInput(prompt: "Hello"))
let stream = try await container.generate(
    input: input,
    parameters: GenerateParameters(temperature: 0.0)
)
for await event in stream {
    if case .chunk(let text) = event {
        print(text, terminator: "")
    }
}

Recommended Models by Device

Device	RAM	Recommended Model	Disk Size	RAM Usage
iPhone 12-14	4-6 GB	SmolLM2-135M or Qwen 2.5 0.5B	~278 MB	~0.3 GB
iPhone 15 Pro+	8 GB	Gemma 3n E4B 4-bit	~2.7 GB	~3.5 GB
Mac 8 GB	8 GB	Llama 3.2 3B 4-bit	~1.8 GB	~3 GB
Mac 16 GB+	16 GB+	Mistral 7B 4-bit	~4 GB	~6 GB

Memory Management Rules

1. Never exceed 60% of total RAM on iOS 2. Set MLX cache limits:

   Memory.cacheLimit = 512 * 1024 * 1024 // 512 MB

3. Monitor memory pressure with Memory.snapshot() and reduce cache under pressure 4. Unload MLX and llama.cpp models on backgrounding or memory pressure; for MLX, also call Memory.clearCache() after generation-heavy phases 5. Use "Increased Memory Limit" entitlement for larger models on iOS 6. Pre-flight memory checks before loading models 7. Validate MLX Swift and llama.cpp on physical Apple Silicon; Simulator cannot exercise Metal-dependent inference, memory, or performance

Model Lifecycle Management

@Observable
class ModelManager {
    private var model: ModelContainer?
    private var generationCount = 0

    func loadModel() async throws {
        model = try await LLMModelFactory.shared.loadContainer(
            from: HubClient.default,
            using: TokenizersLoader(),
            configuration: .init(id: "mlx-community/Qwen3-4B-4bit")
        )
    }

    func unloadModel() {
        model = nil
        Memory.clearCache()
    }
}

Key lifecycle patterns:

Track active generation count to distinguish "loaded but idle" from

"generating"

Unconditional cancellation on app backgrounding
5-second delayed force-unload after backgrounding
Platform-specific memory monitoring (UIKit on iOS, DispatchSource on macOS)

Background Handling

// iOS: Observe app lifecycle
NotificationCenter.default.addObserver(
    forName: UIApplication.didEnterBackgroundNotification,
    object: nil, queue: .main
) { _ in
    modelManager.cancelGeneration()
    Task {
        try await Task.sleep(for: .seconds(5))
        modelManager.unloadModel()
    }
}

llama.cpp

C/C++ LLM inference engine. Best cross-platform support. Uses GGUF model format.

Swift Integration (swift-llama-cpp)

import SwiftLlamaCpp

let service = LlamaService(
    modelUrl: modelURL,
    config: .init(
        batchSize: 256,
        maxTokenCount: 4096,
        useGPU: true
    )
)

let messages = [
    LlamaChatMessage(role: .system, content: "You are helpful."),
    LlamaChatMessage(role: .user, content: "Hello")
]

let stream = try await service.streamCompletion(
    of: messages,
    samplingConfig: .init(temperature: 0.8)
)
for try await token in stream {
    print(token, terminator: "")
}

GGUF Quantization Levels

Level	Quality	Size	Use Case
Q2_K	Lowest	Smallest	Extreme memory constraints
Q4_K_M	Good	Balanced	Mobile devices (recommended)
Q5_K_M	Higher	Larger	When quality matters more
Q8_0	Near-original	Largest	Desktop with ample RAM

llama.cpp vs MLX Swift

Aspect	llama.cpp	MLX Swift
Model format	GGUF	Hugging Face / MLX format
Platform support	Cross-platform	Apple only
Throughput (Apple Silicon)	Good	Best
Model ecosystem	Broadest	mlx-community models
Maturity	Very mature	Evolving
Memory efficiency	Excellent	Good

Multi-Backend Architecture

When an app needs multiple AI backends:

Fallback Chain Pattern

func respond(to prompt: String) async throws -> String {
    // Try Foundation Models first when available (system-integrated backend)
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    }

    // Fall back to MLX Swift (best throughput)
    if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    }

    // Fall back to llama.cpp (broadest compatibility)
    if llamaModelAvailable() {
        return try await llamaRespond(prompt)
    }

    throw AIError.noBackendAvailable
}

Architecture Guidelines

1. Create a router that checks Foundation Models availability first 2. Fall back to MLX or llama.cpp when Foundation Models is unavailable 3. Define model tiers based on device capabilities 4. Serialize all model access through a coordinator actor to prevent contention 5. Ensure tool systems work across backends (schema translation may be needed)

Coordinator Actor

actor ModelCoordinator {
    private var activeBackend: Backend?

    func withExclusiveAccess<T>(
        _ work: () async throws -> T
    ) async rethrows -> T {
        try await work()
    }

    enum Backend {
        case foundationModels
        case mlx
        case llamaCpp
    }
}

Built-in Apple Frameworks

Before reaching for custom models, consider built-in frameworks:

Natural Language Framework

No model downloads required:

NLLanguageRecognizer -- Language detection
NLTokenizer -- Word, sentence, paragraph tokenization
NLTagger -- Parts of speech, named entity recognition, sentiment
NLEmbedding -- Word and sentence vectors, similarity search

Vision Framework

Built-in computer vision (legacy VN* API; for iOS 18+ prefer modern Swift equivalents like RecognizeTextRequest):

VNRecognizeTextRequest -- OCR
VNClassifyImageRequest -- Image classification
VNDetectFaceRectanglesRequest -- Face detection
VNDetectHumanBodyPoseRequest -- Body pose estimation

Create ML

Training custom classifiers directly on device or Mac:

Image classification
Text classification
Tabular data models
Sound classification

Performance Best Practices

1. Run outside debugger for accurate benchmarks (Xcode: Cmd-Opt-R, uncheck "Debug Executable") 2. Use session.prewarm() for Foundation Models before user interaction 3. Batch Vision framework requests in a single perform() call 4. Use .fast recognition level for real-time camera processing 5. Neural Engine (Core ML) is most energy-efficient for compatible operations 6. For MLX Swift, monitor token generation speed and adjust model size if below acceptable thresholds

Review Checklist

[ ] Model size appropriate for target device RAM
[ ] Memory pressure monitoring implemented
[ ] Models unloaded on app backgrounding
[ ] MLX cache limits set appropriately
[ ] Pre-flight memory check before loading large models
[ ] Fallback strategy when model unavailable
[ ] All model access serialized through coordinator
[ ] Quantization level appropriate for quality/size tradeoff
[ ] Energy efficiency considered (Neural Engine vs GPU)
[ ] Physical device testing (not simulator) for Metal-dependent code

Related skills

Xcode Project SetupAutomatically create and configure a new Xcode project with Swift Package Manager dependencies for iOS or macOS agent projects.74.7k392

Expo Tailwind SetupInstantly configure Tailwind CSS v4 with NativeWind v5 and react-native-css inside an Expo project for universal styling.46.7k2.3k

Expo Dev ClientCreate custom development clients for Expo React Native apps that need native modules or Apple-specific targets.45.9k2.3k

Swiftui Expert SkillGet expert guidance when writing, reviewing, or refactoring SwiftUI views, state, performance, and modern iOS/macOS APIs.27.6k3.3k

Flutter Apply Architecture Best PracticesEnforce clean layered architecture when creating or refactoring a Flutter mobile application.25.4k2.7k

Expo ModuleCreate custom config plugins that safely modify native Android and iOS projects generated by Expo prebuild.25k2.3k

How it compares

Choose Apple-on-device-ai over generic LLM integration skills when the target is private iOS or macOS inference using Apple-native or GGUF on-device runtimes.

FAQ

When should I use Foundation Models?

For text generation, structured Generable output, and tool calling on iOS 26 plus Apple Intelligence devices after availability checks.

What is required before every Foundation Models call?

Check SystemLanguageModel.default.availability and locale support, then provide fallback UI when unavailable.

How many concurrent requests can one LanguageModelSession handle?

One request at a time; check session.isResponding or serialize access through a coordinator.

Is Apple On Device Ai safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Mobile Developmentllmagents

About

Apple On Device Ai by the numbers

apple-on-device-ai capabilities & compatibility

What apple-on-device-ai says it does

Add your badge

Which Apple on-device AI framework should I use for text, vision, or open-source LLM inference on iOS and macOS?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

On-Device AI for Apple Platforms

Contents

Framework Selection Router

Apple Foundation Models

Core ML

MLX Swift

llama.cpp

Quick Reference

Apple Foundation Models Overview

Availability Checking (Required)

Session Management

Structured Output with @Generable

@Guide Constraints

Streaming Structured Output

Tool Calling

Error Handling

Generation Options

Prompt Design Rules

Safety and Guardrails

Use Cases

Custom Adapters

Core ML Overview

Model Formats

Conversion Pipeline (coremltools)

Optimization Techniques

Boundary with coreml

MLX Swift Overview

Loading and Running LLMs

Model Selection by Device

Memory Management

Multi-Backend Architecture

Performance Best Practices

Common Mistakes

Review Checklist

References

Core ML Model Conversion Reference

Contents

coremltools Installation

Architecture Overview

Model Formats

mlprogram vs neuralnetwork

Unified Conversion API

Converting from PyTorch

torch.jit.trace (Recommended)

torch.export (Beta)

Converting from TensorFlow

Converting from scikit-learn

Converting from XGBoost

ONNX Conversion (Deprecated)

Input and Output Types

TensorType

ImageType

StateType (iOS 18+)

Flexible Input Shapes

Fixed Shape

RangeDim (Variable Dimensions)

EnumeratedShapes (Best Performance)

Deployment Targets

Compute Precision

Mixed Precision (Selective Per-Op)

Compute Units

Stateful Models (iOS 18+)

Python Conversion

Swift Usage

Multifunction Models (iOS 18+)

Model Utilities

Graph Pass Control

Custom Composite Operators

Common Mistakes to Avoid

Structured Output with `@Generable`

`@Guide` Constraints

Boundary with `coreml`

Structured Output with `@Generable`

Supported Types for `@Generable` Properties

`@Guide` Constraints