Machine Learning Engineer

Primary shelf is Build because most value is shipping inference APIs and serving infrastructure, not ideation or marketing. Deployment, auto-scaling, and multi-model orchestration map to backend production systems rather than notebook-only data science.

Also useful

Also useful

Where it fits

Example use

BuildIntegrations & version control

Wrap a sklearn or pytorch artifact in a FastAPI inference service with autoscaling assumptions documented.

Example use

Wire batch prediction jobs to your warehouse or queue without blocking the main app thread.

Example use

Compress or quantize a model to meet p99 latency before launch traffic.

Example use

Plan multi-model routing and load balancing when several endpoints share one GPU pool.

How it compares

Focuses on ML serving and ops engineering, not generic backend CRUD or a one-off Jupyter experimentation checklist.

Common Questions / FAQ

Who is machine-learning-engineer for?

Solo and indie developers moving ML models into production APIs, batch systems, or edge deployments who want structured ML engineering guidance from their coding agent.

When should I use machine-learning-engineer?

In Build when designing inference services and integrations; in Operate when tuning scaling, latency, or multi-model serving under real traffic.

Is machine-learning-engineer safe to install?

The skill describes deployment patterns that may imply shell, cloud, and network access when implemented—review the Security Audits panel on this page and constrain agent permissions.

SKILL.md

READMESKILL.md - Machine Learning Engineer

# Machine Learning Engineer

## Purpose

Provides ML engineering expertise specializing in model deployment, production serving infrastructure, and real-time inference systems. Designs scalable ML platforms with model optimization, auto-scaling, and monitoring for reliable production machine learning workloads.

## When to Use

- ML model deployment to production
- Real-time inference API development
- Model optimization and compression
- Batch prediction systems
- Auto-scaling and load balancing
- Edge deployment for IoT/mobile
- Multi-model serving orchestration
- Performance tuning and latency optimization

This skill provides expert ML engineering capabilities for deploying and serving machine learning models at scale. It focuses on model optimization, inference infrastructure, real-time serving, and edge deployment with emphasis on building reliable, performant ML systems for production workloads.

## When to Use

User needs:
- ML model deployment to production
- Real-time inference API development
- Model optimization and compression
- Batch prediction systems
- Auto-scaling and load balancing
- Edge deployment for IoT/mobile
- Multi-model serving orchestration
- Performance tuning and latency optimization

## What This Skill Does

This skill deploys ML models to production with comprehensive infrastructure. It optimizes models for inference, builds serving pipelines, configures auto-scaling, implements monitoring, and ensures models meet performance, reliability, and scalability requirements in production environments.

### ML Deployment Components

- Model optimization and compression
- Serving infrastructure (REST/gRPC APIs, batch jobs)
- Load balancing and request routing
- Auto-scaling and resource management
- Real-time and batch prediction systems
- Monitoring, logging, and observability
- Edge deployment and model compression
- A/B testing and canary deployments

## Core Capabilities

### Model Deployment Pipelines
- CI/CD integration for ML models
- Automated testing and validation
- Model performance benchmarking
- Security scanning and vulnerability assessment
- Container building and registry management
- Progressive rollout and blue-green deployment

### Serving Infrastructure
- Load balancer configuration (NGINX, HAProxy)
- Request routing and model caching
- Connection pooling and health checking
- Graceful shutdown and resource allocation
- Multi-region deployment and failover
- Container orchestration (Kubernetes, ECS)

### Model Optimization
- Quantization (FP32, FP16, INT8, INT4)
- Model pruning and sparsification
- Knowledge distillation techniques
- ONNX and TensorRT conversion
- Graph optimization and operator fusion
- Memory optimization and throughput tuning

### Real-time Inference
- Request preprocessing and validation
- Model prediction execution
- Response formatting and error handling
- Timeout management and circuit breaking
- Request batching and response caching
- Streaming predictions and async processing

### Batch Prediction Systems
- Job scheduling and orchestration
- Data partitioning and parallel processing
- Progress tracking and error handling
- Result aggregation and storage
- Cost optimization and resource management

### Auto-scaling Strategies
- Metric-based scaling (CPU, GPU, request rate)
- Scale-up and scale-down policies
- Warm-up periods and predictive scaling
- Cost controls and regional distribution
- Traffic prediction and capacity planning

### Multi-model Serving
- Model routing and version management
- A/B testing and traffic splitting
- Ensemble serving and model cascading
- Fallback strategies and performance isolation
- Shadow mode testing and validation

### Edge Deployment
- Model compression

What is this skill?

Production ML deployment and real-time inference API design

Model optimization, compression, and latency tuning

Batch prediction pipelines and multi-model serving orchestration

Auto-scaling, load balancing, and edge/IoT deployment patterns

Monitoring-minded framing for reliable production ML workloads

8 explicit When-to-Use triggers including edge and multi-model orchestration

Compatible agents: Claude Code, Cursor, Codex, Windsurf

Adoption & trust: 790 installs on skills.sh; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

BuildIntegrations & version control

Wrap a sklearn or pytorch artifact in a FastAPI inference service with autoscaling assumptions documented.

Example use

Wire batch prediction jobs to your warehouse or queue without blocking the main app thread.

Example use

Compress or quantize a model to meet p99 latency before launch traffic.

Example use