Mle Workflow

Build is the canonical shelf because the skill’s core job is hardening pipelines, features, and training/serving parity—not first ideation or pure marketing launch. Backend fits training jobs, batch inference, feature pipelines, and serving contracts that live outside the UI layer.

Also useful

Also useful

Where it fits

Example use

ValidateScope & plan

Decide which ML lanes (labels, online serving, feature store) actually apply before committing build scope for a recommender.

Example use

Define data contracts and a reproducible training job when moving a classifier out of a notebook.

Example use

Set offline metrics and shadow-traffic checks before promoting a refreshed ranking model.

Example use

Add drift detection and rollback triggers when batch feature freshness degrades in production.

How it compares

A production ML systems workflow skill—not a notebook tutorial generator or a generic unit-test-only QA checklist.

Common Questions / FAQ

Who is mle-workflow for?

Solo builders and small teams using agentic coding tools to design, review, or harden ML features from data through operations.

When should I use mle-workflow?

In Build when designing pipelines and train/serve logic; in Ship when defining offline/online evals and promotion gates; in Operate when adding monitoring, canaries, or rollback after drift or quality regressions.

Is mle-workflow safe to install?

The skill may steer agents toward shell, data, and deployment actions—review the Security Audits panel on this page and scope permissions to your environment.

SKILL.md

READMESKILL.md - Mle Workflow

# Machine Learning Engineering Workflow

Use this skill to turn model work into a production ML system with clear data contracts, repeatable training, measurable quality gates, deployable artifacts, and operational monitoring.

## When to Activate

- Planning or reviewing a production ML feature, model refresh, ranking system, recommender, classifier, embedding workflow, or forecasting pipeline
- Converting notebook code into a reusable training, evaluation, batch inference, or online inference pipeline
- Designing model promotion criteria, offline/online evals, experiment tracking, or rollback paths
- Debugging failures caused by data drift, label leakage, stale features, artifact mismatch, or inconsistent training and serving logic
- Adding model monitoring, canary rollout, shadow traffic, or post-deploy quality checks

## Scope Calibration

Use only the lanes that fit the system in front of you. This skill is useful for ranking, search, recommendations, classifiers, forecasting, embeddings, LLM workflows, anomaly detection, and batch analytics, but it should not force one architecture onto all of them.

- Do not assume every model has supervised labels, online serving, a feature store, PyTorch, GPUs, human review, A/B tests, or real-time feedback.
- Do not add heavyweight MLOps machinery when a data contract, baseline, eval script, and rollback note would make the change reviewable.
- Do make assumptions explicit when the project lacks labels, delayed outcomes, slice definitions, production traffic, or monitoring ownership.
- Treat examples as interchangeable scaffolds. Replace metrics, serving mode, data stores, and rollout mechanics with the project-native equivalents.

## Related Skills

- `python-patterns` and `python-testing` for Python implementation and pytest coverage
- `pytorch-patterns` for deep learning models, data loaders, device handling, and training loops
- `eval-harness` and `ai-regression-testing` for promotion gates and agent-assisted regression checks
- `database-migrations`, `postgres-patterns`, and `clickhouse-io` for data storage and analytics surfaces
- `deployment-patterns`, `docker-patterns`, and `security-review` for serving, secrets, containers, and production hardening

## Reuse the SWE Surface

Do not treat MLE as separate from software engineering. Most ECC SWE workflows apply directly to ML systems, often with stricter failure modes:

The recommended `minimal --with capability:machine-learning` install keeps the core agent surface available alongside this skill. For skill-only or agent-limited harnesses, pair `skill:mle-workflow` with `agent:mle-reviewer` where the target supports agents.

| SWE surface | MLE use |
|-------------|---------|
| `product-capability` / `architecture-decision-records` | Turn model work into explicit product contracts and record irreversible data, model, and rollout choices |
| `repo-scan` / `codebase-onboarding` / `code-tour` | Find existing training, feature, serving, eval, and monitoring paths before introducing a parallel ML stack |
| `plan` / `feature-dev` | Scope model changes as product capabilities with data, eval, serving, and rollback phases |
| `tdd-workflow` / `python-testing` | Test feature transforms, split logic, metric calculations, artifact loading, and inference schemas before implementation |
| `code-reviewer` / `mle-reviewer` | Review code quality plus ML-specific leakage, reproducibility, promotion, and monitoring risks |
| `build-fix` / `pr-test-analyzer` | Diagnose broken CI, flaky evals, missing fixtures, and environment-specific model or dependency failures |
| `quality-gate` / `test-coverage` | Require automated evidence for transforms, metrics, inference contract

What is this skill?

Scope calibration across ranking, recommenders, classifiers, forecasting, embeddings, and LLM workflows without forcing

Lanes for data contracts, reproducible training, measurable quality gates, and artifact promotion

Deploy paths with canary, shadow traffic, and post-deploy quality checks

Operational focus on drift, label leakage, stale features, and train/serve skew debugging

Compatible agents: Claude Code, Cursor, Codex, Windsurf

Adoption & trust: 1.1k installs on skills.sh; 210k GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

ValidateScope & plan

Decide which ML lanes (labels, online serving, feature store) actually apply before committing build scope for a recommender.

Example use

Define data contracts and a reproducible training job when moving a classifier out of a notebook.

Example use

Set offline metrics and shadow-traffic checks before promoting a refreshed ranking model.

Example use