Data Science Expert

Primary shelf is Build backend because the skill centers on implementing analysis, ML, and visualization code paths in the product or data layer. Backend fits statistical modeling, feature engineering, and pipeline-style Python the skill exemplifies with DataCleaner and ML concepts.

Also useful

Also useful

Where it fits

Example use

Frame success metrics and A/B test design before committing to a full ML feature.

Example use

Implement DataCleaner preprocessing and train a baseline classifier with proper validation splits.

Example use

Refine lifecycle dashboards and interpret experiment results with statistical plots.

How it compares

Use as a skill-backed methodology layer instead of isolated stack-overflow answers when agents write pandas and sklearn workflows.

Common Questions / FAQ

Who is data-science-expert for?

It is for developers and solo data practitioners who want agent help across EDA, ML, stats, and visualization in Python.

When should I use data-science-expert?

Use it in Build when coding pipelines or models, in Validate when scoping metrics and experiments, or in Grow when refining analytics and A/B tests.

Is data-science-expert safe to install?

It can run Python via Bash; review the Security Audits panel on this page and restrict execution to trusted datasets and environments.

SKILL.md

READMESKILL.md - Data Science Expert

# Data Science Expert

Expert guidance for data science, analytics, statistical modeling, and data visualization.

## Core Concepts

### Data Analysis
- Exploratory Data Analysis (EDA)
- Data cleaning and preprocessing
- Feature engineering
- Statistical inference
- Time series analysis
- A/B testing

### Machine Learning
- Supervised learning (classification, regression)
- Unsupervised learning (clustering, PCA)
- Model selection and validation
- Feature importance
- Hyperparameter tuning
- Ensemble methods

### Data Visualization
- Matplotlib, Seaborn, Plotly
- Statistical plots
- Interactive dashboards
- Storytelling with data
- Best practices for visualization
- Color theory and accessibility

## Data Cleaning and EDA

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List

class DataCleaner:
    """Clean and preprocess data"""

    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.cleaning_log = []

    def handle_missing_values(self, strategy: str = 'drop',
                             fill_value=None) -> pd.DataFrame:
        """Handle missing values"""
        missing_before = self.df.isnull().sum().sum()

        if strategy == 'drop':
            self.df = self.df.dropna()
        elif strategy == 'fill':
            if fill_value is not None:
                self.df = self.df.fillna(fill_value)
            else:
                # Fill numeric with median, categorical with mode
                for col in self.df.columns:
                    if self.df[col].dtype in ['float64', 'int64']:
                        self.df[col].fillna(self.df[col].median(), inplace=True)
                    else:
                        self.df[col].fillna(self.df[col].mode()[0], inplace=True)

        missing_after = self.df.isnull().sum().sum()
        self.cleaning_log.append(f"Missing values: {missing_before} -> {missing_after}")

        return self.df

    def remove_duplicates(self) -> pd.DataFrame:
        """Remove duplicate rows"""
        before = len(self.df)
        self.df = self.df.drop_duplicates()
        after = len(self.df)

        self.cleaning_log.append(f"Duplicates removed: {before - after}")
        return self.df

    def remove_outliers(self, columns: List[str],
                       method: str = 'iqr',
                       threshold: float = 1.5) -> pd.DataFrame:
        """Remove outliers"""
        before = len(self.df)

        for col in columns:
            if method == 'iqr':
                Q1 = self.df[col].quantile(0.25)
                Q3 = self.df[col].quantile(0.75)
                IQR = Q3 - Q1

                lower = Q1 - threshold * IQR
                upper = Q3 + threshold * IQR

                self.df = self.df[(self.df[col] >= lower) & (self.df[col] <= upper)]

            elif method == 'zscore':
                z_scores = np.abs(stats.zscore(self.df[col]))
                self.df = self.df[z_scores < threshold]

        after = len(self.df)
        self.cleaning_log.append(f"Outliers removed: {before - after}")

        return self.df

class EDA:
    """Exploratory Data Analysis"""

    def __init__(self, df: pd.DataFrame):
        self.df = df

    def summary_stats(self) -> pd.DataFrame:
        """Generate summary statistics"""
        return self.df.describe(include='all').T

    def correlation_analysis(self, method: str = 'pearson') -> pd.DataFrame:
        """Calculate correlation matrix"""
        numeric_cols = self.df.select_dtypes(include=[np.number]).columns
        return self.df[numeric_cols].corr(method=method)

    def plot_distributions(self, columns: List[str] = None):

What is this skill?

Covers EDA, cleaning, feature engineering, inference, time series, and A/B testing

Machine learning guidance for supervised, unsupervised, validation, and ensembles

Visualization stack: Matplotlib, Seaborn, Plotly with accessibility and storytelling notes

Includes executable Python patterns such as DataCleaner missing-value handling

Allowed-tools scope: Read, Write, Edit, Bash(python:*)

Skill version 1.0.0

Six core analysis themes including EDA and A/B testing

Bash(python:*) tool allowance for runnable examples

Compatible agents: Claude Code, Cursor, Codex, Windsurf

Adoption & trust: 1 installs on skills.sh; 28 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Frame success metrics and A/B test design before committing to a full ML feature.

Example use

Implement DataCleaner preprocessing and train a baseline classifier with proper validation splits.

Example use