
Data Science Expert
Get expert agent guidance for EDA, cleaning, modeling, visualization, and statistical workflows with pandas, numpy, and Python tooling.
Overview
Data-science-expert is an agent skill most often used in Build (also Validate scope, Grow analytics) that provides expert data science, analytics, visualization, and statistical modeling guidance in Python.
Install
npx skills add https://github.com/personamanagmentlayer/pcl --skill data-science-expertWhat is this skill?
- Covers EDA, cleaning, feature engineering, inference, time series, and A/B testing
- Machine learning guidance for supervised, unsupervised, validation, and ensembles
- Visualization stack: Matplotlib, Seaborn, Plotly with accessibility and storytelling notes
- Includes executable Python patterns such as DataCleaner missing-value handling
- Allowed-tools scope: Read, Write, Edit, Bash(python:*)
- Skill version 1.0.0
- Six core analysis themes including EDA and A/B testing
- Bash(python:*) tool allowance for runnable examples
Adoption & trust: 1 installs on skills.sh; 28 GitHub stars; 3/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You are building a data feature or model but need disciplined EDA, cleaning, validation, and visualization patterns beyond generic ML hype.
Who is it for?
Indie builders implementing analytics, ML experiments, or statistical reporting inside a Python codebase.
Skip if: Pure no-code BI users or teams that need governed enterprise MLOps platforms without writing Python.
When should I use this skill?
User needs expert data science, analytics, visualization, or statistical modeling help in Python.
What do I get? / Deliverables
Structured analysis and modeling steps with reusable Python patterns for clean datasets, validated models, and clear visual outputs.
- EDA and cleaning scripts or notebook-oriented code
- Model training and evaluation patterns
- Visualization code for statistical storytelling
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Primary shelf is Build backend because the skill centers on implementing analysis, ML, and visualization code paths in the product or data layer. Backend fits statistical modeling, feature engineering, and pipeline-style Python the skill exemplifies with DataCleaner and ML concepts.
Where it fits
Frame success metrics and A/B test design before committing to a full ML feature.
Implement DataCleaner preprocessing and train a baseline classifier with proper validation splits.
Refine lifecycle dashboards and interpret experiment results with statistical plots.
How it compares
Use as a skill-backed methodology layer instead of isolated stack-overflow answers when agents write pandas and sklearn workflows.
Common Questions / FAQ
Who is data-science-expert for?
It is for developers and solo data practitioners who want agent help across EDA, ML, stats, and visualization in Python.
When should I use data-science-expert?
Use it in Build when coding pipelines or models, in Validate when scoping metrics and experiments, or in Grow when refining analytics and A/B tests.
Is data-science-expert safe to install?
It can run Python via Bash; review the Security Audits panel on this page and restrict execution to trusted datasets and environments.
SKILL.md
READMESKILL.md - Data Science Expert
# Data Science Expert Expert guidance for data science, analytics, statistical modeling, and data visualization. ## Core Concepts ### Data Analysis - Exploratory Data Analysis (EDA) - Data cleaning and preprocessing - Feature engineering - Statistical inference - Time series analysis - A/B testing ### Machine Learning - Supervised learning (classification, regression) - Unsupervised learning (clustering, PCA) - Model selection and validation - Feature importance - Hyperparameter tuning - Ensemble methods ### Data Visualization - Matplotlib, Seaborn, Plotly - Statistical plots - Interactive dashboards - Storytelling with data - Best practices for visualization - Color theory and accessibility ## Data Cleaning and EDA ```python import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from typing import Dict, List class DataCleaner: """Clean and preprocess data""" def __init__(self, df: pd.DataFrame): self.df = df.copy() self.cleaning_log = [] def handle_missing_values(self, strategy: str = 'drop', fill_value=None) -> pd.DataFrame: """Handle missing values""" missing_before = self.df.isnull().sum().sum() if strategy == 'drop': self.df = self.df.dropna() elif strategy == 'fill': if fill_value is not None: self.df = self.df.fillna(fill_value) else: # Fill numeric with median, categorical with mode for col in self.df.columns: if self.df[col].dtype in ['float64', 'int64']: self.df[col].fillna(self.df[col].median(), inplace=True) else: self.df[col].fillna(self.df[col].mode()[0], inplace=True) missing_after = self.df.isnull().sum().sum() self.cleaning_log.append(f"Missing values: {missing_before} -> {missing_after}") return self.df def remove_duplicates(self) -> pd.DataFrame: """Remove duplicate rows""" before = len(self.df) self.df = self.df.drop_duplicates() after = len(self.df) self.cleaning_log.append(f"Duplicates removed: {before - after}") return self.df def remove_outliers(self, columns: List[str], method: str = 'iqr', threshold: float = 1.5) -> pd.DataFrame: """Remove outliers""" before = len(self.df) for col in columns: if method == 'iqr': Q1 = self.df[col].quantile(0.25) Q3 = self.df[col].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - threshold * IQR upper = Q3 + threshold * IQR self.df = self.df[(self.df[col] >= lower) & (self.df[col] <= upper)] elif method == 'zscore': z_scores = np.abs(stats.zscore(self.df[col])) self.df = self.df[z_scores < threshold] after = len(self.df) self.cleaning_log.append(f"Outliers removed: {before - after}") return self.df class EDA: """Exploratory Data Analysis""" def __init__(self, df: pd.DataFrame): self.df = df def summary_stats(self) -> pd.DataFrame: """Generate summary statistics""" return self.df.describe(include='all').T def correlation_analysis(self, method: str = 'pearson') -> pd.DataFrame: """Calculate correlation matrix""" numeric_cols = self.df.select_dtypes(include=[np.number]).columns return self.df[numeric_cols].corr(method=method) def plot_distributions(self, columns: List[str] = None):