Data Quality Auditor

Name: Data Quality Auditor
Author: alirezarezvani

alirezarezvani/claude-skills

578 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

data-quality-auditor is a Claude Code skill that systematically audits datasets for missingness mechanisms so developers who feed data into analysis, training, or imputation pipelines can detect bias risks before process

About

data-quality-auditor is an alirezarezvani Claude Code skill that applies Rubin's missingness framework—MCAR, MAR, and MNAR—to evaluate whether null values in a dataset can be safely imputed or analyzed. The skill includes a deep reference on detection heuristics, imputation safety, and systematic bias risks before data enters ML training, analytics dashboards, or ETL jobs. Developers reach for it when null rates spike, imputation choices feel arbitrary, or model performance may reflect missing-data artifacts rather than signal. The auditor produces a structured assessment of why data is missing and which downstream treatments are statistically defensible.

Classifies missing data as MCAR, MAR, or MNAR using Rubin's framework
Explains safe imputation strategies per missingness mechanism
Provides detection heuristics for each missingness type
Serves as theory reference for the Data Quality Auditor agent skill
Helps prevent systematic bias in downstream ML and analytics work

Data Quality Auditor by the numbers

578 all-time installs (skills.sh)
Ranked #414 of 2,065 Data Science & ML skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill data-quality-auditor

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/data-quality-auditor.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/data-quality-auditor)

Installs	578
repo stars	★ 23.5k
Security audit	3 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

How do you audit missing data before ML training?

Systematically audit datasets for missingness mechanisms before feeding them to analysis, training, or imputation pipelines.

Who is it for?

Data engineers and ML developers preparing datasets with significant null values who must choose safe imputation or exclusion strategies.

Skip if: Datasets with no missing values, quick exploratory plots without pipeline stakes, or teams without statistical missing-data concerns.

When should I use this skill?

A dataset shows null values before analysis, training, or imputation, and the developer needs to classify MCAR, MAR, or MNAR mechanisms.

What you get

Missingness mechanism classification, imputation safety assessment, and bias-risk report for the audited dataset.

Missingness mechanism report
Imputation safety recommendation
Bias-risk assessment

By the numbers

Covers 3 Rubin missingness mechanisms: MCAR, MAR, and MNAR

Files

SKILL.mdMarkdownGitHub ↗

You are an expert data quality engineer. Your goal is to systematically assess dataset health, surface hidden issues that corrupt downstream analysis, and prescribe prioritized fixes. You move fast, think in impact, and never let "good enough" data quietly poison a model or dashboard.

---

Entry Points

Mode 1 — Full Audit (New Dataset)

Use when you have a dataset you've never assessed before.

1. Profile — Run data_profiler.py to get shape, types, completeness, and distributions 2. Missing Values — Run missing_value_analyzer.py to classify missingness patterns (MCAR/MAR/MNAR) 3. Outliers — Run outlier_detector.py to flag anomalies using IQR and Z-score methods 4. Cross-column checks — Inspect referential integrity, duplicate rows, and logical constraints 5. Score & Report — Assign a Data Quality Score (DQS) and produce the remediation plan

Mode 2 — Targeted Scan (Specific Concern)

Use when a specific column, metric, or pipeline stage is suspected.

1. Ask: What broke, when did it start, and what changed upstream? 2. Run the relevant script against the suspect columns only 3. Compare distributions against a known-good baseline if available 4. Trace issues to root cause (source system, ETL transform, ingestion lag)

Mode 3 — Ongoing Monitoring Setup

Use when the user wants recurring quality checks on a live pipeline.

1. Identify the 5–8 critical columns driving key metrics 2. Define thresholds: acceptable null %, outlier rate, value domain 3. Generate a monitoring checklist and alerting logic from data_profiler.py --monitor 4. Schedule checks at ingestion cadence

---

Tools

`scripts/data_profiler.py`

Full dataset profile: shape, dtypes, null counts, cardinality, value distributions, and a Data Quality Score.

Features:

Per-column null %, unique count, top values, min/max/mean/std
Detects constant columns, high-cardinality text fields, mixed types
Outputs a DQS (0–100) based on completeness + consistency signals
--monitor flag prints threshold-ready summary for alerting

# Profile from CSV
python3 scripts/data_profiler.py --file data.csv

# Profile specific columns
python3 scripts/data_profiler.py --file data.csv --columns col1,col2,col3

# Output JSON for downstream use
python3 scripts/data_profiler.py --file data.csv --format json

# Generate monitoring thresholds
python3 scripts/data_profiler.py --file data.csv --monitor

`scripts/missing_value_analyzer.py`

Deep-dive into missingness: volume, patterns, and likely mechanism (MCAR/MAR/MNAR).

Features:

Null heatmap summary (text-based) and co-occurrence matrix
Pattern classification: random, systematic, correlated
Imputation strategy recommendations per column (drop / mean / median / mode / forward-fill / flag)
Estimates downstream impact if missingness is ignored

# Analyze all missing values
python3 scripts/missing_value_analyzer.py --file data.csv

# Focus on columns above a null threshold
python3 scripts/missing_value_analyzer.py --file data.csv --threshold 0.05

# Output JSON
python3 scripts/missing_value_analyzer.py --file data.csv --format json

`scripts/outlier_detector.py`

Multi-method outlier detection with business-impact context.

Features:

IQR method (robust, non-parametric)
Z-score method (normal distribution assumption)
Modified Z-score (Iglewicz-Hoaglin, robust to skew)
Per-column outlier count, %, and boundary values
Flags columns where outliers may be data errors vs. legitimate extremes

# Detect outliers across all numeric columns
python3 scripts/outlier_detector.py --file data.csv

# Use specific method
python3 scripts/outlier_detector.py --file data.csv --method iqr

# Set custom Z-score threshold
python3 scripts/outlier_detector.py --file data.csv --method zscore --threshold 2.5

# Output JSON
python3 scripts/outlier_detector.py --file data.csv --format json

---

Data Quality Score (DQS)

The DQS is a 0–100 composite score across five dimensions. Report it at the top of every audit.

Dimension	Weight	What It Measures
Completeness	30%	Null / missing rate across critical columns
Consistency	25%	Type conformance, format uniformity, no mixed types
Validity	20%	Values within expected domain (ranges, categories, regexes)
Uniqueness	15%	Duplicate rows, duplicate keys, redundant columns
Timeliness	10%	Freshness of timestamps, lag from source system

Scoring thresholds:

🟢 85–100 — Production-ready
🟡 65–84 — Usable with documented caveats
🔴 0–64 — Remediation required before use

---

Proactive Risk Triggers

Surface these unprompted whenever you spot the signals:

Silent nulls — Nulls encoded as 0, "", "N/A", "null" strings. Completeness metrics lie until these are caught.
Leaky timestamps — Future dates, dates before system launch, or timezone mismatches that corrupt time-series joins.
Cardinality explosions — Free-text fields with thousands of unique values masquerading as categorical. Will break one-hot encoding silently.
Duplicate keys — PKs that aren't unique invalidate joins and aggregations downstream.
Distribution shift — Columns where current distribution diverges from baseline (>2σ on mean/std). Signals upstream pipeline changes.
Correlated missingness — Nulls concentrated in a specific time range, user segment, or region — evidence of MNAR, not random dropout.

---

Output Artifacts

Request	Deliverable
"Profile this dataset"	Full DQS report with per-column breakdown and top issues ranked by impact
"What's wrong with column X?"	Targeted column audit: nulls, outliers, type issues, value domain violations
"Is this data ready for modeling?"	Model-readiness checklist with pass/fail per ML requirement
"Help me clean this data"	Prioritized remediation plan with specific transforms per issue
"Set up monitoring"	Threshold config + alerting checklist for critical columns
"Compare this to last month"	Distribution comparison report with drift flags

---

Remediation Playbook

Missing Values

Null %	Recommended Action
< 1%	Drop rows (if dataset is large) or impute with median/mode
1–10%	Impute; add a binary indicator column `col_was_null`
10–30%	Impute cautiously; investigate root cause; document assumption
> 30%	Flag for domain review; do not impute blindly; consider dropping column

Outliers

Likely data error (value physically impossible): cap, correct, or drop
Legitimate extreme (valid but rare): keep, document, consider log transform for modeling
Unknown (can't determine without domain input): flag, do not silently remove

Duplicates

1. Confirm uniqueness key with data owner before deduplication 2. Prefer keep='last' for event data (most recent state wins) 3. Prefer keep='first' for slowly-changing-dimension tables

---

Quality Loop

Tag every finding with a confidence level:

🟢 Verified — confirmed by data inspection or domain owner
🟡 Likely — strong signal but not fully confirmed
🔴 Assumed — inferred from patterns; needs domain validation

Never auto-remediate 🔴 findings without human confirmation.

---

Communication Standard

Structure all audit reports as:

Bottom Line — DQS score and one-sentence verdict (e.g., "DQS: 61/100 — remediation required before production use") What — The specific issues found (ranked by severity × breadth) Why It Matters — Business or analytical impact of each issue How to Act — Specific, ordered remediation steps

---

Related Skills

Skill	Use When
`finance/financial-analyst`	Data involves financial statements or accounting figures
`finance/saas-metrics-coach`	Data is subscription/event data feeding SaaS KPIs
`engineering/database-designer`	Issues trace back to schema design or normalization
`engineering/tech-debt-tracker`	Data quality issues are systemic and need to be tracked as tech debt
`product-team/product-analytics`	Auditing product event data (funnels, sessions, retention)

When NOT to use this skill:

You need to design or optimize the database schema — use engineering/database-designer
You need to build the ETL pipeline itself — use an engineering skill
The dataset is a financial model output — use finance/financial-analyst for model validation

---

References

references/data-quality-concepts.md — MCAR/MAR/MNAR theory, DQS methodology, outlier detection methods

Data Quality Concepts Reference

Deep-dive reference for the Data Quality Auditor skill. Keep SKILL.md lean — this is where the theory lives.

---

Missingness Mechanisms (Rubin, 1976)

Understanding why data is missing determines how safely it can be imputed.

MCAR — Missing Completely At Random

The probability of missingness is independent of both observed and unobserved data.
Example: A sensor drops a reading due to random hardware noise.
Safe to impute? Yes. Imputing with mean/median introduces no systematic bias.
Detection: Null rows are indistinguishable from non-null rows on all other dimensions.

MAR — Missing At Random

The probability of missingness depends on observed data, not the missing value itself.
Example: Older users are less likely to fill in a "social media handle" field — missingness depends on age (observed), not on the handle itself.
Safe to impute? Conditionally yes — impute using a model that accounts for the related observed variables.
Detection: Null rows differ systematically from non-null rows on other columns.

MNAR — Missing Not At Random

The probability of missingness depends on the missing value itself (unobserved).
Example: High earners skip the income field; low performers skip the satisfaction survey.
Safe to impute? No — imputation will introduce systematic bias. Escalate to domain owner.
Detection: Difficult to confirm statistically; look for clustered nulls in time or segment slices.

---

Data Quality Score (DQS) Methodology

The DQS is a weighted composite of five ISO 8000 / DAMA-aligned dimensions:

Dimension	Weight	Rationale
Completeness	30%	Nulls are the most common and impactful quality failure
Consistency	25%	Type/format violations corrupt joins and aggregations silently
Validity	20%	Out-of-domain values (negative ages, future birth dates) create invisible errors
Uniqueness	15%	Duplicate rows inflate metrics and invalidate joins
Timeliness	10%	Stale data causes decisions based on outdated state

Scoring thresholds align to production-readiness standards:

85–100: Ready for production use in models and dashboards
65–84: Usable for exploratory analysis with documented caveats
0–64: Unreliable; remediation required before use in any decision-making context

---

Outlier Detection Methods

IQR (Interquartile Range)

Formula: Outlier if x < Q1 − 1.5×IQR or x > Q3 + 1.5×IQR
Strengths: Non-parametric, robust to non-normal distributions, interpretable bounds
Weaknesses: Can miss outliers in heavily skewed distributions; 1.5× multiplier is conventional, not universal
When to use: Default choice for most business datasets (revenue, counts, durations)

Z-score

Formula: Outlier if |x − μ| / σ > threshold (commonly 3.0)
Strengths: Simple, widely understood, easy to explain to stakeholders
Weaknesses: Mean and std are themselves influenced by outliers — the method is self-defeating for extreme contamination
When to use: Only when the distribution is approximately normal and contamination is < 5%

Modified Z-score (Iglewicz-Hoaglin)

Formula: M_i = 0.6745 × |x_i − median| / MAD; outlier if M_i > 3.5
Strengths: Uses median and MAD — both resistant to outlier influence; handles skewed distributions
Weaknesses: MAD = 0 for discrete columns with one dominant value; less intuitive
When to use: Preferred for skewed distributions (e.g. revenue, latency, page views)

---

Imputation Strategies

Method	When	Risk
Mean	MCAR, continuous, symmetric distribution	Distorts variance; don't use with skewed data
Median	MCAR/MAR, continuous, skewed distribution	Safe for skewed; loses variance
Mode	MCAR/MAR, categorical	Can over-represent one category
Forward-fill	Time series with MCAR/MAR gaps	Assumes value persists — valid for slowly-changing fields
Binary indicator	Null % 1–30%	Preserves information about missingness without imputing
Model-based	MAR, high-value columns	Most accurate but computationally expensive
Drop column	> 50% missing with no business justification	Safest option if column has no predictive value

Golden rule: Always add a col_was_null indicator column when imputing with null% > 1%. This preserves the information that a value was imputed, which may itself be predictive.

---

Common Silent Data Quality Failures

These are the issues that don't raise errors but corrupt results:

1. Sentinel values — 0, -1, 9999, "" used to mean "unknown" in legacy systems 2. Timezone naive timestamps — datetimes stored without timezone; comparisons silently shift by hours 3. Trailing whitespace — "active " ≠ "active" causes silent join mismatches 4. Encoding errors — UTF-8 vs Latin-1 mismatches produce garbled strings in one column 5. Scientific notation — 1e6 stored as string gets treated as a category not a number 6. Implicit schema changes — upstream adds a new category to a lookup field; existing code silently drops new rows

---

References

Rubin, D.B. (1976). "Inference and Missing Data." Biometrika 63(3): 581–592.
Iglewicz, B. & Hoaglin, D. (1993). How to Detect and Handle Outliers. ASQC Quality Press.
DAMA International (2017). DAMA-DMBOK: Data Management Body of Knowledge. 2nd ed.
ISO 8000-8: Data quality — Concepts and measuring.

#!/usr/bin/env python3
from __future__ import annotations
"""
data_profiler.py — Full dataset profile with Data Quality Score (DQS).

Usage:
    python3 data_profiler.py --file data.csv
    python3 data_profiler.py --file data.csv --columns col1,col2
    python3 data_profiler.py --file data.csv --format json
    python3 data_profiler.py --file data.csv --monitor
"""

import argparse
import csv
import json
import math
import sys
from collections import Counter, defaultdict


def load_csv(filepath: str) -> tuple[list[str], list[dict]]:
    with open(filepath, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        rows = list(reader)
        headers = reader.fieldnames or []
    return headers, rows


def infer_type(values: list[str]) -> str:
    """Infer dominant type from non-null string values."""
    counts = {"int": 0, "float": 0, "bool": 0, "string": 0}
    for v in values:
        v = v.strip()
        if v.lower() in ("true", "false"):
            counts["bool"] += 1
        else:
            try:
                int(v)
                counts["int"] += 1
            except ValueError:
                try:
                    float(v)
                    counts["float"] += 1
                except ValueError:
                    counts["string"] += 1
    dominant = max(counts, key=lambda k: counts[k])
    return dominant if counts[dominant] > 0 else "string"


def safe_mean(nums: list[float]) -> float | None:
    return sum(nums) / len(nums) if nums else None


def safe_std(nums: list[float], mean: float) -> float | None:
    if len(nums) < 2:
        return None
    variance = sum((x - mean) ** 2 for x in nums) / (len(nums) - 1)
    return math.sqrt(variance)


def profile_column(name: str, raw_values: list[str]) -> dict:
    total = len(raw_values)
    null_strings = {"", "null", "none", "n/a", "na", "nan", "nil"}
    null_count = sum(1 for v in raw_values if v.strip().lower() in null_strings)
    non_null = [v for v in raw_values if v.strip().lower() not in null_strings]

    col_type = infer_type(non_null)
    unique_values = set(non_null)
    top_values = Counter(non_null).most_common(5)

    profile = {
        "column": name,
        "total_rows": total,
        "null_count": null_count,
        "null_pct": round(null_count / total * 100, 2) if total else 0,
        "non_null_count": len(non_null),
        "unique_count": len(unique_values),
        "cardinality_pct": round(len(unique_values) / len(non_null) * 100, 2) if non_null else 0,
        "inferred_type": col_type,
        "top_values": top_values,
        "is_constant": len(unique_values) == 1,
        "is_high_cardinality": len(unique_values) / len(non_null) > 0.9 if len(non_null) > 10 else False,
    }

    if col_type in ("int", "float"):
        try:
            nums = [float(v) for v in non_null]
            mean = safe_mean(nums)
            profile["min"] = min(nums)
            profile["max"] = max(nums)
            profile["mean"] = round(mean, 4) if mean is not None else None
            profile["std"] = round(safe_std(nums, mean), 4) if mean is not None else None
        except ValueError:
            pass

    return profile


def compute_dqs(profiles: list[dict], total_rows: int) -> dict:
    """Compute Data Quality Score (0-100) across 5 dimensions."""
    if not profiles or total_rows == 0:
        return {"score": 0, "dimensions": {}}

    # Completeness (30%) — avg non-null rate
    avg_null_pct = sum(p["null_pct"] for p in profiles) / len(profiles)
    completeness = max(0, 100 - avg_null_pct)

    # Consistency (25%) — penalize constant cols and mixed-type signals
    constant_cols = sum(1 for p in profiles if p["is_constant"])
    consistency = max(0, 100 - (constant_cols / len(profiles)) * 100)

    # Validity (20%) — penalize high-cardinality string cols (proxy for free-text issues)
    high_card = sum(1 for p in profiles if p["is_high_cardinality"] and p["inferred_type"] == "string")
    validity = max(0, 100 - (high_card / len(profiles)) * 60)

    # Uniqueness (15%) — placeholder; duplicate detection needs full row comparison
    uniqueness = 90.0  # conservative default without row-level dedup check

    # Timeliness (10%) — placeholder; requires timestamp columns
    timeliness = 85.0  # conservative default

    score = (
        completeness * 0.30
        + consistency * 0.25
        + validity * 0.20
        + uniqueness * 0.15
        + timeliness * 0.10
    )

    return {
        "score": round(score, 1),
        "dimensions": {
            "completeness": round(completeness, 1),
            "consistency": round(consistency, 1),
            "validity": round(validity, 1),
            "uniqueness": uniqueness,
            "timeliness": timeliness,
        },
    }


def dqs_label(score: float) -> str:
    if score >= 85:
        return "PASS — Production-ready"
    elif score >= 65:
        return "WARN — Usable with documented caveats"
    else:
        return "FAIL — Remediation required before use"


def print_report(headers: list[str], profiles: list[dict], dqs: dict, total_rows: int, monitor: bool):
    print("=" * 64)
    print("DATA QUALITY AUDIT REPORT")
    print("=" * 64)
    print(f"Rows: {total_rows}  |  Columns: {len(headers)}")
    score = dqs["score"]
    indicator = "🟢" if score >= 85 else ("🟡" if score >= 65 else "🔴")
    print(f"\nData Quality Score (DQS): {score}/100  {indicator}")
    print(f"Verdict: {dqs_label(score)}")

    dims = dqs["dimensions"]
    print("\nDimension Breakdown:")
    for dim, val in dims.items():
        bar = int(val / 5)
        print(f"  {dim.capitalize():<14} {val:>5.1f}  {'█' * bar}{'░' * (20 - bar)}")

    print("\n" + "-" * 64)
    print("COLUMN PROFILES")
    print("-" * 64)

    issues = []
    for p in profiles:
        status = "🟢"
        col_issues = []
        if p["null_pct"] > 30:
            status = "🔴"
            col_issues.append(f"{p['null_pct']}% nulls — investigate root cause")
        elif p["null_pct"] > 10:
            status = "🟡"
            col_issues.append(f"{p['null_pct']}% nulls — impute cautiously")
        elif p["null_pct"] > 1:
            col_issues.append(f"{p['null_pct']}% nulls — impute with indicator")
        if p["is_constant"]:
            status = "🟡"
            col_issues.append("Constant column — zero variance, likely useless")
        if p["is_high_cardinality"] and p["inferred_type"] == "string":
            col_issues.append("High-cardinality string — check if categorical or free-text")

        print(f"\n  {status} {p['column']}")
        print(f"     Type: {p['inferred_type']}  |  Nulls: {p['null_count']} ({p['null_pct']}%)  |  Unique: {p['unique_count']}")
        if "min" in p:
            print(f"     Min: {p['min']}  Max: {p['max']}  Mean: {p['mean']}  Std: {p['std']}")
        if p["top_values"]:
            top = ", ".join(f"{v}({c})" for v, c in p["top_values"][:3])
            print(f"     Top values: {top}")
        for issue in col_issues:
            issues.append((p["column"], issue))
            print(f"     ⚠  {issue}")

    if issues:
        print("\n" + "-" * 64)
        print(f"ISSUES SUMMARY ({len(issues)} found)")
        print("-" * 64)
        for col, msg in issues:
            print(f"  [{col}] {msg}")

    if monitor:
        print("\n" + "-" * 64)
        print("MONITORING THRESHOLDS (copy into alerting config)")
        print("-" * 64)
        for p in profiles:
            if p["null_pct"] > 0:
                print(f"  {p['column']}: null_pct <= {min(p['null_pct'] * 1.5, 100):.1f}%")
            if "mean" in p and p["mean"] is not None:
                drift = abs(p.get("std", 0) or 0) * 2
                print(f"  {p['column']}: mean within [{p['mean'] - drift:.2f}, {p['mean'] + drift:.2f}]")

    print("\n" + "=" * 64)


def main():
    parser = argparse.ArgumentParser(description="Profile a CSV dataset and compute a Data Quality Score.")
    parser.add_argument("--file", required=True, help="Path to CSV file")
    parser.add_argument("--columns", help="Comma-separated list of columns to profile (default: all)")
    parser.add_argument("--format", choices=["text", "json"], default="text")
    parser.add_argument("--monitor", action="store_true", help="Print monitoring thresholds")
    args = parser.parse_args()

    try:
        headers, rows = load_csv(args.file)
    except FileNotFoundError:
        print(f"Error: file not found: {args.file}", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"Error reading file: {e}", file=sys.stderr)
        sys.exit(1)

    if not rows:
        print("Error: CSV file is empty or has no data rows.", file=sys.stderr)
        sys.exit(1)

    selected = args.columns.split(",") if args.columns else headers
    missing_cols = [c for c in selected if c not in headers]
    if missing_cols:
        print(f"Error: columns not found: {', '.join(missing_cols)}", file=sys.stderr)
        sys.exit(1)

    profiles = [profile_column(col, [row.get(col, "") for row in rows]) for col in selected]
    dqs = compute_dqs(profiles, len(rows))

    if args.format == "json":
        print(json.dumps({"total_rows": len(rows), "dqs": dqs, "columns": profiles}, indent=2))
    else:
        print_report(selected, profiles, dqs, len(rows), args.monitor)


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
missing_value_analyzer.py — Classify missingness patterns and recommend imputation strategies.

Usage:
    python3 missing_value_analyzer.py --file data.csv
    python3 missing_value_analyzer.py --file data.csv --threshold 0.05
    python3 missing_value_analyzer.py --file data.csv --format json
"""

import argparse
import csv
import json
import sys
from collections import defaultdict


NULL_STRINGS = {"", "null", "none", "n/a", "na", "nan", "nil", "undefined", "missing"}


def load_csv(filepath: str) -> tuple[list[str], list[dict]]:
    with open(filepath, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        rows = list(reader)
        headers = reader.fieldnames or []
    return headers, rows


def is_null(val: str) -> bool:
    return val.strip().lower() in NULL_STRINGS


def compute_null_mask(headers: list[str], rows: list[dict]) -> dict[str, list[bool]]:
    return {col: [is_null(row.get(col, "")) for row in rows] for col in headers}


def null_stats(mask: list[bool]) -> dict:
    total = len(mask)
    count = sum(mask)
    return {"count": count, "pct": round(count / total * 100, 2) if total else 0}


def classify_mechanism(col: str, mask: list[bool], all_masks: dict[str, list[bool]]) -> str:
    """
    Heuristic classification of missingness mechanism:
    - MCAR: nulls appear randomly, no correlation with other columns
    - MAR:  nulls correlate with values in other observed columns
    - MNAR: nulls correlate with the missing column's own unobserved value (can't fully detect)

    Returns one of: "MCAR (likely)", "MAR (likely)", "MNAR (possible)", "Insufficient data"
    """
    null_indices = {i for i, v in enumerate(mask) if v}
    if not null_indices:
        return "None"

    n = len(mask)
    if n < 10:
        return "Insufficient data"

    # Check correlation with other columns' nulls
    correlated_cols = []
    for other_col, other_mask in all_masks.items():
        if other_col == col:
            continue
        other_null_indices = {i for i, v in enumerate(other_mask) if v}
        if not other_null_indices:
            continue
        overlap = len(null_indices & other_null_indices)
        union = len(null_indices | other_null_indices)
        jaccard = overlap / union if union else 0
        if jaccard > 0.5:
            correlated_cols.append(other_col)

    # Check if nulls are clustered (time/positional pattern) — proxy for MNAR
    sorted_indices = sorted(null_indices)
    if len(sorted_indices) > 2:
        gaps = [sorted_indices[i + 1] - sorted_indices[i] for i in range(len(sorted_indices) - 1)]
        avg_gap = sum(gaps) / len(gaps)
        clustered = avg_gap < n / len(null_indices) * 0.5  # nulls appear closer together than random
    else:
        clustered = False

    if correlated_cols:
        return f"MAR (likely) — co-occurs with nulls in: {', '.join(correlated_cols[:3])}"
    elif clustered:
        return "MNAR (possible) — nulls are spatially clustered, may reflect a systematic gap"
    else:
        return "MCAR (likely) — nulls appear random, no strong correlation detected"


def recommend_strategy(pct: float, col_type: str) -> str:
    if pct == 0:
        return "No action needed"
    if pct < 1:
        return "Drop rows — impact is negligible"
    if pct < 10:
        strategies = {
            "int": "Impute with median + add binary indicator column",
            "float": "Impute with median + add binary indicator column",
            "string": "Impute with mode or 'Unknown' category + add indicator",
            "bool": "Impute with mode",
        }
        return strategies.get(col_type, "Impute with median/mode + add indicator")
    if pct < 30:
        return "Impute cautiously; investigate root cause; document assumption; add indicator"
    return "Do NOT impute blindly — > 30% missing. Escalate to domain owner or consider dropping column"


def infer_type(values: list[str]) -> str:
    non_null = [v for v in values if not is_null(v)]
    counts = {"int": 0, "float": 0, "bool": 0, "string": 0}
    for v in non_null[:200]:  # sample for speed
        v = v.strip()
        if v.lower() in ("true", "false"):
            counts["bool"] += 1
        else:
            try:
                int(v)
                counts["int"] += 1
            except ValueError:
                try:
                    float(v)
                    counts["float"] += 1
                except ValueError:
                    counts["string"] += 1
    return max(counts, key=lambda k: counts[k]) if any(counts.values()) else "string"


def compute_cooccurrence(headers: list[str], masks: dict[str, list[bool]], top_n: int = 5) -> list[dict]:
    """Find column pairs where nulls most frequently co-occur."""
    pairs = []
    cols = list(headers)
    for i in range(len(cols)):
        for j in range(i + 1, len(cols)):
            a, b = cols[i], cols[j]
            mask_a, mask_b = masks[a], masks[b]
            overlap = sum(1 for x, y in zip(mask_a, mask_b) if x and y)
            if overlap > 0:
                pairs.append({"col_a": a, "col_b": b, "co_null_rows": overlap})
    pairs.sort(key=lambda x: -x["co_null_rows"])
    return pairs[:top_n]


def print_report(headers: list[str], rows: list[dict], masks: dict, threshold: float):
    total = len(rows)
    print("=" * 64)
    print("MISSING VALUE ANALYSIS REPORT")
    print("=" * 64)
    print(f"Rows: {total}  |  Columns: {len(headers)}")

    results = []
    for col in headers:
        mask = masks[col]
        stats = null_stats(mask)
        if stats["pct"] / 100 < threshold and stats["count"] > 0:
            continue
        raw_vals = [row.get(col, "") for row in rows]
        col_type = infer_type(raw_vals)
        mechanism = classify_mechanism(col, mask, masks)
        strategy = recommend_strategy(stats["pct"], col_type)
        results.append({
            "column": col,
            "null_count": stats["count"],
            "null_pct": stats["pct"],
            "col_type": col_type,
            "mechanism": mechanism,
            "strategy": strategy,
        })

    fully_complete = [col for col in headers if null_stats(masks[col])["count"] == 0]
    print(f"\nFully complete columns: {len(fully_complete)}/{len(headers)}")

    if not results:
        print(f"\nNo columns exceed the null threshold ({threshold * 100:.1f}%).")
    else:
        print(f"\nColumns with missing values (threshold >= {threshold * 100:.1f}%):\n")
        for r in sorted(results, key=lambda x: -x["null_pct"]):
            indicator = "🔴" if r["null_pct"] > 30 else ("🟡" if r["null_pct"] > 10 else "🟢")
            print(f"  {indicator} {r['column']}")
            print(f"     Nulls: {r['null_count']} ({r['null_pct']}%)  |  Type: {r['col_type']}")
            print(f"     Mechanism: {r['mechanism']}")
            print(f"     Strategy:  {r['strategy']}")
            print()

    cooccur = compute_cooccurrence(headers, masks)
    if cooccur:
        print("-" * 64)
        print("NULL CO-OCCURRENCE (top pairs)")
        print("-" * 64)
        for pair in cooccur:
            print(f"  {pair['col_a']} + {pair['col_b']}  →  {pair['co_null_rows']} rows both null")

    print("\n" + "=" * 64)


def main():
    parser = argparse.ArgumentParser(description="Analyze missing values in a CSV dataset.")
    parser.add_argument("--file", required=True, help="Path to CSV file")
    parser.add_argument("--threshold", type=float, default=0.0,
                        help="Only show columns with null fraction above this (e.g. 0.05 = 5%%)")
    parser.add_argument("--format", choices=["text", "json"], default="text")
    args = parser.parse_args()

    try:
        headers, rows = load_csv(args.file)
    except FileNotFoundError:
        print(f"Error: file not found: {args.file}", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"Error reading file: {e}", file=sys.stderr)
        sys.exit(1)

    if not rows:
        print("Error: CSV file is empty.", file=sys.stderr)
        sys.exit(1)

    masks = compute_null_mask(headers, rows)

    if args.format == "json":
        output = []
        for col in headers:
            mask = masks[col]
            stats = null_stats(mask)
            raw_vals = [row.get(col, "") for row in rows]
            col_type = infer_type(raw_vals)
            mechanism = classify_mechanism(col, mask, masks)
            strategy = recommend_strategy(stats["pct"], col_type)
            output.append({
                "column": col,
                "null_count": stats["count"],
                "null_pct": stats["pct"],
                "col_type": col_type,
                "mechanism": mechanism,
                "strategy": strategy,
            })
        print(json.dumps({"total_rows": len(rows), "columns": output}, indent=2))
    else:
        print_report(headers, rows, masks, args.threshold)


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
from __future__ import annotations
"""
outlier_detector.py — Multi-method outlier detection for numeric columns.

Methods:
  iqr     — Interquartile Range (robust, non-parametric, default)
  zscore  — Standard Z-score (assumes normal distribution)
  mzscore — Modified Z-score via Median Absolute Deviation (robust to skew)

Usage:
    python3 outlier_detector.py --file data.csv
    python3 outlier_detector.py --file data.csv --method iqr
    python3 outlier_detector.py --file data.csv --method zscore --threshold 2.5
    python3 outlier_detector.py --file data.csv --columns col1,col2
    python3 outlier_detector.py --file data.csv --format json
"""

import argparse
import csv
import json
import math
import sys


NULL_STRINGS = {"", "null", "none", "n/a", "na", "nan", "nil", "undefined", "missing"}


def load_csv(filepath: str) -> tuple[list[str], list[dict]]:
    with open(filepath, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        rows = list(reader)
        headers = reader.fieldnames or []
    return headers, rows


def is_null(val: str) -> bool:
    return val.strip().lower() in NULL_STRINGS


def to_float(val: str) -> float | None:
    try:
        return float(val.strip())
    except (ValueError, AttributeError):
        return None


def median(nums: list[float]) -> float:
    s = sorted(nums)
    n = len(s)
    mid = n // 2
    return s[mid] if n % 2 else (s[mid - 1] + s[mid]) / 2


def percentile(nums: list[float], p: float) -> float:
    """Linear interpolation percentile."""
    s = sorted(nums)
    n = len(s)
    if n == 1:
        return s[0]
    idx = p / 100 * (n - 1)
    lo = int(idx)
    hi = lo + 1
    frac = idx - lo
    if hi >= n:
        return s[-1]
    return s[lo] + frac * (s[hi] - s[lo])


def mean(nums: list[float]) -> float:
    return sum(nums) / len(nums)


def std(nums: list[float], mu: float) -> float:
    if len(nums) < 2:
        return 0.0
    variance = sum((x - mu) ** 2 for x in nums) / (len(nums) - 1)
    return math.sqrt(variance)


# --- Detection methods ---

def detect_iqr(nums: list[float], multiplier: float = 1.5) -> dict:
    q1 = percentile(nums, 25)
    q3 = percentile(nums, 75)
    iqr = q3 - q1
    lower = q1 - multiplier * iqr
    upper = q3 + multiplier * iqr
    outliers = [x for x in nums if x < lower or x > upper]
    return {
        "method": "IQR",
        "q1": round(q1, 4),
        "q3": round(q3, 4),
        "iqr": round(iqr, 4),
        "lower_bound": round(lower, 4),
        "upper_bound": round(upper, 4),
        "outlier_count": len(outliers),
        "outlier_pct": round(len(outliers) / len(nums) * 100, 2),
        "outlier_values": sorted(set(round(x, 4) for x in outliers))[:10],
    }


def detect_zscore(nums: list[float], threshold: float = 3.0) -> dict:
    mu = mean(nums)
    sigma = std(nums, mu)
    if sigma == 0:
        return {"method": "Z-score", "outlier_count": 0, "outlier_pct": 0.0,
                "note": "Zero variance — all values identical"}
    zscores = [(x, abs((x - mu) / sigma)) for x in nums]
    outliers = [x for x, z in zscores if z > threshold]
    return {
        "method": "Z-score",
        "mean": round(mu, 4),
        "std": round(sigma, 4),
        "threshold": threshold,
        "outlier_count": len(outliers),
        "outlier_pct": round(len(outliers) / len(nums) * 100, 2),
        "outlier_values": sorted(set(round(x, 4) for x in outliers))[:10],
    }


def detect_modified_zscore(nums: list[float], threshold: float = 3.5) -> dict:
    """Iglewicz-Hoaglin modified Z-score using Median Absolute Deviation."""
    med = median(nums)
    mad = median([abs(x - med) for x in nums])
    if mad == 0:
        return {"method": "Modified Z-score (MAD)", "outlier_count": 0, "outlier_pct": 0.0,
                "note": "MAD is zero — consider Z-score instead"}
    mzscores = [(x, 0.6745 * abs(x - med) / mad) for x in nums]
    outliers = [x for x, mz in mzscores if mz > threshold]
    return {
        "method": "Modified Z-score (MAD)",
        "median": round(med, 4),
        "mad": round(mad, 4),
        "threshold": threshold,
        "outlier_count": len(outliers),
        "outlier_pct": round(len(outliers) / len(nums) * 100, 2),
        "outlier_values": sorted(set(round(x, 4) for x in outliers))[:10],
    }


def classify_outlier_risk(pct: float, col: str) -> str:
    """Heuristic: flag whether outliers are likely data errors or legitimate extremes."""
    if pct > 10:
        return "High outlier rate — likely systematic data quality issue or wrong data type"
    if pct > 5:
        return "Elevated outlier rate — investigate source; may be mixed populations"
    if pct > 1:
        return "Moderate — review individually; could be legitimate extremes or entry errors"
    if pct > 0:
        return "Low — verify extreme values against source; likely legitimate but worth checking"
    return "Clean — no outliers detected"


def analyze_column(col: str, nums: list[float], method: str, threshold: float) -> dict:
    if len(nums) < 4:
        return {"column": col, "status": "Skipped — fewer than 4 numeric values"}

    if method == "iqr":
        result = detect_iqr(nums, multiplier=threshold if threshold != 3.0 else 1.5)
    elif method == "zscore":
        result = detect_zscore(nums, threshold=threshold)
    elif method == "mzscore":
        result = detect_modified_zscore(nums, threshold=threshold)
    else:
        result = detect_iqr(nums)

    result["column"] = col
    result["total_numeric"] = len(nums)
    result["risk_assessment"] = classify_outlier_risk(result.get("outlier_pct", 0), col)
    return result


def print_report(results: list[dict]):
    print("=" * 64)
    print("OUTLIER DETECTION REPORT")
    print("=" * 64)

    clean = [r for r in results if r.get("outlier_count", 0) == 0 and "status" not in r]
    flagged = [r for r in results if r.get("outlier_count", 0) > 0]
    skipped = [r for r in results if "status" in r]

    print(f"\nColumns analyzed: {len(results) - len(skipped)}")
    print(f"Clean:   {len(clean)}")
    print(f"Flagged: {len(flagged)}")
    if skipped:
        print(f"Skipped: {len(skipped)} ({', '.join(r['column'] for r in skipped)})")

    if flagged:
        print("\n" + "-" * 64)
        print("FLAGGED COLUMNS")
        print("-" * 64)
        for r in sorted(flagged, key=lambda x: -x.get("outlier_pct", 0)):
            pct = r.get("outlier_pct", 0)
            indicator = "🔴" if pct > 5 else "🟡"
            print(f"\n  {indicator} {r['column']} ({r['method']})")
            print(f"     Outliers: {r['outlier_count']} / {r['total_numeric']} rows ({pct}%)")
            if "lower_bound" in r:
                print(f"     Bounds: [{r['lower_bound']}, {r['upper_bound']}]  |  IQR: {r['iqr']}")
            if "mean" in r:
                print(f"     Mean: {r['mean']}  |  Std: {r['std']}  |  Threshold: ±{r['threshold']}σ")
            if "median" in r:
                print(f"     Median: {r['median']}  |  MAD: {r['mad']}  |  Threshold: {r['threshold']}")
            if r.get("outlier_values"):
                vals = ", ".join(str(v) for v in r["outlier_values"][:8])
                print(f"     Sample outlier values: {vals}")
            print(f"     Assessment: {r['risk_assessment']}")

    if clean:
        cols = ", ".join(r["column"] for r in clean)
        print(f"\n🟢 Clean columns: {cols}")

    print("\n" + "=" * 64)


def main():
    parser = argparse.ArgumentParser(description="Detect outliers in numeric columns of a CSV dataset.")
    parser.add_argument("--file", required=True, help="Path to CSV file")
    parser.add_argument("--method", choices=["iqr", "zscore", "mzscore"], default="iqr",
                        help="Detection method (default: iqr)")
    parser.add_argument("--threshold", type=float, default=None,
                        help="Method threshold (IQR multiplier default 1.5; Z-score default 3.0; mzscore default 3.5)")
    parser.add_argument("--columns", help="Comma-separated columns to check (default: all numeric)")
    parser.add_argument("--format", choices=["text", "json"], default="text")
    args = parser.parse_args()

    # Set default thresholds per method
    if args.threshold is None:
        args.threshold = {"iqr": 1.5, "zscore": 3.0, "mzscore": 3.5}[args.method]

    try:
        headers, rows = load_csv(args.file)
    except FileNotFoundError:
        print(f"Error: file not found: {args.file}", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"Error reading file: {e}", file=sys.stderr)
        sys.exit(1)

    if not rows:
        print("Error: CSV file is empty.", file=sys.stderr)
        sys.exit(1)

    selected = args.columns.split(",") if args.columns else headers
    missing_cols = [c for c in selected if c not in headers]
    if missing_cols:
        print(f"Error: columns not found: {', '.join(missing_cols)}", file=sys.stderr)
        sys.exit(1)

    results = []
    for col in selected:
        raw = [row.get(col, "") for row in rows]
        nums = [n for v in raw if not is_null(v) and (n := to_float(v)) is not None]
        results.append(analyze_column(col, nums, args.method, args.threshold))

    if args.format == "json":
        print(json.dumps(results, indent=2))
    else:
        print_report(results)


if __name__ == "__main__":
    main()

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Pick data-quality-auditor over generic data profiling when missingness mechanism classification must precede imputation or model training decisions.

FAQ

What missingness types does data-quality-auditor classify?

data-quality-auditor applies Rubin's MCAR, MAR, and MNAR framework from 1976. MCAR means missingness is independent of all data; MAR depends on observed values; MNAR depends on unobserved values and needs careful handling.

When is mean imputation safe according to data-quality-auditor?

data-quality-auditor flags mean or median imputation as safe under MCAR, where null rows are indistinguishable from complete rows. Under MAR or MNAR, naive imputation can introduce systematic bias into training or analytics pipelines.

Data Science & MLdatabasesanalyticspipelines

About

Data Quality Auditor by the numbers

Add your badge

How do you audit missing data before ML training?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Entry Points

Mode 1 — Full Audit (New Dataset)

Mode 2 — Targeted Scan (Specific Concern)

Mode 3 — Ongoing Monitoring Setup

Tools

scripts/data_profiler.py

scripts/missing_value_analyzer.py

scripts/outlier_detector.py

Data Quality Score (DQS)

Proactive Risk Triggers

Output Artifacts

Remediation Playbook

Missing Values

Outliers

Duplicates

Quality Loop

Communication Standard

Related Skills

References

Data Quality Concepts Reference

Missingness Mechanisms (Rubin, 1976)

MCAR — Missing Completely At Random

MAR — Missing At Random

MNAR — Missing Not At Random

Data Quality Score (DQS) Methodology

Outlier Detection Methods

IQR (Interquartile Range)

Z-score

Modified Z-score (Iglewicz-Hoaglin)

Imputation Strategies

Common Silent Data Quality Failures

References

Related skills

How it compares

FAQ

What missingness types does data-quality-auditor classify?

When is mean imputation safe according to data-quality-auditor?

This week in AI coding

`scripts/data_profiler.py`

`scripts/missing_value_analyzer.py`

`scripts/outlier_detector.py`