
Data Quality Auditor
Audit missing data and imputation risk in datasets before models, pipelines, or dashboards ship.
Overview
Data Quality Auditor is an agent skill most often used in Build (also Validate, Operate) that classifies missing data mechanisms and guides safe imputation or escalation before biased analytics ship.
Install
npx skills add https://github.com/alirezarezvani/claude-skills --skill data-quality-auditorWhat is this skill?
- Classifies missingness using Rubin mechanisms: MCAR, MAR, and MNAR with imputation safety guidance
- Maps when mean/median imputation is safe versus when model-based conditional impute is required
- Escalates MNAR patterns to domain owners instead of silently imputing biased values
- Detection heuristics: null rows vs non-null on other dimensions and clustered null patterns
- Theory reference companion—keep SKILL.md lean while deep concepts live in the reference doc
- 3 missingness mechanisms documented (MCAR, MAR, MNAR)
Adoption & trust: 526 installs on skills.sh; 17.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have nulls in critical columns but no clear rule for whether imputing, modeling, or stopping the pipeline is statistically safe.
Who is it for?
Solo builders preparing tabular or survey data for ML features, BI dashboards, or customer-facing reports where missing fields are common.
Skip if: Teams that only need generic schema linting with no missingness analysis, or datasets already signed off by a dedicated data steward with locked DQ policy.
When should I use this skill?
Before committing to imputation, feature engineering, or shipping metrics on datasets with substantial missing values.
What do I get? / Deliverables
You document MCAR/MAR/MNAR classification, chosen handling (impute, conditional model, or escalate), and reduce silent bias in downstream metrics and models.
- Missingness classification per column or field group
- Imputation vs escalate recommendation with rationale
- DQ notes suitable for a pipeline README or data contract
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Data-quality gates belong on the build shelf because solo builders fix schemas and ETL before downstream ML and product features depend on them. Backend subphase covers ingestion, validation rules, and pipeline contracts where missingness (MCAR, MAR, MNAR) is diagnosed.
Where it fits
Scope whether a prototype dataset’s null rate on signup fields is safe for a pricing experiment.
Define pipeline rules: impute MCAR sensor drops, model MAR by cohort, block MNAR income skips.
Pre-launch QA checklist for analytics events with systematic nulls on older user segments.
Investigate a sudden spike in null satisfaction scores after a deploy.
How it compares
Use instead of ad-hoc “fill nulls with zero” chat advice when statistical missingness mechanics matter.
Common Questions / FAQ
Who is data-quality-auditor for?
Indie builders and small teams auditing tabular datasets, ingestion jobs, and analytics tables before models or dashboards go live.
When should I use data-quality-auditor?
During Validate when scoping a dataset for a prototype, in Build when designing ETL or feature stores, and in Operate when production null rates spike or imputation drift is suspected.
Is data-quality-auditor safe to install?
Review the Security Audits panel on this Prism page for install risk, dependency exposure, and any automated audit results before enabling it in your agent.
SKILL.md
READMESKILL.md - Data Quality Auditor
# Data Quality Concepts Reference Deep-dive reference for the Data Quality Auditor skill. Keep SKILL.md lean — this is where the theory lives. --- ## Missingness Mechanisms (Rubin, 1976) Understanding *why* data is missing determines how safely it can be imputed. ### MCAR — Missing Completely At Random - The probability of missingness is independent of both observed and unobserved data. - **Example:** A sensor drops a reading due to random hardware noise. - **Safe to impute?** Yes. Imputing with mean/median introduces no systematic bias. - **Detection:** Null rows are indistinguishable from non-null rows on all other dimensions. ### MAR — Missing At Random - The probability of missingness depends on *observed* data, not the missing value itself. - **Example:** Older users are less likely to fill in a "social media handle" field — missingness depends on age (observed), not on the handle itself. - **Safe to impute?** Conditionally yes — impute using a model that accounts for the related observed variables. - **Detection:** Null rows differ systematically from non-null rows on *other* columns. ### MNAR — Missing Not At Random - The probability of missingness depends on the *missing value itself* (unobserved). - **Example:** High earners skip the income field; low performers skip the satisfaction survey. - **Safe to impute?** No — imputation will introduce systematic bias. Escalate to domain owner. - **Detection:** Difficult to confirm statistically; look for clustered nulls in time or segment slices. --- ## Data Quality Score (DQS) Methodology The DQS is a weighted composite of five ISO 8000 / DAMA-aligned dimensions: | Dimension | Weight | Rationale | |---|---|---| | Completeness | 30% | Nulls are the most common and impactful quality failure | | Consistency | 25% | Type/format violations corrupt joins and aggregations silently | | Validity | 20% | Out-of-domain values (negative ages, future birth dates) create invisible errors | | Uniqueness | 15% | Duplicate rows inflate metrics and invalidate joins | | Timeliness | 10% | Stale data causes decisions based on outdated state | **Scoring thresholds** align to production-readiness standards: - 85–100: Ready for production use in models and dashboards - 65–84: Usable for exploratory analysis with documented caveats - 0–64: Unreliable; remediation required before use in any decision-making context --- ## Outlier Detection Methods ### IQR (Interquartile Range) - **Formula:** Outlier if `x < Q1 − 1.5×IQR` or `x > Q3 + 1.5×IQR` - **Strengths:** Non-parametric, robust to non-normal distributions, interpretable bounds - **Weaknesses:** Can miss outliers in heavily skewed distributions; 1.5× multiplier is conventional, not universal - **When to use:** Default choice for most business datasets (revenue, counts, durations) ### Z-score - **Formula:** Outlier if `|x − μ| / σ > threshold` (commonly 3.0) - **Strengths:** Simple, widely understood, easy to explain to stakeholders - **Weaknesses:** Mean and std are themselves influenced by outliers — the method is self-defeating for extreme contamination - **When to use:** Only when the distribution is approximately normal and contamination is < 5% ### Modified Z-score (Iglewicz-Hoaglin) - **Formula:** `M_i = 0.6745 × |x_i − median| / MAD`; outlier if `M_i > 3.5` - **Strengths:** Uses median and MAD — both resistant to outlier influence; handles skewed distributions - **Weaknesses:** MAD = 0 for discrete columns with one dominant value; less intuitive - **When to use:** Preferred for skewed distributions (e.g. revenue, latency, page views) --- ## Imputation Strategies | Method | When | Risk | |---|---|---| | Mean | MCAR, continuous, symmetric distribution | Distorts variance; don't use with skewed data | | Median | MCAR/MAR, continuous, skewed distribution | Safe for skewed; loses variance | | Mode | MCAR/MAR, categorical | Can over-represent one category | | Forward-fill | Time series with MCAR/MAR gaps | Assumes value persis