
Explore Data
Profile a new warehouse table or uploaded file so you know shape, nulls, duplicates, and which metrics to trust before building dashboards or models.
Overview
Explore-data is an agent skill most often used in Validate (also Grow analytics, Build integrations) that profiles datasets—columns, nulls, distributions, and quality issues—before deeper analysis.
Install
npx skills add https://github.com/anthropics/knowledge-work-plugins --skill explore-dataWhat is this skill?
- Resolves table names and schema prefixes via warehouse MCP or reads CSV, Excel, Parquet, JSON
- Table-level profiling: row/column counts, types, and structure-before-analysis workflow
- Data quality checks: null rates, distributions, duplicates, suspicious values
- Fallback path when no connector: user upload or guided profiling queries from described schema
- Argument-hint driven entry: /explore-data <table or file>
Adoption & trust: 3.1k installs on skills.sh; 19.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a new table or file but do not know row counts, null rates, or whether the columns are safe to use in metrics.
Who is it for?
Founders and indies connecting a warehouse MCP or dropping ad-hoc exports who need a repeatable profiling ritual.
Skip if: Production ETL pipeline design only, or when you already have certified dbt docs and monitored data contracts with no new sources.
When should I use this skill?
Encountering a new table or file, checking null rates and column distributions, spotting duplicates or suspicious values, or deciding which dimensions and metrics to analyze.
What do I get? / Deliverables
You get a structured data profile and clearer choices on dimensions, metrics, and fixes for duplicates or bad values before modeling or shipping reports.
- Comprehensive data profile summary
- Data quality findings (nulls, duplicates, suspicious values)
- Guidance on viable dimensions and metrics
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Validate is the canonical shelf because exploration answers 'what can we analyze?' before you commit build scope. scope fits deciding dimensions, metrics, and data-quality gates from an initial profile.
Where it fits
Profile a Stripe export to see which revenue fields are nullable before defining MVP metrics.
After connecting warehouse MCP, profile the events table you plan to sync into the app.
Re-profile a lifecycle table after a marketing campaign changes event volume.
How it compares
Use instead of guessing column meaning in chat—this is a profiling workflow, not a visualization or ML training skill.
Common Questions / FAQ
Who is explore-data for?
Solo builders and analysts using agent tools with warehouse MCP or file uploads who need fast, thorough dataset profiling.
When should I use explore-data?
Use it in Validate when scoping a new data source, in Build when wiring integrations, and in Grow when onboarding a lifecycle or content analytics table.
Is explore-data safe to install?
Profiling may query live warehouses or read local files; review the Security Audits panel on this Prism page and limit MCP credentials to least privilege.
SKILL.md
READMESKILL.md - Explore Data
# /explore-data - Profile and Explore a Dataset > If you see unfamiliar placeholders or need to check which tools are connected, see [CONNECTORS.md](../../CONNECTORS.md). Generate a comprehensive data profile for a table or uploaded file. Understand its shape, quality, and patterns before diving into analysis. ## Usage ``` /explore-data <table_name or file> ``` ## Workflow ### 1. Access the Data **If a data warehouse MCP server is connected:** 1. Resolve the table name (handle schema prefixes, suggest matches if ambiguous) 2. Query table metadata: column names, types, descriptions if available 3. Run profiling queries against the live data **If a file is provided (CSV, Excel, Parquet, JSON):** 1. Read the file and load into a working dataset 2. Infer column types from the data **If neither:** 1. Ask the user to provide a table name (with their warehouse connected) or upload a file 2. If they describe a table schema, provide guidance on what profiling queries to run ### 2. Understand Structure Before analyzing any data, understand its structure: **Table-level questions:** - How many rows and columns? - What is the grain (one row per what)? - What is the primary key? Is it unique? - When was the data last updated? - How far back does the data go? **Column classification** — categorize each column as one of: - **Identifier**: Unique keys, foreign keys, entity IDs - **Dimension**: Categorical attributes for grouping/filtering (status, type, region, category) - **Metric**: Quantitative values for measurement (revenue, count, duration, score) - **Temporal**: Dates and timestamps (created_at, updated_at, event_date) - **Text**: Free-form text fields (description, notes, name) - **Boolean**: True/false flags - **Structural**: JSON, arrays, nested structures ### 3. Generate Data Profile Run the following profiling checks: **Table-level metrics:** - Total row count - Column count and types breakdown - Approximate table size (if available from metadata) - Date range coverage (min/max of date columns) **All columns:** - Null count and null rate - Distinct count and cardinality ratio (distinct / total) - Most common values (top 5-10 with frequencies) - Least common values (bottom 5 to spot anomalies) **Numeric columns (metrics):** ``` min, max, mean, median (p50) standard deviation percentiles: p1, p5, p25, p75, p95, p99 zero count negative count (if unexpected) ``` **String columns (dimensions, text):** ``` min length, max length, avg length empty string count pattern analysis (do values follow a format?) case consistency (all upper, all lower, mixed?) leading/trailing whitespace count ``` **Date/timestamp columns:** ``` min date, max date null dates future dates (if unexpected) distribution by month/week gaps in time series ``` **Boolean columns:** ``` true count, false count, null count true rate ``` **Present the profile as a clean summary table**, grouped by column type (dimensions, metrics, dates, IDs). ### 4. Identify Data Quality Issues Apply the quality assessment framework below. Flag potential problems: - **High null rates**: Columns with >5% nulls (warn), >20% nulls (alert) - **Low cardinality surprises**: Columns that should be high-cardinality but aren't (e.g., a "user_id" with only 50 distinct values) - **High cardinality surprises**: Columns that should be categorical but have too many distinct values - **Suspicious values**: Negative amounts where only positive expected, future dates in historical data, obviously placeholder values (e.g., "N/A", "TBD", "test", "999999") - **Duplicate detection**: Check if there's a natural key and whether it