
Exploring Data Catalog
Discover and inventory AWS data catalogs (Glue, S3 Tables, federated sources) with a fixed report structure before querying or modeling in Athena.
Overview
Exploring Data Catalog is an agent skill most often used in Build (also Validate scope) that inventories AWS Glue and related catalogs with schema, storage, and quality analysis in a 7-section report.
Install
npx skills add https://github.com/aws/agent-toolkit-for-aws --skill exploring-data-catalogWhat is this skill?
- 7-part output order: landscape, executive summary, inventory, unregistered assets, schema, storage, recommendations
- Catalog types: Glue, S3 Tables, Redshift-federated, Remote Iceberg with connection status
- Column classification: identifier, dimension, metric, temporal, text, boolean, structural
- Quality scoring bands: Complete (>99%), Mostly complete (95–99%), Incomplete (80–95%)
- Flags S3 Tables not registered in Glue with registration guidance for Athena queryability
- 7-section ordered output structure
- 3 quality scoring tiers for column completeness
- 7 column classification categories
Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have data scattered across Glue, S3 Tables, and federated catalogs but no ordered inventory or queryability map for Athena and downstream apps.
Who is it for?
Solo builders wiring Claude/Cursor agents or Athena SQL to an AWS account they did not originally design.
Skip if: Purely local SQLite apps or teams with a single well-documented warehouse and no multi-catalog sprawl.
When should I use this skill?
Discovering AWS data catalogs, Glue/S3 Tables inventory, or preparing Athena-queryable metadata analysis.
What do I get? / Deliverables
You receive a structured catalog report with unregistered asset fixes, column classifications, completeness scores, and optimization recommendations.
- Catalog landscape and database inventory report
- Unregistered asset list with registration steps
- Schema, storage, and recommendation sections
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Catalog exploration is a build-phase integration task when wiring analytics, agents, or backends to AWS data estate. Integrations subphase covers AWS Glue, Athena, S3 Tables, and federated Iceberg/Redshift connections as data plane hooks.
Where it fits
Map Glue databases before connecting an agent to Athena for natural-language queries.
Document partitions and formats prior to writing ingestion or API read models.
See which registered tables actually support the MVP metrics without hand-waving data availability.
Use recommendations section to fix missing metadata and registration gaps after launch.
How it compares
AWS catalog discovery checklist for agents, not a one-off Glue console click path tutorial.
Common Questions / FAQ
Who is exploring-data-catalog for?
Indie builders and small teams using AWS Agent Toolkit who need a repeatable catalog reconnaissance before analytics or agent tooling.
When should I use exploring-data-catalog?
During build integrations with Athena/Glue, when validating which datasets support an MVP, or before operate-phase pipeline fixes on partitioning and registration.
Is exploring-data-catalog safe to install?
It implies AWS API/catalog access via your toolkit; review the Security Audits panel on this page and scope IAM minimally.
SKILL.md
READMESKILL.md - Exploring Data Catalog
# Discovery Checklist ## Output Structure Present findings in this order: 1. Catalog Landscape: catalog count by type (Glue, S3 Tables, Redshift-federated, Remote Iceberg), connection status for federated catalogs 2. Executive Summary: total databases, total tables, primary formats, estimated volume 3. Database Inventory: organized by catalog and database with table counts 4. Unregistered Assets: S3 Tables not in Glue (not queryable via Athena), with registration instructions 5. Schema Analysis: data types, nullable fields, key patterns 6. Storage Analysis: formats, partitioning strategies, S3 locations 7. Recommendations: optimization opportunities, quality issues, missing metadata, unregistered tables to register ## Column Classification Categorize each column as one of: - **Identifier**: Unique keys, foreign keys, entity IDs - **Dimension**: Categorical attributes for grouping/filtering (status, type, region) - **Metric**: Quantitative values for measurement (revenue, count, duration) - **Temporal**: Dates and timestamps (created_at, updated_at, event_date) - **Text**: Free-form text fields (description, notes) - **Boolean**: True/false flags - **Structural**: JSON, arrays, nested structures (common in Glue tables from JSON sources) ## Quality Scoring Rate each column's completeness: - **Complete** (>99% non-null): reliable for analysis - **Mostly complete** (95-99%): investigate the nulls before using in calculations - **Incomplete** (80-95%): understand why, may need imputation or filtering - **Sparse** (<80%): likely not usable without significant cleanup ## Column Profiling (when deep-diving a table) For numeric columns: min, max, mean, median, p5, p95, zero count, negative count For string columns: min/max length, empty string count, distinct values, pattern consistency For date columns: min/max date, null dates, future dates (if unexpected), gap detection For boolean columns: true/false/null distribution ## What to Flag - Tables with no partition keys on datasets > 1GB - CSV tables that should be Parquet (cost and performance) - Databases or tables with no descriptions - Tables with no recent data (stale/abandoned) - Inconsistent naming conventions across databases - Tables with high null percentages in key columns - Columns that appear to be foreign keys (potential join targets) - Hierarchical dimensions (country > state > city) - Columns with suspiciously low cardinality (possible default values) - S3 Tables not registered in Glue (exist but not queryable via Athena) - Federated catalogs with connection errors or stale metadata ## Format Detection Map SerDe libraries to human-readable format names: - `org.apache.hadoop.hive.ql.io.parquet` = Parquet - `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` = CSV/TSV - `org.openx.data.jsonserde.JsonSerDe` = JSON - `org.apache.hadoop.hive.serde2.OpenCSVSerde` = CSV - `org.apache.hadoop.hive.ql.io.orc` = ORC --- name: exploring-data-catalog description: >- Full inventory and audit of AWS Glue Data Catalog assets across S3 Tables, Redshift-federated, and remote Iceberg catalogs. Triggers on: inventory the catalog, audit databases, list all tables, catalog overview, data landscape, enumerate catalogs, data inventory, search the catalog. Do NOT use for finding specific data (use finding-data-lake-assets), running queries (use querying-data-lake), or creating tables (use creating-data-lake-table). version: 1 argument-hint: '[search-term|catalog-name|database-name|s3://bucket-path|table-name]' --- Structured inventory and cataloging across your AWS data landscape: Glue Data Catalog with S3 Tables, Redshift-federated, and remote Iceberg catalogs. ## Overview Maps data in an AWS account. Starts with catalog landscape (Glue, S3 Tables, federated), then drills into databases and tables. Read-only — no query execution. **Constraints for parameter acquisition:** - You MUST ask for the target AWS region upfront if not provided - You MUST support a sing