Exploring Data Catalog

Catalog exploration is a build-phase integration task when wiring analytics, agents, or backends to AWS data estate. Integrations subphase covers AWS Glue, Athena, S3 Tables, and federated Iceberg/Redshift connections as data plane hooks.

Also useful

Also useful

Where it fits

Example use

BuildBackend, data & payments

Map Glue databases before connecting an agent to Athena for natural-language queries.

Example use

Document partitions and formats prior to writing ingestion or API read models.

Example use

See which registered tables actually support the MVP metrics without hand-waving data availability.

Example use

Use recommendations section to fix missing metadata and registration gaps after launch.

How it compares

AWS catalog discovery checklist for agents, not a one-off Glue console click path tutorial.

Common Questions / FAQ

Who is exploring-data-catalog for?

Indie builders and small teams using AWS Agent Toolkit who need a repeatable catalog reconnaissance before analytics or agent tooling.

When should I use exploring-data-catalog?

During build integrations with Athena/Glue, when validating which datasets support an MVP, or before operate-phase pipeline fixes on partitioning and registration.

Is exploring-data-catalog safe to install?

It implies AWS API/catalog access via your toolkit; review the Security Audits panel on this page and scope IAM minimally.

SKILL.md

READMESKILL.md - Exploring Data Catalog

# Discovery Checklist

## Output Structure

Present findings in this order:

1. Catalog Landscape: catalog count by type (Glue, S3 Tables, Redshift-federated, Remote Iceberg), connection status for federated catalogs
2. Executive Summary: total databases, total tables, primary formats, estimated volume
3. Database Inventory: organized by catalog and database with table counts
4. Unregistered Assets: S3 Tables not in Glue (not queryable via Athena), with registration instructions
5. Schema Analysis: data types, nullable fields, key patterns
6. Storage Analysis: formats, partitioning strategies, S3 locations
7. Recommendations: optimization opportunities, quality issues, missing metadata, unregistered tables to register

## Column Classification

Categorize each column as one of:

- **Identifier**: Unique keys, foreign keys, entity IDs
- **Dimension**: Categorical attributes for grouping/filtering (status, type, region)
- **Metric**: Quantitative values for measurement (revenue, count, duration)
- **Temporal**: Dates and timestamps (created_at, updated_at, event_date)
- **Text**: Free-form text fields (description, notes)
- **Boolean**: True/false flags
- **Structural**: JSON, arrays, nested structures (common in Glue tables from JSON sources)

## Quality Scoring

Rate each column's completeness:

- **Complete** (>99% non-null): reliable for analysis
- **Mostly complete** (95-99%): investigate the nulls before using in calculations
- **Incomplete** (80-95%): understand why, may need imputation or filtering
- **Sparse** (<80%): likely not usable without significant cleanup

## Column Profiling (when deep-diving a table)

For numeric columns: min, max, mean, median, p5, p95, zero count, negative count
For string columns: min/max length, empty string count, distinct values, pattern consistency
For date columns: min/max date, null dates, future dates (if unexpected), gap detection
For boolean columns: true/false/null distribution

## What to Flag

- Tables with no partition keys on datasets > 1GB
- CSV tables that should be Parquet (cost and performance)
- Databases or tables with no descriptions
- Tables with no recent data (stale/abandoned)
- Inconsistent naming conventions across databases
- Tables with high null percentages in key columns
- Columns that appear to be foreign keys (potential join targets)
- Hierarchical dimensions (country > state > city)
- Columns with suspiciously low cardinality (possible default values)
- S3 Tables not registered in Glue (exist but not queryable via Athena)
- Federated catalogs with connection errors or stale metadata

## Format Detection

Map SerDe libraries to human-readable format names:

- `org.apache.hadoop.hive.ql.io.parquet` = Parquet
- `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` = CSV/TSV
- `org.openx.data.jsonserde.JsonSerDe` = JSON
- `org.apache.hadoop.hive.serde2.OpenCSVSerde` = CSV
- `org.apache.hadoop.hive.ql.io.orc` = ORC


---
name: exploring-data-catalog
description: >-
  Full inventory and audit of AWS Glue Data Catalog assets across S3 Tables, Redshift-federated,
  and remote Iceberg catalogs. Triggers on: inventory the catalog, audit databases,
  list all tables, catalog overview, data landscape, enumerate catalogs, data inventory,
  search the catalog. Do NOT use for finding specific data (use finding-data-lake-assets),
  running queries (use querying-data-lake), or creating tables (use creating-data-lake-table).
version: 1
argument-hint: '[search-term|catalog-name|database-name|s3://bucket-path|table-name]'
---

Structured inventory and cataloging across your AWS data landscape: Glue Data Catalog with S3 Tables, Redshift-federated, and remote Iceberg catalogs.

## Overview

Maps data in an AWS account. Starts with catalog landscape (Glue, S3 Tables, federated), then drills into databases and tables. Read-only — no query execution.

**Constraints for parameter acquisition:**

- You MUST ask for the target AWS region upfront if not provided
- You MUST support a sing

What is this skill?

7-part output order: landscape, executive summary, inventory, unregistered assets, schema, storage, recommendations

Catalog types: Glue, S3 Tables, Redshift-federated, Remote Iceberg with connection status

Column classification: identifier, dimension, metric, temporal, text, boolean, structural

Quality scoring bands: Complete (>99%), Mostly complete (95–99%), Incomplete (80–95%)

Flags S3 Tables not registered in Glue with registration guidance for Athena queryability

7-section ordered output structure

3 quality scoring tiers for column completeness

7 column classification categories

Compatible agents: Claude Code, Codex, Cursor, any compatible agent

Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

BuildBackend, data & payments

Map Glue databases before connecting an agent to Athena for natural-language queries.

Example use

Document partitions and formats prior to writing ingestion or API read models.

Example use

See which registered tables actually support the MVP metrics without hand-waving data availability.

Example use