Finding Data Lake Assets

Name: Finding Data Lake Assets
Author: aws

aws/agent-toolkit-for-aws

Resolve vague table names, keywords, column hints, or S3 paths to concrete Glue, S3 Tables, lakehouse, or Redshift assets in a chosen AWS region.

Overview

finding-data-lake-assets is an agent skill most often used in Build (also Operate) that locates Glue, S3, lakehouse, and Redshift tables from names, keywords, columns, or S3 paths.

Install

npx skills add https://github.com/aws/agent-toolkit-for-aws --skill finding-data-lake-assets

What is this skill?

Resolves table names, keywords, column names, or s3:// paths across Glue, S3, S3 Tables, and Redshift
Optimized for low token usage—fast answers without full catalog audit workflows
Requires target AWS region and confirmation when input is ambiguous before searching
Explicitly excludes full catalog audits, query execution, and table creation (points to sibling skills)
Intended to run via AWS MCP server tools when connected for validation and audit logging
Covers four AWS surfaces: Glue Data Catalog, S3, S3 Tables, and Redshift

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).

What problem does it solve?

You know a table nickname, column, or S3 path but not which Glue, lakehouse, or Redshift asset is canonical in your AWS account.

Who is it for?

Indie builders and agents orchestrating AWS data lake work who need quick asset resolution before querying or modeling.

Skip if: Full data catalog audits, running SQL/analytics queries, or creating new lake tables—use the dedicated sibling skills instead.

When should I use this skill?

Find the table, where is our data, which table has, locate dataset, search catalog, Redshift/lakehouse table, or reverse lookup S3 path—not for full audits, queries, or table creation.

What do I get? / Deliverables

You get concrete catalog entries and asset pointers in the right region so downstream query or pipeline skills can run against the correct dataset.

Resolved catalog entry or asset reference
Clarified match when input was ambiguous
Pointer suitable for follow-on query or pipeline skills

Recommended Skills

Azure Deploymicrosoft/azure-skills

Azure Deploy is a Microsoft agent skill that executes cloud releases for applications that are already planned and valid…374k installs·1.2k stars

Azure Preparemicrosoft/azure-skills

Azure Prepare is Microsoft's skill for getting applications ready to run on Azure—writing the deployment plan, generatin…374k installs·1.2k stars

Azure Storagemicrosoft/azure-skills

Azure Storage skill helps agents pick the right Azure storage service—Blob for objects, Files for SMB shares, Queues for…374k installs·1.2k stars

Azure Validatemicrosoft/azure-skills

Microsoft-guided preflight validation for Azure deployments including IaC, identity, and service-specific readiness.374k installs·1.2k stars

Appinsights Instrumentationmicrosoft/azure-skills

appinsights-instrumentation is a Microsoft Azure-skills package that walks solo builders through enabling Application In…374k installs·1.2k stars

Azure Resource Lookupmicrosoft/azure-skills

Azure Resource Lookup is a Microsoft agent skill that helps solo builders and small teams answer “what do I have in Azur…373k installs·1.2k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

BuildIntegrations & version control

Canonical shelf is Build integrations because pipelines and apps need catalog resolution before queries, transforms, or agent tools can target the right tables. Integrations subphase fits cross-service AWS discovery (Glue Data Catalog, S3, S3 Tables, Redshift) rather than one-off SQL authoring.

Also useful

OperateInfrastructure & cost

Where it fits

Example use

BuildIntegrations & version control

Wire an agent tool that must resolve a user-mentioned table before calling querying-data-lake.

Example use

BuildBackend, data & payments

Confirm which Glue database holds events before adding a new ETL job ARN.

Example use

OperateInfrastructure & cost

Trace an s3:// prefix back to catalog metadata during an incident or access review.

Example use

GrowAnalytics & insights

Find the canonical warehouse table behind a dashboard metric name a stakeholder cited.

How it compares

Lightweight resolver across Glue/S3/Redshift—not a full data exploration or query runner.

Common Questions / FAQ

Who is finding-data-lake-assets for?

Solo builders and agent workflows on AWS that must turn fuzzy data references into real catalog objects before other toolkit skills run.

When should I use finding-data-lake-assets?

In Build when integrating lake-aware features or agents; in Operate when locating production tables, buckets, or warehouse objects from paths or keywords.

Is finding-data-lake-assets safe to install?

It can trigger AWS discovery APIs—review Security Audits on this page, limit IAM scope, and prefer MCP-sandboxed execution when available.

Workflow Chain

Then invoke: querying data lake

SKILL.md

READMESKILL.md - Finding Data Lake Assets

# Find Data Lake Assets

## Overview

Resolves data lake asset references to concrete catalog entries. Acts as a
resolver for other skills and direct user requests. Covers Glue,
S3, S3 Tables, and Redshift. Optimized for low token usage — return the
answer fast and get out of the way.

**Constraints for parameter acquisition:**

- You MUST accept a single argument: table name, keyword, column name, or S3 path
- You MUST accept the argument as direct input or a pointer to a file containing the spec
- You MUST ask for the target AWS region if not already set
- You MUST confirm ambiguous input before searching (e.g., "Did you mean table X or bucket Y?")
- You MUST respect the user's decision to abort at any step

## Common Tasks

You MUST execute commands using AWS MCP server tools when connected — they
provide validation, sandboxed execution, and audit logging. Fall back to
AWS CLI only if MCP is unavailable. You MUST explain each step before
executing.

### 1. Verify Dependencies

Check for required tools and AWS access before searching.

**Constraints:**

- You MUST verify AWS MCP server tools (`aws___call_aws`) are available; fall back to AWS CLI if not
- You MUST confirm credentials with `aws sts get-caller-identity`
- You MUST inform the user about any missing tools and ask whether to proceed

### 2. Classify the Request

Determine the mode:

- **Resolve** (most common): User/skill references something specific.
  Signals: possessive/definite articles ("our X table", "the Y
  dataset") imply the asset exists. Goal: find it, return the
  reference, done.
- **Search**: User is exploring. Signals: "find tables with", "what
  has customer_id". Goal: rank candidates, present top matches.

You SHOULD default to Resolve mode when ambiguous.

### 3. Extract Search Terms

Parse the request into search dimensions:

- **Name terms**: Table or database names mentioned
- **Domain terms**: Business concepts (billing, orders, churn)
- **Column terms**: Specific column names (customer_id, event_type)
- **Location terms**: S3 paths, bucket names, prefixes

### 4. Layered Search (stop early)

Search sources in order. Stop at the first layer that returns a
high-confidence match. Do NOT search all layers every time.

You MUST track which layers were searched and which were skipped.
Report this in the output (see Step 6).

**Layer 1: Glue Data Catalog** (always start here)

You SHOULD use `SearchTables` as the primary API — it searches table
names, column names, and column comments across the entire catalog in
one call. You MUST NOT loop over databases with `get-tables` unless
you already know the database name. See
[search-strategy.md](references/search-strategy.md) for patterns.

```
aws glue search-tables --search-text "orders"
aws glue get-tables --database-name sales --expression "order.*"
```

**Layer 2: S3 Reverse Lookup** (S3 path provided)

When a user provides an S3 path, you SHOULD default to reverse lookup first —
they usually want the Glue table, not the file contents.

```
aws glue search-tables --search-text "<path-keyword>"
aws s3api list-objects-v2 --bucket <bucket-name> --prefix <prefix>
```

**Layer 3: Redshift Catalog** (if user mentions Redshift, warehouse, or lakehouse)

```sql
SELECT schema_name, table_name, table_type
FROM svv_all_tables
WHERE table_name ILIKE '%orders%';
```

Redshift Spectrum external tables a

What is this skill?

Resolves table names, keywords, column names, or s3:// paths across Glue, S3, S3 Tables, and Redshift

Optimized for low token usage—fast answers without full catalog audit workflows

Requires target AWS region and confirmation when input is ambiguous before searching

Explicitly excludes full catalog audits, query execution, and table creation (points to sibling skills)

Intended to run via AWS MCP server tools when connected for validation and audit logging

Covers four AWS surfaces: Glue Data Catalog, S3, S3 Tables, and Redshift

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

BuildIntegrations & version control

Also useful

OperateInfrastructure & cost

Where it fits

Example use

BuildIntegrations & version control

Wire an agent tool that must resolve a user-mentioned table before calling querying-data-lake.

Example use

BuildBackend, data & payments

Confirm which Glue database holds events before adding a new ETL job ARN.

Example use

OperateInfrastructure & cost

Trace an s3:// prefix back to catalog metadata during an incident or access review.

Example use

GrowAnalytics & insights

Find the canonical warehouse table behind a dashboard metric name a stakeholder cited.

SKILL.md

READMESKILL.md - Finding Data Lake Assets

# Find Data Lake Assets

## Overview

Resolves data lake asset references to concrete catalog entries. Acts as a
resolver for other skills and direct user requests. Covers Glue,
S3, S3 Tables, and Redshift. Optimized for low token usage — return the
answer fast and get out of the way.

**Constraints for parameter acquisition:**

- You MUST accept a single argument: table name, keyword, column name, or S3 path
- You MUST accept the argument as direct input or a pointer to a file containing the spec
- You MUST ask for the target AWS region if not already set
- You MUST confirm ambiguous input before searching (e.g., "Did you mean table X or bucket Y?")
- You MUST respect the user's decision to abort at any step

## Common Tasks

You MUST execute commands using AWS MCP server tools when connected — they
provide validation, sandboxed execution, and audit logging. Fall back to
AWS CLI only if MCP is unavailable. You MUST explain each step before
executing.

### 1. Verify Dependencies

Check for required tools and AWS access before searching.

**Constraints:**

- You MUST verify AWS MCP server tools (`aws___call_aws`) are available; fall back to AWS CLI if not
- You MUST confirm credentials with `aws sts get-caller-identity`
- You MUST inform the user about any missing tools and ask whether to proceed

### 2. Classify the Request

Determine the mode:

- **Resolve** (most common): User/skill references something specific.
  Signals: possessive/definite articles ("our X table", "the Y
  dataset") imply the asset exists. Goal: find it, return the
  reference, done.
- **Search**: User is exploring. Signals: "find tables with", "what
  has customer_id". Goal: rank candidates, present top matches.

You SHOULD default to Resolve mode when ambiguous.

### 3. Extract Search Terms

Parse the request into search dimensions:

- **Name terms**: Table or database names mentioned
- **Domain terms**: Business concepts (billing, orders, churn)
- **Column terms**: Specific column names (customer_id, event_type)
- **Location terms**: S3 paths, bucket names, prefixes

### 4. Layered Search (stop early)

Search sources in order. Stop at the first layer that returns a
high-confidence match. Do NOT search all layers every time.

You MUST track which layers were searched and which were skipped.
Report this in the output (see Step 6).

**Layer 1: Glue Data Catalog** (always start here)

You SHOULD use `SearchTables` as the primary API — it searches table
names, column names, and column comments across the entire catalog in
one call. You MUST NOT loop over databases with `get-tables` unless
you already know the database name. See
[search-strategy.md](references/search-strategy.md) for patterns.

```
aws glue search-tables --search-text "orders"
aws glue get-tables --database-name sales --expression "order.*"
```

**Layer 2: S3 Reverse Lookup** (S3 path provided)

When a user provides an S3 path, you SHOULD default to reverse lookup first —
they usually want the Glue table, not the file contents.

```
aws glue search-tables --search-text "<path-keyword>"
aws s3api list-objects-v2 --bucket <bucket-name> --prefix <prefix>
```

**Layer 3: Redshift Catalog** (if user mentions Redshift, warehouse, or lakehouse)

```sql
SELECT schema_name, table_name, table_type
FROM svv_all_tables
WHERE table_name ILIKE '%orders%';
```

Redshift Spectrum external tables a

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is finding-data-lake-assets for?

When should I use finding-data-lake-assets?

Is finding-data-lake-assets safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is finding-data-lake-assets for?

When should I use finding-data-lake-assets?

Is finding-data-lake-assets safe to install?

SKILL.md