
Finding Data Lake Assets
Resolve vague table names, keywords, column hints, or S3 paths to concrete Glue, S3 Tables, lakehouse, or Redshift assets in a chosen AWS region.
Overview
finding-data-lake-assets is an agent skill most often used in Build (also Operate) that locates Glue, S3, lakehouse, and Redshift tables from names, keywords, columns, or S3 paths.
Install
npx skills add https://github.com/aws/agent-toolkit-for-aws --skill finding-data-lake-assetsWhat is this skill?
- Resolves table names, keywords, column names, or s3:// paths across Glue, S3, S3 Tables, and Redshift
- Optimized for low token usage—fast answers without full catalog audit workflows
- Requires target AWS region and confirmation when input is ambiguous before searching
- Explicitly excludes full catalog audits, query execution, and table creation (points to sibling skills)
- Intended to run via AWS MCP server tools when connected for validation and audit logging
- Covers four AWS surfaces: Glue Data Catalog, S3, S3 Tables, and Redshift
Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You know a table nickname, column, or S3 path but not which Glue, lakehouse, or Redshift asset is canonical in your AWS account.
Who is it for?
Indie builders and agents orchestrating AWS data lake work who need quick asset resolution before querying or modeling.
Skip if: Full data catalog audits, running SQL/analytics queries, or creating new lake tables—use the dedicated sibling skills instead.
When should I use this skill?
Find the table, where is our data, which table has, locate dataset, search catalog, Redshift/lakehouse table, or reverse lookup S3 path—not for full audits, queries, or table creation.
What do I get? / Deliverables
You get concrete catalog entries and asset pointers in the right region so downstream query or pipeline skills can run against the correct dataset.
- Resolved catalog entry or asset reference
- Clarified match when input was ambiguous
- Pointer suitable for follow-on query or pipeline skills
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Build integrations because pipelines and apps need catalog resolution before queries, transforms, or agent tools can target the right tables. Integrations subphase fits cross-service AWS discovery (Glue Data Catalog, S3, S3 Tables, Redshift) rather than one-off SQL authoring.
Where it fits
Wire an agent tool that must resolve a user-mentioned table before calling querying-data-lake.
Confirm which Glue database holds events before adding a new ETL job ARN.
Trace an s3:// prefix back to catalog metadata during an incident or access review.
Find the canonical warehouse table behind a dashboard metric name a stakeholder cited.
How it compares
Lightweight resolver across Glue/S3/Redshift—not a full data exploration or query runner.
Common Questions / FAQ
Who is finding-data-lake-assets for?
Solo builders and agent workflows on AWS that must turn fuzzy data references into real catalog objects before other toolkit skills run.
When should I use finding-data-lake-assets?
In Build when integrating lake-aware features or agents; in Operate when locating production tables, buckets, or warehouse objects from paths or keywords.
Is finding-data-lake-assets safe to install?
It can trigger AWS discovery APIs—review Security Audits on this page, limit IAM scope, and prefer MCP-sandboxed execution when available.
Workflow Chain
Then invoke: querying data lake
SKILL.md
READMESKILL.md - Finding Data Lake Assets
# Find Data Lake Assets ## Overview Resolves data lake asset references to concrete catalog entries. Acts as a resolver for other skills and direct user requests. Covers Glue, S3, S3 Tables, and Redshift. Optimized for low token usage — return the answer fast and get out of the way. **Constraints for parameter acquisition:** - You MUST accept a single argument: table name, keyword, column name, or S3 path - You MUST accept the argument as direct input or a pointer to a file containing the spec - You MUST ask for the target AWS region if not already set - You MUST confirm ambiguous input before searching (e.g., "Did you mean table X or bucket Y?") - You MUST respect the user's decision to abort at any step ## Common Tasks You MUST execute commands using AWS MCP server tools when connected — they provide validation, sandboxed execution, and audit logging. Fall back to AWS CLI only if MCP is unavailable. You MUST explain each step before executing. ### 1. Verify Dependencies Check for required tools and AWS access before searching. **Constraints:** - You MUST verify AWS MCP server tools (`aws___call_aws`) are available; fall back to AWS CLI if not - You MUST confirm credentials with `aws sts get-caller-identity` - You MUST inform the user about any missing tools and ask whether to proceed ### 2. Classify the Request Determine the mode: - **Resolve** (most common): User/skill references something specific. Signals: possessive/definite articles ("our X table", "the Y dataset") imply the asset exists. Goal: find it, return the reference, done. - **Search**: User is exploring. Signals: "find tables with", "what has customer_id". Goal: rank candidates, present top matches. You SHOULD default to Resolve mode when ambiguous. ### 3. Extract Search Terms Parse the request into search dimensions: - **Name terms**: Table or database names mentioned - **Domain terms**: Business concepts (billing, orders, churn) - **Column terms**: Specific column names (customer_id, event_type) - **Location terms**: S3 paths, bucket names, prefixes ### 4. Layered Search (stop early) Search sources in order. Stop at the first layer that returns a high-confidence match. Do NOT search all layers every time. You MUST track which layers were searched and which were skipped. Report this in the output (see Step 6). **Layer 1: Glue Data Catalog** (always start here) You SHOULD use `SearchTables` as the primary API — it searches table names, column names, and column comments across the entire catalog in one call. You MUST NOT loop over databases with `get-tables` unless you already know the database name. See [search-strategy.md](references/search-strategy.md) for patterns. ``` aws glue search-tables --search-text "orders" aws glue get-tables --database-name sales --expression "order.*" ``` **Layer 2: S3 Reverse Lookup** (S3 path provided) When a user provides an S3 path, you SHOULD default to reverse lookup first — they usually want the Glue table, not the file contents. ``` aws glue search-tables --search-text "<path-keyword>" aws s3api list-objects-v2 --bucket <bucket-name> --prefix <prefix> ``` **Layer 3: Redshift Catalog** (if user mentions Redshift, warehouse, or lakehouse) ```sql SELECT schema_name, table_name, table_type FROM svv_all_tables WHERE table_name ILIKE '%orders%'; ``` Redshift Spectrum external tables a