
Querying Data Lake
Run reproducible Athena/Presto SQL for profiling tables, sampling rows, and cohort retention on an AWS data lake without guessing schema or null rates.
Overview
Querying Data Lake is an agent skill most often used in Grow (also Build integrations, Operate monitoring) that supplies Athena/Presto SQL for profiling, sampling, null analysis, and cohort retention on AWS data lakes.
Install
npx skills add https://github.com/aws/agent-toolkit-for-aws --skill querying-data-lakeWhat is this skill?
- Schema discovery and row-count/date-range profiling via information_schema and aggregates
- Mandatory sample-with-LIMIT-5 pattern before heavy analytical queries
- Null-count and null-percentage analysis per column
- Pre-built cohort retention CTE pattern with month_0 through month_6 buckets
- Presto/Athena SQL idioms (DATE_TRUNC, DATE_ADD) for lake tables
- 4 SQL pattern groups: table profiling, cohort retention, schema discovery, null analysis
Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have tables in the lake but no trusted starting SQL to understand schema quality, sample safely, or compute cohort retention.
Who is it for?
Solo builders shipping analytics on Athena/Presto who want guard-railed SQL patterns instead of one-off agent hallucinations.
Skip if: Teams still defining ingestion, catalog, or partition strategy—fix lake plumbing before leaning on analytical templates.
When should I use this skill?
You need Athena/Presto SQL templates to explore or analyze tables in an AWS data lake.
What do I get? / Deliverables
You get copy-ready Athena/Presto queries for discovery, profiling, and retention metrics you can run or adapt in your warehouse console.
- Profiling SQL snippets
- Cohort retention query template
- Null-analysis SELECT statements
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Grow because the skill’s outputs are analytical queries and cohort views used after you have data in the lake. Analytics subphase matches table profiling, null analysis, and retention SQL—the core deliverables in the readme.
Where it fits
Compute month-over-month retention from user_activity after onboarding events land in the lake.
Generate profiling SQL when an agent first connects to a new Glue database.
Quick row counts and date ranges to sanity-check nightly ETL partitions.
How it compares
Structured SQL pattern library for lake tables, not a BI connector or MCP server that runs queries for you.
Common Questions / FAQ
Who is querying-data-lake for?
Indie builders and small teams using AWS Agent Toolkit workflows who query S3 data lakes with Athena or Presto and need standard profiling and retention SQL.
When should I use querying-data-lake?
During Grow when building cohort or funnel metrics, during Build when wiring agent tooling to your lake, or during Operate when auditing row counts and null rates before trusting production dashboards.
Is querying-data-lake safe to install?
Treat it as documentation-style SQL guidance; review the Security Audits panel on this Prism page and ensure agents only run queries against scopes and tables you authorize.
SKILL.md
READMESKILL.md - Querying Data Lake
# Common Query Patterns (Presto/Athena SQL) ## Table Profiling ```sql -- Schema discovery SELECT column_name, data_type FROM information_schema.columns WHERE table_schema = '<database>' AND table_name = '<table>'; -- Quick row count and date range SELECT COUNT(*) as total_rows, MIN(created_at) as earliest, MAX(created_at) as latest FROM <table>; -- Sample data (always do this before analytical queries) SELECT * FROM <table> LIMIT 5; -- Null analysis SELECT '<column>' as field, COUNT(*) - COUNT(<column>) as null_count, ROUND((COUNT(*) - COUNT(<column>)) * 100.0 / COUNT(*), 2) as null_pct FROM <table>; ``` ## Cohort Retention ```sql WITH cohorts AS ( SELECT user_id, DATE_TRUNC('month', first_activity_date) as cohort_month FROM users ), activity AS ( SELECT user_id, DATE_TRUNC('month', activity_date) as activity_month FROM user_activity ) SELECT c.cohort_month, COUNT(DISTINCT c.user_id) as cohort_size, COUNT(DISTINCT CASE WHEN a.activity_month = c.cohort_month THEN a.user_id END) as month_0, COUNT(DISTINCT CASE WHEN a.activity_month = DATE_ADD('month', 1, c.cohort_month) THEN a.user_id END) as month_1, COUNT(DISTINCT CASE WHEN a.activity_month = DATE_ADD('month', 3, c.cohort_month) THEN a.user_id END) as month_3, COUNT(DISTINCT CASE WHEN a.activity_month = DATE_ADD('month', 6, c.cohort_month) THEN a.user_id END) as month_6 FROM cohorts c LEFT JOIN activity a ON c.user_id = a.user_id GROUP BY c.cohort_month ORDER BY c.cohort_month; ``` ## Funnel Analysis ```sql WITH funnel AS ( SELECT user_id, MAX(CASE WHEN event = 'page_view' THEN 1 ELSE 0 END) as step_1_view, MAX(CASE WHEN event = 'signup_start' THEN 1 ELSE 0 END) as step_2_start, MAX(CASE WHEN event = 'signup_complete' THEN 1 ELSE 0 END) as step_3_complete, MAX(CASE WHEN event = 'first_purchase' THEN 1 ELSE 0 END) as step_4_purchase FROM events WHERE event_date >= DATE_ADD('day', -30, CURRENT_DATE) GROUP BY user_id ) SELECT COUNT(*) as total_users, SUM(step_1_view) as viewed, SUM(step_2_start) as started_signup, SUM(step_3_complete) as completed_signup, SUM(step_4_purchase) as purchased, ROUND(100.0 * SUM(step_2_start) / NULLIF(SUM(step_1_view), 0), 1) as view_to_start_pct, ROUND(100.0 * SUM(step_3_complete) / NULLIF(SUM(step_2_start), 0), 1) as start_to_complete_pct, ROUND(100.0 * SUM(step_4_purchase) / NULLIF(SUM(step_3_complete), 0), 1) as complete_to_purchase_pct FROM funnel; ``` ## Deduplication ```sql -- Keep the most recent record per key (Presto/Athena syntax) WITH ranked AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY entity_id ORDER BY updated_at DESC ) as rn FROM source_table ) SELECT * FROM ranked WHERE rn = 1; ``` ## Window Functions ```sql -- Running total SUM(revenue) OVER (ORDER BY event_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total -- 7-day moving average AVG(revenue) OVER (ORDER BY event_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_avg_7d -- Period-over-period comparison LAG(value, 1) OVER (PARTITION BY entity ORDER BY event_date) as prev_value -- Percent of total revenue / SUM(revenue) OVER () as pct_of_total revenue / SUM(revenue) OVER (PARTITION BY category) as pct_of_category -- Ranking ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) as rank_in_category ``` ## Period Comparison / Growth When the user asks for "growth", "change", or "comparison" between periods, compute the delta — not raw totals. ```sql WITH quarterly AS ( SELECT category, QUARTER(order_date) as q, SUM(amount) as revenue FROM orders WHERE YEAR(order_date) = 2025 GROUP BY category, QUARTER(order_date) ) SELECT curr.category, prev.revenue as prev_period, curr.revenue as curr_period,