
Ingesting Into Data Lake
Load CSV, JSON, Parquet, or ORC from S3 into a governed data-lake table using Athena INSERT when Glue ETL is overkill.
Overview
Ingesting Into Data Lake is an agent skill for the Build phase that loads S3 files into lake tables via Athena external tables and INSERT INTO.
Install
npx skills add https://github.com/aws/agent-toolkit-for-aws --skill ingesting-into-data-lakeWhat is this skill?
- Fallback path for simple one-time loads when Glue ETL is unavailable or unnecessary
- Step 1: CREATE EXTERNAL TABLE templates for CSV, JSON, Parquet, and ORC on S3
- Step 2: INSERT INTO target catalog table with CAST, DATE_PARSE, and null filters
- References type-transformations.md for casting, dates, booleans, and null handling
- Includes Athena CLI execution pattern for running the load query
- Two-step workflow: CREATE EXTERNAL TABLE then INSERT INTO target
- Four source formats documented: CSV, JSON, Parquet, and ORC
Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have raw files in S3 and a target Iceberg or Hive table, but no quick agent-ready recipe for Athena INSERT without standing up Glue ETL.
Who is it for?
Solo builders on AWS who need a one-off or occasional ingest into Athena/Glue catalog tables from CSV, JSON, Parquet, or ORC.
Skip if: High-volume continuous ingestion, complex multi-hop ETL, or teams standardized on Glue Spark jobs only.
When should I use this skill?
Simple one-time data loads into a data lake when Glue ETL is unavailable or unnecessary.
What do I get? / Deliverables
You run a documented two-step SQL load with typed transforms and optional CLI execution, then drop the temp external table.
- CREATE EXTERNAL TABLE DDL for the source prefix
- INSERT INTO statement with casts and filters into the lake table
- Executed Athena load via console or CLI with validated row counts
Recommended Skills
Journey fit
One-time or simple lake ingestion is Build → Backend work—standing up data paths that power analytics and product features. The skill is SQL-first Athena loading (external tables, INSERT INTO, CLI execution), which is backend data plumbing rather than frontend or launch work.
How it compares
Athena INSERT skill for simple loads—not a replacement for Glue ETL or streaming ingest pipelines.
Common Questions / FAQ
Who is ingesting-into-data-lake for?
Indie and solo builders using AWS who want their coding agent to generate correct Athena DDL and INSERT statements for S3-backed lake tables.
When should I use ingesting-into-data-lake?
Use it during Build → Backend when onboarding a new S3 prefix into your data lake catalog, especially for one-time migrations or small recurring batch loads.
Is ingesting-into-data-lake safe to install?
It is AWS documentation-style guidance; scope IAM so Athena can only reach intended buckets. Review the Security Audits panel on this Prism page before automating loads in CI.
SKILL.md
READMESKILL.md - Ingesting Into Data Lake
# Data Loading via Athena INSERT INTO Fallback approach for simple one-time data loads when Glue ETL is unavailable or unnecessary. ## Step 1: Create External Table for Source Create a temporary external table pointing to source files in S3. ### CSV ```sql CREATE EXTERNAL TABLE temp_source_<timestamp> ( customer_id INT, first_name STRING, last_name STRING, email STRING, signup_date STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 's3://<bucket>/<prefix>/' TBLPROPERTIES ('skip.header.line.count'='1'); ``` ### JSON ```sql CREATE EXTERNAL TABLE temp_source_<timestamp> ( order_id BIGINT, customer_id BIGINT, order_date STRING, total DECIMAL(10,2) ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://<bucket>/<prefix>/'; ``` ### Parquet / ORC ```sql CREATE EXTERNAL TABLE temp_source_<timestamp> ( event_id BIGINT, event_type STRING, timestamp TIMESTAMP ) STORED AS PARQUET -- or ORC LOCATION 's3://<bucket>/<prefix>/'; ``` ## Step 2: Transform and Insert ```sql INSERT INTO "<catalog>"."<namespace>"."<target_table>" SELECT CAST(customer_id AS BIGINT) AS customer_id, first_name, last_name, email, DATE_PARSE(signup_date, '%Y-%m-%d') AS signup_date FROM temp_source_<timestamp> WHERE customer_id IS NOT NULL ``` For detailed type casting, date parsing, null handling, and boolean conversion patterns, see [type-transformations.md](type-transformations.md). ### Execute via CLI ```bash QUERY_ID=$(aws athena start-query-execution \ --query-string "<INSERT INTO query>" \ --query-execution-context Database=<namespace> \ --result-configuration OutputLocation=s3://<results-bucket>/ \ --region <region> \ --query 'QueryExecutionId' --output text) aws athena get-query-execution --query-execution-id "$QUERY_ID" --region <region> ``` ## Step 3: Validate ```sql -- Row count SELECT COUNT(*) as row_count FROM "<catalog>"."<namespace>"."<target_table>"; -- Spot check SELECT * FROM "<catalog>"."<namespace>"."<target_table>" LIMIT 10; -- Null check on critical columns SELECT SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) as null_ids, COUNT(*) as total FROM "<catalog>"."<namespace>"."<target_table>"; ``` ## Step 4: Clean Up ```sql DROP TABLE IF EXISTS temp_source_<timestamp>; ``` ## Large Datasets If Athena times out (30-minute limit): 1. **Batch by partition**: Load one month/day at a time 2. **Switch to Glue ETL**: Better for datasets > 1GB — handles larger data with more workers, provides monitoring and retries ## Limitations | Limitation | Workaround | |-----------|------------| | No scheduling | Use EventBridge or Step Functions to trigger queries | | Limited transformations | Use Glue ETL for complex PySpark logic | | 30-minute timeout | Batch loads or switch to Glue ETL | # BigQuery Ingest Move data from Google BigQuery into the data lake. Assumes a Glue `BIGQUERY` connection exists. If not, delegate to `connecting-to-data-source`. ## Contents - [Prerequisites](#prerequisites) - [Read Pattern](#read-pattern) - [Incremental Loading](#incremental-loading) - [Partition Decorators](#partition-decorators) - [Type Mapping](#type-mapping) - [Further Reading](#further-reading) ## Prerequisites - Glue connection of type `BIGQUERY` with service account credentials in Secrets Manager - GCP project ID and source table (full form: `project.dataset.table`) - Target table in the data lake - Egress from the Glue subnet to `bigquery.googleapis.com` (public internet or Google Private Service Connect) ## Read Pattern ```python bigquery_df = glueContext.create_dynamic_frame.from_options( connection_type="bigquery", connection_options={ "connectionName": args['connection_name'], "parentProject": args['gcp_project'], "sourceType": "table", "table": "my_dataset.customers" } ).toDF() ``` For custom SQL: ```python connection_options={ "connectionName": args['connection_name'], "paren