Ingesting Into Data Lake

Name: Ingesting Into Data Lake
Author: aws

aws/agent-toolkit-for-aws

Load CSV, JSON, Parquet, or ORC from S3 into a governed data-lake table using Athena INSERT when Glue ETL is overkill.

Overview

Ingesting Into Data Lake is an agent skill for the Build phase that loads S3 files into lake tables via Athena external tables and INSERT INTO.

Install

npx skills add https://github.com/aws/agent-toolkit-for-aws --skill ingesting-into-data-lake

What is this skill?

Fallback path for simple one-time loads when Glue ETL is unavailable or unnecessary
Step 1: CREATE EXTERNAL TABLE templates for CSV, JSON, Parquet, and ORC on S3
Step 2: INSERT INTO target catalog table with CAST, DATE_PARSE, and null filters
References type-transformations.md for casting, dates, booleans, and null handling
Includes Athena CLI execution pattern for running the load query
Two-step workflow: CREATE EXTERNAL TABLE then INSERT INTO target
Four source formats documented: CSV, JSON, Parquet, and ORC

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

You have raw files in S3 and a target Iceberg or Hive table, but no quick agent-ready recipe for Athena INSERT without standing up Glue ETL.

Who is it for?

Solo builders on AWS who need a one-off or occasional ingest into Athena/Glue catalog tables from CSV, JSON, Parquet, or ORC.

Skip if: High-volume continuous ingestion, complex multi-hop ETL, or teams standardized on Glue Spark jobs only.

When should I use this skill?

Simple one-time data loads into a data lake when Glue ETL is unavailable or unnecessary.

What do I get? / Deliverables

You run a documented two-step SQL load with typed transforms and optional CLI execution, then drop the temp external table.

CREATE EXTERNAL TABLE DDL for the source prefix
INSERT INTO statement with casts and filters into the lake table
Executed Athena load via console or CLI with validated row counts

Recommended Skills

Azure Deploymicrosoft/azure-skills

Azure Deploy is a Microsoft agent skill that executes cloud releases for applications that are already planned and valid…374k installs·1.2k stars

Azure Preparemicrosoft/azure-skills

Azure Prepare is Microsoft's skill for getting applications ready to run on Azure—writing the deployment plan, generatin…374k installs·1.2k stars

Azure Storagemicrosoft/azure-skills

Azure Storage skill helps agents pick the right Azure storage service—Blob for objects, Files for SMB shares, Queues for…374k installs·1.2k stars

Azure Validatemicrosoft/azure-skills

Microsoft-guided preflight validation for Azure deployments including IaC, identity, and service-specific readiness.374k installs·1.2k stars

Appinsights Instrumentationmicrosoft/azure-skills

appinsights-instrumentation is a Microsoft Azure-skills package that walks solo builders through enabling Application In…374k installs·1.2k stars

Azure Resource Lookupmicrosoft/azure-skills

Azure Resource Lookup is a Microsoft agent skill that helps solo builders and small teams answer “what do I have in Azur…373k installs·1.2k stars

Journey fit

Primary fit

BuildBackend, data & payments

One-time or simple lake ingestion is Build → Backend work—standing up data paths that power analytics and product features. The skill is SQL-first Athena loading (external tables, INSERT INTO, CLI execution), which is backend data plumbing rather than frontend or launch work.

Also useful

OperateInfrastructure & cost

How it compares

Athena INSERT skill for simple loads—not a replacement for Glue ETL or streaming ingest pipelines.

Common Questions / FAQ

Who is ingesting-into-data-lake for?

Indie and solo builders using AWS who want their coding agent to generate correct Athena DDL and INSERT statements for S3-backed lake tables.

When should I use ingesting-into-data-lake?

Use it during Build → Backend when onboarding a new S3 prefix into your data lake catalog, especially for one-time migrations or small recurring batch loads.

Is ingesting-into-data-lake safe to install?

It is AWS documentation-style guidance; scope IAM so Athena can only reach intended buckets. Review the Security Audits panel on this Prism page before automating loads in CI.

SKILL.md

READMESKILL.md - Ingesting Into Data Lake

# Data Loading via Athena INSERT INTO

Fallback approach for simple one-time data loads when Glue ETL is unavailable or unnecessary.

## Step 1: Create External Table for Source

Create a temporary external table pointing to source files in S3.

### CSV

```sql
CREATE EXTERNAL TABLE temp_source_<timestamp> (
  customer_id INT,
  first_name STRING,
  last_name STRING,
  email STRING,
  signup_date STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://<bucket>/<prefix>/'
TBLPROPERTIES ('skip.header.line.count'='1');
```

### JSON

```sql
CREATE EXTERNAL TABLE temp_source_<timestamp> (
  order_id BIGINT,
  customer_id BIGINT,
  order_date STRING,
  total DECIMAL(10,2)
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://<bucket>/<prefix>/';
```

### Parquet / ORC

```sql
CREATE EXTERNAL TABLE temp_source_<timestamp> (
  event_id BIGINT,
  event_type STRING,
  timestamp TIMESTAMP
)
STORED AS PARQUET  -- or ORC
LOCATION 's3://<bucket>/<prefix>/';
```

## Step 2: Transform and Insert

```sql
INSERT INTO "<catalog>"."<namespace>"."<target_table>"
SELECT
  CAST(customer_id AS BIGINT) AS customer_id,
  first_name,
  last_name,
  email,
  DATE_PARSE(signup_date, '%Y-%m-%d') AS signup_date
FROM temp_source_<timestamp>
WHERE customer_id IS NOT NULL
```

For detailed type casting, date parsing, null handling, and boolean conversion patterns, see [type-transformations.md](type-transformations.md).

### Execute via CLI

```bash
QUERY_ID=$(aws athena start-query-execution \
  --query-string "<INSERT INTO query>" \
  --query-execution-context Database=<namespace> \
  --result-configuration OutputLocation=s3://<results-bucket>/ \
  --region <region> \
  --query 'QueryExecutionId' --output text)

aws athena get-query-execution --query-execution-id "$QUERY_ID" --region <region>
```

## Step 3: Validate

```sql
-- Row count
SELECT COUNT(*) as row_count FROM "<catalog>"."<namespace>"."<target_table>";

-- Spot check
SELECT * FROM "<catalog>"."<namespace>"."<target_table>" LIMIT 10;

-- Null check on critical columns
SELECT
  SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) as null_ids,
  COUNT(*) as total
FROM "<catalog>"."<namespace>"."<target_table>";
```

## Step 4: Clean Up

```sql
DROP TABLE IF EXISTS temp_source_<timestamp>;
```

## Large Datasets

If Athena times out (30-minute limit):

1. **Batch by partition**: Load one month/day at a time
2. **Switch to Glue ETL**: Better for datasets > 1GB — handles larger data with more workers, provides monitoring and retries

## Limitations

| Limitation | Workaround |
|-----------|------------|
| No scheduling | Use EventBridge or Step Functions to trigger queries |
| Limited transformations | Use Glue ETL for complex PySpark logic |
| 30-minute timeout | Batch loads or switch to Glue ETL |


# BigQuery Ingest

Move data from Google BigQuery into the data lake. Assumes a Glue `BIGQUERY` connection exists. If not, delegate to `connecting-to-data-source`.

## Contents

- [Prerequisites](#prerequisites)
- [Read Pattern](#read-pattern)
- [Incremental Loading](#incremental-loading)
- [Partition Decorators](#partition-decorators)
- [Type Mapping](#type-mapping)
- [Further Reading](#further-reading)

## Prerequisites

- Glue connection of type `BIGQUERY` with service account credentials in Secrets Manager
- GCP project ID and source table (full form: `project.dataset.table`)
- Target table in the data lake
- Egress from the Glue subnet to `bigquery.googleapis.com` (public internet or Google Private Service Connect)

## Read Pattern

```python
bigquery_df = glueContext.create_dynamic_frame.from_options(
    connection_type="bigquery",
    connection_options={
        "connectionName": args['connection_name'],
        "parentProject": args['gcp_project'],
        "sourceType": "table",
        "table": "my_dataset.customers"
    }
).toDF()
```

For custom SQL:

```python
connection_options={
    "connectionName": args['connection_name'],
    "paren

What is this skill?

Fallback path for simple one-time loads when Glue ETL is unavailable or unnecessary

Step 1: CREATE EXTERNAL TABLE templates for CSV, JSON, Parquet, and ORC on S3

Step 2: INSERT INTO target catalog table with CAST, DATE_PARSE, and null filters

References type-transformations.md for casting, dates, booleans, and null handling

Includes Athena CLI execution pattern for running the load query

Two-step workflow: CREATE EXTERNAL TABLE then INSERT INTO target

Four source formats documented: CSV, JSON, Parquet, and ORC

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1k installs on skills.sh; 819 GitHub stars; 2/3 security scanners passed (skills.sh audits).

Journey fit

Primary fit

BuildBackend, data & payments

Also useful

OperateInfrastructure & cost

SKILL.md

READMESKILL.md - Ingesting Into Data Lake

# Data Loading via Athena INSERT INTO

Fallback approach for simple one-time data loads when Glue ETL is unavailable or unnecessary.

## Step 1: Create External Table for Source

Create a temporary external table pointing to source files in S3.

### CSV

```sql
CREATE EXTERNAL TABLE temp_source_<timestamp> (
  customer_id INT,
  first_name STRING,
  last_name STRING,
  email STRING,
  signup_date STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://<bucket>/<prefix>/'
TBLPROPERTIES ('skip.header.line.count'='1');
```

### JSON

```sql
CREATE EXTERNAL TABLE temp_source_<timestamp> (
  order_id BIGINT,
  customer_id BIGINT,
  order_date STRING,
  total DECIMAL(10,2)
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://<bucket>/<prefix>/';
```

### Parquet / ORC

```sql
CREATE EXTERNAL TABLE temp_source_<timestamp> (
  event_id BIGINT,
  event_type STRING,
  timestamp TIMESTAMP
)
STORED AS PARQUET  -- or ORC
LOCATION 's3://<bucket>/<prefix>/';
```

## Step 2: Transform and Insert

```sql
INSERT INTO "<catalog>"."<namespace>"."<target_table>"
SELECT
  CAST(customer_id AS BIGINT) AS customer_id,
  first_name,
  last_name,
  email,
  DATE_PARSE(signup_date, '%Y-%m-%d') AS signup_date
FROM temp_source_<timestamp>
WHERE customer_id IS NOT NULL
```

For detailed type casting, date parsing, null handling, and boolean conversion patterns, see [type-transformations.md](type-transformations.md).

### Execute via CLI

```bash
QUERY_ID=$(aws athena start-query-execution \
  --query-string "<INSERT INTO query>" \
  --query-execution-context Database=<namespace> \
  --result-configuration OutputLocation=s3://<results-bucket>/ \
  --region <region> \
  --query 'QueryExecutionId' --output text)

aws athena get-query-execution --query-execution-id "$QUERY_ID" --region <region>
```

## Step 3: Validate

```sql
-- Row count
SELECT COUNT(*) as row_count FROM "<catalog>"."<namespace>"."<target_table>";

-- Spot check
SELECT * FROM "<catalog>"."<namespace>"."<target_table>" LIMIT 10;

-- Null check on critical columns
SELECT
  SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) as null_ids,
  COUNT(*) as total
FROM "<catalog>"."<namespace>"."<target_table>";
```

## Step 4: Clean Up

```sql
DROP TABLE IF EXISTS temp_source_<timestamp>;
```

## Large Datasets

If Athena times out (30-minute limit):

1. **Batch by partition**: Load one month/day at a time
2. **Switch to Glue ETL**: Better for datasets > 1GB — handles larger data with more workers, provides monitoring and retries

## Limitations

| Limitation | Workaround |
|-----------|------------|
| No scheduling | Use EventBridge or Step Functions to trigger queries |
| Limited transformations | Use Glue ETL for complex PySpark logic |
| 30-minute timeout | Batch loads or switch to Glue ETL |


# BigQuery Ingest

Move data from Google BigQuery into the data lake. Assumes a Glue `BIGQUERY` connection exists. If not, delegate to `connecting-to-data-source`.

## Contents

- [Prerequisites](#prerequisites)
- [Read Pattern](#read-pattern)
- [Incremental Loading](#incremental-loading)
- [Partition Decorators](#partition-decorators)
- [Type Mapping](#type-mapping)
- [Further Reading](#further-reading)

## Prerequisites

- Glue connection of type `BIGQUERY` with service account credentials in Secrets Manager
- GCP project ID and source table (full form: `project.dataset.table`)
- Target table in the data lake
- Egress from the Glue subnet to `bigquery.googleapis.com` (public internet or Google Private Service Connect)

## Read Pattern

```python
bigquery_df = glueContext.create_dynamic_frame.from_options(
    connection_type="bigquery",
    connection_options={
        "connectionName": args['connection_name'],
        "parentProject": args['gcp_project'],
        "sourceType": "table",
        "table": "my_dataset.customers"
    }
).toDF()
```

For custom SQL:

```python
connection_options={
    "connectionName": args['connection_name'],
    "paren

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is ingesting-into-data-lake for?

When should I use ingesting-into-data-lake?

Is ingesting-into-data-lake safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is ingesting-into-data-lake for?

When should I use ingesting-into-data-lake?

Is ingesting-into-data-lake safe to install?

SKILL.md