Iac Data Engineering Terraform

Name: Iac Data Engineering Terraform
Author: aradotso

aradotso/data-skills

1.5k installs
4 repo stars
Updated July 18, 2026
aradotso/data-skills

Provision AWS S3, EC2, and IAM for data pipelines using reusable Terraform patterns and state management.

About

IaC Data Engineering Terraform is an agent skill from ara.so’s Data Skills collection that encodes Infrastructure-as-Code patterns for solo data builders on AWS. It walks through provisioning S3 for lake or staging storage, EC2 for processing workloads, and IAM policies that keep pipeline access explicit—using Terraform as the single declarative interface. Prerequisites assume Terraform and AWS CLI on the machine and configured credentials, matching how indie engineers bootstrap a first pipeline environment without clicking through the console. The skill fits builders who treat infrastructure as versioned code alongside ETL jobs, and it remains relevant when you extend stacks in Operate or redeploy after Validate proves a prototype. It is pattern-oriented rather than a one-click deploy of a named product, so agents adapt modules to your naming and regions while preserving state discipline.

Covers S3 buckets, EC2 processing hosts, and IAM roles with least-privilege patterns
Documents Terraform and AWS CLI install via Homebrew plus aws configure
Emphasizes reproducible environments and Terraform state for data infra
Trigger phrases include S3/EC2 setup, pipeline IaC, and state management
From ara.so Data Skills collection for data-engineering workflows

Iac Data Engineering Terraform by the numbers

1,510 all-time installs (skills.sh)
+2 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #271 of 1,041 Cloud & Infrastructure skills by installs in the Skillselion catalog
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/aradotso/data-skills --skill iac-data-engineering-terraform

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/data-skills/iac-data-engineering-terraform.svg)](https://skillselion.com/skills/aradotso/data-skills/iac-data-engineering-terraform)

Installs	1.5k
repo stars	★ 4
Last updated	July 18, 2026
Repository	aradotso/data-skills ↗

What it does

Provision AWS S3, EC2, and IAM for data pipelines using reusable Terraform patterns and state management.

Files

SKILL.mdMarkdownGitHub ↗

IaC for Data Engineering with Terraform

Skill by ara.so — Data Skills collection.

This project demonstrates Infrastructure-as-Code (IaC) fundamentals for data engineers using Terraform to provision AWS resources including S3 buckets, EC2 instances, and IAM configurations. It provides reusable patterns for managing data infrastructure declaratively.

What This Project Does

Provisions AWS S3 buckets for data storage
Creates and configures EC2 instances for data processing
Sets up IAM roles and policies with proper permissions
Manages infrastructure state with Terraform
Provides reproducible data engineering environments

Prerequisites

Before using this project, ensure you have:

# Install Terraform
brew tap hashicorp/tap
brew install hashicorp/tap/terraform

# Install AWS CLI
brew install awscli

# Configure AWS credentials
aws configure
# Enter your AWS Access Key ID, Secret Access Key, region, and output format

Set up required environment variables:

export AWS_ACCESS_KEY_ID=$YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=$YOUR_SECRET_KEY
export AWS_DEFAULT_REGION=us-east-1

Project Structure

terraform/
├── main.tf          # Main infrastructure definitions
├── variables.tf     # Input variables
├── outputs.tf       # Output values
└── terraform.tfstate # State file (auto-generated)

Core Terraform Commands

Initialize Terraform

# Initialize the working directory and download providers
terraform -chdir=terraform init

# Validate configuration syntax
terraform -chdir=terraform validate

# Format configuration files
terraform -chdir=terraform fmt

Plan and Apply Infrastructure

# Preview changes without applying
terraform -chdir=terraform plan

# Apply infrastructure changes
terraform -chdir=terraform apply

# Auto-approve without prompts (use carefully)
terraform -chdir=terraform apply -auto-approve

Inspect Infrastructure

# List all resources in state
terraform -chdir=terraform state list

# Show detailed state information
terraform -chdir=terraform show

# Output specific values
terraform -chdir=terraform output

Destroy Infrastructure

# Destroy all managed infrastructure
terraform -chdir=terraform destroy

# Destroy specific resource
terraform -chdir=terraform destroy -target=aws_s3_bucket.data_bucket

Key Configuration Patterns

S3 Bucket for Data Storage

# main.tf
resource "aws_s3_bucket" "data_lake" {
  bucket = "my-data-engineering-bucket-${random_id.bucket_suffix.hex}"
  
  tags = {
    Environment = "dev"
    Purpose     = "data-engineering"
    ManagedBy   = "terraform"
  }
}

resource "random_id" "bucket_suffix" {
  byte_length = 4
}

# Enable versioning for data protection
resource "aws_s3_bucket_versioning" "data_lake_versioning" {
  bucket = aws_s3_bucket.data_lake.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

# Configure lifecycle rules
resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "archive-old-data"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    expiration {
      days = 365
    }
  }
}

EC2 Instance for Data Processing

# main.tf
resource "aws_instance" "data_processor" {
  ami           = "ami-0c55b159cbfafe1f0"  # Amazon Linux 2
  instance_type = "t3.medium"
  
  key_name = aws_key_pair.data_eng_key.key_name
  
  vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
  
  iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
  
  user_data = <<-EOF
              #!/bin/bash
              yum update -y
              yum install -y python3 python3-pip
              pip3 install pandas boto3 awscli
              EOF
  
  tags = {
    Name        = "data-processor"
    Environment = "dev"
    ManagedBy   = "terraform"
  }
  
  root_block_device {
    volume_size = 50
    volume_type = "gp3"
  }
}

resource "aws_key_pair" "data_eng_key" {
  key_name   = "data-engineering-key"
  public_key = file("~/.ssh/id_rsa.pub")
}

Security Group Configuration

resource "aws_security_group" "data_processor_sg" {
  name        = "data-processor-sg"
  description = "Security group for data processing EC2 instances"
  
  # SSH access
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Restrict in production
  }
  
  # Allow all outbound traffic
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = {
    Name = "data-processor-sg"
  }
}

IAM Role for EC2 with S3 Access

resource "aws_iam_role" "data_processor_role" {
  name = "data-processor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "s3_access_policy" {
  name = "s3-access-policy"
  role = aws_iam_role.data_processor_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          aws_s3_bucket.data_lake.arn,
          "${aws_s3_bucket.data_lake.arn}/*"
        ]
      }
    ]
  })
}

resource "aws_iam_instance_profile" "data_processor_profile" {
  name = "data-processor-profile"
  role = aws_iam_role.data_processor_role.name
}

Variables and Outputs

Define Variables

# variables.tf
variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name"
  type        = string
  default     = "dev"
}

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t3.medium"
}

variable "bucket_prefix" {
  description = "Prefix for S3 bucket names"
  type        = string
  default     = "data-engineering"
}

Configure Outputs

# outputs.tf
output "s3_bucket_name" {
  description = "Name of the created S3 bucket"
  value       = aws_s3_bucket.data_lake.id
}

output "s3_bucket_arn" {
  description = "ARN of the S3 bucket"
  value       = aws_s3_bucket.data_lake.arn
}

output "ec2_instance_id" {
  description = "ID of the EC2 instance"
  value       = aws_instance.data_processor.id
}

output "ec2_public_ip" {
  description = "Public IP of the EC2 instance"
  value       = aws_instance.data_processor.public_ip
}

output "ec2_private_ip" {
  description = "Private IP of the EC2 instance"
  value       = aws_instance.data_processor.private_ip
}

Remote State Management

For team collaboration, use S3 backend for state:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "terraform-state-bucket-name"
    key            = "data-engineering/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Create the backend resources:

resource "aws_s3_bucket" "terraform_state" {
  bucket = "terraform-state-bucket-name"
  
  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  
  attribute {
    name = "LockID"
    type = "S"
  }
}

Verification Commands

After applying infrastructure:

# Verify S3 buckets
aws s3 ls

# Verify EC2 instances
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key==`Name`].Value,Type:InstanceType,State:State.Name,PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress}' \
  --output table

# Check IAM roles
aws iam list-roles --query 'Roles[?contains(RoleName, `data-processor`)].RoleName'

# Inspect Terraform state
terraform -chdir=terraform state list
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'

Common Patterns

Multi-Environment Setup

# environments/dev/main.tf
module "data_infrastructure" {
  source = "../../modules/data-infra"
  
  environment   = "dev"
  instance_type = "t3.small"
  bucket_prefix = "dev-data"
}

# environments/prod/main.tf
module "data_infrastructure" {
  source = "../../modules/data-infra"
  
  environment   = "prod"
  instance_type = "t3.large"
  bucket_prefix = "prod-data"
}

Using terraform.tfvars

# terraform.tfvars
aws_region    = "us-west-2"
environment   = "staging"
instance_type = "t3.medium"
bucket_prefix = "staging-data-lake"

Apply with variables:

terraform -chdir=terraform apply -var-file="terraform.tfvars"

Troubleshooting

State Lock Issues

# Force unlock if state is stuck
terraform -chdir=terraform force-unlock LOCK_ID

# View current state
terraform -chdir=terraform show

S3 Bucket Name Conflicts

If bucket name is taken:

# Use random suffix
resource "random_id" "bucket_suffix" {
  byte_length = 8
}

resource "aws_s3_bucket" "data_lake" {
  bucket = "${var.bucket_prefix}-${random_id.bucket_suffix.hex}"
}

Import Existing Resources

# Import existing S3 bucket
terraform -chdir=terraform import aws_s3_bucket.data_lake existing-bucket-name

# Import EC2 instance
terraform -chdir=terraform import aws_instance.data_processor i-1234567890abcdef0

Debugging Terraform

# Enable detailed logging
export TF_LOG=DEBUG
terraform -chdir=terraform apply

# Disable logging
unset TF_LOG

Refresh State

# Sync state with real infrastructure
terraform -chdir=terraform refresh

# Replace corrupted resource
terraform -chdir=terraform apply -replace=aws_instance.data_processor

Best Practices

1. Always use variables for environment-specific values 2. Enable S3 versioning for data protection 3. Use IAM roles instead of access keys for EC2 4. Tag all resources for cost tracking and management 5. Store state remotely for team collaboration 6. Use modules for reusable infrastructure patterns 7. Run `terraform plan` before every apply 8. Never commit .tfstate files or sensitive variables to Git 9. Use `.gitignore` for Terraform files:

# .gitignore
.terraform/
*.tfstate
*.tfstate.backup
.terraform.lock.hcl
terraform.tfvars
*.auto.tfvars

Related skills

Azure AiIntegrates Azure AI Content Safety, Document Intelligence, Speech, and Search services into Java-based agents and applications.479k1.3k

Azure PrepareGenerate the exact Azure infrastructure files, Dockerfiles, and azure.yaml configuration needed before deploying any new or modernized application.479k1.3k

Azure StorageConnect agents and applications to Azure Blob Storage, File Shares, Queues, Tables, and Data Lake without leaving the coding environment.478k1.3k

Appinsights InstrumentationAutomatically instrument web applications running on Azure App Service with Application Insights for observability without manual configuration.478k1.3k

Azure Resource LookupInstantly list, query, and discover any Azure resources across subscriptions without leaving the agent chat.478k1.3k

Azure AigatewayConfigure Azure API Management as a secure, governed gateway for routing traffic to LLMs, MCP servers, and agent tools.478k1.3k

Cloud & Infrastructurepipelinesetl