Data Engineering Data Pipeline

Name: Data Engineering Data Pipeline
Author: sickn33

sickn33/antigravity-awesome-skills

556 installs
44k repo stars
Updated July 27, 2026
sickn33/antigravity-awesome-skills

Data Engineering Data Pipeline is a Claude skill that provides expert guidance for designing, implementing, and optimizing scalable batch and streaming data pipelines with reliability and cost-efficiency constraints.

About

Data Pipeline Architecture is an agent skill that turns you into a specialist for building scalable, reliable, and cost-effective data pipelines. It provides concrete guidance on architecture selection, ingestion patterns, orchestration, transformation, storage layers, quality checks, monitoring, and cost optimization for both batch and real-time streaming use cases. developers and small teams use it when they need expert checklists, best practices, and step-by-step decisions without hiring a dedicated data engineer. The skill follows a structured workflow that starts with assessing sources, volume, latency, and targets then guides pattern selection, component implementation, and production hardening.

Designs ETL/ELT, Lambda, Kappa, and Lakehouse architectures
Implements batch and streaming data ingestion plus workflow orchestration with Airflow or Prefect
Handles data transformation with dbt and Spark and manages Delta Lake/Iceberg with ACID transactions
Builds data quality frameworks using Great Expectations and dbt tests
Monitors pipelines with CloudWatch, Prometheus, and Grafana while optimizing costs via partitioning and lifecycle polici

Data Engineering Data Pipeline by the numbers

556 all-time installs (skills.sh)
+27 installs in the week ending Jun 23, 2026 (Skillselion tracking)
Ranked #419 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill data-engineering-data-pipeline

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/sickn33/antigravity-awesome-skills/data-engineering-data-pipeline.svg)](https://skillselion.com/skills/sickn33/antigravity-awesome-skills/data-engineering-data-pipeline)

Installs	556
repo stars	★ 44k
Security audit	3 / 3 scanners passed
Last updated	July 27, 2026
Repository	sickn33/antigravity-awesome-skills ↗

How do you design scalable batch and streaming pipelines?

Get expert guidance designing, implementing, and optimizing scalable data pipelines for batch and streaming workloads.

Who is it for?

Data engineers architecting batch or streaming pipelines who need scalability, reliability, and cost guidance across tooling choices.

Skip if: Unrelated application frontend work, one-off SQL queries, or tasks with no data movement or orchestration component.

When should I use this skill?

A developer asks to design, implement, or optimize a data pipeline for batch processing, streaming ingestion, or ETL orchestration.

What you get

Pipeline architecture plan, ETL workflow design, reliability checklist, and cost-optimization recommendations.

Pipeline architecture plan
Design checklist

Files

SKILL.mdMarkdownGitHub ↗

Data Pipeline Architecture

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

Use this skill when

Working on data pipeline architecture tasks or workflows
Needing guidance, best practices, or checklists for data pipeline architecture

Do not use this skill when

The task is unrelated to data pipeline architecture
You need a different domain or tool outside this scope

Requirements

$ARGUMENTS

Core Capabilities

Design ETL/ELT, Lambda, Kappa, and Lakehouse architectures
Implement batch and streaming data ingestion
Build workflow orchestration with Airflow/Prefect
Transform data using dbt and Spark
Manage Delta Lake/Iceberg storage with ACID transactions
Implement data quality frameworks (Great Expectations, dbt tests)
Monitor pipelines with CloudWatch/Prometheus/Grafana
Optimize costs through partitioning, lifecycle policies, and compute optimization

Instructions

1. Architecture Design

Assess: sources, volume, latency requirements, targets
Select pattern: ETL (transform before load), ELT (load then transform), Lambda (batch + speed layers), Kappa (stream-only), Lakehouse (unified)
Design flow: sources → ingestion → processing → storage → serving
Add observability touchpoints

2. Ingestion Implementation

Batch

Incremental loading with watermark columns
Retry logic with exponential backoff
Schema validation and dead letter queue for invalid records
Metadata tracking (_extracted_at, _source)

Streaming

Kafka consumers with exactly-once semantics
Manual offset commits within transactions
Windowing for time-based aggregations
Error handling and replay capability

3. Orchestration

Airflow

Task groups for logical organization
XCom for inter-task communication
SLA monitoring and email alerts
Incremental execution with execution_date
Retry with exponential backoff

Prefect

Task caching for idempotency
Parallel execution with .submit()
Artifacts for visibility
Automatic retries with configurable delays

4. Transformation with dbt

Staging layer: incremental materialization, deduplication, late-arriving data handling
Marts layer: dimensional models, aggregations, business logic
Tests: unique, not_null, relationships, accepted_values, custom data quality tests
Sources: freshness checks, loaded_at_field tracking
Incremental strategy: merge or delete+insert

5. Data Quality Framework

Great Expectations

Table-level: row count, column count
Column-level: uniqueness, nullability, type validation, value sets, ranges
Checkpoints for validation execution
Data docs for documentation
Failure notifications

dbt Tests

Schema tests in YAML
Custom data quality tests with dbt-expectations
Test results tracked in metadata

6. Storage Strategy

Delta Lake

ACID transactions with append/overwrite/merge modes
Upsert with predicate-based matching
Time travel for historical queries
Optimize: compact small files, Z-order clustering
Vacuum to remove old files

Apache Iceberg

Partitioning and sort order optimization
MERGE INTO for upserts
Snapshot isolation and time travel
File compaction with binpack strategy
Snapshot expiration for cleanup

7. Monitoring & Cost Optimization

Monitoring

Track: records processed/failed, data size, execution time, success/failure rates
CloudWatch metrics and custom namespaces
SNS alerts for critical/warning/info events
Data freshness checks
Performance trend analysis

Cost Optimization

Partitioning: date/entity-based, avoid over-partitioning (keep >1GB)
File sizes: 512MB-1GB for Parquet
Lifecycle policies: hot (Standard) → warm (IA) → cold (Glacier)
Compute: spot instances for batch, on-demand for streaming, serverless for adhoc
Query optimization: partition pruning, clustering, predicate pushdown

Example: Minimal Batch Pipeline

# Batch ingestion with validation
from batch_ingestion import BatchDataIngester
from storage.delta_lake_manager import DeltaLakeManager
from data_quality.expectations_suite import DataQualityFramework

ingester = BatchDataIngester(config={})

# Extract with incremental loading
df = ingester.extract_from_database(
    connection_string='postgresql://host:5432/db',
    query='SELECT * FROM orders',
    watermark_column='updated_at',
    last_watermark=last_run_timestamp
)

# Validate
schema = {'required_fields': ['id', 'user_id'], 'dtypes': {'id': 'int64'}}
df = ingester.validate_and_clean(df, schema)

# Data quality checks
dq = DataQualityFramework()
result = dq.validate_dataframe(df, suite_name='orders_suite', data_asset_name='orders')

# Write to Delta Lake
delta_mgr = DeltaLakeManager(storage_path='s3://lake')
delta_mgr.create_or_update_table(
    df=df,
    table_name='orders',
    partition_columns=['order_date'],
    mode='append'
)

# Save failed records
ingester.save_dead_letter_queue('s3://lake/dlq/orders')

Output Deliverables

1. Architecture Documentation

Architecture diagram with data flow
Technology stack with justification
Scalability analysis and growth patterns
Failure modes and recovery strategies

2. Implementation Code

Ingestion: batch/streaming with error handling
Transformation: dbt models (staging → marts) or Spark jobs
Orchestration: Airflow/Prefect DAGs with dependencies
Storage: Delta/Iceberg table management
Data quality: Great Expectations suites and dbt tests

3. Configuration Files

Orchestration: DAG definitions, schedules, retry policies
dbt: models, sources, tests, project config
Infrastructure: Docker Compose, K8s manifests, Terraform
Environment: dev/staging/prod configs

4. Monitoring & Observability

Metrics: execution time, records processed, quality scores
Alerts: failures, performance degradation, data freshness
Dashboards: Grafana/CloudWatch for pipeline health
Logging: structured logs with correlation IDs

5. Operations Guide

Deployment procedures and rollback strategy
Troubleshooting guide for common issues
Scaling guide for increased volume
Cost optimization strategies and savings
Disaster recovery and backup procedures

Success Criteria

Pipeline meets defined SLA (latency, throughput)
Data quality checks pass with >99% success rate
Automatic retry and alerting on failures
Comprehensive monitoring shows health and performance
Documentation enables team maintenance
Cost optimization reduces infrastructure costs by 30-50%
Schema evolution without downtime
End-to-end data lineage tracked

Limitations

Use this skill only when the task clearly matches the scope described above.
Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

FAQ

What workloads does Data Engineering Data Pipeline cover?

Data Engineering Data Pipeline covers scalable, reliable, and cost-effective pipelines for both batch and streaming data processing. The skill provides architecture guidance, best practices, and checklists for pipeline design tasks.

When should Data Engineering Data Pipeline not be used?

Data Engineering Data Pipeline should not be used when the task is unrelated to data pipeline architecture. The skill explicitly defers frontend, unrelated SQL, or non-pipeline application work to other skills.

Is Data Engineering Data Pipeline safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLdatabasesanalyticspipelines