
Data Engineering Data Pipeline
Design and implement batch or streaming data pipelines with orchestration, transforms, lake storage, and quality checks.
Install
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill data-engineering-data-pipelineWhat is this skill?
- Architecture patterns: ETL, ELT, Lambda, Kappa, and Lakehouse
- Batch and streaming ingestion with Airflow/Prefect orchestration
- Transforms via dbt and Spark; Delta Lake/Iceberg with ACID semantics
- Data quality with Great Expectations and dbt tests
- Monitoring hooks for CloudWatch, Prometheus, and Grafana plus cost optimization levers
Adoption & trust: 496 installs on skills.sh; 40.1k GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
Build is the canonical shelf because the skill centers on architecting and implementing ingestion, transforms, and storage—not day-two pager rotation alone. Backend subphase matches ETL/ELT services, workflow engines, and data APIs that power the product’s data layer.
Common Questions / FAQ
Is Data Engineering Data Pipeline safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Data Engineering Data Pipeline
# Data Pipeline Architecture You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing. ## Use this skill when - Working on data pipeline architecture tasks or workflows - Needing guidance, best practices, or checklists for data pipeline architecture ## Do not use this skill when - The task is unrelated to data pipeline architecture - You need a different domain or tool outside this scope ## Requirements $ARGUMENTS ## Core Capabilities - Design ETL/ELT, Lambda, Kappa, and Lakehouse architectures - Implement batch and streaming data ingestion - Build workflow orchestration with Airflow/Prefect - Transform data using dbt and Spark - Manage Delta Lake/Iceberg storage with ACID transactions - Implement data quality frameworks (Great Expectations, dbt tests) - Monitor pipelines with CloudWatch/Prometheus/Grafana - Optimize costs through partitioning, lifecycle policies, and compute optimization ## Instructions ### 1. Architecture Design - Assess: sources, volume, latency requirements, targets - Select pattern: ETL (transform before load), ELT (load then transform), Lambda (batch + speed layers), Kappa (stream-only), Lakehouse (unified) - Design flow: sources → ingestion → processing → storage → serving - Add observability touchpoints ### 2. Ingestion Implementation **Batch** - Incremental loading with watermark columns - Retry logic with exponential backoff - Schema validation and dead letter queue for invalid records - Metadata tracking (_extracted_at, _source) **Streaming** - Kafka consumers with exactly-once semantics - Manual offset commits within transactions - Windowing for time-based aggregations - Error handling and replay capability ### 3. Orchestration **Airflow** - Task groups for logical organization - XCom for inter-task communication - SLA monitoring and email alerts - Incremental execution with execution_date - Retry with exponential backoff **Prefect** - Task caching for idempotency - Parallel execution with .submit() - Artifacts for visibility - Automatic retries with configurable delays ### 4. Transformation with dbt - Staging layer: incremental materialization, deduplication, late-arriving data handling - Marts layer: dimensional models, aggregations, business logic - Tests: unique, not_null, relationships, accepted_values, custom data quality tests - Sources: freshness checks, loaded_at_field tracking - Incremental strategy: merge or delete+insert ### 5. Data Quality Framework **Great Expectations** - Table-level: row count, column count - Column-level: uniqueness, nullability, type validation, value sets, ranges - Checkpoints for validation execution - Data docs for documentation - Failure notifications **dbt Tests** - Schema tests in YAML - Custom data quality tests with dbt-expectations - Test results tracked in metadata ### 6. Storage Strategy **Delta Lake** - ACID transactions with append/overwrite/merge modes - Upsert with predicate-based matching - Time travel for historical queries - Optimize: compact small files, Z-order clustering - Vacuum to remove old files **Apache Iceberg** - Partitioning and sort order optimization - MERGE INTO for upserts - Snapshot isolation and time travel - File compaction with binpack strategy - Snapshot expiration for cleanup ### 7. Monitoring & Cost Optimization **Monitoring** - Track: records processed/failed, data size, execution time, success/failure rates - CloudWatch metrics and custom namespaces - SNS alerts for critical/warning/info events - Data freshness checks - Performance trend analysis **Cost Optimization** - Partitioning: date/entity-based, avoid over-partitioning (keep >1GB)