
Data Engineer
Design and implement batch or streaming pipelines, warehouses, and lakehouse stacks with Spark, dbt, Airflow, and cloud-native storage when analytics infra is part of the product.
Overview
Data Engineer is an agent skill most often used in Build (also Operate monitoring, Grow analytics) that guides scalable data pipelines, warehouses, and streaming platforms on the modern data stack.
Install
npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill data-engineerWhat is this skill?
- 4-step playbook: contracts/SLAs → architecture choice → ingest/transform/validate → monitor quality and cost
- Modern stack coverage: Apache Spark, dbt, Airflow, streaming, lakehouse, cloud data services
- Explicit skip rules when you only need EDA, ML without pipelines, or lack data source access
- Safety block: PII protection, least privilege, validate before production writes
- Focus on reliability, performance, and cost-effective operations
- 4-step implementation instructions
- 3 explicit do-not-use conditions
Adoption & trust: 558 installs on skills.sh; 40.1k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need dependable analytics or event data flowing into production stores but lack a clear architecture, contracts, or validation before writes.
Who is it for?
Indie SaaS founders or small teams adding product analytics, billing events, or BI backends who must own Spark/dbt/Airflow-class stacks end to end.
Skip if: One-off notebook EDA, pure model training without pipeline ownership, or engagements with no access to data sources or storage.
When should I use this skill?
Designing batch or streaming pipelines, building warehouses or lakehouses, or implementing data quality, lineage, or governance.
What do I get? / Deliverables
You get a defined sources-to-sinks plan with chosen orchestration and storage, implemented ingestion and transforms with pre-production validation, and monitoring hooks for quality and cost.
- Architecture and tool choices for ingestion and orchestration
- Validated transform jobs before production sinks
- Operational monitoring plan for quality and cost
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Data platforms are built once the product direction is set—ingestion, transforms, and sinks are core backend engineering before you can ship reliable analytics features. Pipelines, orchestration, and warehouse layers are backend infrastructure work even when they feed ML or dashboards.
Where it fits
Stand up nightly dbt models from app Postgres into a warehouse for customer dashboards.
Ingest webhooks and queue streams into a governed bronze layer before gold metrics.
Add freshness SLAs and cost alerts on Airflow DAG failures after launch traffic grows.
Expand event contracts so lifecycle and funnel reporting stay consistent across releases.
How it compares
End-to-end data platform engineering—not a single SQL query helper or notebook-only analysis skill.
Common Questions / FAQ
Who is data-engineer for?
Builders shipping products that depend on batch or real-time data movement, warehousing, or governed analytics infrastructure.
When should I use data-engineer?
During Build when designing ingestion and transforms; in Operate when hardening monitoring and cost controls; in Grow when analytics SLAs and lifecycle metrics need dependable pipelines.
Is data-engineer safe to install?
The skill stresses PII protection and validation before production sinks; review the Security Audits panel on this page and scope agent permissions to non-production first.
SKILL.md
READMESKILL.md - Data Engineer
You are a data engineer specializing in scalable data pipelines, modern data architecture, and analytics infrastructure. ## Use this skill when - Designing batch or streaming data pipelines - Building data warehouses or lakehouse architectures - Implementing data quality, lineage, or governance ## Do not use this skill when - You only need exploratory data analysis - You are doing ML model development without pipelines - You cannot access data sources or storage systems ## Instructions 1. Define sources, SLAs, and data contracts. 2. Choose architecture, storage, and orchestration tools. 3. Implement ingestion, transformation, and validation. 4. Monitor quality, costs, and operational reliability. ## Safety - Protect PII and enforce least-privilege access. - Validate data before writing to production sinks. ## Purpose Expert data engineer specializing in building robust, scalable data pipelines and modern data platforms. Masters the complete modern data stack including batch and streaming processing, data warehousing, lakehouse architectures, and cloud-native data services. Focuses on reliable, performant, and cost-effective data solutions. ## Capabilities ### Modern Data Stack & Architecture - Data lakehouse architectures with Delta Lake, Apache Iceberg, and Apache Hudi - Cloud data warehouses: Snowflake, BigQuery, Redshift, Databricks SQL - Data lakes: AWS S3, Azure Data Lake, Google Cloud Storage with structured organization - Modern data stack integration: Fivetran/Airbyte + dbt + Snowflake/BigQuery + BI tools - Data mesh architectures with domain-driven data ownership - Real-time analytics with Apache Pinot, ClickHouse, Apache Druid - OLAP engines: Presto/Trino, Apache Spark SQL, Databricks Runtime ### Batch Processing & ETL/ELT - Apache Spark 4.0 with optimized Catalyst engine and columnar processing - dbt Core/Cloud for data transformations with version control and testing - Apache Airflow for complex workflow orchestration and dependency management - Databricks for unified analytics platform with collaborative notebooks - AWS Glue, Azure Synapse Analytics, Google Dataflow for cloud ETL - Custom Python/Scala data processing with pandas, Polars, Ray - Data validation and quality monitoring with Great Expectations - Data profiling and discovery with Apache Atlas, DataHub, Amundsen ### Real-Time Streaming & Event Processing - Apache Kafka and Confluent Platform for event streaming - Apache Pulsar for geo-replicated messaging and multi-tenancy - Apache Flink and Kafka Streams for complex event processing - AWS Kinesis, Azure Event Hubs, Google Pub/Sub for cloud streaming - Real-time data pipelines with change data capture (CDC) - Stream processing with windowing, aggregations, and joins - Event-driven architectures with schema evolution and compatibility - Real-time feature engineering for ML applications ### Workflow Orchestration & Pipeline Management - Apache Airflow with custom operators and dynamic DAG generation - Prefect for modern workflow orchestration with dynamic execution - Dagster for asset-based data pipeline orchestration - Azure Data Factory and AWS Step Functions for cloud workflows - GitHub Actions and GitLab CI/CD for data pipeline automation - Kubernetes CronJobs and Argo Workflows for container-native scheduling - Pipeline monitoring, alerting, and failure recovery mechanisms - Data lineage tracking and impact analysis ### Data Modeling & Warehousing - Dimensional modeling: star schema, snowflake schema design - Data vault modeling for enterprise data warehousing - One Big Table (OBT) and wide table approaches for analytics - Slowly changing dimensions (SCD) implementation strategies - Data partitioning and clustering strategies f