
Senior Data Engineer
Design ETL/ELT pipelines, data models, and DataOps patterns with Python, SQL, Spark, Airflow, dbt, and Kafka when your product needs reliable analytics infra.
Overview
senior-data-engineer is an agent skill most often used in Build (also Operate, Grow) that guides design and implementation of scalable data pipelines, models, and DataOps on the modern data stack.
Install
npx skills add https://github.com/alirezarezvani/claude-skills --skill senior-data-engineerWhat is this skill?
- Trigger phrases for pipeline design, architecture (batch vs streaming, lakehouse), and modeling
- Workflows spanning ingestion, dimensional modeling, SCDs, and data vault patterns
- Data quality, validation, monitoring, and troubleshooting sections
- Modern stack coverage: Python, SQL, Spark, Airflow, dbt, Kafka
- Architecture decision framework for lambda vs kappa and late-arriving data
- Seven major SKILL.md sections including trigger phrases, workflows, architecture framework, and troubleshooting
- Tech stack explicitly lists Python, SQL, Spark, Airflow, dbt, and Kafka
Adoption & trust: 842 installs on skills.sh; 17.5k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need analytics and event data in production but only have brittle scripts, unclear batch vs streaming choices, and no quality gates on ingested tables.
Who is it for?
Indie SaaS founders and technical solos standing up warehouse/lakehouse pipelines, dbt projects, or Kafka ingestion without a dedicated data team.
Skip if: Simple CRUD apps with no analytics requirements, or teams wanting only a one-line SQL tweak with no pipeline or governance scope.
When should I use this skill?
Designing data architectures, building data pipelines, optimizing data workflows, implementing data governance, or troubleshooting data issues.
What do I get? / Deliverables
You get agent-led pipeline designs, modeling choices, orchestration patterns, and validation steps you can implement with Python, SQL, Spark, Airflow, dbt, and Kafka.
- Pipeline and architecture recommendations with batch/streaming rationale
- Data models (dimensional, SCD, or vault-oriented) and quality check plans
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Data architectures and ingestion are built alongside backend services before analytics can power growth decisions. Backend subphase covers pipeline orchestration, warehousing layers, and serving data to apps and dashboards.
Where it fits
Design batch ingestion from your app database into a warehouse with Airflow and dbt marts.
Troubleshoot late-arriving events and add validation checks when row counts diverge.
Model funnel and retention tables so lifecycle dashboards stay consistent.
How it compares
Broader data-platform methodology than a single dbt-linter or SQL formatter skill.
Common Questions / FAQ
Who is senior-data-engineer for?
Solo builders and small teams responsible for both product backend and analytics infra who need senior-level pipeline and modeling guidance from an agent.
When should I use senior-data-engineer?
At build when designing ingestion and models, at operate when pipelines fail or data drifts, and at grow when lifecycle and product analytics need dependable marts.
Is senior-data-engineer safe to install?
Consult the Security Audits panel on this page; pipeline skills often imply shell, network, and database access—scope credentials and production writes carefully.
SKILL.md
READMESKILL.md - Senior Data Engineer
# Senior Data Engineer Production-grade data engineering skill for building scalable, reliable data systems. ## Table of Contents 1. [Trigger Phrases](#trigger-phrases) 2. [Quick Start](#quick-start) 3. [Workflows](#workflows) 4. [Architecture Decision Framework](#architecture-decision-framework) 5. [Tech Stack](#tech-stack) 6. [Reference Documentation](#reference-documentation) 7. [Troubleshooting](#troubleshooting) --- ## Trigger Phrases Activate this skill when you see: **Pipeline Design:** - "Design a data pipeline for..." - "Build an ETL/ELT process..." - "How should I ingest data from..." - "Set up data extraction from..." **Architecture:** - "Should I use batch or streaming?" - "Lambda vs Kappa architecture" - "How to handle late-arriving data" - "Design a data lakehouse" **Data Modeling:** - "Create a dimensional model..." - "Star schema vs snowflake" - "Implement slowly changing dimensions" - "Design a data vault" **Data Quality:** - "Add data validation to..." - "Set up data quality checks" - "Monitor data freshness" - "Implement data contracts" **Performance:** - "Optimize this Spark job" - "Query is running slow" - "Reduce pipeline execution time" - "Tune Airflow DAG" --- ## Quick Start ### Core Tools ```bash # Generate pipeline orchestration config python scripts/pipeline_orchestrator.py generate \ --type airflow \ --source postgres \ --destination snowflake \ --schedule "0 5 * * *" # Validate data quality python scripts/data_quality_validator.py validate \ --input data/sales.parquet \ --schema schemas/sales.json \ --checks freshness,completeness,uniqueness # Optimize ETL performance python scripts/etl_performance_optimizer.py analyze \ --query queries/daily_aggregation.sql \ --engine spark \ --recommend ``` --- ## Workflows → See references/workflows.md for details ## Architecture Decision Framework Use this framework to choose the right approach for your data pipeline. ### Batch vs Streaming | Criteria | Batch | Streaming | |----------|-------|-----------| | **Latency requirement** | Hours to days | Seconds to minutes | | **Data volume** | Large historical datasets | Continuous event streams | | **Processing complexity** | Complex transformations, ML | Simple aggregations, filtering | | **Cost sensitivity** | More cost-effective | Higher infrastructure cost | | **Error handling** | Easier to reprocess | Requires careful design | **Decision Tree:** ``` Is real-time insight required? ├── Yes → Use streaming │ └── Is exactly-once semantics needed? │ ├── Yes → Kafka + Flink/Spark Structured Streaming │ └── No → Kafka + consumer groups └── No → Use batch └── Is data volume > 1TB daily? ├── Yes → Spark/Databricks └── No → dbt + warehouse compute ``` ### Lambda vs Kappa Architecture | Aspect | Lambda | Kappa | |--------|--------|-------| | **Complexity** | Two codebases (batch + stream) | Single codebase | | **Maintenance** | Higher (sync batch/stream logic) | Lower | | **Reprocessing** | Native batch layer | Replay from source | | **Use case** | ML training + real-time serving | Pure event-driven | **When to choose Lambda:** - Need to train ML models on historical data - Complex batch transformations not feasible in streaming - Existing batch infrastructure **When to choose Kappa:** - Event-sourced architecture - All processing can be expressed as stream operations - Starting fresh without legacy systems ### Data Warehouse vs Data Lakehouse | Feature | Warehouse (Snowflake/BigQuery) |