Enterprise Data Engineering Pipeline Ssis Pyspark

Name: Enterprise Data Engineering Pipeline Ssis Pyspark
Author: aradotso

aradotso/data-skills

1.7k installs
4 repo stars
Updated July 18, 2026
aradotso/data-skills

Enterprise Data Engineering Pipeline (SSIS + PySpark) is an end-to-end ELT solution that orchestrates extraction via SSIS, transforms data through SQL Server Integration Services packages, loads into a dimensional wareho

About

Complete enterprise data engineering solution combining SSIS for ETL orchestration, SQL Server with star schema dimensional modeling (fact and dimension tables), Python/Pandas for data quality audits, and PySpark for big data analytics. Ingests raw CSV files (Sales, Products, Customers), transforms via SSIS packages with error handling, loads into a dimensional warehouse, and performs analytics at scale using Spark JDBC connections to SQL Server. Includes incremental load patterns, automated refresh scheduling, and performance optimization for parallel processing across large datasets.

SSIS packages with lookup transformations, data conversion, and error logging for enterprise ETL orchestration
SQL Server star schema with fact_Sales and dimension tables (dim_Customers, dim_Products) plus BI views for revenue and
PySpark JDBC integration for high-volume monthly revenue aggregation, product category analysis, and parallel batch proc
Enterprise data warehousing with SSIS, SQL Server star schema, and PySpark analytics for millions of rows
Enterprise data warehousing with SSIS, SQL Server star schema, and PySpark analytics for millions of rows

Enterprise Data Engineering Pipeline Ssis Pyspark by the numbers

1,723 all-time installs (skills.sh)
+4 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #85 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

enterprise-data-engineering-pipeline-ssis-pyspark capabilities & compatibility

SQL Server licensing (Developer Edition free); compute for PySpark cluster (on-premises or cloud).

Capabilities: etl orchestration · data warehousing · incremental load · data quality audit · big data analytics · dimensional modeling · automated refresh · error handling
Works with: sql server
Use cases: data analysis · database
Runs: Local or remote
Pricing: Free

From the docs

What enterprise-data-engineering-pipeline-ssis-pyspark says it does

End-to-end ELT pipeline using SSIS, SQL Server, and PySpark for enterprise data warehousing and analytics

enterprise-data-engineering-pipeline-ssis-pyspark.md

The pipeline ingests raw CSV files (Sales, Products, Customers), transforms them through SSIS, loads into a dimensional model, and performs analytics at scale.

enterprise-data-engineering-pipeline-ssis-pyspark.md

npx skills add https://github.com/aradotso/data-skills --skill enterprise-data-engineering-pipeline-ssis-pyspark

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/data-skills/enterprise-data-engineering-pipeline-ssis-pyspark.svg)](https://skillselion.com/skills/aradotso/data-skills/enterprise-data-engineering-pipeline-ssis-pyspark)

Installs	1.7k
repo stars	★ 4
Security audit	2 / 3 scanners passed
Last updated	July 18, 2026
Repository	aradotso/data-skills ↗

What it does

Enterprise data warehousing with SSIS, SQL Server star schema, and PySpark analytics for millions of rows

Who is it for?

Organizations requiring scalable enterprise data warehousing with Microsoft SQL Server, complex ETL orchestration, and big data analytics on transactional and master data at scale.

Skip if: Real-time streaming pipelines, graph databases, or teams without SQL Server and Java infrastructure requirements.

When should I use this skill?

Setting up enterprise ETL pipelines with SSIS, designing star schema data warehouses, implementing data quality audits, or processing millions of rows with PySpark analytics.

What you get

Deliver a production-grade data warehouse with automated ETL, validated data quality, incremental load patterns, and PySpark-driven analytics for revenue, customer segmentation, and category performance.

01_Schema_Setup.sql (dimension and fact tables with BI views)
EnterpriseETL.sln (SSIS project with data flow packages)
project_audit.py (null audit, orphan detection, revenue visualization)

By the numbers

3 dimension tables (dim_Customers, dim_Products) and 1 fact table (fact_Sales) in star schema
2 BI views (vw_RevenueByProduct, vw_CustomerLTV) for business intelligence
Processes millions of rows via PySpark parallel batch processing with 4GB executor memory and 4 cores per worker

Files

SKILL.mdMarkdownGitHub ↗

Enterprise Data Engineering Pipeline (SSIS + PySpark)

Skill by ara.so — Data Skills collection.

Overview

This project provides a complete enterprise data engineering solution that combines:

SSIS (SQL Server Integration Services) for ETL orchestration
SQL Server with Star Schema data warehouse design (fact and dimension tables)
Python (Pandas) for data quality audits and visualization
PySpark for big data analytics and aggregation

The pipeline ingests raw CSV files (Sales, Products, Customers), transforms them through SSIS, loads into a dimensional model, and performs analytics at scale.

Architecture Components

1. Source Layer: Raw CSV files containing transactional and master data 2. ETL Layer: SSIS packages handle extraction, transformation, error handling 3. Storage Layer: SQL Server Data Warehouse with Star Schema 4. Analytics Layer: Python/PySpark scripts for business intelligence

Installation & Setup

Prerequisites

# Required software
- SQL Server 2019+ (Developer or Enterprise Edition)
- SQL Server Integration Services (SSIS)
- Visual Studio with SQL Server Data Tools (SSDT)
- Python 3.10+
- Java 8+ (for PySpark)

Python Dependencies

pip install pandas sqlalchemy pyodbc pyspark matplotlib

Database Setup

-- 01_Schema_Setup.sql
-- Create the data warehouse database
CREATE DATABASE EnterpriseDataWarehouse;
GO

USE EnterpriseDataWarehouse;
GO

-- Dimension: Customers
CREATE TABLE dim_Customers (
    CustomerID INT PRIMARY KEY,
    CustomerName NVARCHAR(100),
    Email NVARCHAR(100),
    Region NVARCHAR(50),
    RegistrationDate DATE
);

-- Dimension: Products
CREATE TABLE dim_Products (
    ProductID INT PRIMARY KEY,
    ProductName NVARCHAR(100),
    Category NVARCHAR(50),
    UnitPrice DECIMAL(10, 2)
);

-- Fact: Sales
CREATE TABLE fact_Sales (
    SaleID INT PRIMARY KEY,
    CustomerID INT FOREIGN KEY REFERENCES dim_Customers(CustomerID),
    ProductID INT FOREIGN KEY REFERENCES dim_Products(ProductID),
    Quantity INT,
    SaleDate DATE,
    TotalAmount DECIMAL(10, 2)
);

-- Business Intelligence View: Revenue by Product
CREATE VIEW vw_RevenueByProduct AS
SELECT 
    p.ProductName,
    p.Category,
    SUM(s.TotalAmount) AS TotalRevenue,
    SUM(s.Quantity) AS TotalQuantity
FROM fact_Sales s
INNER JOIN dim_Products p ON s.ProductID = p.ProductID
GROUP BY p.ProductName, p.Category;

-- Business Intelligence View: Customer Lifetime Value
CREATE VIEW vw_CustomerLTV AS
SELECT 
    c.CustomerID,
    c.CustomerName,
    c.Region,
    COUNT(s.SaleID) AS TotalPurchases,
    SUM(s.TotalAmount) AS LifetimeValue
FROM dim_Customers c
LEFT JOIN fact_Sales s ON c.CustomerID = s.CustomerID
GROUP BY c.CustomerID, c.CustomerName, c.Region;

SSIS Package Configuration

Creating the SSIS Project

1. Open Visual Studio with SSDT 2. Create new Integration Services Project: EnterpriseETL.sln 3. Add Connection Managers:

Source_FlatFile: Points to CSV directory
Destination_OLEDB: SQL Server connection string

SSIS Package Flow

<!-- Key SSIS Components -->
<!-- Data Flow Task: Load dim_Customers -->
- Flat File Source (Customers.csv)
- Data Conversion (handle Unicode, trim strings)
- Derived Column (add audit columns)
- OLEDB Destination (dim_Customers)

<!-- Data Flow Task: Load dim_Products -->
- Flat File Source (Products.csv)
- Data Conversion (decimal precision for prices)
- OLEDB Destination (dim_Products)

<!-- Data Flow Task: Load fact_Sales -->
- Flat File Source (Sales.csv)
- Lookup Transformation (validate CustomerID, ProductID)
- Derived Column (calculate TotalAmount = Quantity * UnitPrice)
- OLEDB Destination (fact_Sales)

Error Handling in SSIS

-- Create error logging table
CREATE TABLE ETL_ErrorLog (
    ErrorID INT IDENTITY(1,1) PRIMARY KEY,
    PackageName NVARCHAR(100),
    TaskName NVARCHAR(100),
    ErrorDescription NVARCHAR(MAX),
    ErrorDate DATETIME DEFAULT GETDATE()
);

Python Analytics

Data Quality Audit Script

# project_audit.py
import pandas as pd
import pyodbc
from sqlalchemy import create_engine
import matplotlib.pyplot as plt

# Database connection
def get_connection():
    conn_str = (
        "mssql+pyodbc:///?odbc_connect="
        "DRIVER={ODBC Driver 17 for SQL Server};"
        f"SERVER={os.getenv('SQL_SERVER')};"
        f"DATABASE={os.getenv('SQL_DATABASE')};"
        "Trusted_Connection=yes;"
    )
    return create_engine(conn_str)

# Data Quality Checks
def run_audit():
    engine = get_connection()
    
    # Check 1: Null values in critical columns
    query_nulls = """
    SELECT 
        'dim_Customers' AS TableName,
        SUM(CASE WHEN CustomerName IS NULL THEN 1 ELSE 0 END) AS NullCustomerName,
        SUM(CASE WHEN Email IS NULL THEN 1 ELSE 0 END) AS NullEmail
    FROM dim_Customers
    UNION ALL
    SELECT 
        'fact_Sales',
        SUM(CASE WHEN CustomerID IS NULL THEN 1 ELSE 0 END),
        SUM(CASE WHEN ProductID IS NULL THEN 1 ELSE 0 END)
    FROM fact_Sales
    """
    df_nulls = pd.read_sql(query_nulls, engine)
    print("Null Value Audit:")
    print(df_nulls)
    
    # Check 2: Orphaned records (referential integrity)
    query_orphans = """
    SELECT COUNT(*) AS OrphanedSales
    FROM fact_Sales s
    WHERE NOT EXISTS (SELECT 1 FROM dim_Customers c WHERE c.CustomerID = s.CustomerID)
       OR NOT EXISTS (SELECT 1 FROM dim_Products p WHERE p.ProductID = s.ProductID)
    """
    df_orphans = pd.read_sql(query_orphans, engine)
    print("\nOrphaned Records:")
    print(df_orphans)
    
    # Check 3: Revenue distribution
    query_revenue = "SELECT * FROM vw_RevenueByProduct ORDER BY TotalRevenue DESC"
    df_revenue = pd.read_sql(query_revenue, engine)
    
    # Visualization
    plt.figure(figsize=(10, 6))
    plt.bar(df_revenue['ProductName'][:10], df_revenue['TotalRevenue'][:10])
    plt.xlabel('Product')
    plt.ylabel('Total Revenue')
    plt.title('Top 10 Products by Revenue')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig('revenue_analysis.png')
    print("\nRevenue chart saved to revenue_analysis.png")

if __name__ == "__main__":
    run_audit()

Customer Segmentation Analysis

# customer_segmentation.py
import pandas as pd
from sqlalchemy import create_engine
import os

def segment_customers():
    engine = create_engine(
        f"mssql+pyodbc:///?odbc_connect="
        f"DRIVER={{ODBC Driver 17 for SQL Server}};"
        f"SERVER={os.getenv('SQL_SERVER')};"
        f"DATABASE={os.getenv('SQL_DATABASE')};"
        f"Trusted_Connection=yes;"
    )
    
    # Load customer LTV data
    df = pd.read_sql("SELECT * FROM vw_CustomerLTV", engine)
    
    # RFM-style segmentation
    df['Segment'] = pd.cut(
        df['LifetimeValue'],
        bins=[0, 1000, 5000, float('inf')],
        labels=['Bronze', 'Silver', 'Gold']
    )
    
    # Aggregate by segment
    segment_summary = df.groupby('Segment').agg({
        'CustomerID': 'count',
        'LifetimeValue': 'sum',
        'TotalPurchases': 'mean'
    }).reset_index()
    
    print(segment_summary)
    return segment_summary

PySpark Big Data Processing

High-Volume Sales Aggregation

# pyspark_analytics.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum, count, avg, year, month
import os

# Initialize Spark
spark = SparkSession.builder \
    .appName("EnterpriseSalesAnalytics") \
    .config("spark.jars", "mssql-jdbc-9.4.0.jre8.jar") \
    .getOrCreate()

# JDBC connection properties
jdbc_url = f"jdbc:sqlserver://{os.getenv('SQL_SERVER')}:1433;databaseName={os.getenv('SQL_DATABASE')}"
connection_properties = {
    "user": os.getenv('SQL_USER'),
    "password": os.getenv('SQL_PASSWORD'),
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# Load data from SQL Server
df_sales = spark.read.jdbc(
    url=jdbc_url,
    table="fact_Sales",
    properties=connection_properties
)

df_customers = spark.read.jdbc(
    url=jdbc_url,
    table="dim_Customers",
    properties=connection_properties
)

df_products = spark.read.jdbc(
    url=jdbc_url,
    table="dim_Products",
    properties=connection_properties
)

# Join and aggregate
df_combined = df_sales \
    .join(df_customers, "CustomerID") \
    .join(df_products, "ProductID")

# Monthly revenue analysis
df_monthly = df_combined \
    .withColumn("Year", year(col("SaleDate"))) \
    .withColumn("Month", month(col("SaleDate"))) \
    .groupBy("Year", "Month", "Region") \
    .agg(
        _sum("TotalAmount").alias("MonthlyRevenue"),
        count("SaleID").alias("TransactionCount"),
        avg("TotalAmount").alias("AvgTransactionValue")
    ) \
    .orderBy("Year", "Month", "Region")

df_monthly.show(20)

# Write results back to SQL Server
df_monthly.write.jdbc(
    url=jdbc_url,
    table="analytics_MonthlyRevenue",
    mode="overwrite",
    properties=connection_properties
)

# Product performance by category
df_category = df_combined \
    .groupBy("Category") \
    .agg(
        _sum("TotalAmount").alias("CategoryRevenue"),
        _sum("Quantity").alias("TotalUnitsSold"),
        count("CustomerID").distinct().alias("UniqueCustomers")
    ) \
    .orderBy(col("CategoryRevenue").desc())

df_category.show()

spark.stop()

Parallel Processing for Large Datasets

# batch_processing.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder \
    .appName("BatchDataProcessing") \
    .config("spark.executor.memory", "4g") \
    .config("spark.executor.cores", "4") \
    .getOrCreate()

# Read large CSV files
df_raw = spark.read.csv(
    "hdfs://data/sales/*.csv",
    header=True,
    inferSchema=True
)

# Data quality transformations
df_cleaned = df_raw \
    .filter(col("TotalAmount").isNotNull()) \
    .filter(col("Quantity") > 0) \
    .withColumn(
        "IsHighValue",
        when(col("TotalAmount") > 1000, "Yes").otherwise("No")
    ) \
    .dropDuplicates(["SaleID"])

# Partition by date for efficient querying
df_cleaned.write \
    .partitionBy("SaleDate") \
    .mode("overwrite") \
    .parquet("hdfs://data/processed/sales_cleaned")

print(f"Processed {df_cleaned.count()} records")

Common Patterns

Incremental ETL (Load Only New Records)

-- Create staging table
CREATE TABLE stg_Sales (
    SaleID INT,
    CustomerID INT,
    ProductID INT,
    Quantity INT,
    SaleDate DATE,
    TotalAmount DECIMAL(10, 2),
    LoadDate DATETIME DEFAULT GETDATE()
);

-- Merge statement for incremental load
MERGE INTO fact_Sales AS target
USING stg_Sales AS source
ON target.SaleID = source.SaleID
WHEN MATCHED THEN
    UPDATE SET
        Quantity = source.Quantity,
        TotalAmount = source.TotalAmount
WHEN NOT MATCHED THEN
    INSERT (SaleID, CustomerID, ProductID, Quantity, SaleDate, TotalAmount)
    VALUES (source.SaleID, source.CustomerID, source.ProductID, 
            source.Quantity, source.SaleDate, source.TotalAmount);

Automated Data Refresh

# scheduled_refresh.py
import subprocess
import os
from datetime import datetime

def run_ssis_package():
    """Execute SSIS package via dtexec"""
    package_path = r"C:\SSIS\EnterpriseETL\EnterpriseETL\Package.dtsx"
    
    cmd = [
        "dtexec",
        "/FILE", package_path,
        "/REPORTING", "E"
    ]
    
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    log_file = f"etl_log_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
    with open(log_file, 'w') as f:
        f.write(result.stdout)
        f.write(result.stderr)
    
    return result.returncode == 0

def run_analytics():
    """Execute Python analytics after ETL"""
    subprocess.run(["python", "project_audit.py"])
    subprocess.run(["python", "pyspark_analytics.py"])

if __name__ == "__main__":
    if run_ssis_package():
        print("ETL completed successfully")
        run_analytics()
    else:
        print("ETL failed - check logs")

Troubleshooting

SSIS Connection Issues

Error: "Cannot acquire connection to SQL Server"
Solution:
1. Verify SQL Server service is running
2. Check Windows Authentication vs SQL Authentication
3. Update connection string in Connection Manager
4. Enable TCP/IP protocol in SQL Server Configuration Manager

Unicode/Encoding Errors

Error: "Cannot convert between unicode and non-unicode string data types"
Solution in SSIS:
1. Add Data Conversion task
2. Convert DT_STR to DT_WSTR for Unicode columns
3. Set CodePage to 1252 (Windows Latin 1) in Flat File Connection

PySpark JDBC Driver Not Found

# Download Microsoft JDBC driver
wget https://github.com/microsoft/mssql-jdbc/releases/download/v9.4.0/mssql-jdbc-9.4.0.jre8.jar

# Add to Spark session
spark = SparkSession.builder \
    .config("spark.jars", "/path/to/mssql-jdbc-9.4.0.jre8.jar") \
    .getOrCreate()

Performance Optimization

# Enable broadcast join for small dimension tables
from pyspark.sql.functions import broadcast

df_result = df_sales.join(
    broadcast(df_products),
    "ProductID"
)

# Cache frequently accessed DataFrames
df_sales.cache()
df_sales.count()  # Trigger caching

Environment Variables

# .env file
SQL_SERVER=localhost
SQL_DATABASE=EnterpriseDataWarehouse
SQL_USER=your_username
SQL_PASSWORD=your_password

# For PySpark
SPARK_HOME=/path/to/spark
JAVA_HOME=/path/to/java

Running the Complete Pipeline

# Step 1: Setup database
sqlcmd -S localhost -i 01_Schema_Setup.sql

# Step 2: Run SSIS package (via Visual Studio or dtexec)
dtexec /FILE "EnterpriseETL\Package.dtsx"

# Step 3: Run data quality audit
python project_audit.py

# Step 4: Run PySpark analytics
spark-submit --jars mssql-jdbc-9.4.0.jre8.jar pyspark_analytics.py

This enterprise pipeline provides a complete solution for data warehousing, ETL automation, and big data analytics using industry-standard Microsoft and Apache technologies.

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Choose this skill for Microsoft SSIS and SQL Server warehouse stacks instead of cloud-native Kafka or DBT medallion prototypes.

FAQ

What are the core components of the architecture?

Source Layer (raw CSV files), ETL Layer (SSIS packages with error handling), Storage Layer (SQL Server star schema with fact and dimension tables), and Analytics Layer (Python/Pandas and PySpark scripts).

How does PySpark connect to SQL Server for analytics?

Via JDBC driver (mssql-jdbc-9.4.0.jre8.jar) using connection properties with hostname, database name, user, and password. DataFrames are read from tables, transformed with joins and aggregations, and results written back to analytics tables.

What data quality checks are implemented?

Null value audits on critical columns, orphaned record detection via referential integrity checks, and revenue distribution visualization. Results logged with null counts and integrity violations reported.

Is Enterprise Data Engineering Pipeline Ssis Pyspark safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLdatabasespipelinesetl