Data Engineering Patterns Fabric Databricks

Name: Data Engineering Patterns Fabric Databricks
Author: aradotso

aradotso/data-skills

1.3k installs
4 repo stars
Updated July 18, 2026
aradotso/data-skills

Look up production-ready Microsoft Fabric, Databricks, and PySpark patterns while designing lakehouse pipelines and governance.

About

data-engineering-patterns-fabric-databricks is a reference skill from ara.so’s Data Skills collection that gives solo builders and small data teams a searchable body of patterns for Microsoft Fabric, Azure Databricks, and PySpark. Instead of piecing lakehouse design from scattered docs, you invoke it when you need concrete guidance on pipelines, Delta Lake behavior, cluster tuning, Unity Catalog governance, streaming ingestion, or Fabric warehouse and Power BI integration. The catalog spans on the order of six hundred patterns split across Fabric-focused areas (Data Factory pipelines, lakehouse PySpark, SQL warehouse, architecture) and Databricks-focused areas (compute, workflows, Delta, streaming, SQL/Photon). It supports Build when you are standing up analytics infrastructure, Ship when you are hardening production pipelines, and Operate when you are optimizing cost and reliability. The skill is procedural knowledge for agents: ask pattern-shaped questions and apply answers to your repo or platform config rather than expecting a single generated artifact every time.

600+ field-tested patterns across Microsoft Fabric and Azure Databricks
12-book style organization: Fabric pipelines, lakehouse, warehouse, Power BI; Databricks clusters, Delta, streaming, Uni
Covers Delta Lake optimization, Auto Loader, Structured Streaming, Photon, and cost architecture
Explicit triggers for lakehouse architecture, governance, and production best practices

Data Engineering Patterns Fabric Databricks by the numbers

1,268 all-time installs (skills.sh)
+4 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #233 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/aradotso/data-skills --skill data-engineering-patterns-fabric-databricks

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/data-skills/data-engineering-patterns-fabric-databricks.svg)](https://skillselion.com/skills/aradotso/data-skills/data-engineering-patterns-fabric-databricks)

Installs	1.3k
repo stars	★ 4
Last updated	July 18, 2026
Repository	aradotso/data-skills ↗

What it does

Look up production-ready Microsoft Fabric, Databricks, and PySpark patterns while designing lakehouse pipelines and governance.

Files

SKILL.mdMarkdownGitHub ↗

Data Engineering Patterns - Fabric & Databricks

Skill by ara.so — Data Skills collection.

This skill provides access to 600+ field-tested data engineering patterns for Microsoft Fabric, Azure Databricks, and PySpark. These patterns cover everything from pipeline design and Delta Lake optimization to Unity Catalog governance and cost architecture.

What This Project Provides

A comprehensive collection of patterns organized into 12 books covering:

Microsoft Fabric (250 patterns):

Pipelines and Data Factory
Lakehouse and PySpark
Warehouse and SQL
Power BI in Fabric
Architecture Patterns

Azure Databricks (350 patterns):

Clusters and Compute
Delta Lake
Workflows and Orchestration
Structured Streaming and Auto Loader
Unity Catalog
Databricks SQL and Photon
Platform and Cost Architecture

PySpark:

88 concepts for production Spark across both platforms

Installation

Clone the repository to access all pattern PDFs:

git clone https://github.com/ssanjaychandra123/data-engineering-patterns.git
cd data-engineering-patterns

Repository Structure

data-engineering-patterns/
├── Fabric Patterns/
│   ├── Fabric Engineering Patterns Book I - Pipelines and Data Factory.pdf
│   ├── Fabric Engineering Patterns Book II - Lakehouse and PySpark.pdf
│   ├── Fabric Engineering Patterns Book III - Warehouse and SQL.pdf
│   ├── Fabric Engineering Patterns Book IV - Power BI in Fabric.pdf
│   └── Fabric Engineering Patterns Book V - Architecture Patterns.pdf
├── Databricks Patterns/
│   ├── Azure Databricks Engineering Patterns Book I - Clusters and Compute.pdf
│   ├── Azure Databricks Engineering Patterns Book II - Delta Lake.pdf
│   ├── Azure Databricks Engineering Patterns Book III - Workflows and Orchestration.pdf
│   ├── Azure Databricks Engineering Patterns Book IV - Structured Streaming and Auto Loader.pdf
│   ├── Azure Databricks Engineering Patterns Book V - Unity Catalog.pdf
│   ├── Azure Databricks Engineering Patterns Book VI - Databricks SQL and Photon.pdf
│   └── Azure Databricks Engineering Patterns Book VII - Platform and Cost Architecture.pdf
└── PySpark/
    └── The PySpark Handbook for Fabric and Databricks.pdf

Key Pattern Categories

Microsoft Fabric Patterns

Pipeline and Data Factory Patterns

Common patterns include:

Incremental data loading strategies
Pipeline retry and error handling
Parameter-driven pipeline design
Activity dependencies and control flow
Copy activity optimization
Metadata-driven frameworks

Example incremental load pattern in Fabric Pipeline:

# Notebook activity in Fabric pipeline
from datetime import datetime, timedelta

# Get pipeline parameters
watermark = spark.conf.get("pipeline.watermark")
table_name = spark.conf.get("pipeline.tableName")

# Read incremental data
df = spark.read.format("delta") \
    .load(f"abfss://source@storage.dfs.core.windows.net/{table_name}") \
    .filter(f"modified_date > '{watermark}'")

# Write to target
df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save(f"Tables/{table_name}")

# Return new watermark
new_watermark = df.agg({"modified_date": "max"}).collect()[0][0]
mssparkutils.notebook.exit(str(new_watermark))

Lakehouse and PySpark Patterns

Key patterns for Fabric Lakehouse:

# Pattern: Upsert (merge) operation in Fabric Lakehouse
from delta.tables import DeltaTable

# Source data
updates_df = spark.read.format("parquet").load("Files/updates/")

# Target Delta table
target_table = DeltaTable.forPath(spark, "Tables/customers")

# Merge logic
target_table.alias("target").merge(
    updates_df.alias("updates"),
    "target.customer_id = updates.customer_id"
).whenMatchedUpdate(set={
    "name": "updates.name",
    "email": "updates.email",
    "updated_at": "updates.updated_at"
}).whenNotMatchedInsert(values={
    "customer_id": "updates.customer_id",
    "name": "updates.name",
    "email": "updates.email",
    "created_at": "updates.created_at",
    "updated_at": "updates.updated_at"
}).execute()

Pattern: Optimize Delta tables in Fabric:

# Optimize with Z-ordering for common query patterns
spark.sql(f"""
    OPTIMIZE lakehouse.customers
    ZORDER BY (customer_id, signup_date)
""")

# Vacuum old files (default 7 days retention)
spark.sql(f"""
    VACUUM lakehouse.customers RETAIN 168 HOURS
""")

Warehouse and SQL Patterns

Pattern: Create warehouse tables with proper partitioning:

-- Create partitioned warehouse table in Fabric
CREATE TABLE dw.fact_sales (
    sale_id BIGINT,
    customer_id BIGINT,
    product_id BIGINT,
    sale_amount DECIMAL(18,2),
    sale_date DATE,
    created_at TIMESTAMP
)
USING DELTA
PARTITIONED BY (sale_date);

-- Insert with partition optimization
INSERT INTO dw.fact_sales
SELECT 
    sale_id,
    customer_id,
    product_id,
    sale_amount,
    CAST(sale_date AS DATE) as sale_date,
    created_at
FROM staging.sales
WHERE sale_date >= CURRENT_DATE - INTERVAL 7 DAYS;

Azure Databricks Patterns

Cluster and Compute Patterns

Pattern: Configure autoscaling cluster for cost optimization:

# Databricks cluster configuration (JSON)
{
  "cluster_name": "production-etl",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "Standard_DS3_v2",
  "autoscale": {
    "min_workers": 2,
    "max_workers": 8
  },
  "autotermination_minutes": 30,
  "spark_conf": {
    "spark.databricks.delta.preview.enabled": "true",
    "spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite": "true",
    "spark.databricks.delta.properties.defaults.autoOptimize.autoCompact": "true"
  },
  "aws_attributes": {
    "availability": "SPOT_WITH_FALLBACK",
    "spot_bid_price_percent": 100
  }
}

Delta Lake Patterns

Pattern: Time travel and versioning:

# Read historical version of Delta table
df_version_10 = spark.read.format("delta") \
    .option("versionAsOf", 10) \
    .load("/mnt/delta/customers")

# Read table as of timestamp
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-15 00:00:00") \
    .load("/mnt/delta/customers")

# Describe history
history_df = spark.sql("DESCRIBE HISTORY delta.`/mnt/delta/customers`")
history_df.select("version", "timestamp", "operation", "operationMetrics").show()

Pattern: Change Data Feed (CDF) for incremental processing:

# Enable CDF on table
spark.sql("""
    ALTER TABLE delta.customers 
    SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")

# Read changes between versions
changes_df = spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 10) \
    .option("endingVersion", 20) \
    .table("delta.customers")

# Process different change types
inserts = changes_df.filter("_change_type = 'insert'")
updates = changes_df.filter("_change_type = 'update_postimage'")
deletes = changes_df.filter("_change_type = 'delete'")

Structured Streaming Patterns

Pattern: Auto Loader with schema evolution:

# Auto Loader with schema inference and evolution
checkpoint_path = "/mnt/checkpoints/raw_files"
target_path = "/mnt/delta/bronze/raw_data"

df = spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", checkpoint_path + "/schema") \
    .option("cloudFiles.inferColumnTypes", "true") \
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
    .load("/mnt/landing/raw_files/")

# Write to Delta with checkpointing
query = df.writeStream \
    .format("delta") \
    .option("checkpointLocation", checkpoint_path) \
    .option("mergeSchema", "true") \
    .trigger(availableNow=True) \
    .start(target_path)

query.awaitTermination()

Pattern: Streaming aggregations with watermarking:

from pyspark.sql.functions import window, col

# Read streaming data
stream_df = spark.readStream.format("delta") \
    .table("events")

# Windowed aggregation with watermark
aggregated = stream_df \
    .withWatermark("event_time", "10 minutes") \
    .groupBy(
        window(col("event_time"), "5 minutes"),
        col("user_id")
    ) \
    .agg({
        "event_id": "count",
        "amount": "sum"
    })

# Write to Delta table
query = aggregated.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/mnt/checkpoints/aggregations") \
    .toTable("event_aggregations")

Unity Catalog Patterns

Pattern: Create governed table with row-level security:

# Create schema with Unity Catalog
spark.sql("""
    CREATE SCHEMA IF NOT EXISTS main.finance
    COMMENT 'Finance department data'
    LOCATION 'abfss://data@storage.dfs.core.windows.net/finance'
""")

# Create managed table
spark.sql("""
    CREATE TABLE IF NOT EXISTS main.finance.transactions (
        transaction_id BIGINT,
        account_id BIGINT,
        amount DECIMAL(18,2),
        region STRING,
        transaction_date DATE
    )
    USING DELTA
    TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true')
""")

# Apply row filter for data access control
spark.sql("""
    CREATE FUNCTION main.finance.region_filter(region STRING)
    RETURN IF(
        IS_MEMBER('data_engineers'), 
        TRUE, 
        region = current_user()
    )
""")

spark.sql("""
    ALTER TABLE main.finance.transactions 
    SET ROW FILTER main.finance.region_filter ON (region)
""")

Pattern: Column masking with Unity Catalog:

# Create masking function
spark.sql("""
    CREATE FUNCTION main.finance.mask_ssn(ssn STRING)
    RETURN CASE 
        WHEN IS_MEMBER('finance_managers') THEN ssn
        ELSE CONCAT('XXX-XX-', RIGHT(ssn, 4))
    END
""")

# Apply column mask
spark.sql("""
    ALTER TABLE main.finance.customers 
    ALTER COLUMN ssn 
    SET MASK main.finance.mask_ssn
""")

Workflows and Orchestration Patterns

Pattern: Create parameterized Databricks job:

# In notebook: Get job parameters
dbutils.widgets.text("date", "")
dbutils.widgets.text("environment", "prod")

processing_date = dbutils.widgets.get("date")
env = dbutils.widgets.get("environment")

# Use parameters in processing
df = spark.read.format("delta") \
    .load(f"/mnt/{env}/data") \
    .filter(f"date = '{processing_date}'")

# Process and write results
result_df = df.groupBy("category").count()
result_df.write.format("delta").mode("overwrite") \
    .save(f"/mnt/{env}/results/{processing_date}")

# Return status for orchestration
dbutils.notebook.exit(f"Processed {result_df.count()} records")

Pattern: Job definition with retry logic:

{
  "name": "daily-etl-pipeline",
  "tasks": [
    {
      "task_key": "extract",
      "notebook_task": {
        "notebook_path": "/Workflows/extract",
        "base_parameters": {
          "date": "{{job.start_time.date}}",
          "environment": "prod"
        }
      },
      "existing_cluster_id": "{{cluster_id}}",
      "max_retries": 2,
      "timeout_seconds": 3600
    },
    {
      "task_key": "transform",
      "depends_on": [{"task_key": "extract"}],
      "notebook_task": {
        "notebook_path": "/Workflows/transform",
        "base_parameters": {
          "date": "{{job.start_time.date}}"
        }
      },
      "existing_cluster_id": "{{cluster_id}}",
      "max_retries": 1
    },
    {
      "task_key": "load",
      "depends_on": [{"task_key": "transform"}],
      "notebook_task": {
        "notebook_path": "/Workflows/load"
      },
      "existing_cluster_id": "{{cluster_id}}"
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 0 2 * * ?",
    "timezone_id": "UTC"
  }
}

PySpark Production Patterns

Broadcast Join Pattern

from pyspark.sql.functions import broadcast

# Small dimension table (< 10GB)
dim_products = spark.table("dim.products")

# Large fact table
fact_sales = spark.table("fact.sales")

# Use broadcast join to avoid shuffle
result = fact_sales.join(
    broadcast(dim_products),
    fact_sales.product_id == dim_products.product_id,
    "left"
)

Partitioning and Bucketing Pattern

# Write with optimal partitioning
df.write.format("delta") \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .option("maxRecordsPerFile", 1000000) \
    .save("/mnt/delta/partitioned_data")

# Create bucketed table for join optimization
df.write.format("delta") \
    .mode("overwrite") \
    .bucketBy(100, "customer_id") \
    .sortBy("transaction_date") \
    .saveAsTable("bucketed_transactions")

Error Handling Pattern

from pyspark.sql.functions import col, when, lit
from pyspark.sql.utils import AnalysisException

try:
    # Attempt to read data with schema enforcement
    df = spark.read.format("delta") \
        .option("enforceSchema", "true") \
        .load("/mnt/delta/source")
    
    # Data quality checks
    valid_df = df.filter(col("amount") > 0) \
        .filter(col("customer_id").isNotNull())
    
    invalid_df = df.filter(
        (col("amount") <= 0) | 
        (col("customer_id").isNull())
    ).withColumn("error_reason", 
        when(col("amount") <= 0, lit("Invalid amount"))
        .when(col("customer_id").isNull(), lit("Missing customer_id"))
    )
    
    # Write valid records
    valid_df.write.format("delta").mode("append") \
        .save("/mnt/delta/target")
    
    # Write invalid records to quarantine
    if invalid_df.count() > 0:
        invalid_df.write.format("delta").mode("append") \
            .save("/mnt/delta/quarantine")
        
except AnalysisException as e:
    print(f"Schema mismatch: {str(e)}")
    # Handle schema evolution
    df = spark.read.format("delta") \
        .option("mergeSchema", "true") \
        .load("/mnt/delta/source")

Performance Optimization Pattern

from pyspark.sql.functions import col, current_timestamp

# Cache frequently accessed data
df_cached = spark.table("dimension.products") \
    .filter(col("is_active") == True) \
    .cache()

# Use persist for complex operations
from pyspark.storagelevel import StorageLevel
df_persisted = large_df.repartition(200, "partition_key") \
    .persist(StorageLevel.MEMORY_AND_DISK)

# Adaptive Query Execution settings
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Dynamic partition pruning
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")

Common Use Cases

Medallion Architecture Pattern

# Bronze layer: Raw data ingestion
bronze_df = spark.read.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .load("/mnt/landing/") \
    .withColumn("ingestion_time", current_timestamp())

bronze_df.write.format("delta") \
    .mode("append") \
    .save("/mnt/delta/bronze/raw_events")

# Silver layer: Cleaned and conformed
from pyspark.sql.functions import col, to_timestamp

silver_df = spark.read.format("delta") \
    .load("/mnt/delta/bronze/raw_events") \
    .filter(col("event_type").isNotNull()) \
    .withColumn("event_timestamp", to_timestamp("timestamp")) \
    .dropDuplicates(["event_id"]) \
    .select("event_id", "event_type", "user_id", "event_timestamp", "properties")

silver_df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/mnt/delta/silver/events")

# Gold layer: Business aggregates
gold_df = spark.read.format("delta") \
    .load("/mnt/delta/silver/events") \
    .groupBy("user_id", "event_type") \
    .agg({
        "event_id": "count",
        "event_timestamp": "max"
    })

gold_df.write.format("delta") \
    .mode("overwrite") \
    .save("/mnt/delta/gold/user_event_summary")

SCD Type 2 Pattern

from delta.tables import DeltaTable
from pyspark.sql.functions import col, current_timestamp, lit

# Source changes
source_df = spark.read.format("parquet").load("/mnt/staging/customers")

# Target dimension
target_table = DeltaTable.forPath(spark, "/mnt/delta/dim_customers")

# Identify changed records
changes = source_df.alias("source") \
    .join(
        target_table.toDF().filter("is_current = true").alias("target"),
        "customer_id",
        "left"
    ) \
    .filter(
        col("target.customer_id").isNull() |  # New records
        (col("source.name") != col("target.name")) |  # Changed records
        (col("source.email") != col("target.email"))
    )

# Expire old records
target_table.alias("target").merge(
    changes.alias("changes"),
    "target.customer_id = changes.customer_id AND target.is_current = true"
).whenMatchedUpdate(set={
    "is_current": lit(False),
    "end_date": current_timestamp()
}).execute()

# Insert new versions
new_records = changes.select(
    col("customer_id"),
    col("name"),
    col("email"),
    current_timestamp().alias("start_date"),
    lit(None).alias("end_date"),
    lit(True).alias("is_current")
)

new_records.write.format("delta").mode("append") \
    .save("/mnt/delta/dim_customers")

Configuration Best Practices

Fabric Configuration

# Set Fabric notebook session configuration
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

# Access Fabric environment variables
from notebookutils import mssparkutils

# Get secrets from Key Vault
storage_key = mssparkutils.credentials.getSecret(
    "https://keyvault.vault.azure.net/",
    "storage-account-key"
)

# Access workspace identity
workspace_id = mssparkutils.env.getWorkspaceId()

Databricks Configuration

# Access Databricks secrets
storage_account_key = dbutils.secrets.get(
    scope="azure-key-vault",
    key="storage-account-key"
)

# Mount storage with managed identity
dbutils.fs.mount(
    source=f"abfss://data@{storage_account}.dfs.core.windows.net/",
    mount_point="/mnt/data",
    extra_configs={
        "fs.azure.account.auth.type": "OAuth",
        "fs.azure.account.oauth.provider.type": 
            "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
        "fs.azure.account.oauth2.client.id": dbutils.secrets.get("azure-sp", "client-id"),
        "fs.azure.account.oauth2.client.secret": storage_account_key,
        "fs.azure.account.oauth2.client.endpoint": 
            f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
    }
)

# Optimize cluster for specific workload
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.files.maxPartitionBytes", "134217728")  # 128 MB
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")

Troubleshooting

Performance Issues

Problem: Slow joins causing job timeouts

# Check partition distribution
df.rdd.getNumPartitions()  # Should be 200-2000 for most workloads

# Identify data skew
df.groupBy("partition_key").count().orderBy(col("count").desc()).show()

# Solution: Repartition with salt for skewed keys
from pyspark.sql.functions import rand, concat

df_balanced = df.withColumn("salt", (rand() * 10).cast("int")) \
    .withColumn("salted_key", concat(col("partition_key"), lit("_"), col("salt"))) \
    .repartition(200, "salted_key")

Problem: Small file problem in Delta tables

# Check file sizes
spark.sql("DESCRIBE DETAIL delta.`/mnt/delta/table`").select("numFiles", "sizeInBytes").show()

# Solution: Compact small files
spark.sql("OPTIMIZE delta.`/mnt/delta/table`")

# For partitioned tables
spark.sql("OPTIMIZE delta.`/mnt/delta/table` WHERE date >= '2024-01-01'")

Schema Evolution Issues

Problem: Schema mismatch errors when appending data

# Enable automatic schema merging
df.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/mnt/delta/table")

# Or allow schema overwrite
df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("/mnt/delta/table")

# Check current schema
spark.read.format("delta").load("/mnt/delta/table").printSchema()

Memory Issues

Problem: Out of memory errors during processing

# Solution 1: Increase partition count to reduce partition size
df_repartitioned = df.repartition(400)

# Solution 2: Use iterative processing for large aggregations
from pyspark.sql.window import Window

window_spec = Window.partitionBy("category").orderBy("date")
df_windowed = df.withColumn("row_num", row_number().over(window_spec))

# Solution 3: Spill to disk instead of memory
spark.conf.set("spark.memory.fraction", "0.6")
spark.conf.set("spark.memory.storageFraction", "0.3")

Streaming Issues

Problem: Checkpoint directory conflicts

# Always use unique checkpoint locations per stream
checkpoint_base = "/mnt/checkpoints"
stream_id = "user_events_stream"

query = df.writeStream \
    .format("delta") \
    .option("checkpointLocation", f"{checkpoint_base}/{stream_id}") \
    .start("/mnt/delta/target")

# To restart stream from beginning, delete checkpoint
# dbutils.fs.rm(f"{checkpoint_base}/{stream_id}", True)

Problem: Watermark not advancing

# Ensure event time column is properly formatted
from pyspark.sql.functions import to_timestamp

df_with_timestamp = df.withColumn(
    "event_time",
    to_timestamp(col("timestamp_string"), "yyyy-MM-dd HH:mm:ss")
)

# Set appropriate watermark delay
stream_df = df_with_timestamp.withWatermark("event_time", "30 minutes")

Cost Optimization Patterns

Databricks Cost Optimization

# Use cluster pools for faster startup
cluster_config = {
    "instance_pool_id": "pool-abc123",
    "autotermination_minutes": 15,
    "autoscale": {
        "min_workers": 1,
        "max_workers": 10
    }
}

# Use spot instances for non-critical workloads
aws_attributes = {
    "availability": "SPOT_WITH_FALLBACK",
    "zone_id": "us-west-2a",
    "spot_bid_price_percent": 100
}

# Optimize table for reduced storage and faster queries
spark.sql("""
    OPTIMIZE prod.sales_transactions
    ZORDER BY (customer_id, transaction_date)
""")

# Remove old versions to reduce storage costs
spark.sql("VACUUM prod.sales_transactions RETAIN 168 HOURS")

Fabric Cost Optimization

# Use on-demand capacity for variable workloads
# Set idle timeout for capacity auto-pause

# Optimize pipeline runs
# - Use copy activity instead of foreach + copy for bulk operations
# - Batch small files before processing
# - Use incremental loads instead of full refreshes

# Compress data at rest
df.write.format("delta") \
    .option("compression", "zstd") \
    .mode("overwrite") \
    .save("Tables/compressed_data")

Resources

Pattern PDFs: All 12 books are available in the repository under Fabric Patterns/, Databricks Patterns/, and PySpark/
Microsoft Fabric Documentation: https://learn.microsoft.com/fabric/
Azure Databricks Documentation: https://learn.microsoft.com/azure/databricks/
Delta Lake Documentation: https://docs.delta.io/
PySpark API Reference: https://spark.apache.org/docs/latest/api/python/

Author

Sanjay Chandra - Enterprise data platform architect and advisor

LinkedIn: https://www.linkedin.com/in/ssanjaychandra/
Website: http://www.ssanjaychandra.com

---

These patterns are compiled from real production implementations across Microsoft Fabric and Azure Databricks platforms. The material is continuously updated as platforms evolve.

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

Data Science & MLpipelinesanalytics