
Fabric Lakehouse
Design Microsoft Fabric lakehouse pipelines with Data Factory activities, Spark notebooks, bronze/silver/gold layers, and Delta maintenance on OneLake.
Overview
fabric-lakehouse is an agent skill most often used in Build (also Operate infra) that documents Microsoft Fabric lakehouse ETL orchestration, Spark ingestion, and Delta maintenance patterns.
Install
npx skills add https://github.com/github/awesome-copilot --skill fabric-lakehouseWhat is this skill?
- Data Factory surface: 180+ connectors, copy activity, Dataflow Gen2, notebook activity, schedules and triggers
- Pipeline activity table: Copy Data, Notebook, Dataflow, Stored Procedure, ForEach, If Condition, Get Metadata, Lakehouse
- Daily_ETL_Pipeline orchestration pattern: metadata gate, per-file bronze copy, silver notebook, gold aggregate, optimize
- Spark best-practice conf flags for V-Order and optimizeWrite on Fabric
- Python examples for CSV/JSON loads under `Files/bronze/` paths on the lakehouse
- 180+ connectors referenced for Microsoft Fabric Data Factory data sources
- Eight pipeline activity types documented in the activity table (Copy Data, Notebook, Dataflow, Stored Procedure, ForEach
Adoption & trust: 8.5k installs on skills.sh; 34.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need a repeatable Fabric lakehouse pipeline but are unsure which Data Factory activities to chain for bronze/silver/gold and how to tune Spark on OneLake.
Who is it for?
Indie builders shipping internal analytics, BI-heavy SaaS, or client lakehouse deliverables on Microsoft Fabric.
Skip if: Teams on AWS-only or Databricks-only stacks with no Fabric tenant, or simple CRUD APIs without a medallion data layer.
When should I use this skill?
User asks about Fabric lakehouse ETL, Data Factory pipelines, Spark reads on OneLake, medallion layers, or Delta table maintenance.
What do I get? / Deliverables
You get a concrete pipeline blueprint, activity matrix, and Spark config snippets your agent can turn into notebooks, copy steps, and maintenance jobs on Fabric.
- Pipeline orchestration outline (activities and dependencies)
- Spark configuration and bronze-layer read examples for agent implementation
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Lakehouse and ETL orchestration are core backend/data-platform build work for solo builders shipping analytics products, even though pipelines also run in Operate. Backend is the right shelf for Spark reads, copy activities, notebook transforms, and medallion layering—not frontend or pure DevOps deploy scripts alone.
Where it fits
Scaffold a Daily_ETL_Pipeline with Get Metadata, ForEach bronze copy, and silver Notebook activities.
Add Lakehouse Maintenance optimize/vacuum after nightly gold aggregation jobs.
Compare Fabric Data Factory connectors against your source systems before committing to OneLake.
How it compares
Procedural Fabric lakehouse playbook—not a generic dbt skill and not an MCP connector to live Fabric APIs.
Common Questions / FAQ
Who is fabric-lakehouse for?
Solo developers and tiny data teams building on Microsoft Fabric who want agents to follow medallion ETL and Data Factory orchestration conventions.
When should I use fabric-lakehouse?
Use it in Build while designing pipelines and Spark notebooks; in Operate when scheduling optimize/vacuum and monitoring daily ETL; or in Validate when scoping whether Fabric fits your data product.
Is fabric-lakehouse safe to install?
It is documentation-only in Prism’s ingest—review the Security Audits panel on this page; actual Fabric access still depends on your workspace credentials and agent permissions.
SKILL.md
READMESKILL.md - Fabric Lakehouse
### Data Factory Integration Microsoft Fabric includes Data Factory for ETL/ELT orchestration: - **180+ connectors** for data sources - **Copy activity** for data movement - **Dataflow Gen2** for transformations - **Notebook activity** for Spark processing - **Scheduling** and triggers ### Pipeline Activities | Activity | Description | |----------|-------------| | Copy Data | Move data between sources and Lakehouse | | Notebook | Execute Spark notebooks | | Dataflow | Run Dataflow Gen2 transformations | | Stored Procedure | Execute SQL procedures | | ForEach | Loop over items | | If Condition | Conditional branching | | Get Metadata | Retrieve file/folder metadata | | Lakehouse Maintenance | Optimize and vacuum Delta tables | ### Orchestration Patterns ``` Pipeline: Daily_ETL_Pipeline ├── Get Metadata (check for new files) ├── ForEach (process each file) │ ├── Copy Data (bronze layer) │ └── Notebook (silver transformation) ├── Notebook (gold aggregation) └── Lakehouse Maintenance (optimize tables) ``` --- ### Spark Configuration (Best Practices) ```python # Enable Fabric optimizations spark.conf.set("spark.sql.parquet.vorder.enabled", "true") spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") ``` ### Reading Data ```python # Read CSV file df = spark.read.format("csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .load("Files/bronze/data.csv") # Read JSON file df = spark.read.format("json").load("Files/bronze/data.json") # Read Parquet file df = spark.read.format("parquet").load("Files/bronze/data.parquet") # Read Delta table df = spark.read.table("my_delta_table") # Read from SQL endpoint df = spark.sql("SELECT * FROM lakehouse.my_table") ``` ### Writing Delta Tables ```python # Write DataFrame as managed Delta table df.write.format("delta") \ .mode("overwrite") \ .saveAsTable("silver_customers") # Write with partitioning df.write.format("delta") \ .mode("overwrite") \ .partitionBy("year", "month") \ .saveAsTable("silver_transactions") # Append to existing table df.write.format("delta") \ .mode("append") \ .saveAsTable("silver_events") ``` ### Delta Table Operations (CRUD) ```python # UPDATE spark.sql(""" UPDATE silver_customers SET status = 'active' WHERE last_login > '2024-01-01' -- Example date, adjust as needed """) # DELETE spark.sql(""" DELETE FROM silver_customers WHERE is_deleted = true """) # MERGE (Upsert) spark.sql(""" MERGE INTO silver_customers AS target USING staging_customers AS source ON target.customer_id = source.customer_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * """) ``` ### Schema Definition ```python from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DecimalType schema = StructType([ StructField("id", IntegerType(), False), StructField("name", StringType(), True), StructField("email", StringType(), True), StructField("amount", DecimalType(18, 2), True), StructField("created_at", TimestampType(), True) ]) df = spark.read.format("csv") \ .schema(schema) \ .option("header", "true") \ .load("Files/bronze/customers.csv") ``` ### SQL Magic in Notebooks ```sql %%sql -- Query Delta table directly SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total_amount FROM gold_orders GROUP BY customer_id ORDER BY total_amount DESC LIMIT 10 ``` ### V-Order Optimization ```python # Enable V-Order for read optimization spark.conf.set("spark.sql.parquet.vorder.enabled", "true") ``` ### Table Optimization ```sql %%sql -- Optimize table (compact small files) OPTIMIZE silver_transactions -- Optimize with Z-ordering on query columns OPTIMIZE silver_transactions ZORDER BY (customer_id, transaction_date) -- Vacuum old files (default 7 days retention) VACUUM silver_transactions -- Vacuum with custom retention VACUUM silver_transactions RETAIN 168 HOURS ``` #