Spark Authoring Cli

Name: Spark Authoring Cli
Author: microsoft

microsoft/skills-for-fabric

136 installs
886 repo stars
Updated July 23, 2026
microsoft/skills-for-fabric

About

>. > **Update Check — ONCE PER SESSION (mandatory)** > The first time this skill is used in a session, run the **check-updates** skill before proceeding. > - **GitHub Copilot CLI / VS Code**: invoke the `check-updates` skill. > - **Claude Code / Cowork / Cursor / Windsurf / Codex**: compare local vs remote package.json version. > - Skip if the check was already performed earlier in this session.

> **Update Check — ONCE PER SESSION (mandatory)**
> The first time this skill is used in a session, run the **check-updates** skill before proceeding.
> - **GitHub Copilot CLI / VS Code**: invoke the `check-updates` skill.
> - **Claude Code / Cowork / Cursor / Windsurf / Codex**: compare local vs remote package.json version.
> - Skip if the check was already performed earlier in this session.

Spark Authoring Cli by the numbers

136 all-time installs (skills.sh)
Ranked #3,479 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

spark-authoring-cli capabilities & compatibility

Capabilities: > **update check — once per session (mandatory)* · > the first time this skill is used in a session · > **github copilot cli / vs code**: invoke the · > **claude code / cowork / cursor / windsurf /
Use cases: documentation

From the docs

What spark-authoring-cli says it does

>

SKILL.md

npx skills add https://github.com/microsoft/skills-for-fabric --skill spark-authoring-cli

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/microsoft/skills-for-fabric/spark-authoring-cli.svg)](https://skillselion.com/skills/microsoft/skills-for-fabric/spark-authoring-cli)

Installs	136
repo stars	★ 886
Security audit	2 / 3 scanners passed
Last updated	July 23, 2026
Repository	microsoft/skills-for-fabric ↗

How do I apply spark-authoring-cli using the workflow in its SKILL.md?

Who is it for?

Developers following the spark-authoring-cli skill for the tasks it documents.

Skip if: Tasks outside the spark-authoring-cli scope described in SKILL.md.

When should I use this skill?

User mentions spark-authoring-cli or related triggers from the skill description.

What you get

Working spark-authoring-cli setup aligned with the documented patterns and constraints.

Files

SKILL.mdMarkdownGitHub ↗

Update Check — ONCE PER SESSION (mandatory)

The first time this skill is used in a session, run the check-updates skill before proceeding.

- GitHub Copilot CLI / VS Code: invoke the check-updates skill.

- Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.

- Skip if the check was already performed earlier in this session.

CRITICAL NOTES

1. To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering

2. To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering

Spark Authoring — CLI Skill

This skill covers two complementary areas: (1) managing Fabric Spark artifacts via REST APIs (workspaces, lakehouses, notebooks, jobs, pipelines) and (2) writing code inside Fabric Notebook cells (PySpark, Scala, SparkR, SQL with correct lakehouse access, notebookutils, and Spark configuration). For notebook code authoring fundamentals and shared modules, MUST see SPARK-NOTEBOOK-AUTHORING-CORE.md.

Task	Reference	Notes
RULES — Read these first, follow them always	SKILL.md § RULES	MUST read — 4 rules for this skill
Finding Workspaces and Items in Fabric	COMMON-CLI.md § Finding Workspaces and Items in Fabric	Mandatory — READ link first [needed for finding workspace id by its name or item id by its name, item type, and workspace id]
Fabric Topology & Key Concepts	COMMON-CORE.md § Fabric Topology & Key Concepts
Environment URLs	COMMON-CORE.md § Environment URLs
Authentication & Token Acquisition	COMMON-CORE.md § Authentication & Token Acquisition	Wrong audience = 401; read before any auth issue
Core Control-Plane REST APIs	COMMON-CORE.md § Core Control-Plane REST APIs
Pagination	COMMON-CORE.md § Pagination
Long-Running Operations (LRO)	COMMON-CORE.md § Long-Running Operations (LRO)
Rate Limiting & Throttling	COMMON-CORE.md § Rate Limiting & Throttling
OneLake Data Access	COMMON-CORE.md § OneLake Data Access	Requires `storage.azure.com` token, not Fabric token
Definition Envelope	ITEM-DEFINITIONS-CORE.md § Definition Envelope	Definition payload structure
Per-Item-Type Definitions	ITEM-DEFINITIONS-CORE.md § Per-Item-Type Definitions	Support matrix, decoded content, part paths — REST specs, CLI recipes
Job Execution	COMMON-CORE.md § Job Execution
Capacity Management	COMMON-CORE.md § Capacity Management
Gotchas & Troubleshooting	COMMON-CORE.md § Gotchas & Troubleshooting
Best Practices	COMMON-CORE.md § Best Practices
Tool Selection Rationale	COMMON-CLI.md § Tool Selection Rationale
Authentication Recipes	COMMON-CLI.md § Authentication Recipes	`az login` flows and token acquisition
Fabric Control-Plane API via `az rest`	COMMON-CLI.md § Fabric Control-Plane API via az rest	Always pass `--resource https://api.fabric.microsoft.com` or `az rest` fails
Pagination Pattern	COMMON-CLI.md § Pagination Pattern
Long-Running Operations (LRO) Pattern	COMMON-CLI.md § Long-Running Operations (LRO) Pattern
OneLake Data Access via `curl`	COMMON-CLI.md § OneLake Data Access via curl	Use `curl` not `az rest` (different token audience)
SQL / TDS Data-Plane Access	COMMON-CLI.md § SQL / TDS Data-Plane Access
Job Execution (CLI)	COMMON-CLI.md § Job Execution
Job Scheduling	COMMON-CLI.md § Job Scheduling	URL is `/jobs/{jobType}/schedules`; `endDateTime` required
OneLake Shortcuts	COMMON-CLI.md § OneLake Shortcuts
Capacity Management (CLI)	COMMON-CLI.md § Capacity Management
Composite Recipes	COMMON-CLI.md § Composite Recipes
Gotchas & Troubleshooting (CLI-Specific)	COMMON-CLI.md § Gotchas & Troubleshooting (CLI-Specific)	`az rest` audience, shell escaping, token expiry
Quick Reference: `az rest` Template	COMMON-CLI.md § Quick Reference: az rest Template
Quick Reference: Token Audience / CLI Tool Matrix	COMMON-CLI.md § Quick Reference: Token Audience ↔ CLI Tool Matrix	Which `--resource` + tool for each service
Relationship to SPARK-CONSUMPTION-CORE.md	SPARK-AUTHORING-CORE.md § Relationship to SPARK-CONSUMPTION-CORE.md
Data Engineering Authoring Capability Matrix	SPARK-AUTHORING-CORE.md § Data Engineering Authoring Capability Matrix
Lakehouse Management	SPARK-AUTHORING-CORE.md § Lakehouse Management
Notebook Management	SPARK-AUTHORING-CORE.md § Notebook Management
Notebook Execution & Job Management	SPARK-AUTHORING-CORE.md § Notebook Execution & Job Management
CI/CD & Automation Patterns	SPARK-AUTHORING-CORE.md § CI/CD & Automation Patterns
Infrastructure-as-Code	SPARK-AUTHORING-CORE.md § Infrastructure-as-Code
Performance Optimization & Resource Management	SPARK-AUTHORING-CORE.md § Performance Optimization & Resource Management
Authoring Gotchas and Troubleshooting	SPARK-AUTHORING-CORE.md § Authoring Gotchas and Troubleshooting
Quick Reference: Authoring Decision Guide	SPARK-AUTHORING-CORE.md § Quick Reference: Authoring Decision Guide
Recommended Patterns (Data Engineering)	data-engineering-patterns.md § Recommended patterns
Data Ingestion Principles	data-engineering-patterns.md § Data Ingestion Principles
Transformation Patterns	data-engineering-patterns.md § Transformation Patterns
Delta Lake Best Practices	data-engineering-patterns.md § Delta Lake Best Practices
Quality Assurance Strategies	data-engineering-patterns.md § Quality Assurance Strategies
Recommended Patterns (Development Workflow)	development-workflow.md § Recommended patterns
Notebook Lifecycle	development-workflow.md § Notebook Lifecycle
Parameterization Patterns	development-workflow.md § Parameterization Patterns
Variable Library (notebook + pipeline usage)	development-workflow.md § Method 4: Variable Library	`getLibrary()` + dot notation in notebooks; `libraryVariables` + `@pipeline().libraryVariables` in pipelines
Variable Library Definition	ITEM-DEFINITIONS-CORE.md § VariableLibrary	Definition parts, decoded content, types, pipeline mappings, gotchas
Local Testing Strategy	development-workflow.md § Local Testing Strategy
Debugging Patterns	development-workflow.md § Debugging Patterns
Recommended Patterns (Infrastructure)	infrastructure-orchestration.md § Recommended patterns
Materialized Lake View patterns	materialized-lake-view-patterns.md § Recommended patterns	Spark Lakehouse authoring guidance for MLV design (when to use MLVs, layering patterns)
MLV incremental refresh patterns	mlv-incremental-refresh-patterns.md § IR-friendly syntax guide	Use for refresh-readiness review and safe non-breaking rewrites
Workspace Provisioning Principles	infrastructure-orchestration.md § Workspace Provisioning Principles
Lakehouse Configuration Guidance	infrastructure-orchestration.md § Lakehouse Configuration Guidance
Pipeline Design Patterns	infrastructure-orchestration.md § Pipeline Design Patterns
CI/CD Integration Strategy	infrastructure-orchestration.md § CI/CD Integration Strategy
Notebook API — Which Endpoint to Use	notebook-api-operations.md § Quick Decision	Start here for remote notebook edits — getDefinition vs updateDefinition
Notebook Modification Workflow	notebook-api-operations.md § Workflow	Five-step flow: retrieve, decode, modify, encode, upload
Notebook API Error Reference	notebook-api-operations.md § Error Reference	411, 400 (updateMetadata), 401, 403 explained
Notebook API Gotchas	notebook-api-operations.md § Gotchas	`/result` suffix, empty body, `\n` per-line rule, `format=ipynb`
Default Lakehouse Binding	notebook-api-operations.md § Default Lakehouse Binding	`.ipynb` metadata vs `.py` `# METADATA` block; discover IDs dynamically
Public URL Data Ingestion	notebook-api-operations.md § Public URL Data Ingestion	Use real source URL, stage into `Files/`, then read with Spark
getDefinition (read notebook content)	notebook-api-operations.md § Step 1 — Retrieve Notebook Content	LRO flow, `?format=ipynb`, empty body (`--body '{}'`) requirement
Decode Base64 Notebook Payload	notebook-api-operations.md § Step 2 — Decode the Notebook Content	Extract payload, base64 decode, ipynb JSON structure
Modify Notebook Cells	notebook-api-operations.md § Step 3 — Modify the Notebook Content	Find cell, insert/replace lines, `\n` per-line rule
updateDefinition (write notebook content)	notebook-api-operations.md § Step 4 — Re-encode and Upload	Re-encode, upload, LRO poll, updateMetadata flag pitfall
Verify Notebook Update (Optional)	notebook-api-operations.md § Step 5 — Verify the Update	Skip unless you suspect a silent failure — `Succeeded` from updateDefinition is sufficient (see Rule 2)
Notebook API Error Reference	notebook-api-operations.md § Error Reference	411, 400 (updateMetadata), 401, 403 explained
Notebook API End-to-End Script	notebook-api-operations.md § Complete End-to-End Script	Full bash: get → decode → modify → encode → update → verify
Quick Start Examples	SKILL.md § Quick Start Examples	Minimal examples for common operations
— Notebook Code Authoring (shared modules) —
Notebook Authoring Core	SPARK-NOTEBOOK-AUTHORING-CORE.md	READ FIRST for notebook code tasks — fundamentals, code gen approach, module index

---

Must/Prefer/Avoid

MUST DO

Check for recent jobs BEFORE creating new notebook runs — Query job instances from last 5 minutes; if recent job exists, monitor it instead of creating duplicate
Capture job instance ID immediately after POST — Store job ID before any other operations to enable proper monitoring
Verify workspace capacity assignment before operations — Workspace must have capacity assigned and active
When user provides a public data URL, follow the Public URL Data Ingestion policy — keep detailed behavior in the linked resource section to avoid drift/duplication
Format notebook cells correctly — Each line in cell source array MUST end with \n to prevent code merging
Use correct Lakehouse Livy session body format — Send a FLAT JSON with name, driverMemory, driverCores, executorMemory, executorCores. Do NOT wrap in {"payload": ...} or send only {"kind": "pyspark"} — that causes HTTP 500. Use valid memory values (28g, 56g, 112g, 224g). See Create Lakehouse Livy Session example below and SPARK-CONSUMPTION-CORE.md.

PREFER

Poll job status with proper intervals — 10-30 seconds between polls; timeout after reasonable duration (e.g., 30 minutes)
Check job history when POST response is unreadable — If POST returns "No Content" or unreadable response, query recent jobs (last 1 minute) before retrying
Use Starter Pool for development — Development/testing workloads should use useStarterPool: true
Use Workspace Pool for production — Production workloads need consistent performance with useWorkspacePool: true
Enable lakehouse schemas during creation — Set creationPayload.enableSchemas: true for better table organization
Implement idempotency checks — Prevent duplicate operations by checking existing state first

AVOID

Never retry POST with same parameters — If you have a job ID, only use GET to check status; don't create duplicate job instances
Don't skip capacity verification — Operations will fail if workspace capacity is paused or unassigned
Avoid immediate POST retries on failures — Check for existing/active jobs first to prevent duplicates
Don't create new runs if monitoring existing job — One job at a time; wait for completion before submitting new runs
Don't hardcode workspace/lakehouse IDs — Discover dynamically via item listing or catalog search APIs
Do NOT use Lakehouse Livy sessions to run a Fabric notebook — Lakehouse Livy sessions (the public Livy API) are for ad-hoc interactive Spark code execution. To run a notebook as a job, use the Jobs API (RunNotebook) which creates a Notebook Spark session internally. See SPARK-AUTHORING-CORE.md § Notebook Execution & Job Management

---

RULES — Read these first, follow them always

Rule 1 — Validate prerequisites before operations.

Verify workspace has capacity assigned (see COMMON-CORE.md Create Workspace and Capacity Management) and resource IDs exist before attempting operations.

Rule 2 — Trust updateDefinition success.

A Succeeded poll result from updateDefinition is sufficient confirmation that content and lakehouse bindings persisted. Do NOT call getDefinition after every upload — it is an async LRO that adds significant latency. Only use getDefinition for its intended purpose: reading current notebook content before making modifications.

Rule 3 — Prevent duplicate jobs and monitor execution properly.

Before submitting new notebook run, ALWAYS check for recent job instances first (last 5 minutes). If recent job exists, monitor it instead of creating duplicate. After submission, capture job instance ID immediately and poll status - never retry POST. See SPARK-AUTHORING-CORE.md Job Monitoring for patterns.

Rule 4 — For notebook code authoring, MUST follow SPARK-NOTEBOOK-AUTHORING-CORE.md.

When writing code inside notebook cells, MUST read SPARK-NOTEBOOK-AUTHORING-CORE.md first — it defines the code generation approach, rules, and a Module Index linking to detailed guides (lakehouse paths, connections, context, orchestration, etc.). Use the Spark-specific resources in this skill (data-engineering-patterns.md, development-workflow.md) for Spark-only implementation details. When the task is about Materialized Lake Views, read materialized-lake-view-patterns.md for authoring/design guidance and mlv-incremental-refresh-patterns.md for refresh-readiness analysis.

---

Quick Start Examples

For detailed patterns, authentication, and comprehensive API usage, see:

COMMON-CORE.md — Fabric REST API patterns, authentication, item discovery
COMMON-CLI.md — az rest usage, environment detection, token acquisition
SPARK-AUTHORING-CORE.md — Notebook deployment, lakehouse creation, job execution

Below are minimal quick-start examples. *Always reference the COMMON- files for production use.**

Create Workspace & Lakehouse

# See COMMON-CORE.md Environment URLs and SPARK-AUTHORING-CORE.md for full patterns
cat > /tmp/body.json << 'EOF'
{"displayName": "DataEng-Dev"}
EOF
workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces" \
  --body @/tmp/body.json --query "id" --output tsv)

cat > /tmp/body.json << 'EOF'
{"displayName": "DevLakehouse", "type": "Lakehouse", "creationPayload": {"enableSchemas": true}}
EOF
lakehouse_id=$(az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items" \
  --body @/tmp/body.json --query "id" --output tsv)

Organize Lakehouse Tables with Schemas

# See SPARK-AUTHORING-CORE.md Lakehouse Schema Organization for table organization patterns
# Create schemas for medallion architecture
spark.sql("CREATE SCHEMA IF NOT EXISTS bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS gold")

Create and Refresh a Materialized Lake View (MLV)

-- See resources/materialized-lake-view-patterns.md for design guidance
-- and resources/mlv-incremental-refresh-patterns.md for refresh-readiness review.

-- Bronze/Silver/Gold schemas in a Lakehouse with schemas enabled
CREATE SCHEMA IF NOT EXISTS bronze;
CREATE SCHEMA IF NOT EXISTS silver;
CREATE SCHEMA IF NOT EXISTS gold;

-- A simple Silver MLV with data quality constraints
--
-- Prerequisite for incremental refresh: enable Change Data Feed (CDF) on every
-- source table the MLV reads from. Without CDF, optimal refresh can only choose
-- between no refresh (sources unchanged) and full refresh — never incremental.
-- See resources/mlv-incremental-refresh-patterns.md.
ALTER TABLE bronze.orders_raw SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

CREATE OR REPLACE MATERIALIZED LAKE VIEW silver.orders_clean
(
    CONSTRAINT valid_order_id CHECK (order_id IS NOT NULL) ON MISMATCH DROP
)
AS
SELECT
  order_id,
  customer_id,
  CAST(order_ts AS TIMESTAMP) AS order_ts,
  amount
FROM bronze.orders_raw;

-- Routine refresh is handled by the lakehouse Materialized lake views → Manage
-- schedule/lineage view; don't orchestrate from notebooks. The SQL form below is
-- documented only for forcing a one-time FULL recompute (troubleshooting / after
-- a correction). There is no documented SQL form for triggering incremental refresh.
REFRESH MATERIALIZED LAKE VIEW silver.orders_clean FULL;

Create Lakehouse Livy Session

# See SPARK-CONSUMPTION-CORE.md for Lakehouse Livy session configuration and management
# IMPORTANT: Body MUST be flat JSON with memory/cores — do NOT wrap in {"payload": ...}
cat > /tmp/body.json << 'EOF'
{"name": "dev-session", "driverMemory": "56g", "driverCores": 8, "executorMemory": "56g", "executorCores": 8, "conf": {"spark.dynamicAllocation.enabled": "true", "spark.fabric.pool.name": "Starter Pool"}}
EOF
az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/lakehouses/$lakehouse_id/livyapi/versions/2023-12-01/sessions" \
  --body @/tmp/body.json

Lakehouse Livy Session Body — Common Mistakes

- ❌ {"payload": {"kind": "pyspark"}} → HTTP 500 (wrong wrapper, missing required fields)

- ❌ {"kind": "pyspark"} → HTTP 500 (missing driverMemory, executorMemory, etc.)

- ✅ Flat JSON with name, driverMemory, driverCores, executorMemory, executorCores (and optionally conf with Starter Pool)

Spark Performance Configs

For detailed workload-specific configurations, see data-engineering-patterns.md Delta Lake Best Practices.

Quick reference:

# Write-heavy (Bronze): Disable V-Order, enable autoCompact
# Balanced (Silver): Enable V-Order, adaptive execution  
# Read-heavy (Gold): Vectorized reads, optimal parallelism
# See data-engineering-patterns.md for complete config tables

---

Focus: Essential CLI patterns for Spark/data engineering development and notebook code authoring, with intelligent routing to specialized resources. For comprehensive patterns, always reference COMMON-* files and resource documents.

Data Engineering Patterns — Skill Resource

Essential patterns and principles for PySpark data engineering in Microsoft Fabric.

Recommended patterns

Must

1. Always define explicit schemas for production data ingestion — avoid inferSchema=true which adds overhead and inconsistency 2. Use Delta Lake format for all managed tables — provides ACID guarantees, time travel, and optimized reads 3. Validate data quality at ingestion boundaries — check nulls, data types, and business rules before persisting 4. Add metadata columns to track lineage — ingestion_timestamp, source_system, pipeline_run_id for debugging 5. Handle errors gracefully — wrap ingestion/transformation logic in try-except with proper logging and recovery 6. Use MERGE for upserts — leverage Delta Lake's MERGE INTO for incremental updates based on merge keys 7. Partition large tables — use date or category columns for partition pruning to improve query performance 8. If a real public source URL is provided, ingest from that source — download/copy into lakehouse Files/ first, then load with Spark from lakehouse paths (do not replace with synthetic inline rows)

Prefer

1. Batch processing over streaming unless real-time requirements exist — simpler to debug and monitor 2. Read-optimized writes for analytical workloads — use .coalesce() or .repartition() to right-size output files 3. Window functions over self-joins — more efficient for ranking, running totals, and lag/lead operations 4. Broadcast joins for small dimensions — use .broadcast() hint when one table fits in memory (<100MB) 5. Columnar operations over row-wise — leverage DataFrame/SQL API instead of UDFs when possible 6. Lazy evaluation mindset — build transformation chains, then execute with actions (.write(), .count())

Avoid

1. Don't use `.collect()` on large DataFrames — brings all data to driver, causes OOM errors 2. Don't chain multiple `.count()` calls — each triggers a full scan; cache DataFrame if needed 3. Don't ignore skew — salting keys or adaptive query execution prevents straggler tasks 4. Don't skip Delta optimization — run OPTIMIZE and VACUUM regularly to prevent small file problem 5. Don't hardcode paths or credentials — use parameters and secure configuration patterns 6. Don't mix append and overwrite carelessly — understand partition scope for .mode("overwrite")

---

Data Ingestion Principles

Schema Management

Guide LLM to define explicit schemas with nullable constraints, data type validation, and business context comments.

Note: This section refers to data schemas (DataFrame structure). For lakehouse schemas (databases/namespaces for organizing tables), see SPARK-AUTHORING-CORE.md Lakehouse Schema Organization.

Source Format Handling

CSV/TSV: Explicit schema, header option
Parquet/ORC: Columnar formats with embedded schema
JSON: multiLine option for nested objects
ADLS Gen2: abfss://container@storage.dfs.core.windows.net/path
OneLake: abfss://workspace@onelake.dfs.fabric.microsoft.com/lakehouse.Lakehouse/Files/path
Public HTTP/HTTPS datasets: Download/copy to lakehouse Files/... first, then spark.read from lakehouse paths for stable runtime behavior

Validation Patterns

Completeness: Filter nulls in required fields
Referential integrity: Join with dimensions, flag orphans
Business rules: Domain-specific checks (amount > 0, date ranges)
Duplicates: dropDuplicates or groupBy to identify

Error Handling Strategy

Try-except blocks with specific exceptions
Contextual logging
Dead letter queues for invalid records
Retry logic for transient failures

---

Transformation Patterns

When to Use Different Operations

Aggregations: For summarization and metrics; combine multiple in single pass with .agg()

Window Functions: For ranking (row_number, rank), running calculations (cumulative sums), and lead/lag comparisons; more efficient than self-joins

Joins: Inner (matching only), Left (dimension lookups), Broadcast (<100MB tables to avoid shuffle)

Example Approaches

Customer Segmentation: Use window functions for lifetime metrics, when().otherwise() for classification, temporal dimensions for recency

Product Analytics: Join with dimensions, aggregate by category, rank with row_number(), compute percentiles

---

Delta Lake Best Practices

MERGE Operations (Upserts)

When to use:

Incremental loads where source sends changed/new records
Slowly changing dimensions (SCD Type 1 or Type 2)
Deduplication scenarios

Guide LLM to generate MERGE with:

.merge(source_df, "target.id = source.id") on unique key
.whenMatchedUpdateAll() to update existing records
.whenNotMatchedInsertAll() to insert new records
Optional: .whenMatchedDelete() for hard deletes based on condition

Optimization Strategies

Tell LLM to include:

Z-Ordering: OPTIMIZE table_name ZORDER BY (frequently_filtered_column) improves query speed
VACUUM: VACUUM table_name RETAIN 168 HOURS cleans up old file versions after retention period
Partition pruning: Query with partition columns in WHERE clause to skip irrelevant data
File compaction: Run OPTIMIZE to combine small files into right-sized files (128MB-1GB)

Time Travel

Use cases:

Point-in-time queries: spark.read.format("delta").option("versionAsOf", 5).load(path)
Rollback bad writes: Restore to previous version with RESTORE TABLE table_name TO VERSION AS OF 10
Audit trail: Query historical data for compliance, debugging

Spark Session Configurations for Performance

Guide LLM to configure Spark sessions based on workload type:

Write-Heavy Workloads (Bronze Layer - High-Volume Ingestion):

spark.microsoft.delta.parquet.vorder.enabled = false — Disable V-Order for faster writes
spark.databricks.delta.optimizeWrite.binSize = 1073741824 — Target 1GB file size for fewer small files
spark.databricks.delta.autoCompact.enabled = true — Automatic compaction during writes
spark.microsoft.delta.optimize.fast.enabled = true — Fast optimization algorithms
spark.databricks.delta.properties.defaults.enableDeletionVectors = true — Efficient delete tracking
spark.microsoft.delta.targetFileSize.adaptive.enabled = true — Adaptive file sizing
spark.native.enabled = true — Use native execution engine (Velox)
spark.gluten.delta.columnMapping.name.enabled = true — Column mapping for schema evolution

Balanced Workloads (Silver Layer - Mixed Read/Write):

spark.microsoft.delta.parquet.vorder.enabled = true — Enable V-Order for better read performance
spark.databricks.delta.optimizeWrite.enabled = true — Balance write optimization with read efficiency
spark.microsoft.delta.snapshot.driverMode.enabled = true — Faster snapshot reads
spark.sql.adaptive.enabled = true — Adaptive query execution
spark.sql.adaptive.coalescePartitions.enabled = true — Dynamic partition coalescing

Read-Heavy Workloads (Gold Layer - Analytics & Reporting):

spark.microsoft.delta.parquet.vorder.enabled = true — V-Order for maximum read performance
spark.databricks.delta.optimizeWrite.enabled = false — No write optimization overhead
spark.sql.parquet.enableVectorizedReader = true — Vectorized Parquet reads
spark.sql.files.maxPartitionBytes = 134217728 — 128MB partition size for optimal parallelism
spark.sql.adaptive.enabled = true — Optimize query plans based on runtime stats
spark.databricks.delta.stalenessLimit = 0 — Always use latest snapshot

When to apply these configs:

Pass during Livy session creation: "conf": {"spark.config.key": "value"}
Set in notebook first cell before any Spark operations
Configure at workspace level for consistent defaults
Override per-job for specific workload requirements

---

Quality Assurance Strategies

Testing Levels

Guide LLM to implement:

Unit Testing (local Spark):

Test transformation logic with small sample DataFrames
Use pytest fixtures to create test Spark session
Assert row counts, column values, schema correctness
Focus on business logic in isolation

Integration Testing (Fabric API):

Validate workspace/lakehouse creation succeeded
Test notebook deployment via REST API
Verify Livy session creation and code execution
Check end-to-end data flow through bronze → silver → gold

Data Quality Checks (production):

Row count validation: compare source vs target
Schema validation: ensure expected columns exist with correct types
Null checks: flag unexpected nulls in required fields
Range checks: validate numeric values within expected bounds
Freshness checks: ensure data updated within SLA timeframe

Quality Gates

Define when pipelines should fail:

Critical failures: schema mismatch, zero rows ingested, primary key violations
Warnings: elevated null rate, data volume anomaly (>20% change), late arrival
Monitoring: track ingestion lag, transformation duration, error rates over time

Logging and Observability

Prompt LLM to generate:

Structured logging: JSON-formatted logs with timestamp, severity, context
Metrics emission: log key counts (rows processed, errors, duration) for monitoring
Error context: capture input values, stack traces, environment details for debugging

Development Workflow — Skill Resource

Essential workflow patterns for Spark notebook development in Microsoft Fabric.

Recommended patterns

Must

1. Always validate notebook JSON structure before deployment — malformed JSON causes deployment failures 2. Use base64 encoding for notebook content in Fabric API calls — required by REST API specification 3. Test locally first with sample data before deploying to Fabric — catch logic errors early 4. Use parameterized notebooks for reusability across environments — avoid hardcoded values 5. Follow PySpark best practices — proper DataFrame operations, avoid driver memory issues

Prefer

1. Local development workflow — develop in Jupyter locally, validate, then deploy to Fabric 2. Session reuse over creating new sessions — faster iteration during development 3. Incremental development — test small changes before full deployments

Avoid

1. Don't hardcode connection strings or workspace IDs — use parameters and configuration 2. Don't skip local testing — always validate transformation logic before deploying 3. Don't commit secrets to notebooks — use secure parameter passing and Azure Key Vault

---

Notebook Lifecycle

Development Phase

Guide LLM to generate notebooks following: 1. Local development: Create .ipynb file with Jupyter, use local Spark session for testing 2. Cell structure: Organize as Parameters → Setup → Logic → Validation → Cleanup 3. Parameter cell: First code cell should define configurable parameters with defaults 4. Imports cell: Import all dependencies upfront to catch missing packages early 5. Validation cell: Add checks at end to validate output (row counts, schema, sample data)

Deployment Phase

Prompt LLM to generate deployment commands: 1. Convert to JSON: Notebook must be valid JSON with cells array 2. Base64 encode: Content must be base64-encoded for Fabric REST API 3. Create notebook item: POST to /workspaces/{id}/items with type="Notebook" 4. Update definition: POST to /workspaces/{id}/items/{notebookId}/updateDefinition with payload

Execution Phase

Guide LLM for execution patterns: 1. On-demand execution: POST to Livy sessions endpoint to run notebook interactively 2. Pipeline execution: Embed notebook in pipeline activity with parameter overrides 3. Scheduled execution: Create a schedule via Job Scheduling in COMMON-CLI.md 4. Monitoring: Query Livy session state or pipeline run status to track progress

---

Parameterization Patterns

For parameterization patterns (when to parameterize, parameter injection methods, Variable Library, configuration management), see context-and-params.md. The section below covers Spark-specific development patterns only.

Spark Session Configuration & Runtime

Agents must fetch official docs for details — use the URLs below, not local descriptions.

Topic	Fetch URL	Keywords
%%configure magic command	https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook#spark-session-configuration-magic-command	`%%configure`, `driverMemory`, `executorMemory`, `driverCores`, `executorCores`, `numExecutors`, `defaultLakehouse`, `mountPoints`, `sessionTimeoutInSeconds`, `useStarterPool`, `useWorkspacePool`, `conf`, Variable Library, session restart, `-f` flag
Parameterized %%configure from pipeline	https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook#parameterized-session-configuration-from-a-pipeline	`parameterName`, `defaultValue`, pipeline notebook activity, override %%configure, parameterized session config
Spark compute (pools, node sizes, autoscale)	https://learn.microsoft.com/en-us/fabric/data-engineering/spark-compute	starter pool, custom Spark pool, node size (Small/Medium/Large/XL/XXL), vCores, autoscale, dynamic allocation, capacity units, single-node pool

---

Local Testing Strategy

Setup Local Environment

Prompt LLM to generate setup for: 1. Install PySpark: pip install pyspark delta-spark for local Spark session 2. Install Jupyter: pip install jupyter notebook for interactive development 3. Sample data: Create small CSV/Parquet files locally to simulate Fabric data 4. Mock Fabric paths: Use local file paths during dev, swap to abfss:// for Fabric deployment

Testing Transformation Logic

Guide LLM to test:

Create test DataFrame: Use spark.createDataFrame() with sample data and explicit schema
Run transformation: Execute the notebook's core logic on test data
Assert results: Validate output row count, column values, schema matches expectations
Edge cases: Test with nulls, empty DataFrames, duplicate keys

Local vs Fabric Differences

Make LLM aware of:

Spark session: Local requires explicit creation; Fabric provides pre-configured spark object
OneLake access: Local can't access OneLake; use local files or mounted Azure Storage
Livy API: Only available in Fabric; local testing can't validate Livy-specific features
Lakehouse tables: Local uses Hive metastore; Fabric uses OneLake managed tables

---

Debugging Patterns

Livy Session Debugging

When errors occur in Fabric, guide LLM to: 1. Check session state: GET /livyapi/versions/2023-12-01/sessions/{id} to see if session is idle/busy/error 2. Retrieve session log: GET session log endpoint to see driver/executor logs 3. Statement-level debugging: Execute statements individually to isolate failing code 4. Resource issues: Check if error is memory-related (OOM), timeout, or network connectivity

Common Error Patterns

Schema mismatch:

Symptom: "Cannot merge incompatible schemas"
Fix: Ensure source DataFrame columns match target table schema exactly
Prevention: Define explicit schemas, validate before write

Path not found:

Symptom: "Path does not exist: abfss://..."
Fix: Verify lakehouse ID, file path, check OneLake permissions
Prevention: Test paths with .ls() or simple read before complex operations

Out of memory:

Symptom: "java.lang.OutOfMemoryError" or driver/executor crashes
Fix: Add .repartition() or .coalesce(), reduce data volume, increase executor memory
Prevention: Avoid .collect(), limit .count() calls, cache judiciously

Livy session timeout:

Symptom: Session in "dead" state or statements not executing
Fix: Recreate session, check network connectivity, verify lakehouse accessibility
Prevention: Use session heartbeat, handle long-running operations with checkpoints

Logging Best Practices

Prompt LLM to add logging:

Progress indicators: Log after each major step (read, transform, write)
Row counts: Log DataFrame counts to track data flow
Timing: Record start/end timestamps for performance analysis
Error context: Log input parameters, DataFrame sample when errors occur

Incremental Debugging Strategy

Guide LLM to debug systematically: 1. Isolate failure: Comment out sections to identify failing cell 2. Simplify input: Test with small sample (.limit(100)) to reproduce faster 3. Add visibility: Insert .show() and .printSchema() to inspect intermediate state 4. Check assumptions: Validate data types, nulls, distributions match expectations 5. Divide and conquer: Break complex transformations into smaller steps with validation between

Infrastructure & Orchestration — Skill Resource

Essential patterns for workspace provisioning, lakehouse configuration, and pipeline orchestration in Microsoft Fabric.

Recommended patterns

Must

1. Always use version control for infrastructure definitions and deployment scripts — track changes, enable rollback 2. Implement environment isolation — separate dev/test/prod workspaces with distinct configurations 3. Use parameterized configurations — avoid hardcoding workspace IDs, names, or environment-specific values 4. Validate before deployment — check JSON payloads, test API calls with dry-run where possible 5. Implement proper RBAC — assign minimum required permissions to users and service principals

Prefer

1. Declarative infrastructure over imperative scripts — define desired state, let automation converge 2. Automated CI/CD over manual deployments — reduce human error, increase consistency 3. Idempotent operations — scripts should be safe to run multiple times without side effects

Avoid

1. Don't hardcode secrets in scripts or configurations — use Azure Key Vault, environment variables 2. Don't skip testing infrastructure changes — validate in dev before promoting to prod 3. Don't ignore capacity planning — understand SKU requirements, concurrent user limits, data volume 4. Don't create orphaned resources — track workspace-lakehouse-notebook relationships, clean up on teardown

---

Workspace Provisioning Principles

Environment Isolation Strategy

Guide LLM to create workspaces following:

Naming convention: {project}-{environment} (e.g., sales-analytics-dev, sales-analytics-prod)
Separate capacity: Dev uses lower-tier capacity (F2/F4), prod uses dedicated higher-tier (F64+)
Access control: Limit prod access to service principals and specific admins, dev is broader
Data segregation: Dev uses synthetic/sample data, prod uses real data with compliance controls

Workspace Configuration

Prompt LLM to configure: 1. Display name: Human-readable, follows naming convention 2. Description: Document purpose, owning team, environment type 3. Capacity assignment: Assign to appropriate Fabric capacity for the environment 4. Default lakehouse: Create primary lakehouse during workspace setup for consistency

RBAC Patterns

Tell LLM to assign roles appropriately:

Admin: Full control, can manage access — limit to 2-3 people
Member: Can create and edit items — developers, data engineers
Contributor: Can edit items but not create new — analysts, data scientists
Viewer: Read-only access to reports — business users

API Approach

Guide LLM to use Fabric REST API:

Create workspace: POST /v1/workspaces with displayName and optional description
Assign capacity: Use capacity assignment endpoint if not using default
List workspaces: GET /v1/workspaces to discover existing workspaces

---

Lakehouse Configuration Guidance

When to Create Separate Lakehouses

Guide LLM based on use case:

Single lakehouse when:

Small project with <10 tables, single data pipeline
All data has same security requirements
Team is small, no cross-team collaboration

Multiple lakehouses when:

Large project with distinct data domains (sales, marketing, finance)
Different security/compliance requirements per domain
Separate bronze/silver/gold layers for medallion architecture
Different retention policies per data type

Lakehouse Naming Conventions

Prompt LLM to follow pattern:

Medallion architecture: {project}_bronze, {project}_silver, {project}_gold
Domain separation: sales_lakehouse, marketing_lakehouse, shared_dimensions
Environment suffix: Include env if workspace spans environments (rare)

Configuration Considerations

Tell LLM to think about:

OneLake path: Lakehouse automatically gets Files/ and Tables/ folders
Shortcuts: Create shortcuts to external ADLS Gen2 or other OneLake lakehouses for federated queries
Default lakehouse: Set one as default for notebooks to simplify path references
SQL endpoint: Every lakehouse automatically gets SQL Analytics Endpoint for T-SQL queries

Schema Management

Guide LLM for table organization:

Bronze schema: Raw data, minimal transformation, original data types
Silver schema: Cleaned, validated, conformed types, deduplication
Gold schema: Business-logic enriched, aggregated, optimized for analytics
Naming: Use lowercase with underscores: customer_transactions, not CustomerTransactions

---

Pipeline Design Patterns

Orchestration vs Transformation Logic

Critical principle for LLM to understand:

Pipelines (orchestration): Define when and what order to run tasks

Schedule/trigger configuration
Activity dependencies (Task A → Task B → Task C)
Parameter flow between activities
Retry and error handling policies
Notifications on success/failure

Notebooks (transformation): Contain what to transform and how

Read data from source
Apply business logic transformations
Write to destination
No orchestration concerns

Guide LLM to separate these: Don't put scheduling logic in notebooks; don't put transformation logic in pipelines.

Activity Types and When to Use

Notebook Activity:

Use for: Data ingestion, transformation, ML training, custom PySpark logic
Parameterization: Pass date ranges, paths, table names from pipeline
Output: Can output values for downstream activities to consume

Copy Activity:

Use for: Simple data movement without transformation (ADLS → Lakehouse, Lakehouse → Azure SQL)
Efficient for large-volume copies with schema mapping
Built-in retry and throttling handling

ForEach Activity:

Use for: Batch processing multiple files, tables, or partitions
Iterates over array, executes nested activity for each item
Can run sequentially or in parallel (set batchCount)

If/Switch Conditions:

Use for: Conditional logic (e.g., different processing for weekday vs weekend)
Evaluate expressions based on pipeline parameters or activity outputs

Wait/Webhook:

Use for: External system coordination, delays between activities
Webhook calls external REST endpoint, waits for callback before proceeding

Parameter Flow Pattern

Guide LLM to design parameter flow: 1. Pipeline parameters: Define at pipeline level (e.g., ProcessingDate, Environment) 2. Pass to activities: Map pipeline params to notebook params in activity configuration 3. Default values: Notebooks should have sensible defaults if pipeline doesn't override 4. Expressions: Use @formatDateTime(utcnow(), 'yyyy-MM-dd') for dynamic values 5. Output consumption: Capture notebook outputs, use in downstream activities

Error Handling Strategy

Prompt LLM to configure:

Activity retry: Set retry count (3) and interval (30s) for transient failures
Timeout: Set max execution time to prevent runaway processes
On failure paths: Use activity dependencies to execute cleanup/notification on error
Dead letter: Write failed records to error table for investigation
Alerting: Configure pipeline failure notifications to Teams/email

---

CI/CD Integration Strategy

Infrastructure as Code Benefits

Explain to LLM why IaC matters:

Reproducibility: Deploy same config across dev/test/prod consistently
Version control: Track changes, review PRs, rollback bad changes
Automation: Reduce manual steps, increase deployment frequency
Testing: Validate infrastructure changes before production deployment

Deployment Patterns

Pattern 1: API-Driven Deployment

Use Fabric REST API to create/update notebooks, lakehouses, pipelines
Store item definitions in Git repository
CI/CD pipeline reads definitions, calls API to deploy
Best for: Notebooks, pipelines, simple infrastructure

Pattern 2: ARM/Bicep Templates

Define Azure resources declaratively (capacity, workspace assignments)
Use Azure DevOps or GitHub Actions to deploy templates
Best for: Azure-level resources, capacity management

Pattern 3: Hybrid Approach

ARM templates for Azure resources and capacity
Fabric REST API for workspace items (notebooks, pipelines, lakehouses)
Separation of concerns: infrastructure vs data artifacts

Git Workflow

Guide LLM for branching strategy:

Main branch: Protected, reflects production state
Feature branches: Develop changes in isolation, PR review before merge
Environment branches: Optional dev/test/prod branches for staged deployments
Tag releases: Tag commits with version numbers for rollback reference

Automated Testing in CI/CD

Prompt LLM to implement: 1. Lint/validate: Check JSON syntax, YAML format, notebook structure 2. Unit tests: Run local PySpark tests with sample data 3. Integration tests: Deploy to dev workspace, execute pipeline, validate output 4. Smoke tests: Post-deployment checks (workspace accessible, pipelines runnable) 5. Rollback capability: Automated rollback if deployment tests fail

Deployment Stages

Typical progression LLM should generate: 1. Dev deployment: Automatic on every commit to feature branch 2. Test deployment: Automatic on merge to main, runs full integration tests 3. Prod deployment: Manual approval gate, requires test pass + human review 4. Monitoring: Post-deployment validation, alerting on anomalies

Secrets Management

Tell LLM to handle secrets via:

Azure Key Vault: Store connection strings, SAS tokens, API keys
Service principals: Use for automation authentication, not personal accounts
Environment variables: Pass secrets to pipelines at runtime, never hardcode
Audit logging: Track who accessed secrets, when, for what purpose

Materialized Lake View Patterns — Skill Resource

Public-facing authoring patterns for Microsoft Fabric Materialized Lake Views (MLVs). Use this resource when the task is about writing, reviewing, or restructuring MLV SQL, not when the task is about Spark job triage or broad cross-workload orchestration.

---

Recommended patterns

Must

1. Use deterministic SQL in MLV definitions — keep transformations stable across refreshes. 2. Prefer Delta sources with Change Data Feed (CDF) enabled for source tables that feed MLVs. 3. Use Materialized Lake Views for durable layer outputs, not for transient notebook-only logic. 4. Apply data quality checks close to the source-aligned layer using CONSTRAINT ... CHECK ... ON MISMATCH DROP where appropriate. 5. Separate Bronze, Silver, and Gold responsibilities clearly:

Bronze: raw landing / source-aligned tables
Silver: cleaned and conformed datasets
Gold: business-facing aggregates

6. Keep MLVs business-stable — preserve query semantics unless the user explicitly asks for a redesign. 7. Use documented syntax only — avoid undocumented or implementation-specific features by default.

Prefer

1. Source-aligned Silver MLVs first, denormalized Silver MLVs second — then aggregate in Gold. 2. `COUNT` and `SUM` for Gold metrics when they satisfy the business requirement. 3. Downstream notebooks or BI logic for ranking, moving windows, and presentation formatting. 4. Cross-lakehouse 4-part naming when reading from another workspace/lakehouse. 5. Partitioned outputs when downstream reads are heavily filtered by date or a small set of dimensions. 6. Thin Gold MLVs that serve reusable business outputs instead of embedding every downstream convenience calculation.

Avoid

1. Window functions inside MLVs — move them downstream. 2. Non-deterministic functions inside MLVs — stamp values during ingestion instead. 3. `RIGHT JOIN`, `FULL OUTER JOIN`, `CROSS JOIN` in MLVs intended for incremental refresh. 4. `ORDER BY` and `LIMIT` in MLV definitions. 5. Standalone `SELECT DISTINCT` as a default modeling pattern. 6. Embedding moving time windows like date_sub(current_date(), 90) directly in the MLV. 7. Using MLVs as a substitute for orchestration — non-MLV ingestion, validation, and cross-system workflows still belong in pipelines or notebooks. (Refresh ordering between MLVs themselves is the Lakehouse's job — see Refresh and orchestration guidance below.)

---

When to use Materialized Lake Views

Choose an MLV when the user needs one or more of the following:

a durable curated table in a Lakehouse
repeatable cleansing or conformance logic
pre-joined analytical detail tables
reusable aggregate outputs for BI or downstream notebooks
a Bronze → Silver → Gold layer implemented directly in Fabric Lakehouse

Do not default to MLVs when the task is primarily:

ad-hoc notebook exploration
one-off data movement
streaming/event processing
Spark job debugging or performance triage

---

Layering patterns

Pattern 1: Source-aligned Silver MLV

Use one MLV per important Bronze source when you need:

type cleanup
validation
null/range checks
basic derived columns
a stable foundation for downstream joins

-- Enable CDF on the SOURCE so this MLV can incrementally refresh.
-- TBLPROPERTIES on the MLV itself only helps DOWNSTREAM MLVs that read from it.
ALTER TABLE bronze.orders SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

CREATE OR REPLACE MATERIALIZED LAKE VIEW silver.orders_clean
(
    CONSTRAINT valid_order_id CHECK (order_id IS NOT NULL) ON MISMATCH DROP,
    CONSTRAINT positive_amount CHECK (amount > 0) ON MISMATCH DROP
)
PARTITIONED BY (order_date)
TBLPROPERTIES (delta.enableChangeDataFeed = true)  -- for DOWNSTREAM IR consumers
AS
SELECT
    order_id,
    customer_id,
    product_id,
    order_date,
    CAST(amount AS DECIMAL(12,2)) AS amount
FROM bronze.orders;

Pattern 2: Denormalized Silver MLV

Use a joined Silver MLV when Gold should aggregate over a clean, stable analytical grain.

-- All three sources must have CDF enabled for this MLV to incrementally refresh.
-- (silver.orders_clean already has CDF from Pattern 1; add it to the other two.)
ALTER TABLE silver.customers_clean SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
ALTER TABLE silver.products_clean  SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

CREATE OR REPLACE MATERIALIZED LAKE VIEW silver.order_details
PARTITIONED BY (order_date)
TBLPROPERTIES (delta.enableChangeDataFeed = true)  -- for DOWNSTREAM IR consumers
AS
SELECT
    o.order_id,
    o.order_date,
    o.amount,
    c.customer_name,
    c.region,
    p.category
FROM silver.orders_clean o
INNER JOIN silver.customers_clean c ON o.customer_id = c.customer_id
INNER JOIN silver.products_clean p ON o.product_id = p.product_id;

Pattern 3: Gold aggregate MLV

Use Gold MLVs for business-facing metrics and reusable summary tables.

CREATE OR REPLACE MATERIALIZED LAKE VIEW gold.daily_revenue
AS
SELECT
    order_date,
    region,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue
FROM silver.order_details
GROUP BY order_date, region;

---

Data quality patterns

Use constraints for deterministic row-level checks.

CREATE OR REPLACE MATERIALIZED LAKE VIEW silver.customers_clean
(
    CONSTRAINT valid_customer_id CHECK (customer_id IS NOT NULL) ON MISMATCH DROP,
    CONSTRAINT valid_email CHECK (email LIKE '%@%') ON MISMATCH DROP
)
TBLPROPERTIES (delta.enableChangeDataFeed = true)
AS
SELECT customer_id, customer_name, email, region
FROM bronze.customers;

Prefer simple expressions. Keep the logic auditable and easy to explain.

---

Cross-lakehouse and schema organization

Cross-lakehouse reads

Use documented 4-part naming when needed:

SELECT *
FROM WorkspaceName.LakehouseName.bronze.orders;

If a workspace, lakehouse, or schema name contains spaces, wrap that part in backticks: ` My Workspace.LakehouseName.bronze.orders `.

Schema organization

For medallion-style design, organize tables and MLVs into schemas such as:

CREATE SCHEMA IF NOT EXISTS bronze;
CREATE SCHEMA IF NOT EXISTS silver;
CREATE SCHEMA IF NOT EXISTS gold;

Keep naming predictable:

bronze.orders
silver.orders_clean
silver.order_details
gold.daily_revenue

---

SQL management commands

List MLVs in a schema

SHOW MATERIALIZED LAKE VIEWS IN silver;

Retrieve the original definition

SHOW CREATE MATERIALIZED LAKE VIEW silver.orders_clean;

Update an MLV definition

You cannot alter an existing MLV definition in place. Use CREATE OR REPLACE to overwrite the current definition:

CREATE OR REPLACE MATERIALIZED LAKE VIEW silver.orders_clean AS
SELECT order_id, customer_id, product_id, order_date,
       CAST(amount AS DECIMAL(12,2)) AS amount
FROM bronze.orders;

To rename an MLV without changing its definition:

ALTER MATERIALIZED LAKE VIEW silver.orders_clean RENAME TO silver.orders_clean_v2;

---

Current limitations

These limitations apply to Spark SQL MLV definitions:

1. Schema and MLV naming — all-uppercase schema names (e.g., MYSCHEMA) are not supported; use mixed case or lowercase. MLV object names are case-insensitive and normalized to lowercase (MyTestView becomes mytestview). 2. No DML statements — INSERT, UPDATE, DELETE cannot target an MLV. Data comes only from the SELECT query. 3. No time-travel queries — VERSION AS OF and TIMESTAMP AS OF are not supported in the MLV definition. 4. No user-defined functions — UDFs are not supported in MLV definitions. 5. `OR REPLACE` and `IF NOT EXISTS` cannot be combined in the same statement. 6. Temp views cannot be sources — an MLV must select from persisted tables or other MLVs. Session-scoped temp views (createOrReplaceTempView) are not valid sources. 7. Session Spark configs don't apply to scheduled refresh — spark.conf.set(...) values set interactively are not carried into scheduled refresh runs. Set properties at the lakehouse or workspace level instead.

See the Spark SQL Reference for the complete and current list of limitations.

---

PySpark MLVs (Preview)

PySpark-authored MLVs (defined with import fmlv and the @fmlv.materialized_lake_view decorator on a function that returns a DataFrame) are supported but have trade-offs:

No incremental refresh — PySpark MLVs always use full refresh.
Lineage-schedule refresh only — cannot refresh on-demand via notebook as with Spark SQL-based views.
Renaming — the MLV object itself can be renamed via ALTER MATERIALIZED LAKE VIEW ... RENAME TO ... (see the SQL syntax above; applies to both SQL- and PySpark-authored MLVs). If you instead change the @fmlv.materialized_lake_view decorator name in code, you have to drop and recreate.
Use PySpark when you need complex transformation logic, reusable functions, external Python libraries, or custom UDFs.

Running spark.sql("CREATE MATERIALIZED LAKE VIEW ...") from a notebook cell is a Spark SQL MLV (it keeps optimal/incremental refresh), not a PySpark-authored MLV.

When incremental refresh matters, prefer Spark SQL notebooks over PySpark.

See the PySpark Reference for notebook organization best practices and current limitations.

---

Refresh and orchestration guidance

MLVs define durable data products. Refresh orchestration belongs to the Lakehouse, not to notebook or pipeline schedulers.

[!TIP] Manage MLV refresh from your Lakehouse, not notebooks

Once your MLVs are created, rely on the Lakehouse's built-in capabilities

instead of orchestrating refresh from notebooks:

- [Lineage](https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/view-lineage) —

Fabric derives dependency order from MLV definitions automatically.

Open the Materialized lake views tab → Manage to view the graph and

follow a run in progress.

- [Scheduled refresh](https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/schedule-lineage-run) —

Create one or more schedules to refresh all MLVs or a selected subset.

Each schedule runs views in dependency order and retries transient failures.

Use notebooks to author and iterate on MLV definitions; use Lakehouse

lineage and schedules to handle ordering, refresh, and retries.

Scheduled refresh

Use the Lakehouse Materialized lake views tab → Manage → Schedules to configure:

Repeat cadence: minute, hourly, daily, weekly, or monthly
Optimal refresh toggle (default On): Fabric automatically picks incremental or full refresh per view
Extended lineage: refresh chains across multiple lakehouses in dependency order from a single schedule

Key behaviors (per current Fabric documentation — confirm against the refresh reference as platform limits may change):

A refresh run fails if it exceeds 24 hours
If a new refresh starts while another is in progress, Fabric skips the later one

For programmatic schedule management (create/update/delete schedules, trigger on-demand refresh via code), use the MLV Public REST API.

Manual refresh order

Prefer the lakehouse Materialized lake views → Manage schedule/lineage view for routine refresh. The SQL below is the documented way to force a one-time full refresh of an individual MLV (for troubleshooting or after a correction); FULL is the only documented REFRESH MATERIALIZED LAKE VIEW form.

REFRESH MATERIALIZED LAKE VIEW silver.orders_clean FULL;
REFRESH MATERIALIZED LAKE VIEW silver.customers_clean FULL;
REFRESH MATERIALIZED LAKE VIEW silver.order_details FULL;
REFRESH MATERIALIZED LAKE VIEW gold.daily_revenue FULL;

Recommended sequence:

1. source-aligned Silver MLVs 2. denormalized Silver MLVs 3. Gold MLVs 4. maintenance steps on a slower cadence

---

Modeling tradeoffs

Exact distinct counts

If the user requests exact distinct counts, explain that:

the requirement is valid
the design may be less refresh-friendly
one option is to pre-deduplicate earlier in the flow
another option is to accept that this MLV may not be the most incremental-refresh-friendly shape

Rankings and moving windows

If the user requests ranking, lag/lead, or moving windows:

keep the base curated dataset in an MLV
move the ranking/window logic to a notebook or consuming layer

Presentation logic

If the user requests rounding, formatting, or report-only columns:

store raw business measures in the MLV
apply presentation formatting downstream

---

Routing guidance for the agent

Use this resource when the user asks about:

materialized lake views
MLV authoring
designing Silver/Gold tables with MLVs
MLV constraints
CREATE MATERIALIZED LAKE VIEW
refresh ordering for MLV-based layers
medallion design implemented directly with MLVs

Escalate to e2e-medallion-architecture or FabricDataEngineer when the request becomes:

multi-workspace architecture
end-to-end Bronze → Silver → Gold orchestration
pipeline design across multiple workloads
Power BI + Spark + pipeline coordinated rollout

---

Official documentation references

Topic	URL	Keywords
MLV overview	https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/overview-materialized-lake-view	overview, capabilities, when to use, limitations
Get started	https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/get-started-with-materialized-lake-views	quickstart, create MLV, first MLV, CDF setup
Medallion tutorial	https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/tutorial	medallion architecture, bronze-silver-gold, sales analytics
Optimal refresh	https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/refresh-materialized-lake-view	incremental refresh, full refresh, no refresh, optimal refresh, CDF
Spark SQL reference	https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/create-materialized-lake-view	CREATE, DROP, SHOW, ALTER, syntax, limitations
PySpark reference	https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/create-materialized-lake-view-pyspark	PySpark, full refresh, UDFs, complex transformations
Schedule refresh	https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/schedule-lineage-run	schedule, lineage, cross-lakehouse, optimal refresh toggle
Manage lineage	https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/view-lineage	lineage view, dependency graph, extended lineage
REST API management	https://learn.microsoft.com/en-us/fabric/data-engineering/materialized-lake-views/materialized-lake-views-public-api	REST API, on-demand refresh, schedule CRUD

MLV Incremental Refresh Patterns — Skill Resource

Skill resource for reviewing and improving incremental refresh readiness for Microsoft Fabric Materialized Lake Views (MLVs). Contains user-facing guidance and an agent routing section at the end of the file.

Use this resource when the task is about:

why an MLV may be doing a full refresh
how to rewrite an MLV without changing business logic
source-table readiness for incremental refresh
which SQL patterns are safer for refresh-friendly MLV design

Note: PySpark-authored MLVs always default to full refresh — incremental refresh

applies only to Spark SQL MLV definitions. If the user's MLV is PySpark-based, the

incremental readiness review does not apply.

---

IR-friendly syntax guide

Goal: Help users write MLV queries that qualify for incremental refresh from the start.

The authoritative Must/Prefer/Avoid for MLV authoring is in

`materialized-lake-view-patterns.md`.

The rules below are the incremental-refresh-specific subset — follow both.

IR-friendly SQL patterns (use these)

These SQL patterns are compatible with incremental refresh per the official supported constructs list:

Pattern	Example	Notes
Simple SELECT with filters	`SELECT col1, col2 FROM src WHERE col1 IS NOT NULL`	Best case — flat projection + deterministic filter
`COUNT(*)`, `SUM(col)`	`SELECT region, COUNT(*) AS cnt, SUM(amount) AS total FROM src GROUP BY region`	Preferred aggregates for IR
`GROUP BY` with simple columns	`SELECT region, status, COUNT(*) FROM src GROUP BY region, status`	Keep grouping keys simple
`INNER JOIN`	`SELECT a.id, b.name FROM src_a a INNER JOIN src_b b ON a.id = b.id`	Safest join type for IR
`LEFT OUTER JOIN`	`SELECT a.*, b.name FROM src_a a LEFT JOIN src_b b ON a.id = b.id`	Supported — IR works only if the right-side table remains unchanged during the refresh cycle (MS Learn)
`LEFT SEMI JOIN`	`SELECT a.* FROM src_a a LEFT SEMI JOIN src_b b ON a.id = b.id`	Same right-side-unchanged constraint as LEFT OUTER JOIN; returns only left-side columns (right-side columns are not projected)
`UNION ALL`	`SELECT * FROM src_a UNION ALL SELECT * FROM src_b`	Supported for combining multiple sources
`CAST` / type conversions	`SELECT CAST(amount AS DOUBLE) FROM src`	Schema reshaping is fine
`CASE WHEN` in SELECT	`SELECT CASE WHEN amount > 100 THEN 'High' ELSE 'Low' END AS tier FROM src`	Deterministic expressions are safe
`CONSTRAINT ... ON MISMATCH DROP`	See data quality patterns in materialized-lake-view-patterns.md	Row-level data quality constraints with deterministic functions
Subquery alias (inline view)	`SELECT sub.col FROM (SELECT col FROM src WHERE ...) sub`	Subqueries and CTEs work if they use only supported clauses
Non-recursive `WITH ... AS` (CTE)	`WITH clean AS (SELECT ... FROM src WHERE ...) SELECT * FROM clean`	Keep CTEs simple; avoid nesting blockers inside

Caution patterns (test before relying on IR)

These are not explicitly listed as supported in the official docs. They may work in some cases but deserve extra validation:

Pattern	Guidance
`HAVING`	Aggregate filter — not listed as supported; test whether it affects refresh eligibility
`GROUP BY` with `CASE WHEN` expressions	Adds complexity; test carefully
`Multi-level INNER JOIN chains (3+)`	May work but adds risk; consider staged Silver MLVs
`Subqueries in SELECT or WHERE (scalar subqueries, EXISTS)`	Per docs, triggers full refresh if any referenced table has changes
`AVG()`, `MIN()`, `MAX()`, `STDDEV()` (and similar aggregates other than `SUM` / `COUNT`)	IR-eligible only when every source table is partitioned AND the partition column is included in the MLV `GROUP BY`. Without that, falls back to full refresh. `SUM()` and `COUNT()` (without `DISTINCT`) are the special case that don't need partitioning.

Patterns that force full refresh (rewrite these)

If you need incremental refresh, rewrite queries that use these patterns:

Pattern	Why it blocks IR	IR-friendly alternative
`SELECT DISTINCT`	Non-incremental by nature	Use `GROUP BY` instead, or move DISTINCT to a downstream view
`ROW_NUMBER()`, `RANK()`, `LAG()`, `LEAD()`	Window functions require full recomputation	Keep the MLV flat; apply windowing in a downstream query
`RIGHT JOIN`, `FULL OUTER JOIN`, `CROSS JOIN`	Not eligible for incremental refresh	Rewrite as `INNER JOIN` or `LEFT JOIN` where semantics allow
`ORDER BY`, `LIMIT`	Not on the IR-supported constructs list — forces full refresh (and adds unnecessary sort cost since materialized output ordering is not guaranteed to consumers)	Remove from the MLV; apply ordering in the consuming query
`current_timestamp()`, `current_date()`, `rand()`	Non-deterministic — result changes each refresh	Remove or move to a downstream view
`COUNT(DISTINCT col)`	DISTINCT forces full refresh	Use `COUNT(*)` on a pre-deduplicated Silver MLV
`date_sub(current_date(), 90)`	Rolling window changes each refresh	Use a fixed filter; manage the window in a pipeline parameter
`EXCEPT`, `INTERSECT`	Set operations (other than `UNION ALL`)	Rewrite as `LEFT JOIN ... WHERE ... IS NULL` or `INNER JOIN`
`QUALIFY`, `LATERAL VIEW`, `TABLESAMPLE`	Advanced clauses not IR-compatible	Simplify to basic SELECT/JOIN/GROUP BY
`WITH RECURSIVE`	Recursive CTEs	Break recursion into staged MLVs
User-defined functions (UDFs)	Non-deterministic or unsupported	Use built-in Spark SQL functions

Source table prerequisites

Before any syntax review, confirm the source tables are IR-ready:

Prerequisite	Required	How to check
Delta format	Yes	Non-Delta sources (CSV, Parquet, JSON) force full refresh
Change Data Feed (CDF) enabled	Required for IR	`ALTER TABLE src SET TBLPROPERTIES (delta.enableChangeDataFeed = true)`
Append-only pattern	Required (per cycle)	Updates/deletes on sources fall back to full refresh for that cycle, even with CDF enabled

---

Readiness workflow

When reviewing an MLV for incremental refresh readiness:

Step 1: Check source tables

Confirm each source meets the prerequisites above (Delta format, CDF, append-only).

Step 2: Compare the query against the IR-friendly patterns

Walk through the MLV SQL and check each clause against the tables above. Identify which patterns are IR-friendly, which need caution, and which force full refresh.

Step 3: Suggest rewrites using the IR-friendly alternatives

Only suggest changes that preserve the business meaning of the query. Use the "IR-friendly alternative" column from the "force full refresh" table.

Step 4: Produce the report

Use this structure:

## IR Readiness Report

**Overall Assessment:** [IR-Ready | Partially Ready | Not IR-Eligible]

### Blockers
### Warnings
### Good Practices Detected
### Source Table Checklist
### Top Recommendations

---

Safe rewrite patterns

Pattern 1: Move ranking downstream

❌ Avoid in the MLV:

CREATE MATERIALIZED LAKE VIEW gold.latest_orders AS
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
FROM silver.orders;

✅ Keep the MLV deterministic:

CREATE MATERIALIZED LAKE VIEW gold.orders_base AS
SELECT customer_id, order_date, amount
FROM silver.orders;

Then apply ranking in a notebook or consuming query.

Pattern 2: Remove moving time windows from the MLV

❌ Avoid:

CREATE MATERIALIZED LAKE VIEW gold.recent_sales AS
SELECT product_id, sale_date, amount
FROM silver.sales
WHERE sale_date >= date_sub(current_date(), 90);

✅ Prefer:

CREATE MATERIALIZED LAKE VIEW gold.sales_base AS
SELECT product_id, sale_date, amount
FROM silver.sales;

Then filter for “last 90 days” in the BI or notebook layer.

Pattern 3: Prefer simpler aggregates

✅ Good refresh-friendly shape:

CREATE MATERIALIZED LAKE VIEW gold.daily_sales AS
SELECT
    order_date,
    region,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue
FROM silver.orders
GROUP BY order_date, region;

If users request averages, explain the tradeoff and prefer storing totals and counts when that still meets the business need.

Pattern 4: Keep presentation logic downstream

Avoid turning the MLV into a reporting layer. Prefer raw business measures inside the MLV and format later.

---

Source-table readiness guidance

Enable CDF on all source tables before relying on incremental refresh:

ALTER TABLE bronze.orders SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
ALTER TABLE bronze.customers SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

Prefer append-only ingestion for fact tables
Deletes or updates on source data cause fallback to full refresh for that cycle
Verify the Optimal refresh toggle is enabled (default: On) in schedule settings
For MLV chains, enable CDF on intermediate MLVs too

---

Example assessment language

IR-Ready ✅ — no hard blockers, source prerequisites satisfied, deterministic query shape
Partially Ready ⚠️ — no obvious blockers but source readiness unknown or caution areas remain
Not IR-Eligible ❌ — one or more hard blockers present in the current definition

---

Routing guidance for the agent

Use this resource when the user asks about incremental refresh readiness, full-refresh debugging, or refresh-friendly SQL rewrites. Pair with materialized-lake-view-patterns.md for broader MLV design guidance. See that file's reference table for full documentation links.

Notebook API Operations — Skill Resource

Principles and decision guidance for reading and updating Fabric notebook content via REST API. Use this guide when the task requires modifying an existing notebook in a Fabric workspace (e.g., adding new columns, updating SQL logic, changing cell code).

Local file authoring? See local-development.md instead.

This guide covers service-mode operations that require a workspace ID and a Fabric token.

---

Quick Decision: Which Endpoint to Use?

Goal	Endpoint	Notes
Read notebook content	`POST .../getDefinition`	Returns 202 LRO — must poll
Write/update notebook content	`POST .../updateDefinition`	Returns 202 LRO — must poll
Update display name / description only	`PATCH .../items/{id}`	Synchronous, no LRO

---

`.ipynb` Validation + Fabric Nuances

Use the official Jupyter schema for generic notebook structure validation, and keep this document focused on Fabric-specific behavior.

Official schema (nbformat v4): https://github.com/jupyter/nbformat/blob/main/nbformat/v4/nbformat.v4.schema.json
Validate against the schema before Base64 encoding and updateDefinition upload
Prefer preserving the decoded notebook's existing structure and metadata shape, then apply minimal edits

Fabric-Specific Nuances (keep these in this doc)

Area	Fabric nuance	Why it matters
Code cell execution fields	Keep `outputs` and `execution_count` explicitly present on code cells (`[]` / `null` when not executed)	Missing fields commonly cause Fabric rejection or execution/runtime issues
Cell metadata	Keep `metadata` object on every cell (use `{}` when empty)	Missing metadata frequently breaks update/round-trip consistency
Source line endings	Ensure each source line ends with `\n` except the final line	Missing trailing newlines can make code appear merged in Fabric editor
Kernel/language metadata	Keep notebook kernel/language metadata consistent with notebook language/runtime	Inconsistent kernel metadata can lead to editor/runtime mismatch
Lakehouse dependency metadata	Preserve/maintain `metadata.dependencies.lakehouse` when notebook uses `spark.sql()` or relative lakehouse paths	Notebook execution needs data context; missing binding causes runtime failures

Principles

Schema-first for notebook JSON correctness — rely on Jupyter schema for generic .ipynb compliance
Fabric-only rules here — document only behaviors specific to Fabric API/runtime/editor
Round-trip safely — when modifying existing notebooks, preserve non-target metadata and update only required cells/fields

---

Workflow: Get → Decode → Modify → Encode → Upload

The notebook modification lifecycle always follows this six-step flow. Generate implementation code on-demand using these principles — do not copy-paste templates.

Step 1 — Retrieve (`getDefinition`)

Endpoint:

POST /v1/workspaces/{workspaceId}/notebooks/{notebookId}/getDefinition?format=ipynb

Principles:

Always append ?format=ipynb — without it, the API may return .py source format instead of standard Jupyter JSON
Always send --body '{}' — the API returns HTTP 411 (Length Required) if no request body is sent for POST endpoints
This is an LRO (Long-Running Operation) — POST returns 202, poll the Location header URL until status == "Succeeded"
After polling succeeds, append /result to the Location URL to retrieve the actual content — this is unique to getDefinition (other LROs return data in the poll response directly)

Step 2 — Decode the Base64 Payload

Principles:

The LRO result contains a definition.parts[] array; find the part whose path ends with .ipynb
The payload field is Base64-encoded (standard encoding, not URL-safe) — decode it to get raw .ipynb JSON
After decoding, examine the actual JSON structure to understand cell layout, metadata, and lakehouse dependencies before making changes — the decoded content is standard Jupyter .ipynb format (nbformat, metadata, cells array)
When constructing or modifying cells, follow `.ipynb` Validation + Fabric Nuances — especially the required fields for code cells (outputs, execution_count)

Step 3 — Modify Notebook Cells

Principles:

Cells live in the cells array — search by joining cell['source'] and matching target text
Critical formatting rule: every line in cell['source'] must end with \n except the last line of a cell — missing newlines cause lines to visually merge in Fabric's notebook editor
For insertions, iterate the source lines and append new lines after the match point
For replacements, use list comprehension to swap matching lines
Preserve existing cell metadata (id, cell_type, metadata, outputs, execution_count) — only modify source

Step 4 — Re-encode and Upload (`updateDefinition`)

Endpoint:

POST /v1/workspaces/{workspaceId}/notebooks/{notebookId}/updateDefinition

Principles:

Serialize the modified notebook to JSON, then Base64-encode it (standard encoding)
Build the request payload with definition.format: "ipynb" and a single part: path: "notebook-content.ipynb", payloadType: "InlineBase64"
Do NOT include `updateMetadata: true` unless also supplying a .platform file part — sending the flag without a .platform part causes HTTP 400
This is an LRO — poll the Location header URL until status == "Succeeded"
Unlike getDefinition, updateDefinition does NOT need a /result suffix — the poll response itself confirms success

Step 5 — Verify the Update (Optional)

Principles:

updateDefinition returning Succeeded (HTTP 202 → poll → Succeeded) is sufficient confirmation — the API accepted and persisted the payload. No further verification needed.
Do NOT call `getDefinition` after every update — it is an async LRO (202 → poll → /result) and adds significant latency with no benefit when updateDefinition already succeeded.
Only call getDefinition post-upload if you have a specific reason to suspect a silent failure (e.g., the LRO returned Succeeded but subsequent notebook execution fails with unexpected behaviour suggesting wrong content).

---

Default Lakehouse Binding

A notebook must have a default lakehouse bound to it for spark.sql() and relative paths (e.g., Tables/, Files/) to resolve correctly at runtime. Without a binding, the notebook has no data context.

When to Bind

Always when creating a new notebook that reads from or writes to a lakehouse
When modifying an existing notebook whose lakehouse binding is missing or needs to change
After verifying the lakehouse ID exists in the target workspace — use the item listing API to discover the lakehouse ID dynamically (never hardcode)

Binding in `.ipynb` Format

In .ipynb format (used when ?format=ipynb is specified on getDefinition), the lakehouse binding lives in the notebook's top-level metadata.dependencies object:

Set metadata.dependencies.lakehouse.default_lakehouse to the lakehouse GUID
Set metadata.dependencies.lakehouse.default_lakehouse_workspace_id to the workspace GUID
Set metadata.dependencies.lakehouse.default_lakehouse_name to the lakehouse display name
Only one default lakehouse per notebook — additional lakehouses are accessible via SparkSQL three-part names at runtime
After decoding the .ipynb payload, inspect the existing `metadata` structure to see if a binding already exists before modifying — preserve any other metadata keys

Binding in `.py` Format (Fabric Native)

When getDefinition is called without ?format=ipynb, the API returns Fabric's native .py format. In this format, metadata is embedded as a # METADATA comment block near the top of the file:

The block is delimited by # METADATA ******************** lines
Each metadata line is prefixed with # META
The content inside is JSON with the same dependencies.lakehouse structure as .ipynb
When modifying .py format notebooks, look for the existing `# METADATA` block and update the lakehouse fields within it
If no # METADATA block exists, add one after the initial comment header (e.g., after # Fabric notebook source)

Principles

Discover lakehouse IDs dynamically — list items in the workspace filtered by type Lakehouse, then match by display name
Trust `updateDefinition` success — a Succeeded poll result confirms the lakehouse binding persisted. Do not call getDefinition to re-verify after upload.
Prefer `.ipynb` format for programmatic modifications — the JSON structure is easier to parse and less error-prone than editing comment-embedded metadata in .py format
If switching a notebook's default lakehouse, ensure the new lakehouse contains the tables/files the notebook references — otherwise the notebook will fail at runtime

---

Public URL Data Ingestion (Spark)

When a user provides a real public dataset URL (HTTP/HTTPS), prefer ingesting the real data over generating synthetic rows.

Principles

Do not generate synthetic inline data when a real public source is provided for ingestion tasks
Stage external files into lakehouse `Files/` first, then read from lakehouse paths in Spark
Do not rely on direct arbitrary URL reads in Spark as the default path; network/runtime restrictions can be environment-dependent

Recommended Flow

1. Download/copy the source file from public URL using Python (urllib/requests), OneLake API, or pipeline Copy activity 2. Write it to lakehouse Files/ (for example Files/nyc/) 3. Read with Spark from lakehouse path (for example spark.read.parquet("Files/nyc/yellow_tripdata_2024-01.parquet")) 4. Persist to Delta table in Bronze and continue the medallion flow

---

Error Reference

HTTP	Error	Root Cause	Fix
411	Length Required	POST with no request body	Add `--body '{}'` to all `getDefinition` and `updateDefinition` calls
400	Bad Request: updateMetadata	`updateMetadata: true` sent without a `.platform` file part	Remove `updateMetadata` flag or set to `false` for content-only updates
400	Bad Request: invalid base64	Malformed base64 or wrong encoding	Verify with `base64 --decode` before uploading; use standard Base64 (not URL-safe)
400	Invalid notebook format	`.ipynb` JSON is malformed	Validate JSON structure; ensure `nbformat`, `cells`, and `metadata` keys are present
401	Unauthorized	Expired or wrong token audience	Re-run `az login`; ensure `--resource https://api.fabric.microsoft.com` is set
403	Forbidden	Insufficient workspace permissions	Verify caller has Contributor or Admin role on the workspace

---

Gotchas

#	Issue	Details
1	`getDefinition` LRO result needs `/result` suffix	Unlike most LROs (which return final data in the poll response), `getDefinition` requires an additional `GET {Location}/result` call after the poll shows `Succeeded`
2	`updateDefinition` LRO does NOT need `/result`	Polling `{Location}` is sufficient; no `/result` step
3	Empty body causes 411	All `getDefinition` and `updateDefinition` POSTs require at minimum `--body '{}'`
4	`updateMetadata: true` requires `.platform` part	If you include this flag, you must also supply a `.platform` file in the `parts` array; for content-only updates, omit the flag entirely
5	Source lines must end with `\n`	Every line in `cell['source']` except the last must end with `\n`; missing newlines cause lines to visually merge in Fabric's notebook editor
6	`format=ipynb` query parameter matters	Without it, `getDefinition` may return `.py` source format; always append `?format=ipynb`

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

FAQ

What does spark-authoring-cli do?

When should I use spark-authoring-cli?

Invoke when >.

Is spark-authoring-cli safe to install?

Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagents

About

Spark Authoring Cli by the numbers

spark-authoring-cli capabilities & compatibility

What spark-authoring-cli says it does

Add your badge

How do I apply spark-authoring-cli using the workflow in its SKILL.md?

Who is it for?

When should I use this skill?

What you get

Files

Spark Authoring — CLI Skill

Table of Contents

Must/Prefer/Avoid

MUST DO

PREFER

AVOID

RULES — Read these first, follow them always

Quick Start Examples

Create Workspace & Lakehouse

Organize Lakehouse Tables with Schemas

Create and Refresh a Materialized Lake View (MLV)

Create Lakehouse Livy Session

Spark Performance Configs

Data Engineering Patterns — Skill Resource

Recommended patterns

Must

Prefer

Avoid

Data Ingestion Principles

Schema Management

Source Format Handling

Validation Patterns

Error Handling Strategy

Transformation Patterns

When to Use Different Operations

Example Approaches

Delta Lake Best Practices

MERGE Operations (Upserts)

Optimization Strategies

Time Travel

Spark Session Configurations for Performance

Quality Assurance Strategies

Testing Levels

Quality Gates

Logging and Observability

Development Workflow — Skill Resource

Recommended patterns

Must

Prefer

Avoid

Notebook Lifecycle

Development Phase

Deployment Phase

Execution Phase

Parameterization Patterns

Spark Session Configuration & Runtime

Local Testing Strategy

Setup Local Environment

Testing Transformation Logic

Local vs Fabric Differences

Debugging Patterns

Livy Session Debugging

Common Error Patterns

Logging Best Practices

Incremental Debugging Strategy

Infrastructure & Orchestration — Skill Resource

Recommended patterns

Must

Prefer

Avoid

Workspace Provisioning Principles

Environment Isolation Strategy

Workspace Configuration

RBAC Patterns

API Approach

Lakehouse Configuration Guidance

When to Create Separate Lakehouses

Lakehouse Naming Conventions

Configuration Considerations

Schema Management

`.ipynb` Validation + Fabric Nuances

Step 1 — Retrieve (`getDefinition`)