
Spark Authoring Cli
Author production-grade PySpark lakehouse jobs in Microsoft Fabric with Delta Lake, schemas, and ingestion guardrails.
Overview
Spark Authoring CLI is an agent skill most often used in Build (also Operate) that teaches PySpark lakehouse patterns—Delta Lake, schemas, MERGE, and ingestion discipline—in Microsoft Fabric.
Install
npx skills add https://github.com/microsoft/skills-for-fabric --skill spark-authoring-cliWhat is this skill?
- Eight must-follow patterns including explicit schemas, Delta Lake ACID tables, and boundary data-quality checks
- Metadata columns for lineage: ingestion_timestamp, source_system, pipeline_run_id
- MERGE INTO upserts, partitioning for pruning, and graceful try-except around ingestion
- Prefer batch over streaming, window functions over self-joins, and real URL ingest via lakehouse Files/ paths
- Discourages inferSchema=true and synthetic inline rows when a public source URL is provided
- 8 must-follow data engineering patterns for Fabric PySpark
- Additional prefer-tier guidance on batch vs streaming and write file sizing
Adoption & trust: 68 installs on skills.sh; 427 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Ad-hoc PySpark with inferred schemas and no quality gates creates brittle Fabric pipelines that are hard to debug in production.
Who is it for?
Indie data builders shipping Fabric lakehouse tables and ETL who want checklist-level conventions without relearning Spark ops each sprint.
Skip if: One-off exploratory notebooks where inferSchema and tiny synthetic samples are acceptable and ops guarantees do not matter.
When should I use this skill?
Authoring or reviewing PySpark data engineering notebooks and jobs in Microsoft Fabric lakehouses.
What do I get? / Deliverables
You implement Fabric jobs aligned to eight must patterns and preferred performance choices, with traceable metadata and Delta-native upserts.
- Ingestion and transformation code following documented must/prefer patterns
- Delta-managed tables with lineage metadata columns
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Lakehouse pipelines are primary Build work for data-backed products and analytics features. Spark authoring is backend data engineering rather than UI or agent tooling.
Where it fits
Define explicit StructType and Delta writes when onboarding a new SaaS export into the lakehouse.
Add null and business-rule checks at ingestion boundaries before promoting a notebook to scheduled pipeline.
Tune partition columns and MERGE keys after query latency regressions on a large fact table.
How it compares
Pattern playbook for Fabric Spark authoring, not a CLI that deploys workspaces or runs pipelines by itself.
Common Questions / FAQ
Who is spark-authoring-cli for?
Developers and solo builders writing PySpark in Microsoft Fabric who need production ingestion, Delta Lake, and incremental update conventions.
When should I use spark-authoring-cli?
In Build when designing lakehouse ingestion and transforms; in Ship or Operate when revisiting partitioning, MERGE keys, or error recovery before scale-up.
Is spark-authoring-cli safe to install?
It is documentation-style patterns without embedded secrets; confirm Fabric workspace access and review Security Audits on this Prism page before running jobs on production data.
SKILL.md
READMESKILL.md - Spark Authoring Cli
# Data Engineering Patterns — Skill Resource Essential patterns and principles for PySpark data engineering in Microsoft Fabric. ## Recommended patterns ### Must 1. **Always define explicit schemas** for production data ingestion — avoid `inferSchema=true` which adds overhead and inconsistency 2. **Use Delta Lake format** for all managed tables — provides ACID guarantees, time travel, and optimized reads 3. **Validate data quality** at ingestion boundaries — check nulls, data types, and business rules before persisting 4. **Add metadata columns** to track lineage — `ingestion_timestamp`, `source_system`, `pipeline_run_id` for debugging 5. **Handle errors gracefully** — wrap ingestion/transformation logic in try-except with proper logging and recovery 6. **Use MERGE for upserts** — leverage Delta Lake's `MERGE INTO` for incremental updates based on merge keys 7. **Partition large tables** — use date or category columns for partition pruning to improve query performance 8. **If a real public source URL is provided, ingest from that source** — download/copy into lakehouse `Files/` first, then load with Spark from lakehouse paths (do not replace with synthetic inline rows) ### Prefer 1. **Batch processing over streaming** unless real-time requirements exist — simpler to debug and monitor 2. **Read-optimized writes** for analytical workloads — use `.coalesce()` or `.repartition()` to right-size output files 3. **Window functions over self-joins** — more efficient for ranking, running totals, and lag/lead operations 4. **Broadcast joins for small dimensions** — use `.broadcast()` hint when one table fits in memory (<100MB) 5. **Columnar operations over row-wise** — leverage DataFrame/SQL API instead of UDFs when possible 6. **Lazy evaluation mindset** — build transformation chains, then execute with actions (`.write()`, `.count()`) ### Avoid 1. **Don't use `.collect()` on large DataFrames** — brings all data to driver, causes OOM errors 2. **Don't chain multiple `.count()` calls** — each triggers a full scan; cache DataFrame if needed 3. **Don't ignore skew** — salting keys or adaptive query execution prevents straggler tasks 4. **Don't skip Delta optimization** — run `OPTIMIZE` and `VACUUM` regularly to prevent small file problem 5. **Don't hardcode paths or credentials** — use parameters and secure configuration patterns 6. **Don't mix append and overwrite** carelessly — understand partition scope for `.mode("overwrite")` --- ## Data Ingestion Principles ### Schema Management Guide LLM to define explicit schemas with nullable constraints, data type validation, and business context comments. > **Note**: This section refers to **data schemas** (DataFrame structure). For **lakehouse schemas** (databases/namespaces for organizing tables), see SPARK-AUTHORING-CORE.md Lakehouse Schema Organization. ### Source Format Handling - **CSV/TSV**: Explicit schema, header option - **Parquet/ORC**: Columnar formats with embedded schema - **JSON**: multiLine option for nested objects - **ADLS Gen2**: `abfss://container@storage.dfs.core.windows.net/path` - **OneLake**: `abfss://workspace@onelake.dfs.fabric.microsoft.com/lakehouse.Lakehouse/Files/path` - **Public HTTP/HTTPS datasets**: Download/copy to lakehouse `Files/...` first, then `spark.read` from lakehouse paths for stable runtime behavior ### Validation Patterns - **Completeness**: Filter nulls in required fields - **Referential integrity**: Join with dimensions, flag orphans - **Business rules**: Domain-specific checks (amount > 0, date ranges) - **Duplicates**: dropDuplicates or groupBy to identify ### Error Handling Strategy - Try-except blocks with specific exceptions - Contextual logging - Dead letter queues for invalid records - Retry logic for transient failures --- ## Transformation Patterns ### When to Use Different Operations **Aggregations**: For summarization and metrics; combine multiple in single pass with `.agg()` **Window Functions**: For ranking (row_numbe