
Spark Engineer
Tune PySpark partitioning and caching so batch jobs finish without OOMs, shuffle storms, or runaway partition counts on modest clusters.
Overview
spark-engineer is an agent skill for the Build phase that optimizes PySpark partitioning and caching using core-count and partition-size guidelines to improve parallelism and join performance.
Install
npx skills add https://github.com/jeffallan/claude-skills --skill spark-engineerWhat is this skill?
- Partition count heuristics: 2–4 partitions per CPU core and 128–256MB target partition size
- Volume table from under 1GB through 1TB+ with suggested partition counts
- repartition vs coalesce guidance including column-hash and range partitioning for joins
- Co-partitioning rationale to avoid expensive shuffles on keyed joins
- PySpark snippets to inspect getNumPartitions() and resize DataFrames safely
- 2–4 partitions per CPU core heuristic
- Volume guidance table from under 1GB through greater than 1TB
Adoption & trust: 2.2k installs on skills.sh; 9.7k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your Spark jobs shuffle excessively, run out of memory, or take forever because partition counts and key distribution do not match your cluster or data size.
Who is it for?
Developers fixing slow or failing PySpark ETL on defined executor topologies who need concrete partition math, not generic big-data theory.
Skip if: Non-Spark stacks, front-end apps, or teams needing only SQL warehouse tuning without Spark DataFrames.
When should I use this skill?
Tuning Spark DataFrame partition counts, repartition/coalesce strategy, co-partitioned joins, or cache usage when jobs are slow or OOM-prone.
What do I get? / Deliverables
You leave with repartition/coalesce choices, target partition counts by data volume, and column partitioning patterns that cut shuffle cost and stabilize executor memory.
- Recommended partition count and repartition keys
- Before/after partition inspection steps
- Join-oriented co-partitioning plan to reduce shuffle
Recommended Skills
Journey fit
Spark tuning is implementation work on data backends and ETL, which belongs in Build rather than Operate unless you are only reacting to incidents. Backend subphase covers distributed data processing, join shuffle behavior, and executor memory patterns the skill emphasizes.
How it compares
Spark DataFrame partition and cache tuning, not ML experiment tracking or product retention methodology.
Common Questions / FAQ
Who is spark-engineer for?
Indie and solo builders implementing PySpark pipelines who can reach for agents to refactor partition strategy before paying for a bigger cluster.
When should I use spark-engineer?
During Build when designing or debugging backend data jobs—especially before large joins, after OOM errors, or when partition counts look far off relative to cores and gigabytes ingested.
Is spark-engineer safe to install?
It describes data-parallel code patterns; review the Security Audits panel on this page and run changes in staging since repartitioning can be expensive and affect production SLAs.
SKILL.md
READMESKILL.md - Spark Engineer
# Partitioning and Caching --- ## Partitioning Fundamentals ### Why Partitioning Matters - **Parallelism**: Each partition runs on a separate task - **Data locality**: Minimize data movement across network - **Memory efficiency**: Right-sized partitions prevent OOM - **Join performance**: Co-partitioned data avoids shuffle ### Partition Count Guidelines ```python # Rule of thumb: 2-4 partitions per CPU core # For 100 executor cores: 200-400 partitions # Check current partitions print(f"Number of partitions: {df.rdd.getNumPartitions()}") # Recommended formula total_cores = num_executors * cores_per_executor recommended_partitions = total_cores * 2 to 4 # Target partition size: 128MB - 256MB per partition # For 100GB data with 128MB target: ~800 partitions ``` ### Optimal Partition Sizes | Data Volume | Target Partition Size | Partition Count | |-------------|----------------------|-----------------| | < 1GB | 64MB | 8-16 | | 1-10GB | 128MB | 8-80 | | 10-100GB | 128-256MB | 40-800 | | 100GB-1TB | 256MB | 400-4000 | | > 1TB | 256MB | 4000+ | --- ## DataFrame Partitioning ### Repartition (Full Shuffle) ```python from pyspark.sql import functions as F # Repartition to specific number df_repart = df.repartition(200) # Repartition by column(s) - same keys go to same partition df_repart = df.repartition("user_id") df_repart = df.repartition("user_id", "date") # Repartition with count and columns df_repart = df.repartition(100, "user_id") # Range partitioning (for sorted access patterns) df_range = df.repartitionByRange(100, "date") ``` ```scala // Scala repartition val dfRepart = df.repartition(200) val dfByCol = df.repartition($"user_id") val dfRange = df.repartitionByRange(100, $"date") ``` ### Coalesce (No Shuffle) ```python # Reduce partitions without shuffle - efficient! # Use after filtering reduces data significantly df_coalesced = df.coalesce(50) # Common pattern: filter then coalesce df_filtered = df.filter(F.col("active") == True) # If filter reduced data by 80%, reduce partitions too df_optimized = df_filtered.coalesce(40) # From 200 to 40 ``` **When to use:** - `repartition(n)`: Increase partitions, need even distribution, partition by column - `coalesce(n)`: Decrease partitions only (no shuffle benefit) - `repartitionByRange()`: Need sorted partitions for range queries ### Checking Partition Distribution ```python from pyspark.sql import functions as F # Check partition count print(f"Partitions: {df.rdd.getNumPartitions()}") # Check partition sizes (row counts) partition_counts = df.withColumn("partition_id", F.spark_partition_id()) \ .groupBy("partition_id") \ .count() \ .orderBy("partition_id") partition_counts.show() # Get partition statistics stats = partition_counts.agg( F.min("count").alias("min_rows"), F.max("count").alias("max_rows"), F.avg("count").alias("avg_rows"), F.stddev("count").alias("stddev") ) stats.show() # Identify skew: max/avg ratio > 3 indicates skew ``` --- ## Shuffle Partitions ### Configuration ```python # Default shuffle partitions (200) - often suboptimal spark.conf.set("spark.sql.shuffle.partitions", 200) # For small data (<10GB), reduce spark.conf.set("spark.sql.shuffle.partitions", 50) # For large data (>100GB), increase spark.conf.set("spark.sql.shuffle.partitions", 2000) # Adaptive Query Execution (Spark 3.0+) - dynamic partition sizing spark.conf.set("spark.sql.adaptive.enabled", "true") spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionSize", "64MB") spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB") ``` ### AQE Automatic Optimization (Spark 3.x) ```python # Enable full AQE suite spark.conf.set("spark.sql.adaptive.enabled", "true") # Auto-coalesce shuffle partitions spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") spark.conf.set("spark.sql.adaptive.coalescePartitions.parallelismFirst",