Spark Optimization

Name: Spark Optimization
Author: wshobson

wshobson/agents

8.3k installs
38.3k repo stars
Updated July 22, 2026
wshobson/agents

spark-optimization is an agent skill that Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or sc.

About

Optimize Apache Spark jobs with partitioning caching shuffle optimization and memory tuning Use when improving Spark performance debugging slow jobs or scaling data processing pipelines name spark-optimization description Optimize Apache Spark jobs with partitioning caching shuffle optimization and memory tuning Use when improving Spark performance debugging slow jobs or scaling data processing pipelines Apache Spark Optimization Production patterns for optimizing Apache Spark jobs including partitioning strategies memory management shuffle optimization and performance tuning When to Use This Skill Optimizing slow Spark jobs Tuning memory and executor configuration Implementing efficient partitioning strategies Debugging Spark performance issues Scaling Spark pipelines for large datasets Reducing shuffle and data skew Core Concepts 1 Spark Execution Model Driver Program Job triggered by action Stages separated by shuffles Tasks one per partition 2 Key Performance Factors Factor Impact Solution Shuffle Network I O disk I O Minimize wide transformations Data Skew Uneven task duration Salting broadcast joins Serialization CPU overhead Use Kryo columnar formats Memory GC pressure spil.

Apache Spark Optimization
Optimizing slow Spark jobs
Tuning memory and executor configuration
Implementing efficient partitioning strategies
Debugging Spark performance issues

Spark Optimization by the numbers

8,322 all-time installs (skills.sh)
+172 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #113 of 4,386 Backend & APIs skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

spark-optimization capabilities & compatibility

Capabilities: apache spark optimization · optimizing slow spark jobs · tuning memory and executor configuration · implementing efficient partitioning strategies · debugging spark performance issues
Use cases: documentation

From the docs

What spark-optimization says it does

--- name: spark-optimization description: Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning.

SKILL.md

Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

SKILL.md

--- # Apache Spark Optimization Production patterns for optimizing Apache Spark jobs including partitioning strategies, memory management, shuffle optimization, and performance tuning.

SKILL.md

Spark Execution Model ``` Driver Program ↓ Job (triggered by action) ↓ Stages (separated by shuffles) ↓ Tasks (one per partition) ``` ### 2.

SKILL.md

npx skills add https://github.com/wshobson/agents --skill spark-optimization

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/wshobson/agents/spark-optimization.svg)](https://skillselion.com/skills/wshobson/agents/spark-optimization)

Installs	8.3k
repo stars	★ 38.3k
Security audit	3 / 3 scanners passed
Last updated	July 22, 2026
Repository	wshobson/agents ↗

What problem does spark-optimization solve for developers using this skill?

Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

Who is it for?

Developers who need spark-optimization patterns described in the cached skill documentation.

Skip if: Skip when docs are empty or the task is outside the skill's documented scope.

When should I use this skill?

Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

What you get

Actionable workflows and conventions from SKILL.md for spark-optimization.

optimized partition configuration
tuned Spark job code

By the numbers

Targets 128MB–256MB optimal partition size for Spark tasks

Files

SKILL.mdMarkdownGitHub ↗

Apache Spark Optimization

Production patterns for optimizing Apache Spark jobs including partitioning strategies, memory management, shuffle optimization, and performance tuning.

When to Use This Skill

Optimizing slow Spark jobs
Tuning memory and executor configuration
Implementing efficient partitioning strategies
Debugging Spark performance issues
Scaling Spark pipelines for large datasets
Reducing shuffle and data skew

Core Concepts

1. Spark Execution Model

Driver Program
    ↓
Job (triggered by action)
    ↓
Stages (separated by shuffles)
    ↓
Tasks (one per partition)

2. Key Performance Factors

Factor	Impact	Solution
Shuffle	Network I/O, disk I/O	Minimize wide transformations
Data Skew	Uneven task duration	Salting, broadcast joins
Serialization	CPU overhead	Use Kryo, columnar formats
Memory	GC pressure, spills	Tune executor memory
Partitions	Parallelism	Right-size partitions

Quick Start

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Create optimized Spark session
spark = (SparkSession.builder
    .appName("OptimizedJob")
    .config("spark.sql.adaptive.enabled", "true")
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
    .config("spark.sql.adaptive.skewJoin.enabled", "true")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.sql.shuffle.partitions", "200")
    .getOrCreate())

# Read with optimized settings
df = (spark.read
    .format("parquet")
    .option("mergeSchema", "false")
    .load("s3://bucket/data/"))

# Efficient transformations
result = (df
    .filter(F.col("date") >= "2024-01-01")
    .select("id", "amount", "category")
    .groupBy("category")
    .agg(F.sum("amount").alias("total")))

result.write.mode("overwrite").parquet("s3://bucket/output/")

Detailed patterns and worked examples

Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.

Best Practices

Do's

Enable AQE - Adaptive query execution handles many issues
Use Parquet/Delta - Columnar formats with compression
Broadcast small tables - Avoid shuffle for small joins
Monitor Spark UI - Check for skew, spills, GC
Right-size partitions - 128MB - 256MB per partition

Don'ts

Don't collect large data - Keep data distributed
Don't use UDFs unnecessarily - Use built-in functions
Don't over-cache - Memory is limited
Don't ignore data skew - It dominates job time
Don't use `.count()` for existence - Use .take(1) or .isEmpty()

spark-optimization — detailed patterns and worked examples

Patterns

Pattern 1: Optimal Partitioning

# Calculate optimal partition count
def calculate_partitions(data_size_gb: float, partition_size_mb: int = 128) -> int:
    """
    Optimal partition size: 128MB - 256MB
    Too few: Under-utilization, memory pressure
    Too many: Task scheduling overhead
    """
    return max(int(data_size_gb * 1024 / partition_size_mb), 1)

# Repartition for even distribution
df_repartitioned = df.repartition(200, "partition_key")

# Coalesce to reduce partitions (no shuffle)
df_coalesced = df.coalesce(100)

# Partition pruning with predicate pushdown
df = (spark.read.parquet("s3://bucket/data/")
    .filter(F.col("date") == "2024-01-01"))  # Spark pushes this down

# Write with partitioning for future queries
(df.write
    .partitionBy("year", "month", "day")
    .mode("overwrite")
    .parquet("s3://bucket/partitioned_output/"))

Pattern 2: Join Optimization

from pyspark.sql import functions as F
from pyspark.sql.types import *

# 1. Broadcast Join - Small table joins
# Best when: One side < 10MB (configurable)
small_df = spark.read.parquet("s3://bucket/small_table/")  # < 10MB
large_df = spark.read.parquet("s3://bucket/large_table/")  # TBs

# Explicit broadcast hint
result = large_df.join(
    F.broadcast(small_df),
    on="key",
    how="left"
)

# 2. Sort-Merge Join - Default for large tables
# Requires shuffle, but handles any size
result = large_df1.join(large_df2, on="key", how="inner")

# 3. Bucket Join - Pre-sorted, no shuffle at join time
# Write bucketed tables
(df.write
    .bucketBy(200, "customer_id")
    .sortBy("customer_id")
    .mode("overwrite")
    .saveAsTable("bucketed_orders"))

# Join bucketed tables (no shuffle!)
orders = spark.table("bucketed_orders")
customers = spark.table("bucketed_customers")  # Same bucket count
result = orders.join(customers, on="customer_id")

# 4. Skew Join Handling
# Enable AQE skew join optimization
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")

# Manual salting for severe skew
def salt_join(df_skewed, df_other, key_col, num_salts=10):
    """Add salt to distribute skewed keys"""
    # Add salt to skewed side
    df_salted = df_skewed.withColumn(
        "salt",
        (F.rand() * num_salts).cast("int")
    ).withColumn(
        "salted_key",
        F.concat(F.col(key_col), F.lit("_"), F.col("salt"))
    )

    # Explode other side with all salts
    df_exploded = df_other.crossJoin(
        spark.range(num_salts).withColumnRenamed("id", "salt")
    ).withColumn(
        "salted_key",
        F.concat(F.col(key_col), F.lit("_"), F.col("salt"))
    )

    # Join on salted key
    return df_salted.join(df_exploded, on="salted_key", how="inner")

Pattern 3: Caching and Persistence

from pyspark import StorageLevel

# Cache when reusing DataFrame multiple times
df = spark.read.parquet("s3://bucket/data/")
df_filtered = df.filter(F.col("status") == "active")

# Cache in memory (MEMORY_AND_DISK is default)
df_filtered.cache()

# Or with specific storage level
df_filtered.persist(StorageLevel.MEMORY_AND_DISK_SER)

# Force materialization
df_filtered.count()

# Use in multiple actions
agg1 = df_filtered.groupBy("category").count()
agg2 = df_filtered.groupBy("region").sum("amount")

# Unpersist when done
df_filtered.unpersist()

# Storage levels explained:
# MEMORY_ONLY - Fast, but may not fit
# MEMORY_AND_DISK - Spills to disk if needed (recommended)
# MEMORY_ONLY_SER - Serialized, less memory, more CPU
# DISK_ONLY - When memory is tight
# OFF_HEAP - Tungsten off-heap memory

# Checkpoint for complex lineage
spark.sparkContext.setCheckpointDir("s3://bucket/checkpoints/")
df_complex = (df
    .join(other_df, "key")
    .groupBy("category")
    .agg(F.sum("amount")))
df_complex.checkpoint()  # Breaks lineage, materializes

Pattern 4: Memory Tuning

# Executor memory configuration
# spark-submit --executor-memory 8g --executor-cores 4

# Memory breakdown (8GB executor):
# - spark.memory.fraction = 0.6 (60% = 4.8GB for execution + storage)
#   - spark.memory.storageFraction = 0.5 (50% of 4.8GB = 2.4GB for cache)
#   - Remaining 2.4GB for execution (shuffles, joins, sorts)
# - 40% = 3.2GB for user data structures and internal metadata

spark = (SparkSession.builder
    .config("spark.executor.memory", "8g")
    .config("spark.executor.memoryOverhead", "2g")  # For non-JVM memory
    .config("spark.memory.fraction", "0.6")
    .config("spark.memory.storageFraction", "0.5")
    .config("spark.sql.shuffle.partitions", "200")
    # For memory-intensive operations
    .config("spark.sql.autoBroadcastJoinThreshold", "50MB")
    # Prevent OOM on large shuffles
    .config("spark.sql.files.maxPartitionBytes", "128MB")
    .getOrCreate())

# Monitor memory usage
def print_memory_usage(spark):
    """Print current memory usage"""
    sc = spark.sparkContext
    for executor in sc._jsc.sc().getExecutorMemoryStatus().keySet().toArray():
        mem_status = sc._jsc.sc().getExecutorMemoryStatus().get(executor)
        total = mem_status._1() / (1024**3)
        free = mem_status._2() / (1024**3)
        print(f"{executor}: {total:.2f}GB total, {free:.2f}GB free")

Pattern 5: Shuffle Optimization

# Reduce shuffle data size
spark.conf.set("spark.sql.shuffle.partitions", "auto")  # With AQE
spark.conf.set("spark.shuffle.compress", "true")
spark.conf.set("spark.shuffle.spill.compress", "true")

# Pre-aggregate before shuffle
df_optimized = (df
    # Local aggregation first (combiner)
    .groupBy("key", "partition_col")
    .agg(F.sum("value").alias("partial_sum"))
    # Then global aggregation
    .groupBy("key")
    .agg(F.sum("partial_sum").alias("total")))

# Avoid shuffle with map-side operations
# BAD: Shuffle for each distinct
distinct_count = df.select("category").distinct().count()

# GOOD: Approximate distinct (no shuffle)
approx_count = df.select(F.approx_count_distinct("category")).collect()[0][0]

# Use coalesce instead of repartition when reducing partitions
df_reduced = df.coalesce(10)  # No shuffle

# Optimize shuffle with compression
spark.conf.set("spark.io.compression.codec", "lz4")  # Fast compression

Pattern 6: Data Format Optimization

# Parquet optimizations
(df.write
    .option("compression", "snappy")  # Fast compression
    .option("parquet.block.size", 128 * 1024 * 1024)  # 128MB row groups
    .parquet("s3://bucket/output/"))

# Column pruning - only read needed columns
df = (spark.read.parquet("s3://bucket/data/")
    .select("id", "amount", "date"))  # Spark only reads these columns

# Predicate pushdown - filter at storage level
df = (spark.read.parquet("s3://bucket/partitioned/year=2024/")
    .filter(F.col("status") == "active"))  # Pushed to Parquet reader

# Delta Lake optimizations
(df.write
    .format("delta")
    .option("optimizeWrite", "true")  # Bin-packing
    .option("autoCompact", "true")  # Compact small files
    .mode("overwrite")
    .save("s3://bucket/delta_table/"))

# Z-ordering for multi-dimensional queries
spark.sql("""
    OPTIMIZE delta.`s3://bucket/delta_table/`
    ZORDER BY (customer_id, date)
""")

Pattern 7: Monitoring and Debugging

# Enable detailed metrics
spark.conf.set("spark.sql.codegen.wholeStage", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Explain query plan
df.explain(mode="extended")
# Modes: simple, extended, codegen, cost, formatted

# Get physical plan statistics
df.explain(mode="cost")

# Monitor task metrics
def analyze_stage_metrics(spark):
    """Analyze recent stage metrics"""
    status_tracker = spark.sparkContext.statusTracker()

    for stage_id in status_tracker.getActiveStageIds():
        stage_info = status_tracker.getStageInfo(stage_id)
        print(f"Stage {stage_id}:")
        print(f"  Tasks: {stage_info.numTasks}")
        print(f"  Completed: {stage_info.numCompletedTasks}")
        print(f"  Failed: {stage_info.numFailedTasks}")

# Identify data skew
def check_partition_skew(df):
    """Check for partition skew"""
    partition_counts = (df
        .withColumn("partition_id", F.spark_partition_id())
        .groupBy("partition_id")
        .count()
        .orderBy(F.desc("count")))

    partition_counts.show(20)

    stats = partition_counts.select(
        F.min("count").alias("min"),
        F.max("count").alias("max"),
        F.avg("count").alias("avg"),
        F.stddev("count").alias("stddev")
    ).collect()[0]

    skew_ratio = stats["max"] / stats["avg"]
    print(f"Skew ratio: {skew_ratio:.2f}x (>2x indicates skew)")

Configuration Cheat Sheet

# Production configuration template
spark_configs = {
    # Adaptive Query Execution (AQE)
    "spark.sql.adaptive.enabled": "true",
    "spark.sql.adaptive.coalescePartitions.enabled": "true",
    "spark.sql.adaptive.skewJoin.enabled": "true",

    # Memory
    "spark.executor.memory": "8g",
    "spark.executor.memoryOverhead": "2g",
    "spark.memory.fraction": "0.6",
    "spark.memory.storageFraction": "0.5",

    # Parallelism
    "spark.sql.shuffle.partitions": "200",
    "spark.default.parallelism": "200",

    # Serialization
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.sql.execution.arrow.pyspark.enabled": "true",

    # Compression
    "spark.io.compression.codec": "lz4",
    "spark.shuffle.compress": "true",

    # Broadcast
    "spark.sql.autoBroadcastJoinThreshold": "50MB",

    # File handling
    "spark.sql.files.maxPartitionBytes": "128MB",
    "spark.sql.files.openCostInBytes": "4MB",
}

Related skills

Lark Openapi ExplorerInstantly explore, test, and generate calls against the full Lark (Feishu) OpenAPI surface without leaving their agent workflow.471k

Lark EventConsume real-time events from Lark/Feishu as structured NDJSON streams inside AI agent workflows.382k15.8k

Lark Openapi ExplorerWhen an existing Lark/Feishu skill or CLI command cannot fulfill a specific requirement and they need to discover and invoke the exact native OpenAPI endpoint.381k15.8k

Just ScrapeQuickly search, crawl, extract structured JSON, or monitor web pages without writing custom scraping code.245k37

Lark AppsQuery the current visibility and permission scope of a Lark (Feishu) app without writing HTTP client code.230k15.8k

SupabaseGet accurate, up-to-date Supabase implementation guidance across database, auth, realtime, storage, edge functions and vector search without relying on outd182k2.4k

How it compares

Choose spark-optimization when you need Spark-specific partition and shuffle tuning rather than general SQL query optimization.

FAQ

What does spark-optimization do?

Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

When should I use spark-optimization?

Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

Is spark-optimization safe to install?

Review the Security Audits panel on this page before installing in production.

Backend & APIsbackend

About

Spark Optimization by the numbers

spark-optimization capabilities & compatibility

What spark-optimization says it does

Add your badge

What problem does spark-optimization solve for developers using this skill?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Apache Spark Optimization

When to Use This Skill

Core Concepts

1. Spark Execution Model

2. Key Performance Factors

Quick Start

Detailed patterns and worked examples

Best Practices

Do's

Don'ts

spark-optimization — detailed patterns and worked examples

Patterns

Pattern 1: Optimal Partitioning

Pattern 2: Join Optimization

Pattern 3: Caching and Persistence

Pattern 4: Memory Tuning

Pattern 5: Shuffle Optimization

Pattern 6: Data Format Optimization

Pattern 7: Monitoring and Debugging

Configuration Cheat Sheet

Related skills

How it compares

FAQ

What does spark-optimization do?

When should I use spark-optimization?

Is spark-optimization safe to install?

This week in AI coding