Polars

Name: Polars
Author: k-dense-ai

k-dense-ai/scientific-agent-skills

922 installs
32k repo stars
Updated July 29, 2026
k-dense-ai/scientific-agent-skills

Polars is a Claude Code skill that teaches Polars lazy evaluation, early filter/select, and expression API patterns so coding agents write fast parallel-friendly Python data pipelines.

About

Polars is a performance guide skill from k-dense-ai/scientific-agent-skills for writing efficient Polars Python code. It prioritizes lazy evaluation with scan_csv and collect so queries benefit from predicate pushdown, projection pushdown, query optimization, and parallel execution planning instead of eager full-file loads. The skill teaches pushing filter and select operations early in pipelines, using the expression API correctly, and avoiding common eager-mode pitfalls that stall large datasets. Developers invoke Polars when agents generate dataframe code that works on samples but will choke on production CSV or Parquet volumes. It complements scientific and analytics workflows where Polars replaces pandas for speed-critical transforms. The readme contrasts bad eager patterns with optimized lazy pipelines using concrete Python examples developers can apply immediately.

Lazy evaluation playbook: `scan_csv` + `collect` with predicate and projection pushdown
Pipeline hygiene: filter and select early before group_by and joins
Anti-pattern coverage: avoid Python UDFs that break Polars parallelization
Expression-API-first patterns for maintainable scientific and analytics code
Performance-focused reference suitable for large CSV/Parquet workloads

Polars by the numbers

922 all-time installs (skills.sh)
+42 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #308 of 2,065 Data Science & ML skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 29, 2026 (Skillselion catalog sync)

npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill polars

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/k-dense-ai/scientific-agent-skills/polars.svg)](https://skillselion.com/skills/k-dense-ai/scientific-agent-skills/polars)

Installs	922
repo stars	★ 32k
Security audit	3 / 3 scanners passed
Last updated	July 29, 2026
Repository	k-dense-ai/scientific-agent-skills ↗

How do you write fast Polars lazy data pipelines?

Teach your coding agent Polars patterns—lazy scans, early filter/select, and expression API—so data pipelines stay fast and parallel-friendly.

Who is it for?

Python developers building Polars ETL or analytics where agents need lazy-mode and expression API guidance.

Skip if: Developers working exclusively in pandas, Spark, or SQL warehouses without Polars in the stack.

When should I use this skill?

Agent-generated Polars code uses eager reads, late filters, or inefficient expressions on large datasets.

What you get

Optimized Polars lazy pipelines with early filters, projection pushdown, and parallel-friendly expression code

optimized Polars pipeline code

Files

SKILL.mdMarkdownGitHub ↗

Polars

Overview

Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.

Quick Start

Installation and Basic Usage

Install the current stable Polars release verified during this refresh:

uv pip install "polars==1.41.2"

Install optional integrations only when needed:

uv pip install "polars[excel,database,fsspec,pandas,numpy]==1.41.2"

Basic DataFrame creation and operations:

import polars as pl

# Create DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["NY", "LA", "SF"]
})

# Select columns
df.select("name", "age")

# Filter rows
df.filter(pl.col("age") > 25)

# Add computed columns
df.with_columns(
    age_plus_10=pl.col("age") + 10
)

Core Concepts

Expressions

Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.

Key principles:

Use pl.col("column_name") to reference columns
Chain methods to build complex transformations
Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)

Example:

# Expression-based computation
df.select(
    pl.col("name"),
    (pl.col("age") * 12).alias("age_in_months")
)

Lazy vs Eager Evaluation

Eager (DataFrame): Operations execute immediately

df = pl.read_csv("file.csv")  # Reads immediately
result = df.filter(pl.col("age") > 25)  # Executes immediately

Lazy (LazyFrame): Operations build a query plan, optimized before execution

lf = pl.scan_csv("file.csv")  # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect()  # Now executes optimized query

When to use lazy:

Working with large datasets
Complex query pipelines
When only some columns/rows are needed
Performance is critical

Benefits of lazy evaluation:

Automatic query optimization
Predicate pushdown
Projection pushdown
Parallel execution

For detailed concepts, load references/core_concepts.md.

Common Operations

Select

Select and manipulate columns:

# Select specific columns
df.select("name", "age")

# Select with expressions
df.select(
    pl.col("name"),
    (pl.col("age") * 2).alias("double_age")
)

# Select all columns matching a pattern
df.select(pl.col("^.*_id$"))

Filter

Filter rows by conditions:

# Single condition
df.filter(pl.col("age") > 25)

# Multiple conditions (cleaner than using &)
df.filter(
    pl.col("age") > 25,
    pl.col("city") == "NY"
)

# Complex conditions
df.filter(
    (pl.col("age") > 25) | (pl.col("city") == "LA")
)

With Columns

Add or modify columns while preserving existing ones:

# Add new columns
df.with_columns(
    age_plus_10=pl.col("age") + 10,
    name_upper=pl.col("name").str.to_uppercase()
)

# Parallel computation (all columns computed in parallel)
df.with_columns(
    pl.col("value") * 10,
    pl.col("value") * 100,
)

Group By and Aggregations

Group data and compute aggregations:

# Basic grouping
df.group_by("city").agg(
    pl.col("age").mean().alias("avg_age"),
    pl.len().alias("count")
)

# Multiple group keys
df.group_by("city", "department").agg(
    pl.col("salary").sum()
)

# Conditional aggregations
df.group_by("city").agg(
    (pl.col("age") > 30).sum().alias("over_30")
)

For detailed operation patterns, load references/operations.md.

Aggregations and Window Functions

Aggregation Functions

Common aggregations within group_by context:

pl.len() - count rows
pl.col("x").sum() - sum values
pl.col("x").mean() - average
pl.col("x").min() / pl.col("x").max() - extremes
pl.first() / pl.last() - first/last values

Window Functions with `over()`

Apply aggregations while preserving row count:

# Add group statistics to each row
df.with_columns(
    avg_age_by_city=pl.col("age").mean().over("city"),
    rank_in_city=pl.col("salary").rank().over("city")
)

# Multiple grouping columns
df.with_columns(
    group_avg=pl.col("value").mean().over("category", "region")
)

Mapping strategies:

group_to_rows (default): Preserves original row order
explode: Faster but groups rows together
join: Creates list columns

Data I/O

Supported Formats

Polars supports reading and writing:

CSV, Parquet, JSON, Excel
Databases (via connectors)
Cloud storage (S3, Azure, GCS)
Google BigQuery
Multiple/partitioned files

Common I/O Operations

CSV:

# Eager
df = pl.read_csv("file.csv")
df.write_csv("output.csv")

# Lazy (preferred for large files)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()

Parquet (recommended for performance):

df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")

JSON:

df = pl.read_json("file.json")
df.write_json("output.json")

For comprehensive I/O documentation, load references/io_guide.md.

Transformations

Joins

Combine DataFrames:

# Inner join
df1.join(df2, on="id", how="inner")

# Left join
df1.join(df2, on="id", how="left")

# Join on different column names
df1.join(df2, left_on="user_id", right_on="id")

Concatenation

Stack DataFrames:

# Vertical (stack rows)
pl.concat([df1, df2], how="vertical")

# Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")

# Diagonal (union with different schemas)
pl.concat([df1, df2], how="diagonal")

Pivot and Unpivot

Reshape data:

# Pivot (wide format)
df.pivot(on="product", values="sales", index="date")

# Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])

For detailed transformation examples, load references/transformations.md.

Pandas Migration

Polars offers significant performance improvements over pandas with a cleaner API. Key differences:

Conceptual Differences

No index: Polars uses integer positions only
Strict typing: No silent type conversions
Lazy evaluation: Available via LazyFrame
Parallel by default: Operations parallelized automatically

Common Operation Mappings

Operation	Pandas	Polars
Select column	`df["col"]`	`df.select("col")`
Filter	`df[df["col"] > 10]`	`df.filter(pl.col("col") > 10)`
Add column	`df.assign(x=...)`	`df.with_columns(x=...)`
Group by	`df.groupby("col").agg(...)`	`df.group_by("col").agg(...)`
Window	`df.groupby("col").transform(...)`	`df.with_columns(...).over("col")`

Key Syntax Patterns

Pandas sequential (slow):

df.assign(
    col_a=lambda df_: df_.value * 10,
    col_b=lambda df_: df_.value * 100
)

Polars parallel (fast):

df.with_columns(
    col_a=pl.col("value") * 10,
    col_b=pl.col("value") * 100,
)

For comprehensive migration guide, load references/pandas_migration.md.

Best Practices

Performance Optimization

1. Use lazy evaluation for large datasets:

   lf = pl.scan_csv("large.csv")  # Don't use read_csv
   result = lf.filter(...).select(...).collect()

2. Avoid Python functions in hot paths:

Stay within expression API for parallelization
Use .map_elements() only when necessary
Prefer native Polars operations

3. Use streaming for very large data:

   lf.collect(engine="streaming")

4. Select only needed columns early:

   # Good: Select columns early
   lf.select("col1", "col2").filter(...)

   # Bad: Filter on all columns first
   lf.filter(...).select("col1", "col2")

5. Use appropriate data types:

Categorical for low-cardinality strings
Appropriate integer sizes (i32 vs i64)
Date types for temporal data

Expression Patterns

Conditional operations:

pl.when(condition).then(value).otherwise(other_value)

Column operations across multiple columns:

df.select(pl.col("^.*_value$") * 2)  # Regex pattern

Null handling:

pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()

For additional best practices and patterns, load references/best_practices.md.

Resources

This skill includes comprehensive reference documentation:

references/

core_concepts.md - Detailed explanations of expressions, lazy evaluation, and type system
operations.md - Comprehensive guide to all common operations with examples
pandas_migration.md - Complete migration guide from pandas to Polars
io_guide.md - Data I/O operations for all supported formats
transformations.md - Joins, concatenation, pivots, and reshaping operations
best_practices.md - Performance optimization tips and common patterns

Load these references as needed when users require detailed information about specific topics.

Polars Best Practices and Performance Guide

Comprehensive guide to writing efficient Polars code and avoiding common pitfalls.

Performance Optimization

1. Use Lazy Evaluation

Always prefer lazy mode for large datasets:

# Bad: Eager mode loads everything immediately
df = pl.read_csv("large_file.csv")
result = df.filter(pl.col("age") > 25).select("name", "age")

# Good: Lazy mode optimizes before execution
lf = pl.scan_csv("large_file.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()

Benefits of lazy evaluation:

Predicate pushdown (filter at source)
Projection pushdown (read only needed columns)
Query optimization
Parallel execution planning

2. Filter and Select Early

Push filters and column selection as early as possible in the pipeline:

# Bad: Process all data, then filter and select
result = (
    lf.group_by("category")
    .agg(pl.col("value").mean())
    .join(other, on="category")
    .filter(pl.col("value") > 100)
    .select("category", "value")
)

# Good: Filter and select early
result = (
    lf.select("category", "value")  # Only needed columns
    .filter(pl.col("value") > 100)  # Filter early
    .group_by("category")
    .agg(pl.col("value").mean())
    .join(other.select("category", "other_col"), on="category")
)

3. Avoid Python Functions

Stay within the expression API to maintain parallelization:

# Bad: Python function disables parallelization
df = df.with_columns(
    result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
)

# Good: Use native expressions (parallelized)
df = df.with_columns(result=pl.col("value") * 2)

When you must use custom functions:

# If truly needed, be explicit
df = df.with_columns(
    result=pl.col("value").map_elements(
        custom_function,
        return_dtype=pl.Float64,
        skip_nulls=True  # Optimize null handling
    )
)

4. Use Streaming for Very Large Data

Enable streaming for datasets larger than RAM:

# Streaming mode processes data in chunks
lf = pl.scan_parquet("very_large.parquet")
result = lf.filter(pl.col("value") > 100).collect(engine="streaming")

# Or use sink for direct streaming writes
lf.filter(pl.col("value") > 100).sink_parquet("output.parquet")

5. Optimize Data Types

Choose appropriate data types to reduce memory and improve performance:

# Bad: Default types may be wasteful
df = pl.read_csv("data.csv")

# Good: Specify optimal types
df = pl.read_csv(
    "data.csv",
    schema_overrides={
        "id": pl.UInt32,  # Instead of Int64 if values fit
        "category": pl.Categorical,  # For low-cardinality strings
        "date": pl.Date,  # Instead of String
        "small_int": pl.Int16,  # Instead of Int64
    }
)

Type optimization guidelines:

Use smallest integer type that fits your data
Use Categorical for strings with low cardinality (<50% unique)
Use Date instead of Datetime when time isn't needed
Use Boolean instead of integers for binary flags

6. Parallel Operations

Structure code to maximize parallelization:

# Bad: Sequential pipe operations disable parallelization
df = (
    df.pipe(operation1)
    .pipe(operation2)
    .pipe(operation3)
)

# Good: Combined operations enable parallelization
df = df.with_columns(
    result1=operation1_expr(),
    result2=operation2_expr(),
    result3=operation3_expr()
)

7. Rechunk After Concatenation

# Concatenation can fragment data
combined = pl.concat([df1, df2, df3])

# Rechunk for better performance in subsequent operations
combined = pl.concat([df1, df2, df3], rechunk=True)

Expression Patterns

Conditional Logic

Simple conditions:

df.with_columns(
    status=pl.when(pl.col("age") >= 18)
        .then(pl.lit("adult"))
        .otherwise(pl.lit("minor"))
)

Multiple conditions:

df.with_columns(
    grade=pl.when(pl.col("score") >= 90)
        .then(pl.lit("A"))
        .when(pl.col("score") >= 80)
        .then(pl.lit("B"))
        .when(pl.col("score") >= 70)
        .then(pl.lit("C"))
        .when(pl.col("score") >= 60)
        .then(pl.lit("D"))
        .otherwise(pl.lit("F"))
)

Complex conditions:

df.with_columns(
    category=pl.when(
        (pl.col("revenue") > 1000000) & (pl.col("customers") > 100)
    )
    .then(pl.lit("enterprise"))
    .when(
        (pl.col("revenue") > 100000) | (pl.col("customers") > 50)
    )
    .then(pl.lit("business"))
    .otherwise(pl.lit("starter"))
)

Null Handling

Check for nulls:

df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())

Fill nulls:

# Constant value
df.with_columns(pl.col("value").fill_null(0))

# Forward fill
df.with_columns(pl.col("value").fill_null(strategy="forward"))

# Backward fill
df.with_columns(pl.col("value").fill_null(strategy="backward"))

# Mean
df.with_columns(pl.col("value").fill_null(strategy="mean"))

# Per-group fill
df.with_columns(
    pl.col("value").fill_null(pl.col("value").mean()).over("group")
)

Coalesce (first non-null):

df.with_columns(
    combined=pl.coalesce(["col1", "col2", "col3"])
)

Column Selection Patterns

By name:

df.select("col1", "col2", "col3")

By pattern:

# Regex
df.select(pl.col("^sales_.*$"))

# Starts with
df.select(pl.col("^sales"))

# Ends with
df.select(pl.col("_total$"))

# Contains
df.select(pl.col(".*revenue.*"))

By type:

import polars.selectors as cs

# All numeric columns
df.select(cs.numeric())

# All string columns
df.select(cs.string())

# Multiple types
df.select(cs.numeric() | cs.boolean())

Exclude columns:

df.select(pl.all().exclude("id", "timestamp"))

Transform multiple columns:

# Apply same operation to multiple columns
df.select(
    pl.col("^sales_.*$") * 1.1  # 10% increase to all sales columns
)

Aggregation Patterns

Multiple aggregations:

df.group_by("category").agg(
    pl.col("value").sum().alias("total"),
    pl.col("value").mean().alias("average"),
    pl.col("value").std().alias("std_dev"),
    pl.col("id").count().alias("count"),
    pl.col("id").n_unique().alias("unique_count"),
    pl.col("value").min().alias("minimum"),
    pl.col("value").max().alias("maximum"),
    pl.col("value").quantile(0.5).alias("median"),
    pl.col("value").quantile(0.95).alias("p95")
)

Conditional aggregations:

df.group_by("category").agg(
    # Count high values
    (pl.col("value") > 100).sum().alias("high_count"),

    # Average of filtered values
    pl.col("value").filter(pl.col("active")).mean().alias("active_avg"),

    # Conditional sum
    pl.when(pl.col("status") == "completed")
        .then(pl.col("amount"))
        .otherwise(0)
        .sum()
        .alias("completed_total")
)

Grouped transformations:

df.with_columns(
    # Group statistics
    group_mean=pl.col("value").mean().over("category"),
    group_std=pl.col("value").std().over("category"),

    # Rank within groups
    rank=pl.col("value").rank().over("category"),

    # Percentage of group total
    pct_of_group=(pl.col("value") / pl.col("value").sum().over("category")) * 100
)

Common Pitfalls and Anti-Patterns

Pitfall 1: Row Iteration

# Bad: Never iterate rows
for row in df.iter_rows():
    # Process row
    result = row[0] * 2

# Good: Use vectorized operations
df = df.with_columns(result=pl.col("value") * 2)

Pitfall 2: Modifying in Place

# Bad: Polars is immutable, this doesn't work as expected
df["new_col"] = df["old_col"] * 2  # May work but not recommended

# Good: Functional style
df = df.with_columns(new_col=pl.col("old_col") * 2)

Pitfall 3: Not Using Expressions

# Bad: String-based operations
df.select("value * 2")  # Won't work

# Good: Expression-based
df.select(pl.col("value") * 2)

Pitfall 4: Inefficient Joins

# Bad: Join large tables without filtering
result = large_df1.join(large_df2, on="id")

# Good: Filter before joining
result = (
    large_df1.filter(pl.col("active"))
    .join(
        large_df2.filter(pl.col("status") == "valid"),
        on="id"
    )
)

Pitfall 5: Not Specifying Types

# Bad: Let Polars infer everything
df = pl.read_csv("data.csv")

# Good: Specify types for correctness and performance
df = pl.read_csv(
    "data.csv",
    schema_overrides={"id": pl.Int64, "date": pl.Date, "category": pl.Categorical}
)

Pitfall 6: Creating Many Small DataFrames

# Bad: Many operations creating intermediate DataFrames
df1 = df.filter(pl.col("age") > 25)
df2 = df1.select("name", "age")
df3 = df2.sort("age")
result = df3.head(10)

# Good: Chain operations
result = (
    df.filter(pl.col("age") > 25)
    .select("name", "age")
    .sort("age")
    .head(10)
)

# Better: Use lazy mode
result = (
    df.lazy()
    .filter(pl.col("age") > 25)
    .select("name", "age")
    .sort("age")
    .head(10)
    .collect()
)

Memory Management

Monitor Memory Usage

# Check DataFrame size
print(f"Estimated size: {df.estimated_size('mb'):.2f} MB")

# Profile memory during operations
lf = pl.scan_csv("large.csv")
print(lf.explain())  # See query plan

Reduce Memory Footprint

# 1. Use lazy mode
lf = pl.scan_parquet("data.parquet")

# 2. Stream results
result = lf.collect(engine="streaming")

# 3. Select only needed columns
lf = lf.select("col1", "col2")

# 4. Optimize data types
df = df.with_columns(
    pl.col("int_col").cast(pl.Int32),  # Downcast if possible
    pl.col("category").cast(pl.Categorical)  # For low cardinality
)

# 5. Drop columns not needed
df = df.drop("large_text_col", "unused_col")

Testing and Debugging

Inspect Query Plans

lf = pl.scan_csv("data.csv")
query = lf.filter(pl.col("age") > 25).select("name", "age")

# View the optimized query plan
print(query.explain())

# View detailed query plan
print(query.explain(optimized=True))

Sample Data for Development

# Use n_rows for testing
df = pl.read_csv("large.csv", n_rows=1000)

# Or sample after reading
df_sample = df.sample(n=1000, seed=42)

Validate Schemas

# Check schema
print(df.schema)

# Ensure schema matches expectation
expected_schema = {
    "id": pl.Int64,
    "name": pl.String,
    "date": pl.Date
}

assert df.schema == expected_schema

Profile Performance

import time

# Time operations
start = time.time()
result = lf.collect()
print(f"Execution time: {time.time() - start:.2f}s")

# Compare eager vs lazy
start = time.time()
df_eager = pl.read_csv("data.csv").filter(pl.col("age") > 25)
eager_time = time.time() - start

start = time.time()
df_lazy = pl.scan_csv("data.csv").filter(pl.col("age") > 25).collect()
lazy_time = time.time() - start

print(f"Eager: {eager_time:.2f}s, Lazy: {lazy_time:.2f}s")

File Format Best Practices

Choose the Right Format

Parquet:

Best for: Large datasets, archival, data lakes
Pros: Excellent compression, columnar, fast reads
Cons: Not human-readable

CSV:

Best for: Small datasets, human inspection, legacy systems
Pros: Universal, human-readable
Cons: Slow, large file size, no type preservation

Arrow IPC:

Best for: Inter-process communication, temporary storage
Pros: Fastest, zero-copy, preserves all types
Cons: Less compression than Parquet

File Reading Best Practices

# 1. Use lazy reading
lf = pl.scan_parquet("data.parquet")  # Not read_parquet

# 2. Read multiple files efficiently
lf = pl.scan_parquet("data/*.parquet")  # Parallel reading

# 3. Specify schema when known
lf = pl.scan_csv(
    "data.csv",
    schema_overrides={"id": pl.Int64, "date": pl.Date}
)

# 4. Use predicate pushdown
result = lf.filter(pl.col("date") >= "2023-01-01").collect()

File Writing Best Practices

# 1. Use Parquet for large data
df.write_parquet("output.parquet", compression="zstd")

# 2. Partition large datasets
df.write_parquet("output", partition_by=["year", "month"])

# 3. Use streaming for very large writes
lf.sink_parquet("output.parquet")  # Streaming write

# 4. Optimize compression
df.write_parquet(
    "output.parquet",
    compression="snappy",  # Fast compression
    statistics=True  # Enable predicate pushdown on read
)

Code Organization

Reusable Expressions

# Define reusable expressions
age_group = (
    pl.when(pl.col("age") < 18)
    .then(pl.lit("minor"))
    .when(pl.col("age") < 65)
    .then(pl.lit("adult"))
    .otherwise(pl.lit("senior"))
)

revenue_per_customer = pl.col("revenue") / pl.col("customer_count")

# Use in multiple contexts
df = df.with_columns(
    age_group=age_group,
    rpc=revenue_per_customer
)

# Reuse in filtering
df = df.filter(revenue_per_customer > 100)

Pipeline Functions

def clean_data(lf: pl.LazyFrame) -> pl.LazyFrame:
    """Clean and standardize data."""
    return lf.with_columns(
        pl.col("name").str.to_uppercase(),
        pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"),
        pl.col("amount").fill_null(0)
    )

def add_features(lf: pl.LazyFrame) -> pl.LazyFrame:
    """Add computed features."""
    return lf.with_columns(
        month=pl.col("date").dt.month(),
        year=pl.col("date").dt.year(),
        amount_log=pl.col("amount").log()
    )

# Compose pipeline
result = (
    pl.scan_csv("data.csv")
    .pipe(clean_data)
    .pipe(add_features)
    .filter(pl.col("year") == 2023)
    .collect()
)

Documentation

Always document complex expressions and transformations:

# Good: Document intent
df = df.with_columns(
    # Calculate customer lifetime value as sum of purchases
    # divided by months since first purchase
    clv=(
        pl.col("total_purchases") /
        ((pl.col("last_purchase_date") - pl.col("first_purchase_date"))
         .dt.total_days() / 30)
    )
)

Version Compatibility

# Check Polars version
import polars as pl
print(pl.__version__)

# Feature availability varies by version
# Document version requirements for production code

Polars Core Concepts

Expressions

Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately.

What are Expressions?

An expression describes a transformation on data. It only materializes (executes) within specific contexts:

select() - Select and transform columns
with_columns() - Add or modify columns
filter() - Filter rows
group_by().agg() - Aggregate data

Expression Syntax

Basic column reference:

pl.col("column_name")

Computed expressions:

# Arithmetic
pl.col("height") * 2
pl.col("price") + pl.col("tax")

# With alias
(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi")

# Method chaining
pl.col("name").str.to_uppercase().str.slice(0, 3)

Expression Contexts

Select context:

df.select(
    "name",  # Simple column name
    pl.col("age"),  # Expression
    (pl.col("age") * 12).alias("age_in_months")  # Computed expression
)

With_columns context:

df.with_columns(
    age_doubled=pl.col("age") * 2,
    name_upper=pl.col("name").str.to_uppercase()
)

Filter context:

df.filter(
    pl.col("age") > 25,
    pl.col("city").is_in(["NY", "LA", "SF"])
)

Group_by context:

df.group_by("department").agg(
    pl.col("salary").mean(),
    pl.col("employee_id").count()
)

Expression Expansion

Apply operations to multiple columns at once:

All columns:

df.select(pl.all() * 2)

Pattern matching:

import polars.selectors as cs

# All columns ending with "_value"
df.select(pl.col("^.*_value$") * 100)

# All numeric columns
df.select(cs.numeric() + 1)

Exclude patterns:

df.select(pl.all().exclude("id", "name"))

Expression Composition

Expressions can be stored and reused:

# Define reusable expressions
age_expression = pl.col("age") * 12
name_expression = pl.col("name").str.to_uppercase()

# Use in multiple contexts
df.select(age_expression, name_expression)
df.with_columns(age_months=age_expression)

Data Types

Polars has a strict type system based on Apache Arrow.

Core Data Types

Numeric:

Int8, Int16, Int32, Int64 - Signed integers
UInt8, UInt16, UInt32, UInt64 - Unsigned integers
Float32, Float64 - Floating point numbers

Text:

Utf8 / String - UTF-8 encoded strings
Categorical - Categorized strings (low cardinality)
Enum - Fixed set of string values

Temporal:

Date - Calendar date (no time)
Datetime - Date and time with optional timezone
Time - Time of day
Duration - Time duration/difference

Boolean:

Boolean - True/False values

Nested:

List - Variable-length lists
Array - Fixed-length arrays
Struct - Nested record structures

Other:

Binary - Binary data
Object - Python objects (avoid in production)
Null - Null type

Type Casting

Convert between types explicitly:

# Cast to different type
df.select(
    pl.col("age").cast(pl.Float64),
    pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
    pl.col("id").cast(pl.String)
)

Null Handling

Polars uses consistent null handling across all types:

Check for nulls:

df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())

Fill nulls:

pl.col("value").fill_null(0)
pl.col("value").fill_null(strategy="forward")
pl.col("value").fill_null(strategy="backward")
pl.col("value").fill_null(strategy="mean")

Drop nulls:

df.drop_nulls()  # Drop any row with nulls
df.drop_nulls(subset=["col1", "col2"])  # Drop rows with nulls in specific columns

Categorical Data

Use categorical types for string columns with low cardinality (repeated values):

# Cast to categorical
df.with_columns(
    pl.col("category").cast(pl.Categorical)
)

# Benefits:
# - Reduced memory usage
# - Faster grouping and joining
# - Maintains order information

Lazy vs Eager Evaluation

Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame).

Eager Evaluation (DataFrame)

Operations execute immediately:

import polars as pl

# DataFrame operations execute right away
df = pl.read_csv("data.csv")  # Reads file immediately
result = df.filter(pl.col("age") > 25)  # Filters immediately
final = result.select("name", "age")  # Selects immediately

When to use eager:

Small datasets that fit in memory
Interactive exploration in notebooks
Simple one-off operations
Immediate feedback needed

Lazy Evaluation (LazyFrame)

Operations build a query plan, optimized before execution:

import polars as pl

# LazyFrame operations build a query plan
lf = pl.scan_csv("data.csv")  # Doesn't read yet
lf2 = lf.filter(pl.col("age") > 25)  # Adds to plan
lf3 = lf2.select("name", "age")  # Adds to plan
df = lf3.collect()  # NOW executes optimized plan

When to use lazy:

Large datasets
Complex query pipelines
Only need subset of data
Performance is critical
Streaming required

Query Optimization

Polars automatically optimizes lazy queries:

Predicate Pushdown: Filter operations pushed to data source when possible:

# Only reads rows where age > 25 from CSV
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).collect()

Projection Pushdown: Only read needed columns from data source:

# Only reads "name" and "age" columns from CSV
lf = pl.scan_csv("data.csv")
result = lf.select("name", "age").collect()

Query Plan Inspection:

# View the optimized query plan
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age")
print(result.explain())  # Shows optimized plan

Streaming Mode

Process data larger than memory:

# Enable streaming for very large datasets
lf = pl.scan_csv("very_large.csv")
result = lf.filter(pl.col("age") > 25).collect(engine="streaming")

Streaming benefits:

Process data larger than RAM
Lower peak memory usage
Chunk-based processing
Automatic memory management

Streaming limitations:

Not all operations support streaming
May be slower for small data
Some operations require materializing entire dataset

Converting Between Eager and Lazy

Eager to Lazy:

df = pl.read_csv("data.csv")
lf = df.lazy()  # Convert to LazyFrame

Lazy to Eager:

lf = pl.scan_csv("data.csv")
df = lf.collect()  # Execute and return DataFrame

Memory Format

Polars uses Apache Arrow columnar memory format:

Benefits:

Zero-copy data sharing with other Arrow libraries
Efficient columnar operations
SIMD vectorization
Reduced memory overhead
Fast serialization

Implications:

Data stored column-wise, not row-wise
Column operations very fast
Random row access slower than pandas
Best for analytical workloads

Parallelization

Polars parallelizes operations automatically using Rust's concurrency:

What gets parallelized:

Aggregations within groups
Window functions
Most expression evaluations
File reading (multiple files)
Join operations

What to avoid for parallelization:

Python user-defined functions (UDFs)
Lambda functions in .map_elements()
Sequential .pipe() chains

Best practice:

# Good: Stays in expression API (parallelized)
df.with_columns(
    pl.col("value") * 10,
    pl.col("value").log(),
    pl.col("value").sqrt()
)

# Bad: Uses Python function (sequential)
df.with_columns(
    pl.col("value").map_elements(lambda x: x * 10)
)

Strict Type System

Polars enforces strict typing:

No silent conversions:

# This will error - can't mix types
# df.with_columns(pl.col("int_col") + "string")

# Must cast explicitly
df.with_columns(
    pl.col("int_col").cast(pl.String) + "_suffix"
)

Benefits:

Prevents silent bugs
Predictable behavior
Better performance
Clearer code intent

Integer nulls: Unlike pandas, integer columns can have nulls without converting to float:

# In pandas: Int column with null becomes Float
# In polars: Int column with null stays Int (with null values)
df = pl.DataFrame({"int_col": [1, 2, None, 4]})
# dtype: Int64 (not Float64)

Polars Data I/O Guide

Comprehensive guide to reading and writing data in various formats with Polars.

CSV Files

Reading CSV

Eager mode (loads into memory):

import polars as pl

# Basic read
df = pl.read_csv("data.csv")

# With options
df = pl.read_csv(
    "data.csv",
    separator=",",
    has_header=True,
    columns=["col1", "col2"],  # Select specific columns
    n_rows=1000,  # Read only first 1000 rows
    skip_rows=10,  # Skip first 10 rows
    schema_overrides={"col1": pl.Int64, "col2": pl.String},  # Specify types
    null_values=["NA", "null", ""],  # Define null values
    encoding="utf-8",
    ignore_errors=False
)

Lazy mode (scans without loading - recommended for large files):

# Scan CSV (builds query plan)
lf = pl.scan_csv("data.csv")

# Apply operations
result = lf.filter(pl.col("age") > 25).select("name", "age")

# Execute and load
df = result.collect()

Writing CSV

# Basic write
df.write_csv("output.csv")

# With options
df.write_csv(
    "output.csv",
    separator=",",
    include_header=True,
    null_value="",  # How to represent nulls
    quote_char='"',
    line_terminator="\n"
)

Multiple CSV Files

Read multiple files:

# Read all CSVs in directory
lf = pl.scan_csv("data/*.csv")

# Read specific files
lf = pl.scan_csv(["file1.csv", "file2.csv", "file3.csv"])

Parquet Files

Parquet is the recommended format for performance and compression.

Reading Parquet

Eager:

df = pl.read_parquet("data.parquet")

# With options
df = pl.read_parquet(
    "data.parquet",
    columns=["col1", "col2"],  # Select specific columns
    n_rows=1000,  # Read first N rows
    parallel="auto"  # Control parallelization
)

Lazy (recommended):

lf = pl.scan_parquet("data.parquet")

# Automatic predicate and projection pushdown
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()

Writing Parquet

# Basic write
df.write_parquet("output.parquet")

# With compression
df.write_parquet(
    "output.parquet",
    compression="snappy",  # Options: "snappy", "gzip", "brotli", "lz4", "zstd"
    statistics=True,  # Write statistics (enables predicate pushdown)
    use_pyarrow=False  # Use Rust writer (faster)
)

Partitioned Parquet (Hive-style)

Write partitioned:

# Write with partitioning
df.write_parquet(
    "output_dir",
    partition_by=["year", "month"]  # Creates directory structure
)
# Creates: output_dir/year=2023/month=01/data.parquet

Read partitioned:

lf = pl.scan_parquet("output_dir/**/*.parquet")

# Hive partitioning columns are automatically added
result = lf.filter(pl.col("year") == 2023).collect()

JSON Files

Reading JSON

NDJSON (newline-delimited JSON) - recommended:

df = pl.read_ndjson("data.ndjson")

# Lazy
lf = pl.scan_ndjson("data.ndjson")

Standard JSON:

df = pl.read_json("data.json")

# From JSON string
df = pl.read_json('{"col1": [1, 2], "col2": ["a", "b"]}')

Writing JSON

# Write NDJSON
df.write_ndjson("output.ndjson")

# Write standard JSON
df.write_json("output.json")

# Pretty printed
df.write_json("output.json", pretty=True, row_oriented=False)

Excel Files

Reading Excel

# Read first sheet
df = pl.read_excel("data.xlsx")

# Specific sheet
df = pl.read_excel("data.xlsx", sheet_name="Sheet1")
# Or by index
df = pl.read_excel("data.xlsx", sheet_id=0)

# With options
df = pl.read_excel(
    "data.xlsx",
    sheet_name="Sheet1",
    columns=["A", "B", "C"],  # Excel columns
    n_rows=100,
    skip_rows=5,
    has_header=True
)

Writing Excel

# Write to Excel
df.write_excel("output.xlsx")

# Multiple sheets
with pl.ExcelWriter("output.xlsx") as writer:
    df1.write_excel(writer, worksheet="Sheet1")
    df2.write_excel(writer, worksheet="Sheet2")

Database Connectivity

Read from Database

import polars as pl

# Read entire table
df = pl.read_database("SELECT * FROM users", connection_uri="postgresql://...")

# Using connectorx for better performance
df = pl.read_database_uri(
    "SELECT * FROM users WHERE age > 25",
    uri="postgresql://user:pass@localhost/db"
)

Write to Database

# Using SQLAlchemy
from sqlalchemy import create_engine

engine = create_engine("postgresql://user:pass@localhost/db")
df.write_database("table_name", connection=engine)

# With options
df.write_database(
    "table_name",
    connection=engine,
    if_exists="replace",  # or "append", "fail"
)

Common Database Connectors

PostgreSQL:

uri = "postgresql://username:password@localhost:5432/database"
df = pl.read_database_uri("SELECT * FROM table", uri=uri)

MySQL:

uri = "mysql://username:password@localhost:3306/database"
df = pl.read_database_uri("SELECT * FROM table", uri=uri)

SQLite:

uri = "sqlite:///path/to/database.db"
df = pl.read_database_uri("SELECT * FROM table", uri=uri)

Cloud Storage

AWS S3

# Read from S3
df = pl.read_parquet("s3://bucket/path/to/file.parquet")
lf = pl.scan_parquet("s3://bucket/path/*.parquet")

# Write to S3
df.write_parquet("s3://bucket/path/output.parquet")

# Prefer cloud profiles, IAM roles, or Polars credential providers over
# hardcoding secrets in scripts.
lf = pl.scan_parquet(
    "s3://bucket/file.parquet",
    credential_provider=pl.CredentialProviderAWS(profile_name="analytics"),
)
df = lf.collect()

Azure Blob Storage

# Read from Azure
df = pl.read_parquet("az://container/path/file.parquet")

# Write to Azure
df.write_parquet("az://container/path/output.parquet")

# Prefer managed identity or an Azure SDK credential provider.
from azure.identity import DefaultAzureCredential

df = pl.read_parquet(
    "abfss://container@account.dfs.core.windows.net/path/file.parquet",
    credential_provider=pl.CredentialProviderAzure(
        credential=DefaultAzureCredential()
    ),
)

Google Cloud Storage (GCS)

# Read from GCS
df = pl.read_parquet("gs://bucket/path/file.parquet")

# Write to GCS
df.write_parquet("gs://bucket/path/output.parquet")

# Prefer Application Default Credentials or workload identity configured
# outside the script.
df = pl.read_parquet("gs://bucket/path/file.parquet")

Google BigQuery

# Read from BigQuery
df = pl.read_database(
    "SELECT * FROM project.dataset.table",
    connection_uri="bigquery://project"
)

# Or using Google Cloud SDK
from google.cloud import bigquery
client = bigquery.Client()

query = "SELECT * FROM project.dataset.table WHERE date > '2023-01-01'"
df = pl.from_pandas(client.query(query).to_dataframe())

Apache Arrow

IPC/Feather Format

Read:

df = pl.read_ipc("data.arrow")
lf = pl.scan_ipc("data.arrow")

Write:

df.write_ipc("output.arrow")

# Compressed
df.write_ipc("output.arrow", compression="zstd")

Arrow Streaming

# Write streaming format
df.write_ipc("output.arrows", compression="zstd")

# Read streaming
df = pl.read_ipc("output.arrows")

From/To Arrow

import pyarrow as pa

# From Arrow Table
arrow_table = pa.table({"col": [1, 2, 3]})
df = pl.from_arrow(arrow_table)

# To Arrow Table
arrow_table = df.to_arrow()

In-Memory Formats

Python Dictionaries

# From dict
df = pl.DataFrame({
    "col1": [1, 2, 3],
    "col2": ["a", "b", "c"]
})

# To dict
data_dict = df.to_dict()  # Column-oriented
data_dict = df.to_dict(as_series=False)  # Lists instead of Series

NumPy Arrays

import numpy as np

# From NumPy
arr = np.array([[1, 2], [3, 4], [5, 6]])
df = pl.DataFrame(arr, schema=["col1", "col2"])

# To NumPy
arr = df.to_numpy()

Pandas DataFrames

import pandas as pd

# From Pandas
pd_df = pd.DataFrame({"col": [1, 2, 3]})
pl_df = pl.from_pandas(pd_df)

# To Pandas
pd_df = pl_df.to_pandas()

# Zero-copy when possible
pl_df = pl.from_arrow(pd_df)

Lists of Rows

# From list of dicts
data = [
    {"name": "Alice", "age": 25},
    {"name": "Bob", "age": 30}
]
df = pl.DataFrame(data)

# To list of dicts
rows = df.to_dicts()

# From list of tuples
data = [("Alice", 25), ("Bob", 30)]
df = pl.DataFrame(data, schema=["name", "age"])

Streaming Large Files

For datasets larger than memory, use lazy mode with streaming:

# Streaming mode
lf = pl.scan_csv("very_large.csv")
result = lf.filter(pl.col("value") > 100).collect(engine="streaming")

# Streaming with multiple files
lf = pl.scan_parquet("data/*.parquet")
result = lf.group_by("category").agg(pl.col("value").sum()).collect(engine="streaming")

Best Practices

Format Selection

Use Parquet when:

Need compression (up to 10x smaller than CSV)
Want fast reads/writes
Need to preserve data types
Working with large datasets
Need predicate pushdown

Use CSV when:

Need human-readable format
Interfacing with legacy systems
Data is small
Need universal compatibility

Use JSON when:

Working with nested/hierarchical data
Need web API compatibility
Data has flexible schema

Use Arrow IPC when:

Need zero-copy data sharing
Fastest serialization required
Working between Arrow-compatible systems

Reading Large Files

# 1. Always use lazy mode
lf = pl.scan_csv("large.csv")  # NOT read_csv

# 2. Filter and select early (pushdown optimization)
result = (
    lf
    .select("col1", "col2", "col3")  # Only needed columns
    .filter(pl.col("date") > "2023-01-01")  # Filter early
    .collect()
)

# 3. Use streaming for very large data
result = lf.filter(...).select(...).collect(engine="streaming")

# 4. Read only needed rows during development
df = pl.read_csv("large.csv", n_rows=10000)  # Sample for testing

Writing Large Files

# 1. Use Parquet with compression
df.write_parquet("output.parquet", compression="zstd")

# 2. Use partitioning for very large datasets
df.write_parquet("output", partition_by=["year", "month"])

# 3. Write streaming
lf = pl.scan_csv("input.csv")
lf.sink_parquet("output.parquet")  # Streaming write

Performance Tips

# 1. Specify dtypes when reading CSV
df = pl.read_csv(
    "data.csv",
    schema_overrides={"id": pl.Int64, "name": pl.String}  # Avoids inference
)

# 2. Use appropriate compression
df.write_parquet("output.parquet", compression="snappy")  # Fast
df.write_parquet("output.parquet", compression="zstd")    # Better compression

# 3. Parallel reading
df = pl.read_csv("data.csv", parallel="auto")

# 4. Read multiple files in parallel
lf = pl.scan_parquet("data/*.parquet")  # Automatic parallel read

Error Handling

try:
    df = pl.read_csv("data.csv")
except pl.exceptions.ComputeError as e:
    print(f"Error reading CSV: {e}")

# Ignore errors during parsing
df = pl.read_csv("messy.csv", ignore_errors=True)

# Handle missing files
from pathlib import Path
if Path("data.csv").exists():
    df = pl.read_csv("data.csv")
else:
    print("File not found")

Schema Management

# Infer schema from sample
schema = pl.read_csv("data.csv", n_rows=1000).schema

# Use inferred schema for full read
df = pl.read_csv("data.csv", schema=schema)

# Define schema explicitly
schema = {
    "id": pl.Int64,
    "name": pl.String,
    "date": pl.Date,
    "value": pl.Float64
}
df = pl.read_csv("data.csv", schema=schema)

Polars Operations Reference

This reference covers all common Polars operations with comprehensive examples.

Selection Operations

Select Columns

Basic selection:

# Select specific columns
df.select("name", "age", "city")

# Using expressions
df.select(pl.col("name"), pl.col("age"))

Pattern-based selection:

import polars.selectors as cs

# All columns starting with "sales_"
df.select(pl.col("^sales_.*$"))

# All numeric columns
df.select(cs.numeric())

# All columns except specific ones
df.select(pl.all().exclude("id", "timestamp"))

Computed columns:

df.select(
    "name",
    (pl.col("age") * 12).alias("age_in_months"),
    (pl.col("salary") * 1.1).alias("salary_after_raise")
)

With Columns (Add/Modify)

Add new columns or modify existing ones while preserving all other columns:

# Add new columns
df.with_columns(
    age_doubled=pl.col("age") * 2,
    full_name=pl.col("first_name") + " " + pl.col("last_name")
)

# Modify existing columns
df.with_columns(
    pl.col("name").str.to_uppercase().alias("name"),
    pl.col("salary").cast(pl.Float64).alias("salary")
)

# Multiple operations in parallel
df.with_columns(
    pl.col("value") * 10,
    pl.col("value") * 100,
    pl.col("value") * 1000,
)

Filtering Operations

Basic Filtering

# Single condition
df.filter(pl.col("age") > 25)

# Multiple conditions (AND)
df.filter(
    pl.col("age") > 25,
    pl.col("city") == "NY"
)

# OR conditions
df.filter(
    (pl.col("age") > 30) | (pl.col("salary") > 100000)
)

# NOT condition
df.filter(~pl.col("active"))
df.filter(pl.col("city") != "NY")

Advanced Filtering

String operations:

# Contains substring
df.filter(pl.col("name").str.contains("John"))

# Starts with
df.filter(pl.col("email").str.starts_with("admin"))

# Regex match
df.filter(pl.col("phone").str.contains(r"^\d{3}-\d{3}-\d{4}$"))

Membership checks:

# In list
df.filter(pl.col("city").is_in(["NY", "LA", "SF"]))

# Not in list
df.filter(~pl.col("status").is_in(["inactive", "deleted"]))

Range filters:

# Between values
df.filter(pl.col("age").is_between(25, 35))

# Date range
df.filter(
    pl.col("date") >= pl.date(2023, 1, 1),
    pl.col("date") <= pl.date(2023, 12, 31)
)

Null filtering:

# Filter out nulls
df.filter(pl.col("value").is_not_null())

# Keep only nulls
df.filter(pl.col("value").is_null())

Grouping and Aggregation

Basic Group By

# Group by single column
df.group_by("department").agg(
    pl.col("salary").mean().alias("avg_salary"),
    pl.len().alias("employee_count")
)

# Group by multiple columns
df.group_by("department", "location").agg(
    pl.col("salary").sum()
)

# Maintain order
df.group_by("category", maintain_order=True).agg(
    pl.col("value").sum()
)

Aggregation Functions

Count and length:

df.group_by("category").agg(
    pl.len().alias("count"),
    pl.col("id").count().alias("non_null_count"),
    pl.col("id").n_unique().alias("unique_count")
)

Statistical aggregations:

df.group_by("group").agg(
    pl.col("value").sum().alias("total"),
    pl.col("value").mean().alias("average"),
    pl.col("value").median().alias("median"),
    pl.col("value").std().alias("std_dev"),
    pl.col("value").var().alias("variance"),
    pl.col("value").min().alias("minimum"),
    pl.col("value").max().alias("maximum"),
    pl.col("value").quantile(0.95).alias("p95")
)

First and last:

df.group_by("user_id").agg(
    pl.col("timestamp").first().alias("first_seen"),
    pl.col("timestamp").last().alias("last_seen"),
    pl.col("event").first().alias("first_event")
)

List aggregation:

# Collect values into lists
df.group_by("category").agg(
    pl.col("item").alias("all_items")  # Creates list column
)

Conditional Aggregations

Filter within aggregations:

df.group_by("department").agg(
    # Count high earners
    (pl.col("salary") > 100000).sum().alias("high_earners"),

    # Average of filtered values
    pl.col("salary").filter(pl.col("bonus") > 0).mean().alias("avg_with_bonus"),

    # Conditional sum
    pl.when(pl.col("active"))
      .then(pl.col("sales"))
      .otherwise(0)
      .sum()
      .alias("active_sales")
)

Multiple Aggregations

Combine multiple aggregations efficiently:

df.group_by("store_id").agg(
    pl.col("transaction_id").count().alias("num_transactions"),
    pl.col("amount").sum().alias("total_sales"),
    pl.col("amount").mean().alias("avg_transaction"),
    pl.col("customer_id").n_unique().alias("unique_customers"),
    pl.col("amount").max().alias("largest_transaction"),
    pl.col("timestamp").min().alias("first_transaction_date"),
    pl.col("timestamp").max().alias("last_transaction_date")
)

Window Functions

Window functions apply aggregations while preserving the original row count.

Basic Window Operations

Group statistics:

# Add group mean to each row
df.with_columns(
    avg_age_by_dept=pl.col("age").mean().over("department")
)

# Multiple group columns
df.with_columns(
    group_avg=pl.col("value").mean().over("category", "region")
)

Ranking:

df.with_columns(
    # Rank within groups
    rank=pl.col("score").rank().over("team"),

    # Dense rank (no gaps)
    dense_rank=pl.col("score").rank(method="dense").over("team"),

    # Row number
    row_num=pl.col("timestamp").sort().rank(method="ordinal").over("user_id")
)

Window Mapping Strategies

group_to_rows (default): Preserves original row order:

df.with_columns(
    group_mean=pl.col("value").mean().over("category", mapping_strategy="group_to_rows")
)

explode: Faster, groups rows together:

df.with_columns(
    group_mean=pl.col("value").mean().over("category", mapping_strategy="explode")
)

join: Creates list columns:

df.with_columns(
    group_values=pl.col("value").over("category", mapping_strategy="join")
)

Rolling Windows

Time-based rolling:

df.with_columns(
    rolling_avg=pl.col("value").rolling_mean(
        window_size="7d",
        by="date"
    )
)

Row-based rolling:

df.with_columns(
    rolling_sum=pl.col("value").rolling_sum(window_size=3),
    rolling_max=pl.col("value").rolling_max(window_size=5)
)

Cumulative Operations

df.with_columns(
    cumsum=pl.col("value").cum_sum().over("group"),
    cummax=pl.col("value").cum_max().over("group"),
    cummin=pl.col("value").cum_min().over("group"),
    cumprod=pl.col("value").cum_prod().over("group")
)

Shift and Lag/Lead

df.with_columns(
    # Previous value (lag)
    prev_value=pl.col("value").shift(1).over("user_id"),

    # Next value (lead)
    next_value=pl.col("value").shift(-1).over("user_id"),

    # Calculate difference from previous
    diff=pl.col("value") - pl.col("value").shift(1).over("user_id")
)

Sorting

Basic Sorting

# Sort by single column
df.sort("age")

# Sort descending
df.sort("age", descending=True)

# Sort by multiple columns
df.sort("department", "age")

# Mixed sorting order
df.sort(["department", "salary"], descending=[False, True])

Advanced Sorting

Null handling:

# Nulls first
df.sort("value", nulls_last=False)

# Nulls last
df.sort("value", nulls_last=True)

Sort by expression:

# Sort by computed value
df.sort(pl.col("first_name").str.len())

# Sort by multiple expressions
df.sort(
    pl.col("last_name").str.to_lowercase(),
    pl.col("age").abs()
)

Conditional Operations

When/Then/Otherwise

# Basic conditional
df.with_columns(
    status=pl.when(pl.col("age") >= 18)
        .then(pl.lit("adult"))
        .otherwise(pl.lit("minor"))
)

# Multiple conditions
df.with_columns(
    category=pl.when(pl.col("score") >= 90)
        .then(pl.lit("A"))
        .when(pl.col("score") >= 80)
        .then(pl.lit("B"))
        .when(pl.col("score") >= 70)
        .then(pl.lit("C"))
        .otherwise(pl.lit("F"))
)

# Conditional computation
df.with_columns(
    adjusted_price=pl.when(pl.col("is_member"))
        .then(pl.col("price") * 0.9)
        .otherwise(pl.col("price"))
)

String Operations

Common String Methods

df.with_columns(
    # Case conversion
    upper=pl.col("name").str.to_uppercase(),
    lower=pl.col("name").str.to_lowercase(),
    title=pl.col("name").str.to_titlecase(),

    # Trimming
    trimmed=pl.col("text").str.strip_chars(),

    # Substring
    first_3=pl.col("name").str.slice(0, 3),

    # Replace
    cleaned=pl.col("text").str.replace("old", "new"),
    cleaned_all=pl.col("text").str.replace_all("old", "new"),

    # Split
    parts=pl.col("full_name").str.split(" "),

    # Length
    name_length=pl.col("name").str.len_chars()
)

String Filtering

# Contains
df.filter(pl.col("email").str.contains("@gmail.com"))

# Starts/ends with
df.filter(pl.col("name").str.starts_with("A"))
df.filter(pl.col("file").str.ends_with(".csv"))

# Regex matching
df.filter(pl.col("phone").str.contains(r"^\d{3}-\d{4}$"))

Date and Time Operations

Date Parsing

# Parse strings to dates
df.with_columns(
    date=pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d"),
    datetime=pl.col("dt_str").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")
)

Date Components

df.with_columns(
    year=pl.col("date").dt.year(),
    month=pl.col("date").dt.month(),
    day=pl.col("date").dt.day(),
    weekday=pl.col("date").dt.weekday(),
    hour=pl.col("datetime").dt.hour(),
    minute=pl.col("datetime").dt.minute()
)

Date Arithmetic

# Add duration
df.with_columns(
    next_week=pl.col("date") + pl.duration(weeks=1),
    next_month=pl.col("date") + pl.duration(months=1)
)

# Difference between dates
df.with_columns(
    days_diff=(pl.col("end_date") - pl.col("start_date")).dt.total_days()
)

Date Filtering

# Filter by date range
df.filter(
    pl.col("date").is_between(pl.date(2023, 1, 1), pl.date(2023, 12, 31))
)

# Filter by year
df.filter(pl.col("date").dt.year() == 2023)

# Filter by month
df.filter(pl.col("date").dt.month().is_in([6, 7, 8]))  # Summer months

List Operations

Working with List Columns

# Create list column
df.with_columns(
    items_list=pl.concat_list("item1", "item2", "item3")
)

# List operations
df.with_columns(
    list_len=pl.col("items").list.len(),
    first_item=pl.col("items").list.first(),
    last_item=pl.col("items").list.last(),
    unique_items=pl.col("items").list.unique(),
    sorted_items=pl.col("items").list.sort()
)

# Explode lists to rows
df.explode("items")

# For element-wise list filtering, use Polars' native list-expression
# methods with pl.element(); avoid Python callbacks in hot paths.

Struct Operations

Working with Nested Structures

# Create struct column
df.with_columns(
    address=pl.struct(["street", "city", "zip"])
)

# Access struct fields
df.with_columns(
    city=pl.col("address").struct.field("city")
)

# Unnest struct to columns
df.unnest("address")

Unique and Duplicate Operations

# Get unique rows
df.unique()

# Unique on specific columns
df.unique(subset=["name", "email"])

# Keep first/last duplicate
df.unique(subset=["id"], keep="first")
df.unique(subset=["id"], keep="last")

# Identify duplicates
df.with_columns(
    is_duplicate=pl.col("id").is_duplicated()
)

# Count duplicates
df.group_by("email").agg(
    pl.len().alias("count")
).filter(pl.col("count") > 1)

Sampling

# Random sample
df.sample(n=100)

# Sample fraction
df.sample(fraction=0.1)

# Sample with seed for reproducibility
df.sample(n=100, seed=42)

Column Renaming

# Rename specific columns
df.rename({"old_name": "new_name", "age": "years"})

# Rename with expression
df.select(pl.col("*").name.suffix("_renamed"))
df.select(pl.col("*").name.prefix("data_"))
df.select(pl.col("*").name.to_uppercase())

Pandas to Polars Migration Guide

This guide helps you migrate from pandas to Polars with comprehensive operation mappings and key differences.

Core Conceptual Differences

1. No Index System

Pandas: Uses row-based indexing system

df.loc[0, "column"]
df.iloc[0:5]
df.set_index("id")

Polars: Uses integer positions only

df[0, "column"]  # Row position, column name
df[0:5]  # Row slice
# No set_index equivalent - use group_by instead

2. Memory Format

Pandas: Row-oriented NumPy arrays Polars: Columnar Apache Arrow format

Implications:

Polars is faster for column operations
Polars uses less memory
Polars has better data sharing capabilities

3. Parallelization

Pandas: Primarily single-threaded (requires Dask for parallelism) Polars: Parallel by default using Rust's concurrency

4. Lazy Evaluation

Pandas: Only eager evaluation Polars: Both eager (DataFrame) and lazy (LazyFrame) with query optimization

5. Type Strictness

Pandas: Allows silent type conversions Polars: Strict typing, explicit casts required

Example:

# Pandas: Silently converts to float
pd_df["int_col"] = [1, 2, None, 4]  # dtype: float64

# Polars: Keeps as integer with null
pl_df = pl.DataFrame({"int_col": [1, 2, None, 4]})  # dtype: Int64

Operation Mappings

Data Selection

Operation	Pandas	Polars
Select column	`df["col"]` or `df.col`	`df.select("col")` or `df["col"]`
Select multiple	`df[["a", "b"]]`	`df.select("a", "b")`
Select by position	`df.iloc[:, 0:3]`	`df.select(pl.col(df.columns[0:3]))`
Select by condition	`df[df["age"] > 25]`	`df.filter(pl.col("age") > 25)`

Data Filtering

Operation	Pandas	Polars
Single condition	`df[df["age"] > 25]`	`df.filter(pl.col("age") > 25)`
Multiple conditions	`df[(df["age"] > 25) & (df["city"] == "NY")]`	`df.filter(pl.col("age") > 25, pl.col("city") == "NY")`
Query method	`df.query("age > 25")`	`df.filter(pl.col("age") > 25)`
isin	`df[df["city"].isin(["NY", "LA"])]`	`df.filter(pl.col("city").is_in(["NY", "LA"]))`
isna	`df[df["value"].isna()]`	`df.filter(pl.col("value").is_null())`
notna	`df[df["value"].notna()]`	`df.filter(pl.col("value").is_not_null())`

Adding/Modifying Columns

Operation	Pandas	Polars
Add column	`df["new"] = df["old"] * 2`	`df.with_columns(new=pl.col("old") * 2)`
Multiple columns	`df.assign(a=..., b=...)`	`df.with_columns(a=..., b=...)`
Conditional column	`np.where(condition, a, b)`	`pl.when(condition).then(a).otherwise(b)`

Important difference - Parallel execution:

# Pandas: Sequential (lambda sees previous results)
df.assign(
    a=lambda df_: df_.value * 10,
    b=lambda df_: df_.value * 100
)

# Polars: Parallel (all computed together)
df.with_columns(
    a=pl.col("value") * 10,
    b=pl.col("value") * 100
)

Grouping and Aggregation

Operation	Pandas	Polars
Group by	`df.groupby("col")`	`df.group_by("col")`
Agg single	`df.groupby("col")["val"].mean()`	`df.group_by("col").agg(pl.col("val").mean())`
Agg multiple	`df.groupby("col").agg({"val": ["mean", "sum"]})`	`df.group_by("col").agg(pl.col("val").mean(), pl.col("val").sum())`
Size	`df.groupby("col").size()`	`df.group_by("col").agg(pl.len())`
Count	`df.groupby("col").count()`	`df.group_by("col").agg(pl.col("*").count())`

Window Functions

Operation	Pandas	Polars
Transform	`df.groupby("col").transform("mean")`	`df.with_columns(pl.col("val").mean().over("col"))`
Rank	`df.groupby("col")["val"].rank()`	`df.with_columns(pl.col("val").rank().over("col"))`
Shift	`df.groupby("col")["val"].shift(1)`	`df.with_columns(pl.col("val").shift(1).over("col"))`
Cumsum	`df.groupby("col")["val"].cumsum()`	`df.with_columns(pl.col("val").cum_sum().over("col"))`

Joins

Operation	Pandas	Polars
Inner join	`df1.merge(df2, on="id")`	`df1.join(df2, on="id", how="inner")`
Left join	`df1.merge(df2, on="id", how="left")`	`df1.join(df2, on="id", how="left")`
Different keys	`df1.merge(df2, left_on="a", right_on="b")`	`df1.join(df2, left_on="a", right_on="b")`

Concatenation

Operation	Pandas	Polars
Vertical	`pd.concat([df1, df2], axis=0)`	`pl.concat([df1, df2], how="vertical")`
Horizontal	`pd.concat([df1, df2], axis=1)`	`pl.concat([df1, df2], how="horizontal")`

Sorting

Operation	Pandas	Polars
Sort by column	`df.sort_values("col")`	`df.sort("col")`
Descending	`df.sort_values("col", ascending=False)`	`df.sort("col", descending=True)`
Multiple columns	`df.sort_values(["a", "b"])`	`df.sort("a", "b")`

Reshaping

Operation	Pandas	Polars
Pivot	`df.pivot(index="a", columns="b", values="c")`	`df.pivot(on="b", values="c", index="a")`
Melt	`df.melt(id_vars="id")`	`df.unpivot(index="id")`

I/O Operations

Operation	Pandas	Polars
Read CSV	`pd.read_csv("file.csv")`	`pl.read_csv("file.csv")` or `pl.scan_csv()`
Write CSV	`df.to_csv("file.csv")`	`df.write_csv("file.csv")`
Read Parquet	`pd.read_parquet("file.parquet")`	`pl.read_parquet("file.parquet")`
Write Parquet	`df.to_parquet("file.parquet")`	`df.write_parquet("file.parquet")`
Read Excel	`pd.read_excel("file.xlsx")`	`pl.read_excel("file.xlsx")`

String Operations

Operation	Pandas	Polars
Upper	`df["col"].str.upper()`	`df.select(pl.col("col").str.to_uppercase())`
Lower	`df["col"].str.lower()`	`df.select(pl.col("col").str.to_lowercase())`
Contains	`df["col"].str.contains("pattern")`	`df.filter(pl.col("col").str.contains("pattern"))`
Replace	`df["col"].str.replace("old", "new")`	`df.select(pl.col("col").str.replace("old", "new"))`
Split	`df["col"].str.split(" ")`	`df.select(pl.col("col").str.split(" "))`

Datetime Operations

Operation	Pandas	Polars
Parse dates	`pd.to_datetime(df["col"])`	`df.select(pl.col("col").str.strptime(pl.Date, "%Y-%m-%d"))`
Year	`df["date"].dt.year`	`df.select(pl.col("date").dt.year())`
Month	`df["date"].dt.month`	`df.select(pl.col("date").dt.month())`
Day	`df["date"].dt.day`	`df.select(pl.col("date").dt.day())`

Missing Data

Operation	Pandas	Polars
Drop nulls	`df.dropna()`	`df.drop_nulls()`
Fill nulls	`df.fillna(0)`	`df.fill_null(0)`
Check null	`df["col"].isna()`	`df.select(pl.col("col").is_null())`
Forward fill	`df.fillna(method="ffill")`	`df.select(pl.col("col").fill_null(strategy="forward"))`

Other Operations

Operation	Pandas	Polars
Unique values	`df["col"].unique()`	`df["col"].unique()`
Value counts	`df["col"].value_counts()`	`df["col"].value_counts()`
Describe	`df.describe()`	`df.describe()`
Sample	`df.sample(n=100)`	`df.sample(n=100)`
Head	`df.head()`	`df.head()`
Tail	`df.tail()`	`df.tail()`

Common Migration Patterns

Pattern 1: Chained Operations

Pandas:

result = (df
    .assign(new_col=lambda x: x["old_col"] * 2)
    .query("new_col > 10")
    .groupby("category")
    .agg({"value": "sum"})
    .reset_index()
)

Polars:

result = (df
    .with_columns(new_col=pl.col("old_col") * 2)
    .filter(pl.col("new_col") > 10)
    .group_by("category")
    .agg(pl.col("value").sum())
)
# No reset_index needed - Polars doesn't have index

Pattern 2: Apply Functions

Pandas:

# Avoid in Polars - breaks parallelization
df["result"] = df["value"].apply(lambda x: x * 2)

Polars:

# Use expressions instead
df = df.with_columns(result=pl.col("value") * 2)

# If custom function needed
df = df.with_columns(
    result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
)

Pattern 3: Conditional Column Creation

Pandas:

df["category"] = np.where(
    df["value"] > 100,
    "high",
    np.where(df["value"] > 50, "medium", "low")
)

Polars:

df = df.with_columns(
    category=pl.when(pl.col("value") > 100)
        .then(pl.lit("high"))
        .when(pl.col("value") > 50)
        .then(pl.lit("medium"))
        .otherwise(pl.lit("low"))
)

Pattern 4: Group Transform

Pandas:

df["group_mean"] = df.groupby("category")["value"].transform("mean")

Polars:

df = df.with_columns(
    group_mean=pl.col("value").mean().over("category")
)

Pattern 5: Multiple Aggregations

Pandas:

result = df.groupby("category").agg({
    "value": ["mean", "sum", "count"],
    "price": ["min", "max"]
})

Polars:

result = df.group_by("category").agg(
    pl.col("value").mean().alias("value_mean"),
    pl.col("value").sum().alias("value_sum"),
    pl.col("value").count().alias("value_count"),
    pl.col("price").min().alias("price_min"),
    pl.col("price").max().alias("price_max")
)

Performance Anti-Patterns to Avoid

Anti-Pattern 1: Sequential Pipe Operations

Bad (disables parallelization):

df = df.pipe(function1).pipe(function2).pipe(function3)

Good (enables parallelization):

df = df.with_columns(
    function1_result(),
    function2_result(),
    function3_result()
)

Anti-Pattern 2: Python Functions in Hot Paths

Bad:

df = df.with_columns(
    result=pl.col("value").map_elements(lambda x: x * 2)
)

Good:

df = df.with_columns(result=pl.col("value") * 2)

Anti-Pattern 3: Using Eager Reading for Large Files

Bad:

df = pl.read_csv("large_file.csv")
result = df.filter(pl.col("age") > 25).select("name", "age")

Good:

lf = pl.scan_csv("large_file.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()

Anti-Pattern 4: Row Iteration

Bad:

for row in df.iter_rows():
    # Process row
    pass

Good:

# Use vectorized operations
df = df.with_columns(
    # Vectorized computation
)

Migration Checklist

When migrating from pandas to Polars:

1. Remove index operations - Use integer positions or group_by 2. Replace apply/map with expressions - Use Polars native operations 3. Update column assignment - Use with_columns() instead of direct assignment 4. Change groupby.transform to .over() - Window functions work differently 5. Update string operations - Use .str.to_uppercase() instead of .str.upper() 6. Add explicit type casts - Polars won't silently convert types 7. Consider lazy evaluation - Use scan_* instead of read_* for large data 8. Update aggregation syntax - More explicit in Polars 9. Remove reset_index calls - Not needed in Polars 10. Update conditional logic - Use when().then().otherwise() pattern

Compatibility Layer

For gradual migration, you can use both libraries:

import pandas as pd
import polars as pl

# Convert pandas to Polars
pl_df = pl.from_pandas(pd_df)

# Convert Polars to pandas
pd_df = pl_df.to_pandas()

# Use Arrow for zero-copy (when possible)
pl_df = pl.from_arrow(pd_df)
pd_df = pl_df.to_arrow().to_pandas()

When to Stick with Pandas

Consider staying with pandas when:

Working with time series requiring complex index operations
Need extensive ecosystem support (some libraries only support pandas)
Team lacks Rust/Polars expertise
Data is small and performance isn't critical
Using advanced pandas features without Polars equivalents

When to Switch to Polars

Switch to Polars when:

Performance is critical
Working with large datasets (>1GB)
Need lazy evaluation and query optimization
Want better type safety
Need parallel execution by default
Starting a new project

Polars Data Transformations

Comprehensive guide to joins, concatenation, and reshaping operations in Polars.

Joins

Joins combine data from multiple DataFrames based on common columns.

Basic Join Types

Inner Join (intersection):

# Keep only matching rows from both DataFrames
result = df1.join(df2, on="id", how="inner")

Left Join (all left + matches from right):

# Keep all rows from left, add matching rows from right
result = df1.join(df2, on="id", how="left")

Full Join (union):

# Keep all rows from both DataFrames
result = df1.join(df2, on="id", how="full")

Cross Join (Cartesian product):

# Every row from left with every row from right
result = df1.join(df2, how="cross")

Semi Join (filtered left):

# Keep only left rows that have a match in right
result = df1.join(df2, on="id", how="semi")

Anti Join (non-matching left):

# Keep only left rows that DON'T have a match in right
result = df1.join(df2, on="id", how="anti")

Join Syntax Variations

Single column join:

df1.join(df2, on="id")

Multiple columns join:

df1.join(df2, on=["id", "date"])

Different column names:

df1.join(df2, left_on="user_id", right_on="id")

Multiple different columns:

df1.join(
    df2,
    left_on=["user_id", "date"],
    right_on=["id", "timestamp"]
)

Suffix Handling

When both DataFrames have columns with the same name (other than join keys):

# Add suffixes to distinguish columns
result = df1.join(df2, on="id", suffix="_right")

# Results in: value, value_right (if both had "value" column)

Join Examples

Example 1: Customer Orders

customers = pl.DataFrame({
    "customer_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Charlie", "David"]
})

orders = pl.DataFrame({
    "order_id": [101, 102, 103],
    "customer_id": [1, 2, 1],
    "amount": [100, 200, 150]
})

# Inner join - only customers with orders
result = customers.join(orders, on="customer_id", how="inner")

# Left join - all customers, even without orders
result = customers.join(orders, on="customer_id", how="left")

Example 2: Time-series data

prices = pl.DataFrame({
    "date": ["2023-01-01", "2023-01-02", "2023-01-03"],
    "stock": ["AAPL", "AAPL", "AAPL"],
    "price": [150, 152, 151]
})

volumes = pl.DataFrame({
    "date": ["2023-01-01", "2023-01-02"],
    "stock": ["AAPL", "AAPL"],
    "volume": [1000000, 1100000]
})

result = prices.join(
    volumes,
    on=["date", "stock"],
    how="left"
)

Asof Joins (Nearest Match)

For time-series data, join to nearest timestamp:

# Join to nearest earlier timestamp
quotes = pl.DataFrame({
    "timestamp": [1, 2, 3, 4, 5],
    "stock": ["A", "A", "A", "A", "A"],
    "quote": [100, 101, 102, 103, 104]
})

trades = pl.DataFrame({
    "timestamp": [1.5, 3.5, 4.2],
    "stock": ["A", "A", "A"],
    "trade": [50, 75, 100]
})

result = trades.join_asof(
    quotes,
    on="timestamp",
    by="stock",
    strategy="backward"  # or "forward", "nearest"
)

Concatenation

Concatenation stacks DataFrames together.

Vertical Concatenation (Stack Rows)

df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
df2 = pl.DataFrame({"a": [5, 6], "b": [7, 8]})

# Stack rows
result = pl.concat([df1, df2], how="vertical")
# Result: 4 rows, same columns

Handling mismatched schemas:

df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
df2 = pl.DataFrame({"a": [5, 6], "c": [7, 8]})

# Diagonal concat - fills missing columns with nulls
result = pl.concat([df1, df2], how="diagonal")
# Result: columns a, b, c (with nulls where not present)

Horizontal Concatenation (Stack Columns)

df1 = pl.DataFrame({"a": [1, 2, 3]})
df2 = pl.DataFrame({"b": [4, 5, 6]})

# Stack columns
result = pl.concat([df1, df2], how="horizontal")
# Result: 3 rows, columns a and b

Note: Horizontal concat requires same number of rows.

Concatenation Options

# Rechunk after concatenation (better performance for subsequent operations)
result = pl.concat([df1, df2], rechunk=True)

# Parallel execution
result = pl.concat([df1, df2], parallel=True)

Use Cases

Combining data from multiple sources:

# Read multiple files and concatenate
files = ["data_2023.csv", "data_2024.csv", "data_2025.csv"]
dfs = [pl.read_csv(f) for f in files]
combined = pl.concat(dfs, how="vertical")

Adding computed columns:

base = pl.DataFrame({"value": [1, 2, 3]})
computed = pl.DataFrame({"doubled": [2, 4, 6]})
result = pl.concat([base, computed], how="horizontal")

Pivoting (Wide Format)

Convert unique values from one column into multiple columns.

Basic Pivot

df = pl.DataFrame({
    "date": ["2023-01", "2023-01", "2023-02", "2023-02"],
    "product": ["A", "B", "A", "B"],
    "sales": [100, 150, 120, 160]
})

# Pivot: products become columns
pivoted = df.pivot(
    on="product",
    values="sales",
    index="date"
)
# Result:
# date     | A   | B
# 2023-01  | 100 | 150
# 2023-02  | 120 | 160

Pivot with Aggregation

When there are duplicate combinations, aggregate:

df = pl.DataFrame({
    "date": ["2023-01", "2023-01", "2023-01"],
    "product": ["A", "A", "B"],
    "sales": [100, 110, 150]
})

# Aggregate duplicates
pivoted = df.pivot(
    on="product",
    values="sales",
    index="date",
    aggregate_function="sum"  # or "mean", "max", "min", etc.
)

Multiple Index Columns

df = pl.DataFrame({
    "region": ["North", "North", "South", "South"],
    "date": ["2023-01", "2023-01", "2023-01", "2023-01"],
    "product": ["A", "B", "A", "B"],
    "sales": [100, 150, 120, 160]
})

pivoted = df.pivot(
    on="product",
    values="sales",
    index=["region", "date"]
)

Unpivoting/Melting (Long Format)

Convert multiple columns into rows (opposite of pivot).

Basic Unpivot

df = pl.DataFrame({
    "date": ["2023-01", "2023-02"],
    "product_A": [100, 120],
    "product_B": [150, 160]
})

# Unpivot: convert columns to rows
unpivoted = df.unpivot(
    index="date",
    on=["product_A", "product_B"]
)
# Result:
# date     | variable   | value
# 2023-01  | product_A  | 100
# 2023-01  | product_B  | 150
# 2023-02  | product_A  | 120
# 2023-02  | product_B  | 160

Custom Column Names

unpivoted = df.unpivot(
    index="date",
    on=["product_A", "product_B"],
    variable_name="product",
    value_name="sales"
)

Unpivot by Pattern

# Unpivot all columns matching pattern
df = pl.DataFrame({
    "id": [1, 2],
    "sales_Q1": [100, 200],
    "sales_Q2": [150, 250],
    "sales_Q3": [120, 220],
    "revenue_Q1": [1000, 2000]
})

# Unpivot all sales columns
unpivoted = df.unpivot(
    index="id",
    on=pl.col("^sales_.*$")
)

Exploding (Unnesting Lists)

Convert list columns into multiple rows.

Basic Explode

df = pl.DataFrame({
    "id": [1, 2],
    "values": [[1, 2, 3], [4, 5]]
})

# Explode list into rows
exploded = df.explode("values")
# Result:
# id | values
# 1  | 1
# 1  | 2
# 1  | 3
# 2  | 4
# 2  | 5

Multiple Column Explode

df = pl.DataFrame({
    "id": [1, 2],
    "letters": [["a", "b"], ["c", "d"]],
    "numbers": [[1, 2], [3, 4]]
})

# Explode multiple columns (must be same length)
exploded = df.explode("letters", "numbers")

Transposing

Swap rows and columns:

df = pl.DataFrame({
    "metric": ["sales", "costs", "profit"],
    "Q1": [100, 60, 40],
    "Q2": [150, 80, 70]
})

# Transpose
transposed = df.transpose(
    include_header=True,
    header_name="quarter",
    column_names="metric"
)
# Result: quarters as rows, metrics as columns

Reshaping Patterns

Pattern 1: Wide to Long to Wide

# Start wide
wide = pl.DataFrame({
    "id": [1, 2],
    "A": [10, 20],
    "B": [30, 40]
})

# To long
long = wide.unpivot(index="id", on=["A", "B"])

# Back to wide (maybe with transformations)
wide_again = long.pivot(on="variable", values="value", index="id")

Pattern 2: Nested to Flat

# Nested data
df = pl.DataFrame({
    "user": [1, 2],
    "purchases": [
        [{"item": "A", "qty": 2}, {"item": "B", "qty": 1}],
        [{"item": "C", "qty": 3}]
    ]
})

# Explode and unnest
flat = (
    df.explode("purchases")
    .unnest("purchases")
)

Pattern 3: Aggregation to Pivot

# Raw data
sales = pl.DataFrame({
    "date": ["2023-01", "2023-01", "2023-02"],
    "product": ["A", "B", "A"],
    "sales": [100, 150, 120]
})

# Aggregate then pivot
result = (
    sales
    .group_by("date", "product")
    .agg(pl.col("sales").sum())
    .pivot(on="product", values="sales", index="date")
)

Advanced Transformations

Conditional Reshaping

# Pivot only certain values
df.filter(pl.col("year") >= 2020).pivot(...)

# Unpivot with filtering
df.unpivot(index="id", on=pl.col("^sales.*$"))

Multi-level Transformations

# Complex reshaping pipeline
result = (
    df
    .unpivot(index="id", on=pl.col("^Q[0-9]_.*$"))
    .with_columns(
        quarter=pl.col("variable").str.extract(r"Q([0-9])", 1),
        metric=pl.col("variable").str.extract(r"Q[0-9]_(.*)", 1)
    )
    .drop("variable")
    .pivot(on="metric", values="value", index=["id", "quarter"])
)

Performance Considerations

Join Performance

# 1. Join on indexed/sorted columns when possible
df1_sorted = df1.sort("id")
df2_sorted = df2.sort("id")
result = df1_sorted.join(df2_sorted, on="id")

# 2. Use appropriate join type
# semi/anti are faster than inner+filter
matches = df1.join(df2, on="id", how="semi")  # Better than filtering after inner join

# 3. Filter before joining
df1_filtered = df1.filter(pl.col("active"))
result = df1_filtered.join(df2, on="id")  # Smaller join

Concatenation Performance

# 1. Rechunk after concatenation
result = pl.concat(dfs, rechunk=True)

# 2. Use lazy mode for large concatenations
lf1 = pl.scan_parquet("file1.parquet")
lf2 = pl.scan_parquet("file2.parquet")
result = pl.concat([lf1, lf2]).collect()

Pivot Performance

# 1. Filter before pivoting
pivoted = df.filter(pl.col("year") == 2023).pivot(...)

# 2. Specify aggregate function explicitly
pivoted = df.pivot(..., aggregate_function="first")  # Faster than "sum" if only one value

Common Use Cases

Time Series Alignment

# Align two time series with different timestamps
ts1.join_asof(ts2, on="timestamp", strategy="backward")

Feature Engineering

# Create lag features
df.with_columns(
    pl.col("value").shift(1).over("user_id").alias("prev_value"),
    pl.col("value").shift(2).over("user_id").alias("prev_prev_value")
)

Data Denormalization

# Combine normalized tables
orders.join(customers, on="customer_id").join(products, on="product_id")

Report Generation

# Pivot for reporting
sales.pivot(on="product", values="amount", index="month")

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

FAQ

Why does the Polars skill prefer lazy mode?

The Polars skill prefers lazy mode because scan_csv with collect enables predicate pushdown, projection pushdown, query optimization, and parallel execution planning. Eager read_csv loads entire files before filtering.

What Polars anti-pattern does the skill correct first?

The Polars skill corrects eager read_csv followed by late filter and select. It teaches pushing filter and column selection to the earliest pipeline stage so Polars optimizes before execution.

Is Polars safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLanalyticspipelines

About

Polars by the numbers

Add your badge

How do you write fast Polars lazy data pipelines?

Who is it for?

When should I use this skill?

What you get

Files

Polars

Overview

Quick Start

Installation and Basic Usage

Core Concepts

Expressions

Lazy vs Eager Evaluation

Common Operations

Select

Filter

With Columns

Group By and Aggregations

Aggregations and Window Functions

Aggregation Functions

Window Functions with over()

Data I/O

Supported Formats

Common I/O Operations

Transformations

Joins

Concatenation

Pivot and Unpivot

Pandas Migration

Conceptual Differences

Common Operation Mappings

Key Syntax Patterns

Best Practices

Performance Optimization

Expression Patterns

Resources

references/

Polars Best Practices and Performance Guide

Performance Optimization

1. Use Lazy Evaluation

2. Filter and Select Early

3. Avoid Python Functions

4. Use Streaming for Very Large Data

5. Optimize Data Types

6. Parallel Operations

7. Rechunk After Concatenation

Expression Patterns

Conditional Logic

Null Handling

Column Selection Patterns

Aggregation Patterns

Common Pitfalls and Anti-Patterns

Pitfall 1: Row Iteration

Pitfall 2: Modifying in Place

Pitfall 3: Not Using Expressions

Pitfall 4: Inefficient Joins

Pitfall 5: Not Specifying Types

Pitfall 6: Creating Many Small DataFrames

Memory Management

Monitor Memory Usage

Reduce Memory Footprint

Testing and Debugging

Inspect Query Plans

Sample Data for Development

Validate Schemas

Profile Performance

File Format Best Practices

Choose the Right Format

File Reading Best Practices

File Writing Best Practices

Code Organization

Reusable Expressions

Pipeline Functions

Documentation

Version Compatibility

Polars Core Concepts

Expressions

What are Expressions?

Window Functions with `over()`