
Polars
Write fast, parallel Polars pipelines and avoid eager-mode and Python-UDF pitfalls when building data features or ETL in Python.
Overview
Polars is an agent skill for the Build phase that teaches efficient Polars patterns—lazy scans, early filter/select, and expression-native transforms—for solo builders writing Python data pipelines.
Install
npx skills add https://github.com/davila7/claude-code-templates --skill polarsWhat is this skill?
- Lazy scan_csv/scan_parquet pipelines with collect() so predicate and projection pushdown run at the source
- Filter and column-select as early as possible before group_by, agg, and joins to shrink work
- Stay on the Polars expression API instead of Python row UDFs to keep parallel execution
- Contrasts eager read_csv chains with optimized lazy query plans for large files
- Covers join/select hygiene so pipelines do not drag unused columns through heavy steps
- Three documented optimization themes: lazy evaluation, early filter/select, and avoiding Python functions
- Lazy mode benefits include predicate pushdown, projection pushdown, query optimization, and parallel execution planning
Adoption & trust: 512 installs on skills.sh; 27.8k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agent keeps writing eager Polars or pandas-style pipelines that load huge files, run Python row functions, and stall on joins you could have filtered away earlier.
Who is it for?
Solo builders implementing CSV/Parquet ETL, analytics endpoints, or batch jobs in Python who want agent-generated Polars to match production performance habits.
Skip if: Teams that only need one-off exploratory notebooks with tiny samples and no plan to optimize or deploy Polars jobs.
When should I use this skill?
When implementing, refactoring, or reviewing Polars code on medium-to-large datasets where performance, memory, and parallel expression usage matter.
What do I get? / Deliverables
You get lazy, pushdown-friendly Polars pipelines that read less data, parallelize on the expression API, and are ready to benchmark or ship as backend or ETL code.
- Lazy Polars pipeline code using scan_* plus filter/select before collect
- Refactored transforms that replace Python UDFs with native expressions where possible
Recommended Skills
Journey fit
Polars guidance applies while implementing data processing and analytics code in the product backend, before you optimize or ship workloads. Backend is where dataframe transforms, joins, and aggregations live for APIs, jobs, and internal analytics—not landing-page or growth tooling.
How it compares
Use this as Polars-specific performance guardrails for your agent, not as a generic pandas cheat sheet or a hosted query engine integration.
Common Questions / FAQ
Who is polars for?
Polars is for solo and indie builders (and small teams) who ship Python data code—ETL scripts, APIs, or agent tools—and want Claude Code, Cursor, or Codex to follow Polars lazy and expression best practices.
When should I use polars?
Use it during Build when you are writing or reviewing Polars transforms on large files, refactoring eager pipelines, or designing joins and aggregations; it also helps in Ship perf work when you are tightening a slow data job before release.
Is polars safe to install?
Treat it like any third-party agent skill: review the skill source and the Security Audits panel on this Prism page before enabling it in repos that touch production data or secrets.
SKILL.md
READMESKILL.md - Polars
# Polars Best Practices and Performance Guide Comprehensive guide to writing efficient Polars code and avoiding common pitfalls. ## Performance Optimization ### 1. Use Lazy Evaluation **Always prefer lazy mode for large datasets:** ```python # Bad: Eager mode loads everything immediately df = pl.read_csv("large_file.csv") result = df.filter(pl.col("age") > 25).select("name", "age") # Good: Lazy mode optimizes before execution lf = pl.scan_csv("large_file.csv") result = lf.filter(pl.col("age") > 25).select("name", "age").collect() ``` **Benefits of lazy evaluation:** - Predicate pushdown (filter at source) - Projection pushdown (read only needed columns) - Query optimization - Parallel execution planning ### 2. Filter and Select Early Push filters and column selection as early as possible in the pipeline: ```python # Bad: Process all data, then filter and select result = ( lf.group_by("category") .agg(pl.col("value").mean()) .join(other, on="category") .filter(pl.col("value") > 100) .select("category", "value") ) # Good: Filter and select early result = ( lf.select("category", "value") # Only needed columns .filter(pl.col("value") > 100) # Filter early .group_by("category") .agg(pl.col("value").mean()) .join(other.select("category", "other_col"), on="category") ) ``` ### 3. Avoid Python Functions Stay within the expression API to maintain parallelization: ```python # Bad: Python function disables parallelization df = df.with_columns( result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64) ) # Good: Use native expressions (parallelized) df = df.with_columns(result=pl.col("value") * 2) ``` **When you must use custom functions:** ```python # If truly needed, be explicit df = df.with_columns( result=pl.col("value").map_elements( custom_function, return_dtype=pl.Float64, skip_nulls=True # Optimize null handling ) ) ``` ### 4. Use Streaming for Very Large Data Enable streaming for datasets larger than RAM: ```python # Streaming mode processes data in chunks lf = pl.scan_parquet("very_large.parquet") result = lf.filter(pl.col("value") > 100).collect(streaming=True) # Or use sink for direct streaming writes lf.filter(pl.col("value") > 100).sink_parquet("output.parquet") ``` ### 5. Optimize Data Types Choose appropriate data types to reduce memory and improve performance: ```python # Bad: Default types may be wasteful df = pl.read_csv("data.csv") # Good: Specify optimal types df = pl.read_csv( "data.csv", dtypes={ "id": pl.UInt32, # Instead of Int64 if values fit "category": pl.Categorical, # For low-cardinality strings "date": pl.Date, # Instead of String "small_int": pl.Int16, # Instead of Int64 } ) ``` **Type optimization guidelines:** - Use smallest integer type that fits your data - Use `Categorical` for strings with low cardinality (<50% unique) - Use `Date` instead of `Datetime` when time isn't needed - Use `Boolean` instead of integers for binary flags ### 6. Parallel Operations Structure code to maximize parallelization: ```python # Bad: Sequential pipe operations disable parallelization df = ( df.pipe(operation1) .pipe(operation2) .pipe(operation3) ) # Good: Combined operations enable parallelization df = df.with_columns( result1=operation1_expr(), result2=operation2_expr(), result3=operation3_expr() ) ``` ### 7. Rechunk After Concatenation ```python # Concatenation can fragment data combined = pl.concat([df1, df2, df3]) # Rechunk for better performance in subsequent operations combined = pl.concat([df1, df2, df3], rechunk=True) ``` ## Expression Patterns ### Conditional Logic **Simple conditions:** ```python df.with_columns( status=pl.when(pl.col("age") >= 18) .then("adult") .otherwise("minor") ) ``` **Multiple conditions:** ```python df.with_columns( grade=pl.when(pl.col("score") >= 90) .