
Polars
Teach your coding agent Polars patterns—lazy scans, early filter/select, and expression API—so data pipelines stay fast and parallel-friendly.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill polarsWhat is this skill?
- Lazy evaluation playbook: `scan_csv` + `collect` with predicate and projection pushdown
- Pipeline hygiene: filter and select early before group_by and joins
- Anti-pattern coverage: avoid Python UDFs that break Polars parallelization
- Expression-API-first patterns for maintainable scientific and analytics code
- Performance-focused reference suitable for large CSV/Parquet workloads
Adoption & trust: 553 installs on skills.sh; 27.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Common Questions / FAQ
Is Polars safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Polars
# Polars Best Practices and Performance Guide Comprehensive guide to writing efficient Polars code and avoiding common pitfalls. ## Performance Optimization ### 1. Use Lazy Evaluation **Always prefer lazy mode for large datasets:** ```python # Bad: Eager mode loads everything immediately df = pl.read_csv("large_file.csv") result = df.filter(pl.col("age") > 25).select("name", "age") # Good: Lazy mode optimizes before execution lf = pl.scan_csv("large_file.csv") result = lf.filter(pl.col("age") > 25).select("name", "age").collect() ``` **Benefits of lazy evaluation:** - Predicate pushdown (filter at source) - Projection pushdown (read only needed columns) - Query optimization - Parallel execution planning ### 2. Filter and Select Early Push filters and column selection as early as possible in the pipeline: ```python # Bad: Process all data, then filter and select result = ( lf.group_by("category") .agg(pl.col("value").mean()) .join(other, on="category") .filter(pl.col("value") > 100) .select("category", "value") ) # Good: Filter and select early result = ( lf.select("category", "value") # Only needed columns .filter(pl.col("value") > 100) # Filter early .group_by("category") .agg(pl.col("value").mean()) .join(other.select("category", "other_col"), on="category") ) ``` ### 3. Avoid Python Functions Stay within the expression API to maintain parallelization: ```python # Bad: Python function disables parallelization df = df.with_columns( result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64) ) # Good: Use native expressions (parallelized) df = df.with_columns(result=pl.col("value") * 2) ``` **When you must use custom functions:** ```python # If truly needed, be explicit df = df.with_columns( result=pl.col("value").map_elements( custom_function, return_dtype=pl.Float64, skip_nulls=True # Optimize null handling ) ) ``` ### 4. Use Streaming for Very Large Data Enable streaming for datasets larger than RAM: ```python # Streaming mode processes data in chunks lf = pl.scan_parquet("very_large.parquet") result = lf.filter(pl.col("value") > 100).collect(engine="streaming") # Or use sink for direct streaming writes lf.filter(pl.col("value") > 100).sink_parquet("output.parquet") ``` ### 5. Optimize Data Types Choose appropriate data types to reduce memory and improve performance: ```python # Bad: Default types may be wasteful df = pl.read_csv("data.csv") # Good: Specify optimal types df = pl.read_csv( "data.csv", schema_overrides={ "id": pl.UInt32, # Instead of Int64 if values fit "category": pl.Categorical, # For low-cardinality strings "date": pl.Date, # Instead of String "small_int": pl.Int16, # Instead of Int64 } ) ``` **Type optimization guidelines:** - Use smallest integer type that fits your data - Use `Categorical` for strings with low cardinality (<50% unique) - Use `Date` instead of `Datetime` when time isn't needed - Use `Boolean` instead of integers for binary flags ### 6. Parallel Operations Structure code to maximize parallelization: ```python # Bad: Sequential pipe operations disable parallelization df = ( df.pipe(operation1) .pipe(operation2) .pipe(operation3) ) # Good: Combined operations enable parallelization df = df.with_columns( result1=operation1_expr(), result2=operation2_expr(), result3=operation3_expr() ) ``` ### 7. Rechunk After Concatenation ```python # Concatenation can fragment data combined = pl.concat([df1, df2, df3]) # Rechunk for better performance in subsequent operations combined = pl.concat([df1, df2, df3], rechunk=True) ``` ## Expression Patterns ### Conditional Logic **Simple conditions:** ```python df.with_columns( status=pl.when(pl.col("age") >= 18) .then(pl.lit("adult")) .otherwise(pl.lit("minor")) ) ``` **Multiple conditions:** ```python df.with_columns( grade=pl.when(pl