pandas limits
pandas (Wes McKinney, 2008) is the dominant Python DataFrame library, but shows limits on modern scales:
- Single-threaded by default — does not leverage multi-core CPUs
- Eager evaluation — every operation runs immediately
- Memory — eager-loaded DataFrame in RAM, problems with data > memory
Polars, created by Ritchie Vink (Dutch engineer) from 2020, responds with a Rust rewrite:
- Rust core with underlying Arrow format
- Lazy evaluation — operation chain optimised before execution
- Multi-threaded by default
- Streaming — datasets larger than RAM
- Query optimiser — predicate pushdown, projection pushdown
MIT licence. Version 0.14-0.15 (autumn 2022) consolidates production maturity, 1.0 reached July 2024.
API
Polars has two modes:
Eager (familiar from pandas):
import polars as pl
df = pl.read_csv("data.csv")
result = df.filter(pl.col("age") > 30).group_by("country").agg(pl.col("salary").mean())
Lazy (optimised):
result = (
pl.scan_csv("data.csv")
.filter(pl.col("age") > 30)
.group_by("country")
.agg(pl.col("salary").mean())
.collect()
)
In lazy mode, Polars builds an execution plan, optimises it, then executes with minimum overhead.
Performance
Public benchmarks (TPC-H, DB-benchmark): Polars 10-100x faster than pandas on medium-large datasets. Competitive with Spark on single-node; Dask/Ray for distributed.
Interoperability
Polars integrates with:
- pandas —
.to_pandas()and.from_pandas() - NumPy
- Arrow (shared data format with Spark, DuckDB, others)
- Parquet, CSV, JSON, Avro
- PyArrow, pyarrow-flight
In the Italian context
Rapid adoption in Italian data teams from 2023 for scenarios where pandas is too slow but Spark is overkill.
References: Polars. Ritchie Vink. Rust + Python bindings. MIT licence. Arrow format. Lazy evaluation + query optimiser. 1.0 (July 2024). Modern alternative to pandas.
