An embedded OLAP
SQLite is the most widespread in-process relational database in the world (in every browser, smartphone, app). But it is OLTP: designed for transactions, row-oriented. For heavy analytical queries it is inefficient.
DuckDB is its OLAP equivalent: in-process columnar database, founded in 2019 by Mark Raasveldt and Hannes Mühleisen at CWI Amsterdam (same institute where Python was born). Version 0.6 in November 2022 consolidates production maturity. 1.0 will arrive June 2024. MIT licence.
In-process analytics
Like SQLite: no server, no complex installation. A compiled library (C++) integrates into the application process:
import duckdb
# Direct query on Parquet files
result = duckdb.sql("""
SELECT country, SUM(sales) as total
FROM 's3://bucket/data/*.parquet'
WHERE year = 2023
GROUP BY country
ORDER BY total DESC
""").df()
DuckDB reads natively:
- Parquet (columnar) — streaming and predicate pushdown
- CSV — optimised parser
- JSON, Arrow, SQLite, Postgres (via extension)
- HTTP(S) remote URLs
- S3, Azure Blob, GCS
Performance
DuckDB is competitive with Spark/Polars/ClickHouse for single-node queries on datasets up to hundreds of GB. Advantage: zero distributed deploy overhead. On laptops with modern SSD, DuckDB handles analytical queries on GB of Parquet in seconds.
Usage
- Local analytics by data scientists without infrastructure
- Interactive notebooks (Jupyter, Observable)
- Embedded analytics in applications (AI model training, ad-hoc exploration)
- ETL pipeline testing
- Data engineering — lightweight alternative to local Spark
Ecosystem
- Python, R, Node.js, Java, Go — official bindings
- SQL PostgreSQL-compatible dialect + analytical extensions
- Extensions:
httpfs,postgres_scanner,parquet,json,spatial
In the Italian context
Rapid adoption in Italian data analytics and ML teams from 2023.
References: DuckDB (2019+), Mark Raasveldt, Hannes Mühleisen, CWI Amsterdam. MIT licence. 1.0 (June 2024). In-process columnar database. Parquet/CSV/JSON native reading. PostgreSQL-compatible SQL.
