DuckDB: SQLite for analytics, in-process OLAP

DuckDB (2022, Mark Raasveldt and Hannes Mühleisen, CWI Amsterdam): in-process columnar analytical database, zero dependency, fast on Parquet/CSV/JSON. The OLAP equivalent of SQLite.

Open SourceR&D DuckDBOLAPAnalyticsColumnarCWIOpen Source

An embedded OLAP

SQLite is the most widespread in-process relational database in the world (in every browser, smartphone, app). But it is OLTP: designed for transactions, row-oriented. For heavy analytical queries it is inefficient.

DuckDB is its OLAP equivalent: in-process columnar database, founded in 2019 by Mark Raasveldt and Hannes Mühleisen at CWI Amsterdam (same institute where Python was born). Version 0.6 in November 2022 consolidates production maturity. 1.0 will arrive June 2024. MIT licence.

In-process analytics

Like SQLite: no server, no complex installation. A compiled library (C++) integrates into the application process:

import duckdb
# Direct query on Parquet files
result = duckdb.sql("""
    SELECT country, SUM(sales) as total
    FROM 's3://bucket/data/*.parquet'
    WHERE year = 2023
    GROUP BY country
    ORDER BY total DESC
""").df()

DuckDB reads natively:

  • Parquet (columnar) — streaming and predicate pushdown
  • CSV — optimised parser
  • JSON, Arrow, SQLite, Postgres (via extension)
  • HTTP(S) remote URLs
  • S3, Azure Blob, GCS

Performance

DuckDB is competitive with Spark/Polars/ClickHouse for single-node queries on datasets up to hundreds of GB. Advantage: zero distributed deploy overhead. On laptops with modern SSD, DuckDB handles analytical queries on GB of Parquet in seconds.

Usage

  • Local analytics by data scientists without infrastructure
  • Interactive notebooks (Jupyter, Observable)
  • Embedded analytics in applications (AI model training, ad-hoc exploration)
  • ETL pipeline testing
  • Data engineering — lightweight alternative to local Spark

Ecosystem

  • Python, R, Node.js, Java, Go — official bindings
  • SQL PostgreSQL-compatible dialect + analytical extensions
  • Extensions: httpfs, postgres_scanner, parquet, json, spatial

In the Italian context

Rapid adoption in Italian data analytics and ML teams from 2023.


References: DuckDB (2019+), Mark Raasveldt, Hannes Mühleisen, CWI Amsterdam. MIT licence. 1.0 (June 2024). In-process columnar database. Parquet/CSV/JSON native reading. PostgreSQL-compatible SQL.

Need support? Under attack? Service Status
Need support? Under attack? Service Status