Data lakes with ACID
Data lakes based on Parquet/ORC files in S3/HDFS have limits: no multi-file ACID transactions, no controlled schema evolution, no time travel, query performance with huge file lists. Table formats add a metadata layer above files to solve these issues.
Apache Iceberg, created by Ryan Blue and Daniel Weeks at Netflix from 2017, donated to Apache Foundation in 2018 and Graduated TLP in May 2020. Apache 2.0 licence.
Features
- Snapshot isolation — each commit produces a snapshot; readers see a consistent point in time
- Time travel — queries on historical snapshots (
SELECT ... AT TIMESTAMP '2024-01-15') - Schema evolution — add/rename/drop column with backward compatibility
- Partition evolution — partition strategy change without rewrite
- Hidden partitioning — Iceberg manages partitions automatically based on column values
- Row-level operations — efficient UPDATE, DELETE, MERGE
- ACID transactions — atomic commits across multiple files
Compute engines
Iceberg is decoupled from compute engine. Natively supported by:
- Apache Spark
- Trino / Presto
- Apache Flink
- Snowflake (native in 2024)
- AWS Athena, Google BigQuery (external tables)
- DuckDB, Dremio
The three table formats
As of 2024 three lakehouse table formats exist:
- Delta Lake (Databricks) — Linux Foundation open source, more tied to Spark
- Apache Iceberg — neutral, multi-engine support
- Apache Hudi — Uber-originated, streaming upsert focus
Iceberg is prevailing as the de facto standard in 2024, with support from all major cloud vendors.
Databricks-Snowflake war
In June 2024 Databricks acquires Tabular (Ryan Blue’s company), consolidating control of Iceberg. Snowflake announces its own native support for Iceberg. A strategic war for control of the open data layer.
In the Italian context
Italian adoption in companies with mature data lakes: banks, telco, large retailers, research institutes.
References: Apache Iceberg. Ryan Blue, Daniel Weeks, Netflix (2017). Apache TLP (May 2020). Apache 2.0 licence. Alternatives: Delta Lake (Databricks), Apache Hudi (Uber). Tabular acquired by Databricks (June 2024).
