Data lakes with ACID
Data lakes based on Parquet/ORC files in S3/HDFS have limits: no multi-file ACID transactions, no controlled schema evolution, no time travel, query performance with huge file lists. Table formats add a metadata layer above files to solve these issues.
Apache Iceberg, created by Ryan Blue and Daniel Weeks at Netflix from 2017, donated to Apache Foundation in 2018 and Graduated TLP in May 2020. Apache 2.0 licence.
Features
- Snapshot isolation — each commit produces a snapshot; readers see a consistent point in time
- Time travel — queries on historical snapshots (
SELECT ... AT TIMESTAMP '2021-01-15') - Schema evolution — add/rename/drop column with backward compatibility
- Partition evolution — partition strategy change without rewrite
- Hidden partitioning — Iceberg manages partitions automatically based on column values
- Row-level operations — efficient UPDATE, DELETE, MERGE
- ACID transactions — atomic commits across multiple files
Compute engines
Iceberg is decoupled from compute engine. Natively supported by:
- Apache Spark
- Trino / Presto
- Apache Flink
- AWS Athena, Google BigQuery (external tables)
- DuckDB, Dremio
The three table formats
Three main lakehouse table formats are establishing themselves:
- Delta Lake (Databricks) — Linux Foundation Open Source, more tied to Spark
- Apache Iceberg — neutral, multi-engine support
- Apache Hudi — Uber-originated, streaming upsert focus
Iceberg’s “vendor-neutral” positioning is a key strength versus Delta Lake.
In the Italian context
Italian adoption in companies with mature data lakes: banks, telco, large retailers, research institutes.
References: Apache Iceberg. Ryan Blue, Daniel Weeks, Netflix (2017). Apache TLP (May 2020). Apache 2.0 licence. Alternatives: Delta Lake (Databricks), Apache Hudi (Uber).
