Apache Iceberg: open table format for data lakehouse

Apache Iceberg (2020-2021): table format with schema evolution, time travel, ACID transactions on Parquet in object storage. One of the three lakehouse standards with Delta Lake and Apache Hudi. ASF Graduated 2020.

Open SourceR&D Apache IcebergLakehouseParquetNetflixData EngineeringOpen Source

Data lakes with ACID

Data lakes based on Parquet/ORC files in S3/HDFS have limits: no multi-file ACID transactions, no controlled schema evolution, no time travel, query performance with huge file lists. Table formats add a metadata layer above files to solve these issues.

Apache Iceberg, created by Ryan Blue and Daniel Weeks at Netflix from 2017, donated to Apache Foundation in 2018 and Graduated TLP in May 2020. Apache 2.0 licence.

Features

  • Snapshot isolation — each commit produces a snapshot; readers see a consistent point in time
  • Time travel — queries on historical snapshots (SELECT ... AT TIMESTAMP '2024-01-15')
  • Schema evolution — add/rename/drop column with backward compatibility
  • Partition evolution — partition strategy change without rewrite
  • Hidden partitioning — Iceberg manages partitions automatically based on column values
  • Row-level operations — efficient UPDATE, DELETE, MERGE
  • ACID transactions — atomic commits across multiple files

Compute engines

Iceberg is decoupled from compute engine. Natively supported by:

  • Apache Spark
  • Trino / Presto
  • Apache Flink
  • Snowflake (native in 2024)
  • AWS Athena, Google BigQuery (external tables)
  • DuckDB, Dremio

The three table formats

As of 2024 three lakehouse table formats exist:

  • Delta Lake (Databricks) — Linux Foundation open source, more tied to Spark
  • Apache Iceberg — neutral, multi-engine support
  • Apache Hudi — Uber-originated, streaming upsert focus

Iceberg is prevailing as the de facto standard in 2024, with support from all major cloud vendors.

Databricks-Snowflake war

In June 2024 Databricks acquires Tabular (Ryan Blue’s company), consolidating control of Iceberg. Snowflake announces its own native support for Iceberg. A strategic war for control of the open data layer.

In the Italian context

Italian adoption in companies with mature data lakes: banks, telco, large retailers, research institutes.


References: Apache Iceberg. Ryan Blue, Daniel Weeks, Netflix (2017). Apache TLP (May 2020). Apache 2.0 licence. Alternatives: Delta Lake (Databricks), Apache Hudi (Uber). Tabular acquired by Databricks (June 2024).

Need support? Under attack? Service Status
Need support? Under attack? Service Status