Modern Table Formats: Iceberg, Delta Lake, and Hudi

Data Lakes made it possible to store raw data at scale, but they lacked the reliability and governance of data warehouses. Files could be dropped into storage (S3, HDFS, MinIO), but analysts struggled with schema changes, updates, and deletes.

To solve these issues, the community created modern table formats that brought ACID transactions, schema evolution, and time travel to Data Lakes. The three leading projects are Apache Iceberg, Delta Lake, and Apache Hudi.

Why Table Formats Matter

Without a table format, a Data Lake is just a collection of files (e.g., Parquet, ORC). Problems include:

No atomic operations (partial writes could corrupt data).
Schema changes break queries.
No standard way to handle deletes or updates.
Hard to manage snapshots or rollback.

Table formats add a metadata layer on top of files, transforming raw object storage into a Lakehouse.

Apache Iceberg

Origin: Created at Netflix, donated to the Apache Foundation.
Goal: Replace Hive tables with a scalable, open table format.
Key Features:
- Full ACID transactions.
- Hidden partitioning (no need to hardcode folder paths).
- Schema evolution without rewriting data.
- Time travel (query past snapshots).
- Strong integration with Trino, Spark, Flink.

Example query in Trino:

SELECT *
FROM sales FOR VERSION AS OF 123456789;

Delta Lake

Origin: Created by Databricks.
Goal: Bring warehouse reliability to Data Lakes with tight Spark integration.
Key Features:
- ACID transactions using a transaction log (_delta_log).
- Schema enforcement and evolution.
- Time travel using versioned checkpoints.
- Optimized for Apache Spark.
Ecosystem: Open source, but Databricks provides enterprise features.

Example query in Spark SQL:

SELECT * FROM sales VERSION AS OF 42;

Apache Hudi

Origin: Created at Uber.
Goal: Optimize for streaming ingestion and incremental data processing.
Key Features:
- ACID transactions.
- Two table types: Copy-on-Write (COW) and Merge-on-Read (MOR).
- Built-in support for upserts and deletes.
- Incremental queries for near-real-time analytics.
- Strong integration with Spark and Flink.

Example incremental query:

SELECT *
FROM sales
WHERE _hoodie_commit_time > '20250901';

Comparing Iceberg, Delta, and Hudi

Feature	Iceberg	Delta Lake	Hudi
Origin	Netflix / Apache	Databricks	Uber / Apache
Transaction model	ACID, snapshots	ACID, delta log	ACID, commit log
Best for	Batch + Interactive	Spark-centric BI	Streaming + upserts
Schema evolution	Yes (flexible)	Yes	Yes
Time travel	Yes	Yes	Limited
Engine support	Trino, Spark, Flink	Spark (best), Trino	Spark, Flink

How They Fit in the Lakehouse

Iceberg: Best for open, multi-engine environments (Trino, Spark, Flink).
Delta Lake: Strongest in Databricks/Spark ecosystems.
Hudi: Best fit for real-time ingestion and incremental pipelines.

All three bring warehouse-like reliability to Data Lakes, enabling the Lakehouse model.

Conclusion

Modern table formats are the foundation of the Lakehouse. By adding ACID transactions, schema evolution, and time travel, they turn raw storage (S3, MinIO, HDFS) into reliable analytical platforms.

Iceberg: open and engine-agnostic.
Delta Lake: Spark-first, Databricks-friendly.
Hudi: optimized for streaming and upserts.

No matter which you choose, table formats are the key to bridging the gap between lakes and warehouses.