Skip to content
>GLB_
Go back

Modern Table Formats: Iceberg, Delta Lake, and Hudi

Data Lakes made it possible to store raw data at scale, but they lacked the reliability and governance of data warehouses. Files could be dropped into storage (S3, HDFS, MinIO), but analysts struggled with schema changes, updates, and deletes.

To solve these issues, the community created modern table formats that brought ACID transactions, schema evolution, and time travel to Data Lakes. The three leading projects are Apache Iceberg, Delta Lake, and Apache Hudi.


Why Table Formats Matter

Without a table format, a Data Lake is just a collection of files (e.g., Parquet, ORC). Problems include:

Table formats add a metadata layer on top of files, transforming raw object storage into a Lakehouse.


Apache Iceberg

Example query in Trino:

SELECT *
FROM sales FOR VERSION AS OF 123456789;

Delta Lake

Example query in Spark SQL:

SELECT * FROM sales VERSION AS OF 42;

Apache Hudi

Example incremental query:

SELECT *
FROM sales
WHERE _hoodie_commit_time > '20250901';

Comparing Iceberg, Delta, and Hudi

FeatureIcebergDelta LakeHudi
OriginNetflix / ApacheDatabricksUber / Apache
Transaction modelACID, snapshotsACID, delta logACID, commit log
Best forBatch + InteractiveSpark-centric BIStreaming + upserts
Schema evolutionYes (flexible)YesYes
Time travelYesYesLimited
Engine supportTrino, Spark, FlinkSpark (best), TrinoSpark, Flink

How They Fit in the Lakehouse

All three bring warehouse-like reliability to Data Lakes, enabling the Lakehouse model.


Conclusion

Modern table formats are the foundation of the Lakehouse. By adding ACID transactions, schema evolution, and time travel, they turn raw storage (S3, MinIO, HDFS) into reliable analytical platforms.

No matter which you choose, table formats are the key to bridging the gap between lakes and warehouses.


Share this post:

Previous Post
Estimating the Cost of an AWS Glue Workflow
Next Post
Running Production Servers on AWS: EC2 vs RDS Cost Breakdown