Skip to content
>GLB_
Go back

The History of Hive and Trino: From Hadoop to Lakehouses

The evolution of Big Data architectures is deeply tied to the history of two projects born at Facebook: Hive and Trino. Both emerged from real engineering pain points, but at different times and for different reasons. Understanding their journey is essential to see how we arrived at today’s Data Lakehouse architectures.


Hive (2008): SQL on Hadoop

In the mid-2000s, Facebook’s engineers were drowning in data. Hadoop had become the storage and processing backbone, but working with MapReduce jobs directly was slow and complex. Data analysts, who were more familiar with SQL, couldn’t easily access Hadoop.

The solution was Hive, introduced in 2008. Hive added:

Hive democratized access to Big Data inside Facebook. Instead of writing complex MapReduce programs, analysts could run queries like:

SELECT user_id, COUNT(*) 
FROM page_views 
WHERE date = '2008-11-01' 
GROUP BY user_id;

Problem: Hive was batch-oriented. Queries often took minutes or hours, making it unsuitable for interactive analysis.


Presto (2012): Interactive SQL at Scale

By 2012, Facebook needed faster analytics. Analysts were frustrated by the latency of Hive, especially when exploring data. The engineering team built a new query engine from scratch: Presto.

Key innovations in Presto:

This meant that instead of waiting for batch jobs, analysts could now write:

SELECT COUNT(*) 
FROM hive.page_views p
JOIN mysql.users u 
ON p.user_id = u.id;

And Presto would execute the query across Hadoop and MySQL simultaneously.

Presto quickly became Facebook’s main interactive query engine, replacing Hive for most day-to-day analysis.


From Presto to Trino (2019)

In 2019, the original creators of Presto left Facebook to continue development under a new name: Trino (originally called PrestoSQL). Meanwhile, the version inside Facebook remained as PrestoDB.

Today:

Trino has extended beyond Hadoop:


Hive vs. Trino in the Lakehouse Era

FeatureHive (2008)Trino (2012 → now)
Execution modelBatch (MapReduce, later Tez/Spark)Interactive, in-memory, distributed
LatencyMinutes to hoursSeconds
Storage targetHDFSHDFS, S3, MinIO, GCS, ADLS
MetadataHive MetastoreHive Metastore, Glue, Nessie, REST catalogs
Use caseBatch ETL, long-running queriesInteractive queries, federated analytics, Lakehouse

Conclusion

Modern data platforms no longer rely on HDFS or MapReduce, but the DNA of Hive and Trino lives on in every query executed against S3, MinIO, or Iceberg tables.


Share this post:

Previous Post
HDFS vs. Object Storage: The Battle for Distributed Storage
Next Post
What Is a Data Lake and What Is a Data Lakehouse?