Skip to content
>GLB_
Go back

From HDFS to S3: The Evolution of Data Lakes in the Cloud

For years, HDFS (Hadoop Distributed File System) was the default choice for building data lakes in on-premises and Hadoop-based environments. But as cloud computing gained momentum, a new player took the lead: Amazon S3. Today, S3 is widely recognized as the de facto data lake storage layer in the AWS ecosystem.

How did this shift happen? Let’s explore the evolution from HDFS to S3 in the context of data lakes.

HDFS: The Original Data Lake Backbone

In the early 2010s, the concept of a data lake—a central repository for storing all types of raw data—became popular. HDFS was a natural fit:

As Tom White describes in Hadoop: The Definitive Guide, HDFS enabled organizations to build scalable storage platforms for batch-oriented big data processing.

However, HDFS came with trade-offs:

The Rise of Amazon S3

As cloud adoption grew, AWS offered a new model with Amazon S3:

S3 allowed companies to shift away from Hadoop clusters while still storing massive datasets in open formats like Parquet, ORC, and CSV.

Why S3 Became the New Data Lake

S3 won the data lake battle in the cloud for several key reasons:

Today, S3 is often referred to as “the data lake of AWS”—a role that HDFS previously held in the Hadoop world.

Key Takeaways

FeatureHDFSAmazon S3
TypeDistributed file systemObject storage
DeploymentOn-prem or Hadoop clusterCloud-native
ScalingManual (add nodes)Automatic
DurabilitySoftware-level replication99.999999999% across AZs
Data AccessHadoop toolsREST API, SQL engines, Spark
Cost ModelFixed compute + storagePay-as-you-go, tiered storage

Final Thoughts

While HDFS laid the foundation for modern data lakes, S3 has redefined the model in the cloud era. Its flexibility, scalability, and native integration with cloud services have made it the go-to choice for data lake architecture in AWS.

As organizations continue to move to the cloud, S3 will likely remain the central storage layer for modern, serverless, and AI-driven analytics.


Share this post:

Previous Post
What Is Serialization?
Next Post
Is S3 the New HDFS? Comparisons and Use Cases in Big Data