Skip to content
>GLB_
Go back

How HDFS Achieves Fault Tolerance Through Replication

One of the core strengths of the Hadoop Distributed File System (HDFS) is its fault tolerance. In a world of distributed computing, failures are not rare—they’re expected. HDFS tackles this by using block-level replication to ensure that data is never lost, even when individual nodes fail.

What Is Replication in HDFS?

When a file is stored in HDFS, it’s broken into blocks (default size: 128MB or 256MB). Each block is then replicated across multiple DataNodes. The default replication factor is 3, meaning:

This provides both redundancy and availability.

How It Works

During a Write

  1. A client writes a file to HDFS.
  2. The file is split into blocks.
  3. For each block, HDFS:
    • Chooses three different DataNodes (based on rack awareness).
    • Writes the block to the first DataNode.
    • That node forwards it to the second.
    • The second forwards it to the third.

This is known as pipelined replication.

During a Failure

Let’s say one DataNode crashes:

This ensures the replication factor is quickly restored.

Benefits of Replication

Trade-Offs

However, for large-scale data systems, the trade-off is worth the reliability.

Summary

HDFS uses replication to guard against data loss and enable fault tolerance. By storing multiple copies of each block across different nodes (and racks), HDFS ensures that hardware failures do not result in lost data—just temporary inconvenience.


Share this post:

Previous Post
What’s Behind Amazon S3?
Next Post
Summary: Teaching HDFS Concepts to New Learners