Skip to content
>GLB_
Go back

How Clients Know Where to Read or Write in HDFS

Hadoop Distributed File System (HDFS) is designed to decouple metadata management from actual data storage. But how does a client—like a Spark job or command-line tool—know where to read or write the bytes of a file across a distributed system?

Let’s break down what happens when a client interacts with HDFS.

The Role of the NameNode

At the heart of HDFS is the NameNode, which is the metadata manager. It knows:

But the NameNode does not store the data itself—that job belongs to the DataNodes.

When a Client Reads a File

  1. The client contacts the NameNode, asking for metadata about the file it wants to read.
  2. The NameNode responds with:
    • A list of blocks
    • The addresses of the DataNodes storing each block
  3. The client then communicates directly with the DataNodes, retrieving the file block-by-block.

This separation makes reading efficient and parallelizable.

When a Client Writes a File

  1. The client contacts the NameNode to request permission to create a file.
  2. The NameNode responds with a list of available DataNodes to store the new blocks.
  3. The client begins sending data directly to those DataNodes.
  4. HDFS replicates the blocks (usually 3 copies), using a pipeline mechanism:
    • The client sends data to DataNode A
    • A forwards it to B
    • B forwards it to C

Important Behaviors

Summary

HDFS clients learn where to read or write by consulting the NameNode for metadata. After that, all data transfer happens directly between the client and the DataNodes, allowing HDFS to scale massively while keeping the NameNode lean and fast.


Share this post:

Previous Post
Summary: Teaching HDFS Concepts to New Learners
Next Post
How HDFS Avoids Understanding File Content