How Clients Know Where to Read or Write in HDFS

Hadoop Distributed File System (HDFS) is designed to decouple metadata management from actual data storage. But how does a client—like a Spark job or command-line tool—know where to read or write the bytes of a file across a distributed system?

Let’s break down what happens when a client interacts with HDFS.

The Role of the NameNode

At the heart of HDFS is the NameNode, which is the metadata manager. It knows:

The list of all files and directories
The size of each file
The mapping of file blocks to DataNodes
Block replication information

But the NameNode does not store the data itself—that job belongs to the DataNodes.

When a Client Reads a File

The client contacts the NameNode, asking for metadata about the file it wants to read.
The NameNode responds with:
- A list of blocks
- The addresses of the DataNodes storing each block
The client then communicates directly with the DataNodes, retrieving the file block-by-block.

This separation makes reading efficient and parallelizable.

When a Client Writes a File

The client contacts the NameNode to request permission to create a file.
The NameNode responds with a list of available DataNodes to store the new blocks.
The client begins sending data directly to those DataNodes.
HDFS replicates the blocks (usually 3 copies), using a pipeline mechanism:
- The client sends data to DataNode A
- A forwards it to B
- B forwards it to C

Important Behaviors

Clients never write to or read from the NameNode directly—it only provides directions.
Data locality is considered: the system tries to place replicas on nodes close to where the client is located (e.g., same rack or data center).
Clients track their own write progress, block by block.

Summary

HDFS clients learn where to read or write by consulting the NameNode for metadata. After that, all data transfer happens directly between the client and the DataNodes, allowing HDFS to scale massively while keeping the NameNode lean and fast.