The Architecture of HDFS: NameNode, DataNodes, and Metadata

HDFS (Hadoop Distributed File System) was built to support the reliable storage and access of large datasets distributed across commodity hardware. To make this possible, HDFS relies on a master/slave architecture composed of two main types of nodes: the NameNode and the DataNodes.

1. The NameNode (Master)

The NameNode is the brain of HDFS. It manages:

Metadata: Keeps track of the filesystem namespace (file names, directories, permissions).
Block mapping: Records which DataNode holds which block of a file.
Cluster status: Monitors the health of DataNodes through regular heartbeats.

However, the NameNode does not store the file data itself—only metadata about the data.

Example:

If you store a 300MB file (with 128MB block size), the NameNode will know:

The file is divided into 3 blocks.
Block 1 is on DataNode A, Block 2 on B, and Block 3 on C.
Each block has 3 replicas (e.g., A, D, and F).

2. The DataNodes (Workers)

DataNodes are responsible for:

Storing the actual data blocks.
Serving read/write requests from HDFS clients.
Sending heartbeat and block reports to the NameNode.

They don’t know what they’re storing—just that they hold a block identified by a block ID.

3. The Client

The client interacts with both:

The NameNode, to get metadata (e.g., block locations).
The DataNodes, to read/write file blocks directly.

This design reduces the load on the NameNode and allows for high-throughput data transfer.

4. Block-Based Storage

Files in HDFS are split into large blocks (usually 128MB or 256MB). These blocks are:

Stored independently.
Distributed across multiple DataNodes.
Replicated (default replication factor is 3) for fault tolerance.

5. How Metadata Is Stored

The NameNode stores metadata in memory for fast access, and persists it to disk in:

A namespace image (fsimage): a snapshot of the filesystem.
An edit log (edits): a transaction log of recent changes.

On restart, the NameNode combines these to restore its state.

Summary of HDFS Architecture

Component	Role
NameNode	Stores metadata and controls the system
DataNodes	Store actual file data (blocks)
Client	Reads/writes data by talking to both

This architecture allows HDFS to scale horizontally and handle very large volumes of data reliably, even in the face of hardware failures.