How HDFS Tracks Block Size and File Boundaries

When dealing with massive files, Hadoop Distributed File System (HDFS) doesn’t read or store them as a whole. Instead, it splits them into large, fixed-size blocks. But how does it know where each block starts and ends? Let’s dive into how HDFS tracks block size and file boundaries behind the scenes.

Fixed Block Size

Each file in HDFS is split into uniform blocks, typically 128MB (or 256MB) by default. This is configured by the system administrator but remains consistent for the duration of that file’s storage unless explicitly overridden.

A 3KB file → stored in one 128MB block (mostly empty).
A 600MB file → stored in 5 blocks (4 × 128MB + 1 × 88MB).

How HDFS Knows the Boundaries

HDFS doesn’t interpret the content. It simply:

Counts bytes from the beginning of the file.
After reaching the block size limit (e.g., 128MB), it starts a new block.
This continues until the entire file is chunked into blocks.

Important: HDFS doesn’t care whether it’s breaking a sentence, a word, or a paragraph — it just splits by byte count.

How the Metadata Tracks Boundaries

The NameNode’s metadata includes:

The exact number of bytes in each block
The offsets for each block (e.g., Block 0 = bytes 0–134217727)
A pointer to where each block is stored on which DataNodes

This allows HDFS clients to reassemble the file byte-for-byte during reading.

No Knowledge of Content Structure

HDFS does not store delimiters or understand data formats (like line breaks or commas). Content interpretation (e.g., parsing JSON, CSV, etc.) is handled at the application level using tools like:

Spark
Hive
MapReduce

These tools can be instructed to skip partial lines if a block starts or ends mid-record.

Summary

HDFS tracks block size and file boundaries purely by counting bytes, not by interpreting content. Each file’s size and its blocks are recorded in the NameNode’s metadata, ensuring precise reassembly of the data across distributed systems.