Skip to content
>GLB_
Go back

How HDFS Tracks Block Size and File Boundaries

When dealing with massive files, Hadoop Distributed File System (HDFS) doesn’t read or store them as a whole. Instead, it splits them into large, fixed-size blocks. But how does it know where each block starts and ends? Let’s dive into how HDFS tracks block size and file boundaries behind the scenes.

Fixed Block Size

Each file in HDFS is split into uniform blocks, typically 128MB (or 256MB) by default. This is configured by the system administrator but remains consistent for the duration of that file’s storage unless explicitly overridden.

How HDFS Knows the Boundaries

HDFS doesn’t interpret the content. It simply:

  1. Counts bytes from the beginning of the file.
  2. After reaching the block size limit (e.g., 128MB), it starts a new block.
  3. This continues until the entire file is chunked into blocks.

Important: HDFS doesn’t care whether it’s breaking a sentence, a word, or a paragraph — it just splits by byte count.

How the Metadata Tracks Boundaries

The NameNode’s metadata includes:

This allows HDFS clients to reassemble the file byte-for-byte during reading.

No Knowledge of Content Structure

HDFS does not store delimiters or understand data formats (like line breaks or commas). Content interpretation (e.g., parsing JSON, CSV, etc.) is handled at the application level using tools like:

These tools can be instructed to skip partial lines if a block starts or ends mid-record.

Summary

HDFS tracks block size and file boundaries purely by counting bytes, not by interpreting content. Each file’s size and its blocks are recorded in the NameNode’s metadata, ensuring precise reassembly of the data across distributed systems.


Share this post:

Previous Post
How Spark and MapReduce Handle Partial Records in HDFS
Next Post
How Metadata Works in HDFS and What It Stores