What is HDFS in Hadoop ? | Hadoop Distributed File System (HDFS)

( HDFS – Full Form – Hadoop Distributed File System )

Do Read : Introduction to Big Data Introduction to Hadoop

Why there is need for HDFS Hadoop Distributed File System (HDFS) ?

A file system helps to navigate the data that is stored in the stored in your storage. Without the file system the information stored will be like one large chunk of data .For example if you had to go to chapter 3 , it would be lot easier if you have a well indexed book with page number whereas doing the same thing without the index or page numbers will be very difficult.

Major functions of a file system

  • Control how data is stored and retrieved
  • Metadata about the files and folders
  • Permissions and security
  • Manage storage space efficiently.

Why another file system?

To support truly parallel computation we had to divide the data into blocks and store them in different nodes and to prevent data loss we need to replicate our data among the different nodes.

To prevent data loss we have replicate our data , if we use conventional file system we would have different machines in which some data will be stored , but what happens in this case is that no machine knows what data is in the other machine with this approach we can’t manage the data redundancy .  But if we have a layer above that is connected to each of these machines and can manage our servers that could solve our problem.

HDFS takes cares of all the complexities in the distributed file system.  The below figure will help you understand the concept better.

HDFS Hadoop Distributed File System

 

Now when you upload your file it will be automatically distributed among the different machines in fixed size chunks of 128mb (earlier it was 64mb) .

HDFS also takes care of the replication of data among the different nodes .By default it creates 3 data copies and if somehow there is problem with one data node and it loses data , our data will be still available and a new copy will be made . So HDFS always have at least three copies of your data. HDFS will continue to keep track of all the data blocks and node assignments all the time. So HDFS will know where you data is and how to construct your data from blocks or chunks.

Suppose if you have 700mb of data then it will be divided into 5 chunks of 128mb and one chunk of 60mb and when we demand the file it will automatically give you the 700mb file.

Advantages of HDFS :

  • Support Distributed processing
    • Blocks of data at different nodes
  • Handle Failures
    • HDFS was made keeping in mind that hardware can fail , so it can maintain data integrity.
  • Scalability
    • HDFS supports scalability, it is highly scalable
  • Cost effective
    • Hadoop runs on commodity hardware

HDFS Architecture

As we already know file is distributed among the different nodes. These nodes are called data nodes which have actual data.

How the Hadoop knows where the data is stored?

Similar to the data node we have Namenode which has the metadata for the data nodes. The metadata includes where the chunks or blocks are stored, where are the replications are stored.

HDFS Hadoop Distributed File System Datanode namenode

Although Hadoop runs on commodity hardware it is not advisable to run Namenode on commodity hardware because if the Namenode fails then the whole system will go down.

The solution to this problem is that having Secondary and Stand-By Namenodes.

Active Namenode: Namenode that is currently working and storing metadata.

Secondary Namenode: It is not the backup Namenode, it collects the metadata from the active Namenode.

Stand-by Namenode: Standby Namenode is the backup Namenode which will take place of the Namenode if it fails.

HDFS Hadoop Distributed File System Secondry namenode

HDFS Federation

We already know that the namenodes are very important for our HDFS to work , one problem that we face is “Bottleneck” . Suppose if we have 1000 datanodes under one namenode and it can handle it well but if the datanodes mow increase to 5000 all the datanodes will lookup to the single namenode for metadata. It is possible that the namenode fails.

HDFS Hadoop Distributed File System federation

Multiple Namenode/Namespaces

HDFS Hadoop Distributed File System multiple namespaces

To scale horizontally we use multiple namenodes which are federated or independent from each other. Namenodes use datanodes blocks as common storage. Every datanode registers with all the nodes in the cluster. A block pool is collection of blocks that belong to the single namenode.

Leave a Reply

Your email address will not be published. Required fields are marked *