Introduction to Hadoop
What is Hadoop ? ( Introduction to Hadoop )
Hadoop is an open-source software written in Java managed and distributed by Apache that is used to analyze large data sets over a cluster of computers. These clusters of computers are nothing but simple commodity computers that are used for distributed computing as well as distributed storage.
Also read : What is Big Data ?
Why do we need Hadoop?
Let’s take an example:
Ox and the Load
Suppose we have an ox who can carry load of 3 wooden logs, but after some time the load increases to 12 wooden logs.
So now how do we solve this problem?
Do we make our ox stronger ? or we now have two oxen that can share the load of the 12 logs. This simple example is sufficient to give you the brief of why was Hadoop needed.
Before Hadoop, the traditional approach was to have a central server which would connect with the database and return the data to the user. This approach was well and good but now we were going to face large amount of data. In that case scaling up and scaling out both proved to be very expensive as well as inefficient.
History of Hadoop
Around the year 2003 -2004 Google was facing problems with handling the huge amount of data that it was getting daily so instead of the inefficient and hectic method of scaling up , what they did was the developed MapReduce . This algorithm divided the bigger task among many machines connected over the network. Instead of using more powerful hardware these machines run on commodity hardware and were controlled by a central system.
In the year 2005 Doug Cutting, Mike Cafarella were working on a Search Engine project called Nutch , which supported Google’s MapReduce and distributed file storage. Both of these projects got combined and formed Hadoop. Hadoop got its name from Doug’s son toy elephant whom his son used to play with.
Hadoop runs on clusters of commodity hardware computers which use MapReduce algorithms to analyze large amount of data and can store data over a distributed file system.
Hadoop consist of four main components that are :
Hadoop Common: It contains the common libraries and utility files that are needed for Hadoop to run.
Hadoop Distributed File System (HDFS): HDFS is the distributed file system which allows us to store data over a cluster of commodity machines.
Hadoop MapReduce: As we read above MapReduce is used for large scale data processing.
Hadoop YARN: YARN stands for Yet Another Resource Negotiator which is the upgraded MapReduce available from Hadoop 2.0 and used for job scheduling and resources management among the clusters.