Introduction to Big Data
Big Data is an ambiguous relative term , for different conditions it’s constraints may vary . But certainly we can understand from it is not usual data , it cannot be processed using the regular methods.
How is Big Data relative ?
As the name says obviously it is Data and it is BIG . Let’s have a look at some examples :
Suppose you want to send some file to your friend via email and that file exceeds the limit of the attachment, we cannot use conventional email for sending large files.
Or if you are storing log files of a website’s users in excel sheet . Even though now you can store over a million rows (1,048,576) in excel after sometime it will reach it’s limit and analyzing the logs will not be possible. Web Giants like Google, Facebook, Twitter generate Terabytes , Petabytes of data daily certainly using this data is impossible unless we use a smart approach.
The another way to define Big Data is to look at the most common characteristics of Big Data . We have 3 V’s that define Big Data more precisely and they are as follows :
Now let’s discuss them individually
The term volume is self-explanatory , it simply means that the data is in such a large volume that it cannot be used using the traditional methods as explained earlier data generated by the Internet Giants like Facebook , Google etc is huge.
Leave the basic operations on this data the storing this much data is also a problem.
Velocity means that the data is coming in very fast , in conventional scientific research it could months to gather data, weeks to analyze data and years to publish the research . But this has changed very much now , Twitter gets more than 6,000 tweets per second which adds to more 500 million data per day and about 200 billion tweets per year.
This constant influx of data better known as Streaming Data presents special challenges for analysis because the data set itself is a moving target analyzing this can become very daunting to you if you are accustomed to analyze static data.
Now we get to the third aspect of Big Data , what we mean here is not just the rows and columns or formatted data in spreedsheets for instance but we can also have many unstructured data like Books , Blog Posts , Tweets etc .As we can see is that 80 % of the enterprise data is unstructured ,unstructured data can also include Photos, Audio and Videos . If we are dealing with NoSQL we may have Graphs , hierarchical structures , documents any set of data that may not fit into rows and columns structure.
As you can see analysis on this uneven data is challenging , studies shows that variety is the major factor for leading companies leading to Big Data solutions in fact variety was mentioned over 4 times as data volume.
Do we need all the three V’s (Volume , Velocity , Variety) to have Big Data ?
It may be true that if we have all the three v’s at once then we have Big Data but any one of them can be too much for our standard approach to data , what Big Data means that we cannot use our standard approach to use data.
Where is Big Data used and how it is important for us ?
Big Data for Consumers
Believe it or not Big Data has been around us for a long time we have been using it, seeing it but still unknown to us . Some of the examples where Big Data is used in consumer industries are as follows :
Ever used Siri in Apple’s IPhone or an IPad , when you tell Siri to tell the temperature , it automatically knows what it means ,where you are or it searches the restaurants nearby tells you whether there is a table available or not.
big data for bussiness.
Another example is when you see related items in your favorite E-Commerce websites like Amazon or Flipkart , what companies try to do is use your historical data , your latest searches and related searches from others to give the similar suggested items that you are more likely to buy. This is can be done using the Mahout which is used for making a recommendation engine.
Big Data for Business
For the business world, Big Data is revolutionizing the way the way people do commerce. Some of the examples where Big Data has proven to be particularly useful are:
Everyone must have seen this thing, every time when you have searched for any term on Google or any other search engine , you start getting ads relevant to the search term . Nowadays you even get recommended pages in facebook related to the search terms you used in search engine . Suppose if you searched for Big Data you would get Big Data related facebook page recommendations.
Predictive Marketing : In this companies predict what are things you are going to need in the near future . For example , if you bought a new house and this data is somehow available to the other companies then =by this they can predict that you may need new furniture , curtains etc.
Fraud Detection : If any anomalies are found in the pattern by which you generally buy items with your card , then credit/debit company can contact you and hence the card can be blocked . This use of Big Data comes in handy if someone steels your Credit or Debit card . 😛
Big Data for Research
Big Data is extensively used for research purposes many research organizations use Big Data to find and epidemic , they use it to find cure for many diseases. NASA uses data sent by Kepler to find Exo-Planets or planets that are our of our Solar System.
Data Science Venn Diagram
Data Science involves a combination of three different skills that are Statistics , Domain Knowledge and Coding . When we talk about Statistics a person should be good at maths , we may encounter a problem even if there is a small mathematical error . Other than the statistics , person should have good domain knowledge in the field he is seeking Data Science help. For example, if you want to apply Data Science in marketing then you should have a good idea how marketing works . Last not the least we should know about coding and algorithms and how they work.
Statistics + Domain Knowledge = Traditional Research
Its is quite obvious that if we know statistics and have good knowledge we can analyze data traditionally , this method is not bad but clearly not suitable for Big Data.
Domain Knowledge + Coding = Danger Zone
It’s a very scanty area where person has both knowledge in Coding and a specific domain.
Coding + Statistics = Machine Learning
Machine learning is where an algorithm or program that updates itself or evolve . Most common example of this Spam filters in E-Mails. The program behind them updates with each new piece of information that it gets from the uses and it’s accuracy increases the more data it gets . The machine is learning itself without having any domain knowledge.