What is Hadoop?

When we look at how data was handled a couple of years ago, it was fairly easy due to the limited amount of data we were working with. Only one processor and storage unit was required to handle data. It was in fact handled with the concept of structured data and a database that contained this data. SQL queries made it possible to go through giant spreadsheets with multiple rows and columns.

As the years went by and data generation increased, there were higher volumes of data and more formats emerged. Hence, multiple processors were used to process data in order to save time.

However, a single storage unit became the bottleneck due to which network overhead was generated. This led to using a distributed storage for each processor which made data access easier. This method is known as parallel processing with distributed storage - various computers run the processes on various storages.

Big Data and its Challenges

Big data refers to the massive amount of data which cannot be stored, processed and analyzed using traditional ways.

The main elements of big data are:

  • Volume - There is a massive amount of data generated every second.
  • Velocity - The speed at which data is generated, collected and analyzed
  • Variety - The different types of data: structured, semi-structured, unstructured
  • Value - The ability to turn data into useful insights for your business
  • Veracity - Trustworthiness in terms of quality and accuracy

The main challenges that big data faced and the solutions for each are listed below:

Challenges

Solution

Single central storage

Distributed storage

Serial processing

  • One input
  • One processor
  • One output

Parallel processing

  • Multiple inputs
  • Multiple processors
  • One output

Lack of ability to process unstructured data

Ability to process every type of data

Hadoop and its Components

Hadoop is a framework that uses distributed storage and parallel processing to store and manage big data. It is the most commonly used software to handle big data. There are two components of Hadoop.

  1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.
  2. Hadoop Map Reduce - Hadoop Map Reduce is the processing unit of Hadoop.

Hadoop HDFS

Data is stored in a distributed manner in HDFS. There are two components of HDFS - name node and data node. While there is only one name node, there can be multiple data nodes.

HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise version of a server costs roughly $10,000 per terabyte for the full processor. In case you need to buy a 100 of these enterprise version servers, it will go up to a million dollars.

Hadoop enables you to use commodity machines as your data nodes. This way, you don’t have to spend millions of dollars just on your data nodes. However, the name node is always an enterprise server.

Features of HDFS

  • Provides distributed storage
  • Can be implemented on commodity hardware
  • Provides data security
  • Highly fault tolerant - If one machine goes down, the data from that machine goes to the next machine

Master and slave nodes

Master and slave nodes form the HDFS cluster. The name node is called the master and the data nodes are called the slaves.

The name node is responsible for the workings of the data nodes. It also stores the metadata.

The data nodes read, write, process and replicates the data. They also send signals, known as heartbeats, to the name node. These heartbeats show the status of the data node.

Consider that 30TB of data is loaded into the name node. The name node distributes it across the data nodes and this data is replicated among the data notes. You can see in the image above that the blue, grey and red data are replicated among the three data nodes.

Replication of the data is performed three times by default. It is done so that in case a commodity machine fails, you can replace it with a new machine which has the same data.

Hadoop MapReduce

Hadoop MapReduce is the processing unit of Hadoop. In MapReduce approach, the processing is done at the slave nodes and the final result is sent to the master node.

A data containing code is used to process the entire data. This code data is usually very small in comparison to the data itself. You only need to send a few kilobytes worth of code to perform heavy duty process on computers.

The input dataset is first split into chunks of data. In this example, the input has three lines of text with three separate entities - “bus car train”, “ship ship train”, “bus ship car”. The dataset is then split into three chunks, based on these entities, and processed parallelly.

In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one car, one ship and one train.

These key-value pairs are then shuffled and sorted together based on their keys. At the reduce phase, the aggregation takes place and the final output is obtained.

Hadoop YARN

Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of Hadoop and is available as a component of Hadoop version 2.

  • Hadoop Yarn acts like an OS to Hadoop. It is a file system that is built on top of HDFS.
  • It is responsible for managing cluster resources to make sure you don't overload one machine.
  • It performs job scheduling to make sure that the jobs are scheduled in the right place

Suppose that a client machine wants to do a query or fetch some code for data analysis. This job request goes to the resource manager (Hadoop Yarn), which is responsible for resource allocation and management.

In the node section, each of the nodes have their own node managers. These node managers manages the nodes and monitors the resource usage in the node. The containers contain a collection of physical resources, which could be RAM, CPU or hard drives. Whenever a job request comes in, the app master requests the container from the node manager. Once the node manager gets the resource, it goes back to the Resource Manager.

Use Case of Hadoop

There are so many use cases of Hadoop. However, in this case study, we will discuss how Hadoop can combat fraudulent activities.

Let us look at the case of Zions Bancorporation. Their main challenge was to combat fraudulent activities taking place using the approaches of the Zions security team. The problem was that they used an RDBMS dataset where they were unable to store and analyze huge amounts of data.

In other words, they were able to analyze small amounts of data. But with a flood of customers coming in, there were so many things they couldn’t keep track of.

They began to use parallel processing. However, the data was unstructured and analyzing it was not possible. Not only did they have a huge amount of data that could not get into their databases, they also had unstructured data.

Hadoop enabled the Zions team to pull all that massive amounts of data together and store it in one place. It was also possible to process and analyze the huge amounts of unstructured data that they had. It was more time efficient and the in-depth analysis of various data formats became easier through Hadoop. Zions’ team could now detect everything from malware, spears and phishing attempts to account takeovers.

In Conclusion

We have seen that Hadoop helps banks to save clients’ money and ultimately their own wealth and reputation. But the benefits of Hadoop are so much more than this and it is just good for any business.

If you want to learn more about Hadoop, Simplilearn's Big Data Hadoop Training Course is an online instructor-led Hadoop Training which will help you master Big Data and Hadoop Ecosystem tools such as HDFS, YARN, Map Reduce, Hive, Impala, Pig, HBase, Spark, Flume, Sqoop, Hadoop Frameworks, and more concepts of Big Data processing Life cycle. The course will also prepare you for Cloudera’s CCA175 Big Data certification.

About the Author

SimplilearnSimplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies. Based in San Francisco, California, and Bangalore, India, Simplilearn has helped more than 500,000 students, professionals and companies across 200 countries get trained, upskilled, and acquire certifications.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.