Evolving constantly, the data management and architecture field is in an unprecedented state of sophistication. Globally, more than 2.5 quintillion bytes of data are created every day, and 90 percent of all the data in the world got generated in the last couple of years (Forbes). Data is the fuel for machine learning and meaningful insights across industries, so organizations are getting serious about how they collect, curate, and manage information.
This article will help you learn more about the vast world of Big Data, and the challenges of Big Data. And in case you thing challenges of Big Data and Big data as a concept is not a big deal, here are some facts that will help you reconsider:
- About 300 billion emails get exchanged every day (Campaign Monitor)
- 400 hours of video are uploaded to YouTube every minute (Brandwatch)
- Worldwide retail eCommerce accounts for more than $4 billion in revenue (Shopify)
- Google receives more than 63,000 search inquiries every minute (SEO Tribunal)
- By 2025, real-time data will account for more than a quarter of all data (IDC)
Before we jump into the challenges of Big Data, let’s start with the five ‘V’s of Big Data.
The Five ‘V’s of Big Data
Big Data is simply a catchall term used to describe data too large and complex to store in traditional databases. The “five ‘V’s” of Big Data are:
- Volume – The amount of data generated
- Velocity - The speed at which data is generated, collected and analyzed
- Variety - The different types of structured, semi-structured and unstructured data
- Value - The ability to turn data into useful insights
- Veracity - Trustworthiness in terms of quality and accuracy
What Does Facebook Do with Its Big Data?
Facebook collects vast volumes of user data (in the range of petabytes, or 1 million gigabytes) in the form of comments, likes, interests, friends, and demographics. Facebook uses this information in a variety of ways:
- To create personalized and relevant news feeds and sponsored ads
- For photo tag suggestions
- Flashbacks of photos and posts with the most engagement
- Safety check-ins during crises or disasters
Next up, let us look at a Big Data case study, understand it’s nuances and then look at some of the challenges of Big Data.
Big Data Case Study
As the number of Internet users grew throughout the last decade, Google was challenged with how to store so much user data on its traditional servers. With thousands of search queries raised every second, the retrieval process was consuming hundreds of megabytes and billions of CPU cycles. Google needed an extensive, distributed, highly fault-tolerant file system to store and process the queries. In response, Google developed the Google File System (GFS).
GFS architecture consists of one master and multiple chunk servers or slave machines. The master machine contains metadata, and the chunk servers/slave machines store data in a distributed fashion. Whenever a client on an API wants to read the data, the client contacts the master, which then responds with the metadata information. The client uses this metadata information to send a read/write request to the slave machines to generate a response.
The files are divided into fixed-size chunks and distributed across the chunk servers or slave machines. Features of the chunk servers include:
- Each piece has 64 MB of data (128 MB from Hadoop version 2 onwards)
- By default, each piece is replicated on multiple chunk servers three times
- If any chunk server crashes, the data file is present in other chunk servers
Next up let us take a look at the challenges of Big Data, and the probable outcomes too!
Challenges of Big Data
Storage
With vast amounts of data generated daily, the greatest challenge is storage (especially when the data is in different formats) within legacy systems. Unstructured data cannot be stored in traditional databases.
Processing
Processing big data refers to the reading, transforming, extraction, and formatting of useful information from raw information. The input and output of information in unified formats continue to present difficulties.
Security
Security is a big concern for organizations. Non-encrypted information is at risk of theft or damage by cyber-criminals. Therefore, data security professionals must balance access to data against maintaining strict security protocols.
And now that we know the challenges of Big Data, let’s take a look at the solutions too!
Hadoop as a Solution
Hadoop, an open-source framework for storing data and running applications on clusters of commodity hardware, is comprised of two main components:
Hadoop HDFS
Hadoop Distributed File System (HDFS) is the storage unit of Hadoop. It is a fault-tolerant, reliable, scalable layer of the Hadoop cluster. Designed for use on commodity machines with low-cost hardware, Hadoop allows access to data across multiple Hadoop clusters on various servers. HDFS has a default block size of 128 MB from Hadoop version 2 onwards, which can be increased based on requirements.
Hadoop MapReduce
Hadoop MapReduce allows the user to perform distributed parallel processing on large volumes of data quickly and efficiently.
Hadoop Ecosystem
Hadoop features Big Data security, providing end-to-end encryption to protect data while at rest within the Hadoop cluster and when moving across networks. Each processing layer has multiple processes running on different machines within a cluster. The components of the Hadoop ecosystem, while evolving every day, include:
- Sqoop
For ingestion of structured data from a Relational Database Management System (RDBMS) into the HDFS (and export back). - Flume
For ingestion of streaming or unstructured data directly into the HDFS or a data warehouse system (such as Hive - Hive
A data warehouse system on top of HDFS in which users can write SQL queries to process data - HCatalog
Enables the user to store data in any format and structure - Oozie
A workflow manager used to schedule jobs on the Hadoop cluster - Apache Zookeeper
A centralized service of the Hadoop ecosystem, responsible for coordinating large clusters of machines - Pig
A language allowing concise scripting to analyze and query datasets stored in HDFS - Apache Drill
Supports data-intensive distributed applications for interactive analysis of large-scale datasets - Mahout
For machine learning
Test your understanding in the Hadoop concepts like HDFS, MapReduce, Sqoop, and more with the Big Data and Hadoop Developer Practice Test. |
MapReduce Algorithm
Hadoop MapReduce is among the oldest and most mature processing frameworks. Google introduced the MapReduce programming model in 2004 to store and process data on multiple servers, and analyze in real-time. Developers use MapReduce to manage data in two phases:
- Map Phase
In which data gets sorted by applying a function or computation on every element. It sorts and shuffles data and decides how much data to process at a time. - Reduce Phase
Segregating data into logical clusters, removing bad data, and retaining necessary information.
Conclusion
Now that you have understood the five ‘V’s of Big Data, Big Data case study, challenges of Big Data, and some of the solutions too, it’s time you scale up your knowledge and become industry ready. Most organizations are making use of big data to draw insights and support strategic business decisions. Simplilearn's Big Data Engineer Master's Program and the Big Data Hadoop Training Course will help you master Big Data and Hadoop Ecosystem tools such as HDFS, YARN, MapReduce, Hive, Impala, Pig, HBase, Spark, Flume, Sqoop, and Hadoop Frameworks, and learn the critical concepts of data processing too!