Big Data Tutorial

Evolving constantly, the data management and architecture field is in an unprecedented state of sophistication. Globally, more than 2.5 quintillion bytes of data are created every day, and 90% of all the data in the world got generated in the last couple of years (Forbes). Data is the fuel for machine learning and meaningful insights across industries, so organizations are getting serious about how they collect, curate, and manage information.

Don’t think Big Data is a big deal? Consider these facts:

  • About 300 billion emails get exchanged every day (Campaign Monitor)
  • 400 hours of video are uploaded to YouTube every minute (Brandwatch)
  • Worldwide retail eCommerce accounts for more than $4 billion in revenue (Shopify)
  • Google receives more than 63,000 search inquiries every minute (SEO Tribunal)
  • By 2025, real-time data will account for more than a quarter of all data (IDC)
Want to start your career as a Hadoop Developer? Then get skilled with the Big Data Hadoop Training Course. Register now.

The Five ‘V’s of Big Data

Big Data is simply a catchall term used to describe data too large and complex to store in traditional databases. The “five ‘V’s” of big data are:

  • Volume – The amount of data generated
  • Velocity - The speed at which data is generated, collected and analyzed
  • Variety - The different types of structured, semi-structured and unstructured data
  • Value - The ability to turn data into useful insights
  • Veracity - Trustworthiness in terms of quality and accuracy 

What Does Facebook Do with Its Big Data?

Facebook collects vast volumes of user data (in the range of petabytes, or 1 million gigabytes) in the form of comments, likes, interests, friends, and demographics. Facebook uses this information in a variety of ways:

  • To create personalized and relevant news feeds and sponsored ads
  • For photo tag suggestions
  • Flashbacks of photos and posts with the most engagement
  • Safety check-ins during crises or disasters

Big Data Case Study

As the number of Internet users grew throughout the last decade, Google was challenged with how to store so much user data on its traditional servers. With thousands of search queries raised every second, the retrieval process was consuming hundreds of megabytes and billions of CPU cycles. Google needed an extensive, distributed, highly fault-tolerant file system to store and process the queries. In response, Google developed the Google File System (GFS).

GFS architecture consists of one master and multiple chunk servers or slave machines. The master machine contains metadata, and the chunk servers/slave machines store data in a distributed fashion. Whenever a client on an API wants to read the data, the client contacts the master, which then responds with the metadata information. The client uses this metadata information to send a read/write request to the slave machines to generate a response.

The files are divided into fixed-size chunks and distributed across the chunk servers or slave machines. Features of the chunk servers include:

  • Each piece has 64 MB of data (128 MB from Hadoop version 2 onwards)
  • By default, each piece is replicated on multiple chunk servers three times
  • If any chunk server crashes, the data file is present in other chunk servers

Big Data Course - LVC Schedule

Challenges of Big Data

Storage

With vast amounts of data generated daily, the greatest challenge is storage (especially when the data is in different formats) within legacy systems. Unstructured data cannot be stored in traditional databases.

Processing

Processing big data refers to the reading, transforming, extraction, and formatting of useful information from raw information. The input and output of information in unified formats continue to present difficulties.

Security

Security is a big concern for organizations. Non-encrypted information is at risk of theft or damage by cyber-criminals. Therefore, data security professionals must balance access to data against maintaining strict security protocols.

Hadoop as a Solution

Hadoop, an open-source framework for storing data and running applications on clusters of commodity hardware, is comprised of two main components:

Hadoop HDFS

Hadoop Distributed File System (HDFS) is the storage unit of Hadoop. It is a fault-tolerant, reliable, scalable layer of the Hadoop cluster. Designed for use on commodity machines with low-cost hardware, Hadoop allows access to data across multiple Hadoop clusters on various servers. HDFS has a default block size of 128 MB from Hadoop version 2 onwards, which can be increased based on requirements.

Hadoop MapReduce

Hadoop MapReduce allows the user to perform distributed parallel processing on large volumes of data quickly and efficiently.

Hadoop Ecosystem

Hadoop features Big Data security, providing end-to-end encryption to protect data while at rest within the Hadoop cluster and when moving across networks. Each processing layer has multiple processes running on different machines within a cluster. The components of the Hadoop ecosystem, while evolving every day, include:

  • Sqoop

    For ingestion of structured data from a Relational Database Management System (RDBMS) into the HDFS (and export back).
  • Flume

    For ingestion of streaming or unstructured data directly into the HDFS or a data warehouse system (such as Hive
  • Hive

    A data warehouse system on top of HDFS in which users can write SQL queries to process data
  • HCatalog

    Enables the user to store data in any format and structure
  • Oozie

    A workflow manager used to schedule jobs on the Hadoop cluster
  • Apache Zookeeper

    A centralized service of the Hadoop ecosystem, responsible for coordinating large clusters of machines
  • Pig

    A language allowing concise scripting to analyze and query datasets stored in HDFS
  • Apache Drill

    Supports data-intensive distributed applications for interactive analysis of large-scale datasets
  • Mahout

    For machine learning
Test your understanding in the Hadoop concepts like HDFS, MapReduce, Sqoop, and more with the Big Data and Hadoop Developer Practice Test.

MapReduce Algorithm

Hadoop MapReduce is among the oldest and most mature processing frameworks. Google introduced the MapReduce programming model in 2004 to store and process data on multiple servers, and analyze in real-time. Developers use MapReduce to manage data in two phases:

  • Map Phase

    In which data gets sorted by applying a function or computation on every element. It sorts and shuffles data and decides how much data to process at a time.
  • Reduce Phase

    Segregating data into logical clusters, removing bad data, and retaining necessary information.

To sum up, check out this video on Big Data Tutorial -

Conclusion

Most organizations are making use of big data to draw insights and support strategic business decisions. Simplilearn's Big Data Hadoop Training Course will help you master Big Data and Hadoop Ecosystem tools such as HDFS, YARN, MapReduce, Hive, Impala, Pig, HBase, Spark, Flume, Sqoop, and Hadoop Frameworks. You will learn the critical concepts of data processing. Consider this course to prepare for Cloudera’s CCA175 Big Data certification.

About the Author

SimplilearnSimplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.