Apache Cassandra L1 Overview Big Data and NoSQL Database Tutorial

1.1 Overview of Big Data and NoSQL Database

Hello and welcome to the first lesson of the Apache Cassandra™ course offered by Simplilearn. This lesson will provide an overview of the Big Data and NoSQL database.

1.2 Course Map

The Apache Cassandra™ course by Simplilearn is divided into eight lessons, as listed. • Lesson 0—Course Overview • Lesson 1—Overview of Big Data and NoSQL Database • Lesson 2—Introduction to Cassandra • Lesson 3—Cassandra Architecture • Lesson 4—Cassandra Installation and Configuration • Lesson 5—Cassandra Data Model • Lesson 6—Cassandra Interfaces • Lesson 7—Cassandra Advanced Architecture and Cluster Management • Lesson 8—Hadoop Ecosystem around Cassandra This is the first lesson, ‘Overview of Big Data and NoSQL Database.’

1.3 Objectives

After completing this lesson, you will be able to describe the 3 Vs of big data. You will also be able to discuss some use cases of big data. Further, you will be able to explain Apache Hadoop and the concept of NoSQL. Finally, you will be able to describe various types of NoSQL databases.

1.4 The 3 Vs of Big Data

Big Data has three main characteristics: volume, velocity, and variety. Volume denotes the huge scaling of data, ranging from terabytes to zettabytes. Velocity accounts for the streaming of data and the movement of large volumes of data at high speed. Variety refers to managing the complexity of data in different structures, ranging from relational data to logs and raw text. In addition, there are other Vs of big data; however they are not as popular. These are veracity, visualization, and value. Veracity refers to the truthfulness of data, visualization refers to the presentation of data in a graphical format, and value refers to the derived value of an organization from using big data.

1.5 Volume

The term “volume” refers to data volume, which is the size of digital data. The impact of Internet and social media has resulted in the explosion of digital data. Data has grown from gigabytes to terabytes, petabytes, exabytes, and zettabytes. As illustrated on the image, in 2008, the total data on the Internet was eight exabytes. It exploded to 150 exabytes by 2011. It reached 670 exabytes in 2013. It is expected to exceed seven zettabytes in the next 10 years.

1.6 Data Sizes-Terms

The table contains various terms used to represent data sizes. The size, along with the power and description of each data term, is given. It is recommended you spend some time to go through the contents of the table for better understanding. The new terms added to address big data sizes are exabyte, zettabyte, and yottabytes. Typically, the term big data refers to data sizes in terms of terabytes or more.

1.7 Velocity

The term “velocity” refers to the speed at which the data grows. People use devices that create data constantly. Data is created from different sources, such as desktops, laptops, mobile phones, tablets, and sensors (Please read this a bit slowly, give pause between each word). Due to the increase in the global customer base, and transactions and interactions with customers, the data created within an organization is growing along with external data. As illustrated on the image, there are many contributors to this data growth, such as web, online billing systems, ERP implementations, machine data, network elements, and social media. Growth in revenues of organizations indicate growth in data.

1.8 Variety

The term “variety” refers to different types of data. Data includes text, images, audio, video, XML, and HTML. There are three types of data: Structured data, where the data is represented in a tabular format. For example, MySQL databases. Semi-structured data, where the data does not have a formal data model. For example, XML files. Unstructured data, where there is no predefined data model. Everything is defined at run time. For example, text files.

1.9 Data Evolution

Digital data has evolved over 30 years, starting with unstructured data. Initially, the data was created as plain text documents. Next, files were created, and data and spread sheets increased the usage of digital computers. The introduction of relational databases revolutionized structured data, as many organizations used it to create large amounts of structured data. Next, the data expanded to data warehouses and Storage Area Networks or SANs to handle large volumes of structured data. Then, the concept of Metadata was introduced to describe structured and semi-structured data.

1.10 Features of Big Data

Big data has the following features: It is extremely fragmented due to the variety of data. It does not provide decisions directly; however, it can be used to make decisions. Big data does not include unstructured data only. It also includes structured data that extends and complements unstructured data. However, big data is not a substitute for structured data. Since most of the information on the Internet is available to anyone, it can be misused by antisocial elements. Generally, big data is wide. Therefore, you may have hundreds of fields in each line of your data. It is also dynamic, as gigabytes and terabytes of data is created every day. Big data can be both internal, generated within the organization and external, generated on social media.

1.11 Big Data-Use Cases

Every industry has some use for big data. Some of the big data use cases are as follows: In the retail sector, big data is used extensively for affinity detection and performing market analysis. Credit card companies can detect fraudulent purchases quickly so that they can alert customers. While giving loans, banks examine the private and public data of a customer to minimize risk. In medical diagnostics, doctors can diagnose a patient’s illness based on symptoms, instead of intuition. Digital marketers need to process huge customer data to find effective marketing channels. Insurance companies use big data to minimize insurance risks. An individual’s driving data can be captured automatically and sent to the insurance companies to calculate premium for risky drivers. Manufacturing units and oil rigs have sensors that generate gigabits of data every day, which are analyzed to reduce risk of equipment failures. Advertisers use demographic data to identify the target audience. Terabytes and petabytes of data are analyzed in the field of Genetics to design new models. Power grids analyze large amounts of historical and weather forecasting data to forecast power consumption.

1.12 Big Data Analytics

With the origin of big data analytics, you can use complete sets of data instead of sample data to conduct an analysis. As represented on the image, in traditional analytics method, analysts take a representative data sample to perform analysis and draw conclusions. Using big data analytics, the entire dataset can be used. Big data analytics help you find associations in data, predict future outcomes, and perform prescriptive analysis. The outcome of prescriptive analysis will be a definitive answer, not a probable answer. Further, using big data for analysis also helps in taking data-driven decisions instead of decisions based on intuitions. It also helps organizations increase their safety standards, reduce maintenance costs, and prevent failures.

1.13 Traditional Technology vs. Big Data Technology

Traditional technology can be compared with big data technology in the following ways: Traditional technology has a limit on scalability, whereas big data technology is highly scalable. Traditional technology uses highly parallel processors on a single machine, whereas big data technology uses distributed processing with multiple machines. Further, in traditional technology, processors may be distributed with data in one place. However, in big data technology, the data is distributed to multiple machines. Traditional technology depends on high-end expensive hardware, costing more than $40000 per terabyte. On the other hand, big data technology leverages commodity hardware which may cost less than $5000 per terabyte. Traditional technology uses storage technologies, such as SAN, whereas big data technology uses distributed data with data redundancy.

1.14 Apache Hadoop

Apache Hadoop is the most popular framework for big data processing. Hadoop has two core components, Hadoop Distributed File System or HDFS and MapReduce. Hadoop uses HDFS to distribute the data into multiple machines. It uses MapReduce to distribute the process to multiple machines. Further, Hadoop distributes the processing action to where the data is, instead of moving data towards the processing. The concept of distribution performed by HDFS and MapReduce is illustrated on the image. First, HDFS divides the data into multiple sets, such as Data 1, Data 2, and Data 3. Next, these datasets are distributed to multiple machines, such as CPU 1, CPU 2, and CPU 3. This action is performed by MapReduce. In the end, the processing is done by the CPUs of each machine on the data assigned to that machine.

1.15 HDFS

HDFS is the storage component of Hadoop, which stores each file as blocks, with a default block size of 64 megabytes. This is larger than the block size on windows, which is 1K or 4K. HDFS is a Write-Once-Read-Many form of file system, also called WORM. Blocks are replicated across nodes in the cluster. HDFS provides three default replication copies. The image illustrates this concept with an example. Suppose you store a 320 MB file into HDFS. It will be divided into five blocks, each of size 64 megabytes, as 64 multiplied by five is 320. If there are five nodes in the cluster, each block is replicated to make three copies, to result in a total of 15 blocks. These blocks are further distributed to the five nodes, so that no two replicas of the same block are on the same node.

1.16 MapReduce

MapReduce is the processing framework of Hadoop. It provides highly fault-tolerant distributed processing of the data distributed by HDFS. MapReduce consists of two types of tasks, Mappers and Reducers. Mappers are tasks that run in parallel on different nodes of the cluster, to process the data blocks. In programming language, mappers are called key-value pairs. After completion of the map tasks, results are gathered and aggregated by the reduce tasks of MapReduce. Reduce tasks consolidate and summarize the results. Each mapper runs on the machine on which the data block is assigned. Data locality is preferred. This follows the principle of taking processing to the data.

1.17 NoSQL Databases

NoSQL is the common term used for all databases that do not follow the traditional Relational Database Management System or RDBMS principles. In NoSQL databases, the overhead of the ACID principles is reduced. ACID stands for Atomicity, Consistency, Isolation, and Durability. This is a set of properties that guarantee the reliable processing of database transactions. These principle are guaranteed by most RDBMS. The process of normalization is not mandatory. With big data, it is difficult to follow RDBMS principles and normalization; NoSQL databases prefer de-normalized databases. Due to the transactional database requirement of RDBMS, the relational databases are not able to handle terabytes and petabytes of data. A NoSQL database is used to overcome the limitations of transactional databases. The image represents people uploading the data and reading the data from a transactional database.

1.18 Brewer’s CAP Principle

Eric Brewer, a computer scientist, proposed the Brewer’s Consistency, Availability, and Partition Tolerance or CAP principle in 1999. CAP is the basis for many NoSQL databases. In a distributed system, consistency means that all nodes in the cluster view the same data at the same time. Availability means that a response is guaranteed for every request received. The response can be in terms of whether a request was successful or it failed. Partition tolerance means the system continues to operate, despite the ad hoc message loss or failure on part of the system. Further, Brewer stated that it is not possible to guarantee all the three aspects simultaneously in a distributed system. Therefore, most NoSQL databases compromise on one of the three aspects to provide better performance.

1.19 Approaches to NoSQL Databases-Types

There are four main types of NoSQL approaches: Graph Database, Document Database, Key Value Stores, and Column Stores. Let us discuss each of these one-by-one. First is the Graph Database. A graph database helps in representing graphical data in the nodes and edges format. Some of its features are as follows: • Graph databases can handle millions of nodes and edges. • They can perform efficient depth-first and breadth-first searches on the graph data, other graph searches, and traversal algorithms. Neo4J and FlockDB are the common graph databases. The image depicts a graph with four nodes in a Neo4J database. Second is the Document Database. A document database helps in storing and processing a huge number of documents. In the document database, you can store millions of documents and process fields. For example, you can store your employee details and their resumes as documents, and search for a potential employee using fields like the phone number. MongoDB and CouchDB are the popular document databases. The image depicts the fields of a document stored in MongoDB. Third is the Key Value Stores. These store the data in key and value formats, where each piece of data is identified by a key and has associated values. In the key value stores, you can store billions of records efficiently and provide fast writes as well as searches of data based on keys. Cassandra and Redis are popular key value stores. The image depicts the keys and values stored in a Cassandra database. You will learn more about key value stores when we discuss Cassandra in the upcoming lessons. Last is the Column Stores. These are also called column-oriented databases. Column Stores organize data in groups of columns and are efficient in data storage and retrieval based on keys. Some of their features are: HBase is part of the Hadoop ecosystem that runs on top of HDFS to store and process terabytes and petabytes of data efficiently. Column Stores normally maintain a version of the data along with each value of data. HBase and Hypertable are the most common column stores. The image illustrates how HBase stores the data. Data is organized by column families and each column family can have one or more columns. For each column, along with the data value, a version number indicating the time of data update is also stored. The column-based data is stored along with the key. Note that NoSQL databases cannot replace general purpose databases. Although they provide better performance and scalability, they compromise on some aspects like ease of use and full SQL query support.

1.20 Quiz

Following is the quiz section to check your understanding of the lesson. Select the correct answer and click Submit to see the feedback.

1.21 Summary

Let us summarize what we have learned in this lesson. Big data is mainly characterized by variety, velocity, and volume. With the origin of big data analytics, complete sets of data can be used to conduct an analysis instead of sample data. Apache Hadoop is the most popular framework for big data processing. It has two components—HDFS and MapReduce. HDFS is the storage component of Hadoop and MapReduce is the processing framework of Hadoop. NoSQL is the common term used for all the databases that do not follow traditional RDBMS principles. It is based on Brewer’s CAP principle. Brewer stated that it is not possible to guarantee the Consistency, Availability, and Partition Tolerance aspects simultaneously in a distributed system.

1.22 Conclusion

This concludes the lesson on overview of Big Data and NoSQL. The next lesson will provide an introduction to Cassandra.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Phone Number*
Job Title*