Apache Ecosystem around Cassandra Tutorial

8.1 Apache Ecosystem around Cassandra

Hello and welcome to the eighth lesson of the Apache Cassandra™ course offered by Simplilearn. This lesson will cover the Apache Ecosystem components around Cassandra.

8.2 Course Map

The Apache Cassandra™ course by Simplilearn is divided into eight lessons, as listed: • Lesson 0—Course Overview • Lesson 1—Overview of Big Data and NoSQL Database • Lesson 2—Introduction to Cassandra • Lesson 3—Cassandra Architecture • Lesson 4—Cassandra Installation and Configuration • Lesson 5—Cassandra Data Model • Lesson 6—Cassandra Interfaces • Lesson 7—Cassandra Advanced Architecture and Cluster Management • Lesson 8—Hadoop Ecosystem around Cassandra This is the last lesson of the course, ‘Hadoop Ecosystem and Cassandra.’

8.3 Objectives

After completing this lesson, you will be able to describe Apache Storm, explain Apache Kafka, and discuss the real-time analytics platform tools. You will also be able to describe Apache Spark. Finally, you will be able to discuss Spark and Scala.

8.4 Stream

In computing, Stream represents a continuous sequence of bytes of data. It is produced by one program and consumed by another, in the first-in-first-out sequence. Stream can be bounded, where it has limited data, or unbounded, where it has unlimited data. Further, unbounded Stream is where the producer continues to produce data and the consumer continues to consume data, as long as the data runs. The given Linux command is an example of Stream: cat logfile | wc –l In this command, ‘cat logfile’ (Read as: cat log file) produces a Stream that is consumed by ‘wc –l’ (Read as: w see hypen el) to display the number of lines in the file.

8.5 Apache Storm

Storm is a real-time stream processing system. It is an open-source product and part of Apache projects. Storm provides fast and reliable processing. It processes unbounded streams and interfaces with queues, such as Kafka, to get the data at the input end. Further, Storm can store data in Cassandra.

8.6 Storm Architecture

Storm has a Master-Slave architecture. The process of this architecture runs as follows: There is a master server called ‘Nimbus’ that runs on a single node. There are slave services called ‘Supervisors’ that run on each worker node. Supervisors start one or more worker processes called ‘Workers’ that run paralelly with the process input. Further, worker processes store the output in a file system or database. Note that Storm uses ZooKeeper for distributed process coordination.

8.7 Storm Architecture (contd.)

The given image illustrates the Storm architecture with one master node ‘Nimbus’ and five worker nodes. The ZooKeeper cluster, which is a separate Apache open-source product, acts as the coordinator of the distributed processes. Further, the image shows the data being stored into an external system. It also shows the Supervisor and Workers processes running on each worker node. Finally, the image shows the worker processes storing their output individually to a file system or database.

8.8 Storm Data Model

Storm data model consists of Tuples and Streams. Tuple refers to an ordered list of named values, similar to a database row. For example, in the stock market data, if the schema is (ticker, year, value, status) (Read as: ticker, year, value, status), some tuples can be as follows: ABC, 2011, 20, GOOD ABC, 2012, 30, GOOD ABC, 2012, 32, BAD XYZ, 2011, 25, GOOD Further, a stream in Storm is an unbounded sequence of tuples. For example, if the above tuples are stored in the stocks.txt file, the following command on Linux produces a stream that can be consumed by Storm: cat stocks.txt If a process continuously stores data into stocks.txt, it becomes an unbounded stream.

8.9 Storm Topology

Storm provides two types of transformations on the input stream, Spouts and Bolts. Spouts process the external data to produce streams. They produce a stream of tuples and send them to bolts for further processing. Bolts process the tuples from input streams and produce some output tuples. Inputs to bolt may come from spouts or another bolt. The given image illustrates a Storm topology with one spout and five bolts. This system receives input from an external data source and stores the output into two separate outputs. Spout 1 receives input from the data source and sends the tuples to three bolts, Bolt 1, Bolt 2, and Bolt 3. The output from Bolts 1 and 2 is consumed by Bolt 4. The output from Bolt 3 is consumed by Bolt 5. Further, Bolt 4 sends the output to output 1 and Bolt 5 sends the output to output 2.

8.10 Storm Topology - Example

The Storm topology is described using an example of the following case study: Problem: The stock market data which is continuously sent by an external system should be processed in the following way—those with a GOOD status are inserted into a database and those with a BAD status are written to an Error Log. Storm solution: As illustrated in the diagram, this Storm topology will have one spout and two bolts, Bolt 1 and Bolt 2. The spout gets the data from the external system and divides into GOOD and BAD tuples and distributes them into two bolts, 1 and 2 respectively. Those with a GOOD status are sent to Bolt 1 and those with a BAD status are sent to Bolt 2. Further, Bolt 1 saves the tuples to the Cassandra database and Bolt 2 saves the tuples to an Error Log file.

8.11 Apache Kafka

Apache Kafka is a high-performance, real-time distributed, and partitioned messaging system. It is also an open-source product and part of Apache projects. Kafka is highly fault-tolerant and it can process millions of messages per second and send to many receivers.

8.12 Kafka Data Model

Kafka data model consists of Messages and Topics. Any piece of information is called ‘Message’ in Kafka. For example, lines in a log file, rows of stock market data, error messages from a system. Further, Messages are grouped into categories called ‘Topics.’ For example, LogMessage and StockMessage.

8.13 Kafka Data Model (contd.)

In Kafka, the external processes that publish messages into a topic are called ‘Producers’ and the external processes that get messages from a topic are called ‘Consumers.’ ‘Brokers’ are the processes within Kafka that process messages. ‘Kafka Cluster’ refers to a set of Brokers. The given image illustrates a Kafka cluster with three brokers. They receive the messages from two producers and send the messages to two consumers.

8.14 Kafka Architecture

Kafka architecture consists of brokers that take messages from producers and add to a partition of the topic. Brokers also send messages to consumers from partitions. A topic is divided into multiple partitions. Messages are added to the partitions at one end and consumed in the same order. Each partition acts as a message queue. Consumers are divided into consumer groups and each message is delivered to one consumer in each consumer group.

8.15 Kafka Architecture (contd.)

The given image illustrates the Kafka architecture. There are two producers sending messages to the broker cluster. There are two partitions, Partition 1 and Partition 2. Brokers add messages to these partitions that take the messages out, in the order of insertion. Further, the broker cluster sends the messages to two consumer groups, Group 1 and Group 2.

8.16 Kafka-Queue System

In the Queue system of messages in Kafka, each message has to be consumed by one consumer only. This can be accomplished by making sure that all the consumers belong to the same group. With this setting, each consumer will get the next message from the message queue, and no two consumers will get the same message. The given image illustrates the Kafka setup for the Queue system. The three consumers, Consumer 1, Consumer 2, and Consumer 3 are set up in the same Consumer Group. Therefore, the messages are sent to one of these consumers and there is no duplication of messages. This setup works like a queue system.

8.17 Kafka-Publish-Subscribe System

The Publish-Subscribe system of messages in Kafka is where one system broadcasts the messages and each message is received by all the consumers. This can be accomplished in Kafka by making sure each consumer belongs to a separate group, and setting up only one consumer in each group. The diagram illustrates the Kafka setup for the Publish-Subscribe system. The three consumers Consumer 1, Consumer 2, and Consumer 3 are set up in separate consumer groups, Group 1, Group 2, and Group 3. Each message is sent to all the three consumer groups, which is received by the all three consumers. Note that Kafka architecture supports both Queue and Publish-Subscribe system of messages.

8.18 Real-Time Data Analysis Platform

Kafka, Storm, and Cassandra together form a high-performance real-time big data analytics platform. In this platform, Kafka receives each line of data as a message and forwards it to Storm. Storm parallelizes the data and initiates multiple bolts to insert data into Cassandra. Further, each bolt inserts data into one or more tables in Cassandra. Millions of records can be processed per second with this architecture. Storm provides special Kafka spouts to get data from Kafka. The Cassandra Java interface can be used to insert records into Cassandra.

8.19 Real-Time Data Analysis Platform (contd.)

The given image illustrates the real-time data analysis platform. The Kafka Cluster receives messages from an external data source and sends them to Kafka Spout within Storm. Further, the Kafka Spout parallelizes the messages into multiple bolts. These bolts insert the data into Cassandra, thus providing a real-time data analytics platform.

8.20 Apache Spark

Apache Spark is an open-source project, which is considered the next generation of MapReduce. Apache Spark transforms the distributed data even beyond map and reduce. It is said to be 10 to 100 times faster than Hadoop MapReduce. It is suitable for batch as well as real-time processing. Further, it also provides Spark-SQL (Read as: spark sequel) for SQL (Read as: sequel) interface to big data. It provides built-in libraries for Machine Learning and Graph Processing. Note: Machine Learning consists of programs that can learn based on the data, without being explicitly programmed. Graph Processing consists of algorithms to process the nodes and edges of a graph.

8.21 Apache Spark Architecture

Similar to Storm, Spark has a Master-Slave architecture, with Spark Driver as the master and tasks scheduled on each worker node. A cache, also called block cache, is maintained on each worker node. Task scheduling is done based on the data locality and availability of data in cache. The given image illustrates the Spark architecture with one Master and five Worker Nodes. Data is maintained as blocks on each worker node and a cache is present in each worker node. Tasks are run on the worker nodes in parallel. They make use of the cache to process data faster.

8.22 Resilient Distributed Datasets

Resilient Distributed Datasets or RDDs are the units of data in Spark. RDDs are similar to HDFS and cannot be changed after creation. Further, RDDs are distributed across the cluster and stored in memory and on disk, and they can use HDFS for storage. RDDs are called ‘resilient’ as they can be rebuilt automatically upon failure. The given image illustrates the RDD being distributed across four data nodes, Data 1, Data 2, Data 3, and Data 4.

8.23 Operations in Spark

There are two types of operations in Spark, Transformation and Action. Transformation takes one RDD and produces another RDD. The built-in transformations available are filter, map, reduce, limit, and group. You can chain the transformations and apply a filter, followed by map and then reduce. The given image illustrates the transformation of RDD1 to RDD2. Action takes one RDD and produces a number or a program object. The actions available are count, collect, and save. For example, the action ‘count’ gives you the number of items in an RDD. The given image illustrates an action that takes RDD1 and gives a set of numbers. Chaining of transformations allows Spark to iterate over the data sets, instead of using multiple MapReduce jobs.

8.24 Spark and Scala

Spark supports multiple interfaces including Java, Python, and Scala. Scala is an object-functional programming language, well-suited for Spark. Scala is a higher level language than Java. Everything created in Scala is considered an object, and functions can be defined on the objects. In Scala, the types are inferred based on the context, therefore explicit type-definition for objects are not required. Further, Scala inter-operates with Java, as you can use any Java classes in Scala. It is also called ‘Java without semicolons.’ The given image illustrates the standard “Hello, World!” program written in Scala. It defines an object Hello World that contains the main function. The main function just prints the line “Hello world!” (Read as: hello world). Once this object is defined, you can call this object function with HelloWorld.main. The output of this program will be the printout of the line “Hello World”. Notice that unlike in Java, semi-colons are not required to end the lines here.

8.25 Spark - Example

Following is an example to find the number of error messages in a log file. The example also illustrates Transformations and Actions in Spark. The Spark program written in Scala is shown: scala> val logFile = sc.textFile("hdfs://localhost:9000/user/ec2-user/logfile") scala> val logInfo = logFile.filter(line => line.contains(“ERROR")) scala> logInfo.count() res6: Long = 92 First, it loads a log file from HDFS into a Spark variable called logFile, which becomes an RDD in Spark. Next, it applies a filter, filtering each line containing the string ‘ERROR’ into another variable called logInfo. This is the Transformation that takes place in Spark, where an RDD logFile is taken and an RDD logInfo is produced. Finally, it applies the ‘count’ action on the RDD logInfo, with 92 as the number of lines with error messages in the file. This is known as Action in Spark.

8.26 Quiz

A few questions will be presented in the following screens. Select the correct option and click submit to see the feedback.

8.27 Summary

Let us summarize the topics covered in this lesson: • Storm, Kafka and Spark are all Apache ecosystem of open-source products. • Apache Storm is a real-time stream processing system, and consists of spouts and bolts. • Apache Kafka is a high-performance, real-time distributed and partitioned messaging system. • Kafka architecture consists of brokers that take messages from producers and add to a partition of the topic. • Brokers also provide the messages to the consumers from the partitions. • Kafka architecture supports both Queue and Publish-Subscribe system of messages. • Apache Spark is an open-source project, which is considered the next generation of MapReduce. • Spark has a Master-Slave architecture, with Spark Driver as the Master and tasks scheduled on each worker node. • Scala is an object-functional programming language, well-suited for Spark.

8.28 Thank you

With this, we have come to the end of this course. Thank you and happy learning!

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Phone Number*
Job Title*