Apache Ecosystem around Cassandra Tutorial

Apache Ecosystem around Cassandra

Welcome to the eighth lesson ‘Apache Ecosystem around Cassandra’ of the Apache Cassandra Certification Course. This lesson will cover the Apache Ecosystem components around Cassandra.

Let us begin with the objectives of this lesson.

Objectives

After completing this lesson on Apache Ecosystem around Cassandra, you will be able to:

  • Describe Apache Storm

  • Explain Apache Kafka

  • Discuss the real-time analytics platform tools

  • Describe Apache Spark

  • Discuss Spark and Scala

Let us learn about Stream in the next section.

Want to become competent in the field of Big Data? Check our Cassandra Course here.

Stream

In computing, Stream represents a continuous sequence of bytes of data. It is produced by one program and consumed by another, in the first-in-first-out sequence.

Stream can be bounded, where it has limited data, or unbounded, where it has unlimited data. Further, unbounded Stream is where the producer continues to produce data, and the consumer continues to consume data, as long as the data runs.

The given Linux command is an example of Stream: cat logfile | wc –l

In this command, ‘cat logfile’ produces a Stream that is consumed by ‘wc –l’ to display the number of lines in the file.

Let us learn about Apache Storm in the next section.

Apache Storm

Storm is a real-time stream processing system. It is an open-source product and part of Apache projects. Storm provides fast and reliable processing.

It processes unbounded streams and interfaces with queues, such as Kafka, to get the data at the input end. Further, Storm can store data in Cassandra.

Let us learn about Storm Architecture in the next section.

Storm Architecture

Storm has a Master-Slave architecture. The process of this architecture runs as follows:

There is a master server called ‘Nimbus’ that runs on a single node. There are slave services called ‘Supervisors’ that run on each worker node.

Supervisors start one or more worker processes called ‘Workers’ that run parallel with the process input. Further, worker processes store the output in a file system or database.

Note that Storm uses ZooKeeper for distributed process coordination.

The given image illustrates the Storm architecture with one master node.

‘Nimbus’ and five worker nodes. The ZooKeeper cluster, which is a separate Apache open-source product, acts as the coordinator of the distributed processes.

Further, the image shows the data being stored in an external system. It also shows the Supervisor and Workers processes running on each worker node. Finally, the image shows the worker processes storing their output individually to a file system or database.

Let us learn about Storm Data Model in the next section.

Storm Data Model

Storm data model consists of Tuples and Streams. Tuple refers to an ordered list of named values, similar to a database row.

For example, in the stock market data, if the schema is (ticker, year, value, status), some tuples can be as follows:

ABC, 2011, 20, GOOD

ABC, 2012, 30, GOOD

ABC, 2012, 32, BAD

XYZ, 2011, 25, GOOD

Further, a stream in Storm is an unbounded sequence of tuples. For example, if the above tuples are stored in the stocks.txt file, the following command on Linux produces a stream that can be consumed by Storm: cat stocks.txt

If a process continuously stores data into stocks.txt, it becomes an unbounded stream.

Let us learn about Storm Topology in the next section.

Storm Topology

Storm provides two types of transformations on the input stream:

Spouts:

Spouts process the external data to produce streams. They produce a stream of tuples and send them to bolts for further processing.

Bolts:

Bolts process the tuples from input streams and produce some output tuples. Inputs to bolt may come from spouts or another bolt.

The given image illustrates a Storm topology with one spout and five bolts.

This system receives input from an external data source and stores the output into two separate outputs.

Spout 1 receives input from the data source and sends the tuples to three bolts, Bolt 1, Bolt 2, and Bolt 3. The output from Bolts 1 and 2 is consumed by Bolt 4. The output from Bolt 3 is consumed by Bolt 5.

Further, Bolt 4 sends the output to output 1, and Bolt 5 sends the output to output 2.

In the next section, let us look into the Storm Topology example.

Storm Topology - Example

The Storm topology is described using an example of the following case study:

Problem: The stock market data which is continuously sent by an external system should be processed in the following way—those with a GOOD status are inserted into a database, and those with a BAD status are written to an Error Log.

Storm solution: As illustrated in the following diagram, this Storm topology will have one spout and two bolts, Bolt 1 and Bolt 2.

The spout gets the data from the external system and divides into GOOD and BAD tuples and distributes them into two bolts, 1 and 2 respectively. Those with a GOOD status are sent to Bolt 1, and those with a BAD status are sent to Bolt 2.

Further, Bolt 1 saves the tuples to the Cassandra database, and Bolt 2 saves the tuples to an Error Log file.

In the next section, let us look into the Apache Kafka.

Apache Kafka

Apache Kafka is a high-performance, real-time distributed, and partitioned messaging system. It is also an open-source product and part of Apache projects.

Kafka is highly fault-tolerant, and it can process millions of messages per second and send to many receivers.

In the next section, let us look into the Kafka Data Model.

Interested in taking up a career in Big Data? Check out our Course Preview

Kafka Data Model

Kafka data model consists of Messages and Topics.

Messages

Any piece of information is called ‘Message’ in Kafka.

For example, lines in a log file, rows of stock market data, error messages from a system.

Topics

Further, Messages are grouped into categories called ‘Topics.’

For example, LogMessage and StockMessage.

In Kafka, the external processes that publish messages into a topic are called ‘Producers,’ and the external processes that get messages from a topic are called ‘Consumers.’

‘Brokers’ are the processes within Kafka that process messages.

‘Kafka Cluster’ refers to a set of Brokers.

The given image illustrates a Kafka cluster with three brokers.

They receive the messages from two producers and send the messages to two consumers.

In the next section, let us look into the Kafka Architecture.

Kafka Architecture

Kafka architecture consists of brokers that take messages from producers and add to a partition of the topic. Brokers also send messages to consumers from partitions.

Points to remember for this type of architecture are:

  • A topic is divided into multiple partitions.

  • Messages are added to the partitions at one end and consumed in the same order.

  • Each partition acts as a message queue.

  • Consumers are divided into consumer groups

  • Each message is delivered to one consumer in each consumer group.

The given image illustrates the Kafka architecture.

There are two producers sending messages to the broker cluster. There are two partitions, Partition 1 and Partition 2. Brokers add messages to these partitions that take the messages out, in the order of insertion.

Further, the broker cluster sends the messages to two consumer groups, Group 1 and Group 2.

In the next section, let us look into the Kafka Queue System.

Kafka-Queue System

In the Queue system of messages in Kafka:

  • Each message has to be consumed by one consumer only.

  • This can be accomplished by making sure that all the consumers belong to the same group.

  • With this setting, each consumer will get the next message from the message queue

  • No two consumers will get the same message.

The given image illustrates the Kafka setup for the Queue system.

The three consumers, Consumer 1, Consumer 2, and Consumer 3 are set up in the same Consumer Group. Therefore, the messages are sent to one of these consumers, and there is no duplication of messages. This setup works like a queue system.

In the next section, let us look into the Kafka Publish Subscribe System.

Kafka-Publish-Subscribe System

In the Publish-Subscribe system of messages in Kafka:

  • One system broadcasts the messages, and each message is received by all the consumers.

  • This can be accomplished in Kafka by making sure each consumer belongs to a separate group

  • Setting up only one consumer in each group.

The diagram illustrates the Kafka setup for the Publish-Subscribe system.

The three consumers Consumer 1, Consumer 2, and Consumer 3 are set up in separate consumer groups, Group 1, Group 2, and Group 3.

Each message is sent to all the three consumer groups, which is received by all three consumers. Note that Kafka architecture supports both Queue and Publish-Subscribe system of messages.

In the next section, let us look into the real-time data analysis platform.

Real-Time Data Analysis Platform

Kafka, Storm, and Cassandra together form a high-performance real-time big data analytics platform.

In this platform, Kafka receives each line of data as a message and forwards it to Storm.

Storm parallelizes the data and initiates multiple bolts to insert data into Cassandra. Further, each bolt inserts data into one or more tables in Cassandra.

Millions of records can be processed per second with this architecture. Storm provides special Kafka spouts to get data from Kafka. The Cassandra Java interface can be used to insert records into Cassandra.

The given image illustrates the real-time data analysis platform.

The Kafka Cluster receives messages from an external data source and sends them to Kafka Spout within Storm.

Further, the Kafka Spout parallelizes the messages into multiple bolts. These bolts insert the data into Cassandra, thus providing a real-time data analytics platform.

The next section focuses on Apache Spark.

Apache Spark

Apache Spark is an open-source project, which is considered the next generation of MapReduce. Apache Spark transforms the distributed data even beyond the map and reduce. It is said to be 10 to 100 times faster than Hadoop MapReduce. It is suitable for the batch as well as real-time processing.

Further, it also provides Spark-SQL for SQL interface to big data. It provides built-in libraries for Machine Learning and Graph Processing.

Note: Machine Learning consists of programs that can learn based on the data, without being explicitly programmed.

Graph Processing consists of algorithms to process the nodes and edges of a graph.

In the next section, let us look into the Apache Spark Architecture.

Apache Spark Architecture

Similar to Storm, Spark has a Master-Slave architecture, with Spark Driver as the master and tasks scheduled on each worker node. A cache, also called block cache, is maintained on each worker node. Task scheduling is done based on the data locality and availability of data in cache.

The given image illustrates the Spark architecture with one Master and five Worker Nodes.

Data is maintained as blocks on each worker node, and a cache is present in each worker node. Tasks are run on the worker nodes in parallel. They make use of the cache to process data faster.

Let us look into the resilient distributed datasets in the next section.

Resilient Distributed Datasets

Resilient Distributed Datasets or RDDs are the units of data in Spark. RDDs are similar to HDFS and cannot be changed after creation.

Further, RDDs are distributed across the cluster and stored in memory and on disk, and they can use HDFS for storage. RDDs are called ‘resilient’ as they can be rebuilt automatically upon failure.

In the next section, let us look into the operations in Spark.

Operations in Spark

There are two types of operations in Spark, Transformation, and Action.

Transformation takes one RDD and produces another RDD. The built-in transformations available are filter, map, reduce, limit, and group. You can chain the transformations and apply a filter, followed by map and then reduce.

The given image illustrates the transformation of RDD1 to RDD2.

Action takes one RDD and produces a number or a program object. The actions available are count, collect, and save. For example, the action ‘count’ gives you the number of items in an RDD.

The given image illustrates an action that takes RDD1 and gives a set of numbers.

Chaining of transformations allows Spark to iterate over the data sets, instead of using multiple MapReduce jobs.

In the next section, let us focus on Spark and Scala.

Preparing for the Apache Cassandra Certification? Check out our Course Preview here!

Spark and Scala

Spark supports multiple interfaces including Java, Python, and Scala. Scala is an object-functional programming language, well-suited for Spark.

Scala is a higher level language than Java. Everything created in Scala is considered an object, and functions can be defined on the objects. In Scala, the types are inferred based on the context, therefore, explicit type-definition for objects are not required.

Further, Scala inter-operates with Java, as you can use any Java classes in Scala. It is also called ‘Java without semicolons.’

The given image illustrates the standard “Hello, World!” program written in Scala.

It defines an object Hello World that contains the main function. The main function just prints the line “Hello world!”. Once this object is defined, you can call this object function with HelloWorld.main.

The output of this program will be the printout of the line “Hello World”. Notice that unlike in Java, semi-colons are not required to end the lines here.

Let us see the example of Spark in the next section.

Spark - Example

Following is an example to find the number of error messages in a log file. The example also illustrates Transformations and Actions in Spark.

The Spark program written in Scala is shown:

scala> val logFile = sc.textFile("hdfs://localhost:9000/user/ec2-user/logfile")

scala> val logInfo = logFile.filter(line => line.contains(“ERROR"))

scala> logInfo.count()

res6: Long = 92

First, it loads a log file from HDFS into a Spark variable called logFile, which becomes an RDD in Spark.

Next, it applies a filter, filtering each line containing the string ‘ERROR’ into another variable called logInfo. This is the Transformation that takes place in Spark, where an RDD logFile is taken, and an RDD logInfo is produced.

Finally, it applies the ‘count’ action on the RDD logInfo, with 92 as the number of lines with error messages in the file. This is known as Action in Spark.

Summary

Let us summarize the topics covered in this lesson:

  • Storm, Kafka, and Spark are all Apache ecosystem of open-source products.

  • Apache Storm is a real-time stream processing system and consists of spouts and bolts.

  • Apache Kafka is a high-performance, real-time distributed and partitioned messaging system.

  • Kafka architecture consists of brokers that take messages from producers and add to a partition of the topic.

  • Brokers also provide the messages to the consumers from the partitions.

  • Kafka architecture supports both Queue and Publish-Subscribe system of messages.

  • Apache Spark is an open-source project, which is considered the next generation of MapReduce.

  • Spark has a Master-Slave architecture, with Spark Driver as the Master and tasks scheduled on each worker node.

  • Scala is an object-functional programming language, well-suited for Spark.

Conclusion

With this, we have come to an end of the Apache Cassandra Tutorial.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*