Introduction to Storm Tutorial

2.1 Introduction to Storm

Hello and welcome to the second lesson - of the Apache Storm course offered by SimpliLearn. This lesson will provide you an introduction to Storm, its data model, architecture and components. Let us start with exploring the objectives of this lesson.

2.2 Objectives

By the end of this lesson, you will be able to describe the concept of Storm and explain streaming, describe the features and use cases for Storm, discuss the Storm data model, describe Storm architecture and its components, and explain the different types of topologies. Now, let’s get started with understanding the concept of Storm.

2.3 Apache Storm

Storm is a real time stream processing system. It is an Open source and a part of Apache projects. It is a tool that processes big data and provides fast and reliable processing. It can ingest high volume and high velocity data. It is highly parallelizable, scalable and fault-tolerant. Moving on, let us explore the uses of Storm.

2.4 Uses of Storm

Storm provides the computation system that can be used for real time analytics, machine learning and unbounded stream processing. It can take continuously produced messages and can output to multiple systems. Next, we will explain industry use cases for Storm.

2.5 What is a Stream

In computing, a stream represents a continuous sequence of bytes of data. It is produced by one program and consumed by another. It is consumed in the ‘First In First Out’ (or FIFO) sequence. That means if 12345 is produced by one program, another program consumes it in the order 12345 only. It can be bounded or unbounded. Bounded means that the data is limited. Unbounded means that there is no limit and the producer will keep producing the data as long as it runs and the consumer will keep consuming the data. A Linux pipe is an example of a stream. The Linux command is: cat logfile | wc -l (cat space logfile vertical bar wc hyphen l) In this command, cat logfile command produces a stream that is consumed by wc -l command to display the number of lines in the file. Next, let us look at the Storm data model.

2.6 Industry use cases for STORM

Many industries can use Storm for real time big data processing such as: 1. Credit card can use it for fraud detection on swipe 2. Investment banks can use it for trade pattern analysis in real time 3. Retail stores can use it for dynamic pricing 4. Transportation providers can use it for route suggestion based on traffic data 5. Healthcare providers can use it for the monitoring of ICU sensors 6. Telecom organizations can use it for processing switch data As we have been discussing about streams, so next let us now understand what a stream is.

2.7 STORM Data Model

Storm data model consists of tuples and streams. A tuple is an ordered list of named values similar to a database row. Each field in the tuple has a data type that can be dynamic. The field can be of any data type such as a string, integer, float, double, boolean or byte array. User-defined data types are also allowed in tuples. For example, for the stock market data, if the schema is in the ticker, year, value, and status format, then some tuples can be: ABC, 2011, 20, GOOD ABC, 2012, 30, GOOD ABC, 2012, 32, BAD XYZ, 2011, 25, GOOD A stream in Storm is an unbounded sequence of tuples. For example, if the above tuples are stored in a file stocks.txt format, then the command cat stocks.txt produces a stream. If the process is continuously putting data into stocks.txt format, then it becomes an unbounded stream. Now, let us look at the Storm architechture.

2.8 Storm Architecture

Storm has a master slave architecture. There is a master server called Nimbus running on a single node called master node. There are slave services called supervisor that are running on each worker node. Supervisors start one or more worker processes called workers that run in parallel to process the input. Worker processes store output to a file system or database. Storm uses Zookeeper for distributed process coordination. The diagram shows the Storm architecture with one master node and five worker nodes. The Nimbus process is running on the master node. There is one supervisor process running on each worker node. There are multiple worker processes running on each worker node. The workers get the input from file system or database and store the output also to a file system or database.

2.9 Storm Processes

A Zookeeper cluster is used for coordinating the master, supervisor and worker processes. Moving on, let us explore the Nimbus process.

2.10 Sample Program

A Log processing program takes each line from the log file and filters the messages based on the log type and outputs the log type. Input: A log file containing error, warning and informational messages. This is a growing file getting continuous lines of log messages. Output: Output type of message (ERROR or WARNING or INFO) Let us continue with the sample program. This program contains a single spout and a single bolt. The spout does the following: Opens the file, reads each line and outputs the entire line as a tuple. The bolt does the following: Reads each tuple from the spout and checks if the tuple contains the string ERROR or WARNING or INFO. Outputs only ERROR or WARNING or INFO. The diagram shows the outline of spout and bolt. The spout is named LineSpout. It has a loop to read each line of input and outputs the entire line. The emit function is used to output the line as a stream of tuples. The bolt is named LogTypeBolt. It takes the tuple as input. If the line contains the string ERROR, then it outputs the string ERROR. If the line contains the string WARNING, then it outputs the string WARNING. Similarly, if the line contains the string INFO, then it outputs the string INFO.

2.11 Storm Components

Storm provides two types of components that process the input stream, spouts and bolts. Spouts process external data to produce streams of tuples. Spouts produce tuples and send them to bolts. Bolts process the tuples from input streams and produce some output tuples. Input streams to bolt may come from spouts or from another bolt. The diagram shows a Storm cluster consisting of one spout and two bolts. The spout gets the data from an external data source and produces a stream of tuples. The first bolt takes the output tuples from spout and processes them to produce another set of tuples. The second bolt takes the output tuples from bolt 1 and stores them into an output stream. Now, we will understand the functioning of Storm spout.

2.12 Storm Spout

Spout is a component of Storm that ingests the data and creates the stream of tuples for processing by the bolts. A spout can create a stream of tuples from the input and it automatically serializes the output data. It can get data from other queuing systems like Kafka, Twitter, RabitMQ etc. Spout implementations are available for popular message producers such as Kafka and Twitter. A single spout can produce multiple streams of output. Each stream output can be consumed by one or more bolts. The diagram shows a twitter spout that gets twitter posts from a twitter feed and converts them into a stream of tuples. It also shows a Kafka spout that gets messages from Kafka server and produces a tuple of messages. Next, let us look at how Storm bolt functions.

2.13 Storm Bolt

Storm Bolt processes the tuples from spouts and outputs to external systems or other bolts. The processing logic in a bolt can include filters, joins and aggregation. Filter data examples include processing only records with STATUS = GOOD, processing only records with volume > 100 etc. Aggregation examples include calculating the sum of sale amount, calculating the max stock value etc. A bolt can process any number of input streams. Input data is deserialized and the output data is serialized. Streams are treated as tuples; bolts run in parallel and can distribute across machines in the Storm cluster. The diagram shows four bolts running in parallel. Bolt 1 produces the output that is received by both bolt 3 and bolt 4. Bolt 2 produces the output that is received by both bolt 3 and bolt 4. Bolt 3 stores the output to a Cassandra database whereas bolt 4 stores the output to Hadoop storage. Next, we will explore the functioning of Storm topology.

2.14 Storm Topology

A group of spouts and bolts running in a Storm cluster form the Strom Topology. Spouts and bolts run in parallel. There can be multiple spouts and bolts. Topology determines how the output of a spout is connected to the input of bolts and how the output of a bolt is connected to the input of other bolts. The diagram shows a Storm Topology with one spout and five bolts. The output of spout 1 is processed by three bolts: bolt 1, bolt 2 and bolt 3. Bolt 4 gets the output from bolt 2 and bolt 3. Bolt 5 gets the input from Bolt 5. This represents the storm topology. The input to spout 1 is coming from an external data source. The output from bolt 4 goes to Output 1 and the output from bolt 5 goes to output 2. Let us understand this in a better way with the help of an example.

2.15 Storm Example

Let us illustrate storm with an example. • Problem: The stock market data which is continuously sent by an external system should be processed so that data with GOOD status are inserted into a database where as data that are with BAD status are written to an error log. • STORM Solution: This will have one spout and two bolts in the topology. Spout will get the data from the external system and convert into stream of tuples. These tuples will be processed by two bolts. Those with Status GOOD will be processed by bolt1. Those with status BAD will be processed by bolt2. Bolt1 will save the tuples to Cassandra database. Bolt2 will save the tuples to an error log file. The diagram shows the Storm topology for the above solution. There is one spout that gets the input from external data source. There is bolt 1 that processes the tuples from the spout and stores the tuples with GOOD status to Cassandra. There is bolt 2 that processes the tuples from the spout and stores the tuples with BAD status to an error log. Next, let us understand serialization and deserialization.

2.16 Serialization-Deserialization

Serialization is the process to convert data structures or objects into a platform independent stream of bytes. It is used to store data on disk or memory and to transmit data over network. The purpose of serialization is to make data readable by other programs. For example: The object ( name: char(10), id : integer) may have data {‘John’, 101}. This can be serialized as “John\00x65”. This represents that the string John is followed by null character and then hexadecimal representation of 101. To read the data, programs have to reverse the serialization process. This is called deserialization. Serialization-Deserialization is also known as SerDe (abbreviation of Serialization-Deserialization). Next, let us explore the Storm Components. Now we will go through the steps involved in submitting a job to Storm.

2.17 Submitting a Job to Storm

A job is submitted to Nimbus process. To submit a job, you need to: 1. Create spout and bolt functions 2. Build topology using spout and bolts 3. Configure parallelism 4. Submit topology to Nimbus The diagram shows a Java program fragment for submitting a job to Storm. It first gets a Storm topology builder object. Next, it sets the spout and bolts for the topology with the setSpout and setBolt methods. It also sets the connection from the output of spout to bolt using the shuffleGrouping method. shuffleGrouping is a type of grouping of input that we will discuss in a subsequent lesson. After setting the spout and bolt, the program sets the number of workers for the job to 2; which represents the number of workers that will run in parallel. Finally, using the submitTopology method, the topology is submitted to Nimbus process. Now, finally we will look at the Types of topologies.

2.18 Types of Topologies

Storm supports different types of topologies: 1. Simple topology: It consists of simple spouts and bolts. Output of spouts are sent to bolts. 2. Transactional topology: It guarantees processing of tuples only once. This is an abstraction of the above simple topology. The Storm libraries ensure that a tuple is processed exactly once even if the system goes down in the middle. This topology has become deprecated and replaced by trident topology in current version of Storm. 3. Distributed RPC: Distributed RPC is also called DRPC. This topology provides libraries to parallelize any computation using Storm. 4. Trident topology: It is a higher level abstraction over Storm. That means it has libraries that run on top of Storm libraries to support additional functionality. Trident provides spouts and bolts that provide much higher functionality such as transaction processing and batch processing. We have come to the end of this lesson. Now, let’s do a small quiz to test your knowledge.

2.20 Quiz

A few questions will be presented on the following screens. Select the correct option and click submit to see the feedback.

2.21 Summary

Let us summarize the topics covered in this lesson. 1. Storm is used for processing streams of data. 2. Storm data model consists of tuples and streams. 3. Storm consists of spouts and bolts. 4. Spouts create tuples that are processed by bolts. 5. Spouts and bolts together form the Storm topology. 6. Storm follows a master slave architecture. 7. The master process is called Nimbus and the slave processes are called supervisors. 8. Data processing is done by the workers that are monitored by the supervisors.

2.22 Conclusion

This concludes the lesson: Introduction to Storm. In the next lesson, we will look at Storm installation and configuration.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Phone Number*
Job Title*