Apache Storm is a real-time stream processing system, and in this Apache Storm tutorial, you will learn all about it, its data model, architecture, and components.

Features of Apache Storm

Following are the features of Apache Storm.

  • It is an open source and a part of Apache projects.
  • It helps to process big data.
  • It is a fast and reliable processing system.
  • It can ingest high volume and high-velocity data.
  • It is highly parallelizable, scalable, and fault-tolerant.

In the next section of apache storm tutorial, we will discuss Uses of Storm.

Uses of Storm

Storm provides the computation system that can be used for real-time analytics, machine learning, and unbounded stream processing. It can take continuously produced messages and can output to multiple systems.  

In the next section of apache storm tutorial, let us understand what a stream is.

What is a Stream

In computing, a stream represents a continuous sequence of bytes of data. It is produced by one program and consumed by another. It is consumed in the ‘First In First Out’ (or FIFO) sequence. That means if 12345 is produced by one program, another program consumes it in the order 12345 only. It can be bounded or unbounded.

Bounded means that the data is limited. Unbounded means that there is no limit and the producer will keep producing the data as long as it runs and the consumer will keep consuming the data. A Linux pipe is an example of a stream.

The Linux command is cat logfile | wc -l (cat space logfile vertical bar wc hyphen l)

In this command, cat logfile command produces a stream that is consumed by wc -l command to display the number of lines in the file.

Big Data Engineer Master's Program

Master All the Big Data Skill You Need TodayEnroll Now
Big Data Engineer Master's Program

Industry Use Cases for STORM

Many industries can use Storm for real-time big data processing such as:

  1. Credit card companies can use it for fraud detection on swipe.
  2. Investment banks can use it for trade pattern analysis in real time.
  3. Retail stores can use it for dynamic pricing.
  4. Transportation providers can use it for route suggestions based on traffic data.
  5. Healthcare providers can use it for the monitoring of ICU sensors.
  6. Telecom organizations can use it for processing switch data.

In the next section of apache storm tutorial, let us look at the Storm data model.

STORM Data Model

Storm data model consists of tuples and streams.

  • Tuple

A tuple is an ordered list of named values similar to a database row. Each field in the tuple has a data type that can be dynamic. The field can be of any data type such as a string, integer, float, double, boolean or byte array. User-defined data types are also allowed in tuples.

For example, for the stock market data, if the schema is in the ticker, year, value, and status format, then some tuples can be ABC, 2011, 20, GOOD ABC, 2012, 30, GOOD ABC, 2012, 32, BAD XYZ, 2011, 25, GOOD.

  • Stream

A stream of Storm is an unbounded sequence of tuples.

For example, if the above tuples are stored in a file stocks.txt format, then the command cat stocks.txt produces a stream. If the process is continuously putting data into stocks.txt format, then it becomes an unbounded stream.

Storm Architecture

Storm has a master-slave architecture. There is a master server called Nimbus running on a single node called master node. There are slave services called supervisors that are running on each worker node. Supervisors start one or more worker processes called workers that run in parallel to process the input.

Worker processes store the output to a file system or database. Storm uses Zookeeper for distributed process coordination.

The diagram shows the Storm architecture with one master node and five worker nodes. The Nimbus process is running on the master node. There is one supervisor process running on each worker node. There are multiple worker processes running on each worker node. The workers get the input from the file system or database and store the output also to a file system or database.

apache-storm-architecture_1.

In the next section of Apache Storm tutorial, we will discuss storm processes.

Storm Processes

A Zookeeper cluster is used for coordinating the master, supervisor and worker processes. Moving on, let us explore the Nimbus process, which is the master daemon of Storm cluster.

  • Runs on only one node called the master node
  • Assigns and distributes the tasks to the worker nodes
  • Monitors the tasks
  • Reassigns tasks on node failure.

Next comes the supervisor process which is the worker daemon of Storm cluster.

  • Runs on each worker node of the cluster
  • Runs each task as a separate process called worker process
  • Communicates with Nimbus daemon using zookeeper
  • Number of worker processes for each task can be configured

Next is the worker process, that does the actual work of Storm.

  • Runs on any worker node of the cluster
  • Started and monitored by the supervisory process
  • Runs either spout or bolt tasks
  • Number of worker processes for each task can be configured

Sample Program

A Log processing program takes each line from the log file and filters the messages based on the log type and outputs the log type.

  • Input: A log file containing error, warning, and informational messages. This is a growing file getting continuous lines of log messages.
  • Output: Output type of message (ERROR or WARNING or INFO) Let us continue with the sample program.

This program given below contains a single spout and a single bolt.

The spout does the following:

Opens the file, reads each line and outputs the entire line as a tuple.

The bolt does the following: Reads each tuple from the spout and checks if the tuple contains the string ERROR or WARNING or INFO. Outputs only ERROR or WARNING or INFO.

LineSpout {

foreach line = readLine(logfile) {

emit(line)

}

LogTypeBolt(tuple) {

if(tuple contains “ERROR”) emit(“ERROR”);

if(tuple contains(“WARNING”) emit (“WARNING”);

if(tuple contains “INFO”) emit(“INFO”);

}

The spout is named LineSpout. It has a loop to read each line of input and outputs the entire line. The emit function is used to output the line as a stream of tuples. The bolt is named LogTypeBolt. It takes the tuple as input. If the line contains the string ERROR, then it outputs the string ERROR. If the line contains the string WARNING, then it outputs the string WARNING.

Similarly, if the line contains the string INFO, then it outputs the string INFO.

Next, let us explore the Storm Components.

Storm Components

Storm provides two types of components that process the input stream, spouts, and bolts. Spouts process external data to produce streams of tuples. Spouts produce tuples and send them to bolts. Bolts process the tuples from input streams and produce some output tuples. Input streams to bolt may come from spouts or from another bolt.

The diagram shows a Storm cluster consisting of one spout and two bolts. The spout gets the data from an external data source and produces a stream of tuples. The first bolt takes the output tuples from the spout and processes them to produce another set of tuples. The second bolt takes the output tuples from bolt 1 and stores them into an output stream.

apache-storm-architecture_2

Now, we will understand the functioning of Storm spout.

Storm Spout

Spout is a component of Storm that ingests the data and creates the stream of tuples for processing by the bolts. A spout can create a stream of tuples from the input, and it automatically serializes the output data. It can get data from other queuing systems like Kafka, Twitter, RabitMQ, etc. Spout implementations are available for popular message producers such as Kafka and Twitter. A single spout can produce multiple streams of output. Each stream output can be consumed by one or more bolts.

The diagram shows a twitter spout that gets twitter posts from a twitter feed and converts them into a stream of tuples. It also shows a Kafka spout that gets messages from Kafka server and produces a tuple of messages.

apache-storm-architecture_3.

Next, let us look at how Storm bolt functions.

Storm Bolt

Storm Bolt processes the tuples from spouts and outputs to external systems or other bolts. The processing logic in a bolt can include filters, joins, and aggregation.

Filter data examples include processing only records with STATUS = GOOD, processing only records with volume > 100, etc.

Aggregation examples include calculating the sum of sale amount, calculating the max stock value, etc. A bolt can process any number of input streams. Input data is deserialized, and the output data is serialized. Streams are treated as tuples; bolts run in parallel and can distribute across machines in the Storm cluster.

The diagram shows four bolts running in parallel. Bolt 1 produces the output that is received by both bolt 3 and bolt 4. Bolt 2 produces the output that is received by both bolt 3 and bolt 4. Bolt 3 stores the output to a Cassandra database whereas bolt 4 stores the output to Hadoop storage.

Next, we will explore the functioning of Storm topology.

Big Data Hadoop and Spark Developer Course (FREE)

Learn Big Data Basics from Top Experts - for FREEEnroll Now
Big Data Hadoop and Spark Developer Course (FREE)

Storm Topology

A group of spouts and bolts running in a Storm cluster form the Strom Topology. Spouts and bolts run in parallel. There can be multiple spouts and bolts. Topology determines how the output of a spout is connected to the input of bolts and how the output of a bolt is connected to the input of other bolts.

The diagram shows a Storm Topology with one spout and five bolts. The output of spout 1 is processed by three bolts: bolt 1, bolt 2 and bolt 3. Bolt 4 gets the output from bolt 2 and bolt 3. Bolt 5 gets the input from Bolt 5. This represents the storm topology. The input to spout 1 is coming from an external data source. The output from bolt 4 goes to Output 1, and the output from bolt 5 goes to output 2.

apache-storm-architecture_4

Let us understand this in a better way with the help of an example in the next section of apache storm tutorial.

Storm Example

Let us illustrate storm with an example.

Problem: The stock market data which is continuously sent by an external system should be processed, so that data with GOOD status is inserted into a database whereas data that are with BAD status is written to an error log.

STORM Solution: This will have one spout and two bolts in the topology. Spout will get the data from the external system and convert into a stream of tuples. These tuples will be processed by two bolts. Those with Status GOOD will be processed by bolt1. Those with status BAD will be processed by bolt2. Bolt1 will save the tuples to Cassandra database. Bolt2 will save the tuples to an error log file.

The diagram shows the Storm topology for the above solution. There is one spout that gets the input from an external data source. There is bolt 1 that processes the tuples from the spout and stores the tuples with GOOD status to Cassandra. There is bolt 2 that processes the tuples from the spout and stores the tuples with BAD status to an error log.

apache-storm-architecture_5.

Next, let us understand serialization and deserialization.

Serialization and Deserialization

Serialization is the process to convert data structures or objects into a platform-independent stream of bytes. It is used to store data on disk or memory and to transmit data over the network. The purpose of serialization is to make data readable by other programs.

For example, The object ( name: char(10), id : integer) may have data {‘John’, 101}. This can be serialized as “John\00x65”. This represents that the string John is followed by null character and then the hexadecimal representation of 101.

To read the data, programs have to reverse the serialization process. This is called deserialization. Serialization-Deserialization is also known as SerDe (abbreviation of Serialization-Deserialization).

Now we will go through the steps involved in submitting a job to Storm.

Submitting a Job to Storm

A job is submitted to the Nimbus process. To submit a job, you need to:

  1. Create spout and bolt functions
  2. Build topology using spout and bolts
  3. Configure parallelism
  4. Submit topology to Nimbus

builder = new TopologyBuilder();

builder.setSpout("spout", LineSpout());

builder.setBolt(“bolt", LogTypeBolt()).shuffleGrouping("spout");

conf.setNumWorkers(2); // use two worker processes

StormSubmitter.submitTopology(builder.createTopology, conf);

The diagram shows a Java program fragment for submitting a job to Storm. It first gets a Storm topology builder object. Next, it sets the spout and bolts for the topology with the setSpout and setBolt methods. It also sets the connection from the output of spout to bolt using the shuffleGrouping method.

shuffleGrouping is a type of grouping of input that we will discuss in a subsequent lesson. After setting the spout and bolt, the program sets the number of workers for the job to 2; which represents the number of workers that will run in parallel. Finally, using the submitTopology method, the topology is submitted to Nimbus process.

Now, finally, we will look at the Types of topologies.

Types of Topologies

Storm supports different types of topologies:

  • Simple topology: It consists of simple spouts and bolts. The output of spouts is sent to bolts.
  • Transactional topology: It guarantees processing of tuples only once. This is an abstraction of the above simple topology. The Storm libraries ensure that a tuple is processed exactly once even if the system goes down in the middle. This topology has become deprecated and replaced by trident topology in the current version of Storm.
  • Distributed RPC: Distributed RPC is also called DRPC. This topology provides libraries to parallelize any computation using Storm.
  • Trident topology: It is a higher level abstraction over Storm. That means it has libraries that run on top of Storm libraries to support additional functionality. Trident provides spouts and bolts that provide much higher functionality such as transaction processing and batch processing.

Summary

Here’s the summary of the Apache Storm tutorial:

  • Storm is used for processing streams of data.
  • Storm data model consists of tuples and streams.
  • Storm consists of spouts and bolts.
  • Spouts create tuples that are processed by bolts.
  • Spouts and bolts together form the Storm topology.
  • Storm follows a master-slave architecture.
  • The master process is called Nimbus, and the slave processes are called supervisors.
  • Data processing is done by workers that are monitored by supervisors.
Looking forward to becoming a Hadoop Developer? Check out the Big Data Hadoop Certification Training course and get certified today.

Next Step to Success

Now that we have covered one everything about Apache storm, your next step should be mastering Big Data ecosystem in general. If you have just stepped into the world of Big Data, Simpliearn’s Apache Spark and Scala Certification should be your next stop. This certification course will help you gain most  in-demand Apache Spark skills and develop a competitive advantage for a career as a Spark Developer. On the other hand, if you are already proficient in Big Data ecosystem, becoming a Big Data engineer might be your career goal. And we have just the right course to help you reach there. Explore Simplilearn’s Big Data Engineer Master’s Program, and enroll in it right away, to get a step closer to your ambition. What are you waiting for? Start right away!

About the Author

SimplilearnSimplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.