Apache Spark Tutorial - Introduction

Welcome to the first chapter of the Apache Spark and Scala tutorial (part of the Apache Spark and Scala course). This chapter will explain the need, features, and benefits of Spark. It will also compare Spark with the traditional Hadoop Ecosystem.

Let us explore the objectives of Apache spark in the next section.


After completing this lesson, you will be able to:

  • Describe the limitations of MapReduce in Hadoop

  • Compare batch vs. real-time analytics

  • Describe the application of stream processing and in-memory processing.

  • Explain the features and benefits of Spark.

  • Explain how to install Spark as a standalone user,

  • Compare Spark vs. Hadoop Eco-system.

In the next section of apache spark tutorial, we will begin with the evolution of distributed systems.

Evolution of Distributed Systems

The evolution of distributed systems went through the stages mentioned above. 

In early days, computers were quite expensive and heavy. Those were accessible only in research labs of industries and universities and to professional users. In addition, they used to have long job execution time. Various new concepts were introduced to increase the power and utilization of CPU, which helped in multiprogramming, automatic jobs, and long jobs processing.

After that, LAN and WAN were introduced. While LAN allowed connecting local computers to exchange information located in a campus or building at the rate of about 10, 100, and 1000 Mbps, WAN allowed connecting far-located computers to exchange information at the rate of about 56Kbps, 2, 34,155, and 620 Mbps.
Then came distributed computing that allowed geographically-spread computers to the network as they were in a single environment. You may find it in different implementations. In some of them, computing entities only pass messages to each other.
In the next section of apache spark tutorial, we will discuss the need of new generation distributed systems.

Interested in learning more about Apache Spark & Scala? Enroll in our Apache course today!

The Need for New Generation Distributed Systems

We need a new generation of distributed systems. Let’s understand why. Nowadays, it is very rare that an organization exists that depends only on its centralized computing.

Irrespective of the fact, there are still many organizations that keep a tight hold on their internal data center and avoid any absolutely required data distribution. This sometimes happens because of their heavy investments in the infrastructure. Data centralization is becoming less popular because of various reasons.
One of those reasons is a variety of client devices. The number and variety of these devices are increasing each year, leading to a complex array of endpoints to be served.  

Another reason is Social, Mobile, and Embedded Technology, as the amount and variety of the collected data is increasing exponentially.

Also, Landscape Transformation and Latency Reduction is causing data centralization to decrease. With a few exceptions like High-Frequency Trading or HFT, in which physically locating servers in a single location can lower latency, leveraging the distributed computing technology with parallel processing techniques transform the landscape and reduce latency.
We will discuss the limitations of MapReduce in Hadoop in the next section of apache spark tutorial.

Limitations of MapReduce in Hadoop

MapReduce used in Hadoop is not suitable for many reasons, such as it is not a good choice when it comes to real-time processing. It is batch-oriented, because of which it is executed as periodic jobs that take time to process the data and provide results. It takes minutes to complete a job, which mainly depends on the data amount and number of nodes in the cluster.

MapReduce is also not suitable for writing trivial operations such as Filter and Joins. To write such operations, you might need to rewrite the jobs using the MapReduce framework, which becomes complex because of the key-value pattern. This pattern is required to be followed in reducer and mapper codes.

In addition, MapReduce doesn’t work well with large data on a network. It works on the data locality principle and hence works well on the node where the data actually resides. However, it is not a good option when you need to process a lot of data requiring shuffling over the network. The reason is that it will take a lot of time to copy the data, which may cause bandwidth issues.

We will continue discussing the limitations of MapReduce in Hadoop in the next section of apache spark tutorial.

Limitations of MapReduce in Hadoop (contd.)

MapReduce is also unsuitable with OLTP that includes a large number of short transactions. Since it works on the batch-oriented framework, it lacks latency of seconds or sub-seconds.

Another limitation exists with Namenode that tracks the metadata of about 600 bytes per file, as estimated by Yahoo. This means that in case of too many files, there can be a problem with Namenode.

Additionally, MapReduce is unfit for processing graphs. Graphs represent the structures to explore relationships between various points, for example, finding common friends in Social Media like Facebook. Hadoop has Apache Giraph library for such cases; however, on top of MapReduce, it adds to the complexity.
Another important limitation is its unsuitability for iterative execution of programs. Some use cases, like K-means, need such execution where data needs to be processed again and again for refining results. MapReduce, being a state-less execution, runs from the start every time. 

In the next section of apache spark tutorial, we will discuss Batch vs Real-Time processing.

Batch vs. Real-Time Processing

The features listed above show a comparison of batch and real-time processing in case of the enterprise use cases. In case of batch processing, a large amount of data or transactions is processed in a single run over a time period. The associated jobs generally run entirely without any manual intervention. Additionally, the entire data is pre-selected and fed using command-line parameters and scripts. In typical cases, it is used to execute multiple operations, handle heavy data load, reporting, and offline data workflow. An example is to generate daily or hourly reports for the purpose of decision making. On the other hand, real-time processing takes place upon data entry or command receipt instantaneously. It needs to execute on response time within stringent constraints. An example is fraud detection. Note that Hadoop has different subsystems like Pregal, Griph, S4, and Drill for different business use cases. It would be better to have just one processing framework to solve all these use cases.

In the next section of apache spark tutorial, we will discuss applications of stream processing.

Application of Stream Processing

Stream processing fits well for applications showing three characteristics.  

Let’s first talk about computer intensity, which is defined as the number of arithmetic operations per global memory or Input/Output reference. Today, in various signal processing applications, this intensity is well above 50:1. Also, it is increasing the complexity of algorithms.
The next feature is data parallelism that exists in a kernel when a function is applied to an input stream’s records and multiple records can be processed simultaneously. This should happen without results waiting for the previous records.

Data locality is the third feature that is a particular type of temporal locality and is general in media and signal processing applications, in which data is produced and read once or twice, and then never again read. Intermediate streams can capture this locality directly.

These are the streams that are passed between kernels and the data within kernel functions and they do it using the model of stream processing programming.

In the next section of apache spark tutorial, we will discuss applications of In-Memory Processing. 

Application of In-Memory Processing

With column-centric databases coming, the similar information can be stored together and hence data can be stored with more compression and efficiency.

It also permitted to store large data amounts in the same space, which thereby reduced the memory amount required for performing a query and also increased the speed of processing.  

In an in-memory database, the entire information is loaded into memory, eliminating the need for indexes, aggregates, optimized databases, star schemas, and cubes.

With the use of in-memory tools, compression algorithms can be implemented that thereby decrease the in-memory size, even beyond what is required for hard disks.

Users querying the data loaded into the memory is different from caching. This also helps to avoid performance bottlenecks and slow database access. Caching is a popular method for speeding up the performance of a query, where caches are subsets of very particular organized data that are defined already.
With in-memory tools, the analysis of data can be flexible in size and can be accessed within seconds by concurrent users with an excellent analytics potential. This is possible as data lies completely in memory. In theoretical terms, this leads to data access improvement that is 10,000 to 1,000,000 times faster, as compared to a disk.

In addition, it also reduces the performance tuning need by the IT folks and hence provides faster data access for end users.

With in-memory processing, it is also possible to access visually rich dashboards and existing data sources. This ability is provided by several vendors. In turn, this allows end users and business analytics to create customized queries and reports without any need of extensive expertise or training.
In the next section of apache spark tutorial, we will begin with an introduction to apache spark.

Introduction to Apache Spark

Apache Spark Is an open-source cluster computing framework that was initially developed at UC Berkeley in the AMPLab.

As compared to the disk-based, two-stage MapReduce of Hadoop, Spark provides up to 100 times faster performance for a few applications with in-memory primitives.
This makes it suitable for machine learning algorithms, as it allows programs to load data into the memory of a cluster and query the data constantly.

A Spark project contains various components such as Spark Core and Resilient Distributed Datasets or RDDs, Spark SQL, Spark Streaming, Machine Learning Library or Mllib, and GraphX.
In the next section of apache spark tutorial, we will discuss components of spark project.

Components of a Spark Project

Following are the components of a Spark Project.

Spark Code and RDDs

The first component is Spark Core and RDDs, which is the foundation of the entire project. It provides basic Input/Output functionalities, distributed task dispatching, and scheduling. RDDs is the basic programming abstraction and is a collection of data that is partitioned across machines logically.
These can be created by applying coarse-grained transformations on the existing RDDs or by referencing external datasets. The examples of these transformations are reduce, join, filter, and map.  

The abstraction of RDDs is exposed similarly as in-process and local collections through a language-integrated API in Python, Java, and Scala. As a result, the complexity of programming is simplified, as the manner in which applications change RDDs is similar to changing local data collections.  

Spark SQL 

Spark SQL is a component lying on the top of Spark Core. It introduces SchemaRDD, which is a new data abstraction and supports semi-structured and structured data. This abstraction can be manipulated in Java, Scala, and Python by the Spark SQL provided a domain-specific language. In addition, Spark SQL supports SQL with ODBC/JDBC server and command-line interfaces.  

Spark Streaming 

The next component Spark Streaming leverages the fast scheduling capability of Spark Core for streaming analytics, ingests data in small batches, and performs RDD transformations on them. With this design, the same application code set that is written for batch analytics can be used on a single engine for streaming analytics.  

Machine Learning Library

Machine Learning Library lies on the top of Spark and is a distributed machine learning framework. With its memory-based architecture, it is nine times faster than the Apache Mahout’s Hadoop disk-based version. In addition, the library performs even better than Vowpal Wabbit. In addition, it applies various common statistical and machine learning algorithms.  


The last component, GraphX also lies on the top of Spark and is a distributed graph processing framework. For the computation of graphs, it provides an API and an optimized runtime for the Pregel abstraction. The API can also model this abstraction.

In the next section of apache spark tutorial, we will discuss the history of spark.

History of Spark

As discussed, Spark was started at UC Berkeley AMPLab by Matei Zaharia in the year 2009. It was in 2010 when it was open sourced under a BSD license. The project was then donated to the Apache Software Foundation and the license was changed to Apache 2.0 in the year 2013.
In the month of February 2014, Spark became an Apache Top-Level Project. Then in November of the same year, it was used by the engineering team at Databricks to set a world record in large-scale sorting. Now, Databricks provides commercial support and they provide certification for it.
At present, Spark exists as a next-generation real-time and batch processing framework. In the next section of apache spark tutorial, we will discuss language flexibility in spark.

Language Flexibility in Spark

We have already discussed that Spark provides performance, which in turn provides developers an experience that they won’t forget easily. Spark is considered over MapReduce, mainly for its performance advantages and versatility.

Apart from this, another critical advantage is its development experience. Language flexibility is another important benefit that we will discuss here. Spark provides support to various development languages like Java, Scala, and Python and will likely support R.

In addition, it has the capability to define functions inline. With the temporary exception of Java, a common element in these languages is that they provide methods for expressing operations using lambda functions and closures.

Using closures, you can use the application core logic to define the functions inline, which helps to create easy-to-comprehend code and preserve application flow.
We will discuss spark execution architecture in the next section of apache spark tutorial.

Spark Execution Architecture

The components of Spark Execution Architecture are listed below.  
spark execution architecture

Spark-submit script

The spark-submit script is used to launch applications on a Spark cluster. It can use all cluster managers, supported by Spark, using an even interface. Due to this, it is not required to configure your application for each one particularly.

Spark applications

These applications run as sets of processes independently on a Spark cluster. These are coordinated by the SparkContext object in the driver program, which is your main program.

Cluster Managers

SparkContext can connect to different cluster managers, which are of three types:

  • Standalone,

  • Apache Mesos, and

  • Hadoop YARN

A standalone cluster manager is a simple one that makes setting up a cluster easy. Apache Mesos is a general cluster manager that is also capable of running service applications and MapReduce. On the other hand, Hadoop YARN is the resource manager in Hadoop 2.

Spark’s EC2 launch scripts

Spark’s EC2 launch scripts make launching a standalone cluster easy on Amazon EC2.

In the next section of apache spark tutorial, we will discuss automatic parallelization of complex flows.

Automatic Parallelization of Complex Flows

Let’s now talk about the next feature of Spark, automatic parallelization of complex flows.
automatic parallelization of complex flows
It is your task to make the sequence of MapReduce jobs parallel in case of a complex pipeline. Here, a scheduler tool like Apache Oozie is generally required for constructing this sequence carefully.

Using Spark, the series of individual tasks is expressed in terms of a single program flow. To give a whole picture of the execution graph to the system, this flow is lazily evaluated. Using this approach, the core scheduler can map the dependencies lying between various application stages correctly. This allows parallelizing the operators flow automatically without any intervention.
With this capability, you can also achieve a few optimizations to the engine with less burden. An example of such a job is given above. The example also shows how this parallelization works through the given diagram. 

We will continue discussing the automatic parallelization in the next section of apache spark tutorial.

Automatic Parallelization of Complex Flows-Important Points

An important point about this structure is that every application has its own executor processes, which run tasks in various threads and stay till the duration of the entire application. While it benefits in terms of scheduling and executor sides by separating applications, it also implies that without writing to an external storage system, you cannot share data across applications of Spark.
Another feature is the agnostic behavior of Spark to the cluster manager underlying. Till Spark can obtain executor processes, and these can connect, it is comparatively relaxed to run it even on a cluster manager supporting other applications such as YARN and Mesos.
Note that the driver program needs to listen and accept connections coming from its executors all the time. In other words, the driver program must be accessible to the network to be addressed by the worker nodes. The driver schedules the tasks on the cluster. Therefore, it should run in proximity to the worker nodes, if possible on the same local network.  
To send remote requests to the cluster, you should open an RPC to the driver and let it submit operations from the neighborhood. This is better than running a driver far through the worker nodes. 
In the next section of apache spark tutorial, we will discuss APIs that match user goals.

APIs That Match User Goals

As a developer, when you work with MapReduce, you generally get forced to combine basic operations to make them customer Mapper/Reducer jobs. This happens because there is no built-in feature that could streamline this process. Therefore, some developers turn to the higher-level APIs for writing their MapReduce jobs.
These APIs are provided by frameworks such as Cascading and Apache Crunch.
On the other hand, Spark provides a powerful and ever-growing operators library. These APIs contain functions for the operators listed above. These are just a few examples. There are over 80 operators available in Spark. While a few of them provide you operations that are equivalent to MapReduce operations, the others are high level and allow you to write much more precisely.
Note that when scripting frameworks such as Apache Pig, many high-level operators are also available. Spark lets you access them in the full programming language context. As a result, you can use functions, classes, and control statements, as in a typical environment of programming. 

In the next section of apache spark tutorial, we will discuss a unified platform of big data apps.

Apache Spark-A Unified Platform of Big Data Apps

When it comes to speed, Spark has extended the MapReduce model to support computations like stream processing and interactive queries. The feature of speed is critical to process large datasets, as this implies the difference of waiting for hours or minutes and exploring the data interactively.  
Spark supports running computations in memory. Also, the related system is more effective as compared to MapReduce when it comes to running complex applications on a disk. These features add to the speed capability of Spark.  
Spark covers various workloads that used to require different distributed systems such as streaming, iterative algorithms, and batch applications. As these workloads are supported on the same engine, combining different processing types is easy. It is normally required in production data analysis pipelines. The combination feature also allows easy management of separate tools.
Spark is capable of creating distributed datasets from any file that is stored in the Hadoop Distributed File System or HDFS or any other supported storage systems. You should note that Spark does not need Hadoop. It just supports the storage systems that implement the APIs of Hadoop. It supports SequenceFiles, Parquet, Avro, text files, and all other Input/Output formats of Hadoop.  
Now the question is why unification matters. Unification not only provides developers the ease of learning only one platform but also allows users to take their apps everywhere.

unified platform of big data apps
The graphic above shows the apps and systems that can be combined in Spark. 
In the next section, we will discuss benefits of apache spark. 

More Benefits of Apache Spark

A Spark project includes various closely-integrated components for distributing, scheduling, and monitoring applications with many computational tasks across a computing cluster or various worker machines.
The Spark’s core engine is general purpose and fast. As a result, it empowers various higher-level components that are specialized for different workloads like machine learning or SQL. These components can interoperate closely.
Another important benefit is that it Integrates tightly, allowing to create applications that easily combine different processing models; for example, ability to write an application using machine learning to categorize data in real time as it is ingested from sources of streaming. 
Additionally, it allows analysts to query the data thus resulted in SQL. Moreover, data scientists and engineers can access the same data through the Python shell for ad-hoc analysis and in standalone batch applications.
For all this, the IT team needs to maintain just one system.
In the next section of apache spark tutorial, we will discuss different modes in running spark,

Running Spark in Different Modes

The different deployment modes of Spark are explained on the screen. The standalone mode is a simple one that can be launched manually, by using launch scripts, or starting a master and workers. This mode is usually used for development and testing.
Spark can also be run on hardware clusters that are managed by Mesos. Running Spark in this mode has advantages like scalable partitioning among different Spark instances and dynamic partitioning between Spark and other frameworks. 
Running Spark on YARN has all parallel processing and all benefits of the Hadoop cluster. By running Spark on EC2, you have key-value pair benefits of Amazon.

Overview of Spark on a Cluster

The steps listed below depict how Spark applications run on a cluster.

overview of spark on cluster
These applications run independently as different processes sets on a cluster. The SparkContext object that lies in your main program coordinates the process. For running on a cluster, this object can connect to different types of cluster managers. We have already discussed different types of cluster managers. This allocates resources across applications. Once the system is connected, Spark acquires executors on nodes in the cluster. Executors are the processes that store data for applications and run computations. Then, it sends the application code to executors. At the end, SparkContext sends tasks for executors to run. 
In the next section of apache spark tutorial, we will discuss tasks of spark on a cluster.

Tasks of Spark on a Cluster

One of the tasks that are performed on a Spark cluster is submitting applications of any type to a cluster. This is done using the spark-submit script. 

Another task is monitoring. Every driver program has a web-based UI that usually lies on port 4040. It shows information about the storage usage, executors, and running tasks. To access the same, go to the given URL.

You can also schedule jobs on a cluster as Spark provides control over resource allocation both across and within applications.

Companies Using Spark-Use Cases

Companies like NTTDATA, Yahoo, GROUPON, NASA, Nokia, and more are using Spark for creating applications for different use cases.

These use cases are stream processing of network machine data, Performing Big Data analytics for subscriber personalization and profile in the telecommunications domain, and executing the Big Content platform, which is a B2B content asset management service that provides an aggregated and searchable source of public domain media, live news feeds, and archives of content. 

A few more use cases are Building data intelligence and eCommerce solutions in the retail industry and analyzing and visualizing patterns in large-scale recordings of brain activities.

In the next section of apache spark tutorial, we will discuss Hadoop Ecosystem vs Apache Spark.

Hadoop Ecosystem vs. Apache Spark

The Hadoop Ecosystem allows storing large files on various machines. It uses MapReduce for batch analytics that is easy as it is distributed in nature.

In Hadoop, third-party support is also available, for example by using ETL Talend tools, various batch-oriented workflows can be designed. In addition, it supports pig or hive queries that non-Java developers can use and prepare batch workflows using SQL scripts.
On the other hand, Apache supports both real-time and batch processing. 

We will continue our discussion on Hadoop Ecosystem vs Apache Spark.

Hadoop Ecosystem vs. Apache Spark (contd.)

You can perform every type of data processing using Spark that you execute in Hadoop. 

Batch Processing: For batch processing, Spark batch can be used over Hadoop MapReduce.

Structured Data Analysis: For Structured Data Analysis, Spark SQL can be used using SQL.

Machine Learning Analysis: For Machine Learning Analysis, Machine Learning Library can be used for clustering, recommendation, and classification.

Interactive SQL Analysis: For Interactive SQL Analysis, Spark SQL can be used over Stringer, Tez, and Impala.

Real-time Streaming Data Analysis: In addition, for real-time Streaming Data Analysis, Spark streaming can be used over specialized library like Storm.


Let us summarize the topics covered in this lesson:

  • Data centralization is becoming less popular because of various reasons such as a variety of client devices.

  • MapReduce in Hadoop has many limitations such as it is unsuitable for real-time processing.

  • Apache Spark is an open-source cluster computing framework.

  • The components of a Spark project are Spark Core and RDDs, Spark SQL, Spark Streaming, MLlib, and GraphX.

  • Spark is popular for its performance benefits over MapReduce. Another important benefit is language flexibility.

  • The components of the Spark execution architecture are a Spark-submit script, Spark applications, SparkContext, cluster managers, EC2 launch scripts.

  • The different advantages of Spark are speed, combination, unification, and Hadoop support.

  • The different deployment modes of Spark are standalone, on Mesos, on YARN, and on EC2.

  • Companies like NTTDATA, Yahoo, GROUPON, NASA, Nokia, and more are using Spark for creating applications for different use cases.

  • You can perform every type of data processing using Spark that you execute in Hadoop.


With this, we come to the end of the 1st chapter “Introduction to Spark” of the Apache Spark and Scala course.

The next chapter is Introduction to Programming in Scala.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Work Email*
Phone Number*
Job Title*