Apache Spark Architecture: A Detailed Guide

Apache Spark is an open-source framework that enables cluster computing and sets the Big Data industry on fire. Experts say that the performance of this framework is almost 100 times faster when it comes to memory, and for the disk, it is nearly ten times faster than Hadoop. The architecture of Apache spark is defined exceptionally in different layers. All the components are coupled loosely with the layers within these layers.

Spark Features

Apache Spark has several features which make the framework highly recognised in the industry.

Speed: As discussed, spark speed is much better than Hadoop, which enables large data scale processing. Controlled partitioning allows the framework to achieve high speed.
Real-Time: The framework offers real-time computation and low latency due to the in-memory analysis.
Powerful Caching: The spark framework has a superficial programming layer which offers powerful caching and the capabilities of disk persistence.

Working On The Apache Spark Architecture

There is a driver program in the Apache Spark Architecture which calls the only program for an application. This is how SparkContext is created. All basic functionalities are contained in SparkContext. Several other components are included in Spark Driver, such as block manager, backend scheduler, etc. This helps translate the code written by a user into a job which is then executed on the cluster.

The Spark Driver and Spark Context are collectively responsible for the execution of any job. The job is split into several smaller tasks that are distributed to the worker nodes. The tasks are then executed by the nodes and returned to the spark context. Executing all these tasks is possible with the help of an executor.

Spark Architecture Applications

Some high-level components are part of this architecture as well. Let us now know more about them.

The Spark Driver: As the name suggests, the spark driver acts like the driver seat of the application. The driver controls the execution of this application. The driver maintains all the states of applications that run on Spark Cluster. The driver needs to be interfaced with the cluster manager to get the launch executors and physical resources.
The Spark Executors: All tasks the spark driver assigns are carried out by the executor. The executor has a core responsibility of assigning tasks, running them, and also reporting the success back or results and failure state. Every application has its executor process.
The Cluster Manager: In the cluster manager, a cluster of machines is found that run the spark applications. It has a driver of its own known as the driver and the worker abstractions. They are tied to the physical machines instead of the processes as far as spark is considered.

Modes Of Execution

The execution mode helps determine the location of the previously mentioned resource when the application runs. There are three significant modes of execution.

Cluster Mode: One of the most common ways of running Spark applications is the cluster mode. During this process, the user usually submits a pre-compiled JAR, python script, or R script to the cluster manager. On the worker node, the driver process gets launched in the cluster with the help of the cluster manager. Executor processes also help in the process. This means that all spark application-related operations are managed by cluster mode.
Client Mode: Client mode and cluster mode are more or less similar. They have just one difference: the spark driver is contained with the client machine by which the application was submitted. It depicts that the client machine maintains the spark driver process, whereas the cluster manager maintains the executor ones.
Local Mode: When it comes to local mode, the entire spark application runs on one machine. Parallelism is observed with the help of threads on that one machine. This easy process helps test applications and also experiments with local development easily. Though, this is not recommended to run production applications.

Two Main Abstractions of Apache Spark

The architecture of Apache Spark is made up of two main abstraction layers:

Resilient Distributed Datasets

It is a vital tool for working with data. It lets you recheck data if something goes wrong and acts as an interface for data that can't be changed. It helps recalculate data if something goes wrong and is a type of data structure.

Directed Acyclic Graph

For each job, the driver changes the program into a DAG. The Apache Spark Ecosystem comprises different parts, such as the API core, Spark SQL, Streaming and real-time processing, MLIB, and Graph X. A driver is a series of node connections. So, you can use the Spark shell to read large amounts of data.

Cluster Managers in Spark Architecture

The program comes with a Spark cluster manager that simplifies creating a cluster. Spark Standalone Cluster only has two genuinely autonomous parts—the Resource Manager and the Worker. In Standalone Cluster mode, there is only one executor for all of the worker nodes. A Standalone Clustered master initiates the execution process when a client connects to the master, makes a resource request, and begins the execution process.

The application master is the client that requests resources from the resource manager. The Cluster Manager's Web UI provides comprehensive information about all clusters and jobs.

Hadoop YARN (Yet Another Resource Negotiator)

An essential component of Hadoop 2.0 is the enhanced resource manager. The Hadoop ecosystem utilizes YARN to manage resources. It comprises the two elements listed below:

It manages the distribution of system resources across all programs. It contains a Scheduler and an Application Manager. The Scheduler supplies applications with resources.

Each job or application requires one or more containers, and the Node Manager oversees the consumption of these containers.

Application Manager and Container Manager together make up Node Manager. Each MapReduce task operates in its container. The Node Manager monitors container and resource utilization and reports this information to the Resource Manager.

Apache Mesos

It can execute Hadoop MapReduce and service applications and serve as standard cluster management. Utilizing dynamic resource sharing and isolation, Apache Mesos contributes to creating and managing application clusters. It facilitates the deployment and administration of programs in cluster environments on a massive scale.

The Mesos framework consists of three elements:

Mesos Master: It provides fault tolerance. Due to the design of the Mesos Master, a group has numerous Mesos Masters.
Mesos Slave: It is an instance that supplies the cluster with resources. Mesos Slave will not assign resources when a Mesos Master gives a task.
Mesos Frameworks: Apps can request the resources from the cluster to complete their job. Mesos Frameworks make this possible.

Kubernetes

It contains a set of tools for deploying, scaling, and managing open source containerized applications. There is a separate project that supports Nomad as a cluster manager. This project is not part of the Spark project.

Conclusion

Understanding the Apache Spark Architecture has allowed us to construct large data apps easily. They are accessible and composed of parts, which is highly advantageous for cluster computing and big data technology. Spark quickly calculates the necessary outcomes and is often used for batch processing.

Spark's distinctive components, such as datasets and data frames, enable users to optimize their code. The SQL engine and quick execution speed are two of this software's most crucial features. It is an excellent complement to numerous industries that deal with massive data. Spark facilitates the completion of complex computations.

Learn more about Big Data Tools such as Apache Spark with our extensive Data Engineering course. In this program, you’ll learn about the different tools that Data Engineers use to wrangle data and get meaningful insights from it. It is the perfect course to help you get started on your Big Data journey.

Want to begin your career as a Big Data Engineer? Check out the Data Engineering Certification and get certified.

FAQs

1. What is the Spark architecture?

The spark architecture is an open-source framework based component that helps process large chunks of semi-structured, unstructured and also structured data for easy analysis. The data can be further utilised in Apache Spark.

2. What are the four elements of your Spark framework?

The four major elements of the spark architecture that enables smooth processing and execution of tasks include, spark drivers, executors, cluster administrators and worker nodes.

3. What are the layers parts of Spark?

The layer parts of Spark include Spark core, Spark streaming, Spark SQL, Spark R, etc.

4. What are Spark and its features?

The spark is an open-source framework which focuses on machine learning, interactive query, and real time work loads. The best feature and a robust reason behind using spark framework is its uninterrupted performance. Also, it is much faster than the counterparts and can process large scale.of data easily.

5. Why is Spark faster than Hadoop?

Spark is no doubt better and faster than Hadoop. Some reasons behind this is that it uses random access memory of the framework instead of reading and writing all available intermediate data to the disk. On the contrary, the latter stores data in multiple sources and then processes it in different batches.

6. Why is parquet best for Spark?

Parquet is the best option for spark in several senses. Its execution speed is higher than the other standard format of files like JSON, Avro, etc. Furthermore, it consumes less disk space as compared to Avro and JSON.

A Detailed Guide Into Apache Spark Architecture