With the huge amount of data being generated, data processing frameworks like Apache Spark have become the need of the hour. And in this tutorial, we will help you master one of the most essential elements of Spark, that is, parallel processing.
Let us begin by understanding what a spark cluster is in the next section of the Spark parallelize tutorial.
A Spark Application on Cluster is explained below.
A user can submit a Spark job using Spark-submit. Once the Spark job is submitted, Sparkcontext driver program will be opened which will then go to Cluster Master Node.
From Cluster Master Node, Containers will be opened in Worker Nodes. Followed by Containers, Executors will open up in Worker Nodes, and these Executors will start interacting with Sparkcontext Driver.
RDDs on Spark Cluster
In RDD, Resilient Distributed Datasets, data is partitioned across Worker Nodes by Spark. You can also control the number of partitions created.
Let us understand the partitioning from a single file in the next section of the Spark parallelize tutorial.
File Partitioning: Single Files
Partitions are based on the size of the file. You can also specify the minimum number of partitions required as textFile(file,minPartitions). By default, there will be two partitions when running on a spark cluster. More the number of partitions, the more the parallelization.
File Partitioning: Multiple Files
Using command sc.textFile(“mydir/*”), each file becomes at least one partition. File-based operations can be done per partition, for example parsing XML.
The next command is sc.wholeTextFiles (“mydir”). This command is used for partitioning multiple small files and also creating a key-value pair RRD, where Key stands for file name and value stands for file contents. Most RDD operations work on each element of an RDD and the other few work on each partition. Some of the commands that are used for partition are:
- foreachPartition- It is used for calling a function for each partition.
- mapPartitions - It is used to create a new RDD by executing a function on each partition in the current RDD.
- mapPartitionsWithIndex - This is the same as mapPartitions, but this includes an index of the partitions.
Note: Functions for partition operations take iterators. We will look at an example for one of the RDDs for better understanding.
Let us understand foreachPartition with an example, in the next section of the Spark parallelize tutorial.
In the example below, we have created a function printFirstLine which will calculate the first line for each partition.
Let’s assume we already have an RDD created, which is named myrdd. We can pass the printFirstLine created function into foreachPartition to calculate the first line for each partition.
Now that you have understood the commands used for partitioning, let's understand HDFS and data locality with an example in the next section of the Spark parallelize tutorial.
HDFS and Data Locality
In the diagram, you can notice multiple data nodes.
Now, using hdfs dfs -put mydata, you can push the mydata file to HDFS. Let’s assume that it is saved in the HDFS disk in three blocks. When the data is saved in the HDFS disk, you can start programming in Spark. Once you start programming, Spark context will be available, which will open up executors in datanodes.
Using sc.textfile, you can push mydata in the Spark executor. Since this is a transformation step, RDD will remain blank. Using action trigger execution, tasks on executors load data from blocks into partitions. Then the data is distributed across executors until an action returns a value to the driver.
Parallel Operations on Partitions
RDD operations are executed in parallel on each partition. Tasks are executed on the Worker Nodes where the data is stored.
Some operations preserve partitioning, such as map, flatMap, filter, distinct, and so on. Some operations repartition, such as reduceByKey, sortByKey, join, groupByKey, and so on.
Let us learn about the operations in stages.
Operations in Stages
Operations that can run on the same partition are executed in stages. Tasks within a stage are pipelined together. Developers should be aware of operational stages to improve performance. Listed are some of the Spark terminologies:
- Job: Job a set of tasks executed as a result of an action.
- Stage: Stage a set of tasks in a job that can be executed in parallel.
- Task: Task is an individual unit of work sent to one executor.
- Application: Application can contain any number of jobs managed by a single driver.
How Does Spark Calculate Stages
Spark constructs a Directed Acyclic Graph or DAG of RDD dependencies. These dependencies are of two types:
In Narrow dependencies, each partition in the child RDD depends on just one partition of the parent RDD. No shuffle is required between executors. Nodes, where the RDDs are created, can be collapsed into a single stage
Example: map, filter,
Union Wide or Shuffle Dependencies
In Union Wide or shuffle dependencies, multiple child partitions depend on each partition in the parent RDD. Wide dependencies define a new stage.
Example: reduceByKey, join, groupByKey Let’s go through the process of controlling the level of Parallelism.
“Wide” operations such as reduceByKey partition result in RDDs. The more the number of partitions, the more are the parallel tasks. Spark cluster will be under-utilized if there are too few partitions.
You can control the number of partitions by optional numPartitionsparameter in the function call. We can see the Spark application UI from localhost: 4040. We can notice all the Spark jobs in this UI.
Now let's summarize what we've learned in this Spark parallelize tutorial.
- RDDs are stored in the memory of Spark executor
- Java Virtual Machines, JVMs Data is split into partitions.
- Each partition in a separate executor RDD operations are executed on partitions in parallel
- Operations that depend on the same partition are pipelined together in stages.
- Operations that depend on multiple partitions are executed in separate stages.
With this we come to an end of the Spark parallelize tutorial, and by now you should know that mastering Apache Spark is only going to help you succeed in your data career. And Simplilearn’s Caltech Post Graduate Program in Data Science should help you just do that. This program will help you gain expertise in concepts of the Hadoop framework, Big Data tools, and methodologies.