Apache Spark is an open-source data processing framework for large volumes of data from multiple sources. Spark is used in distributed computing for processing machine learning applications, data analytics, and graph-parallel processing on single-node machines or clusters. 

Owing to its lightning-fast processing speed, scalability, and programmability for Big Data, Spark has become one of the most widely used Big Data distributed processing frameworks for scalable computing. 

Thousands of companies, including tech giants like Apple, Facebook, IBM, and Microsoft, use Apache Spark. Spark Installation is simple and can be done in a variety of ways. It provides native bindings for programming languages, including Java, Scala, Python, and R. 

This guide will show you the step-by-step tutorial for Apache Spark Installation. 

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Steps in Apache Spark Installation

Prerequisites:

  • A system running Windows 10
  • A user account with administrator privileges (needed for software installation, modifying file permissions, and modifying system path)
  • Command Prompt or Powershell
  • A tool like 7-Zip can extract .tar files

Step 1: Verifying Java Installation

To install Apache Spark on Windows, you need to have Java 8, or the latest version installed in your system. 

Try this command to verify the Java version:

$java -version 

If your system has Java already installed, you’ll get the following output:

java version "1.7.0_71" 

Java(TM) SE Runtime Environment (build 1.7.0_71-b13) 

Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

If you do not have Java installed, download Java from https://java.com/en/download/ and install Java in your system before proceeding to the next step 

Step 2: Verifying Scala Installation 

To implement Apache Spark, you need to have Scala language installed in your system. Verify Scala installation with the following command:

$scala -version

If you have Scala already installed, you will see the following response:

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

However, if you do not have Scala installed, proceed to the next step of Spark installation.

Step 3: Downloading Scala

Download the latest version of Scala from the link http://www.scala-lang.org/download/ and install Scala in your system before proceeding to the next step. 

In this tutorial, we are using Scala - 2.11.6 version. 

Once the download is complete, you will find the Scala tar file in the downloads folder. 

Step 4: Installing Scala

Steps to follow for Scala installation:

  • Extract the Scala tar File –

Use the command below for extracting the Scala tar file.

$ tar xvf scala-2.11.6.tgz

  • Move Scala Software Files – 

To move the Scala software files to the respective directory (/usr/local/scala), use the following commands:

$ su – 

Password: 

# cd /home/Hadoop/Downloads/ 

# mv scala-2.11.6 /usr/local/scala 

# exit

  • Set PATH for Scala – 

The command to set PATH for Scala is:

$ export PATH = $PATH:/usr/local/scala/bin

  • Verifying Scala Installation

Verification should follow Scala installation. Use the command below for verifying Scala installation:

$scala -version

You should see the following output:

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Learn Job Critical Skills To Help You Grow!

Post Graduate Program In Data EngineeringExplore Program
Learn Job Critical Skills To Help You Grow!

Step 5: Downloading Apache Spark

Open a browser and navigate to the link https://spark.apache.org/downloads.html

For this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version.

Under the ‘Download Apache Spark’ heading, choose from the 2 drop-down menus.  

  • In the ‘Choose a Spark release’ drop-down menu select 1.3.1
  • In the second ‘Choose a package type’ drop-down menu, select Pre-built for Apache Hadoop 2.6. 

Click the spark-1.3.1-bin-hadoop2.6.tgz link to download Spark. After the download is complete, you will find the Spark tar file in the Downloads folder.

You can verify the integrity of your download Spark software file by checking the Checksum of the file. This step ensures you’re working with unaltered, uncorrupted software. 

Step 6: Installing Spark

Here are the steps required to install Apache Spark:

  • Extracting Spark tar File –

The command for extracting the Spark tar file is:

$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz

  • Moving Spark Software Files to the Desired Location – 

Write the following commands to move the Spark software files to the respective directory (/usr/local/spark). 

$ su – 

Password:  

# cd /home/Hadoop/Downloads/ 

# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark 

# exit

  • Environment Setting Up for Spark

To the ~/.bashrc file, add the following command. This is done to add the location of the Spark Software file to the PATH variable. 

export PATH=$PATH:/usr/local/spark/bin

The command for sourcing the ~/.bashrc file is:

$ source ~/.bashrc

Want a Job at AWS? Find Out What It Takes

Cloud Architect Master's ProgramExplore Program
Want a Job at AWS? Find Out What It Takes

Step 7: Verifying the Spark Installation

Open the Spark shell using the following command:

$spark-shell

If you have installed Spark successfully, then the system would display many lines indicating the status of the application. A Java pop-up may display on the screen. Select ‘Allow access’ to continue. 

Then, the Spark logo will appear, and the prompt will display the Scala shell.  

You should see the following display:

Spark assembly has been built with Hive, including Datanucleus jars on the classpath 

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 

15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop 

15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop

15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;

 ui acls disabled; users with view permissions: Set(Hadoop); users with modify permissions: Set(Hadoop) 

15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server 

15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292. 

Welcome to 

      ____ __ 

     / __/__ ___ _____/ /__ 

    _\ \/ _ \/ _ `/ __/ '_/ 

   /___/ .__/\_,_/_/ /_/\_\ version 1.4.0 

      /_/                     

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) 

Type in expressions to have them evaluated. 

Spark context available as sc  

scala>

Next, open a web browser and navigate to http://localhost:4040/You can then replace localhost with your system name. 

An Apache Spark shell Web UI will be displayed on the screen.

You can exit Spark and close the Scala shell by pressing ctrl-d in the command prompt window. 

Test Spark

As a test case, we will launch Spark shell and use Scala to read the contents of a file. We created the file abctest with some text.  

  • Open the command prompt window and navigate to the folder which contains the file you want to use and launch the Spark shell. 
  • Set a variable to use in the Spark context with the file name. Do not forget to add the file extension.
    val x =sc.textFile("abctest")
  • The output shows an RDD has been created. You can view the file contents using the command:
    x.take(11).foreach(println)
    This command instructs Spark to print 11 lines from the file abctest. You can perform an action on this file (value x) by adding another value y and doing a map transformation. 
  • Print the characters in reverse order with this command:
    val y = x.map(_.reverse)
  • The system creates a child RDD corresponding to the first one. Next, mention the number of lines you want to print from the value y:
    y.take(11).foreach(println)
    The output prints 11 lines of the abctest file in the reverse order.
  • Exit the shell using ctrl-d. 

Final Words

Now that you have learned the detailed steps of Apache Spark installation on Windows, get started running a Spark instance in your Windows environment. 

Interested in taking a deep dive into what Spark is, its features, how to use Spark, and other Big Data platforms? Join Simplilearn’s Big Data Engineering Course. Learn with the world’s no. 1 online bootcamp and master Apache Spark and Scala to obtain job-ready skills. Sign up today to get started!  

Our Big Data Courses Duration And Fees

Big Data Courses typically range from a few weeks to several months, with fees varying based on program and institution.

Program NameDurationFees
Post Graduate Program in Data Engineering

Cohort Starts: 25 Apr, 2024

8 Months$ 3,850