A Quick Start-up Apache Spark Guide for Newbies

‘Lightning-fast cluster computing’ – that’s the slogan of Apache Spark, one of the world’s most popular big data processing frameworks. It has witnessed rapid growth in the last few years, with companies like e-bay, Yahoo, Facebook, Airbnb, and Netflix adopting the framework for their significant data needs.

But What Is Apache Spark All About?

With the rise of the Internet-of-Things (IoT) and social media’s ubiquitous use, there has been a spike in data volumes. According to a Gartner estimate, there are around 6.4 billion devices plugged into the Internet, generating about 2.5 exabytes of data every day.

Big data techniques and tools help companies manage all this data, ranging from our bank transactions to our activity on social networks like Facebook and Twitter.

And then there are questions to be answered about this data: how can we detect fraud in our bank transactions, which advertisement in Facebook gets the most clicks, and so on.

To answer these questions, large volumes have to be processed quickly, and this is where Spark enters the picture.

Master the skills of the Apache Spark open-source framework and the Scala programming language with the Apache Spark and Scala Certification.

Spark 2.0

With the release of Spark 2.0 last summer, the framework is becoming more mature. It has reached a point where tech junkies are not the only people who are aware of the Spark phenomenon – business leaders are waking up to its potential. IBM has made a significant commitment to it and calls it ‘potentially the most significant open source project of the next decade.’ The success of Spark in projects like personalized DNA Analysis, also contributes to the belief that it works well in real-life projects.

So why do businesses invest in Spark? 

Spark – a Timeline of Its Evolution

Spark started life in 2009 at the University of California in Berkeley as a project by Matei Zaharia. Matei created Spark when working on his Ph.D. at Berkeley’s AMPLab, an institute that researches big data analytics. He is currently the Chief Technologist of Databricks, which is a company that helps clients with cloud-based big data processing using Spark.

Spark was open-sourced in 2010 and was donated to the Apache software foundation in 2013. It is now a top-level Apache project and the largest open source project in the data processing. 

The Spark Ecosystem

  • Spark Core

    It provides the base functionality for the components on top of it, like scheduling and monitoring of jobs in a cluster and handling faults. Though Spark is developed in the Scala programming language, it also provides an API for other languages like R, SQL, Python, and Java. These languages, especially R and Python, are the most popular in data science, which is one of the reasons Spark is so popular.
  • Spark SQL + Dataframes

    This is a module that provides a structured data processing interface via SQL, which is a language used for communicating with a database. It also provides the Dataframe format, which is used to structure data into columns and rows.
  • Streaming

    In some applications, we need a result from a data processing within a specific time limit, failing which it becomes useless. For instance, to detect fraud in a credit card transaction. This module can be used in situations we need real-time performance.
  • ML/Mlib

    Machine Learning (ML) is everywhere today, from recommendations on sites like Netflix and LinkedIn to advanced technologies like self-driving cars. This module provides state of the art algorithms to learn from data and make models with it, to be able to make predictions.
  • GraphX

    A module that can handle graph-structured data at scale. One can think of visualizing Facebook relations or analyzing flight data.

Why Is Spark so Popular?

There are a couple of reasons why Spark is so popular and why there is a massive demand for people with Spark skills -

  • Speed

    The main benefits is the fast processing of big datasets. It is fast because it processes the data in memory, and it uses cluster technology. This means that a particular task can be divided into sub-tasks, and these sub-tasks are computed on different hosts in a cluster. This clustering technology is not new as it’s also used in Hadoop MapReduce, sometimes called the predecessor of Spark. While in Hadoop MapReduce, all tasks are performed on disk, Spark performs them in-memory, which is a lot faster. Spark is nearly 100 times as fast as Hadoop MapReduce. A couple of years ago, it broke the record for sorting a petabyte. It is also possible to use Spark on your local machine without using a cluster.  I have used it this way in my last project where I had to predict click-through rates for a marketing company. Loading the client’s data and building the prediction model took about 15 minutes at first. When I added Spark, which can utilize all cores on my machine, it took only 3 minutes.
  • Ease of Use

    It provides support for the main languages used in data processing like Java, Scala, R, and Python. The documentation is pretty good, and it’s relatively easy to create a simple application in your preferred language.  It also provides a way to use it interactively, which is handy to experiment with before you write your program.
  • Supports Many Use Cases

    It is a complete framework that supports multiple use cases. Ranging from Machine Learning to Stream processing and Graph processing, Spark has quite a lot of functionality available that gets you up and running quickly.
  • Integration with Other Technologies

    Spark can run on different cluster technologies like the Hadoop file-system, YARN, and Amazon web services (AWS). AWS, which has been supporting Spark for some time, now, has the advantage that you don’t have to set up and maintain a cluster yourself, saving you valuable time.
  • Community

    The Spark community is active and growing. It is the largest open source community in big data with over 1000 contributors from 250+ organizations.

From my own experience as a data scientist, I know that Spark has gained in popularity across verticals. There has been a steep rise in Spark-related projects over the last few years. I have seen requests, for example, from clients to “migrate our current solution in python/R to Spark to improve performance.” 

Getting Started with Learning Spark

If you’re new to this field and would like to learn more about Apache Spark, online courses are your best bet. Simplilearn’s Apache Spark certification training course covers Scala programming, Spark streaming, machine learning, and shell scripting with 30 demos, an industry project, and 32 hours of live instructor-led training.

I hope this article has given you an idea about Apache Spark and its use. Please let me know about any comments or questions you might have.

About the Author

Ger InbergGer Inberg

Ger Inberg is a freelance data scientist with a background in software development. He is currently helping clients in the field of machine learning and data visualization.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.