A Quick Start-up Apache Spark Guide for Newbies

‘Lightning-fast cluster computing’ – that’s the slogan of Apache Spark, one of the world’s most popular big data processing frameworks. It has witnessed rapid growth in the last few years, with companies like e-bay, Yahoo, Facebook, Airbnb and Netflix adopting the framework for their big data needs.

But what is Apache Spark all about?

With the rise of the Internet-of-Things (IoT) and social media’s ubiquitous use, there has been a spike in data volumes. According to a Gartner estimate, there are around 6.4 billion devices plugged in to the Internet, generating around 2.5 exabytes of data every day.

Big data techniques and tools help companies manage all this data, ranging from our bank transactions to our activity on social networks like Facebook and Twitter.

And then there are questions to be answered about this data: how can we detect fraud in our bank transactions, which advertisement in Facebook gets the most clicks, and so on.

To answer these questions, large volumes have to be processed quickly, and this is where Spark enters the picture.

Spark 2.0

With the release of Spark 2.0 last summer, the framework is becoming more mature. It has reached a point where tech junkies are not the only people who are aware of the Spark phenomenon – business leaders are waking up to its potential. IBM has made a major commitment to it and calls it ‘potentially the most significant open source project of the next decade’.  The success of Spark in projects like personalized DNA Analysis, also contributes to the belief that it works well in real life projects.

So why do businesses invest in Spark? 

Spark – A Timeline of its Evolution

Spark started life in 2009 at the University of California in Berkeley as a project by Matei Zaharia. Matei created Spark when working on his PhD at Berkeley’s AMPLab, an institute that conducts research into big data analytics. He is currently the Chief Technologist of Databricks, which is a company that helps clients with cloud-based big data processing using Spark.

Spark was open-sourced in 2010 and was donated to the Apache software foundation in 2013. It is now a top- level Apache project and the largest open source project in data processing. 

The Spark Ecosystem

  • Spark core: it provides the base functionality for the components on top of it, like scheduling and monitoring of jobs in a cluster and handling faults. Though Spark is developed in the Scala programming language it also provides an API for other languages like R, SQL, Python and Java. These languages, especially R and Python, are the most popular languages in data science, which is one of the reasons Spark is so popular.
  • Spark SQL + Dataframes:  this is a module that provides a structured data processing interface via SQL which is a language used for communicating with a database. It also provides the Dataframe format, which is used  to structure data into columns and rows.
  • Streaming:  in some applications we need a result from a data processing within a certain time limit, failing which it becomes useless. For instance, to detect fraud in a credit card transaction. This module can be used in situations we need real time performance.
  • ML/Mlib: Machine Learning (ML)  is everywhere, today, from recommendations on sites like Netflix and LinkedIn  to advanced technologies like self-driving cars. This module provides state of the art algorithms to learn from data and make models with it, to be able to make predictions.
  • GraphX:  a module that can handle graph structured data at scale. One can think of visualizing Facebook relations or analyzing flight data.

Why Is Spark so Popular?

There are a couple of reasons why Spark is so popular and why there is a huge demand for people with Spark skills -

  • Speed: the main benefit is fast processing of big datasets. It is fast because it processes the data in memory and it uses cluster technology. This means that a certain task can be divided in sub-tasks and these sub-tasks are computed on different hosts in a cluster. This cluster technology is not new as it’s also used in Hadoop MapReduce, sometimes called the predecessor of Spark. While in Hadoop MapReduce all tasks are performed on disk, Spark performs them in-memory, which is a lot faster. Spark is nearly a 100 times as fast as Hadoop MapReduce. A couple of years ago, it broke the record for sorting a petabyte. It is also possible to use Spark on your local machine without using a cluster.  I have used it this way in my last project where I had to predict click through rates for a marketing company. Loading the client’s data and building the prediction model took about 15 minutes at first. When I added Spark which can utilize all cores on my machine, it took only 3 minutes.
  • Ease of use: it provides support for the main languages used in data processing like Java, Scala, R and Python. The documentation is pretty good and it’s fairly easy to create a simple application in your favorite language.  It also provides a way to use it interactively, which is handy to experiment with before you write your program.
  • Supports many use cases: it is a complete framework that supports multiple use cases. Ranging from Machine Learning to Stream processing and Graph processing, Spark has quite a lot of functionality available that gets you up and running quickly.
  • Integration with other technologies: Spark can run on different cluster technologies like the Hadoop file-system, YARN and Amazon webservices (AWS). AWS, which has been supporting Spark for some time, now, has the advantage that you don’t have to setup and maintain a cluster yourself, saving you valuable time.
  • Community: the Spark community is active and growing. It is the largest open source community in big data with over 1000 contributors from 250+ organizations.

From my own experience as a data scientist, I know that Spark has gained in popularity across verticals. There has been a steep rise in Spark-related projects over the last few years. I have seen requests for example from clients to “migrate our current solution in python/R to Spark to improve performance”. 

Getting Started with Learning Spark

If you’re new to this field and would like to learn more about Apache Spark, online courses are your best bet. Simplilearn’s Apache Spark certification training course covers Scala programing, Spark streaming, machine learning, and shell scripting with 30 demos, an industry project, and 32 hours of live instructor-led training.

I hope this article have given you an idea about Apache Spark and its use. Please let me know about any comments or questions you might have.

I wish you a sparkly day!

About the Author

Ger InbergGer Inberg

Ger Inberg is a freelance data scientist with a background in software development. He is currently helping clients in the field of machine learning and data visualization.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.