Spark SQL Tutorial

Welcome to the sixteenth lesson “Spark SQL” of Big Data Hadoop Tutorial which is a part of ‘Big Data Hadoop and Spark Developer Certification course’ offered by Simplilearn. This lesson will focus on Spark SQL.

Let us look at the objectives of this lesson in the next section.

Objectives

After completing this lesson, you will be able to:

  • Explain the importance and features of Spark SQL

  • Describe the methods to convert RDDs to DataFrames

  • Explain a few concepts of Spark SQL

  • Explain how to load existing data into a DataFrame

  • Explain how to convert from DataFrames to Pair RDDs.

Want to check course preview of our Big Data Hadoop and Spark Developer Certification course? Click here to watch! 

What is Spark SQL?

Spark SQL is a module for structured data processing, which is built on top of core Apache Spark.

Apache Spark SQL provides the following:

DataFrame API: It is a library for working with data as tables

Defining Data Frames: Defines Data Frames containing Rows and Columns

Catalyst Optimizer: It is an extensible optimization framework A SQL Engine and Command Line Interface.

Importance of Spark SQL

Spark SQL is an Apache Spark module used for structured data processing, which:

  • Acts as a distributed SQL query engine

  • Provides DataFrames for programming abstraction

  • Allows to query structured data in Spark programs

  • Can be used with platforms such as Scala, Java, R, and Python.

The features of Spark SQL

The features of Apache Spark SQL are:

Hive Compatibility: Compatible with the existing Hive queries, UDFs, and data.

SQL Queries Support: Mixes SQL queries with Spark programs.

Components Support: Includes a cost-based optimizer, code generations, and columnar storage.

Spark Engine Inclusion: Allows it to scale to multi-hour queries and thousands of nodes.

Comprehensive: Does not require a different engine for the historical data.

Let us now understand what SQLContext is.

SQLContext

The SQLContext class or any of its descendants act as an entry point into all functionalities.

val sqlContext = new

org.apache.spark.sql.SQLContext(sc)

How can you get the benefit of a superset of the basic SQLContext functionality? This is achieved by building a HiveContext, which is to:

  • Use the writing ability for queries.

  • Access Hive UDF and read data from Hive tables.

Here are some of the important points on SQLContext:

  • HiveContext is packaged separately to avoid the dependencies on Hive in the default Spark build.

  • You can also use the spark.sql.dialect option to select the specific variant of SQL used for parsing queries; use the SET key=value command in SQL or the setConf method on an SQLContext.

  • SQL is the only dialect available for an SQLContext; the default is “HIVEQL.”

On an SQLContext, the SQL function allows applications to programmatically run SQL queries and then return a DataFrame as a result.

val df = sqlContext.sql("SELECT * FROM table")

Let's now understand what DataFrames are.

DataFrames

DataFrames represent a distributed collection of data, in which data is organized into columns that are named.

To construct a data frame, use sources like tables in Hive, structured data files, existing RDDs, and external databases.

To convert them to RDDs, call the rdd method that returns the DataFrame content as an RDD of rows.

In prior versions of Spark SQL API, SchemaRDD has been renamed as DataFrame.

Creating DataFrames

DataFrames can be created:

  • From an existing structured data source like HIVE, JSON

  • From an existing RDD

  • By performing an operation or query on another DataFrame

  • By programmatically defining a schema.

Let’s understand DataFrame with an example.

DataFrame: Example

In this example you will be creating SparkContext, record any JSON value, and display the value in DataFrame as shown below.

val sc: SparkContext // An existing SparkContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.json("examples/src/main/resources/customers.json")

// Displays the content of the DataFrame to stdout

df.show()

Using DataFrame Operations

DataFrames provide a domain-specific language that can be used for structured data manipulation in Java, Scala, and Python. Let’s see an example.

val sc: SparkContext // An existing SparkContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create the Data Frame:

val df = sqlContext.read.json("examples/src/main/resources/customers..json")

//Show the content of the DataFrame:

df.show()

// Print the schema in a tree format:

df.printSchema()

// Select only the "name" column:

df.select("name").show()

In the above example, you will be creating SparkContext, record any JSON value, display the value in DataFrame, and show only the name column in DataFrame.

In the below example, you will show the value name and age in DataFrame, then using a filter the list of people above 21 years is displayed, and the final count above 21 years is calculated and shown below.

// Select everybody, but increment the age by 1:

df.select(df("name"), df("age") + 1).show()

// Select people older than 21:

df.filter(df("age") > 21).show()

// Count people by age:

df.groupBy("age").count().show()

Interoperating with RDDs

To convert existing RDDs into DataFrames, Spark SQL supports two methods:

  • Reflection Based method: Infers an RDD schema containing specific types of objects. Works well when the schema is already known when writing the Spark application.

  • Programmatic method: Enables you to build a schema and apply to an already existing RDD. Allows building DataFrames when you do not know the columns and their types until runtime.

Let’s understand each method in detail.

Reflection Based method

In Reflection based approach, the Scala interface allows converting an RDD with case classes to a DataFrame automatically for Spark SQL.

The case class:

  • Has the table schema, where the argument names to the case class are read using the reflection method.

  • Can be nested and used to contain complex types like a sequence of arrays.

Scala Interface implicitly converts the resultant RDD to a DataFrame and register it as a table. Use it in the subsequent SQL statements.

Example Using the Reflection Based Approach

In the next example, you will be creating an RDD of person objects and register it as a table.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.implicits._

case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table:

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0),

p(1).trim.toInt)).toDF()

people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext:

val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

// By field name:

teenagers.map(t => "Name: " + t.getAsString]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]:

teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)

SQL statements will run using the SQL methods provided by SQLContext or by field name, and finally, retrieves multiple columns at once into a Map.

Using the Programmatic Approach

This method is used when you cannot define case classes ahead of time; for example, when the records structure is encoded in a text dataset or a string.

To create a case class using programmatic approach the following steps can be used:

  • Use the existing RDD to create an RDD of rows.

  • Create the schema represented by a StructType which matches the rows structure.

  • Apply the schema to the RDD of rows using the createDataFrame method.

In the next example, sc is an existing SparkContext, where you will be creating an RDD, then the schema is encoded in a string, and it will generate the schema based on the string of schema.

// sc is an existing SparkContext:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDD:

val people = sc.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string:

val schemaString = "name age"

import org.apache.spark.sql.Row;

import org.apache.spark.sql.types.{StructType,StructField,StringType};

// Generate the schema based on the string of schema, Convert records of the RDD (people) to Rows and Apply the schema to the RDD.

val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))

val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)

peopleDataFrame.registerTempTable("people")

It will also convert records of the RDD people to Rows and apply the schema to the RDD.

Let’s now compare Spark SQL with  Impala.

Spark SQL vs. Impala Hadoop

Spark SQL

Impala Hadoop

  • Spark SQL is built on Spark which is a general-purpose processing engine.

  • It provides convenient SQL-like access to structured data in a Spark application.

  • Use Spark SQL for ETL and providing access to structured data required by a Spark application.

  • Impala is a specialized SQL engine.

  • Impalas querying performance is much better than Spark SQL.

  • Impala is much more mature than Spark SQL It provides robust security via Sentry.

  • Impala is better for interactive queries and Data analysis.

Summary

Let's summarize what we learned in this lesson.

  • Spark SQL is a Spark module used for structured data processing.

  • DataFrames represent a distributed collection of data, in which data is organized into columns that are named.

  • DataFrames provide a domain-specific language that can be used for structured data manipulation in Java, Scala, and Python.

  • To convert existing RDDs into DataFrames, Spark SQL supports two methods: Reflection Based and Programmatic.

  • Spark SQL is used for ETL and provides access to structured data required by an Apache Spark application.

Looking to learn more about Big Data Hadoop, why not enroll for Our Big Data Hadoop and Spark Developer Certification course?

Conclusion

This concludes the lesson on “Spark SQL.” This also brings us to the end of this Big Data Hadoop Tutorial.

Find our Big Data Hadoop and Spark Developer Online Classroom training classes in top cities:


Name Date Place
Big Data Hadoop and Spark Developer 24 Nov -30 Dec 2018, Weekend batch Your City View Details
Big Data Hadoop and Spark Developer 3 Dec -24 Dec 2018, Weekdays batch Dallas View Details
Big Data Hadoop and Spark Developer 14 Dec -26 Jan 2019, Weekdays batch Houston View Details
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*