An integrated data structure with an accessible API called a Spark DataFrame makes distributed large data processing easier. For general-purpose programming languages like Java, Python, and Scala, DataFrame is an option.
In this tutorial you will learn what is Pyspark dataframe, its features, and how to use create Dataframes with the Dataset of COVID-19 and more.
What Is Pyspark DataFrame?
PySpark DataFrames are data organized in tables that have rows and columns. Every column in its two-dimensional structure has values for a specific variable, and each row contains a single set of values from each column and names of columns cannot be ignored, Row names need to be unique, and the data that is stored can be character, numeric, or factor data types and there must be an equal number of data items in each column.
Why Use Pyspark DataFrame?
- Pyspark Dataframes are very useful for machine learning tasks because they can consolidate a lot of data.
- They are simple to evaluate and control and also they are fundamental types of data structures.
- DataFrame in Spark can handle petabytes of data.
- It has API support for languages like Python, R, Scala, and Java.
- They are frequently used as the data source for data visualization and can be utilized to hold tabular data.
- In comparison to RDDs, customized memory management lowers overload and boosts performance.
Now that you have an idea of why data frames are used, let's look at some of the important features of the pyspark dataframe and what makes it different.
Pyspark DataFrame Features
DataFrames are distributed data collections arranged into rows and columns in PySpark. DataFrames have names and types for each column. DataFrames are comparable to conventional database tables in that they are organized and brief.
So, the next feature of the data frame we are going to look at is lazy evaluation.
Although Scala may be executed lazily, and Spark is written in Scala, Spark's default execution mode is lazy. It means that up until the action is invoked, no operations over an RDD, DataFrame, or dataset are ever computed.
The next feature we will cover is the immutability of python dataframes.
Immutable storage includes data frames, datasets, and resilient distributed datasets (RDDs). The word "immutability" means "inability to change" when used with an object. Compared to Python, these data frames are immutable and provide less flexibility when manipulating rows and columns.
Now, we will learn to use DataFrame in Python.
Now that we have covered the features of python data frames, let us go through how to use dataframes in pyspark.
How to Use Dataframes in Pyspark?
The pandas package, which offers tools for studying databases or other tabular datasets, allows for creating data frames.
In Python, DataFrames are a fundamental type of data structure. They are frequently used as the data source for visualization and can be utilized to hold tabular data. A two-dimensional table with labeled columns and rows is known as a dataframe. Every row shows an individual instance of the DataFrame's column type, and the columns can be of a variety of types. DataFrames offer a method for quickly accessing, combining, transforming, and visualizing data.
How to Create DataFrames?
Let's go ahead and create some data frames using top 10 functions -
1. How to create a database using Row()?
To create a student database using the row function, write student equals row and writes the elements inside the row as first name, last name, email, age, and roll number.
2. How to add data to the student database?
To add data to the student database, we fill individual data based on the variables in the database, as shown below. Each row indicates a single entry in the database.
3. How to create a database for departments using Row()?
In the student databases, all entries are in the same format, having a first name, last name, email, and so on. To create some department data, we will use the row function, so department 1 equals row. Then inside the brackets, we will have its id and name.
4. What if you want to see the values of student 2? What will you do?
Now suppose you want to look at the values of student 2. We will use the print command.
So you can see here the values of row student 2. The first name is Cassey, the last name is not specified, so it has been printed as a null value; then we add the email firstname.lastname@example.org and her age 22 and roll number, which is 14526.
5. How to create instances for the department and student databases?
To create separate instances, we use the row function with specific arguments as shown in the image below.
You can see here I have created some instances which show us the students each department consists of. We can also see details of a particular student from a department using the print command.
6. What if you want to see the roll number of departmentwithstudent3? What will you do?
We get the roll number of student 4, at index position 1 in Department 3, which is 13536.
7. How to create a dataframe?
To create the data frame, we create an array of sequences of instances for our data frame.
Here department 1 consist of student 1 and 2 and department 2 consists of student 3 and 4 and department 3 consists of student 4 and student 5.
8. How to create a spark context?
After this, we can create our dataframe using the spark context in the image above.
9. How to display values in a dataframe?
We can display the values stored in our data frame using the display function.
We have the department structure, which consists of two strings, id, and name, and the student array, which consists of three strings: first name, last name, and email, and two integer values, age and roll number.
Use Case of the DataFrame
We have the dataset of COVID-19, which is in the CSV format. We will use this data set to create a data frame and look at some of its major functions.
- How to import the spark session from pyspark SQL?
- How to create a data frame by executing the following command using the spark session ?
- How do we use the spark command to read the CSV file and convert it into our data frame, which we named covid_df?
You can see I have provided a path to the CSV file. You can do this by uploading it on Colab. You can find the uploading option on the left side of the page.
- How to upload the covid dataset into the covid_df dataframe?
We can right-click on the file and copy the path into our spark read command.
Otherwise, if you are doing it in the pyspark shell, you can directly copy the file's path from the local directory.
We have used a comma as a separator, and as you can see, I have set header = true otherwise, the data frame would take the first row as the initial values of the dataset.
Now after successful execution of the command, our data frame is created.
Finally, we can try out some major functions of the data frame using the following commands.
So now let’s have a look at our data frame then we will use the show() command.
Here as you can see, only the top 20 rows are displayed.
- What if we want to know the total number of records in our dataframe?
We can do this using the count function.
So here, as you can see, it shows the total number of records in our data frame, which is 859
- What if you want to have a look at the columns?
You can do it manually, using the slider to slide across the data frame displayed using the show command, but there is another way of doing it by using the columns function.
The columns function will list all the columns present in our data frame.
So as you can see, all the columns in our data frame have been listed below.
So we can just count how many columns we have here.
- What if there were too many columns to count manually?
We can always check the total number of columns by using length. column function.
The len() function gives the number of columns.
- If you want to know the structure of the data frame, like the names of all columns with their data types?
printSchema() function allows us to go through the detailed structure of our data frame. It specifies each column with its data type.
So as you can see, we have all our columns listed with their particular data types, and here nullable is set as true, which means they accept null values as input.
The describe() function will provide the details of the specified column, its type, and min and max values.
As you can see, we used the describe function on column username, so it gives us the count or the total number of records in that particular column, and as you can
The select() function will select one or more columns specified in the command and give all the records in those specified columns.
The filter() command will show only records which satisfy the condition provided in the command.
We can also count the number of records that satisfy the condition in the above command using the count() function instead of the show() function with the above command.
The filter function can be applied to more than one condition.
The orderBy() function is used to arrange the records in our data frame in ascending or descending order.
Using SQL Queries on a Dataframe
1. How to create a temporary table from our data frame?
2. How to use the Spark SQL command show() to display the table?
We can also see only a specific column using spark. SQL(‘column_name’).show() command.
Want to begin your career as a Data Engineer? Check out the Data Engineer Certification Course and get certified.
In this tutorial on “PySpark DataFrames,” we covered the importance and features of DataFrames in Python. We also learned how to create dataframes using Google Collab and performed a small demonstration of the PySpark library. Now the question is, what are the best PySpark Technology courses you can take to boost your career? So, Simplilearn has Big Data Engineer Master's Course that will help you to kickstart your career as a Big data engineer.
Let us know if you have any questions or need clarification on any part of this 'What is PySpark DataFrames?’ tutorial in the comment section below. Our team of experts will be pleased to help you.