Before you start data analysis or run your data through a machine learning algorithm, you must clean your data and make sure it is in a suitable form. Further, it is essential to know any recurring patterns and significant correlations that might be present in your data. The process of getting to know your data in depth is called Exploratory Data Analysis.
Exploratory Data Analysis is an integral part of working with data. In this tutorial titled ‘All the ins and outs of exploratory data analysis,’ you will explore how to perform exploratory data analysis on different data types.
What Is Exploratory Data Analysis?
Exploratory Data Analysis is a data analytics process to understand the data in depth and learn the different data characteristics, often with visual means. This allows you to get a better feel of your data and find useful patterns in it.
Figure 1: Exploratory Data Analysis
It is crucial to understand it in depth before you perform data analysis and run your data through an algorithm. You need to know the patterns in your data and determine which variables are important and which do not play a significant role in the output. Further, some variables may have correlations with other variables. You also need to recognize errors in your data.
All of this can be done with Exploratory Data Analysis. It helps you gather insights and make better sense of the data, and removes irregularities and unnecessary values from data.
- Helps you prepare your dataset for analysis.
- Allows a machine learning model to predict our dataset better.
- Gives you more accurate results.
- It also helps us to choose a better machine learning model.
Figure 2: Exploratory Data Analysis uses
Steps Involved in Exploratory Data Analysis
1. Data Collection
Data collection is an essential part of exploratory data analysis. It refers to the process of finding and loading data into our system. Good, reliable data can be found on various public sites or bought from private organizations. Some reliable sites for data collection are Kaggle, Github, Machine Learning Repository, etc.
The data depicted below represents the housing dataset that is available on Kaggle. It contains information on houses and the price that they were sold for.
Figure 3: Housing dataset
2. Data Cleaning
Data cleaning refers to the process of removing unwanted variables and values from your dataset and getting rid of any irregularities in it. Such anomalies can disproportionately skew the data and hence adversely affect the results. Some steps that can be done to clean data are:
- Removing missing values, outliers, and unnecessary rows/ columns.
- Re-indexing and reformatting our data.
Now, it’s time to clean the housing dataset. You first need to check to see the number of missing values in each column and the percentage of missing values they contribute to.
Figure 4: Finding Missing Values
To do so, drop the columns which are missing more than 15% of the data. Further, some variables are missing a significant chunk of the data, like 'PoolQC' , 'MiscFeature', 'Alley', etc., seem to be outliers.
Figure 5: Dropping Missing Values
Your final dataset after cleaning looks as shown below. You now have only 63 columns of importance.
Figure 6: Final Dataset
3. Univariate Analysis
In Univariate Analysis, you analyze data of just one variable. A variable in your dataset refers to a single feature/ column. You can do this either with graphical or non-graphical means by finding specific mathematical values in the data. Some visual methods include:
- Histograms: Bar plots in which the frequency of data is represented with rectangle bars.
- Box-plots: Here the information is represented in the form of boxes.
Let's make a histogram out of our SalePrice column.
Figure 7: Data Distribution in our Dataset
From the above graph, you can say that the graph deviates from the normal and is positively skewed. Now, find the Skewness and Kurtosis of the graph.
Figure 8: Skewness and Kurtosis in your data
To understand exactly which variables are outliers, you need to establish a threshold. To do this, you have to standardize the data. Hence, the data should have a mean of 1 and a standard deviation of 0.
Figure 9: Standardising data
The above figure shows that the lower range values fall in a similar range and are too far from 0. Meanwhile, all the higher range values have a range far from 0. You cannot consider that all of them are outliers, but you have to be careful with the last two variables that are above 7.
4. Bivariate Analysis
Here, you use two variables and compare them. This way, you can find how one feature affects the other. It is done with scatter plots, which plot individual data points or correlation matrices that plot the correlation in hues. You can also use boxplots.
Let's plot a scatter plot of the greater living area and Sales prices. Here, you can see that most of the values follow the same trend and are concentrated around one point, except for two isolated values at the very top. These are probably the data points with values above 7.
Figure 10: Scatterplot
Now, delete the last two values as they are outliers.
Figure 11: Deleting Outliers
Now, plot a scatter plot of the Basement area vs. the Sales Price and see their relationship. Again, you can see that the greater the basement area, the more the sales price.
Figure 12: Scatterplot
Moving ahead, plot a boxplot of the Sales Price with Overall Quality. The overall quality feature is categorical here. It falls in the range of 1 to 10. Here, you can see the increase in sales price as the quality increases. The rise looks a bit like an exponential curve.
Figure 13: Boxplot
Market Analysis With Exploratory Data Analysis
Now, perform Exploratory Data Analysis on market analysis data. You start by importing all necessary modules.
Figure 14: Importing necessary modules
Then, you read in the data as a pandas data frame.
Figure 15: Market Analysis Data
You can see here that the dataset is not formatted correctly. The first two rows contain the actual column names, and the column names are just arbitrary values.
To overcome the skewed rows, import your data by skipping the first two rows. This will make sure that your column names are populated correctly.
Figure 16: Importing Market Analysis Data
The dataset is imported correctly now. The column names are in the correct row, and you’ve dropped the arbitrary data.
The above data was collected while taking a survey. Different information about the survey takers, like their occupation, salary, if they have taken a loan, age, etc, is given. You will use exploratory data analysis to find patterns in this data and find correlations between columns. You will also perform basic data cleaning steps.
The next step that you need to do is data cleaning. Let us drop the customer id column as it is just the row numbers, but indexed at 1. Also, split the ‘jobedu’ column into two. One column for the job and one for the education field. After splitting the columns, you can drop the ‘jobedu’ column as it is of no use anymore.
Figure 17: Cleaning Market Analysis Data
This is what the dataset looks like now.
Figure 18: Market Analysis Data
The data has some missing values in its columns. There are three major categories of missing values:
- MCAR (Missing completely at random): These are values that are randomly missing and do not depend on any other values.
- MAR (Missing at random): These values are dependent on some additional features.
- MNAR (Missing not at random): There is a reason behind why these values are missing.
Let’s check the columns which have missing values.
Figure 19: Missing values
There is nothing you can do about the missing age values. So, drop all rows which do not have the age values.
Figure 20: Missing age values
Now, coming to the month column, you can fill in the missing values by finding the most commonly occurring month and filling it in place of the missing values. You see the mode of the month column to get the most commonly occurring values and fill in the missing values using the fillna function.
Figure 21: Filling in missing month values
Check to see the number of missing values left in your data.
Figure 22: Missing values
Finally, only the response column has missing values. You cannot do anything about these values. If the user hasn't filled in the response, you cannot auto-generate them. So you drop these values.
Figure 23: Dropping Missing response values
Finally, you can see that the data is clean. You can now start finding the outliers in the data.
There are two types of outliers:
- Univariate outliers: Univariate outliers are the data points whose values lie outside the expected range of values. Here, only a single variable is being considered.
- Multivariate outliers: These outliers are dependent on the correlation between two variables. While plotting data, one variable may not lie beyond the expected range, but when you plot the same variable with some other variable, these values may lie far from the expected value.
Now, consider the different jobs that you have data on. Plotting the job column as a bar graph in ascending order of the number of people who work in that job tells us the most popular jobs in the market. To ensure that they lie in the same range and are comparable, normalize the data.
Figure 24: Plotting the number of people performing a certain job
Moving on, plot a pie chart to compare the education qualifications of the people in the survey. Almost half of the people have only a secondary school education and one-fourth have a tertiary education.
Figure 25: Plotting the education qualification of people
Bivariate analysis is of three main types:
1. Numeric-Numeric Analysis
When both the variables being compared have numeric data, the analysis is said to be Numeric-Numeric Analysis. To compare two numeric columns, you can use scatter plots, pair plots, and correlation matrices.
A scatter plot is used to represent every data point in the graph. It shows how the data of one column fluctuates according to the corresponding data points in another column. Plot a scatterplot between different individuals' salaries and bank balances and the balance and age of individuals.
Figure 26: Plotting a scatter plot of Salary vs. Balance
By looking at the above plot, it can be said that regardless of the salary of individuals, the average bank balance ranges from 0 - 25,0000. The majority of the people have a bank balance below 40k.
Figure 27: Plotting a scatter plot of Balance vs Age
From the above graph, you can derive the conclusion that the average balance of people, regardless of age, is around 25,000. This is the average balance, irrespective of age and salary.
Pair plots are used to compare multiple variables at the same time. They plot a scatter plot of all input variables against each other. This helps save space and lets us compare various variables at the same time. Let's plot the pair plot for salary, balance, and age.
Figure 28: Plotting a pairplot
The below figures show the pair plots for salary, balance, and age. Each variable is plotted against the others on both the x and y-axis.
Figure 29: Pairplots of salary, balance, and age
A correlation matrix is used to see the correlation between different variables. How correlated two variables are is determined by the correlation coefficient. The below table shows the correlation between salary, age, and balance. Correlation tells you how one variable affects the other. This helps us determine how changes in one variable will also cause a change in the other variables.
Figure 30: Correlation matrix between salary, balance, and age
The above matrix tells us that balance, age, and salary have a high correlation coefficient and affect each other. Age and salary have a lower correlation coefficient.
2. Numeric - Categorical Analysis
When one variable is of numeric type and another is a categorical variable, then you perform numeric-categorical analysis.
You can use the groupby function to arrange the data into similar groups. Rows that have the same value in a particular column will be arranged in a group together. This way, you can see the numerical occurrences of a certain category across a column. You can groupby values and find their mean.
Figure 31: Groupby of response with respect to salary
The above values tell you the average salary of the people who have responded with yes and no in the response column.
You can also find the middle value of salary or the median value of the people who have responded with yes and no in our survey.
Figure 32: Median of groupby of response with respect to salary
You can also plot the box plot of response vs salary. A boxplot will show you the range of values that fall under a certain category.
Figure 33: Boxplot of response with respect to salary
The above plot tells you that the salary range of people who said no on the survey is between 20k - 70k with a median salary of 60k, while the salary range of people who replied with yes on the survey was between 50k - 100k with a median salary of 60K.
3. Categorical — Categorical Analysis
When both the variables contain categorical data, you perform categorical-categorical analysis. First, convert the categorical response column into a numerical column with 1 corresponding to a positive response and 0 corresponding to a negative response.
Figure 34: Changing categorical to numerical values
Now, plot the marital status of people with the response rate. The below figure tells you the mean number of people who responded with yes to the survey and their marital status.
Figure 35: Changing categorical to numerical values
Also plot the mean loan wrt the response rate.
Figure 36: Changing categorical to numerical values
You can conclude that people who have taken a loan are more likely to respond with a no on the survey.
In this Exploratory Data Analysis Tutorial, you first understood the meaning and importance of exploratory data analysis. You then saw the various steps involved in performing Exploratory Data Analysis and finally, you used market analysis data to perform all the steps involved in exploratory data analysis on different types of data.
We hope this helped you understand how Exploratory Data Analysis works. To learn more about deep learning and machine learning, check out Simplilearn's Artificial Intelligence course. On the other hand, if you need any clarifications on this Exploratory Data Analysis tutorial, share them with us by commenting down below and we will have our experts review them at the earliest!