Exploratory Data Analysis

Exploratory Data Analysis (EDA) examines and visualizes data to understand its main characteristics, identify patterns, spot anomalies, and test hypotheses. It helps summarize the data and uncover insights before applying more advanced data analysis techniques.

Become a Data Scientist through hands-on learning with hackathons, masterclasses, webinars, and Ask-Me-Anything! Start learning now!

What Is Exploratory Data Analysis?

Exploratory Data Analysis is a data analytics process that aims to understand the data in depth and learn its different characteristics, often using visual means. This allows one to get a better feel for the data and find useful patterns.

EDA_1.

Figure 1: Exploratory Data Analysis

It is crucial to understand it in depth before you perform data analysis and run your data through an algorithm. You need to know the patterns in your data and determine which variables are important and do not play a significant role in the output. Further, some variables may have correlations with other variables. You also need to recognize errors in your data.

Exploratory data analysis can do all of this. It helps you gather insights, better sense the data, and remove irregularities and unnecessary values.

  • Helps you prepare your dataset for analysis.
  • Allows a machine learning model to predict our dataset better.
  • Gives you more accurate results.
  • It also helps us to choose a better machine learning model.

EDA_2 

Figure 2: Exploratory Data Analysis uses

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Steps Involved in Exploratory Data Analysis

1. Understand the Data

Familiarize yourself with the data set, understand the domain, and identify the objectives of the analysis.

2. Data Collection

Collect the required data from various sources such as databases, web scraping, or APIs.

3. Data Cleaning

  • Handle missing values: Impute or remove missing data.
  • Remove duplicates: Ensure there are no duplicate records.
  • Correct data types: Convert data types to appropriate formats.
  • Fix errors: Address any inconsistencies or errors in the data.

4. Data Transformation

  • Normalize or standardize the data if necessary.
  • Create new features through feature engineering.
  • Aggregate or disaggregate data based on analysis needs.

5. Data Integration

Integrate data from various sources to create a complete data set.

6. Data Exploration

  • Univariate Analysis: Analyze individual variables using summary statistics and visualizations (e.g., histograms, box plots).
  • Bivariate Analysis: Analyze the relationship between two variables with scatter plots, correlation coefficients, and cross-tabulations.
  • Multivariate Analysis: Investigate interactions between multiple variables using pair plots and correlation matrices.

7. Data Visualization

Visualize data distributions and relationships using visual tools such as bar charts, line charts, scatter plots, heatmaps, and box plots.

8. Descriptive Statistics

Calculate central tendency measures (mean, median, mode) and dispersion measures (range, variance, standard deviation).

9. Identify Patterns and Outliers

Detect patterns, trends, and outliers in the data using visualizations and statistical methods.

10. Hypothesis Testing

Formulate and test hypotheses using statistical tests (e.g., t-tests, chi-square tests) to validate assumptions or relationships in the data.

11. Data Summarization

Summarize findings with descriptive statistics, visualizations, and key insights.

12. Documentation and Reporting

  • Document the EDA process, findings, and insights clearly and structured.
  • Create reports and presentations to convey results to stakeholders.

13. Iterate and Refine

Continuously refine the analysis based on feedback and additional questions during the process.

Your Data Analytics Career is Around The Corner!

Data Analyst Master’s ProgramExplore Program
Your Data Analytics Career is Around The Corner!

Importance of Exploratory Data Analysis in Data Science

Exploratory Data Analysis is a critical step in the data science process. It is the foundation for understanding and interpreting complex data sets. EDA helps data scientists identify patterns, spot anomalies, test hypotheses, and check assumptions through various statistical and graphical techniques. Practitioners can uncover underlying structures, detect outliers, and determine the relationships between variables, which is essential for developing accurate predictive models by thoroughly exploring the data.

Furthermore, Exploratory Data Analysis allows the identification of data quality issues, such as missing values or errors, which can be addressed before proceeding to more advanced analysis. This preliminary analysis enhances the reliability and accuracy of the subsequent modeling and ensures that the insights derived are valid and actionable. EDA allows data scientists to make informed decisions and derive meaningful insights that drive business strategies and solutions.

Types of Exploratory Data Analysis (EDA)

1. Univariate Analysis

  • Definition: Focuses on analyzing a single variable at a time.
  • Purpose: To understand the variable's distribution, central tendency, and spread.
  • Techniques:
    • Descriptive statistics (mean, median, mode, variance, standard deviation).
    • Visualizations (histograms, box plots, bar charts, pie charts).

2. Bivariate Analysis

  • Definition: Examines the relationship between two variables.
  • Purpose: To understand how one variable affects or is associated with another.
  • Techniques:
    • Scatter plots.
    • Correlation coefficients (Pearson, Spearman).
    • Cross-tabulations and contingency tables.
    • Visualizations (line plots, scatter plots, pair plots).

3. Multivariate Analysis

  • Definition: Investigates interactions between three or more variables.
  • Purpose: To understand the complex relationships and interactions in the data.
  • Techniques:
    • Multivariate plots (pair plots, parallel coordinates plots).
    • Dimensionality reduction techniques (PCA, t-SNE).
    • Cluster analysis.
    • Heatmaps and correlation matrices.

4. Descriptive Statistics

  • Definition: Summarizes the main features of a data set.
  • Purpose: To provide a quick overview of the data.
  • Techniques:
    • Measures of central tendency (mean, median, mode).
    • Measures of dispersion (range, variance, standard deviation).
    • Frequency distributions.

5. Graphical Analysis

  • Definition: Uses visual tools to explore data.
  • Purpose: To identify patterns, trends, and data anomalies through visualization.
  • Techniques:
    • Charts (bar charts, histograms, pie charts).
    • Plots (scatter plots, line plots, box plots).
    • Advanced visualizations (heatmaps, violin plots, pair plots).

6. Dimensionality Reduction

  • Definition: Reduces the number of variables under consideration.
  • Purpose: To simplify models, reduce computation time, and mitigate the curse of dimensionality.
  • Techniques:
    • Principal Component Analysis (PCA).
    • t-Distributed Stochastic Neighbor Embedding (t-SNE).
    • Linear Discriminant Analysis (LDA).

Become the Highest Paid Data Science Expert

With Our Best-in-class Data Science ProgramExplore Now
Become the Highest Paid Data Science Expert

Exploratory Data Analysis Tools

Using the following tools for exploratory data analysis, data scientists can effectively gain deeper insights and prepare data for advanced analytics and modeling.

1. Python Libraries

  • Pandas: Provides data structures and functions needed to manipulate structured data seamlessly.
    • Use: Data cleaning, manipulation, and summary statistics.
  • Supports large, multi-dimensional arrays and matrices and a collection of mathematical functions.
    • Use: Numerical computations and data manipulation.
  • Matplotlib: A plotting library that produces static, animated, and interactive visualizations.
    • Use: Basic plots like line charts, scatter plots, and bar charts.
  • Seaborn: Built on Matplotlib, it provides a high-level interface for drawing attractive statistical graphics.
    • Use: Advanced visualizations like heatmaps, violin plots, and pair plots.
  • SciPy: Builds on NumPy and provides many higher-level scientific algorithms.
  • Plotly: A graphing library that makes interactive, publication-quality graphs online.
    • Use: Interactive and dynamic visualizations.

2. R Libraries

  • ggplot2: A framework for creating graphics using the principles of the Grammar of Graphics.
    • Use: Complex and multi-layered visualizations.
  • dplyr: A set of tools for data manipulation, offering consistent verbs to address common data manipulation tasks.
  • tidyr: Provides functions to help you organize your data in a tidy way.
    • Use: Data cleaning and tidying.
  • shiny: An R package that makes building interactive web apps straight from R easy.
    • Use: Interactive data analysis applications.
  • plotly: Also available in R for creating interactive visualizations.
    • Use: Interactive visualizations.

3. Integrated Development Environments (IDEs)

  • Jupyter Notebook: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.
    • Use: Combining code execution, rich text, and visualizations.
  • RStudio: An integrated development environment for R that offers tools for writing and debugging code, building software, and analyzing data.
    • Use: R development and analysis.

4. Data Visualization Tools

  • Tableau: A top data visualization tool that facilitates the creation of diverse charts and dashboards.
    • Use: Interactive and shareable dashboards.
  • Power BI: A Microsoft business analytics service offering interactive visualizations and business intelligence features.
    • Use: Interactive reports and dashboards.

5. Statistical Analysis Tools

  • SPSS: A comprehensive statistics package from IBM.
    • Use: Complex statistical data analysis.
  • SAS: A software suite developed by SAS Institute for advanced analytics, business intelligence, data management, and predictive analytics.
    • Use: Statistical analysis and data management.

6. Data Cleaning Tools

  • OpenRefine: A powerful tool for cleaning messy data, transforming formats, and enhancing it with web services and external data.
    • Use: Data cleaning and transformation.
  • SQL Databases: Tools like MySQL, PostgreSQL, and SQLite are used to manage and query relational databases.
    • Use: Data extraction, transformation, and basic analysis.
Our Data Scientist Master's Program covers core topics such as R, Python, Machine Learning, Tableau, Hadoop, and Spark. Get started on your journey today!

Market Analysis With Exploratory Data Analysis

Now, perform Exploratory Data Analysis on market analysis data. You start by importing all necessary modules.

EDA_14. 

Figure 3: Importing necessary modules

Then, you read in the data as a pandas data frame.

EDA_15

EDA_15_1 

Figure 4: Market Analysis Data

The dataset is not formatted correctly. The first two rows contain the actual column names, just arbitrary values.

Importing Data

When importing your data, skip the first two rows to overcome the skewed rows. This will ensure that your column names are populated correctly.

EDA_16.

Figure 5: Importing Market Analysis Data

The dataset is imported correctly now. The column names are in the correct row, and you’ve dropped the arbitrary data.

The above data was collected while taking a survey. Information about the survey takers, like their occupation, salary, whether they have taken a loan, age, etc., is given. You will use exploratory data analysis to find patterns in this data and correlations between columns. You will also perform basic data-cleaning steps.

Become an Expert in Data Analytics

With Our Unique Data Analyst Master’s ProgramExplore Program
Become an Expert in Data Analytics

Data Cleaning

The next step is data cleaning. Let us drop the customer ID column, as it is just the row numbers indexed at 1. Also, split the ‘jobedu’ column into two: one for the job and one for the education field. After splitting the columns, you can drop the ‘jobedu’ column as it is useless anymore.                

EDA_17

Figure 6: Cleaning Market Analysis Data

This is what the dataset looks like now.

EDA_18 

Figure 7: Market Analysis Data

Missing Values

The data has some missing values in its columns. There are three major categories of missing values:

  1. MCAR (Missing completely at random): These values are randomly missing and do not depend on any other values.
  2. MAR (Missing at random): These values depend on additional features.
  3. MNAR (Missing not at random): There is a reason why these values are missing.

Let’s check the columns which have missing values.                     

EDA_19.

Figure 8: Missing values

You cannot do anything about the missing age values. So, drop all rows without age values.

EDA_20 

Figure 9: Missing age values

Now, in the month column, you can fill in the missing values by finding the most commonly occurring month and filling it in place of the missing values. You see the mode of the month column to get the most commonly occurring values and fill in the missing values using the fill function.

EDA_21 

Figure 10: Filling in missing month values

Check to see the number of missing values left in your data.

EDA_22

Figure 11: Missing values

Finally, only the response column has missing values. You cannot change these values. If the user hasn't filled in the response, you cannot auto-generate it, so you drop these values.

EDA_23.

Figure 12: Dropping Missing response values

Finally, the data is clean. You can now start finding the outliers.

Data Scientist Master's Program

In Collaboration with IBMExplore Course
Data Scientist Master's Program

Handling Outliers

There are two types of outliers in data:

  1. Univariate outliers: Univariate outliers are the data points whose values lie outside the expected range. Here, only a single variable is being considered.
  2. Multivariate outliers: These outliers depend on the correlation between two variables. While plotting data, one variable may not lie beyond the expected range. Still, when you plot the same variable with another variable, these values may lie far from the expected value.

Univariate Analysis

Now, consider the different jobs on which you have data. Plotting the job column as a bar graph in ascending order of the number of people who work in that job tells us the most popular jobs in the market. Normalize the data to ensure that they lie in the same range and are comparable.

EDA_24

Figure 13: Plotting the number of people performing a certain job

Moving on, plot a pie chart to compare the education qualifications of the people in the survey. Almost half of the people have only secondary school education, and one-fourth have a tertiary education.

EDA_25

Figure 14: Plotting the education qualification of people

Bivariate Analysis

Bivariate analysis is of three main types:

1. Numeric-Numerical Analysis

When both variables are compared, they have numeric data, and the analysis is said to be a Numeric-Numerical Analysis. You can use scatter plots, pair plots, and correlation matrices to compare two numeric columns.

Scatter Plot

A scatter plot represents every data point in the graph. It shows how the data in one column fluctuates according to the corresponding data points in another column. For example, plot a scatterplot between different individuals' salaries and bank balances and the balance and age of individuals.

EDA_26.

Figure 15: Plotting a scatter plot of Salary vs. Balance 

By looking at the above plot, it can be said that regardless of the individual salary, the average bank balance ranges from 0 - 25,0000. The majority of the people have a bank balance below 40k.

EDA_27

Figure 16: Plotting a scatter plot of Balance vs Age

From the above graph, you can conclude that the average balance of people, regardless of age, is around 25,000. This is the average balance, irrespective of age and salary.

Pair Plot

Pair plots are used to compare multiple variables simultaneously. They plot a scatter plot of all input variables against each other, which helps save space and allows us to compare various variables simultaneously. Let's plot the pair plot for salary, balance, and age.

EDA_28.

Figure 17: Plotting a pairplot

The figures below show the pair plots for salary, balance, and age. Each variable is plotted against the others on both the x- and y-axes.

EDA_29.

Figure 18: Pairplots of salary, balance, and age

Correlation Matrix

A correlation matrix is used to see the correlation between different variables. The correlation coefficient determines how two variables are correlated. The below table shows the correlation between salary, age, and balance. Correlation tells you how one variable affects the other. This helps us determine how changes in one variable will also cause a change in the other.

EDA_30.

Figure 19: Correlation matrix between salary, balance, and age

The above matrix tells us that balance, age, and salary have a high correlation coefficient and affect each other. Age and salary have a lower correlation coefficient.

Invest in Excellence, Join Our Top-Tier Program

Post Graduate Program In Data AnalyticsExplore Now
Invest in Excellence, Join Our Top-Tier Program

2. Numeric - Categorical Analysis

When one variable is of numeric type, and another is a categorical variable, you perform numeric-categorical analysis.

You can use the group by function to arrange the data into similar groups. Rows that have the same value in a particular column will be grouped. This way, you can see the numerical occurrences of a certain category across a column. You can also group values and find their mean.

EDA_31

Figure 20: Groupby of response with respect to salary

The above values tell you the average salary of the people who have responded yes or no in the response column.

You can also find the middle value of salary or the median value of the people who have responded with yes and no in our survey.

EDA_32

Figure 21: Median of groupby of response with respect to salary

You can also plot the box plot of response vs salary. A boxplot will show you the range of values that fall under a certain category.

EDA_33.

Figure 22: Boxplot of response with respect to salary

The above plot tells you that the salary range of people who said no on the survey is between 20k - 70k with a median salary of 60k, while the salary range of people who replied with yes on the survey was between 50k - 100k with a median salary of 60K.

Your Data Analytics Career is Around The Corner!

Data Analyst Master’s ProgramExplore Program
Your Data Analytics Career is Around The Corner!

3. Categorical — Categorical Analysis

When both the variables contain categorical data, you perform categorical-categorical analysis. First, convert the categorical response column into a numerical column with 1 corresponding to a positive response and 0 corresponding to a negative response.

EDA_34

Figure 23: Changing categorical to numerical values

Now, plot the marital status of people with the response rate. The figure below tells you the mean number of people who responded yes to the survey and their marital status.

EDA_35.

Figure 24: Changing categorical to numerical values

Also, plot the mean loan with the response rate.

EDA_36

Figure 25: Changing categorical to numerical values

You can conclude that people who have taken a loan are likelier to respond with a no on the survey.

Your Dream Career is Just Around The Corner!

Data Scientist Master's ProgramExplore Program
Your Dream Career is Just Around The Corner!

Conclusion

Exploratory Data Analysis provides valuable insights through data exploration, cleaning, and visualization. By understanding the fundamental steps of EDA and applying them to market analysis, professionals can make data-driven decisions and uncover hidden trends. Mastering EDA techniques is essential for anyone looking to excel in data science.

Develop your skills further and become an expert in Exploratory Data Analysis with Simplilearn's Data Scientist program. This course covers all foundational concepts and advanced data science techniques, empowering you to transform data into actionable insights. Start your journey today and unlock new career opportunities.

FAQs

1. What Are the Benefits of EDA?

Exploratory Data Analysis helps identify patterns, detect outliers, understand relationships between variables, and improve data quality, leading to more accurate and reliable models.

2. How Does EDA Differ From Data Cleaning?

Exploratory Data Analysis involves analyzing and visualizing data to understand its characteristics, while data cleaning focuses on correcting errors, handling missing values, and ensuring data consistency.

3. Can EDA Be Performed on Any Type of Data?

Yes, Exploratory Data Analysis can be performed on any type of data, including structured, unstructured, and semi-structured data, though the techniques and tools may vary.

4. What Are Some Common Visualizations Used in EDA?

Common visualizations in Exploratory Data Analysis include histograms, scatter plots, box plots, bar charts, line charts, heatmaps, and pair plots.

5. What Should Be Done if Outliers Are Found During EDA?

Investigate the cause of outliers to determine if they are errors, natural variations, or significant insights. Based on the context of the analysis, decide whether to retain, transform, or remove them.

About the Author

Avijeet BiswalAvijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.