While working with data, it can be difficult to truly understand your data when it’s just in tabular form. To understand what exactly our data conveys, and to better clean it and select suitable models for it, we need to visualize it or represent it in pictorial form. This helps expose patterns, correlations, and trends that cannot be obtained when data is in a table or CSV file.
The process of finding trends and correlations in our data by representing it pictorially is called Data Visualization. To perform data visualization in python, we can use various python data visualization modules such as Matplotlib, Seaborn, Plotly, etc. In this article, The Complete Guide to Data Visualization in Python, we will discuss how to work with some of these modules for data visualization in python and cover the following topics in detail.
- What is Data Visualization?
- Data Visualization in Python
- Matplotlib and Seaborn
- Line Charts
- Bar Graphs
- Scatter Plots
- Heat Maps
What is Data Visualization?
Data visualization is a field in data analysis that deals with visual representation of data. It graphically plots data and is an effective way to communicate inferences from data.
Using data visualization, we can get a visual summary of our data. With pictures, maps and graphs, the human mind has an easier time processing and understanding any given data. Data visualization plays a significant role in the representation of both small and large data sets, but it is especially useful when we have large data sets, in which it is impossible to see all of our data, let alone process and understand it manually.
Data Visualization in Python
Python offers several plotting libraries, namely Matplotlib, Seaborn and many other such data visualization packages with different features for creating informative, customized, and appealing plots to present data in the most simple and effective way.
Figure 1: Data visualization
Matplotlib and Seaborn
Matplotlib and Seaborn are python libraries that are used for data visualization. They have inbuilt modules for plotting different graphs. While Matplotlib is used to embed graphs into applications, Seaborn is primarily used for statistical graphs.
But when should we use either of the two? Let’s understand this with the help of a comparative analysis. The table below provides comparison between Python’s two well-known visualization packages Matplotlib and Seaborn.
It is used for basic graph plotting like line charts, bar graphs, etc.
It is mainly used for statistics visualization and can perform complex visualizations with fewer commands.
It mainly works with datasets and arrays.
It works with entire datasets.
Seaborn is considerably more organized and functional than Matplotlib and treats the entire dataset as a solitary unit.
Matplotlib acts productively with data arrays and frames. It regards the aces and figures as objects.
Seaborn has more inbuilt themes and is mainly used for statistical analysis.
Matplotlib is more customizable and pairs well with Pandas and Numpy for Exploratory Data Analysis.
Table 1: Matplotlib vs Seaborn
A Line chart is a graph that represents information as a series of data points connected by a straight line. In line charts, each data point or marker is plotted and connected with a line or curve.
Let's consider the apple yield (tons per hectare) in Kanto. Let's plot a line graph using this data and see how the yield of apples changes over time. We start by importing Matplotlib and Seaborn.
Figure 2: Importing necessary modules
We are using random data points to represent the yield of apples.
Figure 3: Plotting apple yield
To better understand the graph and its purpose, we can add the x-axis values too.
Figure 4: Axis values
Let's add labels to the axes so that we can show what each axis represents.
Figure 5: Axis with labels
To plot multiple datasets on the same graph, just use the plt.plot function once for each dataset. Let's use this to compare the yields of apples vs. oranges on the same graph.
Figure 6: Plotting multiple graphs
We can add a legend which tells us what each line in our graph means. To understand what we are plotting, we can add a title to our graph.
Figure 7: Plotting multiple graphs
To show each data point on our graph, we can highlight them with markers using the marker argument. Many different marker shapes like a circle, cross, square, diamond, etc. are provided by Matplotlib.
Figure 8: Using markers
You can use the plt.figure function to change the size of the figure.
Figure 9: Changing graph size
An easy way to make your charts look beautiful is to use some default styles from the Seaborn library. These can be applied globally using the sns.set_style function.
Figure 10: Using Seaborn
We can also use the darkgrid option to change the background color to a darker shade.
Figure 11: Using darkgrid in Seaborn
When you have categorical data, you can represent it with a bar graph. A bar graph plots data with the help of bars, which represent value on the y-axis and category on the x-axis. Bar graphs use bars with varying heights to show the data which belongs to a specific category.
Figure 12: Plotting Bar graphs
We can also stack bars on top of each other. Let's plot the data for apples and oranges.
Figure 13: Plotting stacked bar graphs
Let’s use the tips dataset in Seaborn next. The dataset consists of :
- Information about the sex (gender)
- Time of day
- Total bill
- Tips given by customers visiting the restaurant for a week
Figure 14: Iris Dataset
We can draw a bar chart to visualize how the average bill amount varies across different days of the week. We can do this by computing the day-wise averages and then using plt.bar. The Seaborn library also provides a barplot function that can automatically compute averages.
Figure 15: Plotting averages of each bar
If you want to compare bar plots side-by-side, you can use the hue argument. The comparison will be done based on the third feature specified in this argument.
Figure 16: Plotting multiple bar graphs
You can make the bars horizontal by switching the axes.
Figure 17: Plotting horizontal bar graphs
A Histogram is a bar representation of data that varies over a range. It plots the height of the data belonging to a range along the y-axis and the range along the x-axis. Histograms are used to plot data over a range of values. They use a bar representation to show the data belonging to each range. Let's again use the ‘Iris’ data which contains information about flowers to plot histograms.
Figure 18: Iris datase
Now, let’s plot a histogram using the hist() function.
Figure 19: Plotting histograms
We can control the number or size of bins too.
Figure 20: Changing number of bins
We can change the number and size of bins using numpy too.
Figure 21: Changing number and size of bins
We can create bins of unequal size too.
Figure 22: Bins of unequal size
Similar to line charts, we can draw multiple histograms in a single chart. We can reduce each histogram's opacity so that one histogram's bars don't hide the others'. Let's draw separate histograms for each species of flowers.
Figure 23: Multiple histograms
Multiple histograms can be stacked on top of one another by setting the stacked parameter to True.
Figure 24: Stacking histograms
Scatter plots are used when we have to plot two or more variables present at different coordinates. The data is scattered all over the graph and is not confined to a range. Two or more variables are plotted in a Scatter Plot, with each variable being represented by a different color. Let's use the ‘Iris’ dataset to plot a Scatter Plot.
Figure 25: Iris Dataset
First, let’s see how many different species of flowers we have.
Figure 26: Unique flower species
Let’s try plotting the data with the help of a line chart.
Figure 27: Plotting line chart
This is not very informative. We cannot figure out the relationship between different data points.
Figure 28: Scatter plot
This is much better. But we still cannot differentiate different data points belonging to different categories. We can color the dots using the flower species as a hue.
Figure 29: Scatter plot with multiple colors
Since Seaborn uses Matplotlib's plotting functions internally, we can use functions like plt.figure and plt.title to modify the figure.
Figure 30: Changing dimensions of scatter plot
Heatmaps are used to see changes in behavior or gradual changes in data. It uses different colors to represent different values. Based on how these colors range in hues, intensity, etc., tells us how the phenomenon varies. Let's use heatmaps to visualize monthly passenger footfall at an airport over 12 years from the flights dataset in Seaborn.
Figure 31: Flights dataset
The above dataset, flights_df shows us the monthly footfall in an airport for each year, from 1949 to 1960. The values represent the number of passengers (in thousands) that passed through the airport. Let’s use a heatmap to visualize the above data.
Figure 32: Plotting heatmap
The brighter the color, the higher the footfall at the airport. By looking at the graph, we can infer that :
- The annual footfall for any given year is highest around July and August.
- The footfall grows annually. Any month in a year will have a higher footfall when compared to the previous years.
Let's display the actual values in our heatmap and change the hue to blue.
Figure 33: Plotting heatmap with values
Master Deep Learning, Machine Learning, and other programming languages with Artificial Intelligence Engineer Master’s Program
In this article, The Complete Guide to Data Visualization in Python, we gave an overview of data visualization in python and discussed how to create Line Charts, Bar Graphs, Histograms, Scatter Plot, and Heat Maps using various data visualization packages offered by Python like Matplotlib and Seaborn.
If you need any further clarifications or want to learn more about data visualization in Python and want to understand how to perform data visualization, share your queries with us by mentioning them in this page's comments section. We will have our experts review them at the earliest!
Python offers multiple other visualization packages which can be used to create different types of visualizations and not just graphs and plots. It is, therefore, also important to understand the challenges and advantages of the different libraries and how to use them to their full potential. Check out Simplilearn's Artificial Intelligence course to master key concepts including Data Science with Python, Machine Learning, Deep Learning, NLP and more. The goal of this course is to make you job-ready and ensure your career success.
Hope you liked the article “Data visualization in Python”, leave comments in case of any doubts!