Today Data Analysis has become a major in businesses, research, metrological department and many other fields. The extracted information from the datasets helps make meaningful decisions, publish research papers, predict weather and many more. This Spotify Data Analysis Project video will teach you to perform exploratory data analysis using Python on music-related datasets. Spotify is the world's largest audio streaming platform with various features, including sharing songs freely and viewing the lyrics while playing the songs. You will learn to analyze, visualize and draw insights with Python libraries and functions.
We will perform the Spotify Data Analysis using the Jupyter notebook. To perform data analysis, we need to download the Spotify dataset.
The datasets are downloaded from kaggle. You can visit the mentioned links and download your copies of the datasets.
After downloading the dataset, we will launch the Jupyter notebook and install the following libraries: pandas, numpy, matplotlib and seaborn.
-
Import the Following Libraries.
Now we will import the dataset as a csv file with the help of the read_csv function. I have stored the dataset in the Spotify datasets folder. Let's import and view the first five rows using the head() function.
Here we have stored the dataset under a variable name df_tracks.
Output:
When you download a dataset from an open repository, there are chances that the dataset would contain null values, so it's better to check them beforehand.
-
Find Null Values Present in the Dataset.
We can check for the null values with the help of isnull() function present in the pandas library.
In this line of code, we have passed the dataframe name to the isnull() function and used the sum() function to calculate the total number of null value columns in the dataset.
Output:
Here we can see all the columns in the dataset and we found out the song name column has 71 null values.
-
We Will Now Identify the Total Number of Rows and Columns in the Dataset and Check the Data Type and Memory Usage.
We will perform this action with the help of the info() method.
Output:
Now let’s move ahead and perform our crucial analysis in this project.
-
Find Ten Least Popular Songs in the Spotify Dataset.
To get a list of least popular songs, we’ll sort the popularity column in ascending order using the sort_values() function.
Output:
Descriptive Statistics
Let’s see some descriptive statistics for numerical variables present in our dataset.
We will use the describe() function and transpose() function
Output:
-
Top Ten Popular Songs With Popularity More Than 90.
Output:
-
Make the Release Date Column as the Index Column.
We will perform this action with the help of the set_index function.
Output:
-
Find the Name of the Artist Present in the 18th Row of the Dataset.
We can filter any specific information from the dataset with the help of the index location method that is iloc[].
Output:
Here we got the artist named Victor Boucher, who was present in the 18th row.
-
Convert the Duration of the Songs From Milliseconds to Seconds.
We will convert the duration of the songs from milliseconds to seconds and verify it by printing the headings of the dataset to check whether the duration is converted into seconds.
Output:
-
Correlation Map
Now we will create our first visualization, a correlation map. First we will drop three unwanted keys, mode and explicit columns, and apply the pearson correlation method.
We will set the figure size for the correlation map to (14,6). We will use the heatmap() function to create our correlation map, plus we will set the annotation = True that will write the data value in each cell. We will set fmt=" .1g"; this is string formatting quotes used when adding annotations. Here cmap stands for the color map. You can google sns cmap and choose any color from the documentation if you wish.
Output:
After running the piece of code, we got our correlation map. On the right side, you can see a scale ranging from -1 to +1. Here -1 denotes the variables that have the least or negative correlation, while the values above 0.0 denote the variables with a positive correlation.
-
Let’s Move Ahead and Sample Only 4 Percent of the Whole Dataset.
This line of code has provided us with 4 percent of the whole dataset that is 2346 rows.
Output:
-
Create a Regression Plot Between Loudness and Energy. Let’s Plot It in the Form of a Regression Line.
We will use the regplot() function present in the seaborn library to draw the regression plot.
Output:
The result is plotted. There is a very high positive correlation between loudness and energy. You can also see that all the data points or the songs are in one direction. If the energy increases, the loudness of the song increases and similarly, if the song's loudness decreases, the energy of the track also decreases.
Similarly, we can plot another regression plot between popularity and acousticness.
-
Create a Regression Plot Between Popularity and Acousticness in the Form of a Regression Line.
Output:
Here, we can see the blue color regression line is in downward direction, which denotes if the acousticness of the song increases, the popularity decreases and similarly, if the popularity increases, the acousticness decreases.
Now, we will use the seaborn library and the linepolt function.
-
Plot a Line Graph to Show the Duration of the Songs for Each Year.
Output:
We got the line plot. On the X-axis, we have the years and on the Y-axis, we have the duration. Here, we can see the songs from the 1920s to 1960s were of shorter duration. After 1960, the duration of the songs started increasing until 2010. From 2010 onwards, the duration again started declining.
Data Analysis Based on Genres of the Songs
Let’s now import the dataset using the pandas read_csv function.
Output:
Here, we got our dataset.
- Plot Duration of the Songs w.r.t. different Genres using a horizontal barplot.
Here we will use the barplot function present in the seaborn library.
Output:
Here, we got the Genres on Y-axis and Duration in milliseconds on the X-axis. We can analyze the data and find out that classical and world genres have longer duration of songs and children's music have shorter duration songs.
- Find top five genres by Popularity and pot a barplot for the same.
Output:
Here we got our top 5 genres based on the popularity that is Dance, Pop, Rap, Hip-Hop, Reggaeton.
Learn over a dozen of data analytics tools and skills with Data Analytics Certification Program and gain access to masterclasses by Purdue faculty and IBM experts. Enroll and add a star to your data analytics resume now!
Conclusion
Today, businesses hire data analysts to analyze their collected data and use the extracted information to know more about their consumers. We can easily analyze the data and draw useful insights with various Python libraries and functions.
From this article, we learned to analyze music data, created interesting visualizations, found correlations and extracted useful insights using the Spotify dataset. Check out Simplilearn's Data Analytics Certification Program in partnership with Purdue University and in collaboration with IBM. This program provides a hands-on approach with case studies and industry-aligned projects to bring the relevant concepts live. You will get broad exposure to key technologies and skills currently used in data analytics.
If you have any questions or inputs for our editorial team regarding this “The Best Spotify Data Analysis Project You Need to Know” article, do share them in the comments section below. Our team will review them and help solve them for you very soon!
Happy learning!