Tutorial Playlist

Machine Learning Tutorial: A Step-by-Step Guide for Beginners

Overview

An Introduction To Machine Learning

Lesson - 1

What is Machine Learning and How Does It Work?

Lesson - 2

The Complete Guide to Understanding Machine Learning Steps

Lesson - 3

Top 10 Machine Learning Applications in 2020

Lesson - 4

An Introduction to the Types Of Machine Learning

Lesson - 5

Supervised and Unsupervised Learning in Machine Learning

Lesson - 6

Everything You Need to Know About Feature Selection

Lesson - 7

Linear Regression in Python

Lesson - 8

Everything You Need to Know About Classification in Machine Learning

Lesson - 9

An Introduction to Logistic Regression in Python

Lesson - 10

Understanding the Difference Between Linear vs. Logistic Regression

Lesson - 11

The Best Guide On How To Implement Decision Tree In Python

Lesson - 12

Random Forest Algorithm

Lesson - 13

Understanding Naive Bayes Classifier

Lesson - 14

The Best Guide to Confusion Matrix

Lesson - 15

How to Leverage KNN Algorithm in Machine Learning?

Lesson - 16

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Lesson - 17

PCA in Machine Learning - Your Complete Guide to Principal Component Analysis

Lesson - 18

What is Cost Function in Machine Learning

Lesson - 19

The Ultimate Guide to Cross-Validation in Machine Learning

Lesson - 20

An Easy Guide to Stock Price Prediction Using Machine Learning

Lesson - 21

What Is Reinforcement Learning? The Best Guide To Reinforcement Learning

Lesson - 22

What Is Q-Learning? The Best Guide to Understand Q-Learning

Lesson - 23

The Best Guide to Regularization in Machine Learning

Lesson - 24

Everything You Need to Know About Bias and Variance

Lesson - 25

The Complete Guide on Overfitting and Underfitting in Machine Learning

Lesson - 26

Mathematics for Machine Learning - Important Skills You Must Possess

Lesson - 27

A One-Stop Guide to Statistics for Machine Learning

Lesson - 28

Embarking on a Machine Learning Career? Here’s All You Need to Know

Lesson - 29

How to Become a Machine Learning Engineer?

Lesson - 30

Top 34 Machine Learning Interview Questions and Answers in 2021

Lesson - 31
Everything You Need to Know About Feature Selection

The input variables that we give to our machine learning models are called features. Each column in our dataset constitutes a feature. To train an optimal model, we need to make sure that we use only the essential features. If we have too many features, the model can capture the unimportant patterns and learn from noise. The method of choosing the important parameters of our data is called Feature Selection. 

In this article titled ‘Everything you need to know about Feature Selection’, we will teach you all you need to know about feature selection. The topics covered are :

  • Why Feature Selection?
  • What is Feature Selection?
  • Feature Selection Methods
  • How to choose a Feature Selection Model?
  • Feature Selection with Python

Why Feature Selection? 

Machine learning models follow a simple rule: whatever goes in, comes out. If we put garbage into our model, we can expect the output to be garbage too. In this case, garbage refers to noise in our data.

To train a model, we collect enormous quantities of data to help the machine learn better. Usually, a good portion of the data collected is noise, while some of the columns of our dataset might not contribute significantly to the performance of our model. Further, having a lot of data can slow down the training process and cause the model to be slower. The model may also learn from this irrelevant data and be inaccurate.

FREE Machine Learning Course

Learn In-demand Machine Learning Skills and ToolsStart Now
FREE Machine Learning Course

Feature selection is what separates good data scientists from the rest. Given the same model and computational facilities, why do some people win in competitions with faster and more accurate models? The answer is Feature Selection. Apart from choosing the right model for our data, we need to choose the right data to put in our model. 

Consider a table which contains information on old cars. The model decides which cars must be crushed for spare parts.       

              old-cars           

Figure 1: Old cars dataset

In the above table, we can see that the model of the car, the year of manufacture, and the miles it has traveled are pretty important to find out if the car is old enough to be crushed or not. However, the name of the previous owner of the car does not decide if the car should be crushed or not. Further, it can confuse the algorithm into finding patterns between names and the other features. Hence we can drop the column.       

                   dropping-columns                        

 Figure 2: Dropping columns for feature selection

What is Feature Selection?

Feature Selection is the method of reducing the input variable to your model by using only relevant data and getting rid of noise in data.

It is the process of automatically choosing relevant features for your machine learning model based on the type of problem you are trying to solve. We do this by including or excluding important features without changing them. It helps in cutting down the noise in our data and reducing the size of our input data.

feature-selection

Figure 3: Feature Selection

Feature Selection Models

Feature selection models are of two types:

  1. Supervised Models: Supervised feature selection refers to the method which uses the output label class for feature selection. They use the target variables to identify the variables which can increase the efficiency of the model
  2. Unsupervised Models: Unsupervised feature selection refers to the method which does not need the output label class for feature selection. We use them for unlabelled data.

feature-selection-models

Figure 4: Feature Selection Models

We can further divide the supervised models into three :

1. Filter Method: In this method, features are dropped based on their relation to the output, or how they are correlating to the output. We use correlation to check if the features are positively or negatively correlated to the output labels and drop features accordingly. Eg: Information Gain, Chi-Square Test, Fisher’s Score, etc. 

/filter-method.

Figure 5: Filter Method flowchart                         

2. Wrapper Method: We split our data into subsets and train a model using this. Based on the output of the model, we add and subtract features and train the model again. It forms the subsets using a greedy approach and evaluates the accuracy of all the possible combinations of features. Eg: Forward Selection, Backwards Elimination, etc.

wrapper-method

                                              Figure 6: Wrapper Method Flowchart

3. Intrinsic Method: This method combines the qualities of both the Filter and Wrapper method to create the best subset.

          intrinsic

                                                Figure 7: Intrinsic Model Flowchart

This method takes care of the machine training iterative process while maintaining the computation cost to be minimum. Eg: Lasso and Ridge Regression.

How to Choose a Feature Selection Model?

How do we know which feature selection model will work out for our model? The process is relatively simple, with the model depending on the types of input and output variables.

FREE Data Science and AI Courses

Master basic & advanced skills, concepts and toolsStart Learning
FREE Data Science and AI Courses

Variables are of two main types:

  • Numerical Variables: Which include integers, float, and numbers.
  • Categorical Variables: Which include labels, strings, boolean variables, etc.

Based on whether we have numerical or categorical variables as inputs and outputs, we can choose our feature selection model as follows:

Input Variable

Output Variable

Feature Selection Model

Numerical

Numerical

  • Pearson’s correlation coefficient
  • Spearman’s rank coefficient

Numerical

Categorical

  • ANOVA correlation coefficient (linear).
  • Kendall’s rank coefficient (nonlinear).

Categorical

Numerical

  • Kendall’s rank coefficient (linear).
  • ANOVA correlation coefficient (nonlinear).

Categorical

Categorical

  • Chi-Squared test (contingency tables).
  • Mutual Information.

                                            Table 1: Feature Selection Model lookup

Python Training Course

Learn Data Operations in PythonExplore Course
Python Training Course

Feature Selection With Python

Let’s get hands-on experience in feature selection by working on the Kobe Bryant Dataset which analyses shots taken by Kobe from different areas of the court to determine which ones will go into the basket. 

The dataset is as shown:

kobe

                                                        Figure 8: Kobe Bryant Dataset

As we can see, the dataset has 25 different columns. We will not need all of them. 

We first begin by loading in the necessary modules. 

9-import

                                                        Figure 9: Importing modules

First, let's check out the loc_x and loc_y columns. They probably represent longitude and latitude. 

10-plot

                                     Figure 10: Plotting the latitude and longitude columns in our dataset

The figure is as shown:

plotting-latitude

                                                 Figure 11: Plotting Latitude and Longitude  

From the above figures, we can see that they resemble the two ‘D’s on a basketball court. Instead of having two separate columns, we can change the coordinates into polar form and have a single column [‘angle’].

changing-latitude

                                          Figure 12: Changing Latitude and Longitude into polar form 

We can combine the minutes and seconds columns into a single column for time. 

combining-2

                                           Figure 13: Combining two columns 

Let’s look at the unique values in the ‘team_id’ and ‘team_name’ columns:

14-unique

                                          Figure 14: Unique values in ‘team_id’ and ‘team_name’

The entire column contains only one value and can be dropped. Let’s take a look at the ‘match_up’ and ‘opponent’ columns : 

match-up

                                          Figure 15: ‘match_up’ and ‘opponent’ columns

Again, they contain the same information. Let’s plot the values of ‘dist’ and ‘shot_distance’ columns on the same graph to see how they differ:

           plotting-dist

                                                   Figure 16: Plotting ‘dist’ and ‘shot_distance’ columns

Again, they contain exactly the same information. Let’s take a look at columns shot_zone_area, shot_zone_basic and shot_zone_range.

   17-plotting

                                                Figure 17: Plotting the different shot zones columns

The figure depicted below shows the plots :

       different-shot

                                                        Figure 18: Different shot zones

We can see that they contain the different parts of the court from where the shots were taken. This information already exists in the angle and dist columns. 

Now, let’s drop all the useless columns.

19-dropping

                                                        Figure 19: Dropping Columns

After merging columns and removing useless columns, we get a dataset that contains only 11 important columns.

20-final

                                                             Figure 20: Final Dataset

Learn the essentials of object-oriented programming, web development with Django, and more with the Python Training Course. Enroll now!

Conclusion

In this article titled ‘Everything you need to know about Feature Selection’, we got an idea of how important it is to select the best features for our machine learning model. We then took a look at what feature selection is and some feature selection models. We then moved onto a simple way to choose the right feature selection model based on the input and output values. Finally, we saw how to implement feature selection in Python with a demo. If you are looking to learn more about feature selection and related fundamental features of Python, Simplielarn’s Python Certification Course would be ideal for you. This python certification course covers the basics fundamentals of python including data operations, conditional statements, shell scripting, and Django and much more, and prepares you for a rewarding career as a professional Python programmer.

Was this article on feature selection useful to you? Do you have any doubts or questions for us? Mention them in this article's comments section, and we'll have our experts answer them for you at the earliest!

About the Author

Kartik MenonKartik Menon

Kartik is an experienced content strategist and an accomplished technology marketing specialist passionate about designing engaging user experiences with integrated marketing and communication solutions.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.