This is the ‘Data Preprocessing’ tutorial, which is part of the Machine Learning course offered by Simplilearn. We will learn Data Preprocessing, Feature Scaling, and Feature Engineering in detail in this tutorial.
Let’s look at the objectives of Data Preprocessing Tutorial.
A quick brief of Data Preparation in Machine Learning is mentioned below.
The process of preparing data for Machine Learning algorithm comprises the following:
Steps involved in Data Selection involves:
Let’s understand Data Preprocessing in detail below.
After the data has been selected, it needs to be preprocessed using the given steps:
Data cleaning at this stage involves filtering it based on the following variables:
Insufficient Data
The amount of data required for ML algorithms can vary from thousands to millions, depending upon the complexity of the problem and the chosen algorithm.
Non-Representative Data
The sample selected must be an exact representation of the entire data, as non-representative data might train an algorithm such that it won't generalize well on new test data.
Substandard Data
Outliers, errors, and noise can be eliminated to get a better fitment of the model. Missing features such as age for 10% of the audience may be ignored completely, or an average value can be assumed for the missing component.
Selecting the right size of the sample is a key step in data preparation. Samples that are too large or too small might give skewed results.
Sampling Noise
Smaller samples cause sampling noise since they get trained on non-representative data. For example, checking voter sentiment from a very small subset of voters.
Sampling Bias
Larger samples work well as long as there is no sampling bias, that is, hen the right data is picked. For example, sampling bias would occur when checking voter sentiment only for the technically sound subset of voters, while ignoring others.
Let us look at the Data Sample below:
Learn in detail about Data Preprocessing. Click here!
The selected and preprocessed data is transformed using one or more of the following methods:
Lets us look at the Types of Data below.
Data which is not marked and needs real-time unsupervised learning is categorized as unlabelled data.
Data provided to test a hypothesis created via prior learning is known as test data.
Typically 20% of labeled data is reserved for the test.
It is a dataset used to retest the hypothesis (in case the algorithm got overfitted to even the test data due to multiple attempts at testing).
The illustration given below depicts how total available labeled data may be segregated into the training dataset, test dataset, and validation dataset.
The transformation stage in the data preparation process includes an important step known as Feature Engineering.
Feature Engineering refers to selecting and extracting the right features from the data that are relevant to the task and model in consideration.
The place of feature engineering in the machine learning workflow is shown below:
Feature Selection |
Most useful and relevant features are selected from the available data |
Feature Extraction |
Existing features are combined to develop more useful ones |
Feature Addition |
New features are created by gathering new data |
Feature Filtering |
Filter out irrelevant features to make the modeling step easy |
Feature scaling is an important step in the data transformation stage of the data preparation process.
Feature Scaling is a method used in Machine Learning for standardization of independent variables of data features.
Let’s understand the importance of Feature Scaling below.
There are 2 types of Feature Scaling.
Let us understand Standardization technique below.
To standardize the jth feature, you need to subtract the sample mean uj from every training sample and divide it by its standard deviation σj as given below:
Here, xj is a vector consisting of the jth feature values of all training samples n.
Given below is a sample NumPy code that uses NumPy mean and standard
functions to standardize features from a sample data set X (x0, x1...) :
The ML library scikit-learn implements a class for standardization called StandardScaler, as demonstrated here:
In most cases, normalization refers to the rescaling of data features between 0 and 1, which is a special case of Min-Max scaling.
In the given equation, subtract the min value for each feature from each feature instance and divide by the spread between max and min.
In effect, it measures the relative percentage of distance of each instance from the min value for that feature.
The ML library scikit-learn has a MinMaxScaler class for normalization.
The following table shows the difference between standardization and normalization for a sample dataset with values from 1 to 5:
Given below are the Datasets in Machine Learning.
IRIS flower dataset is one of the popular datasets available online and widely used to train or test various ML algorithms.
Modified National Institute of Standards and Technology (MNIST) dataset is another popular dataset used in ML algorithms.
As the amount of data grows in the world, the size of datasets available for ML development also grows:
Let’s look at some aspects of Dimensionality Reduction below.
Example: Car attributes might contain maximum speed in both units, kilometer per hour, and miles per hour. One of these can be safely discarded in order to reduce the dimensions and simplify the data.
Below mentioned are some of the Dimensionality Reduction aspects.
Keen on learning Machine Learning? Click for course description!
Let’s look at some aspects of Principal Component Analysis below.
Before the PCA algorithm is developed, you need to preprocess the data to normalize its mean and variance.
How do you find the axis of variation u on which most of the data lies?
You get the principal Eigenvector* of
Given below are the application of PCA.
PCA can eliminate noise or noncritical aspects of the data set to reduce complexity. Also, during image processing or comparison, image compression can be done with PCA, eliminating the noise such as lighting variations in face images.
It is used to map high dimensional data to lower dimensions. For example, instead of having to deal with multiple car types (dimensions), we can cluster them into fewer types.
It reduces data dimensions before running a supervised learning program and saves on computations as well as reduces overfitting.
3D Data ----changes to----- After PCA, one finds only two dimensions being important—Red and Green that carry most of the variance. The blue dimension has limited variance, and hence it is eliminated.
Let us go through what you have learned so far in this Data Preprocessing tutorial.
This concludes “Data Preprocessing” tutorial. In the next lesson, we will learn "Math Refresher"
Name | Date | Place | |
---|---|---|---|
Machine Learning | 8 Feb -26 Feb 2021, Weekdays batch | Your City | View Details |
Machine Learning | 12 Feb -19 Mar 2021, Weekdays batch | San Francisco | View Details |
Machine Learning | 21 Feb -11 Mar 2021, Weekdays batch | New York City | View Details |
A Simplilearn representative will get back to you in one business day.