Machine learning isn’t an easy thing. Alright, so that’s an understatement. Artificial Intelligence and machine learning represent a major leap forward in getting computers to think like humans, but both concepts are challenging to master. Fortunately, the payoff is worth the effort.
Today we’re tackling the process of dimensionality reduction, a principal component analysis in machine learning. We will cover its definition, why it’s important, how to do it, and provide you with a relatable example to clarify the concept.
Once you’re done, you will have a solid grasp of dimensionality reduction, something that could come in handy during an interview. You will also know how to answer deep learning interview questions or machine learning interview questions with greater confidence and accuracy.
What is Dimensionality Reduction
Before we give a clear definition of dimensionality reduction, we first need to understand dimensionality. If you have too many input variables, machine learning algorithm performance may degrade. Suppose you use rows and columns, like those commonly found on a spreadsheet, to represent your ML data. In that case, the columns become input variables (also called features) fed to a model predicting the target variable.
Additionally, we can treat the data columns as dimensions on an n-dimensional feature space, while the data rows are points located on the space. This process is known as interpreting a data set geometrically.
Unfortunately, if many dimensions reside in the feature space, that results in a large volume of space. Consequently, the points in the space and rows of data may represent only a tiny, non-representative sample. This imbalance can negatively affect machine learning algorithm performance. This condition is known as “the curse of dimensionality.” The bottom line, a data set with vast input features complicates the predictive modeling task, putting performance and accuracy at risk.
Here’s an example to help visualize the problem. Assume you walked in a straight line for 50 yards, and somewhere along that line, you dropped a quarter. You will probably find it fast. But now, let’s say your search area covers a square 50 yards by 50 yards. Now your search will take days! But we’re not done yet. Now, make that search area a cube that’s 50 by 50 by 50 yards. You may want to say “goodbye” to that quarter! The more dimensions involved, the more complex and longer it is to search.
How do we lift the curse of dimensionality? By reducing the number of input features, thereby reducing the number of dimensions in the feature space. Hence, “dimensionality reduction.”
To make a long story short, dimensionality reduction means reducing your feature set’s dimension.
Why Dimensionality Reduction is Important
Dimensionality reduction brings many advantages to your machine learning data, including:
- Fewer features mean less complexity
- You will need less storage space because you have fewer data
- Fewer features require less computation time
- Model accuracy improves due to less misleading data
- Algorithms train faster thanks to fewer data
- Reducing the data set’s feature dimensions helps visualize the data faster
- It removes noise and redundant features
Benefits Of Dimensionality Reduction
For AI engineers or data professionals working with enormous datasets, doing data visualisation, and analysing complicated data, dimension reduction is helpful.
- It aids in data compression, resulting in less storage space being required.
- It speeds up the calculation.
- It also aids in removing any extraneous features.
Disadvantages Of Dimensionality Reduction
- We lost some data during the dimensionality reduction process, which can impact how well future training algorithms work.
- It may need a lot of processing power.
- Interpreting transformed characteristics might be challenging.
- The independent variables become harder to comprehend as a result.
Dimensionality Reduction In Predictive Modeling
An easy email classification issue, where we must determine whether or not the email is spam, may be used to illustrate dimensionality reduction. This might encompass a wide range of characteristics, including if the email employs a template, its content, whether it has a generic subject, etc.
Some of these characteristics, nevertheless, could overlap. In another case, because of the strong correlation between the two, a classification issue that depends on rainfall and humidity can be reduced to just one underlying characteristic. As a result, we may lower the number of features in these issues. A 3-D classification problem may be challenging to picture, in contrast to 2-D and 1-D problems, which can both be translated to a simple 2-dimensional space. This idea is shown in the image below, where a 3-D feature space is divided into two 2-D feature spaces. If the two feature spaces are later found to be associated, more feature reduction may be possible.
Dimensionality Reduction Methods and Approaches
So now that we’ve established how much dimensionality reduction benefits machine learning, what’s the best method of doing it? We have listed the principal approaches you can take, subdivided further into diverse ways. This series of approaches and methods are also known as Dimensionality Reduction Algorithms.
Feature selection is a means of selecting the input data set's optimal, relevant features and removing irrelevant features.
- Filter methods. This method filters down the data set into a relevant subset.
- Wrapper methods. This method uses the machine learning model to evaluate the performance of features fed into it. The performance determines whether it’s better to keep or remove the features to improve the model’s accuracy. This method is more accurate than filtering but is also more complex.
- Embedded methods. The embedded process checks the machine learning model’s various training iterations and evaluates each feature’s importance.
This method transforms the space containing too many dimensions into a space with fewer dimensions. This process is useful for keeping the whole information while using fewer resources during information processing. Here are three of the more common extraction techniques.
- Linear discriminant analysis. LDA is commonly used for dimensionality reduction in continuous data. LDA rotates and projects the data in the direction of increasing variance. Features with maximum variance are designated the principal components.
- Kernel PCA. This process is a nonlinear extension of PCA that works for more complicated structures that cannot be represented in a linear subspace in an easy or appropriate manner. KPCA uses the “kernel trick” to construct nonlinear mappings.
- Quadratic discriminant analysis. This technique projects data in a way that maximizes class separability. The projection puts examples from the same class close together, and examples from different classes are placed farther apart.
Dimensionality Reduction Techniques
Here are some techniques machine learning professionals use.
Principal Component Analysis.
Principal component analysis, or PCA, is a technique for reducing the number of dimensions in big data sets by condensing a large collection of variables into a smaller set that retains most of the large set's information.
Since machine learning algorithms can analyse the data far more quickly and efficiently with smaller sets of information since there are fewer unnecessary factors to evaluate, accuracy must inevitably suffer as a data set's variables are reduced. However, the solution to dimensionality reduction is to trade a little accuracy for simplicity. In conclusion, PCA seeks to retain as much information as is practical while minimising the number of variables in a data set.
Backward Feature Elimination.
Backward elimination helps the model perform better by starting with all the characteristics and removing the least important one at a time. We keep doing this until we see no improvement when we remove features.
- All of the model's variables should be used initially.
- Drop the least valuable variable (based, for example, on the lowest loss in model accuracy), then keep going until a certain set of requirements is met.
Forward Feature Selection.
The forward selection approach begins with no features in the dataset and is an iterative procedure. Features are introduced at each iteration to enhance the model's functionality. The functionalities are maintained if performance is increased. Features that do not enhance the outcomes are removed. The procedure is carried out until the model's improvement stalls.
Missing Value Ratio.
Consider receiving a dataset. What comes first? Naturally, you would want to investigate the data before developing a model. You discover that your dataset has missing values as you explore the data. What's next? You will look for the cause of these missing values before trying to impute them or removing the variables with the missing values completely.
What if there are too much missing data, let's assume there is more than 50%. Should the variable be deleted, or the missing values be imputed? Given that the variable won't contain much data, we'd want to drop it. This isn't a given, though. We may establish a threshold number, and if any variable's proportion of missing data exceeds that level, we will need to drop the variable.
Low Variance Filter.
Like the Missing Value Ratio technique, the Low Variance Filter works with a threshold. However, in this case, it’s testing data columns. The method calculates the variance of each variable. All data columns with variances falling below the threshold are dropped since low variance features don’t affect the target variable.
High Correlation Filter.
This method applies to two variables carrying the same information, thus potentially degrading the model. In this method, we identify the variables with high correlation and use the Variance Inflation Factor (VIF) to choose one. You can remove variables with a higher value (VIF > 5).
Decision trees are a popular supervised learning algorithm that splits data into homogenous sets based on input variables. This approach solves problems like data outliers, missing values, and identifying significant variables.
This method is like the decision tree strategy. However, in this case, we generate a large set of trees (hence "forest") against the target variable. Then we find feature subsets with the help of each attribute’s usage statistics of each attribute.
Let's say we have two variables: education and income. Given that person with greater education levels also tend to have much higher incomes, there might be a strong association between these factors.
The Factor Analysis technique classifies variables based on their correlations; hence, all variables in one category will have a strong correlation among themselves but just a weak relationship with factors in an another group (s). Here, every group is referred to as a factor. These variables are few in comparison to the data's original dimensions. These elements are hard to observe, though.
Dimensionality Reduction Example
Here is an example of dimensionality reduction using the PCA method mentioned earlier. You want to classify a database full of emails into “not spam” and “spam.” To do this, you build a mathematical representation of every email as a bag-of-words vector. Each position in this binary vector corresponds to a word from an alphabet. For any single email, each entry in the bag-of-words vector is the number of times the corresponding word appears in the email (with a zero, meaning it doesn’t appear at all).
Now let’s say you’ve constructed a bag-of-words from each email, giving you a sample of bag-of-words vectors, x1…xm. However, not all your vector’s dimensions (words) are useful for the spam/not spam classification. For instance, words like “credit,” “bargain,” “offer,” and “sale” would be better candidates for spam classification than “sky,” “shoe,” or “fish.” This is where PCA comes in.
You should construct an m-by-m covariance matrix from your sample and compute its eigenvectors and eigenvalues. Then sort the resulting numbers in decreasing order and choose the p top eigenvalues. By applying PCA to your vector sample, you project them onto eigenvectors corresponding to the top p eigenvalues. Your output data is now a projection of the original data onto p eigenvectors. Thus, the projected data dimension has been reduced to p.
After you have computed your bag-of-words vector’s low-dimensional PCA projections, you can use the projection with various classification algorithms to classify the emails instead of using the original emails. Projections are smaller than the original data, so things move along faster.
Learn About Artificial Intelligence
There’s a lot to learn about Artificial Intelligence, especially if you want a career in the field. Fortunately, Simplilearn has the resources to help bring you up to speed. The Artificial Intelligence Course, held in collaboration with IBM, features exclusive IBM hackathons, masterclasses, and “ask me anything sessions.” This AI certification training helps you master key concepts such as Data Science with Python, machine learning, deep learning, and NLP. You will become AI job-ready with live sessions, practical labs, and projects.
Simplilearn also has other data science career-related resources, such as data science interview questions to help you brush up on the best answers for that challenging aspect of the application process.
Glassdoor reports that AI engineers in the United States earn an annual average of USD 117,044. According to Payscale, AI engineers in India make a yearly average of ₹1,551,046.
So, if you’re looking for a cutting-edge career that both challenges and rewards you, give the world of Artificial Intelligence a chance. When you do, let Simplilearn be your partner in helping you achieve your new career goals!