Machine learning isn’t an easy thing. Alright, so that’s an understatement. Artificial Intelligence and machine learning represent a major leap forward in getting computers to think like humans, but both concepts are challenging to master. Fortunately, the payoff is worth the effort.
Today we’re tackling the process of dimensionality reduction, a principal component analysis in machine learning. We will cover its definition, why it’s important, how to do it, and provide you with a relatable example to clarify the concept.
Once you’re done, you will have a solid grasp of dimensionality reduction, something that could come in handy during an interview. You will also know how to answer deep learning interview questions or machine learning interview questions with greater confidence and accuracy.
What is Dimensionality Reduction
Before we give a clear definition of dimensionality reduction, we first need to understand dimensionality. If you have too many input variables, machine learning algorithm performance may degrade. Suppose you use rows and columns, like those commonly found on a spreadsheet, to represent your ML data. In that case, the columns become input variables (also called features) fed to a model predicting the target variable.
Additionally, we can treat the data columns as dimensions on an n-dimensional feature space, while the data rows are points located on the space. This process is known as interpreting a data set geometrically.
Unfortunately, if many dimensions reside in the feature space, that results in a large volume of space. Consequently, the points in the space and rows of data may represent only a tiny, non-representative sample. This imbalance can negatively affect machine learning algorithm performance. This condition is known as “the curse of dimensionality.” The bottom line, a data set with vast input features complicates the predictive modeling task, putting performance and accuracy at risk.
Here’s an example to help visualize the problem. Assume you walked in a straight line for 50 yards, and somewhere along that line, you dropped a quarter. You will probably find it fast. But now, let’s say your search area covers a square 50 yards by 50 yards. Now your search will take days! But we’re not done yet. Now, make that search area a cube that’s 50 by 50 by 50 yards. You may want to say “goodbye” to that quarter! The more dimensions involved, the more complex and longer it is to search.
How do we lift the curse of dimensionality? By reducing the number of input features, thereby reducing the number of dimensions in the feature space. Hence, “dimensionality reduction.”
To make a long story short, dimensionality reduction means reducing your feature set’s dimension.
Why Dimensionality Reduction is Important
Dimensionality reduction brings many advantages to your machine learning data, including:
- Fewer features mean less complexity
- You will need less storage space because you have fewer data
- Fewer features require less computation time
- Model accuracy improves due to less misleading data
- Algorithms train faster thanks to fewer data
- Reducing the data set’s feature dimensions helps visualize the data faster
- It removes noise and redundant features
Dimensionality Reduction Methods and Approaches
So now that we’ve established how much dimensionality reduction benefits machine learning, what’s the best method of doing it? We have listed the principal approaches you can take, subdivided further into diverse ways. This series of approaches and methods are also known as Dimensionality Reduction Algorithms.
Feature selection is a means of selecting the input data set's optimal, relevant features and removing irrelevant features.
- Filter methods. This method filters down the data set into a relevant subset.
- Wrapper methods. This method uses the machine learning model to evaluate the performance of features fed into it. The performance determines whether it’s better to keep or remove the features to improve the model’s accuracy. This method is more accurate than filtering but is also more complex.
- Embedded methods. The embedded process checks the machine learning model’s various training iterations and evaluates each feature’s importance.
This method transforms the space containing too many dimensions into a space with fewer dimensions. This process is useful for keeping the whole information while using fewer resources during information processing. Here are three of the more common extraction techniques.
- Linear discriminant analysis. LDA is commonly used for dimensionality reduction in continuous data. LDA rotates and projects the data in the direction of increasing variance. Features with maximum variance are designated the principal components.
- Kernel PCA. This process is a nonlinear extension of PCA that works for more complicated structures that cannot be represented in a linear subspace in an easy or appropriate manner. KPCA uses the “kernel trick” to construct nonlinear mappings.
- Quadratic discriminant analysis. This technique projects data in a way that maximizes class separability. The projection puts examples from the same class close together, and examples from different classes are placed farther apart.
Dimensionality Reduction Techniques
Here are some techniques machine learning professionals use.
Principal Component Analysis.
PCA extracts a new set of variables from an existing, more extensive set. The new set is called “principal components.”
Backward Feature Elimination.
This five-step technique defines the optimal number of features required for a machine learning algorithm by choosing the best model performance and the maximum tolerable error rate.
Forward Feature Selection.
This technique follows the inverse of the backward feature elimination process. Thus, we don't eliminate the feature. Instead, we find the best features that produce the highest increase in the model’s performance.
Missing Value Ratio.
This technique sets a threshold level for missing values. If a variable exceeds the threshold, it’s dropped.
Low Variance Filter.
Like the Missing Value Ratio technique, the Low Variance Filter works with a threshold. However, in this case, it’s testing data columns. The method calculates the variance of each variable. All data columns with variances falling below the threshold are dropped since low variance features don’t affect the target variable.
High Correlation Filter.
This method applies to two variables carrying the same information, thus potentially degrading the model. In this method, we identify the variables with high correlation and use the Variance Inflation Factor (VIF) to choose one. You can remove variables with a higher value (VIF > 5).
Decision trees are a popular supervised learning algorithm that splits data into homogenous sets based on input variables. This approach solves problems like data outliers, missing values, and identifying significant variables.
This method is like the decision tree strategy. However, in this case, we generate a large set of trees (hence "forest") against the target variable. Then we find feature subsets with the help of each attribute’s usage statistics of each attribute.
This method places highly correlated variables into their own group, symbolizing a single factor or construct.
Dimensionality Reduction Example
Here is an example of dimensionality reduction using the PCA method mentioned earlier. You want to classify a database full of emails into “not spam” and “spam.” To do this, you build a mathematical representation of every email as a bag-of-words vector. Each position in this binary vector corresponds to a word from an alphabet. For any single email, each entry in the bag-of-words vector is the number of times the corresponding word appears in the email (with a zero, meaning it doesn’t appear at all).
Now let’s say you’ve constructed a bag-of-words from each email, giving you a sample of bag-of-words vectors, x1…xm. However, not all your vector’s dimensions (words) are useful for the spam/not spam classification. For instance, words like “credit,” “bargain,” “offer,” and “sale” would be better candidates for spam classification than “sky,” “shoe,” or “fish.” This is where PCA comes in.
You should construct an m-by-m covariance matrix from your sample and compute its eigenvectors and eigenvalues. Then sort the resulting numbers in decreasing order and choose the p top eigenvalues. By applying PCA to your vector sample, you project them onto eigenvectors corresponding to the top p eigenvalues. Your output data is now a projection of the original data onto p eigenvectors. Thus, the projected data dimension has been reduced to p.
After you have computed your bag-of-words vector’s low-dimensional PCA projections, you can use the projection with various classification algorithms to classify the emails instead of using the original emails. Projections are smaller than the original data, so things move along faster.
Master Deep Learning, Machine Learning, and other programming languages with Artificial Intelligence Engineer Master’s Program
Learn About Artificial Intelligence
There’s a lot to learn about Artificial Intelligence, especially if you want a career in the field. Fortunately, Simplilearn has the resources to help bring you up to speed. The Artificial Intelligence Course, held in collaboration with IBM, features exclusive IBM hackathons, masterclasses, and “ask me anything sessions.” This AI certification training helps you master key concepts such as Data Science with Python, machine learning, deep learning, and NLP. You will become AI job-ready with live sessions, practical labs, and projects.
Simplilearn also has other data science career-related resources, such as data science interview questions to help you brush up on the best answers for that challenging aspect of the application process.
So, if you’re looking for a cutting-edge career that both challenges and rewards you, give the world of Artificial Intelligence a chance. When you do, let Simplilearn be your partner in helping you achieve your new career goals!