While working with high-dimensional data, machine learning models often seem to overfit, and this reduces the ability to generalize past the training set examples. Hence, it is important to perform dimensionality reduction techniques before creating a model. In this article, we’ll learn the PCA in Machine Learning with a use case demonstration in Python.

## What is Principal Component Analysis (PCA)?

The Principal Component Analysis is a popular unsupervised learning technique for reducing the dimensionality of large data sets. It increases interpretability yet, at the same time, it minimizes information loss. It helps to find the most significant features in a dataset and makes the data easy for plotting in 2D and 3D. PCA helps in finding a sequence of linear combinations of variables.

In the above figure, we have several points plotted on a 2-D plane. There are two principal components. PC1 is the primary principal component that explains the maximum variance in the data. PC2 is another principal component that is orthogonal to PC1.

## What is a Principal Component?

The Principal Components are a straight line that captures most of the variance of the data. They have a direction and magnitude. Principal components are orthogonal projections (perpendicular) of data onto lower-dimensional space.

Now that you have understood the basics of PCA, let’s look at the next topic on PCA in Machine Learning.

## Dimensionality

The term "dimensionality" describes the quantity of features or variables used in the research. It can be difficult to visualize and interpret the relationships between variables when dealing with high-dimensional data, such as datasets with numerous variables. While reducing the number of variables in the dataset, dimensionality reduction methods like PCA are used to preserve the most crucial data. The original variables are converted into a new set of variables called principal components, which are linear combinations of the original variables, by PCA in order to accomplish this. The dataset's reduced dimensionality depends on how many principal components are used in the study. The objective of PCA is to select fewer principal components that account for the data's most important variation. PCA can help to streamline data analysis, enhance visualization, and make it simpler to spot trends and relationships between factors by reducing the dimensionality of the dataset.

The mathematical representation of dimensionality reduction in the context of PCA is as follows:

Given a dataset with n observations and p variables represented by the n x p data matrix X, the goal of PCA is to transform the original variables into a new set of k variables called principal components that capture the most significant variation in the data. The principal components are defined as linear combinations of the original variables given by:

PC_1 = a_11 * x_1 + a_12 * x_2 + ... + a_1p * x_p

PC_2 = a_21 * x_1 + a_22 * x_2 + ... + a_2p * x_p

...

PC_k = a_k1 * x_1 + a_k2 * x_2 + ... + a_kp * x_p

where a_ij is the loading or weight of variable x_j on principal component PC_i, and x_j is the jth variable in the data matrix X. The principal components are ordered such that the first component PC_1 captures the most significant variation in the data, the second component PC_2 captures the second most significant variation, and so on. The number of principal components used in the analysis, k, determines the reduced dimensionality of the dataset.

## Correlation

A statistical measure known as correlation expresses the direction and strength of the linear connection between two variables. The covariance matrix, a square matrix that displays the pairwise correlations between all pairs of variables in the dataset, is calculated in the setting of PCA using correlation. The covariance matrix's diagonal elements stand for each variable's variance, while the off-diagonal elements indicate the covariances between different pairs of variables. The strength and direction of the linear connection between two variables can be determined using the correlation coefficient, a standardized measure of correlation with a range of -1 to 1.

A correlation coefficient of 0 denotes no linear connection between the two variables, while correlation coefficients of 1 and -1 denote the perfect positive and negative correlations, respectively. The principal components in PCA are linear combinations of the initial variables that maximize the variance explained by the data. Principal components are calculated using the correlation matrix.

In the framework of PCA, correlation is mathematically represented as follows:

The correlation matrix C is a nxn symmetric matrix with the following components given a dataset with n variables (x1, x2,..., xn):

Cij = (sd(xi) * sd(xj)) / cov(xi, xj)

where sd(x i) is the standard deviation of variable x i and sd(x j) is the standard deviation of variable x j, and cov(x i, x j) is the correlation between variables x i and x j.

The correlation matrix C can also be written as follows in matrix notation:

C = X^T X / (n-1) (n-1)

## Orthogonal

The term "orthogonality" alludes to the principal components' construction as being orthogonal to one another in the context of the PCA algorithm. This indicates that there is no redundant information among the main components and that they are not correlated with one another.

Orthogonality in PCA is mathematically expressed as follows: each principal component is built to maximize the variance explained by it while adhering to the requirement that it be orthogonal to all other principal components. The principal components are computed as linear combinations of the original variables. Thus, each principal component is guaranteed to capture a unique and non-redundant part of the variation in the data.

The orthogonality constraint is expressed as:

a_i1 * a_j1 + a_i2 * a_j2 + ... + a_ip * a_jp = 0

for all i and j such that i ≠ j. This means that the dot product between any two loading vectors for different principal components is zero, indicating that the principal components are orthogonal to each other.

## Eigen Vectors

The main components of the data are calculated using the eigenvectors. The ways in which the data vary most are represented by the eigenvectors of the data's covariance matrix. The new coordinate system in which the data is represented is then defined using these coordinates.

The covariance matrix C in mathematics is represented by the letters v 1, v 2,..., v p, and the associated eigenvalues are represented by _1, _2,..., _p. The eigenvectors are calculated in such a way that the equation shown below holds:

C v_i = λ_i v_i

This means that the eigenvector v_i produces the associated eigenvalue λ_i as a scalar multiple of itself when multiplied by the covariance matrix C.

## Covariance Matrix

The covariance matrix is crucial to the PCA algorithm's computation of the data's main components. The pairwise covariances between the factors in the data are measured by the covariance matrix, which is a p x p matrix.

The correlation matrix C is defined as follows given a data matrix X of n observations of p variables:

C = (1/n) * X^T X

where X^T represents X's transposition. The covariances between the variables are represented by the off-diagonal elements of C, whereas the variances of the variables are represented by the diagonal elements of C.

## Steps for PCA Algorithm

- Standardize the data: PCA requires standardized data, so the first step is to standardize the data to ensure that all variables have a mean of 0 and a standard deviation of 1.
- Calculate the covariance matrix: The next step is to calculate the covariance matrix of the standardized data. This matrix shows how each variable is related to every other variable in the dataset.
- Calculate the eigenvectors and eigenvalues: The eigenvectors and eigenvalues of the covariance matrix are then calculated. The eigenvectors represent the directions in which the data varies the most, while the eigenvalues represent the amount of variation along each eigenvector.
- Choose the principal components: The principal components are the eigenvectors with the highest eigenvalues. These components represent the directions in which the data varies the most and are used to transform the original data into a lower-dimensional space.
- Transform the data: The final step is to transform the original data into the lower-dimensional space defined by the principal components.

## Applications of PCA in Machine Learning

- PCA is used to visualize multidimensional data.
- It is used to reduce the number of dimensions in healthcare data.
- PCA can help resize an image.
- It can be used in finance to analyze stock data and forecast returns.
- PCA helps to find patterns in the high-dimensional datasets.

## Advantages of PCA

In terms of data analysis, PCA has a number of benefits, including:

- Dimensionality reduction: By determining the most crucial features or components, PCA reduces the dimensionality of the data, which is one of its primary benefits. This can be helpful when the initial data contains a lot of variables and is therefore challenging to visualize or analyze.
- Feature Extraction: PCA can also be used to derive new features or elements from the original data that might be more insightful or understandable than the original features. This is particularly helpful when the initial features are correlated or noisy.
- Data visualization: By projecting the data onto the first few principal components, PCA can be used to visualize high-dimensional data in two or three dimensions. This can aid in locating data patterns or clusters that may not have been visible in the initial high-dimensional space.
- Noise Reduction: By locating the underlying signal or pattern in the data, PCA can also be used to lessen the impacts of noise or measurement errors in the data.
- Multicollinearity: When two or more variables are strongly correlated, there is multicollinearity in the data, which PCA can handle. PCA can lessen the impacts of multicollinearity on the analysis by identifying the most crucial features or components.

## Disadvantages of PCA

- Interpretability: Although principal component analysis (PCA) is effective at reducing the dimensionality of data and spotting patterns, the resulting principal components are not always simple to understand or describe in terms of the original features.
- Information loss: PCA involves choosing a subset of the most crucial features or components in order to reduce the dimensionality of the data. While this can be helpful for streamlining the data and lowering noise, if crucial features are not included in the components chosen, information loss may also result.
- Outliers: Because PCA is susceptible to anomalies in the data, the resulting principal components may be significantly impacted. The covariance matrix can be distorted by outliers, which can make it harder to identify the most crucial characteristics.
- Scaling: PCA makes the assumption that the data is scaled and centralized, which can be a drawback in some circumstances. The resulting principal components might not correctly depict the underlying patterns in the data if the data is not scaled properly.
- Computing complexity: For big datasets, it may be costly to compute the eigenvectors and eigenvalues of the covariance matrix. This may restrict PCA's ability to scale and render it useless for some uses.

## Uses of PCA

PCA is a widely used technique in data analysis and has a variety of applications, including:

- Data compression: PCA can be used to reduce the dimensionality of high-dimensional datasets, making them easier to store and analyze.
- Feature extraction: PCA can be used to identify the most important features in a dataset, which can be used to build predictive models.
- Visualization: PCA can be used to visualize high-dimensional data in two or three dimensions, making it easier to understand and interpret.
- Data pre-processing: PCA can be used as a pre-processing step for other machine learning algorithms, such as clustering and classification.

## How Does Principal Component Analysis Work?

### 1. Normalize the Data

Standardize the data before performing PCA. This will ensure that each feature has a mean = 0 and variance = 1.

### 2. Build the Covariance Matrix

Construct a square matrix to express the correlation between two or more features in a multidimensional dataset.

### 3. Find the Eigenvectors and Eigenvalues

Calculate the eigenvectors/unit vectors and eigenvalues. Eigenvalues are scalars by which we multiply the eigenvector of the covariance matrix.

### 4. Sort the Eigenvectors in Highest to Lowest Order and Select the Number of Principal Components.

Now that you have understood How PCA in Machine Learning works, let’s perform a hands-on demo on PCA with Python.

## PCA Demo: Classify the Type of Wine

### 1. Import the Necessary Libraries

### 2. Load the Wine Dataset and Display the First Five Rows

### 3. Display the Summary Statistics for Independent Variables

### 4. Boxplot to Check the Output Labels

From the above box plots, you can see that some features classify the wine labels clearly, such as Alkalinity, Total Phenols, or Flavonoids.

### 5. Class Separation of Wine Using 2 Features

### 6. Plot the Correlation Matrix

### 7. Normalize the Data for PCA

### 8. Describe the Scaled Data

### 9. Import the PCA Module and Plot the Variance Ratio

From the above graph, we’ll consider the first two principal components as they together explain nearly 56% of the variance.

### 10. Transform the Scaled Data and Put It in a Dataframe

### 11. Visualize the Wine Classes Using the First Two Principal Components

By applying PCA to the wine dataset, you can transform the data so that most we can capture variations in the variables with a fewer number of principal components. It is easier to distinguish the wine classes by inspecting these principal components rather than looking at the raw data.

## Conclusion

The principal component analysis is a widely used unsupervised learning method to perform dimensionality reduction. We hope that this article helped you understand what PCA is and the applications of PCA. You looked at the applications of PCA and how it works.

Do you have any questions related to this article on PCA in Machine Learning? If yes, then please feel free to put them in the comments sections. Our team will be happy to solve your queries. Finally, we performed a hands-on demonstration on classifying wine type by using the first two principal components.

Click on the following video tutorial to learn more about PCA - Principal Component Analysis.

Looking forward to becoming a Machine Learning Engineer? Check out Simplilearn's Caltech Post Graduate Program in AI and Machine Learning and get certified today.