Covariance and correlation are two significant concepts used in mathematics for data science and machine learning. One of the most commonly asked data science interview questions is the difference between these two terms and how to decide when to use them. Here are some definitions and mathematical formulas used that will help you fully understand covariance vs correlation. In this article we will cover the following topics that will give you clear understanding of the differences between covariance and correlation and more, including:
- What is a covariance matrix?
- What is a correlation matrix?
- Methods of calculating correlation
- Applications of correlation matrix
- Applications of covariance matrix
- Difference between covariance and correlation
- Similarities between covariance and correlation
What Is a Covariance Matrix?
A covariance matrix is used to study the direction of the linear relationship between variables. Suppose we have two variables X and Y, then the covariance between these two variables is represented as cov(X,Y). If Σ(X) and Σ(Y) are the expected values of the variables, the covariance formula can be represented as:
Here are some plots that highlight how the covariance between two variables would look like in different directions.
Fig: Covariance values and their graphs
The covariance values of the variable can lie anywhere between -∞ to +∞. A negative value indicates a negative relationship whereas a positive value indicates a positive relationship between the variables. When the covariance value is zero, it indicates that there is no relationship between the variables.
When the unit of observation is changed for one or both of the two variables, the covariance value changes. However, there is no change in the strength of the relationship.
To better understand the difference between covariance and correlation, let us understand what is a correlation matrix.
What Is a Correlation Matrix?
A correlation matrix is used to study the strength of a relationship between two variables. It not only shows the direction of the relationship, but also shows how strong the relationship is. The correlation formula can be represented as:
var(X) = standard deviation of X
var(Y) = standard deviation of Y
When the two variables move in the same direction, they are positively correlated. On the contrary, when the variables move in the opposite direction, they are negatively correlated.
Fig: Positive relationship
Fig: Negative relationship
The correlation value of two variables ranges from -1 to +1. A value close to +1 indicates a strong positive relation and a value close to -1 indicates a strong negative correlation.
Correlation may occur in three forms:
Simple CorrelationThe correlation is a single number that describes the degree of the relationship between two variables.
Partial CorrelationThe correlation between two variables when the effects of one variable is removed.
Multiple CorrelationA statistical technique that predicts the value of one variable based on two or more variables.
Next in our learning of the covariance vs correlation differences, let us learn the method of calculating correlation.
Methods of Calculating Correlation
There are a number of methods to calculate correlation coefficient. Here are some of the most common ones:
Coefficient of correlation
This is the most common method of determining the correlation coefficient of two variables. It is obtained by dividing the covariance of two variables with the product of their standard deviations.
Rank Correlation Coefficient
A rank correlation coefficient measures the degree of similarity between two variables, and can be used to assess the significance of the relation between them. It measures the extent to which, as one variable increases, the other decreases.
ρ = coefficient of rank relation
D = difference between paired ranks
N = number of items ranked
Coefficient of Concurrent Deviations
Coefficient of concurrent deviations is used when you want to study the correlation in a very casual manner and there is not much need to attain precision.
rc = coefficient of concurrent deviations
n = number of pairs of deviations
We will continue our learning of the covariance vs correlation differences with these applications of the correlation matrix.
Applications of Correlation Matrix
There are three main applications of a correlation matrix:
To Summarize Large Amounts of Data
When there are large amounts of data, the goal is to see patterns. As such, a correlation matrix is used to find a pattern in the data and see whether the variables highly correlate with each other.
To Input Into Other Analyses
Another common application of a correlation matrix to use it as an input for other analyses such as exploratory factor analysis, confirmatory factor analysis, linear regression and structural equation models.
To Use as a Diagnostic for Other Analyses
Correlation matrix also serves as a diagnostic to check other analyses. For example, in a linear regression, if there is a high number of correlation between the values, this suggests that the estimates from the linear regression will be unreliable.
We will next look at the applications of the covariance matrix in our learning of the covariance vs correlation differences.
Applications of Covariance Matrix
Covariance matrix is very helpful as an input to other analyses. The most common ones are:
Cholesky decomposition is used for simulating systems with multiple correlated variables. Since a covariance matrix is positive semi-definite, it is useful for finding the Cholesky decomposition. The covariance matrix is decomposed into the product of a lower triangular matrix and its transpose.
Fig: Cholesky decomposition (source)
Principal Component Analysis
A principal component analysis is used to reduce the dimensionality of large data sets. An eigendecomposition is performed on the covariance matrix to perform principal component analysis.
Difference: Covariance vs Correlation
Although both correlation and covariance matrices are used to measure relationships, there is a significant difference between the two concepts. Here are some differences between covariance vs correlation:
Basis for comparison
Measure of correlation
Scaled version of covariance
Lie between -∞ to +∞
Lie between -1 and +1
Change in scale
Does not affect correlation
Similarities: Covariance vs Correlation
Correlation and Covariance both measure only the linear relationships between two variables. This means that when the correlation coefficient is zero, the covariance is also zero. Both correlation and covariance measures are also unaffected by the change in location.
However, when it comes to making a choice between covariance vs correlation to measure relationship between variables, correlation is preferred over covariance because it does not get affected by the change in scale.
Here’s What You Can Do Next
A strong understanding of mathematical concepts is fundamental to building a successful career in data science. It ensures that you can help an organization solve problems quickly, regardless of the industry that you are in. Simplilearn’s Post Graduate Program in Data Science and the Data Scientist Master’s program in collaboration with IBM will help you accelerate your career in data science and take it to the next level. This course will introduce you to integrated blended learning of key technologies including data science with R, Python, Hadoop, Spark and lots more. It also includes real-life, industry-based projects on different domains to help you master the concepts of Data Science and Big Data.