Math Refresher - Machine Learning

This ‘Math Refresher’ tutorial is a part of the Machine Learning course offered by Simplilearn. We will learn about basic mathematics fundamentals required for machine learning in this tutorial.


Let us quickly go through the objectives of this Math refresher tutorial.

  • Explain the concepts of Linear Algebra
  • Describe eigenvalues, eigenvectors, and eigendecomposition
  • Define differential and integral calculus
  • Explain the concepts of probability and statistics

Introduction to Linear Algebra

Linear algebra is a branch of mathematics that deals with the study of vectors and linear functions and equations.

Linear Equations

The main purpose of linear algebra is to find systematic methods for solving systems of linear equations.

A linear equation of n variables is of the form:

a1x1 + a2x2 + ... + anxn = b

where x1, x2, ..., xn are the unknown quantities to be found, a1, · · ·, an are the coefficients (given numbers), and b is the constant term.

A linear equation does not involve any products, inverses, or roots of variables. All variables occur only to the first power and not as arguments for trigonometric, logarithmic, or exponential functions.

System of Linear Equations

A system of linear equations is a finite collection of linear equations. A linear system of m equations in n variables has the form:

a11x1 + a12x2 + ... + a1nxn = b1

a21x1 + a22x2 + ... + a2nxn = b2


am1x1 + am2x2 + ... + amnxn = bm

In the case of a single linear equation, a linear system can have infinitely many solutions, one solution, or no solutions at all. A linear system that has a solution is called consistent, and the one with no solution is termed inconsistent.

Solving Linear Systems of Equations

Solving a linear system of equations is a long and tedious process. The concept of the matrix was introduced to simplify the computations involved in this process. A matrix contains the essential information of a linear system in a rectangular array.


A matrix of size m × n is a rectangular array of the form:

a11 a12 ... a1n

a21 a22 ... a2n

... ... ... ...

am1 am2 ... amn

where the aij’s are the entries of the matrix, n represents the number of columns, and m represents the number of rows.

Forms of Matrix

If n = m, that is, the number of columns and rows are equal, the matrix is called a square matrix. An entry of the form aii is said to be on the main diagonal.

A is called a diagonal matrix if aij = 0, where i ≠ j.

Matrix Operations

Let’s look into some of the Matrix Operations below.


Consider the following two matrices:

The corresponding elements in the rows are added. Two matrices can be added only if they have the same number of rows and columns. Also, during addition, A + B = B + A


Now consider the same matrices again:

The corresponding elements in the rows are subtracted. Two matrices can be subtracted only if they have the same number of rows and columns. Also, during subtraction, A - B ≠ B - A


Consider the same matrices again:

The 1st and 2nd rows of A are multiplied with the 1st and 2nd columns of B and added. The matrix product AB is defined only when the number of columns in A is equal to the number of rows in B. BA is defined only when the number of columns in B is equal to the number of rows in A. AB is not always equal to BA.


A transpose is a matrix formed by turning all the rows of a given matrix into columns and vice versa. The transpose of matrix A is denoted as AT

From the previous examples:

The rows become columns and vice versa.


Now that you know about matrix addition, subtraction and multiplication; is the division of matrices possible? There is no matrix division, but there is a similar analogy called inverse.


An n-by-n square matrix A is called invertible (also nonsingular or non degenerate) if there exists an n-by-n square matrix B such that

AB = BA = In

where In denotes the n-by-n identity matrix and the multiplication used is ordinary matrix multiplication. When the matrix B is uniquely determined by A, it is called the inverse of A, denoted by A−1

Keen on learning in detail about Machine Learning? Click here!

Special Matrix Types

Given below are the types of Special Matrix.

  • Diagonal Matrix: a matrix D is diagonal only if Di,j = 0 for all i ≠ j
  • Symmetric Matrix: a matrix A for which A = AT
  • Identity matrix: denoted as In such that InA = A

An array with more than two axes is called a tensor. Example: A tensor might have 3 dimensions, so the value at coordinates (i, j, k) is Ai,j,k

What is Vector?

A vector (v) is an object with both magnitude (length) and direction. It starts from the origin (0,0), and its length is denoted by ||v||

Properties of Vectors

  • A unit vector is a vector with unit norm (unit length) ||x||2 = 1
  • A vector x and a vector y are orthogonal to each other if x Ty = 0. This also means that both vectors are at a 90-degree angle to each other.
  • An orthogonal vector that has a unit norm is called an orthonormal vector.

An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal. In this case ATA = AAT = I. Also, for an orthogonal matrix, A-1 = AT


The operation of adding two or more vectors together into a vector sum is referred to as vector addition. For two vectors u and v, the vector sum is:

u + v = z

Vector subtraction is the process of subtracting two or more vectors to get a vector difference. For two vectors u and v, the vector difference is:

u - v = z


Vector multiplication refers to a technique for the multiplication of two (or more) vectors with themselves

u*v = z

u*v = x1y1 + x2y2 = ∑(xiyi )

This can be shown to equal: u*v = ‖x‖ ‖y‖ cos θ


In machine learning, the size of a vector is called a norm. On an intuitive level, the norm of a vector x measures the distance from the origin to the point x.

Example: a vector with Lp norm where p ≥ 1.


  • The most popular norm is the L2 norm with p = 2, also called the Euclidean norm.
  • It is simply the Euclidean distance from the origin to the point identified by x.
  • The 2 in the L2 norm is frequently omitted, that is, ||x||2 is just written as ||x||
  • Size of the vector is often measured as squared L2 norm and equals xTx. This is better as its differential only depends on x.
  • L1 norm is commonly used when the difference between zero and non-zero elements is very important. This is due to the fact that the L1 norm increases slowly in all directions from the origin.
  • Every time an element of x moves away from 0 by ϵ, the L1 norm increases by ϵ.
  • Max-norm, infinite

Eigenvector and Eigenvalue

An eigenvector of a square matrix A is a non-zero vector such that multiplication by A alters only the scale of v.

Av = λv

where λ is eigenvalue corresponding to this eigenvector.

Effect of Eigenvectors and Eigenvalues

Here, matrix A has two orthonormal eigenvectors, v(1) with eigenvalue λ1 and v(2) with eigenvalue λ2. (left). Plot the set of all unit vectors u € R2 as a unit circle. (Right) Plot the set of all points Au. By observing the way A distorts the unit circle, you can see that it scales space in the direction v (i) by λi


Integers can be broken into their prime factors to understand them, for example, 12 = 2 x 2 x 3. From this, useful properties can be derived, for example, the number is not divisible by 5 and is divisible by 2 and 3. Similarly, matrices can be decomposed. This will help you discover information about the matrix.

If A has a set of eigenvectors v1, v2,… represented by matrix V, and the corresponding eigenvalues λ1, λ2,… represented by vector λ, then eigendecomposition of A is given by:

A = V diag (λ) V -1

What is Calculus?

Calculus is the study of change. It provides a framework for modeling systems in which there are change and ways to make predictions of such models.

Differential Calculus

Differential calculus is a part of the calculus that deals with the study of the rates at which quantities change.

Let x and y be two real numbers such that y is a function of x, that is, y = f(x). If f(x) is the equation of a straight line (linear equation), then the equation is represented as y = mx + b. Where m is the slope determined by the following equation:

Δy/Δx or dy/dx is the derivative of y with respect to x and is also the rate of change of y per unit change in x. The slope of a curvature changes at various points of the graph. It represents the slope of an imaginary straight line drawn through that small graph segment.

Integral Calculus

Integral Calculus assigns numbers to functions to describe displacement, area, volume, and other concepts that arise by combining infinitesimal data.

Given a function f of a real variable x and an interval [a, b] of the real line, the definite integral is defined informally as the signed area of the region in the XY-plane that is bounded by the graph of f, the x-axis, and the vertical lines x = a and x = b.

An integral is the inverse of a differential and vice versa.

Probability Theory

Probability is the measure of the likelihood of an event’s occurrence.

Example: The chances of getting heads on a coin toss is ½ or 50%

The probability of any specific event is between 0 and 1 (inclusive). The sum of total probabilities of an event cannot exceed 1, that is, 0 <= p(x) <= 1. This implies that ∫p(x)dx =1 (integral of p for a distribution over x)

Conditional Probability

The conditional probability of y=y given x=x is:

This is also called Bayesian Probability.


Bayes model defines the probability of event A occurring, given event B has occurred.

P(A) = probability of event A

P(B) = probability of event B

P(A ∏ B) = probability of both events happening

Consider the coin example:

P(Coin1-H) = 2/4

P(Coin2-H) = 2/4

P(Coin1-H ∏ Coin2-H) = ¼

P(Coin1-H | Coin2-H) = (1/4)/(2/4) = ½ = 50% (probability of Coin1-H, given Coin2-H)


Events A and B are statistically independent if:

AI with Bayes Model: Example

Calculating the chance of developing diabetes given the incidence of fast food.

Observed Data:

Chances of Diabetes, given fast food: (conditional probability) ⇒ (D and F)/F = 5%/20% = ¼ = 25%

Analysis: If you eat fast food, you have a 25% chance of developing Diabetes.

Get to know more about Machine Learning. Click here!

Chain Rule of Probability

The joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable. It can be represented as:

For example: P(a, b, c) = P(a | b, c) * P(b | c) * P(c)

Standard Deviation

Standard deviation is a quantity that expresses the value by which the members of a group differ from the mean value for the group.

  • Its symbol is σ (the Greek letter sigma). If the data points are further from the mean, there is a higher deviation within the data set.
  • For example, a volatile stock has a high standard deviation, while the deviation of a stable blue-chip stock is usually rather low.

Standard deviation is used more often than variance because the unit in which it is measured is the same as that of mean, a measure of central tendency.


Variance (σ2 ) refers to the spread of the data set, for example, how far the numbers are in relation to the mean.

  • Variance is particularly useful when calculating the probability of future events or performance.
  • Example: India has the high variation in environmental temperature, from very hot to very cold weather. That is, India has a high variance of weather conditions.

Notice that variance is just the square of standard deviation.


Covariance is the measure of how two random variables change together. It is used to calculate the correlation between variables.

  • A positive covariance indicates that both variables from the prior line tend to move upward and downward in value at the same time. An inverse or negative covariance means that variables move counter to each other: when one rises, the other falls
  • Example: Purchasing stock with a negative covariance is a great way to minimize risk in a portfolio (diverse portfolio)

Logistic Sigmoid

The Logistic Sigmoid is a useful function that follows the S curve. It saturates when the input is very large or very small.

Gaussian Distribution

Let us understand about Gaussian Distribution in detail.


Data can be distributed in various ways. The distribution where the data tends to be around a central value with lack of bias or minimal bias toward the left or right is called Gaussian distribution, also known as normal distribution.

In the absence of prior knowledge, the normal distribution is often a good assumption in machine learning.


Types of Gaussian Distribution

We will look at the types of Gaussian Distribution below.


Univariate Gaussian distribution over single variable.


Multivariate normal distribution is the generalization of the univariate normal distribution to multiple variables.

Multivariate Gaussian distribution over two variables x1 and x2

Key Takeaways

Let us quickly go through the topics learned in this Machine Learning tutorial.

  • Linear algebra is a branch of mathematics that deals with the study of vectors and linear functions and equations.
  • A matrix of size m × n is a rectangular array.
  • A vector (v) is an object with both magnitude (length) and direction. It is represented by an arrow on a graph.
  • An eigenvector of a matrix A is a vector v that only changes the scale of the vector v when multiplied by A.
  • Differential Calculus is the incremental rate of change of dependent variable y with respect to x. Integral Calculus is the summation of a function f(x) over x.
  • Probability is the chance of something happening.
  • Standard deviation and variance indicate the spread of a data distribution around its mean.


This concludes “Math Refresher” tutorial. The next lesson is “Regression.

Find our Machine Learning Online Classroom training classes in top cities:

Name Date Place
Machine Learning 3 May -21 May 2021, Weekdays batch Your City View Details
Machine Learning 7 May -11 Jun 2021, Weekdays batch San Francisco View Details
Machine Learning 23 May -10 Jun 2021, Weekdays batch New York City View Details
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Work Email*
Phone Number*
Job Title*