A One-Stop Guide to Statistics for Machine Learning

Statistics is a core component of data analytics and machine learning. It helps you analyze and visualize data to find unseen patterns. If you are interested in machine learning and want to grow your career in it, then learning statistics along with programming should be the first step. In this article, you will learn all the concepts in statistics for machine learning.

What Is Statistics?

Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and visualizing empirical data. Descriptive statistics and inferential statistics are the two major areas of statistics. Descriptive statistics are for describing the properties of sample and population data (what has happened). Inferential statistics use those properties to test hypotheses, reach conclusions, and make predictions (what can you expect).

Looking forward to becoming a Machine Learning Engineer? Check out Simplilearn's AIML Course, Machine Learning Course and get certified today.

Use of Statistics in Machine Learning

StatisticsUses

  • Asking questions about the data
  • Cleaning and preprocessing the data
  • Selecting the right features
  • Model evaluation
  • Model prediction

With this basic understanding, it’s time to dive deep into learning all the crucial concepts related to statistics for machine learning.

Population and Sample

Population:

In statistics, the population comprises all observations (data points) about the subject under study.

An example of a population is studying the voters in an election. In the 2019 Lok Sabha elections, nearly 900 million voters were eligible to vote in 543 constituencies.

Sample:

In statistics, a sample is a subset of the population. It is a small portion of the total observed population.

An example of a sample is analyzing the first-time voters for an opinion poll.

Measures of Central Tendency

Measures of central tendency are the measures that are used to describe the distribution of data using a single value. Mean, Median and Mode are the three measures of central tendency.

Mean:

The arithmetic mean is the average of all the data points.

If there are n number of observations and xi is the ith observation, then mean is:

Mean

Consider the data frame below that has the names of seven employees and their salaries.

EmployeeDataset

To find the mean or the average salary of the employees, you can use the mean() functions in Python.

MeanSalary.

Median:

Median is the middle value that divides the data into two equal parts once it sorts the data in ascending order.

If the total number of data points (n) is odd, the median is the value at position (n+1)/2.

When the total number of observations (n) is even, the median is the average value of observations at n/2 and (n+2)/2 positions.

The median() function in Python can help you find the median value of a column. From the above data frame, you can find the median salary as:

MedianSalary

Mode:

The mode is the observation (value) that occurs most frequently in the data set. There can be over one mode in a dataset.

Given below are the heights of students (in cm) in a class:

155, 157, 160, 159, 162, 160, 161, 165, 160, 158

Mode = 160 cm.

The mode salary from the data frame can be calculated as:

ModeSalary

Variance and Standard Deviation

Variance is used to measure the variability in the data from the mean. 

VarianceFormula

Consider the below dataset.

EmployeeDataframe

To calculate the variance of the Grade, use the following:

VarianceGrade

Standard deviation in statistics is the square root of the variance. Variance and standard deviation represent the measures of fit, meaning how well the mean represents the data.

StandardDeviationFormula

You can find the standard deviation using the std() function in Python.

stdGrade

Range and Interquartile Range

Range:

The Range in statistics is the difference between the maximum and the minimum value of the dataset.

Range

Interquartile Range (IQR) :

The IQR is a measure of the distance between the 1st quartile (Q1) and 3rd quartile (Q3).

IQR

Skewness and Kurtosis

Skewness:

Skewness measures the shape of the distribution. A distribution is symmetrical when the proportion of data at an equal distance from the mean (or median) is equal. If the values extend to the right, it is right-skewed, and if the values extend left, it is left-skewed.

Skewness

Kurtosis:

Kurtosis in statistics is used to check whether the tails of a given distribution have extreme values. It also represents the shape of a probability distribution.

Skewness-Kurtosis

SalarySkewness

HoursSkewness

GradeSkewness

Now, it’s time to discuss a very popular distribution in statistics for machine learning, i.e., Gaussian Distribution.

Become a Certified UI UX Expert in Just 5 Months!

UMass Amherst UI UX BootcampExplore Program
Become a Certified UI UX Expert in Just 5 Months!

Gaussian Distribution

In statistics and probability, Gaussian (normal) distribution is a popular continuous probability distribution for any random variable. It is characterized by 2 parameters (mean μ and standard deviation σ). Many natural phenomena follow a normal distribution, such as the heights of people and IQ scores.

GaussianDistribution

Properties of Gaussian Distribution:

  • The mean, median, and mode are the same
  • It has a symmetrical bell shape
  • 68% data lies within 1 standard deviation of the mean
  • 95% data lie within 2 standard deviations of the mean
  • 99.7% of the data lie within 3 standard deviations of the mean

GaussianCode.

GaussianPlot

Central Limit Theorem

According to the central limit theorem, given a population with mean as μ and standard deviation as σ, if you take large random samples from the population, then the distribution of the sample means will be roughly normally distributed, irrespective of the original population distribution.

Rule of Thumb: For the central limit theorem to hold true, the sample size should be greater than or equal to 30.

Clt

Now, you will learn a very critical concept in statistics for machine learning, i.e., Hypothesis testing.  

Hypothesis Testing

Hypothesis testing is a statistical analysis to make decisions using experimental data. It allows you to statistically back up some findings you have made in looking at the data. In hypothesis testing, you make a claim and the claim is usually about population parameters such as mean, median, standard deviation, etc.

  • The assumption made for a statistical test is called the null hypothesis (H0).
  • The Alternative hypothesis (H1) contradicts the null hypothesis stating that the assumptions do not hold true at some level of significance.

Hypothesis testing lets you decide to either reject or retain a null hypothesis.

Example: H0: The average BMI of boys and girls in a class is the same

    H1: The average BMI of boys and girls in a class is not the same

To determine whether a finding is statistically significant, you need to interpret the p-value. It is common to compare the p-value to a threshold value called the significance level.

It often sets the level of significance to 5% or 0.05.

If the p-value > 0.05 - Accept the null hypothesis.

If the p-value < 0.05 - Reject the null hypothesis.

Some popular hypothesis tests are:

  • Chi-square test
  • T-test
  • Z-test
  • Analysis of Variance (ANOVA)

Conclusion

Statistics is a core component of machine learning. It helps you draw meaningful conclusions by analyzing raw data. In this article on Statistics for Machine Learning, you covered all the critical concepts that are widely used to make sense of data. 

If you are looking to learn further about machine learning with the aim of becoming an expert machine learning engineer, Simplilearn’s Machine Learning program in partnership with IIT Kanpur University is the ideal way to go about it. Ranked #1 AI and Machine Learning course by TechGig, this unique AI and Machine Learning Program offers an extremely comprehensive and applied learning curriculum covering the most in-demand tools, skills, and techniques used in machine learning today. You get to perfect your skills with a capstone project in 3 domains, and 25+ projects that use real industry data sets from companies such as Twitter, Amazon, Mercedes etc.

Do you have any questions regarding this article on Statistics for Machine Learning? If you have, then please put them in the comments section. We’ll help you solve your queries. To learn more about the crucial statistical techniques, click on the following link: Mathematics for Machine Learning.

About the Author

Avijeet BiswalAvijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.