Tutorial Playlist

Machine Learning Tutorial: A Step-by-Step Guide for Beginners

Overview

An Introduction To Machine Learning

Lesson - 1

What is Machine Learning and How Does It Work?

Lesson - 2

The Complete Guide to Understanding Machine Learning Steps

Lesson - 3

Top 10 Machine Learning Applications in 2020

Lesson - 4

An Introduction to the Types Of Machine Learning

Lesson - 5

Supervised and Unsupervised Learning in Machine Learning

Lesson - 6

Everything You Need to Know About Feature Selection

Lesson - 7

Linear Regression in Python

Lesson - 8

Everything You Need to Know About Classification in Machine Learning

Lesson - 9

An Introduction to Logistic Regression in Python

Lesson - 10

Understanding the Difference Between Linear vs. Logistic Regression

Lesson - 11

The Best Guide On How To Implement Decision Tree In Python

Lesson - 12

Random Forest Algorithm

Lesson - 13

Understanding Naive Bayes Classifier

Lesson - 14

The Best Guide to Confusion Matrix

Lesson - 15

How to Leverage KNN Algorithm in Machine Learning?

Lesson - 16

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Lesson - 17

PCA in Machine Learning - Your Complete Guide to Principal Component Analysis

Lesson - 18

What is Cost Function in Machine Learning

Lesson - 19

The Ultimate Guide to Cross-Validation in Machine Learning

Lesson - 20

An Easy Guide to Stock Price Prediction Using Machine Learning

Lesson - 21

What Is Reinforcement Learning? The Best Guide To Reinforcement Learning

Lesson - 22

What Is Q-Learning? The Best Guide to Understand Q-Learning

Lesson - 23

The Best Guide to Regularization in Machine Learning

Lesson - 24

Everything You Need to Know About Bias and Variance

Lesson - 25

The Complete Guide on Overfitting and Underfitting in Machine Learning

Lesson - 26

Mathematics for Machine Learning - Important Skills You Must Possess

Lesson - 27

A One-Stop Guide to Statistics for Machine Learning

Lesson - 28

Embarking on a Machine Learning Career? Here’s All You Need to Know

Lesson - 29

How to Become a Machine Learning Engineer?

Lesson - 30

Top 34 Machine Learning Interview Questions and Answers in 2021

Lesson - 31
A One-Stop Guide to Statistics for Machine Learning

Statistics is a core component of data analytics and machine learning. It helps you analyze and visualize data to find unseen patterns. If you are interested in machine learning and want to grow your career in it, then learning statistics along with programming should be the first step. In this article, you will learn all the concepts in statistics for machine learning.

Below are the topics that this article will cover about Statistics for Machine Learning:

  • What is Statistics?
  • Use of Statistics in Machine Learning
  • Population and Sample
  • Measures of Central Tendency
  • Variance and Standard Deviation
  • Range and Interquartile Range
  • Skewness and Kurtosis
  • Gaussian Distribution
  • Central Limit Theorem
  • Hypothesis Testing

What Is Statistics?

Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and visualizing empirical data. Descriptive statistics and inferential statistics are the two major areas of statistics. Descriptive statistics are for describing the properties of sample and population data (what has happened). Inferential statistics use those properties to test hypotheses, reach conclusions, and make predictions (what can you expect).

Use of Statistics in Machine Learning

StatisticsUses

  • Asking questions about the data
  • Cleaning and preprocessing the data
  • Selecting the right features
  • Model evaluation
  • Model prediction

With this basic understanding, it’s time to dive deep into learning all the crucial concepts related to statistics for machine learning.

Population and Sample

Population:

In statistics, the population comprises all observations (data points) about the subject under study.

An example of a population is studying the voters in an election. In the 2019 Lok Sabha elections, nearly 900 million voters were eligible to vote in 543 constituencies.

Sample:

In statistics, a sample is a subset of the population. It is a small portion of the total observed population.

An example of a sample is analyzing the first-time voters for an opinion poll.

Measures of Central Tendency

Measures of central tendency are the measures that are used to describe the distribution of data using a single value. Mean, Median and Mode are the three measures of central tendency.

Mean:

The arithmetic mean is the average of all the data points.

If there are n number of observations and xi is the ith observation, then mean is:

Mean

Consider the data frame below that has the names of seven employees and their salaries.

EmployeeDataset

To find the mean or the average salary of the employees, you can use the mean() functions in Python.

MeanSalary.

Median:

Median is the middle value that divides the data into two equal parts once it sorts the data in ascending order.

If the total number of data points (n) is odd, the median is the value at position (n+1)/2.

When the total number of observations (n) is even, the median is the average value of observations at n/2 and (n+2)/2 positions.

The median() function in Python can help you find the median value of a column. From the above data frame, you can find the median salary as:

MedianSalary

Mode:

The mode is the observation (value) that occurs most frequently in the data set. There can be over one mode in a dataset.

Given below are the heights of students (in cm) in a class:

155, 157, 160, 159, 162, 160, 161, 165, 160, 158

Mode = 160 cm.

The mode salary from the data frame can be calculated as:

ModeSalary

Variance and Standard Deviation

Variance is used to measure the variability in the data from the mean. 

VarianceFormula

Consider the below dataset.

EmployeeDataframe

To calculate the variance of the Grade, use the following:

VarianceGrade

Standard deviation in statistics is the square root of the variance. Variance and standard deviation represent the measures of fit, meaning how well the mean represents the data.

StandardDeviationFormula

You can find the standard deviation using the std() function in Python.

stdGrade

Range and Interquartile Range

Range:

The Range in statistics is the difference between the maximum and the minimum value of the dataset.

Range

Interquartile Range (IQR) :

The IQR is a measure of the distance between the 1st quartile (Q1) and 3rd quartile (Q3).

IQR

Skewness and Kurtosis

Skewness:

Skewness measures the shape of the distribution. A distribution is symmetrical when the proportion of data at an equal distance from the mean (or median) is equal. If the values extend to the right, it is right-skewed, and if the values extend left, it is left-skewed.

Skewness

Kurtosis:

Kurtosis in statistics is used to check whether the tails of a given distribution have extreme values. It also represents the shape of a probability distribution.

Skewness-Kurtosis

SalarySkewness

HoursSkewness

GradeSkewness

Now, it’s time to discuss a very popular distribution in statistics for machine learning, i.e., Gaussian Distribution.

Machine Learning Free Course

Start Learning Today's Most In-Demand SkillsExplore Course
Machine Learning Free Course

Gaussian Distribution

In statistics and probability, Gaussian (normal) distribution is a popular continuous probability distribution for any random variable. It is characterized by 2 parameters (mean μ and standard deviation σ). Many natural phenomena follow a normal distribution, such as the heights of people and IQ scores.

GaussianDistribution

Properties of Gaussian Distribution:

  • The mean, median, and mode are the same
  • It has a symmetrical bell shape
  • 68% data lies within 1 standard deviation of the mean
  • 95% data lie within 2 standard deviations of the mean
  • 99.7% of the data lie within 3 standard deviations of the mean

GaussianCode.

GaussianPlot

Central Limit Theorem

According to the central limit theorem, given a population with mean as μ and standard deviation as σ, if you take large random samples from the population, then the distribution of the sample means will be roughly normally distributed, irrespective of the original population distribution.

Rule of Thumb: For the central limit theorem to hold true, the sample size should be greater than or equal to 30.

Clt

Now, you will learn a very critical concept in statistics for machine learning, i.e., Hypothesis testing.  

Hypothesis Testing

Hypothesis testing is a statistical analysis to make decisions using experimental data. It allows you to statistically back up some findings you have made in looking at the data. In hypothesis testing, you make a claim and the claim is usually about population parameters such as mean, median, standard deviation, etc.

  • The assumption made for a statistical test is called the null hypothesis (H0).
  • The Alternative hypothesis (H1) contradicts the null hypothesis stating that the assumptions do not hold true at some level of significance.

Hypothesis testing lets you decide to either reject or retain a null hypothesis.

Example: H0: The average BMI of boys and girls in a class is the same

    H1: The average BMI of boys and girls in a class is not the same

To determine whether a finding is statistically significant, you need to interpret the p-value. It is common to compare the p-value to a threshold value called the significance level.

It often sets the level of significance to 5% or 0.05.

If the p-value > 0.05 - Accept the null hypothesis.

If the p-value < 0.05 - Reject the null hypothesis.

Some popular hypothesis tests are:

  • Chi-square test
  • T-test
  • Z-test
  • Analysis of Variance (ANOVA)
Acelerate your career in AI and ML with the Post Graduate Program in AI and Machine Learning with Purdue University collaborated with IBM.

Conclusion

Statistics is a core component of machine learning. It helps you draw meaningful conclusions by analyzing raw data. In this article on Statistics for Machine Learning, you covered all the critical concepts that are widely used to make sense of data. 

If you are looking to learn further about machine learning with the aim of becoming an expert machine learning engineer, Simplilearn’s Post Graduate Program in AI and Machine Learning in partnership with Purdue University & in collaboration with IBM is the ideal way to go about it. Ranked #1 AI and Machine Learning course by TechGig, this unique AI and Machine Learning Bootcamp offers an extremely comprehensive and applied learning curriculum covering the most in-demand tools, skills, and techniques used in machine learning today. You get to perfect your skills with a capstone project in 3 domains, and 25+ projects that use real industry data sets from companies such as Twitter, Zomato, and Wikipedia.

Do you have any questions regarding this article on Statistics for Machine Learning? If you have, then please put them in the comments section. We’ll help you solve your queries. To learn more about the crucial statistical techniques, click on the following link: Mathematics for Machine Learning.

About the Author

Avijeet BiswalAvijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.