Statistics is a core component of data analytics and machine learning. It helps you analyze and visualize data to find unseen patterns. If you are interested in machine learning and want to grow your career in it, then learning statistics along with programming should be the first step. In this article, you will learn all the concepts in statistics for machine learning.
What Is Statistics?
Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and visualizing empirical data. Descriptive statistics and inferential statistics are the two major areas of statistics. Descriptive statistics are for describing the properties of sample and population data (what has happened). Inferential statistics use those properties to test hypotheses, reach conclusions, and make predictions (what can you expect).
Looking forward to becoming a Machine Learning Engineer? Check out Simplilearn's AIML Course and get certified today.
Use of Statistics in Machine Learning
- Asking questions about the data
- Cleaning and preprocessing the data
- Selecting the right features
- Model evaluation
- Model prediction
With this basic understanding, it’s time to dive deep into learning all the crucial concepts related to statistics for machine learning.
Population and Sample
In statistics, the population comprises all observations (data points) about the subject under study.
An example of a population is studying the voters in an election. In the 2019 Lok Sabha elections, nearly 900 million voters were eligible to vote in 543 constituencies.
In statistics, a sample is a subset of the population. It is a small portion of the total observed population.
An example of a sample is analyzing the first-time voters for an opinion poll.
Measures of Central Tendency
Measures of central tendency are the measures that are used to describe the distribution of data using a single value. Mean, Median and Mode are the three measures of central tendency.
The arithmetic mean is the average of all the data points.
If there are n number of observations and xi is the ith observation, then mean is:
Consider the data frame below that has the names of seven employees and their salaries.
To find the mean or the average salary of the employees, you can use the mean() functions in Python.
Median is the middle value that divides the data into two equal parts once it sorts the data in ascending order.
If the total number of data points (n) is odd, the median is the value at position (n+1)/2.
When the total number of observations (n) is even, the median is the average value of observations at n/2 and (n+2)/2 positions.
The median() function in Python can help you find the median value of a column. From the above data frame, you can find the median salary as:
The mode is the observation (value) that occurs most frequently in the data set. There can be over one mode in a dataset.
Given below are the heights of students (in cm) in a class:
155, 157, 160, 159, 162, 160, 161, 165, 160, 158
Mode = 160 cm.
The mode salary from the data frame can be calculated as:
Variance and Standard Deviation
Variance is used to measure the variability in the data from the mean.
Consider the below dataset.
To calculate the variance of the Grade, use the following:
Standard deviation in statistics is the square root of the variance. Variance and standard deviation represent the measures of fit, meaning how well the mean represents the data.
You can find the standard deviation using the std() function in Python.
Range and Interquartile Range
The Range in statistics is the difference between the maximum and the minimum value of the dataset.
Interquartile Range (IQR) :
The IQR is a measure of the distance between the 1st quartile (Q1) and 3rd quartile (Q3).
Skewness and Kurtosis
Skewness measures the shape of the distribution. A distribution is symmetrical when the proportion of data at an equal distance from the mean (or median) is equal. If the values extend to the right, it is right-skewed, and if the values extend left, it is left-skewed.
Kurtosis in statistics is used to check whether the tails of a given distribution have extreme values. It also represents the shape of a probability distribution.
Now, it’s time to discuss a very popular distribution in statistics for machine learning, i.e., Gaussian Distribution.
In statistics and probability, Gaussian (normal) distribution is a popular continuous probability distribution for any random variable. It is characterized by 2 parameters (mean μ and standard deviation σ). Many natural phenomena follow a normal distribution, such as the heights of people and IQ scores.
Properties of Gaussian Distribution:
- The mean, median, and mode are the same
- It has a symmetrical bell shape
- 68% data lies within 1 standard deviation of the mean
- 95% data lie within 2 standard deviations of the mean
- 99.7% of the data lie within 3 standard deviations of the mean
Central Limit Theorem
According to the central limit theorem, given a population with mean as μ and standard deviation as σ, if you take large random samples from the population, then the distribution of the sample means will be roughly normally distributed, irrespective of the original population distribution.
Rule of Thumb: For the central limit theorem to hold true, the sample size should be greater than or equal to 30.
Now, you will learn a very critical concept in statistics for machine learning, i.e., Hypothesis testing.
Hypothesis testing is a statistical analysis to make decisions using experimental data. It allows you to statistically back up some findings you have made in looking at the data. In hypothesis testing, you make a claim and the claim is usually about population parameters such as mean, median, standard deviation, etc.
- The assumption made for a statistical test is called the null hypothesis (H0).
- The Alternative hypothesis (H1) contradicts the null hypothesis stating that the assumptions do not hold true at some level of significance.
Hypothesis testing lets you decide to either reject or retain a null hypothesis.
Example: H0: The average BMI of boys and girls in a class is the same
H1: The average BMI of boys and girls in a class is not the same
To determine whether a finding is statistically significant, you need to interpret the p-value. It is common to compare the p-value to a threshold value called the significance level.
It often sets the level of significance to 5% or 0.05.
If the p-value > 0.05 - Accept the null hypothesis.
If the p-value < 0.05 - Reject the null hypothesis.
Some popular hypothesis tests are:
- Chi-square test
- Analysis of Variance (ANOVA)
Statistics is a core component of machine learning. It helps you draw meaningful conclusions by analyzing raw data. In this article on Statistics for Machine Learning, you covered all the critical concepts that are widely used to make sense of data.
If you are looking to learn further about machine learning with the aim of becoming an expert machine learning engineer, Simplilearn’s Post Graduate Program in AI and Machine Learning in partnership with Purdue University & in collaboration with IBM is the ideal way to go about it. Ranked #1 AI and Machine Learning course by TechGig, this unique AI and Machine Learning Bootcamp offers an extremely comprehensive and applied learning curriculum covering the most in-demand tools, skills, and techniques used in machine learning today. You get to perfect your skills with a capstone project in 3 domains, and 25+ projects that use real industry data sets from companies such as Twitter, Zomato, and Wikipedia.
Do you have any questions regarding this article on Statistics for Machine Learning? If you have, then please put them in the comments section. We’ll help you solve your queries. To learn more about the crucial statistical techniques, click on the following link: Mathematics for Machine Learning.