Central Limit Theorem, also known as the CLT, is a crucial pillar of statistics and machine learning. It is at the heart of hypothesis testing. In this tutorial, you will understand the concept of the CLT and its applications.
What is the Central Limit Theorem?
The CLT is a statistical theory that states that - if you take a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from that population will be roughly equal to the population mean.
Consider there are 15 sections in class X, and each section has 50 students. Our task is to calculate the average marks of students in class X.
The standard approach will be to calculate the average simply:
- Calculate the total marks of all the students in Class X
- Add all the marks
- Divide the total marks by the total number of students
But what if the data is extremely large? Is this a good approach? No way, calculation marks of all the students will be a tedious and time-consuming process. So, what are the alternatives? Let's take a look at another approach.
- To begin, select groups of students from the class at random. This will be referred to as a sample. Create several samples, each with 30 students.
- Calculate each sample's individual mean.
- Calculate the average of these sample means.
- The value will give us the approximate average marks of the students in Class X.
- The histogram of the sample means marks of the students will resemble a bell curve or normal distribution.
Significance of Central Limit Theorem
The CLT has several applications. Look at the places where you can use it.
- Political/election polling is a great example of how you can use CLT. These polls are used to estimate the number of people who support a specific candidate. You may have seen these results with confidence intervals on news channels. The CLT aids in this calculation.
- You use the CLT in various census fields to calculate various population details, such as family income, electricity consumption, individual salaries, and so on.
The CLT is useful in a variety of fields. Are there any others that come to mind? Put them in the comments section below this tutorial.
Assumptions Behind the Central Limit Theorem
Before we move on further, it is important to understand the assumptions behind CLT:
- The data must adhere to the randomization rule. It needs to be sampled at random.
- The samples should be unrelated to one another. One sample should not impact the others.
- When taking samples without replacement, the sample size should not exceed 10% of the population.
When the population is symmetric, a sample size of 30 is generally considered reasonable.
Why n ≥ 30 Samples?
The sample size of 30 is considered sufficient to see the effect of the CLT. If the population distribution is closer to the normal distribution, you will need fewer samples to demonstrate the central limit theorem. On the other hand, if the population distribution is highly skewed, you will need a large number of samples to understand the CLT.
Mean and Standard Deviation of the Sample
You denote the mean of the sample by
And you denote as the standard deviation of the sample mean as:
That’s the concept and theory behind the CLT. Now, go to the python compiler and understand the working of CLT.
Implementation Of Central Limit Theorem in Python
You can understand the working of the CLT with an example involving the rolling of a die.
A die has a different number on each side, ranging from 1 to 6. Each number has a one-in-six chance of appearing on a roll. Given the equal likelihood, the dispersion of the numbers that come up from a dice roll is uniform.
You will use the randint() function to generate the random numbers ranging from 1 to 6.
The example will generate and print the sample of 100 dice rolls along with the mean.
You will then repeat the process 1000 times. This will give you the result of 1000 sample means. According to CLT, the result of these sample means will be gaussian. The example below shows the resulting distribution of sample means.
The following graph shows the distribution of sample means.
The central limit theorem is a crucial concept in statistics and, by extension, data science. It's also crucial to learn about central tendency measures like mean, median, mode, and standard deviation.
If you want to learn further, you can check the Data Scientist course by Simplilearn. The course gives exposure to key technologies including R, Python, Tableau, and Spark and will take you from basics to advanced level in learning.
If you have any doubts and feedback regarding this article, do let us know in the comments section.