The data you want to analyze can have any distribution and the probability distribution graphs can take on very distinct and recognizable shapes. Recognizing these graphs and distributions can help you find certain characteristics of your data and perform specific calculations on them.
What is Normal Distribution?
A normal distribution is the continuous probability distribution with a probability density function that gives you a symmetrical bell curve. Simply put, it is a plot of the probability function of a variable that has maximum data concentrated around one point and a few points taper off symmetrically towards two opposite ends.
In this definition of a normal distribution, you will explore the following terms:
- Continuous Probability Distribution: A probability distribution where the random variable, X, can take any given value, e.g., amount of rainfall. You can record the rainfall received at a certain time as 9 inches. But this is not an exact value. The actual value can be 9.001234 inches or an infinite amount of other numbers. There is no definitive way to plot a point in this case, and instead, you use a continuous value.
- Probability Density Function: An expression that is used to define the range of values that a continuous random variable can take.
A normal distribution has a probability distribution that is centered around the mean. This means that the distribution has more data around the mean. The data distribution decreases as you move away from the center. The resulting curve is symmetrical about the mean and forms a bell-shaped distribution. Consider the below graph which shows the probability distribution of heights in a class:
Figure 1: Normal Distribution
From the above graph, you can see that the distribution is mostly about the mean or the average of all heights. Apart from this, most data is around the mean. As you move away, the probability density decreases too. This kind of curve is called a Bell Curve, and it is a common feature of a normal distribution.
What is Standard Deviation?
The standard deviation is a measure of how the values in your data differ from one another or how spread out your data is.
The standard deviation measures how far apart the data points in your observations are from each. You can calculate it by subtracting each data point from the mean value and then finding the squared mean of the differenced values; this is called Variance. The square root of the variance gives you the standard deviation.
Like how the mean tells you where the data is centered, the standard deviation gives you the width of your bell curve. It tells you how narrow or wide the bell curve is. Consider the example of income in rural and urban areas.
Figure 2: Standard Deviation
In rural areas, say a farming village, most people work in the same profession - farming. They all earn more or less the same, with the zamindar earning the most. Most people here make the same average income, as seen by the high peak at the mean. There is not much deviation in our data. Hence, the curve is relatively narrow.
In an urban city, the population is more. There are also more people doing different jobs which all pay at a very different level. Some people might be businessmen, while others might not even have a fixed income. This leads to more variation in the data, and hence, the curve is more spread out or has a higher standard deviation.
Now, understand standard deviation with the help of an example.
Consider the example of heights of dogs given below:
Figure 3: Dog Heights
You first find the mean, or the average of all these values by adding them all up and dividing the resulting sum by the number of data points.
Figure 4: Mean Height
This means that on average, a dog is 394mm tall. Now, subtract all the data points from the mean.
Figure 5: Difference between heights and mean
The negative values imply that the value lies below the mean and positive values tell you that the data point lies above the mean. A 0 value means that the data point is the same as the mean. Now, let's square each value and find their average to get the variance.
Figure 6: Standard deviation in dog heights data
Finding the square root of the variance gives you your standard deviation. In this case, it is 147 mm. This means that the curve is more tall than wide and has a small spread, and is narrow. There is not much deviation within the data.
What is Standard Normal Distribution?
A Standard Normal Distribution is a type of normal distribution with a mean of 0 and a standard deviation of 1. This means that the normal distribution has its center at 0 and intervals that increase by 1.
The mean and standard deviation in a normal distribution is not fixed. They can take on any value. However, when you standardize the normal distribution, the mean and standard deviation remain fixed and are the same for all standard normal distributions. Consider the example given below of weights of students in a class:
Figure 7: Standard Normal Distribution
It gives the actual weights of the students above the x-axis. But from the graph, you can see that the data points differ by 5 points. On finding the mean, you get it as 50, so you can take this as the 0th point. The rest of the points are equally spaced and, on standardizing, differ by 1, so you can rewrite the scale to be centered around 0 and increasing by 1. The points above the mean fall on positive values and below the mean fall on negative values.
When you standardize your data, calculating the probabilities in your graph becomes easier. You can also easily compare different graphs with one another, as they all have the same scale. Some features of a Standard Normal Distribution are given below:
Figure 8: Characteristics of Standard Normal Distribution
What is Z-Score?
The z-score is used to tell you how far from the mean the data point is. You calculate it using the mean and standard deviation, so it can also be said that the Z-Score is how many standard deviations below the mean the data is.
The z-score is used to standardize your normal distribution. Using the z-score, you can convert each data point into a value in terms of mean and standard deviation, effectively converting the graph into a scaled-down version. The z-score tells you how far each data point is from the mean in steps of standard deviation. So, with the mean and standard deviation, you can plot all points on our graph.
The z-score is given by :
Figure 9: Z-Score
Let us represent each data point by ‘x’, then the formula for z-score becomes:
Figure 10: Z-Score formula
Now, understand the z-score with the help of an example. A summary of the daily travel time of a person commuting from work is given below. The values are in minutes. Calculate the Mean, Standard Deviation, and Z-Score.
Figure 11: Commuting time
The mean is the average of all values:
Figure 12: Mean Commuting time
Now, subtract the mean from each data point and find the variance and standard deviation.
Figure 13: Differenced Commuting Time
Figure 14: Variance in Commuting Time
Figure 15: Standard Deviation in Commuting Time
The Z-Score tells us where the data point falls relative to other points. The z-score will tell you how far away from the mean a point is in steps of your standard deviation. Now, calculate the z-score for each point :
Figure 16: Z-Score of Commuting Time
The negative values tell you that the point lies below the mean and positive values imply that the point is above the mean. Multiplying each value with the standard deviation will give the difference between mean and datapoint.
Overall, it has standardized each value. You can plot a new graph with the mean at the center.
Looking forward to a career in Data Analytics? Check out the Data Analytics Bootcamp and get certified today.
In this tutorial on Everything You Need to Know About the Normal Distribution, you looked at the normal distribution and how to recognize it. You then looked at standard deviation and realized the importance of standardizing our normal distribution. Finally, you explored the z-score with a solved example.
If you are keen on learning about Normal distribution and related statistical concepts, you could explore a career in data analytics. Simplilearn’s Post Graduate Program in Data Analytics is one of the most comprehensive online programs out there for this requirement. Offered in partnership with Purdue University, this applied learning program is designed with a curriculum and delivery methodology that offers job-ready training to its learners. From live virtual classes by real practitioners to plenty of projects and integrated virtual labs, masterclasses from Purdue faculty and IBM experts and the rare Ask-me-anything sessions and Hackathons by IBM, this has all the ingredients needed to get you started on a Data Analytics career and propel it ahead, fast! Explore now.
If you need any further clarifications or want to learn more about statistics and normal distribution, share your queries with us by mentioning them in this page's comments section. We will have our experts review them at the earliest. You can also understand the concept of normal distribution and other statistical concepts by checking out this video on our YouTube channel.