An In-Depth Explanation Of Cumulative Distribution Function

Lesson 8 of 24By Avijeet Biswal

Last updated on Jun 4, 202455126

Tutorial Playlist

Statistics Tutorial
Overview
Everything You Need to Know About the Probability Density Function in Statistics
Lesson - 1
The Best Guide to Understand Central Limit Theorem
Lesson - 2
Measures of Central Tendency : Mean, Median and Mode
Lesson - 3
The Ultimate Guide to Understand Conditional Probability
Lesson - 4
Percentile in Statistics
Lesson - 5
The Best Guide to Understand Bayes Theorem
Lesson - 6
Everything You Need to Know About the Normal Distribution
Lesson - 7
An In-Depth Explanation of Cumulative Distribution Function
Lesson - 8
Chi-Square Test
Lesson - 9
What Is Hypothesis Testing in Statistics? Types and Examples
Lesson - 10
Understanding the Fundamentals of Arithmetic and Geometric Progression
Lesson - 11
The Definitive Guide to Understand Spearman’s Rank Correlation
Lesson - 12
Mean Squared Error: Overview, Examples, Concepts and More
Lesson - 13
All You Need to Know About the Empirical Rule in Statistics
Lesson - 14
The Complete Guide to Skewness and Kurtosis
Lesson - 15
A Holistic Look at Bernoulli Distribution
Lesson - 16
All You Need to Know About Bias in Statistics
Lesson - 17
A Complete Guide to Get a Grasp of Time Series Analysis
Lesson - 18
The Key Differences Between Z-Test Vs. T-Test
Lesson - 19
The Complete Guide to Understand Pearson's Correlation
Lesson - 20
A Complete Guide on the Types of Statistical Studies
Lesson - 21
Everything You Need to Know About Poisson Distribution
Lesson - 22
Your Best Guide to Understand Correlation vs. Regression
Lesson - 23
The Most Comprehensive Guide for Beginners on What Is Correlation
Lesson - 24

Table of Contents

View More

An essential part of statistics is the cumulative distribution function which helps you find the probability for a random variable in a specific range. This tutorial will teach you the basics of the cumulative distribution function and how to implement it in Python.

What Is the Cumulative Distribution Function?

The cumulative distribution function is used to describe the probability distribution of random variables. It can be used to describe the probability for a discrete, continuous or mixed variable. It is obtained by summing up the probability density function and getting the cumulative probability for a random variable.

The Probability Density Function is a function that gives us the probability distribution of a random variable for any value of it. To get the probability distribution at a point, you only have to solve the probability density function for that point.

The cumulative distribution function of a random variable to be calculated at a point x is represented as Fx(X). It is the probability that the random variable X will take a value less than or equal to x.

Consider the diagram shown below. The diagram shows the probability density function f(x), which gives us a rectangle between the points (a, b) when plotted. f(x) has a value of 1/(b-a).

Cumulative_Distribution_Function_1

Figure 1: Probability Density Function

Now consider a point c on the x-axis. This is the point you need to find the cumulative distribution function at. According to the definition, you need to find the total probability density function up to point c. This means that you have to find the area of the rectangle between points a and c.

Cumulative_Distribution_Function_2

Figure 2: Calculating the CDF

You can do this by multiplying the length and breadth of the rectangle. The breadth is the distance between a and c obtained by subtracting them, and the length is the probability density function. In the end, you get the CDF as:

Cumulative_Distribution_Function_3

Figure 3: CDF

Since the cumulative distribution function is the total probability density function up to a certain point x, it can be represented as the probability that the random variable X is less than or equal to x.

Cumulative_Distribution_Function_4.

Figure 4: CDF representation

As you need to get the total PDF sum between two points, you can also represent the CDF as the integration of PDF between the points it has been calculated at. The formula depicted below shows the cumulative distribution function calculated between points (a, b) for the PDF Fx(x).

Cumulative_Distribution_Function_5

Figure 5: CDF as the integration of PDF

Understanding the Cumulative Distribution Function With the IRIS Dataset

In this case study, you will be looking at the Iris dataset, which contains information on the sepal length, sepal width, petal length, and petal width of three different species of Iris:

Iris Setosa
Iris Versicolor
Iris Virginica

Cumulative_Distribution_Function_6

Figure 6: Iris Dataset

All the values are in centimeters. The dataset contains 50 data points on each of the different species. You need to find a reliable measure using which we can differentiate the different species from each other.

Now, plot each feature of our dataset using a bar graph. You plot the features with different colors for each flower to see how they overlap with each other. This is a way of finding the PDF of the data.

Cumulative_Distribution_Function_7

Figure 7: Iris Dataset PDF

The above graphs are as follows from top left to bottom right:

Sepal_length: In this graph, we can see that the sepal lengths of all three species have considerable overlap. Hence it becomes tough to set parameters or ranges which you can use to differentiate our flowers.
Sepal_width: This graph has even more overlap. It is also not a feature that you can use to differentiate between our flowers correctly.
Petal_length: This graph has way less overlap than the other two. The boundaries for Setosa have no overlap with any other species, and Versicolor and Virginica have a slight overlap. You can easily find the different ranges that most of the petal lengths fall into for different species.
Petal_width: This graph has significantly less overlap than the sepal parameters, but you can see a little bit of overlap between Setosa, Versicolor and Virginica.

Hence, you can conclude that the petal length is the best parameter for differentiating between iris species.

Next, add the PDF and plot the resultant graph to see the CDF for our iris data.

Cumulative_Distribution_Function_8

Figure 8: Iris Dataset PDF and CDF

From the above graph, you can notice three things:

Petal Length < 1.9 is most definitely ‘Setosa’. You can say this as the petal length for setosa falls in this range and does not coincide with the petal length of Versicolor or Virginica.
3.2 < Petal length < 5 has a 95% chance of being Versicolor. Versicolor and Virginia have a slight overlap between 4.7 and 5. Hence, any machine learning model may mistake the two, but the chances of that happening are very low.
Petal length > 5 has a 90% chance of being ‘Virginia’. Again, there is a slight overlap between Versicolor and virginica in the 5-5.2 cm region, which is the cause of the slight error.

Implementing Cumulative Distribution Function With Python

Now, see how you can implement the cumulative distribution function in Python. Let’s start by importing the necessary libraries.

Cumulative_Distribution_Function_9.

Figure 9: Importing necessary modules

Next, read in our iris dataset.

Cumulative_Distribution_Function_10

Figure 10: Importing Iris dataset

You can find the mean and median of the data and see how they differ according to species.

Cumulative_Distribution_Function_11

Figure 11: Finding mean and median

As you can see, the mean and median do not differ by much. This means there is not much difference in the average sepal length and sepal width and petal length and petal width for our different species. The same can be said for the median, and the medians are comparable for different species.

Now, find the standard deviation for the data.

Cumulative_Distribution_Function_12

Figure 12: Finding standard deviation

The data has a minimal value of standard deviation for each feature across the different species. This means that it does not spread out our data, and there is not much variation. Also, outliers are far and few.

Now, plot a violin plot to see how the different features compare to each other.

Cumulative_Distribution_Function_13

Figure 13: Plotting violin plots

Cumulative_Distribution_Function_14

Cumulative_Distribution_Function_14_1.

Figure 14: Violin plots

Violin plots plot the range of values in our dataset on the x-axis and show how to spread out the data is with their width. The above graphs show that the petal length has the most narrow violins and hence the least outliers. Their range of values also has the least intersection, as can be seen by comparing the heights of the violins.

If you were to choose two features to compare the flowers on, which ones would they be? This can be found by plotting pair plots of your data. Pair plots plot each feature against the others in the form of a scatterplot to see which pair performs the best.

Cumulative_Distribution_Function_15

Figure 15: Plotting pairplots

You get the following plots on running the above code:

Cumulative_Distribution_Function_16_1

Cumulative_Distribution_Function_16_2

Cumulative_Distribution_Function_16_3.

Cumulative_Distribution_Function_16_4

Figure 16: Iris Pair plots

The above plots show the x-axis value on the leftmost side and the y-axis value on the bottom. When you plot two of the same features against one another, you get the PDF of that feature. From the above graphs, you can say that petal length and petal width would be the best features to use together, as the scatter plot of these two features has the least overlap, both along the x-axis and the y-axis. Using these, you can distinctly identify the unique ranges for different flowers.

Now, plot the PDF along with the histogram for the iris data.

Cumulative_Distribution_Function_17

Figure 17: plotting PDF

Cumulative_Distribution_Function_18.

Figure 18: PDF plots of Iris dataset

The above plots show that the petal_length feature has the least overlap between data of different iris species. The bell-shaped curve you see in all of our features is a probability distribution called the normal distribution. The bell curve for petal_leanght is also smoother than the bell curve for petal_width. All in all, petal length is the best feature for classifying our data.

Now, using the above data, plot the cumulative distribution function. You first split the data into three sets, depending on the flower species.

Cumulative_Distribution_Function_19

Figure 19: Separating our datasets

You then plot tje PDF and CDF together on the same graph. To get the pdf, you count the number of data points in each histogram bin and divide the count by the number of data points in that bin. The CDF is nothing but the cumulative sum or total sum of PDF up to a certain point. To get the cumulative distribution function for every bin, you add the PDF of all the previous bins.

Cumulative_Distribution_Function_20.

Figure 20: Plotting PDF and CDF of Iris Dataset

On running the above code, you get our graph as shown below. You can see that the setosa species has its unique range of values of petal_length that is just below 2cm. For Virginica and Versicolor, there is a bit of an overlap. But most of the flowers for these two species fall in their unique range. Hence, the accuracy of prediction is not compromised much.

Cumulative_Distribution_Function_21.

Figure 21: PDF and CDF of Iris Dataset

From the above diagram, you can say that different petal_length ranges can be determined for different species of Irises. The ranges are:

Iris Setosa has petal lengths < 2cm.
A flower has a 95% chance of being Iris Versicolor if (3.2 < petal_length < 5).
A flower has a 90% chance of being Iris Virginica if its petal length is more than 5.

Looking forward to a career in Data Analytics? Check out the Data Analytics Training and get certified today.

Conclusion

In this tutorial on cumulative distribution function, you first understood the concept of CDF and how to calculate it using PDF. You then did a case study on the iris dataset and found ranges to differentiate between different iris species based on their petal lengths. Finally, you used Python to implement the case study and derive its results.

We hope this article helped you understand how to find the cumulative distribution function of a random variable. To learn more about python and statistics and how statistics can be used in data analytics, check out Simplilearn’s Data Analytics Certification Program. On the other hand, if you need any clarifications or have any doubts, share them with us by commenting them down below in this page's comment section, and we'll have our experts answer them immediately!

Happy learning!

About the Author

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

View More

Recommended Programs

*Lifetime access to high-quality, self-paced e-learning content.

Explore Category

Recommended Resources

prevNext

Acknowledgement
PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.