An essential part of statistics is the cumulative distribution function which helps you find the probability for a random variable in a specific range. This tutorial will teach you the basics of the cumulative distribution function and how to implement it in Python.
What Is the Cumulative Distribution Function?
The cumulative distribution function is used to describe the probability distribution of random variables. It can be used to describe the probability for a discrete, continuous or mixed variable. It is obtained by summing up the probability density function and getting the cumulative probability for a random variable.
The Probability Density Function is a function that gives us the probability distribution of a random variable for any value of it. To get the probability distribution at a point, you only have to solve the probability density function for that point.
The cumulative distribution function of a random variable to be calculated at a point x is represented as Fx(X). It is the probability that the random variable X will take a value less than or equal to x.
Consider the diagram shown below. The diagram shows the probability density function f(x), which gives us a rectangle between the points (a, b) when plotted. f(x) has a value of 1/(b-a).
Figure 1: Probability Density Function
Now consider a point c on the x-axis. This is the point you need to find the cumulative distribution function at. According to the definition, you need to find the total probability density function up to point c. This means that you have to find the area of the rectangle between points a and c.
Figure 2: Calculating the CDF
You can do this by multiplying the length and breadth of the rectangle. The breadth is the distance between a and c obtained by subtracting them, and the length is the probability density function. In the end, you get the CDF as:
Figure 3: CDF
Since the cumulative distribution function is the total probability density function up to a certain point x, it can be represented as the probability that the random variable X is less than or equal to x.
Figure 4: CDF representation
As you need to get the total PDF sum between two points, you can also represent the CDF as the integration of PDF between the points it has been calculated at. The formula depicted below shows the cumulative distribution function calculated between points (a, b) for the PDF Fx(x).
Figure 5: CDF as the integration of PDF
Understanding the Cumulative Distribution Function With the IRIS Dataset
In this case study, you will be looking at the Iris dataset, which contains information on the sepal length, sepal width, petal length, and petal width of three different species of Iris:
- Iris Setosa
- Iris Versicolor
- Iris Virginica
Figure 6: Iris Dataset
All the values are in centimeters. The dataset contains 50 data points on each of the different species. You need to find a reliable measure using which we can differentiate the different species from each other.
Now, plot each feature of our dataset using a bar graph. You plot the features with different colors for each flower to see how they overlap with each other. This is a way of finding the PDF of the data.
Figure 7: Iris Dataset PDF
The above graphs are as follows from top left to bottom right:
- Sepal_length: In this graph, we can see that the sepal lengths of all three species have considerable overlap. Hence it becomes tough to set parameters or ranges which you can use to differentiate our flowers.
- Sepal_width: This graph has even more overlap. It is also not a feature that you can use to differentiate between our flowers correctly.
- Petal_length: This graph has way less overlap than the other two. The boundaries for Setosa have no overlap with any other species, and Versicolor and Virginica have a slight overlap. You can easily find the different ranges that most of the petal lengths fall into for different species.
- Petal_width: This graph has significantly less overlap than the sepal parameters, but you can see a little bit of overlap between Setosa, Versicolor and Virginica.
Hence, you can conclude that the petal length is the best parameter for differentiating between iris species.
Next, add the PDF and plot the resultant graph to see the CDF for our iris data.
Figure 8: Iris Dataset PDF and CDF
From the above graph, you can notice three things:
- Petal Length < 1.9 is most definitely ‘Setosa’. You can say this as the petal length for setosa falls in this range and does not coincide with the petal length of Versicolor or Virginica.
- 3.2 < Petal length < 5 has a 95% chance of being Versicolor. Versicolor and Virginia have a slight overlap between 4.7 and 5. Hence, any machine learning model may mistake the two, but the chances of that happening are very low.
- Petal length > 5 has a 90% chance of being ‘Virginia’. Again, there is a slight overlap between Versicolor and virginica in the 5-5.2 cm region, which is the cause of the slight error.
Implementing Cumulative Distribution Function With Python
Now, see how you can implement the cumulative distribution function in Python. Let’s start by importing the necessary libraries.
Figure 9: Importing necessary modules
Next, read in our iris dataset.
Figure 10: Importing Iris dataset
You can find the mean and median of the data and see how they differ according to species.
Figure 11: Finding mean and median
As you can see, the mean and median do not differ by much. This means there is not much difference in the average sepal length and sepal width and petal length and petal width for our different species. The same can be said for the median, and the medians are comparable for different species.
Now, find the standard deviation for the data.
Figure 12: Finding standard deviation
The data has a minimal value of standard deviation for each feature across the different species. This means that it does not spread out our data, and there is not much variation. Also, outliers are far and few.
Now, plot a violin plot to see how the different features compare to each other.
Figure 13: Plotting violin plots
Figure 14: Violin plots
Violin plots plot the range of values in our dataset on the x-axis and show how to spread out the data is with their width. The above graphs show that the petal length has the most narrow violins and hence the least outliers. Their range of values also has the least intersection, as can be seen by comparing the heights of the violins.
If you were to choose two features to compare the flowers on, which ones would they be? This can be found by plotting pair plots of your data. Pair plots plot each feature against the others in the form of a scatterplot to see which pair performs the best.
Figure 15: Plotting pairplots
You get the following plots on running the above code:
Figure 16: Iris Pair plots
The above plots show the x-axis value on the leftmost side and the y-axis value on the bottom. When you plot two of the same features against one another, you get the PDF of that feature. From the above graphs, you can say that petal length and petal width would be the best features to use together, as the scatter plot of these two features has the least overlap, both along the x-axis and the y-axis. Using these, you can distinctly identify the unique ranges for different flowers.
Now, plot the PDF along with the histogram for the iris data.
Figure 17: plotting PDF
Figure 18: PDF plots of Iris dataset
The above plots show that the petal_length feature has the least overlap between data of different iris species. The bell-shaped curve you see in all of our features is a probability distribution called the normal distribution. The bell curve for petal_leanght is also smoother than the bell curve for petal_width. All in all, petal length is the best feature for classifying our data.
Now, using the above data, plot the cumulative distribution function. You first split the data into three sets, depending on the flower species.
Figure 19: Separating our datasets
You then plot tje PDF and CDF together on the same graph. To get the pdf, you count the number of data points in each histogram bin and divide the count by the number of data points in that bin. The CDF is nothing but the cumulative sum or total sum of PDF up to a certain point. To get the cumulative distribution function for every bin, you add the PDF of all the previous bins.
Figure 20: Plotting PDF and CDF of Iris Dataset
On running the above code, you get our graph as shown below. You can see that the setosa species has its unique range of values of petal_length that is just below 2cm. For Virginica and Versicolor, there is a bit of an overlap. But most of the flowers for these two species fall in their unique range. Hence, the accuracy of prediction is not compromised much.
Figure 21: PDF and CDF of Iris Dataset
From the above diagram, you can say that different petal_length ranges can be determined for different species of Irises. The ranges are:
- Iris Setosa has petal lengths < 2cm.
- A flower has a 95% chance of being Iris Versicolor if (3.2 < petal_length < 5).
- A flower has a 90% chance of being Iris Virginica if its petal length is more than 5.
Looking forward to a career in Data Analytics? Check out the Data Analytics Training and get certified today.
In this tutorial on cumulative distribution function, you first understood the concept of CDF and how to calculate it using PDF. You then did a case study on the iris dataset and found ranges to differentiate between different iris species based on their petal lengths. Finally, you used Python to implement the case study and derive its results.
We hope this article helped you understand how to find the cumulative distribution function of a random variable. To learn more about python and statistics and how statistics can be used in data analytics, check out Simplilearn’s Data Analytics Certification Program. On the other hand, if you need any clarifications or have any doubts, share them with us by commenting them down below in this page's comment section, and we'll have our experts answer them immediately!