For discrete variables, the probability is straightforward and can be calculated easily. But for continuous variables which can take on infinite values, the probability also takes on a range of infinite values. The function which describes the probability for such variables is called a probability density function in statistics.
What Is the Probability Density Function?
A function that defines the relationship between a random variable and its probability, such that you can find the probability of the variable using the function, is called a Probability Density Function (PDF) in statistics.
The different types of variables. They are mainly of two types:
- Discrete Variable: A variable that can only take on a certain finite value within a specific range is called a discrete variable. It usually separates the values by a finite interval, e.g., a sum of two dice. On rolling two dice and adding up the resulting outcome, the result can only belong to a set of numbers not exceeding 12 (as the maximum result of a dice throw is 6). The values are also definite.
- Continuous Variable: A continuous random variable can take on infinite different values within a range of values, e.g., amount of rainfall occurring in a month. The rain observed can be 1.7cm, but the exact value is not known. It can, in actuality, be 1.701, 1.7687, etc. As such, you can only define the range of values it falls into. Within this value, it can take on infinite different values.
Now, consider a continuous random variable x, which has a probability density function, that defines the range of probabilities taken by this function as f(x). After plotting the pdf, you get a graph as shown below:
Figure 1: Probability Density Function
In the above graph, you get a bell-shaped curve after plotting the function against the variable. The blue curve shows this. Now consider the probability of a point b. To find it, you need to find the area under the curve to the left of b. This is represented by P(b). To find the probability of a variable falling between points a and b, you need to find the area of the curve between a and b. As the probability cannot be more than P(b) and less than P(a), you can represent it as:
P(a) <= X <= P(b).
Consider the graph below, which shows the rainfall distribution in a year in a city. The x-axis has the rainfall in inches, and the y-axis has the probability density function. The probability of some amount of rainfall is obtained by finding the area of the curve on the left of it.
Figure 2: Probability Density Function of the amount of rainfall
For the probability of 3 inches of rainfall, you plot a line that intersects the y-axis at the same point on the graph as a line extending from 3 on the x-axis does. This tells you that the probability of 3 inches of rainfall is less than or equal to 0.5.
How to Find the Probability Density Function in Statistics?
Below are the are three main steps:
- Summarizing the density with a histogram: You first convert the data into discrete form by plotting it as a histogram. A histogram is a graph with categorical values on the x-axis and bins of different heights, giving you a count of the values in that category. The number of bins is crucial as it determines how many bars the histogram will have and their width. This will tell you how it will plot your density.
- Performing Parametric density estimation: A PDF can take on a shape similar to many standard functions. The shape of the histogram will help you determine which type of function it is. You can calculate the parameters associated with the function to get our density. To check if our histogram is an excellent fit for the function, you can:
- Plot the density function and compare histogram shape
- Compare samples of the function with actual samples
- Use a statistical test
- Performing Non-Parametric Density Estimation: In cases where the shape of the histogram doesn't match a common probability density function, or cannot be made to fit one, you calculate the density using all the samples in the data and applying certain algorithms. One such algorithm is the Kernel Density Estimation. It uses a mathematical function to calculate and smooth probabilities so that their sum is always 1. To do this, you need the following parameters:
- Smoothing Parameter (bandwidth): Controls the number of samples used to estimate the probability of a new point.
- Basis Function: Helps to control the distribution of samples.
How to Implement the Probability Density Function in Python?
You will see how to find the probability density function of a random sample with the help of Python. You start by importing the necessary modules, which will help you plot the histogram and find the distribution.
Figure 3: Importing necessary modules
1. Plotting a Histogram
Now generate a random sample that has a probability density function resembling a bell-shaped curve. This type of probability distribution is called a Normal Distribution.
Figure 4: Plotting a histogram
Using the pyplot library, you plotted the distribution as a histogram. As you can see, the shape of the histogram resembles a bell curve.
Figure 5: Histogram
While plotting a histogram, it is important to plot it using the right number of bins. In the above diagram, you used 10 bins. See what happens if you use 4 bins.
Figure 6: Histogram with 4 bins
As you can see, this histogram doesn’t resemble a bell shape as much as the one with 10 bins. This can make it hard to recognize the type of distribution.
2. Performing Parametric Density Estimation
Now, see how to perform parametric density estimation. First, generate a normal sample with a mean of 50 and a standard deviation of 5. 1000 samples are being generated.
Figure 7: Generating Samples
To perform parametric estimation, assume that you don't know the distribution of these samples. The first thing that you need to do with the sample is to assume a distribution for it. Let’s assume a normal distribution. The parameters associated with normal distribution are mean and standard deviation. Calculate the mean and standard deviation for the samples.
Figure 8: Calculating mean and standard deviation
Now, define a normal distribution with the above mean and standard deviation.
Figure 9: Normal distribution
Now, find the probability distribution for the distribution defined above.
Figure 10: Probability distribution for normal distribution
Now, plot the distribution you’ve defined on top of the sample data.
Figure 11: Plotting distribution on samples
As you can see, the distribution you assumed is almost a perfect fit for the samples. This means that the sample is a normal distribution. If this were not the same, you would have to assume the sample to be of some other distribution and repeat the process.
3. Performing Non-Parametric Density Estimation
It’s time to perform non-parametric estimations now. You start by importing some modules needed for it.
Figure 12: Importing necessary modules
To perform non-parametric estimations, you must use two normal samples and join them together to get a sample that does not fit any known common distribution.
Figure 13: Creating a sample
Now, plot the distribution to see what it looks like.
Figure 14: Plotting the distribution
Now, use Kernel density estimation to get a model, which you can then fit to your sample to create a probability distribution curve.
Figure 15: Creating a Kernel Density Estimation Function
You will now find the probability distribution for our kernel density estimation function.
Figure 16: Creating a Kernel Density Estimation Function
Finally, plot the function on top of your samples.
Figure 17: Plotting distribution on samples
You can see that the estimations of the kernel density estimation fit the samples pretty well. To further fine-tune the fit, you can change the bandwidth of the function.
Looking forward to a career in Data Analytics? Check out the Data Analytics Bootcamp and get certified today.
In this tutorial on ‘Everything You Need to Know About the Probability Density Function’, you understood a probability density function in statistics. You then looked at how to find the probability density function in statistics and python.
If you are keen on learning about Probability density function and related statistical concepts, you could explore a career in data analytics. Simplilearn’s Data Analytics Certification Program is one of the most comprehensive online programs out there for this. If you need any further clarifications or want to learn more about statistics and normal distribution, share your queries with us by mentioning them in this page's comments section. We will have our experts review them at the earliest. You can also understand the concept of the probability density function and other statistical concepts by checking out this video on our YouTube channel.
Have any questions for us? Leave them in the comments section of this article. Our experts will get back to you on the same, ASAP!