Everything You Need to Know About the Probability Density Function in Statistics

For discrete variables, the probability is straightforward and can be calculated easily. But for continuous variables which can take on infinite values, the probability also takes on a range of infinite values. The function which describes the probability for such variables is called a probability density function in statistics.

What Is the Probability Density Function?

A function that defines the relationship between a random variable and its probability, such that you can find the probability of the variable using the function, is called a Probability Density Function (PDF) in statistics.

The different types of variables. They are mainly of two types:

  1. Discrete Variable: A variable that can only take on a certain finite value within a specific range is called a discrete variable. It usually separates the values by a finite interval, e.g., a sum of two dice. On rolling two dice and adding up the resulting outcome, the result can only belong to a set of numbers not exceeding 12 (as the maximum result of a dice throw is 6). The values are also definite.
  2. Continuous Variable: A continuous random variable can take on infinite different values within a range of values, e.g., amount of rainfall occurring in a month. The rain observed can be 1.7cm, but the exact value is not known. It can, in actuality, be 1.701, 1.7687, etc. As such, you can only define the range of values it falls into. Within this value, it can take on infinite different values.

Now, consider a continuous random variable x, which has a probability density function, that defines the range of probabilities taken by this function as f(x). After plotting the pdf, you get a graph as shown below:                     

Probability_Density_Function_1.

Figure 1: Probability Density Function

In the above graph, you get a bell-shaped curve after plotting the function against the variable. The blue curve shows this. Now consider the probability of a point b. To find it, you need to find the area under the curve to the left of b. This is represented by P(b). To find the probability of a variable falling between points a and b, you need to find the area of the curve between a and b. As the probability cannot be more than P(b) and less than P(a), you can represent it as: 

P(a) <= X <= P(b).

Consider the graph below, which shows the rainfall distribution in a year in a city. The x-axis has the rainfall in inches, and the y-axis has the probability density function. The probability of some amount of rainfall is obtained by finding the area of the curve on the left of it.

Probability_Density_Function_2

Figure 2: Probability Density Function of the amount of rainfall

For the probability of 3 inches of rainfall, you plot a line that intersects the y-axis at the same point on the graph as a line extending from 3 on the x-axis does. This tells you that the probability of 3 inches of rainfall is less than or equal to 0.5.

Your Data Analytics Career is Around The Corner!

Data Analyst Master’s ProgramExplore Program
Your Data Analytics Career is Around The Corner!

How to Find the Probability Density Function in Statistics?

Below are the are three main steps:

  • Summarizing the density with a histogram: You first convert the data into discrete form by plotting it as a histogram. A histogram is a graph with categorical values on the x-axis and bins of different heights, giving you a count of the values in that category. The number of bins is crucial as it determines how many bars the histogram will have and their width. This will tell you how it will plot your density.
  • Performing Parametric density estimation: A PDF can take on a shape similar to many standard functions. The shape of the histogram will help you determine which type of function it is. You can calculate the parameters associated with the function to get our density. To check if our histogram is an excellent fit for the function, you can:
    1. Plot the density function and compare histogram shape
    2. Compare samples of the function with actual samples
    3. Use a statistical test
  • Performing Non-Parametric Density Estimation: In cases where the shape of the histogram doesn't match a common probability density function, or cannot be made to fit one, you calculate the density using all the samples in the data and applying certain algorithms. One such algorithm is the Kernel Density Estimation. It uses a mathematical function to calculate and smooth probabilities so that their sum is always 1. To do this, you need the following parameters:
  1. Smoothing Parameter (bandwidth): Controls the number of samples used to estimate the probability of a new point.
  2. Basis Function: Helps to control the distribution of samples.

How to Implement the Probability Density Function in Python?

You will see how to find the probability density function of a random sample with the help of Python. You start by importing the necessary modules, which will help you plot the histogram and find the distribution.

Probability_Density_Function_3.

Figure 3: Importing necessary modules

1. Plotting a Histogram

Now generate a random sample that has a probability density function resembling a bell-shaped curve. This type of probability distribution is called a Normal Distribution.                                       

Probability_Density_Function_4.

Figure 4: Plotting a histogram

Using the pyplot library, you plotted the distribution as a histogram. As you can see, the shape of the histogram resembles a bell curve.

Probability_Density_Function_5.

Figure 5: Histogram

While plotting a histogram, it is important to plot it using the right number of bins. In the above diagram, you used 10 bins. See what happens if you use 4 bins. 

Probability_Density_Function_6

Figure 6: Histogram with 4 bins

As you can see, this histogram doesn’t resemble a bell shape as much as the one with 10 bins. This can make it hard to recognize the type of distribution.

2. Performing Parametric Density Estimation

Now, see how to perform parametric density estimation. First, generate a normal sample with a mean of 50 and a standard deviation of 5. 1000 samples are being generated.

Probability_Density_Function_7.

Figure 7: Generating Samples

To perform parametric estimation, assume that you don't know the distribution of these samples. The first thing that you need to do with the sample is to assume a distribution for it. Let’s assume a normal distribution. The parameters associated with normal distribution are mean and standard deviation. Calculate the mean and standard deviation for the samples.

Probability_Density_Function_8.

Figure 8: Calculating mean and standard deviation

Now, define a normal distribution with the above mean and standard deviation.

Probability_Density_Function_9

Figure 9: Normal distribution

Now, find the probability distribution for the distribution defined above.

Probability_Density_Function_10

Probability_Density_Function

Probability_Density_Function_10_2.

Figure 10: Probability distribution for normal distribution

Now, plot the distribution you’ve defined on top of the sample data.

Probability_Density_Function_11

Figure 11: Plotting distribution on samples

As you can see, the distribution you assumed is almost a perfect fit for the samples. This means that the sample is a normal distribution. If this were not the same, you would have to assume the sample to be of some other distribution and repeat the process.

Your Data Analytics Career is Around The Corner!

Data Analyst Master’s ProgramExplore Program
Your Data Analytics Career is Around The Corner!

3. Performing Non-Parametric Density Estimation

It’s time to perform non-parametric estimations now. You start by importing some modules needed for it.

Probability_Density_Function_12

Figure 12: Importing necessary modules

To perform non-parametric estimations, you must use two normal samples and join them together to get a sample that does not fit any known common distribution.

Probability_Density_Function_13 

Figure 13: Creating a sample

Now, plot the distribution to see what it looks like.                        

Probability_Density_Function_14.

Figure 14: Plotting the distribution

Now, use Kernel density estimation to get a model, which you can then fit to your sample to create a probability distribution curve.                        

Probability_Density_Function_15.

Figure 15: Creating a Kernel Density Estimation Function

You will now find the probability distribution for our kernel density estimation function.

Probability_Density_Function_16

Figure 16: Creating a Kernel Density Estimation Function

Finally, plot the function on top of your samples. 

Probability_Density_Function_17

Figure 17: Plotting distribution on samples

You can see that the estimations of the kernel density estimation fit the samples pretty well. To further fine-tune the fit, you can change the bandwidth of the function.

Looking forward to a career in Data Analytics? Check out the Data Analytics Bootcamp and get certified today.

Conclusion 

In this tutorial on ‘Everything You Need to Know About the Probability Density Function’, you understood a probability density function in statistics. You then looked at how to find the probability density function in statistics and python.

If you are keen on learning about Probability density function and related statistical concepts, you could explore a career in data analytics. Simplilearn’s Data Analytics Certification Program is one of the most comprehensive online programs out there for this.  If you need any further clarifications or want to learn more about statistics and normal distribution, share your queries with us by mentioning them in this page's comments section. We will have our experts review them at the earliest. You can also understand the concept of the probability density function and other statistical concepts by checking out this video on our YouTube channel.

Have any questions for us? Leave them in the comments section of this article. Our experts will get back to you on the same, ASAP!

About the Author

SimplilearnSimplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.