For discrete variables, the probability is straightforward and can be calculated easily. But for continuous variables which can take on infinite values, the probability also takes on a range of infinite values. The function which describes the probability for such variables is called a probability density function in statistics.
What Is the Probability Density Function?
A function that defines the relationship between a random variable and its probability, such that you can find the probability of the variable using the function, is called a Probability Density Function (PDF) in statistics.
The different types of variables. They are mainly of two types:
1. Discrete Variable: A variable that can only take on a certain finite value within a specific range is called a discrete variable. It usually separates the values by a finite interval, e.g., a sum of two dice. On rolling two dice and adding up the resulting outcome, the result can only belong to a set of numbers not exceeding 12 (as the maximum result of a dice throw is 6). The values are also definite.
2. Continuous Variable: A continuous random variable can take on infinite different values within a range of values, e.g., amount of rainfall occurring in a month. The rain observed can be 1.7cm, but the exact value is not known. It can, in actuality, be 1.701, 1.7687, etc. As such, you can only define the range of values it falls into. Within this value, it can take on infinite different values.
Now, consider a continuous random variable x, which has a probability density function, that defines the range of probabilities taken by this function as f(x). After plotting the pdf, you get a graph as shown below:
Figure 1: Probability Density Function
In the above graph, you get a bell-shaped curve after plotting the function against the variable. The blue curve shows this. Now consider the probability of a point b. To find it, you need to find the area under the curve to the left of b. This is represented by P(b). To find the probability of a variable falling between points a and b, you need to find the area of the curve between a and b. As the probability cannot be more than P(b) and less than P(a), you can represent it as:
P(a) <= X <= P(b).
Consider the graph below, which shows the rainfall distribution in a year in a city. The x-axis has the rainfall in inches, and the y-axis has the probability density function. The probability of some amount of rainfall is obtained by finding the area of the curve on the left of it.
Figure 2: Probability Density Function of the amount of rainfall
For the probability of 3 inches of rainfall, you plot a line that intersects the y-axis at the same point on the graph as a line extending from 3 on the x-axis does. This tells you that the probability of 3 inches of rainfall is less than or equal to 0.5.
How to Find the Probability Density Function in Statistics?
Below are the are three main steps:
Step 1: Summarizing the density with a histogram: You first convert the data into discrete form by plotting it as a histogram. A histogram is a graph with categorical values on the x-axis and bins of different heights, giving you a count of the values in that category. The number of bins is crucial as it determines how many bars the histogram will have and their width. This will tell you how it will plot your density.
Step 2: Performing Parametric density estimation: A PDF can take on a shape similar to many standard functions. The shape of the histogram will help you determine which type of function it is. You can calculate the parameters associated with the function to get our density. To check if our histogram is an excellent fit for the function, you can:
- Plot the density function and compare histogram shape
- Compare samples of the function with actual samples
- Use a statistical test
Step 3: Performing Non-Parametric Density Estimation: In cases where the shape of the histogram doesn't match a common probability density function, or cannot be made to fit one, you calculate the density using all the samples in the data and applying certain algorithms. One such algorithm is the Kernel Density Estimation. It uses a mathematical function to calculate and smooth probabilities so that their sum is always 1. To do this, you need the following parameters:
- Smoothing Parameter (bandwidth): Controls the number of samples used to estimate the probability of a new point.
- Basis Function: Helps to control the distribution of samples.
How to Implement the Probability Density Function in Python?
You will see how to find the probability density function of a random sample with the help of Python. You start by importing the necessary modules, which will help you plot the histogram and find the distribution.
Figure 3: Importing necessary modules
1. Plotting a Histogram
Now generate a random sample that has a probability density function resembling a bell-shaped curve. This type of probability distribution is called a Normal Distribution.
Figure 4: Plotting a histogram
Using the pyplot library, you plotted the distribution as a histogram. As you can see, the shape of the histogram resembles a bell curve.
Figure 5: Histogram
While plotting a histogram, it is important to plot it using the right number of bins. In the above diagram, you used 10 bins. See what happens if you use 4 bins.
Figure 6: Histogram with 4 bins
As you can see, this histogram doesn’t resemble a bell shape as much as the one with 10 bins. This can make it hard to recognize the type of distribution.
2. Performing Parametric Density Estimation
Now, see how to perform parametric density estimation. First, generate a normal sample with a mean of 50 and a standard deviation of 5. 1000 samples are being generated.
Figure 7: Generating Samples
To perform parametric estimation, assume that you don't know the distribution of these samples. The first thing that you need to do with the sample is to assume a distribution for it. Let’s assume a normal distribution. The parameters associated with normal distribution are mean and standard deviation. Calculate the mean and standard deviation for the samples.
Figure 8: Calculating mean and standard deviation
Now, define a normal distribution with the above mean and standard deviation.
Figure 9: Normal distribution
Now, find the probability distribution for the distribution defined above.
Figure 10: Probability distribution for normal distribution
Now, plot the distribution you’ve defined on top of the sample data.
Figure 11: Plotting distribution on samples
As you can see, the distribution you assumed is almost a perfect fit for the samples. This means that the sample is a normal distribution. If this were not the same, you would have to assume the sample to be of some other distribution and repeat the process.
3. Performing Non-Parametric Density Estimation
It’s time to perform non-parametric estimations now. You start by importing some modules needed for it.
Figure 12: Importing necessary modules
To perform non-parametric estimations, you must use two normal samples and join them together to get a sample that does not fit any known common distribution.
Figure 13: Creating a sample
Now, plot the distribution to see what it looks like.
Figure 14: Plotting the distribution
Now, use Kernel density estimation to get a model, which you can then fit to your sample to create a probability distribution curve.
Figure 15: Creating a Kernel Density Estimation Function
You will now find the probability distribution for our kernel density estimation function.
Figure 16: Creating a Kernel Density Estimation Function
Finally, plot the function on top of your samples.
Figure 17: Plotting distribution on samples
You can see that the estimations of the kernel density estimation fit the samples pretty well. To further fine-tune the fit, you can change the bandwidth of the function.
What Is the Central Limit Theorem (CLT) and How Does It Relate to PDFs?
The Central Limit Theorem (CLT) states that as the sample size grows, the distribution of the sample means will tend to be normal, regardless of the original data distribution. For instance, if you take many samples of 50 people's weights from a population, the averages of these samples will form a normal distribution, even if the population's weight distribution is not normal.
This theorem is crucial for understanding Probability Density Function (PDF) in stats because it justifies the use of normal distribution in statistical analysis. For example, when estimating the average height of a population using sample means, the normal PDF can model these estimates effectively due to the CLT. This makes the normal distribution a valuable tool for making predictions and analyzing data.
Probability Density Function (PDF) vs. Cumulative Distribution Function (CDF)
Let's look at the distribution function vs density function comparison to understand the differences between these two concepts:
Aspect |
Probability Density Function (PDF) |
Cumulative Distribution Function (CDF) |
Definition |
Describes the likelihood of a random variable falling within a small interval around a particular value. |
Gives the probability that a random variable is less than or equal to a specific value. |
Purpose |
Represents the probability density over a continuous range, indicating how likely different outcomes are. |
Shows how probabilities accumulate as you move through the range of possible values. |
Representation |
Depicted as a smooth curve that shows probability density. The area under the curve within an interval represents probability. |
Plotted as a curve that is always non-decreasing. It reflects cumulative probabilities. |
Behavior |
Can vary in height and is not necessarily increasing; the area under the curve sums to 1. |
Non-decreasing; as the variable’s value increases, the probability either increases or remains the same. |
Probability Calculation |
Does not give probabilities for specific values but rather the density, used to find probabilities over intervals by integrating. |
Provides the probability for specific values or ranges by looking up the CDF value at a given point. |
Example Use Case |
To determine the probability that a student's score falls between two values, you would use the PDF and integrate over that interval. |
If you need to find the probability of a student's test score being below a certain threshold, you use the CDF. |
Graph Characteristics |
The PDF graph is a smooth curve where the total area under the curve equals 1. |
The CDF graph continuously increases and flattens out as it approaches the maximum value of the variable. |
Conclusion
In this tutorial on Probability Density Function we gave an overview of a probability density function in statistics, how to find the probability density function in statistics and python, the difference between Probability Density Function (PDF) and Cumulative Distribution Function (CDF).
If you are keen on learning more about Probability density function and related statistical concepts, you have come to the right place. Simplilearn’s Data Analytics Program is one of the most comprehensive online programs out there for this. This comprehensive program will help you master several skills and tools including SQL, R, Python, data visualization, and predictive analytics skills. Explore and enroll today!
FAQs
1. What is the Z probability density function?
The Z probability density function describes how values are spread in a standard normal distribution, where the average is 0 and the standard deviation is 1. It's often depicted as a bell-shaped curve, known as the Bell Curve. This function helps us determine the probability of a value falling within a specific range by finding the area under the curve between two z-scores, which are standard units of measurement in statistics.
2. What does probability density function refer to?
A Probability Density Function shows how likely it is for a continuous random variable to fall within a certain range of values. Unlike discrete distributions that give the probability of exact values, the PDF tells us the probability density over intervals. The total area under the PDF curve equals 1, representing the certainty that the variable will fall somewhere within its possible range.
3. Why do you need probability density estimation?
Probability density estimation is important for continuous data because you can't assign a non-zero probability to individual values. Instead, the PDF estimates how likely it is for the variable to fall within a certain range. This helps in understanding the overall shape of the data distribution, making it easier to analyze patterns and make predictions based on how values are spread out across different intervals.