Welcome to the Introduction to statistical analysis and business applications tutorial offered by Simplilearn. The tutorial is a part of the Python for Data Science Certification Training Course.
Let us begin with the objectives in the next section.
In this statistical analysis and business applications tutorial, you will learn -
To define statistics and differentiate between statistical and non-statistical analysis.
The two major categories of statistical analysis, namely descriptive and inferential analysis and their differences.
To describe the statistical analysis process
To calculate mean, median, mode, and percentile
Representation of data distribution using various methods, such as history, Graham Bell, curve, and ketosis.
The hypothesis testing, Chi-square test
The types of frequencies
The correlation matrix.
Let us begin this tutorial by defining the term statistics.
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. It is widely used to understand the complex problems of the real world and simplify them to make well-informed decisions.
Several statistical principles, functions, and algorithms can be used to analyze primary data, build a statistical model, and predict the future outcome.
In the next section let us look at the statistical and non-statistical analysis.
An analysis of any situation can be done in two ways - statistical or a non-statistical analysis.
Let us now learn about the two.
Statistical Analysis is -
Scientific
Based on numbers or statistical values
Useful in providing complete insight into the data
Non-statistical Analysis is:
Based on very generic information
Exclusive of statistical or quantitative
Although both forms of analysis provide results, quantitative analysis provides more insight and a clearer picture. This is why statistical analysis is important for businesses.
In the next section, we will look at the major categories of statistics.
There are two major categories of statistics
Descriptive analytics and
Inferential analytics.
Descriptive analytics helps organize the data and focuses on the main characteristics of the data. It provides a concise summary of the data. You can summarize data, numerically or graphically. For example, you can collect all the data about the visitors to your website for a week and summarize the data.
Inferential analytics generalizes the larger data set and applies the probability theory to arrive at a conclusion. In this approach, a random sample of data is taken from a population which is used to describe and make inferences about the population.
Inferential statistics is valuable when it is not convenient or possible to examine each member of an entire population.
Let's understand both categories better through an example. Suppose you want to study the height of a person in an entire population. You can do this in two ways.
If we were to record the height of each person in the population. It would be a tedious process. Instead, if you categorize hight as tall, medium and small, and then take only a sample from the population, this is an inferential analysis.
In the descriptive analysis method, you would record the height of every person in the population and then provide the data for the maximum height, minimum height, and average height of the population.
In the next section, let us look at the statistical analysis considerations.
How about investing your time in Data Science with Python Course? Take a look at the course preview NOW!
You need to keep in mind certain considerations to make your study meaningful and systematic. The various statistical analysis considerations are -
Purpose - The purpose of a statistical analysis should be clear and well-defined.
Document Questions - Prepare a questionnaire that you will ask the population in advance.
Define Population of Interest - Select the population of your study based on the purpose of analysis.
Determine Sample - Define the sample for your study based on the purpose of the study.
In the next section, let us look at population and sampling.
The population consists of various samples. The samples together represent the population.
A sample is a part or piece drawn from the large population. For statistical analysis, it can be treated as a subset of the population.
A sample should be random to help ensure that it has all the characteristics of the population and is also representative of the population.
Let us understand the terms, statistics, and parameters which we will be using throughout this tutorial.
In the next section, let us look at Statistics and Parameters.
Statistics are quantitative values, calculated from the sample. Parameters are the characteristics of the population.
Suppose we have the sample, Xo, X1, X2……….Xn; a sample from a population, and we want to know some vital information, such as average, most occurring characteristic, and so on. These are calculated as shown in the picture below.
From the image we have -
Mean: Mean is the average; a typical value present in the distribution. It is calculated by summing the values and dividing them by the number of values.
Variance: Variance measures the sample variability.
Standard Deviation: Standard deviation explains how spread out the data is from the mean. The greater the standard deviation, the greater the spread of the data.
Standard deviation is measured in the same units as the mean.
Certain terms are used in statistical analysis to understand the data and gain insight into it. Some of these terms are as follows:
Search: Search is typically used to find unusual data. Unusual data refers to those data that doesn't meet the parameters set at the beginning.
Inspect: Inspect refers to studying the shape of the data set and a determining how spread out it is.
Characterize: Characterize determines the central tendency of the data.
Conclusion: Based on the understanding of search, inspect, and characterize, we can draw some preliminary or high-level conclusions about the data.
There are four steps in the statistical analysis process, they are -
Step 1: Find the population of interest that suits the purpose of statistical analysis.
Step 2: Draw a random sample that represents the population.
Step 3: Compute sample statistics to describe the spread and shape of the dataset.
Step 4: Make inferences using the sample and calculations. Apply it back to the population.
Data distribution is the collection of data values that are arranged in order along with the relative frequency and occurrences. To understand any kind of problem, It is important to describe the data in terms of its spread and shape using graphical techniques.
Range: Range of the data indicates the quantitative values.
Frequency: Minimum and maximum frequency of the data indicates the number of occurrences of any particular data value in the given data set.
Central Tendency: Central tendency indicates whether the data values accumulate in the middle of distribution or toward the end.
The measures of central tendency are -
Mean,
Median, and
Mode.
Mean is the sum of all the values in the data set divided by the number of values
Medium, the data value right in the middle of the data set or the fiftieth percentile
Mode is the data value, which is the most common or frequent.
Let us now understand percentile in data distribution. A percentile or a centile is a measure used in statistics indicating the value below which a given percentage of observation in a group of observations falls.
For example, the twentieth percentile is the value or score below which twenty percent of the observations may be found. Usually, we report percentiles as below. The twenty-fifth percentile is called the first quartertile. The fiftieth percentile is called the median, or second quartile. The seventy-fifth percentile is called the third quartertile.
Dispersion, also called variability, scatter or spread. Spread denotes how stretched or squeezed a distribution is.
Range: The difference between the maximum and minimum values
Inter-quartile Range: Difference between the 25thand 75thpercentiles
Variance: Data values around the Mean. (74.75)
Standard Deviation: Square root of the variance measured in small units
The histogram is a graphical representation of data distribution.
The features of a Histogram include:
It was first introduced by Karl Pearson to construct a Histogram.
To construct a Histogram, the first step is to bin the range of values. That is, divide the entire range of values into a series of intervals, and then count how many values fall into each interval.
The bins are usually specified as consecutive, non-overlapping intervals of a variable.
The bins must be adjacent and are usually of equal size
In the graphical representation, each bar represents a group of values, also called a bin.
The height of the bar represents the frequency of the values in the bin.
Histograms help assess the probability distribution of a given variable by depicting the frequencies of the observations occurring in a certain range of values.
A normal distribution is the most commonly used distribution in statistics. It is characterized by its bell shape and it's two parameters, mean and standard deviation.
Bell curve or normal distribution is -
Symmetric around the mean.
If you draw a line at the center, you'll get the symmetric shapes on both sides.
The mean, median, and a mode of a normal distribution are equal.
Normal distributions are denser in the center and less dense in the tales or sides.
Normal distributions are defined by two parameters, the mean and the standard deviation.
The bell curve is also known as the Gaussian curve.
Let us understand the Bell curve better in the next section.
The bell curve is divided into three parts to understand the data distribution better.
They are -
Peak: It is where most of the observations occur. Generally, the peak is within one standard deviation from the mean.
Flanks: They are the areas beyond the peak but between one and two standard deviations from the mean.
Tails: They refer to the area far from the center of the distribution and considered to be beyond two standard deviations from the mean. Usually, five percent or less data falls under this.
Skewed data distribution indicates that the tendency of data distribution is to be more spread out on one side than the other. In the graphical representation shown below,
The data is left skewed.
Mean is less than the median.
The distribution is negatively skewed or represents negative statistics.
Left tail contains large distributions.
In the graphical representation shown below,
The data is right skewed or positively skewed distribution
Mean is greater than medium
The right tail contains large distributions.
Let's now learn about kurtosis, another popular measure of data distribution. In a similar way to the concept of skewness, kurtosis is the descriptor of the shape of a probability distribution. And just as there are for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample of a population.
Depending on the particular measure of kurtosis that is used, there are various interpretations of kurtosis. Kurtosis measures, the tendency of the data towards the center, or towards the tail. Platykurtic is negative statistics or negative kurtosis. Mesokurtic represents a normal distribution curve. Leptokurtic is positive statistics or positive kurtosis.
Hypothesis testing is an inferential statistical technique to determine whether a certain condition is true for the entire population. For example, in a manufacturing company, the hypothesis is that every dress manufactured has no defect. A study of each dress manufactured and noting that effects, as they occur, will prove or disprove the hypothesis that every dress is flawless.
Hypothesis, test studies to opposing hypotheses about a population, the null hypothesis, and the alternative hypothesis.
Alternative Hypothesis (H1) |
Null Hypothesis (H0) |
A statement that has to be concluded as true. |
The null hypothesis is the statement that has to be tested. A statement of “no effect” or “no difference”. |
It’s a research hypothesis. |
It’s the logical opposite of the alternative hypothesis. |
It needs a significant level of evidence to support the initial hypothesis. |
It indicates that the alternative hypothesis is incorrect. |
If the alternative hypothesis garners strong evidence, reject the null hypothesis. |
Weak evidence of alternative hypothesis indicates that the null hypothesis has to be accepted. |
The table shown below explains how a decision can be made based upon the hypothesis testing.
Decision |
Ho is True |
Ho is False |
Fail to Reject Null |
Correct |
Type II Error |
Reject Null |
Type I Error |
Correct |
Rejects the null hypothesis. When it is true, the probability of making Type I error is represented by α
Fails to Reject the null hypothesis. When it false, the probability of making Type II error is represented by β
The probability of observing extreme values or more extreme than the one observed calculated from the collected data.
Let us understand the process of hypothesis testing in the next section.
There are four steps to the hypothesis testing process:
The first step is to set the hypothesis. The hypothesis could be null or alternative. The null hypothesis or H0 states that a population parameter is equal to a value. The alternative hypothesis, or H1, states that the population parameter is different than the value of the population parameter in the null hypothesis, the alternative hypothesis is what is believed to be true, or is to be proven true.
The second step is to set alpha or choose a significant level for the population.
The third step is to collect the sample from the population, which represents the characteristics of the population.
The final step is to compare the p-value and alpha. You reject the null hypothesis if the p-value is less than alfa and failed to reject the null hypothesis if the p-value is greater than or equal to alpha.
The example given here shows how clinical trials can be analyzed.
Suppose a pharmaceutical company wants to compare a medicine that manufactures with that of a competitors medicine, then hypothesis testing can be the method it adopts. The null hypothesis would be that both the medicines are equally effective. The alternative hypothesis would be that the two medicines are not equally effective.
There are three types of data on which you can perform hypothesis testing, they are -
It evaluates the mean, median, standard deviation, or variants. If you take the same example to test the efficacy of medicine, take the temperature of every person in the sample after three hours of administering the medicine. This would be referred to as continuous data.
Binomial data evaluate the percentage general classification of data. When data is divided into two categories, you obtain binomial data. Supposing the same population was asked if their fever had subsided and the answers could be yes or no. Then the percentage who study yes should match the null hypothesis.
Poisson data evaluate the rate of occurrence of frequency. If the sample is asked about how many times in a month they use the medicine and the rate of frequency has recorded and is then compared it to the null hypothesis where the rate should be less than a certain number.
Let us now understand the different types of variables to analyze categorical data.
Nominal variables have values with no logical ordering, they are independent of each other, and the sequence does not matter. For example, in a restaurant, you can order burgers, soda, and coffee in any order.
Ordinal variables have values in logical order, however, the relative distance between the two data values is not clear. For example, large, medium, and a small coffee sizes.
Association indicates that two variables are associated or independent of each other. For example, in the first data set, the weather conditions do not affect the train schedule, but in the second data set, it does. Variables have dependencies, and one changes if the other changes.
You too can join the high-earners club. Enroll our Python for Data Science Certification Training Course and earn more today.
Chi-square test is a hypothesis test that compares the observed distribution of your data to an expected distribution of data. The test is applied, usually when there are two categorical variables from a single population. Chi-square test is used for -
Test of Association: Chi-square test is used to determine whether there is a significant association between the two variables.
Test of Independence: Chi-square test is used to test the independence or association between categorical variables. Determine whether a statistical model fits the data adequately.
Let us understand the chi-square test through an example. We have a data set of male purchasers and female purchasers. Let us assess whether the probability of females purchasing items of $500 or more, or 0.45 is significantly different from the probability of males purchasing items of $500 or more or 0.25.
Null hypothesis: There is no association between gender and purchase. The probability of purchase does not change for $500 or more, whether female or male.
Alternative hypothesis: There is an association between gender and purchase. The probability of purchase over $500 is different for female and a male.
Let us look at the types of frequencies in the next section.
Expected frequencies or fe: The cell frequencies that are expected in a bi-variant table if the two tables are statistically independent.
Observed frequencies or fo: There is an association between gender and purchase. The probability of purchase behavior over $500 is different for female and male.
If there is no association between gender and purchase, then the observed frequency would be equal to the expected frequency. If there was an association between gender and purchase behavior, then the observed frequency will not be equal to the expected frequency.
The formula for calculating them using the chi-square method is shown below. Chi-square is calculated using the formula shown. Chi-square is the sum of the square difference between the observed frequency and the expected frequency data, or the deviation divided by the expected frequency data in all possible categories. Expected and observed frequencies require no assumption of the underlying population. Both the frequencies demand a random sampling.
A simple matrix can be visualized as an excel document with three rows and three columns.
A correlation matrix is always a square matrix, that is, it has the same number of rows and columns, but it is larger.
A correlation matrix is expressed in the form of an (nxn) matrix.
When we compare in variables, the covariance is calculated by getting the sample variance between the variables in question.
A correlation coefficient measures the extent to which two variables tend to change together. The coefficient describes both the strength and the direction of the relationship.
The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is to associate it with a proportional change in the other variable.
Spearman rank order correlation also called spearman's row. The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, though not necessarily at a constant rate.
The spearmint correlation coefficient is based on the ranked values for each variable rather than the raw data.
Here is the example of a correlation matrix calculated for the stock market. We can choose the data variables and calculate the correlation matrix to demonstrate the short, medium, and long-term relationships between them. Now you can see why a correlation matrix would prove useful for traders.
We know that inferential statistics uses a random sample from the data to make inferences about the population. This is a valuable method when each and every member of the population cannot be studied.
Inferential statistics can be used only under the following conditions -
A complete list of the members of the population is available.
A random sample from this population.
Using a pre-established formula, you determined that your sample size is large enough.
Inferential statistics can be used even if your data does not meet these criteria.
Inferential statistics can help determine the strength of the relationship with your sample.
If it's very difficult to obtain a population list and or draw a random sample, then you do the best you can with what you have.
Inferential statistics is an effective forecasting tool widely used in businesses. For example, companies use it to predict their finances for future quarters. Using current data, future patterns can be inferred.
Inferential Statistics has its uses in almost every field such as business, medicine, data science, and so on.
Inferential Statistics -
Is an effective tool for forecasting.
Is used to predict future patterns.
Let us summarize what we have learned in this statistical analysis and business applications tutorial:
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.
Statistical analysis is more reliable, as it is based on numbers and scientific analysis when compared to non-statistical analysis.
Descriptive and inferential are the two major categories of statistics. Descriptive takes into account all the members of the population, while inferential uses only a sample.
Mean, median, and mode are measures of central tendency, while variance and standard deviation measure the spread of data.
The spread of distribution is called a dispersion and is graphically represented by a histogram and a bell curve.
The shape of the dispersion is represented as left skewed, right skewed, and kurtosis
Hypothesis testing is an inferential statistical technique that is useful for forecasting future patterns.
Chi-square test is a hypothesis test that compares the observed distribution to unexpected to distribution.
The correlation coefficient or covariance measured with the help of correlation matrix measures the extent to which two variables tend to change together.
This concludes the lesson statistical analysis and business applications. The next lesson will discuss the python environment set up.
Name | Date | Place | |
---|---|---|---|
Data Science with Python | 1 Jun -6 Jul 2019, Weekend batch | Your City | View Details |
Data Science with Python | 29 Jun -3 Aug 2019, Weekend batch | San Francisco | View Details |
A Simplilearn representative will get back to you in one business day.