We live in an information-driven world, one where data is king. Unsurprisingly, it’s necessary that we analyze the pertinent data to make crucial business decisions. Regression is one of the more widely used data analysis techniques. The field of machine learning is growing and with that growth comes a popular algorithm: linear regression. In this article, you will learn about linear regression in R and how it works.
Why Linear Regression?
Before we try to understand what linear regression is, let’s quickly explore the need for a linear regression algorithm by means of an analogy.
Imagine that we were required to predict the number of skiers at a resort, based on the area’s snowfall. The easiest way would be to plot a simple graph with snowfall amounts and skiers on the ‘X’ and ‘Y’ axis respectively. Based on the graph, we could infer that as the amount of snowfall increased, so the number of skiers would obviously increase.
Hence, the graph makes it easy to see the relationship between skiers and snowfall. The number of skiers increases in direct proportion to the amount of snowfall. Based upon the knowledge the graph imparts, we can make better decisions relating to the operations of a ski area.
To understand linear regression, we need to understand the term “regression” first. Regression is used to find relationships between a dependent variable (Y) and multiple independent (X) variables. Here, the independent variables are known as the predictors or explanatory variables, and the dependent variable is referred to as a response or target variable.
A linear regression’s equation looks like this:
y = B0 + B1x1 + B2x2 + B3x3 + ....
Where B0 is the intercept(value of y when x=0)
B1, B2, B3 are the slopes
x1, x2, x3 are the independent variables
In this case, snowfall is an independent variable and the number of skiers is a dependent variable. So, since regression finds relationships between dependent and independent variables, then what exactly is linear regression?
What is Linear Regression?
Linear regression is a form of statistical analysis that shows the relationship between two or more continuous variables. It creates a predictive model using relevant data to show trends. Analysts typically use the “least square method” to create the model. There are other methods, but the least square method is the most commonly used.
Below is a graph that depicts the relationship between the heights and weights of a sample of people. The red line is the linear regression that shows the height of a person is positively related to its weight.
Now that we understand what linear regression is, let’s learn how linear regression works and how we use the linear regression formula to derive the regression line.
How Does Linear Regression Work?
We can better understand how linear regression works by using the example of a dataset that contains two fields, Area and Rent, and is used to predict the house’s rent based on the area where it is located. The dataset is:
As you can see, we are using a simple dataset for our example. Using this uncomplicated data, let’s have a look at how linear regression works, step by step:
1. With the available data, we plot a graph with Area in the X-axis and Rent on Y-axis. The graph will look like the following. Notice that it is a linear pattern with a slight dip.
2. Next, we find the mean of Area and Rent.
3. We then plot the mean on the graph.
4. We draw a line of best fit that passes through the mean.
5. But we encounter a problem. As you can see below, multiple lines can be drawn through the mean:
6. To overcome this problem, we keep moving the line to make sure the best fit line has the least square distance from the data points
7. The least-square distance is found by adding the square of the residuals
8. We now arrive at the relation that, Residual is the distance between Y-actual and Y-pred.
9. The value of m & c for the best fit line, y = mx+ c can be calculated using these formulas:
10. This helps us find the corresponding values:
11. With that, we can obtain the values of m & c.
12. Now, we can find the value of Y-pred.
13. After calculating, we find that the least square value for the below line is 3.02.
14. Finally, we are able to plot the Y-pred and this is found out to be the best fit line.
This shows how the linear regression algorithm works. Now let's move onto our use case.
Use Case of revenue prediction, featuring linear regression
Predicting the revenue from paid, organic, and social media traffic using a linear regression model in R.
We will now look at a real-life scenario where we will predict the revenue by using regression analysis in R. The sample dataset we will be working with is shown below:
In this demo, we will work with the following three attributes to predict the revenue:
- Paid Traffic - Traffic coming through advertisement
- Organic Traffic - Traffic from search engines, which is non-paid
- Social Traffic - Traffic coming in from various social networking sites
We will be making use of multiple linear regression. The linear regression formula is:
Before we begin, let’s have a look at the program’s flow:
- Generate inputs using csv files
- Import the required libraries
- Split the dataset into train and test
- Apply the regression on paid traffic, organic traffic, and social traffic
- Validate the model
So let’s start our step-by-step linear regression demo! Since we will perform linear regression in RStudio, we will open that first.
We type the following code in R:
# Import the dataset sales <- read.csv('Mention your download path') head(sales) #Displays the top 6 rows of a dataset summary(sales) #Gives certain statistical information about the data. The output will look like below: |
dim(sales) # Displays the dimensions of the dataset |
Now, we move onto plotting the variables.
plot(sales) # Plot the variables to see their trends |
Let’s now see how the variables are correlated to each other. For that, we’ll take only the numeric column values.
library(corrplot) # Library to finds the correlation between the variables num.cols<-sapply(sales, is.numeric) num.cols cor.data<-cor(sales[,num.cols]) cor.data corrplot(cor.data, method='color') |
As you can see from the above correlation matrix, the variables have a high degree of correlation between each other and with the sales variable.
Let’s now split the data from training and testing sets.
# Split the data into training and testing set.seed(2) library(caTools) #caTools has the split function split <- sample.split(sales, SplitRatio = 0.7) # Assigning it to a variable split, sample.split is one of the functions we are using. With the ration value of 0.7, it states that we will have 70% of the sales data for training and 30% for testing the model split train <- subset(sales, split = 'TRUE') #Creating a training set test <- subset(sales, split = 'FALSE') #Creating a testing set by assigning FALSE head(train) head(test) View(train) View(test) |
Now that we have the test and train variables, let’s go ahead and create the model:
Model <- lm(Revenue ~., data = train) #Creates the model. Here, lm stands for the linear regression model. Revenue is the target variable we want to track. summary(Model) |
# Prediction pred <- predict(Model, test) #The test data was kept for this purpose pred #This displays the predicted values res<-residuals(Model) # Find the residuals res<-as.data.frame(res) # Convert the residual into a dataframe res # Prints the residuals |
# compare the predicted vs actual values results<-cbind(pred,test$Revenue) results colnames(results)<-c('predicted','real') results<-as.data.frame(results) head(results) |
# Let’s now, compare the predicted vs actual values plot(test$Revenue, type = 'l', lty = 1.8, col = "red") |
The output of the above command is shown below in a graph that shows the predicted revenue.
Now let’s plot our test revenue with the following command:
lines(pred, type = "l", col = "blue") #The output looks like below |
Let’s go ahead and plot the prediction fully with the following command:
plot(pred, type = "l", lty = 1.8, col = "blue") #The output looks like below, this graph shows the expected Revenue |
From the above output, we can see that the graphs of the predicted revenue and expected revenue are very close. Let’s check out the accuracy so we can validate the comparison.
# Calculating the accuracy rmse <- sqrt(mean(pred-sales$Revenue)^2) # Root Mean Square Error is the standard deviation of the residuals rmse |
The output looks like below:
You can see that this model’s accuracy is sound. This brings us to the end of the demo.
Learn data structures in R, how to import and export data in R, cluster analysis and forecasting with the Data Science with R Certification. Check out the course now.
Conclusion
Now you can see why linear regression is necessary, what a linear regression model is, and how the linear regression algorithm works. You also had a look at a real-life scenario wherein we used RStudio to calculate the revenue based on our dataset. You learned about the various commands, packages and saw how to plot a graph in RStudio. Although this is a good start, there is still so much more to discover about linear regression.
Want to Learn More?
If this has piqued your interest in advancing your career in data science, check out Simplilearn’s Data Science Certification Course, co-developed with IBM. This comprehensive course will help you develop your expertise in data science using the R and Python programming languages. You will all learn about regression analysis in-depth, including linear regression.
Data scientists are some of the most sought after IT professionals in the world today, so what are you waiting for?