Getting Started with Linear Regression in R

We live in an information-driven world, one where data is king. Unsurprisingly, it’s necessary that we analyze the pertinent data to make crucial business decisions. Regression is one of the more widely used data analysis techniques. The field of machine learning is growing and with that growth comes a popular algorithm: linear regression. In this article, you will learn about linear regression in R and how it works. 

Why Linear Regression?

Before we try to understand what linear regression is, let’s quickly explore the need for a linear regression algorithm by means of an analogy. 

Imagine that we were required to predict the number of skiers at a resort, based on the area’s snowfall. The easiest way would be to plot a simple graph with snowfall amounts and skiers on the ‘X’ and ‘Y’ axis respectively. Based on the graph, we could infer that as the amount of snowfall increased, so the number of skiers would obviously increase.

Hence, the graph makes it easy to see the relationship between skiers and snowfall. The number of skiers increases in direct proportion to the amount of snowfall. Based upon the knowledge the graph imparts, we can make better decisions relating to the operations of a ski area.

To understand linear regression, we need to understand the term “regression” first. Regression is used to find relationships between a dependent variable (Y) and multiple independent (X) variables. Here, the independent variables are known as the predictors or explanatory variables, and the dependent variable is referred to as a response or target variable. 

A linear regression’s equation looks like this:

y = B0 + B1x1 + B2x2 + B3x3 + ....

Where B0 is the intercept(value of y when x=0)

B1, B2, B3 are the slopes

x1, x2, x3 are the independent variables

In this case, snowfall is an independent variable and the number of skiers is a dependent variable. So, since regression finds relationships between dependent and independent variables, then what exactly is linear regression?

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

What is Linear Regression?

Linear regression is a form of statistical analysis that shows the relationship between two or more continuous variables. It creates a predictive model using relevant data to show trends. Analysts typically use the “least square method” to create the model. There are other methods, but the least square method is the most commonly used. 

Below is a graph that depicts the relationship between the heights and weights of a sample of people. The red line is the linear regression that shows the height of a person is positively related to its weight.

linear-reg-height-wt

Now that we understand what linear regression is, let’s learn how linear regression works and how we use the linear regression formula to derive the regression line.

How Does Linear Regression Work?

We can better understand how linear regression works by using the example of a dataset that contains two fields, Area and Rent, and is used to predict the house’s rent based on the area where it is located. The dataset is:

area-image

As you can see, we are using a simple dataset for our example. Using this uncomplicated data, let’s have a look at how linear regression works, step by step:

1. With the available data, we plot a graph with Area in the X-axis and Rent on Y-axis. The graph will look like the following. Notice that it is a linear pattern with a slight dip. 

rent-area

2. Next, we find the mean of Area and Rent.

mean-area-rent

3. We then plot the mean on the graph.


4. We draw a line of best fit that passes through the mean.

rent-area

5. But we encounter a problem. As you can see below, multiple lines can be drawn through the mean: 

rent-area-multiple-lines

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

6. To overcome this problem, we keep moving the line to make sure the best fit line has the least square distance from the data points


best-fit-line

7. The least-square distance is found by adding the square of the residuals


adding-square

8. We now arrive at the relation that, Residual is the distance between Y-actual and Y-pred.

rent-residual

9. The value of m & c for the best fit line, y = mx+ c can be calculated using these formulas:


value-m-c-1

10. This helps us find the corresponding values:

corresponding-values

11. With that, we can obtain the values of m & c.

value-m-c-3

12. Now, we can find the value of Y-pred.

pred-y.

13. After calculating, we find that the least square value for the below line is 3.02.

least-square

14. Finally, we are able to plot the Y-pred and this is found out to be the best fit line.

plot-the-y-pred

This shows how the linear regression algorithm works. Now let's move onto our use case.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Use Case of revenue prediction, featuring linear regression

Predicting the revenue from paid, organic, and social media traffic using a linear regression model in R.

We will now look at a real-life scenario where we will predict the revenue by using regression analysis in R. The sample dataset we will be working with is shown below:

sample-dataset.

In this demo, we will work with the following three attributes to predict the revenue:

  1. Paid Traffic - Traffic coming through advertisement
  2. Organic Traffic - Traffic from search engines, which is non-paid
  3. Social Traffic     -  Traffic coming in from various social networking sites

traffic

We will be making use of multiple linear regression. The linear regression formula is:

multiple-linear-regression

Before we begin, let’s have a look at the program’s flow:

  1. Generate inputs using csv files
  2. Import the required libraries
  3. Split the dataset into train and test
  4. Apply the regression on paid traffic, organic traffic, and social traffic
  5. Validate the model 

So let’s start our step-by-step linear regression demo! Since we will perform linear regression in RStudio, we will open that first.

We type the following code in R:

# Import the dataset

sales <- read.csv('Mention your download path')

head(sales) #Displays the top 6 rows of a dataset

summary(sales) #Gives certain statistical information about the data. The output will look like below:

head-sales

dim(sales) # Displays the dimensions of the dataset

dim-sales

Now, we move onto plotting the variables.

plot(sales) # Plot the variables to see their trends

plot-variables

Let’s now see how the variables are correlated to each other. For that, we’ll take only the numeric column values.

library(corrplot) # Library to finds the correlation between the variables

num.cols<-sapply(sales, is.numeric)

num.cols

cor.data<-cor(sales[,num.cols])

cor.data

corrplot(cor.data, method='color')


cor
correlation-matrix

As you can see from the above correlation matrix, the variables have a high degree of correlation between each other and with the sales variable.

Let’s now split the data from training and testing sets.

# Split the data into training and testing

set.seed(2)

library(caTools) #caTools has the split function 

split <- sample.split(sales, SplitRatio = 0.7) # Assigning it to a variable split, sample.split is one of the functions we are using. With the ration value of 0.7, it states that we will have 70% of the sales data for training and 30% for testing the model

split

train <- subset(sales, split = 'TRUE') #Creating a training set 

test <- subset(sales, split = 'FALSE') #Creating a testing set by assigning FALSE

head(train)

head(test)

View(train)

View(test)

Now that we have the test and train variables, let’s go ahead and create the model:

Model <- lm(Revenue ~., data = train) #Creates the model. Here, lm stands for the linear regression model. Revenue is the target variable we want to track.

summary(Model) 

call-formula

# Prediction

pred <- predict(Model, test) #The test data was kept for this purpose

pred #This displays the predicted values 

res<-residuals(Model) # Find the residuals

res<-as.data.frame(res) # Convert the residual into a dataframe

res # Prints the residuals

# compare the predicted vs actual values

results<-cbind(pred,test$Revenue)

results

colnames(results)<-c('predicted','real')

results<-as.data.frame(results)

head(results)

head-results

# Let’s now, compare the predicted vs actual values

plot(test$Revenue, type = 'l', lty = 1.8, col = "red")

The output of the above command is shown below in a graph that shows the predicted revenue.

predicted-revenue

Now let’s plot our test revenue with the following command:

lines(pred, type = "l", col = "blue") #The output looks like below

Let’s go ahead and plot the prediction fully with the following command:

plot(pred, type = "l", lty = 1.8, col = "blue") #The output looks like below, this graph shows the expected Revenue

 pred-index

From the above output, we can see that the graphs of the predicted revenue and expected revenue are very close. Let’s check out the accuracy so we can validate the comparison.

# Calculating the accuracy

rmse <- sqrt(mean(pred-sales$Revenue)^2) # Root Mean Square Error is the standard deviation of the residuals

rmse

 The output looks like below:

rmse

You can see that this model’s accuracy is sound. This brings us to the end of the demo.

Learn data structures in R, how to import and export data in R, cluster analysis and forecasting with the Data Science with R Certification. Check out the course now.

Conclusion

Now you can see why linear regression is necessary, what a linear regression model is, and how the linear regression algorithm works. You also had a look at a real-life scenario wherein we used RStudio to calculate the revenue based on our dataset. You learned about the various commands, packages and saw how to plot a graph in RStudio. Although this is a good start, there is still so much more to discover about linear regression.

Want to Learn More?

If this has piqued your interest in advancing your career in data science, check out Simplilearn’s Data Science Certification, co-developed with IBM. This comprehensive course will help you develop your expertise in data science using the R and Python programming languages. You will all learn about regression analysis in-depth, including linear regression. 

Data scientists are some of the most sought after IT professionals in the world today, so what are you waiting for?

About the Author

Shruti MShruti M

Shruti is an engineer and a technophile. She works on several trending technologies. Her hobbies include reading, dancing and learning new languages. Currently, she is learning the Japanese language.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.