Linear Regression in Python

“Artificial intelligence,” “big data,” and “machine learning” are some of the most searched science-related terms on the Internet these days. Most of us are increasingly adopting AI in our daily lives, sometimes without realizing we’re doing it. AI-based products are capable of performing human-like activities because machine learning algorithms work as their brain. Linear regression is one of the most common machine learning algorithms. 

Linear Regression in Python

In this article, we will explore Linear Regression in Python and a few related topics:

  • Machine learning algorithms
  • Applications of linear regression 
  • Understanding linear regression
  • Multiple linear regression 
  • Use case: profit estimation of companies
Take up this Machine Learning Certification Training Course to gain the necessary skills to become a Machine Learning Engineer. Click to enroll now!

Let us now take a look at the machine learning algorithms before we actually get learning about Linear Regression in Python.

Machine Learning Algorithms 

Machine learning algorithms are divided into three areas:

  • Supervised 
  • Unsupervised 
  • Reinforcement 

We will deal only with supervised learning this time, because that’s where linear regression fits in. Supervised learning uses labeled data, data that is subsequently used to build our model and come up with answers. The two most common uses for supervised learning are:

  • Regression 
  • Classification 

Regression is divided into three types:

  • Simple linear regression 
  • Multiple linear regression
  • Polynomial linear regression 

Let us begin our Linear Regression in Python learning by looking at the various applications of Linear Regression.

Applications of Linear Regression in Python

Let’s look at a few applications of linear regression.

Economic Growth 

Linear regression is used to determine the economic growth of a country or a state in the upcoming quarter. It can also be used to predict a nation’s gross domestic product (GDP).

Product Price 

Linear regression can be used to predict what the price of a product will be in the future, whether prices will go up or down.

Housing Sales

Linear regression can be used to estimate the number of houses a builder will sell in the coming months and at what price.

Score Predictions 

Linear regression can be used to predict the number of runs a baseball player will score in upcoming games based on previous performance.

Free Video lessons for Machine Learning

Start your Machine Learning journey TODAY!GET ACCESS NOW
Free Video lessons for Machine Learning

Understanding Linear Regression in Python

Linear regression is a statistical model used to predict the relationship between independent and dependent variables by examining two factors:

  1. Which variables, in particular, are significant predictors of the outcome variable?
  2. How significant is the regression line in terms of making predictions with the highest possible accuracy?

To understand the terms “dependent” and “independent variable,” let’s take a real-world example. Imagine that we want to predict future crop yields based on the amount of rainfall, using data regarding past crops and rainfall amounts.

Independent Variable

The value of an independent variable does not change based on the effects of other variables. An independent variable is used to manipulate the dependent variable. It is often denoted by an “x.” In our example, the rainfall is the independent variable because we can’t control the rain, but the rain controls the crop—the independent variable controls the dependent variable.

Dependent Variable

The value of this variable changes when there is any change in the values of the independent variables, as mentioned before. It is often denoted by a “y.” In our example, the crop yield is the dependent variable, and it is dependent on the amount of rainfall. 

Regression Equation 

The simplest linear regression equation with one dependent variable and one independent variable is:

y = m*x + c

Look at this graphic:

Regression Equation _ Graphic

We have plotted two points, (x1,y1) and (x2,y2). Let’s discuss the example of crop yield used earlier in the article, and plot the crop yield based on the amount of rainfall. Here, rainfall is the independent variable and crop yield is the dependent variable.

Consider these graphs:

Regression Graphs

Here, we’ve drawn a line through the middle of the data. The red point on the y-axis is the crop yield you can expect for the amount of rainfall (x) represented by the green dot.

If we have an idea about the amount of rainfall for a year, then we can predict how plentiful our crop will be.

Next, in our learning about the Linear Regression in Python, let us look at the reason behind the regression line.

Reasoning Behind the Regression Line

Let’s consider a sample data set with five rows and find out how to draw the regression line. We’ll take two sets of data in which x is the independent variable and y is the dependent variable:

x

y

1

2

2

4

3

5

4

4

5

5

This is a graph with the data plotted:

Regression line-graph with data plotted

Next, we calculate the means, or average values, of x and y. The average of the x values is 3, and the average of the y values is 4. 

We plot both means on the graph to get the regression line.

Regression Line - Graph

Now we’ll discuss the regression line equation. The computation is:

Regression Line Equation - Computation

We have calculated the values for x2, y2 and x*y to calculate the slope and intercept of the line. The calculated values are:

m = 0.6

c = 2.2

The linear equation is:

y = m*x + c

Let’s find out the predicted values of y for corresponding values of x using the linear equation in which m = 0.6 and c = 2.2 and plot them.

Predicted Values using Linear Regression

Here, the blue points represent the actual y values, and the brown points represent the predicted y values based on the model we created. The distances between the actual and predicted values are known as residuals or errors. The best-fit line should have the lowest sum of squares of these errors, also known as “e square.”

E - Square

You can observe that the sum of squared errors for this regression line is 2.4. We check this error for each line and determine the best-fit line having the lowest e square value. The graphical representation is:

Data Points Best Fit

We keep the line moving through the data points to make sure the best-fit line has the least squared distance between the data points and the regression line.

The above example shows the most commonly used formula for minimizing the distance. There are lots of ways to minimize the distance between the line and the data points, such as using the sum of squared errors, the sum of absolute errors and the root mean square error.

So far we have dealt with only two values, x and y. But it’s very rare in the real world to have only have two values when you’re calculating. Let’s talk about what happens when you have multiple inputs.

While going through this Linear Regression in Python, let us stop by to learn Multiple Linear Regression and how it works by implementing in Python.

Machine Learning Certification Course

To become a Machine Learning EngineerExplore Course
Machine Learning Certification Course

Multiple Linear Regression

In simple linear regression, we have the equation:

y = m*x + c

For multiple linear regression, we have the equation:

y = m1x1 + m2x2 + m3x3 +........ + c

Here, we have multiple independent variables, x1, x2 and x3,  and multiple slopes, m1, m2, m3 and so on.

Implementation of Linear Regression 

Let’s discuss how multiple linear regression works by implementing it in Python.

A venture capital firm is trying to figure out which companies it should invest in. We need to predict the profit of each company based on its expenses in research and development, marketing, administration and so on. 

Looking forward to begin a career career in the Machine learning industry? Try answering this Machine Learning Quiz and assess your understanding of the concepts. 

Conclusion 

According to research, artificial intelligence was a $21 billion market in 2018, and that’s expected to reach more than $190 billion by 2025. This explains tech companies’ growing interest in developing AI-based devices and the need for data scientists. Many professionals are looking to gain expertise in this evolving world of machine learning and AI to take the next big leap in their careers. Simplilearn’s Machine Learning Certification Course is helpful if you want to master the concepts of machine learning. The course covers basic to advanced aspects of machine learning, such as regression, classification, and time series modeling. Get certified today and take your career to the next level!

About the Author

SimplilearnSimplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.