Scikit-learn (Sklearn) is Python's most useful and robust machine learning package. It offers a set of fast tools for machine learning and statistical modeling, such as classification, regression, clustering, and dimensionality reduction, via a Python interface. This mostly Python-written package is based on NumPy, SciPy, and Matplotlib. In this article you’ll understand more about sklearn linear regression. 

Data Science Career Boot Camp

The Ultimate Ticket to Top Data Science Job RolesExplore Course
Data Science Career Boot Camp

What is SKlearn Linear Regression?

Scikit-learn is a Python package that makes it easier to apply a variety of Machine Learning (ML) algorithms for predictive data analysis, such as linear regression.

Linear regression is defined as the process of determining the straight line that best fits a set of dispersed data points:

The line can then be projected to forecast fresh data points. Because of its simplicity and essential features, linear regression is a fundamental Machine Learning method.

Sklearn Linear Regression Concepts

When working with scikit-linear learn's regression approach, you will encounter the following fundamental concepts:

  • Best Fit - The straight line in a plot that minimizes the divergence between related dispersed data points
  • Coefficient - Also known as a parameter, is the factor that is multiplied by a variable. A coefficient in linear regression represents changes in a Response Variable 
  • Coefficient of Determination - It is the correlation coefficient. In a regression, this term is used to define the precision or degree of fit
  • Correlation - the measurable intensity and degree of association between two variables, often known as the 'degree of correlation.' The values range from -1.0 to 1.0
  • Dependent Feature - A variable represented as y in the slope equation y=ax+b. Also referred to as an Output or a Response
  • Estimated Regression Line - the straight line that best fits a set of randomly distributed data points
  • Independent Feature - a variable represented by the letter x in the slope equation y=ax+b. Also referred to as an Input or a predictor
  • Intercept - It is the point at where the slope intersects the Y-axis, indicated by the letter b in the slope equation y=ax+b
  • Least Squares - a method for calculating the best fit to data by minimizing the sum of the squares of the discrepancies between observed and estimated values
  • Mean - an average of a group of numbers; nevertheless, in linear regression, Mean is represented by a linear function
  • OLS (Ordinary Least Squares Regression) - sometimes known as Linear Regression.
  • Residual - the vertical distance between a data point and the regression line
  • Regression - is an assessment of a variable's predicted change in relation to changes in other variables
  • Regression Model - The optimum formula for approximating a regression 
  • Response Variables - This category covers both the Predicted Response (the value predicted by the regression) and the Actual Response (the actual value of the data point) 
  • Slope - the steepness of a regression line. The linear relationship between two variables may be defined using slope and intercept: y=ax+b
  • Simple linear regression - A linear regression with a single independent variable

Build Your Data Science Career With Purdue U.

Free Webinar | 8 February, Wednesday | 9 PM ISTRegister Now
Build Your Data Science Career With Purdue U.

How to Create a Sklearn Linear Regression Model

Step 1: Importing All the Required Libraries

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import preprocessing, svm

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

Step 2: Reading the Dataset

cd C:\Users\Dev\Desktop\Kaggle\Salinity

# Changing the file read location to the location of the dataset

df = pd.read_csv('bottle.csv')

df_binary = df[['Salnty', 'T_degC']]

# Taking only the selected two attributes from the dataset

df_binary.columns = ['Sal', 'Temp']

# Renaming the columns for easier writing of the code


# Displaying only the 1st  rows along with the column names

Step 3: Exploring the Data Scatter

sns.lmplot(x ="Sal", y ="Temp", data = df_binary, order = 2, ci = None)

# Plotting the data scatter

Step 4: Data Cleaning

# Eliminating NaN or missing input numbers

df_binary.fillna(method ='ffill', inplace = True)

Step 5: Training Our Model

X = np.array(df_binary['Sal']).reshape(-1, 1)

y = np.array(df_binary['Temp']).reshape(-1, 1)

# Separating the data into independent and dependent variables

# Converting each dataframe into a numpy array 

# since each dataframe contains only one column

df_binary.dropna(inplace = True)

# Dropping any rows with Nan values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Splitting the data into training and testing data

regr = LinearRegression(), y_train)

print(regr.score(X_test, y_test))

Step 6: Exploring Our Results

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color ='b')

plt.plot(X_test, y_pred, color ='k')

# Data scatter of predicted values

Our model's poor accuracy score indicates that our regressive model did not match the current data very well. This implies that our data is ineligible for linear regression. However, a dataset may accept a linear regressor if only a portion of it is considered. Let us investigate that option.

Data Visualization Expert Master's Program

Make Data-Driven Decisions Like a ProStart Learning
Data Visualization Expert Master's Program

Step 7: Working With a Smaller Dataset

df_binary500 = df_binary[:][:500]

# Selecting the 1st 500 rows of the data

sns.lmplot(x ="Sal", y ="Temp", data = df_binary500,

                               order = 2, ci = None)

We can observe that the first 500 rows adhere to a linear model. Continuing in the same manner as previously.

df_binary500.fillna(method ='ffill', inplace = True)

X = np.array(df_binary500['Sal']).reshape(-1, 1)

y = np.array(df_binary500['Temp']).reshape(-1, 1)

df_binary500.dropna(inplace = True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

regr = LinearRegression(), y_train)

print(regr.score(X_test, y_test))

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color ='b')

plt.plot(X_test, y_pred, color ='k')


Enroll in Simplilearn’s PG in Data Science to learn more about application of Python and become better python and data professionals. This Post Graduation in Data Science program by Economic Times is ranked number 1 in the world, offers over a dozen tools and skills and concepts and includes seminars by Purdue academics and IBM professionals, as well as private hackathons and IBM Ask Me Anything sessions.

About the Author

Avijeet BiswalAvijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.
  • *According to Simplilearn survey conducted and subject to terms & conditions with Ernst & Young LLP (EY) as Process Advisors