Scikit-learn (Sklearn) is Python's most useful and robust machine learning package. It offers a set of fast tools for machine learning and statistical modeling, such as classification, regression, clustering, and dimensionality reduction, via a Python interface. This mostly Python-written package is based on NumPy, SciPy, and Matplotlib. In this article you’ll understand more about sklearn linear regression. 

The Ultimate Data Science Job Guarantee Program

6 Month Data Science Course With a Job GuaranteeJoin Today
The Ultimate Data Science Job Guarantee Program

What is SKlearn Linear Regression?

Scikit-learn is a Python package that makes it easier to apply a variety of Machine Learning (ML) algorithms for predictive data analysis, such as linear regression.

Linear regression is defined as the process of determining the straight line that best fits a set of dispersed data points:

The line can then be projected to forecast fresh data points. Because of its simplicity and essential features, linear regression is a fundamental Machine Learning method.

Sklearn Linear Regression Concepts

When working with scikit-linear learn's regression approach, you will encounter the following fundamental concepts:

  • Best Fit - The straight line in a plot that minimizes the divergence between related dispersed data points
  • Coefficient - Also known as a parameter, is the factor that is multiplied by a variable. A coefficient in linear regression represents changes in a Response Variable 
  • Coefficient of Determination - It is the correlation coefficient. In a regression, this term is used to define the precision or degree of fit
  • Correlation - the measurable intensity and degree of association between two variables, often known as the 'degree of correlation.' The values range from -1.0 to 1.0
  • Dependent Feature - A variable represented as y in the slope equation y=ax+b. Also referred to as an Output or a Response
  • Estimated Regression Line - the straight line that best fits a set of randomly distributed data points
  • Independent Feature - a variable represented by the letter x in the slope equation y=ax+b. Also referred to as an Input or a predictor
  • Intercept - It is the point at where the slope intersects the Y-axis, indicated by the letter b in the slope equation y=ax+b
  • Least Squares - a method for calculating the best fit to data by minimizing the sum of the squares of the discrepancies between observed and estimated values
  • Mean - an average of a group of numbers; nevertheless, in linear regression, Mean is represented by a linear function
  • OLS (Ordinary Least Squares Regression) - sometimes known as Linear Regression.
  • Residual - the vertical distance between a data point and the regression line
  • Regression - is an assessment of a variable's predicted change in relation to changes in other variables
  • Regression Model - The optimum formula for approximating a regression 
  • Response Variables - This category covers both the Predicted Response (the value predicted by the regression) and the Actual Response (the actual value of the data point) 
  • Slope - the steepness of a regression line. The linear relationship between two variables may be defined using slope and intercept: y=ax+b
  • Simple linear regression - A linear regression with a single independent variable

Free Course: Python Libraries for Data Science

Learn the Basics of Python LibrariesEnroll Now
Free Course: Python Libraries for Data Science

How to Create a Sklearn Linear Regression Model

Step 1: Importing All the Required Libraries

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import preprocessing, svm

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

Step 2: Reading the Dataset

cd C:\Users\Dev\Desktop\Kaggle\Salinity

# Changing the file read location to the location of the dataset

df = pd.read_csv('bottle.csv')

df_binary = df[['Salnty', 'T_degC']]

# Taking only the selected two attributes from the dataset

df_binary.columns = ['Sal', 'Temp']

# Renaming the columns for easier writing of the code

df_binary.head()

# Displaying only the 1st  rows along with the column names

Step 3: Exploring the Data Scatter

sns.lmplot(x ="Sal", y ="Temp", data = df_binary, order = 2, ci = None)

# Plotting the data scatter

Step 4: Data Cleaning

# Eliminating NaN or missing input numbers

df_binary.fillna(method ='ffill', inplace = True)

Step 5: Training Our Model

X = np.array(df_binary['Sal']).reshape(-1, 1)

y = np.array(df_binary['Temp']).reshape(-1, 1)

# Separating the data into independent and dependent variables

# Converting each dataframe into a numpy array 

# since each dataframe contains only one column

df_binary.dropna(inplace = True)

# Dropping any rows with Nan values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Splitting the data into training and testing data

regr = LinearRegression()

regr.fit(X_train, y_train)

print(regr.score(X_test, y_test))

Step 6: Exploring Our Results

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color ='b')

plt.plot(X_test, y_pred, color ='k')

plt.show()

# Data scatter of predicted values

Our model's poor accuracy score indicates that our regressive model did not match the current data very well. This implies that our data is ineligible for linear regression. However, a dataset may accept a linear regressor if only a portion of it is considered. Let us investigate that option.

Step 7: Working With a Smaller Dataset

df_binary500 = df_binary[:][:500]

Data Visualization Expert Master's Program

Make Data-Driven Decisions Like a ProStart Learning
Data Visualization Expert Master's Program

# Selecting the 1st 500 rows of the data

sns.lmplot(x ="Sal", y ="Temp", data = df_binary500,

                               order = 2, ci = None)

We can observe that the first 500 rows adhere to a linear model. Continuing in the same manner as previously.

df_binary500.fillna(method ='ffill', inplace = True)

X = np.array(df_binary500['Sal']).reshape(-1, 1)

y = np.array(df_binary500['Temp']).reshape(-1, 1)

df_binary500.dropna(inplace = True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

regr = LinearRegression()

regr.fit(X_train, y_train)

print(regr.score(X_test, y_test))

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color ='b')

plt.plot(X_test, y_pred, color ='k')

plt.show()

Conclusion

Enroll in Simplilearn’s PGP Data Science program to learn more about application of Python and become better python and data professionals. This Post Graduation in Data Science program by Economic Times is ranked number 1 in the world, offers over a dozen tools and skills and concepts and includes seminars by Purdue academics and IBM professionals, as well as private hackathons and IBM Ask Me Anything sessions.

About the Author

SimplilearnSimplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.