Scikit-learn (Sklearn) is Python's most useful and robust machine learning package. It offers a set of fast tools for machine learning and statistical modeling, such as classification, regression, clustering, and dimensionality reduction, via a Python interface. This mostly Python-written package is based on NumPy, SciPy, and Matplotlib. In this article you’ll understand more about sklearn linear regression.

#### The Ultimate Data Science Job Guarantee Program

6 Month Data Science Course With a Job Guarantee ## What is SKlearn Linear Regression?

Scikit-learn is a Python package that makes it easier to apply a variety of Machine Learning (ML) algorithms for predictive data analysis, such as linear regression.

Linear regression is defined as the process of determining the straight line that best fits a set of dispersed data points:

The line can then be projected to forecast fresh data points. Because of its simplicity and essential features, linear regression is a fundamental Machine Learning method.

## Sklearn Linear Regression Concepts

When working with scikit-linear learn's regression approach, you will encounter the following fundamental concepts:

• Best Fit - The straight line in a plot that minimizes the divergence between related dispersed data points
• Coefficient - Also known as a parameter, is the factor that is multiplied by a variable. A coefficient in linear regression represents changes in a Response Variable
• Coefficient of Determination - It is the correlation coefficient. In a regression, this term is used to define the precision or degree of fit
• Correlation - the measurable intensity and degree of association between two variables, often known as the 'degree of correlation.' The values range from -1.0 to 1.0
• Dependent Feature - A variable represented as y in the slope equation y=ax+b. Also referred to as an Output or a Response
• Estimated Regression Line - the straight line that best fits a set of randomly distributed data points
• Independent Feature - a variable represented by the letter x in the slope equation y=ax+b. Also referred to as an Input or a predictor
• Intercept - It is the point at where the slope intersects the Y-axis, indicated by the letter b in the slope equation y=ax+b
• Least Squares - a method for calculating the best fit to data by minimizing the sum of the squares of the discrepancies between observed and estimated values
• Mean - an average of a group of numbers; nevertheless, in linear regression, Mean is represented by a linear function
• OLS (Ordinary Least Squares Regression) - sometimes known as Linear Regression.
• Residual - the vertical distance between a data point and the regression line
• Regression - is an assessment of a variable's predicted change in relation to changes in other variables
• Regression Model - The optimum formula for approximating a regression
• Response Variables - This category covers both the Predicted Response (the value predicted by the regression) and the Actual Response (the actual value of the data point)
• Slope - the steepness of a regression line. The linear relationship between two variables may be defined using slope and intercept: y=ax+b
• Simple linear regression - A linear regression with a single independent variable

#### Free Course: Python Libraries for Data Science

Learn the Basics of Python Libraries ## How to Create a Sklearn Linear Regression Model

### Step 1: Importing All the Required Libraries

 import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn import preprocessing, svm from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression

### Step 2: Reading the Dataset

 cd C:\Users\Dev\Desktop\Kaggle\Salinity # Changing the file read location to the location of the dataset df = pd.read_csv('bottle.csv') df_binary = df[['Salnty', 'T_degC']] # Taking only the selected two attributes from the dataset df_binary.columns = ['Sal', 'Temp'] # Renaming the columns for easier writing of the code df_binary.head() # Displaying only the 1st  rows along with the column names

### Step 3: Exploring the Data Scatter

 sns.lmplot(x ="Sal", y ="Temp", data = df_binary, order = 2, ci = None) # Plotting the data scatter

### Step 4: Data Cleaning

 # Eliminating NaN or missing input numbers df_binary.fillna(method ='ffill', inplace = True)

### Step 5: Training Our Model

 X = np.array(df_binary['Sal']).reshape(-1, 1) y = np.array(df_binary['Temp']).reshape(-1, 1) # Separating the data into independent and dependent variables # Converting each dataframe into a numpy array  # since each dataframe contains only one column df_binary.dropna(inplace = True) # Dropping any rows with Nan values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) # Splitting the data into training and testing data regr = LinearRegression() regr.fit(X_train, y_train) print(regr.score(X_test, y_test))

### Step 6: Exploring Our Results

 y_pred = regr.predict(X_test) plt.scatter(X_test, y_test, color ='b') plt.plot(X_test, y_pred, color ='k') plt.show() # Data scatter of predicted values

Our model's poor accuracy score indicates that our regressive model did not match the current data very well. This implies that our data is ineligible for linear regression. However, a dataset may accept a linear regressor if only a portion of it is considered. Let us investigate that option.

### Step 7: Working With a Smaller Dataset

df_binary500 = df_binary[:][:500]

#### Data Visualization Expert Master's Program

Make Data-Driven Decisions Like a Pro # Selecting the 1st 500 rows of the data

sns.lmplot(x ="Sal", y ="Temp", data = df_binary500,

order = 2, ci = None)

We can observe that the first 500 rows adhere to a linear model. Continuing in the same manner as previously.

df_binary500.fillna(method ='ffill', inplace = True)

X = np.array(df_binary500['Sal']).reshape(-1, 1)

y = np.array(df_binary500['Temp']).reshape(-1, 1)

df_binary500.dropna(inplace = True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

regr = LinearRegression()

regr.fit(X_train, y_train)

print(regr.score(X_test, y_test))

y_pred = regr.predict(X_test)

plt.scatter(X_test, y_test, color ='b')

plt.plot(X_test, y_pred, color ='k')

plt.show()

## Conclusion

Enroll in Simplilearn’s PGP Data Science program to learn more about application of Python and become better python and data professionals. This Post Graduation in Data Science program by Economic Times is ranked number 1 in the world, offers over a dozen tools and skills and concepts and includes seminars by Purdue academics and IBM professionals, as well as private hackathons and IBM Ask Me Anything sessions.

## About the Author Avijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

View More
• Disclaimer
• PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.