Machine learning has revolutionized the world of business and is helping us build sophisticated applications to solve tough business problems. Using supervised and unsupervised machine learning models, you can solve problems using classification, regression, and clustering algorithms. In this article, we’ll discuss a supervised machine learning algorithm known as logistic regression in Python. Logistic regression can be used to solve both classification and regression problems.
Introduction to Supervised Learning
Supervised machine learning algorithms derive insights, patterns, and relationships from a labeled training dataset. It means the dataset already contains a known value for the target variable for each record. It is called supervised learning because the process of an algorithm learning from the training dataset is like an instructor supervising the learning process. You know the correct answers, the algorithm iteratively makes predictions on the training data and the instructor corrects it. Learning ends when the algorithm achieves the desired level of performance and accuracy.
Supervised learning problems can be further classified into regression and classification problems.
- Classification: In a classification problem, the output variable is a category, such as “red” or “blue,” “disease” or “no disease,” “true” or “false,” etc.
- Regression: In a regression problem, the output variable is a real continuous value, such as “dollars” or “weight.”
The following is an example of a supervised learning method where we have labeled data to identify dogs and cats. The algorithm learns from this data and trains a model to predict the new input.
Now that we learned the basics of supervised learning, let's have a look at a popular supervised machine learning algorithm: logistic regression.
What is Logistic Regression?
Logistic regression is a statistical method that is used for building machine learning models where the dependent variable is dichotomous: i.e. binary. Logistic regression is used to describe data and the relationship between one dependent variable and one or more independent variables. The independent variables can be nominal, ordinal, or of interval type.
The name “logistic regression” is derived from the concept of the logistic function that it uses. The logistic function is also known as the sigmoid function. The value of this logistic function lies between zero and one.
The following is an example of a logistic function we can use to find the probability of a vehicle breaking down, depending on how many years it has been since it was serviced last.
Here is how you can interpret the results from the graph to decide whether the vehicle will break down or not.
Logistic Regression Overview
Logistic Regression is a fundamental algorithm in the field of machine learning, particularly for binary classification tasks. It's widely used for predicting the probability of an event occurring based on one or more independent variables. Despite its name, logistic regression is a classification algorithm rather than a regression algorithm.
Math Prerequisites
To understand logistic regression thoroughly, a solid understanding of linear algebra, calculus, and probability theory is necessary. Linear algebra concepts like matrices and vectors are essential for understanding the underlying mathematics of logistic regression. Calculus concepts such as derivatives are crucial for optimizing the logistic regression model. Probability theory helps in understanding the probabilistic interpretation of logistic regression.
Problem Formulation
In logistic regression, the problem is formulated as predicting the probability that a given input belongs to a certain class. It involves estimating the parameters of a logistic function that maps the input features to the probability of the output class. The goal is to find the best-fitting model that maximizes the likelihood of the observed data.
Methodology
Logistic regression employs the logistic function, also known as the sigmoid function, to model the relationship between the input features and the output class probabilities. The logistic function maps any real-valued number to the range [0, 1], making it suitable for modeling probabilities. The model parameters are learned using optimization algorithms such as gradient descent, which iteratively updates the parameters to minimize the error between the predicted probabilities and the actual class labels.
Classification Performance
The performance of a logistic regression model is evaluated using various metrics such as accuracy, precision, recall, F1 score, and ROC curve. These metrics measure the model's ability to correctly classify instances into their respective classes and assess its overall performance.
Single-Variate Logistic Regression
In single-variate logistic regression, there is only one independent variable or feature used to predict the output class. This simpler form of logistic regression is useful for understanding the basic concepts and principles before moving on to more complex models.
Multi-Variate Logistic Regression
Multi-variate logistic regression involves multiple independent variables or features to predict the output class. It allows for capturing more complex relationships between the input features and the output class probabilities, leading to potentially higher predictive accuracy.
Regularization
Regularization techniques such as L1 regularization (Lasso) and L2 regularization (Ridge) can be applied to logistic regression to prevent overfitting and improve generalization performance. Regularization adds a penalty term to the loss function, which discourages large parameter values and promotes smoother models.
Advantages of the Logistic Regression Algorithm
- Logistic regression performs better when the data is linearly separable
- It does not require too many computational resources as it’s highly interpretable
- There is no problem scaling the input features—It does not require tuning
- It is easy to implement and train a model using logistic regression
- It gives a measure of how relevant a predictor (coefficient size) is, and its direction of association (positive or negative)
How Does the Logistic Regression Algorithm Work?
Consider the following example: An organization wants to determine an employee’s salary increase based on their performance.
For this purpose, a linear regression algorithm will help them decide. Plotting a regression line by considering the employee’s performance as the independent variable, and the salary increase as the dependent variable will make their task easier.
Now, what if the organization wants to know whether an employee would get a promotion or not based on their performance? The above linear graph won’t be suitable in this case. As such, we clip the line at zero and one, and convert it into a sigmoid curve (S curve).
Based on the threshold values, the organization can decide whether an employee will get a salary increase or not.
To understand logistic regression, let’s go over the odds of success.
Odds (𝜃) = Probability of an event happening / Probability of an event not happening
𝜃 = p / 1 - p
The values of odds range from zero to ∞ and the values of probability lies between zero and one.
Consider the equation of a straight line:
𝑦 = 𝛽0 + 𝛽1* 𝑥
Here, 𝛽0 is the y-intercept
𝛽1 is the slope of the line
x is the value of the x coordinate
y is the value of the prediction
Now to predict the odds of success, we use the following formula:
Exponentiating both the sides, we have:
Let Y = e 𝛽0+𝛽1 * 𝑥
Then p(x) / 1 - p(x) = Y
p(x) = Y(1 - p(x))
p(x) = Y - Y(p(x))
p(x) + Y(p(x)) = Y
p(x)(1+Y) = Y
p(x) = Y / 1+Y
The equation of the sigmoid function is:
The sigmoid curve obtained from the above equation is as follows:
Now that you know more about logistic regression algorithms, let’s look at the difference between linear regression and logistic regression.
Types of Logistic Regression
Logistic regression is a versatile machine learning algorithm used for binary and multi-class classification tasks. Depending on the nature of the dependent variable, logistic regression can be categorized into different types. The main types of logistic regression include Binary Logistic Regression, Multinomial Logistic Regression, and Ordinal Logistic Regression.
Binary Logistic Regression
Binary logistic regression is the most common type of logistic regression, where the dependent variable has only two possible outcomes or classes, typically represented as 0 and 1. It is used when the target variable is binary, such as yes/no, pass/fail, or true/false. The logistic function in binary logistic regression models the probability of an observation belonging to one of the two classes.
Multinomial Logistic Regression
Multinomial logistic regression, also known as softmax regression, is used when the dependent variable has more than two unordered categories. Unlike binary logistic regression, which deals with binary outcomes, multinomial logistic regression can handle multiple classes simultaneously. It models the probability of an observation belonging to each class using the softmax function, which ensures that the predicted probabilities sum up to one across all classes.
Ordinal Logistic Regression
Ordinal logistic regression is employed when the dependent variable has more than two ordered categories. In other words, the outcome variable has a natural ordering or hierarchy among its categories. Examples include ordinal scales like low, medium, and high, or Likert scale responses ranging from strongly disagree to strongly agree. Ordinal logistic regression models the cumulative probabilities of an observation falling into or below each category using the cumulative logistic distribution function.
Linear Regression vs. Logistic Regression
Linear Regression |
Logistic Regression |
Used to solve regression problems |
Used to solve classification problems |
The response variables are continuous in nature |
The response variable is categorical in nature |
It helps estimate the dependent variable when there is a change in the independent variable |
It helps to calculate the possibility of a particular event taking place |
It is a straight line |
It is an S-curve (S = Sigmoid) |
Now, let’s look at some logistic regression algorithm examples.
Applications of Logistic Regression
- Using the logistic regression algorithm, banks can predict whether a customer would default on loans or not
- To predict the weather conditions of a certain place (sunny, windy, rainy, humid, etc.)
- Ecommerce companies can identify buyers if they are likely to purchase a certain product
- Companies can predict whether they will gain or lose money in the next quarter, year, or month based on their current performance
- To classify objects based on their features and attributes
Now, let’s look at the assumptions you need to take to build a logistic regression model.
Assumption in a Logistic Regression Algorithm
- In a binary logistic regression, the dependent variable must be binary
- For a binary regression, the factor level one of the dependent variables should represent the desired outcome
- Only meaningful variables should be included
- The independent variables should be independent of each other. This means the model should have little or no multicollinearity
- The independent variables are linearly related to the log odds
- Logistic regression requires quite large sample sizes
Let’s now jump into understanding the logistics Regression algorithm in Python.
Logistic Regression in Python Example
Logistic regression is a popular machine learning algorithm used for binary classification tasks. In this example, we'll walk through the process of implementing logistic regression in Python using the scikit-learn library.
Step 1: Import Libraries
First, we need to import the necessary libraries.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
Step 2: Load and Prepare Data:
Next, we'll load a dataset and prepare it for training our logistic regression model.
# Load dataset
data = pd.read_csv('dataset.csv')
# Separate features and target variable
X = data.drop('target', axis=1)
y = data['target']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 3: Train Logistic Regression Model:
Now, we'll create an instance of the logistic regression model and train it using the training data.
# Create logistic regression model
model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
Step 4: Make Predictions:
Once the model is trained, we'll use it to make predictions on the test data.
# Make predictions
y_pred = model.predict(X_test)
Step 5: Evaluate Model Performance:
Finally, we'll evaluate the performance of our logistic regression model using metrics such as accuracy and classification report.
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
# Classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))
Use Case: Predict the Digits in Images Using a Logistic Regression Classifier in Python
We’ll be using the digits dataset in the scikit learn library to predict digit values from images using the logistic regression model in Python.
- Importing libraries and their associated methods
- Determining the total number of images and labels
- Displaying some of the images and their labels
- Dividing dataset into “training” and “test” set
- Importing the logistic regression model
- Making an instance of the model and training it
- Predicting the output of the first element of the test set
- Predicting the output of the first 10 elements of the test set
- Prediction for the entire dataset
- Determining the accuracy of the model
- Representing the confusion matrix in a heat map
- Presenting predictions and actual output
The images above depict the actual numbers and the predicted digit values from our logistic regression model.
Conclusion
We hope that this article has helped you get acquainted with the basics of supervised learning and logistic regression. We covered the logistic regression algorithm and went into detail with an elaborate example. Then, we looked at the different applications of logistic regression, followed by the list of assumptions you should make to create a logistic regression model. Finally, we built a model using the logistic regression algorithm to predict the digits in images.
If you're eager to deepen your understanding and proficiency in Python for data science and machine learning, consider enrolling in a comprehensive Python training course. Such a course can provide you with the foundational knowledge and practical skills needed to excel in the field. From mastering Python syntax to advanced data manipulation techniques and machine learning algorithms, a well-structured training program can be invaluable on your learning journey. Look for courses that offer hands-on projects and real-world applications to reinforce your learning.
Want to Learn More?
If you want to kick off a career in this exciting field, check out Simplilearn’s Artificial Intelligence Engineer Master’s Program, offered in collaboration with IBM. The program enables you to dive much deeper into the concepts and technologies used in AI, machine learning, and deep learning. You will also get to work on an awesome Capstone Project and earn a certificate in all disciplines in this exciting and lucrative field.
FAQs
1. What is classification?
Classification is a machine learning task where the goal is to categorize input data into predefined classes or categories based on certain features or attributes. It is a type of supervised learning, where the algorithm learns from labeled training data and then predicts the class labels of unseen data. The output of a classification model is a discrete class label or a probability distribution over the classes.
2. When Do You Need Classification?
Classification is needed in various real-world scenarios where the goal is to make predictions or decisions based on input data. Some common applications of classification include:
- Spam Detection: Classifying emails as spam or non-spam based on their content and characteristics.
- Disease Diagnosis: Predicting whether a patient has a certain medical condition based on their symptoms and test results.
- Sentiment Analysis: Categorizing text documents (e.g., reviews, social media posts) as positive, negative, or neutral based on their sentiment.
- Image Recognition: Identifying objects or patterns in images and assigning them to predefined categories.
- Fraud Detection: Detecting fraudulent transactions or activities in financial transactions or online platforms.
3. How to Import Logistic Regression in Python?
To import logistic regression in Python, you can use the scikit-learn library, which provides a comprehensive set of machine learning algorithms and tools. Here's how you can import logistic regression from scikit-learn:
from sklearn.linear_model import LogisticRegression
After importing the logistic regression module, you can create an instance of the logistic regression model and train it using your data. Here's a basic example:
# Import logistic regression
from sklearn.linear_model import LogisticRegression
# Create logistic regression model
model = LogisticRegression()
You can then use this model to fit the training data and make predictions on new data. Remember to import other necessary modules such as numpy and pandas for data manipulation and preprocessing.