Lesson 22 of 27By Simplilearn

Last updated on Mar 1, 20211228#### Machine Learning Tutorial: A Step-by-Step Guide for Beginners

Overview#### What is Machine Learning and How Does It Work?

Lesson - 1#### Random Forest Algorithm

Lesson - 2#### The Ultimate Guide to Cross-Validation in Machine Learning

Lesson - 3#### How to Leverage KNN Algorithm in Machine Learning?

Lesson - 4#### Everything You Need to Know About Classification in Machine Learning

Lesson - 5#### Top 34 Machine Learning Interview Questions and Answers in 2021

Lesson - 6#### PCA in Machine Learning - Your Complete Guide to Principal Component Analysis

Lesson - 7#### Top 10 Machine Learning Applications in 2020

Lesson - 8#### The Best Guide On How To Implement Decision Tree In Python

Lesson - 9#### Supervised and Unsupervised Learning in Machine Learning

Lesson - 10#### What Is Reinforcement Learning? The Best Guide To Reinforcement Learning

Lesson - 11#### The Best Guide to Confusion Matrix

Lesson - 12#### Understanding Naive Bayes Classifier

Lesson - 13#### Machine Learning Tutorial

Lesson - 14#### Linear Regression in Python

Lesson - 15#### How to Become a Machine Learning Engineer?

Lesson - 16#### An Introduction to Logistic Regression in Python

Lesson - 17#### What Is Q-Learning? The Best Guide to Understand Q-Learning

Lesson - 18#### An Introduction to the Types Of Machine Learning

Lesson - 19#### Everything You Need to Know About Feature Selection

Lesson - 20#### The Best Guide to Regularization in Machine Learning

Lesson - 21#### Everything You Need to Know About Bias and Variance

Lesson - 22#### What is Cost Function in Machine Learning

Lesson - 23#### Embarking on a Machine Learning Career? Here’s All You Need to Know

Lesson - 24#### A One-Stop Guide to Statistics for Machine Learning

Lesson - 25#### Mathematics for Machine Learning - Important Skills You Must Possess

Lesson - 26#### K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Lesson - 27

While discussing model accuracy, we need to keep in mind the prediction errors, ie: Bias and Variance, that will always be associated with any machine learning model. There will always be a slight difference in what our model predicts and the actual predictions. These differences are called errors. The goal of an analyst is not to eliminate errors but to reduce them. There is always a tradeoff between how low you can get errors to be. In this article titled ‘Everything you need to know about Bias and Variance’, we will discuss what these errors are. The topics covered in this article are :

- Errors in Machine Learning
- What is Bias?
- What is Variance?
- What is Bias-Variance Tradeoff?
- Plotting Bias and Variance using Python

We can describe an error as an action which is inaccurate or wrong. In Machine Learning, error is used to see how accurately our model can predict on data it uses to learn; as well as new, unseen data. Based on our error, we choose the machine learning model which performs best for a particular dataset.

There are two main types of errors present in any machine learning model. They are Reducible Errors and Irreducible Errors.

- Irreducible errors are errors which will always be present in a machine learning model, because of unknown variables, and whose values cannot be reduced.
- Reducible errors are those errors whose values can be further reduced to improve a model. They are caused because our model’s output function does not match the desired output function and can be optimized.

We can further divide reducible errors into two: Bias and Variance.

Figure 1: Errors in Machine Learning

To make predictions, our model will analyze our data and find patterns in it. Using these patterns, we can make generalizations about certain instances in our data. Our model after training learns these patterns and applies them to the test set to predict them.

Bias is the difference between our actual and predicted values. Bias is the simple assumptions that our model makes about our data to be able to predict new data.

Figure 2: Bias

When the Bias is high, assumptions made by our model are too basic, the model can’t capture the important features of our data. This means that our model hasn’t captured patterns in the training data and hence cannot perform well on the testing data too. If this is the case, our model cannot perform on new data and cannot be sent into production.

This instance, where the model cannot find patterns in our training set and hence fails for both seen and unseen data, is called Underfitting.

The below figure shows an example of Underfitting. As we can see, the model has found no patterns in our data and the line of best fit is a straight line that does not pass through any of the data points. The model has failed to train properly on the data given and cannot predict new data either.

Figure 3: Underfitting

Variance is the very opposite of Bias. During training, it allows our model to ‘see’ the data a certain number of times to find patterns in it. If it does not work on the data for long enough, it will not find patterns and bias occurs. On the other hand, if our model is allowed to view the data too many times, it will learn very well for only that data. It will capture most patterns in the data, but it will also learn from the unnecessary data present, or from the noise.

We can define variance as the model’s sensitivity to fluctuations in the data. Our model may learn from noise. This will cause our model to consider trivial features as important.

Figure 4: Example of Variance

In the above figure, we can see that our model has learned extremely well for our training data, which has taught it to identify cats. But when given new data, such as the picture of a fox, our model predicts it as a cat, as that is what it has learned. This happens when the Variance is high, our model will capture all the features of the data given to it, including the noise, will tune itself to the data, and predict it very well but when given new data, it cannot predict on it as it is too specific to training data.

Hence, our model will perform really well on testing data and get high accuracy but will fail to perform on new, unseen data. New data may not have the exact same features and the model won’t be able to predict it very well. This is called Overfitting.

Figure 5: Over-fitted model where we see model performance on, a) training data b) new data

For any model, we have to find the perfect balance between Bias and Variance. This just ensures that we capture the essential patterns in our model while ignoring the noise present it in. This is called Bias-Variance Tradeoff. It helps optimize the error in our model and keeps it as low as possible.

An optimized model will be sensitive to the patterns in our data, but at the same time will be able to generalize to new data. In this, both the bias and variance should be low so as to prevent overfitting and underfitting.

Figure 6: Error in Training and Testing with high Bias and Variance

In the above figure, we can see that when bias is high, the error in both testing and training set is also high.If we have a high variance, the model performs well on the testing set, we can see that the error is low, but gives high error on the training set. We can see that there is a region in the middle, where the error in both training and testing set is low and the bias and variance is in perfect balance.

Figure 7: Bull’s Eye Graph for Bias and Variance

The above bull’s eye graph helps explain bias and variance tradeoff better. The best fit is when the data is concentrated in the center, ie: at the bull’s eye. We can see that as we get farther and farther away from the center, the error increases in our model. The best model is one where bias and variance are both low.

Let’s find out the bias and variance in our weather prediction model. For this we use the daily forecast data as shown below:

Figure 8: Weather forecast data

We start off by importing the necessary modules and loading in our data.

Figure 9: Importing modules

In the data, we can see that the date and month are in military time and are in one column. The day of the month will not have much effect on the weather, but monthly seasonal variations are important to predict the weather. So, let’s make a new column which has only the month.

Figure 10: Creating new month column

The dataset now looks as shown below.

Figure 11: New dataset

Dropping unnecessary columns.

Figure 12: Dropping columns

The dataset becomes as shown.

Figure 13: New Dataset

Let’s convert categorical columns to numerical ones.

Figure 14 : Converting categorical columns to numerical form

The dataset becomes as shown:

Figure 15: New Numerical Dataset

Let’s convert the precipitation column to categorical form, too.

Figure 16: Converting precipitation column to numerical form

Finding all missing values

Figure 17: Finding Missing values

Replacing missing values with ‘0’.

Figure 18: Replacing ‘NaN’ with 0

Let’s drop the prediction column from our dataset.

Figure 19: Input variable

The output column looks as shown.

Figure 20: Output Variable

Splitting the dataset into training and testing data and fitting our model to it.

Figure 21: Splitting and fitting our dataset

Predicting on our dataset and using the variance feature of numpy

Figure 22: Finding variance

Using squared mean error to find bias

Figure 23: Finding Bias

Acelerate your career in AI and ML with the Post Graduate Program in AI and Machine Learning with Purdue University collaborated with IBM.

In this article - Everything you need to know about Bias and Variance, we find out about the various errors that can be present in a machine learning model. We then took a look at what these errors are and learned about Bias and variance, two types of errors that can be reduced and hence are used to help optimize the model. We learn about model optimization and error reduction and finally learn to find the bias and variance using python in our model.

Was this article on bias and variance useful to you? Do you have any doubts or questions for us? Mention them in this article's comments section, and we'll have our experts answer them for you at the earliest!

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

- Video Tutorial
A One-Stop Guide to Statistics for Machine Learning

- Ebook
Getting Started with Google Display Network: The Ultimate Beginner’s Guide

- Article
What Is Ensemble Learning? Understanding Machine Learning Techniques

- Video Tutorial
What is Cost Function in Machine Learning

- Video Tutorial
What is DevOps: DevOps Core, Working, and Uses Explained

- Ebook
Bridging The Gap Between HIPAA & Cloud Computing: What You Need To Know Today

prevNext

- Disclaimer
- PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.