Lesson 9 of 15By Simplilearn

Last updated on Sep 22, 2020205734#### Machine Learning Tutorial: A Step-by-Step Guide for Beginners

Overview#### What is Machine Learning and How Does It Work?

Lesson - 1#### Top 10 Machine Learning Applications in 2020

Lesson - 2#### Supervised and Unsupervised Learning in Machine Learning

Lesson - 3#### Linear Regression in Python

Lesson - 4#### An Introduction to Logistic Regression in Python

Lesson - 5#### Random Forest Algorithm

Lesson - 6#### Understanding Naive Bayes Classifier

Lesson - 7#### K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Lesson - 8#### How to Leverage KNN Algorithm in Machine Learning?

Lesson - 9#### The Best Guide On How To Implement Decision Tree In Python

Lesson - 10#### Embarking on a Machine Learning Career? Here’s All You Need to Know

Lesson - 11#### The Best Guide to Regularization in Machine Learning

Lesson - 12#### Top 34 Machine Learning Interview Questions and Answers in 2020

Lesson - 13#### How to Become a Machine Learning Engineer?

Lesson - 14#### The Best Guide to Confusion Matrix

Lesson - 15

Python is one of the most widely used programming languages in the exciting field of data science. It leverages powerful machine learning algorithms to make data useful. One of those is K Nearest Neighbors, or KNN—a popular supervised machine learning algorithm used for solving classification and regression problems. The main objective of the KNN algorithm is to predict the classification of a new sample point based on data points that are separated into several individual classes. It is used in text mining, agriculture, finance, and healthcare.

In this KNN algorithm guide, we will cover the following:

- Why do we need the KNN algorithm?
- What is KNN?
- How do we choose the factor ‘K’?
- When do we use KNN?
- How does the KNN algorithm work?
- Use case: Predict whether a person will have diabetes or not

The KNN algorithm is useful when you are performing a pattern recognition task for classifying objects based on different features.

Suppose there is a dataset that contains information regarding cats and dogs. There is a new data point and you need to check if that sample data point is a cat or dog. To do this, you need to list the different features of cats and dogs.

Now, let us consider two features: claw sharpness and ear length. Plot these features on a 2D plane and check where the data points fit in.

As illustrated above, the sharpness of claws is significant for cats, but not so much for dogs. On the other hand, the length of ears is significant for dogs, but not quite when it comes to cats.

Now, if we have a new data point based on the above features, we can easily determine if it’s a cat or a dog.

The new data point features indicate that the animal is, in fact, a cat.

Since KNN is based on feature similarity, we can perform classification tasks using the KNN classifier. The image below—trained with the KNN algorithm—shows the predicted outcome, a black cat.

K-Nearest Neighbors is one of the simplest supervised machine learning algorithms used for classification. It classifies a data point based on its neighbors’ classifications. It stores all available cases and classifies new cases based on similar features.

The following example below shows a KNN algorithm being leveraged to predict if a glass of wine is red or white. Different variables that are considered in this KNN algorithm include sulphur dioxide and chloride levels.

K in KNN is a parameter that refers to the number of nearest neighbors in the majority voting process.

Here, we have taken **K=5**. The majority votes from its fifth nearest neighbor and classifies the data point. The glass of wine will be classified as **red** since four out of five neighbors are red.

A KNN algorithm is based on feature similarity. Selecting the right K value is a process called parameter tuning, which is important to achieve higher accuracy.

There is not a definitive way to determine the best value of K. It depends on the type of problem you are solving, as well as the business scenario. The most preferred value for K is five. Selecting a K value of one or two can be noisy and may lead to outliers in the model, and thus resulting in overfitting of the model. The algorithm performs well on the training set, compared to its true performance on unseen test data.

Consider the following example below to predict which class the new data point belongs to.

If you take K=3, the new data point is a red square.

But, if we consider K=7, the new data point is a blue triangle. This is because the amount of red squares outnumbers the blue triangles.

To choose the value of K, take the square root of n (sqrt(n)), where n is the total number of data points. Usually, an odd value of K is selected to avoid confusion between two classes of data.

The KNN algorithm is used in the following scenarios:

- Data is labeled
- Data is noise-free
- Dataset is small, as KNN is a lazy learner

- Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly, which will not impact the accuracy of the algorithm.
- KNN is very easy to implement. There are only two parameters required to implement KNN—the value of K and the distance function (e.g. Euclidean, Manhattan, etc.)

- The KNN algorithm does not work well with large datasets. The cost of calculating the distance between the new point and each existing point is huge, which degrades performance.
- Feature scaling (standardization and normalization) is required before applying the KNN algorithm to any dataset. Otherwise, KNN may generate wrong predictions.

Consider a dataset that contains two variables: height (cm) & weight (kg). Each point is classified as normal or underweight.

Based on the above data, you need to classify the following set as normal or underweight using the KNN algorithm.

To find the nearest neighbors, we will calculate the Euclidean distance.

The Euclidean distance between two points in the plane with coordinates (x,y) and (a,b) is given by:

Let us calculate the Euclidean distance with the help of unknown data points.

The following table shows the calculated Euclidean distance of unknown data points from all points.

Now, we have a new data point (x1, y1), and we need to determine its class.

Looking at the new data, we can consider the last three rows from the table—K=3.

Since the majority of neighbors are classified as normal as per the KNN algorithm, the data point (57, 170) should be normal.

The goal of this use case is to predict whether a person will be diagnosed with diabetes or not.

The dataset we’ll be using has information on 768 people who were diagnosed with diabetes and those who were not.

The following is what the dataset looks like:

The dataset encompasses a wide range of features, such as pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, age, and outcome (target variable).

Let’s start by installing the necessary libraries.

Load the dataset using pandas:

Certain columns, like glucose, blood pressure, insulin, and BMI, cannot contain values that are zeroes, as it will affect the outcome. We can replace such values with the mean of the respective columns.

Next, we will split the dataset into training and testing sets.

Rule of thumb: If an algorithm computes distance or assumes normality, scale your features.

Now, define the using KNeighborsClassifier to fit the training data into the model.

Predict the test set results.

Calculate the accuracy of the model.

The accuracy of our model is (94+32)/(94+13+32+15) = 0.81

You can also find the accuracy of the model using the accuracy_score function.

In the real world, the KNN algorithm has applications for both classification and regression problems.

KNN is widely used in almost all industries, such as healthcare, financial services, eCommerce, political campaigns, etc. Healthcare companies use the KNN algorithm to determine if a patient is susceptible to certain diseases and conditions. Financial institutions predict credit card ratings or qualify loan applications and the likelihood of default with the help of the KNN algorithm. Political analysts classify potential voters into separate classes based on whom they are likely to vote for.

We hope that this guide helped to explain the basics of the KNN algorithm, and how it classifies data based on certain features. We also examined when it’s best to use the K value and when to use the KNN algorithm. Finally, we went through a use case demo to predict whether a person is likely to develop diabetes or not.

If you are looking to kickstart your career in this exciting field, check out our Machine Learning Certification Course today. You’ll get a solid foundation on how to leverage algorithms to make valuable and game changing predictions in any industry. What are you waiting for?

Name | Date | Place | |
---|---|---|---|

Machine Learning | 21 Feb -11 Mar 2021, Weekdays batch | Your City | View Details |

Machine Learning | 8 Mar -26 Mar 2021, Weekdays batch | New York City | View Details |

Machine Learning | 12 Mar -16 Apr 2021, Weekdays batch | San Francisco | View Details |

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

Machine Learning

15343 Learners

Lifetime Access*

*Lifetime access to high-quality, self-paced e-learning content.

Explore Category- Video Tutorial
Introduction to Python Strings

- Ebook
Python Interview Guide

- Article
What is Python Used For? Top 5 Use Cases Explained

- Webinar
How Tech and Learning Will Redefine the Future of Work

- Video Tutorial
Everything You Need to Know About Python Slicing

- Ebook
Free eBook: Guide To The Top CISCO Certifications

prevNext

- Disclaimer
- PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.