Machine Learning with Scikit-Learn Tutorial

 

Welcome to lesson eight ‘Machine Learning with Scikit-Learn’ of the Data Science with Python Tutorial, which is a part of the Data Science with Python Course. In this lesson, we will study machine learning, its algorithms, and how Scikit-Learn makes it all so easy.

Objectives

In this lesson on Machine Learning with Scikit-Learn, you'll get to know;

  • What machine learning is and why it is important

  • The machine learning approach

  • Relevant terminologies that help you understand a dataset

  • Features of supervised and unsupervised learning models

  • Algorithms such as regression, classification, clustering, and dimensionality reduction

Why Machine Learning?

We generate 2.5 quintillion bytes of data every day. That's equal to data stored on one hundred million bluray discs which, when stacked together, equals the height of four Eiffel Towers.

Note that none of this data is in a single format.

Now imagine if you had to sift through all of it. It would take ages. But if you let the machines take over, the task would be over in a jiffy. This is why we need machine learning.

Purpose of Machine Learning

Machine learning is a great tool to analyze data, find hidden data patterns and relationships, and extract information. This enables data scientists to make information-driven decisions. Even businesses can benefit from it.

A business may have a lot of historical or incoming data that they are not aware of. Machine learning can help provide insights into this data so that profitable decisions can be made and new business opportunities can be explored.

To do this, machine learning uses statistical and mathematical models and applies them to data sets. This process can either be semi-automated or fully automated.

Machine Learning Terminology

Let us take a look at some machine learning terminologies that will be used throughout this lesson.

Observations

These are the records, samples, or examples present in the data. They may contain one or more data points.

Features

These are the inputs or attributes that define a given data set. They're usually present as columns in a spreadsheet.

Response

A response is the label, outcome, target or some defined answer attached to the data set for the given set of data points. These terms will be explained later, using a dataset. But before that, let's look at the machine learning approach.

Machine Learning Approach

The machine learning approach starts with either a problem that you need to solve or a given dataset that you need to analyze.

  1. So your first step is to understand the problem or the data set. The size of the dataset does not matter at this point.

  2. The next step is to identify and extract the features of the dataset that affect the outcome.

  3. Then identify its problem type. For instance, you may want to ask whether the data is categorical or has some continuous set of values based on the problem type.

  4. Choose the appropriate model as the next step.

  5. After selecting the model, train and test it.

  6. The final step is to strive for accuracy. You should continue to fine-tune the parameters so your model performs optimally.

Steps 1 and 2: Understand the Dataset and Extract its Features

Let us now explore steps 1 and 2 of the machine learning approach - understanding the data set and extracting its features that affect the outcome.

This is a data set, which contains information about a firm.

Education (Yrs.)

Professional Training (Yes/No)

Hourly Rate (USD)

16

1

90

15

0

65

12

1

70

18

1

130

16

0

110

16

1

100

15

1

105

31

0

70

It shows random records of some of its employees. The values mentioned in the Education column are called Observations. Education and Professional Training are our Attributes. The first attribute indicates the Employer’s education in years, while the second attribute shows how many of them actually took some professional training.

Now, the years of education and the professional training of the employees affects their salaries. That's why the hourly rate is the Response or outcome of the data set.

By using this data, you can now predict that a person with sixteen years of education and some professional training can fetch an average of 100 USD per hour.

Steps 3 and 4: Identify the Problem Type and Learning Model

Machine learning can either be supervised or unsupervised. The problem type should be selected based on the type of learning model.

Concept

Let us know the difference between supervised and unsupervised learning.

Supervised Learning

Unsupervised Learning

  • In supervised learning, the dataset used to train a model should have observations, features, and responses. The model is trained to predict the “right” response for a given set of data points.

  • Supervised learning models are used to predict an outcome.

  • The goal of this model is to “generalize” a dataset so that the “general rule” can be applied to new data as well.

  • In unsupervised learning, the response or the outcome of the data is not known.

  • Supervised learning models are used to identify and visualize patterns in data by grouping similar types of data.

  • The goal of this model is to “represent” data in a way that meaningful information can be extracted.

Problem Type

Data can either be continuous or categorical. Based on whether it is supervised or unsupervised learning, the problem type will differ. It is shown in the following image.

Example

Some examples of supervised and unsupervised learning models are shown in the following image.

How it Works—Supervised Learning Model

Before we look at the next steps of the machine learning approach, let's understand how supervised and unsupervised learning models work.

A known data set has observations, which include features and response. In supervised learning, the features and response are fed into the appropriate machine learning algorithm to train it. After the algorithm is fine-tuned, a predictive model is built on top of it.

Any new or unseen data with the same features will not have a label or response attached to it. The model is used to predict the response for this data.

In unsupervised learning, a known data set has a set of observations with features. But the response or outcome is not known.

Without the knowledge of what the expected outcome should be for a given set of data points, the machine learning algorithm cannot be taught to predict the outcome of any new dataset.

But we do know what the features are for a given data set. Based on this information and your domain expertise, you can choose a few assumptions that will help you define the features or attributes the algorithm should watch out for.

This helps it to identify, classify, and visually represent any new or unseen data. You can also use cross-validation to further test and train the model and improve its accuracy.

Steps 5 and 6: Train, Test, and Optimize the Model

The last two steps of the machine learning approach involve testing, training, and optimizing the model. Only supervised learning models can be trained because all the right features and labels or responses are already known.

In unsupervised learning, the machine algorithm looks for similarities based only on statistical properties. There are two approaches to train the supervised learning model.

The first approach is to use two separate datasets, one to train the model and the other to test it. The second approach is to split a single data set into two parts, a training set, and a testing set. Data analysts prefer the split approach because the algorithm uses the same set of data points for training and testing and they can change from one iteration to the other.

The tests that usually makes up about 20-40% of the original data set. The split approach gives greater accuracy when it comes to predicting the unknowns.

To better understand how the split approach works, let's go back to the same records belonging to a company.

ID

Education (Yrs.)

Professional Training (Yes/No)

Hourly Rate (USD)

10

16

1

90

45

15

0

65

83

12

1

70

45

18

1

130

54

16

0

110

67

16

1

100

71

15

1

105

31

15

0

70

Looking at the records, you can see that it contains continuous data. Hence, it requires the regression algorithm under the supervised learning model.

Our next step is to find the attributes which affect the outcome. Since employee id does not affect the outcome, we can drop it from the set.

Use five of the records for training and the remaining three for testing. This would split the dataset into 37.5% test data and 62.5% of training data. The response is also split into testing and training sets. This whole approach helps achieve greater accuracy for response predictions.

Supervised Learning Model Considerations

While designing the supervised learning model, consider the following:

Response and the Features, which directly affect the outcome, feed the machine learning algorithm and model directly.

Fine tune the parameters of the model based on the training and testing results to optimize its performance and accuracy. This will help you to scale up the model easily. Generalization means predicting the response. Strive for a model which can predict consistently.

Scikit-Learn

Scikit is a powerful and modern machine learning python library. It's a great tool for fully and semi-automated advanced data analysis and information extraction. There are a lot of reasons why Scikit-Learn is a preferred machine learning tool.

It has efficient tools to identify and organize problems, such as whether it fits a supervised or unsupervised learning model.

It contains many free and open data sets. It has a rich set of built-in libraries for learning and predicting.

It provides a model support for every problem type.

It also has built-in functions such as pickle for model persistence.

It is supported by a huge open source community and vendor base.

Scikit-Learn - Problem-Solution Approach

Scikit-Learn really helps data scientists organize their work through its problem-solution approach. This involves:

  • Choosing the model and machine algorithm based on the data set type.

  • Using the estimator object, which represents the model in Scikit-Learn by importing the class and instantiating it.

  • Fitting the data into the model to train and test it.

  • Using the predict method to forecast the response of unseen or new dataset.

  • Tuning the model through multiple iterations and result observations

  • striving for accuracy using built-in methods that support the predictive model.

While working with a Scikit-Learn dataset or loading your own data to Scikit-Learn, always consider these points.

  • Create separate objects for feature and response.

  • Ensure that features and response have only, numeric values

  • Features and response should be in the form of a NumPy ndarray,

  • Since features and response would be in the form of arrays, they would have shapes and sizes.

  • Features are always mapped as x, and the response is mapped as y.

Supervised Learning Models: Linear Regression

Linear regression is the most basic and a widely used technique to predict a value of an attribute. It's pretty easy to use as the model doesn't require a lot of tuning. It also runs very fast, which makes it more time efficient.

Consider the equations shown below:

Find our Data Science with Python Online Classroom training classes in top cities:


Name Date Place
Data Science with Python 30 Aug -4 Oct 2019, Weekdays batch Your City View Details
Data Science with Python 30 Aug -4 Oct 2019, Weekdays batch San Francisco View Details
Data Science with Python 28 Sep -2 Nov 2019, Weekend batch Washington View Details
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*