Welcome to lesson eight ‘Machine Learning with Scikit-Learn’ of the Data Science with Python Tutorial, which is a part of the Data Science with Python Course. In this lesson, we will study machine learning, its algorithms, and how Scikit-Learn makes it all so easy.
In this lesson on Machine Learning with Scikit-Learn, you'll get to know;
What machine learning is and why it is important
The machine learning approach
Relevant terminologies that help you understand a dataset
Features of supervised and unsupervised learning models
Algorithms such as regression, classification, clustering, and dimensionality reduction
We generate 2.5 quintillion bytes of data every day. That's equal to data stored on one hundred million bluray discs which, when stacked together, equals the height of four Eiffel Towers.
Note that none of this data is in a single format.
Now imagine if you had to sift through all of it. It would take ages. But if you let the machines take over, the task would be over in a jiffy. This is why we need machine learning.
Machine learning is a great tool to analyze data, find hidden data patterns and relationships, and extract information. This enables data scientists to make information-driven decisions. Even businesses can benefit from it.
A business may have a lot of historical or incoming data that they are not aware of. Machine learning can help provide insights into this data so that profitable decisions can be made and new business opportunities can be explored.
To do this, machine learning uses statistical and mathematical models and applies them to data sets. This process can either be semi-automated or fully automated.
Let us take a look at some machine learning terminologies that will be used throughout this lesson.
Observations
These are the records, samples, or examples present in the data. They may contain one or more data points.
Features
These are the inputs or attributes that define a given data set. They're usually present as columns in a spreadsheet.
Response
A response is the label, outcome, target or some defined answer attached to the data set for the given set of data points. These terms will be explained later, using a dataset. But before that, let's look at the machine learning approach.
The machine learning approach starts with either a problem that you need to solve or a given dataset that you need to analyze.
So your first step is to understand the problem or the data set. The size of the dataset does not matter at this point.
The next step is to identify and extract the features of the dataset that affect the outcome.
Then identify its problem type. For instance, you may want to ask whether the data is categorical or has some continuous set of values based on the problem type.
Choose the appropriate model as the next step.
After selecting the model, train and test it.
The final step is to strive for accuracy. You should continue to fine-tune the parameters so your model performs optimally.
Let us now explore steps 1 and 2 of the machine learning approach - understanding the data set and extracting its features that affect the outcome.
This is a data set, which contains information about a firm.
Education (Yrs.) |
Professional Training (Yes/No) |
Hourly Rate (USD) |
16 |
1 |
90 |
15 |
0 |
65 |
12 |
1 |
70 |
18 |
1 |
130 |
16 |
0 |
110 |
16 |
1 |
100 |
15 |
1 |
105 |
31 |
0 |
70 |
It shows random records of some of its employees. The values mentioned in the Education column are called Observations. Education and Professional Training are our Attributes. The first attribute indicates the Employer’s education in years, while the second attribute shows how many of them actually took some professional training.
Now, the years of education and the professional training of the employees affects their salaries. That's why the hourly rate is the Response or outcome of the data set.
By using this data, you can now predict that a person with sixteen years of education and some professional training can fetch an average of 100 USD per hour.
Machine learning can either be supervised or unsupervised. The problem type should be selected based on the type of learning model.
Concept
Let us know the difference between supervised and unsupervised learning.
Supervised Learning |
Unsupervised Learning |
|
|
Problem Type
Data can either be continuous or categorical. Based on whether it is supervised or unsupervised learning, the problem type will differ. It is shown in the following image.
Example
Some examples of supervised and unsupervised learning models are shown in the following image.
Before we look at the next steps of the machine learning approach, let's understand how supervised and unsupervised learning models work.
A known data set has observations, which include features and response. In supervised learning, the features and response are fed into the appropriate machine learning algorithm to train it. After the algorithm is fine-tuned, a predictive model is built on top of it.
Any new or unseen data with the same features will not have a label or response attached to it. The model is used to predict the response for this data.
In unsupervised learning, a known data set has a set of observations with features. But the response or outcome is not known.
Without the knowledge of what the expected outcome should be for a given set of data points, the machine learning algorithm cannot be taught to predict the outcome of any new dataset.
But we do know what the features are for a given data set. Based on this information and your domain expertise, you can choose a few assumptions that will help you define the features or attributes the algorithm should watch out for.
This helps it to identify, classify, and visually represent any new or unseen data. You can also use cross-validation to further test and train the model and improve its accuracy.
The last two steps of the machine learning approach involve testing, training, and optimizing the model. Only supervised learning models can be trained because all the right features and labels or responses are already known.
In unsupervised learning, the machine algorithm looks for similarities based only on statistical properties. There are two approaches to train the supervised learning model.
The first approach is to use two separate datasets, one to train the model and the other to test it. The second approach is to split a single data set into two parts, a training set, and a testing set. Data analysts prefer the split approach because the algorithm uses the same set of data points for training and testing and they can change from one iteration to the other.
The tests that usually makes up about 20-40% of the original data set. The split approach gives greater accuracy when it comes to predicting the unknowns.
To better understand how the split approach works, let's go back to the same records belonging to a company.
ID |
Education (Yrs.) |
Professional Training (Yes/No) |
Hourly Rate (USD) |
10 |
16 |
1 |
90 |
45 |
15 |
0 |
65 |
83 |
12 |
1 |
70 |
45 |
18 |
1 |
130 |
54 |
16 |
0 |
110 |
67 |
16 |
1 |
100 |
71 |
15 |
1 |
105 |
31 |
15 |
0 |
70 |
Looking at the records, you can see that it contains continuous data. Hence, it requires the regression algorithm under the supervised learning model.
Our next step is to find the attributes which affect the outcome. Since employee id does not affect the outcome, we can drop it from the set.
Use five of the records for training and the remaining three for testing. This would split the dataset into 37.5% test data and 62.5% of training data. The response is also split into testing and training sets. This whole approach helps achieve greater accuracy for response predictions.
While designing the supervised learning model, consider the following:
Response and the Features, which directly affect the outcome, feed the machine learning algorithm and model directly.
Fine tune the parameters of the model based on the training and testing results to optimize its performance and accuracy. This will help you to scale up the model easily. Generalization means predicting the response. Strive for a model which can predict consistently.
Scikit is a powerful and modern machine learning python library. It's a great tool for fully and semi-automated advanced data analysis and information extraction. There are a lot of reasons why Scikit-Learn is a preferred machine learning tool.
It has efficient tools to identify and organize problems, such as whether it fits a supervised or unsupervised learning model.
It contains many free and open data sets. It has a rich set of built-in libraries for learning and predicting.
It provides a model support for every problem type.
It also has built-in functions such as pickle for model persistence.
It is supported by a huge open source community and vendor base.
Scikit-Learn really helps data scientists organize their work through its problem-solution approach. This involves:
Choosing the model and machine algorithm based on the data set type.
Using the estimator object, which represents the model in Scikit-Learn by importing the class and instantiating it.
Fitting the data into the model to train and test it.
Using the predict method to forecast the response of unseen or new dataset.
Tuning the model through multiple iterations and result observations
striving for accuracy using built-in methods that support the predictive model.
While working with a Scikit-Learn dataset or loading your own data to Scikit-Learn, always consider these points.
Create separate objects for feature and response.
Ensure that features and response have only, numeric values
Features and response should be in the form of a NumPy ndarray,
Since features and response would be in the form of arrays, they would have shapes and sizes.
Features are always mapped as x, and the response is mapped as y.
Linear regression is the most basic and a widely used technique to predict a value of an attribute. It's pretty easy to use as the model doesn't require a lot of tuning. It also runs very fast, which makes it more time efficient.
Consider the equations shown below:
Name | Date | Place | |
---|---|---|---|
Data Science with Python | 8 Mar -26 Mar 2021, Weekdays batch | Your City | View Details |
Data Science with Python | 21 Mar -8 Apr 2021, Weekdays batch | San Francisco | View Details |
Data Science with Python | 27 Mar -1 May 2021, Weekend batch | Washington | View Details |
A Simplilearn representative will get back to you in one business day.