If you are a Python programmer looking for a robust library to build machine learning models, a library that you will want to seriously consider is scikit-learn. In this scikit learn cheatsheet, you will get an overview of some of the most common features and tools available in the scikit-learn library, and this scikit learn cheatsheet will act like your go to scikit learn glossary.
Scikit Learn Cheatsheet
In this scikit learn cheat sheet, you will learn about:
- Scikit Learn Tool (A robust library available in Python)
- Dataset Loading (Different ways to load data into Scikit learn before building models)
- Data Representation (Different ways to represent the data in Scikit learn)
- Estimator API (One of the main APIs implemented by Scikit-learn)
- Linear regression (To extend linear models)
- Stochastic Gradient Descent (an optimization technique to train models)
- Anomaly Detection (To identify data points that do not fit well with the rest of the data in the dataset)
- KNN Learning (K-Nearest Neighbor (KNN) - a simple machine learning algorithm)
- Boosting Methods (To build an ensemble model in an incremental way)
- Clustering Methods (To find patterns among data samples and cluster them into groups)
Scikit Learn Tool
Scikit Learn is one of the most popular and robust libraries available in Python. It provides a number of efficient tools for machine learning and statistical modeling. This includes regression, clustering, classification, and dimensionality reduction via a consistent interface in Python. The library is mostly written in Python, but built upon NumPy, SciPy, and Matplotlib .
Dataset Loading
In order to build machine learning models, you need to first load your data into memory.
Packaged Datasets
Scikit Learn library comes with a lot of packaged datasets. These datasets are useful for getting a handle on a given algorithm or library feature before building your models.
Load from CSV
If you have a dataset as a CSV file on your local workstation or on a remote server, you can also load it directly onto Scikit Learn. From the prepared X and Y variables, you can start training a machine learning model.
Data Representation
Scikit Learn comes with multiple ways to represent data - like tables, feature matrix, or target array.
Data as Tables
Tables are the best way to represent data in Scikit-learn. It consists of a 2D grid of data with rows representing individual elements of the dataset and columns representing the quantities related to those individual elements.
Data as Feature Matrix
A feature matrix is a table layout that is stored in a variable X and assumed to be 2D with shape [n_samples, n_features]. It is mostly contained in a NumPy array or a Pandas DataFrame. Just like a table, the individual objects are described by the dataset and the features represent the distinct observations that describe each sample in a quantitative manner.
Data as Target Array
A target array or a label is denoted by the variable Y. It is usually 1D with length n_samples. It is mostly contained in the NumPy array or Pandas Series. Target arrays may contain both continuous numerical and discrete values.
Estimator API
Estimator API is one of the main APIs implemented by Scikit-learn. It provides a consistent interface for a wide range of ML applications that’s why all machine learning algorithms in Scikit-Learn are implemented via Estimator API. The object that learns from the data (fitting the data) is an estimator. It can be used with any algorithm like classification, regression, clustering, or even with a transformer, that extracts useful features from raw data.
Generalized Linear Regression
Generalized Linear Models (GLM) extend linear models in two ways. The predicted values are first linked to a linear combination of the input variables using an inverse link function.
The squared loss function is then replaced by the unit deviance of the distribution in the exponential family.
The minimization problem is thus,
Where α is the L2 regularization penalty.
The following table consists of some specific EDMs and their unit deviance.
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is an optimization technique that provides a way to train machine learning models. It can be applied to large-scale and sparse machine learning problems which are often found in text classification and natural language processing.
Advantages of SGD:
- Efficient
- Easy to implement (lots of opportunities for code tuning).
Disadvantages of SGD:
- Requires a number of hyperparameters
- Sensitive to feature scaling
Anomaly Detection
Anomaly detection is used to identify data points that do not fit well with the rest of the data in the dataset. Anomalies or outliers can be divided into three categories:
- Point anomalies − Occurs if an individual data instance is considered anomalous with respect to the rest of the data.
- Contextual anomalies − Occurs if a data instance is anomalous in a specific context.
- Collective anomalies − Occurs if a collection of related data instances is anomalous with respect to the entire dataset rather than individual values.
There are two methods used for anomaly detection - outlier detection and novelty detection:
SNo |
Method |
Description |
1 |
Outlier detection |
|
2 |
Novelty detection |
|
KNN Learning
K-Nearest Neighbor (KNN) is one of the simplest machine learning algorithms. It does not require any training data points; the whole training data is used in the testing phase. The k-NN algorithm consists of two steps:
Step 1: Compute and store the k nearest neighbors for each sample in the training set.
Step 2: Retrieve the k nearest neighbors from the dataset. Among these k-nearest neighbors, predict the class through voting.
The module sklearn.neighbors provides the functionality for unsupervised and supervised KNN learning methods.
Unsupervised KNN Learning
The sklearn.neighbors.NearestNeighbors module is used to implement unsupervised KNN learning. The following parameters are used in this module.
SNo |
Parameter |
Description |
1 |
n_neighbors − int, optional |
|
2 |
radius − float, optional |
|
3 |
algorithm − {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional |
|
4 |
leaf_size − int, optional |
|
5 |
metric − string or callable |
|
6 |
P − integer, optional |
|
7 |
metric_params − dict, optional |
|
8 |
N_jobs − int or None, optional |
|
Supervised KNN Learning
The supervised KNN learning is used for classification (for data with discrete labels) and regression (for data with continuous labels). There are two different types of nearest neighbor classifiers used by scikit-learn:
SNo |
Classifiers |
Description |
1. |
|
|
2. |
|
Boosting Methods
Boosting methods help you build an ensemble model in an incremental way by training each base model estimator sequentially. The sklearn.ensemble module has two boosting methods - AdaBoost and Gradient Tree Boosting.
AdaBoost
AdaBoost is a very successful boosting ensemble method that gives weights to the instances in the dataset.
Classification With AdaBoost
The sklearn.ensemble.AdaBoostClassifier is used to build a classifier in AdaBoost. The main parameter of this module is base_estimator, which is the value of the base estimator from which the boosted ensemble is built. If we set this parameter’s value to “none”, the base estimator would be DecisionTreeClassifier(max_depth=1).
Gradient Tree Boost
Gradient Tree Boost is a generalization of boosting to arbitrary differentiable loss functions. It can be used for any type of regression and classification problem. The main advantage of gradient tree boosting is that it can handle mixed-type data.
Classification With Gradient Tree Boost
The sklearn.ensemble.GradientBoostingClassifier is used to build the classifier in Gradient Tree Boost. The main parameter of this module is ‘loss’, which is the value of the loss function to be optimized.
- If we set this value to “deviance”, it refers to deviance for classification with probabilistic outputs.
- If we this value to “exponential”, it recovers the AdaBoost algorithm
Clustering Methods
Clustering methods are used to find similarities and relationship patterns among data samples and cluster them into groups having similarities based on features. The following clustering methods are available under the Scikit-learn library sklearn.cluster.
SNo |
Algorithm |
Parameters |
Scalability |
Metric Used |
1 |
K-Means |
No. of clusters |
Very large n_samples |
The distance between points. |
2 |
Affinity Propagation |
Damping |
It’s not scalable with n_samples |
Graph Distance |
3 |
Mean-Shift |
Bandwidth |
It’s not scalable with n_samples. |
The distance between points. |
4 |
Spectral Clustering |
No.of clusters |
Medium level of scalability with n_samples. Small level of scalability with n_clusters. |
Graph Distance |
5 |
Hierarchical Clustering |
Distance threshold or No.of clusters |
Large n_samples Large n_clusters |
The distance between points. |
6 |
DBSCAN |
Size of neighborhood |
Very large n_samples and medium n_clusters. |
Nearest point distance |
7 |
OPTICS |
Minimum cluster membership |
Very large n_samples and large n_clusters. |
The distance between points. |
8 |
BIRCH |
Threshold, Branching factor |
Large n_samples Large n_clusters |
The Euclidean distance between points. |
Learn over a dozen of data science tools and skills with PG Program in Data Science and get access to masterclasses by Purdue faculty. Enroll now and add a shining star to your data science resume!
Want to Learn More?
Scikit-Learn is a very simple tool that lets you know how everything works behind the scenes. Once you have mastered the tool, you can move on to other advanced tools which makes building models much easier. If you want to learn more, you can check out Simplilearn’s Data Scientist course created in collaboration with IBM. It features exclusive IBM hackathons, masterclasses, ask-me-anything sessions, and live interaction with practitioners, practical labs, and projects. Get started with this course today and boost your career in data science.