If you are a Python programmer looking for a robust library to build machine learning models, a library that you will want to seriously consider is scikit-learn. In this scikit learn cheatsheet, you will get an overview of some of the most common features and tools available in the scikit-learn library, and this scikit learn cheatsheet will act like your go to scikit learn glossary. 

Become a Data Scientist With Real-World Experience

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist With Real-World Experience

Scikit Learn Cheatsheet

In this scikit learn cheat sheet, you will learn about:

  • Scikit Learn Tool (A robust library available in Python)
  • Dataset Loading (Different ways to load data into Scikit learn before building models)
  • Data Representation (Different ways to represent the data in Scikit learn)
  • Estimator API (One of the main APIs implemented by Scikit-learn)
  • Linear regression (To extend linear models)
  • Stochastic Gradient Descent (an optimization technique to train models)
  • Anomaly Detection (To identify data points that do not fit well with the rest of the data in the dataset)
  • KNN Learning (K-Nearest Neighbor (KNN) - a simple machine learning algorithm)
  • Boosting Methods (To build an ensemble model in an incremental way)
  • Clustering Methods (To find patterns among data samples and cluster them into groups)

Scikit Learn Tool

Scikit Learn is one of the most popular and robust libraries available in Python. It provides a number of efficient tools for machine learning and statistical modeling. This includes regression, clustering, classification, and dimensionality reduction via a consistent interface in Python. The library is mostly written in Python, but built upon NumPy, SciPy, and Matplotlib                                              .

Dataset Loading

In order to build machine learning models, you need to first load your data into memory.

Packaged Datasets

Scikit Learn library comes with a lot of packaged datasets. These datasets are useful for getting a handle on a given algorithm or library feature before building your models.

Load from CSV

If you have a dataset as a CSV file on your local workstation or on a remote server, you can also load it directly onto Scikit Learn. From the prepared X and Y variables, you can start training a machine learning model.

Data Representation

Scikit Learn comes with multiple ways to represent data - like tables, feature matrix, or target array.

Data as Tables

Tables are the best way to represent data in Scikit-learn. It consists of a 2D grid of data with rows representing individual elements of the dataset and columns representing the quantities related to those individual elements.

Data as Feature Matrix

A feature matrix is a table layout that is stored in a variable X and assumed to be 2D with shape [n_samples, n_features]. It is mostly contained in a NumPy array or a Pandas DataFrame. Just like a table, the individual objects are described by the dataset and the features represent the distinct observations that describe each sample in a quantitative manner. 

Data as Target Array 

A target array or a label is denoted by the variable Y. It is usually 1D with length n_samples. It is mostly contained in the NumPy array or Pandas Series. Target arrays may contain both continuous numerical and discrete values.

Become a Data Scientist With Real-World Experience

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist With Real-World Experience

Estimator API

Estimator API is one of the main APIs implemented by Scikit-learn. It provides a consistent interface for a wide range of ML applications that’s why all machine learning algorithms in Scikit-Learn are implemented via Estimator API. The object that learns from the data (fitting the data) is an estimator. It can be used with any algorithm like classification, regression, clustering, or even with a transformer, that extracts useful features from raw data.

Generalized Linear Regression

Generalized Linear Models (GLM) extend linear models in two ways. The predicted values are first linked to a linear combination of the input variables using an inverse link function.

Scikit_Learn_Cheatsheet_1.

The squared loss function is then replaced by the unit deviance of the distribution in the exponential family. 

The minimization problem is thus,

Scikit_Learn_Cheatsheet_2

Where α is the L2 regularization penalty. 

The following table consists of some specific EDMs and their unit deviance.

Scikit_Learn_Cheatsheet_3

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is an optimization technique that provides a way to train machine learning models. It can be applied to large-scale and sparse machine learning problems which are often found in text classification and natural language processing.

Advantages of SGD:

  • Efficient
  • Easy to implement (lots of opportunities for code tuning).

Disadvantages of SGD:

  • Requires a number of hyperparameters 
  • Sensitive to feature scaling

Anomaly Detection

Anomaly detection is used to identify data points that do not fit well with the rest of the data in the dataset. Anomalies or outliers can be divided into three categories:

  • Point anomalies − Occurs if an individual data instance is considered anomalous with respect to the rest of the data.
  • Contextual anomalies − Occurs if a data instance is anomalous in a specific context.
  • Collective anomalies − Occurs if a collection of related data instances is anomalous with respect to the entire dataset rather than individual values.

There are two methods used for anomaly detection - outlier detection and novelty detection:

SNo

Method

Description

1

Outlier detection

  • The training data contains outliers that are far from the rest of the data. 
  • It is also known as unsupervised anomaly detection.

2

Novelty detection

  • The training data does not contain outliers. 
  • It is also known as semi-supervised anomaly detection. 

KNN Learning

K-Nearest Neighbor (KNN) is one of the simplest machine learning algorithms. It does not require any training data points; the whole training data is used in the testing phase. The k-NN algorithm consists of two steps:

Step 1: Compute and store the k nearest neighbors for each sample in the training set.

Step 2: Retrieve the k nearest neighbors from the dataset. Among these k-nearest neighbors, predict the class through voting.

The module sklearn.neighbors provides the functionality for unsupervised and supervised KNN learning methods.

Unsupervised KNN Learning

The sklearn.neighbors.NearestNeighbors module is used to implement unsupervised KNN learning. The following parameters are used in this module.

SNo

Parameter

Description

1

n_neighbors − int, optional

  • Determine the number of neighbors to fetch 
  • The default value is 5.

2

radius − float, optional

  • Limits the distance of neighbors to return 
  • The default value is 1.0.

3

algorithm − {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

  • Takes the algorithm you want to use to compute the nearest neighbors. 
  • If you set it to ‘auto’, it will decide the most appropriate algorithm based on the values passed to fit method.

4

leaf_size − int, optional

  • Affects the speed of the construction, query, and memory required to store the tree
  • The default value is 30

5

metric − string or callable

  • Computes the distance between points. 
  • Can be passed as a string or callable function 

6

P − integer, optional

  • The parameter for the Minkowski metric
  • The default value is 2

7

metric_params − dict, optional

  • The additional keyword arguments for the metric function. 
  • The default value is None.

8

N_jobs − int or None, optional

  • Represents the number of parallel jobs to run for neighbor search
  • The default value is None

Supervised KNN Learning

The supervised KNN learning is used for classification (for data with discrete labels) and regression (for data with continuous labels). There are two different types of nearest neighbor classifiers used by scikit-learn:

SNo

Classifiers

Description

1.

KNeighborsClassifier

  • It implements learning based on the k nearest neighbors
  • The choice of the value of k is dependent on data.

2.

RadiusNeighborsClassifier

  • It implements learning based on the number of neighbors within a fixed radius of each training point.

Become a Data Scientist With Real-World Experience

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist With Real-World Experience

Boosting Methods

Boosting methods help you build an ensemble model in an incremental way by training each base model estimator sequentially. The sklearn.ensemble module has two boosting methods - AdaBoost and Gradient Tree Boosting.

AdaBoost

AdaBoost is a very successful boosting ensemble method that gives weights to the instances in the dataset.

Classification With AdaBoost

The sklearn.ensemble.AdaBoostClassifier is used to build a classifier in AdaBoost. The main parameter of this module is base_estimator, which is the value of the base estimator from which the boosted ensemble is built. If we set this parameter’s value to “none”, the base estimator would be DecisionTreeClassifier(max_depth=1).

Gradient Tree Boost

Gradient Tree Boost is a generalization of boosting to arbitrary differentiable loss functions. It can be used for any type of regression and classification problem. The main advantage of gradient tree boosting is that it can handle mixed-type data.

Classification With Gradient Tree Boost

The sklearn.ensemble.GradientBoostingClassifier is used to build the classifier in Gradient Tree Boost. The main parameter of this module is ‘loss’, which is the value of the loss function to be optimized. 

  • If we set this value to “deviance”, it refers to deviance for classification with probabilistic outputs.
  • If we this value to “exponential”, it recovers the AdaBoost algorithm

Clustering Methods

Clustering methods are used to find similarities and relationship patterns among data samples and cluster them into groups having similarities based on features. The following clustering methods are available under the Scikit-learn library sklearn.cluster.

SNo

Algorithm

Parameters

Scalability

Metric Used

1

K-Means

No. of clusters

Very large n_samples

The distance between points.

2

Affinity Propagation

Damping

It’s not scalable with n_samples

Graph Distance

3

Mean-Shift

Bandwidth

It’s not scalable with n_samples.

The distance between points.

4

Spectral Clustering

No.of clusters

Medium level of scalability with n_samples. Small level of scalability with n_clusters.

Graph Distance

5

Hierarchical Clustering

Distance threshold or No.of clusters

Large n_samples Large n_clusters

The distance between points.

6

DBSCAN

Size of neighborhood

Very large n_samples and medium n_clusters.

Nearest point distance

7

OPTICS

Minimum cluster membership

Very large n_samples and large n_clusters.

The distance between points.

8

BIRCH

Threshold, Branching factor

Large n_samples Large n_clusters

The Euclidean distance between points.

Learn over a dozen of data science tools and skills with PG Program in Data Science and get access to masterclasses by Purdue faculty. Enroll now and add a shining star to your data science resume!

Want to Learn More?

Scikit-Learn is a very simple tool that lets you know how everything works behind the scenes. Once you have mastered the tool, you can move on to other advanced tools which makes building models much easier. If you want to learn more, you can check out Simplilearn’s Data Scientist course created in collaboration with IBM. It features exclusive IBM hackathons, masterclasses, ask-me-anything sessions, and live interaction with practitioners, practical labs, and projects. Get started with this course today and boost your career in data science.

Data Science & Business Analytics Courses Duration and Fees

Data Science & Business Analytics programs typically range from a few weeks to several months, with fees varying based on program and institution.

Program NameDurationFees
Post Graduate Program in Data Science

Cohort Starts: 6 May, 2024

11 Months$ 4,199
Post Graduate Program in Data Analytics

Cohort Starts: 6 May, 2024

8 Months$ 3,749
Caltech Post Graduate Program in Data Science

Cohort Starts: 9 May, 2024

11 Months$ 4,500
Applied AI & Data Science

Cohort Starts: 14 May, 2024

3 Months$ 2,624
Data Analytics Bootcamp

Cohort Starts: 24 Jun, 2024

6 Months$ 8,500
Data Scientist11 Months$ 1,449
Data Analyst11 Months$ 1,449