Scikit Learn Cheatsheet: A Comprehensive Scikit Learn Glossary

If you are a Python programmer looking for a robust library to build machine learning models, a library that you will want to seriously consider is scikit-learn. In this scikit learn cheatsheet, you will get an overview of some of the most common features and tools available in the scikit-learn library, and this scikit learn cheatsheet will act like your go to scikit learn glossary.

Scikit Learn Cheatsheet

In this scikit learn cheat sheet, you will learn about:

Scikit Learn Tool (A robust library available in Python)
Dataset Loading (Different ways to load data into Scikit learn before building models)
Data Representation (Different ways to represent the data in Scikit learn)
Estimator API (One of the main APIs implemented by Scikit-learn)
Linear regression (To extend linear models)
Stochastic Gradient Descent (an optimization technique to train models)
Anomaly Detection (To identify data points that do not fit well with the rest of the data in the dataset)
KNN Learning (K-Nearest Neighbor (KNN) - a simple machine learning algorithm)
Boosting Methods (To build an ensemble model in an incremental way)
Clustering Methods (To find patterns among data samples and cluster them into groups)

Scikit Learn Tool

Scikit Learn is one of the most popular and robust libraries available in Python. It provides a number of efficient tools for machine learning and statistical modeling. This includes regression, clustering, classification, and dimensionality reduction via a consistent interface in Python. The library is mostly written in Python, but built upon NumPy, SciPy, and Matplotlib .

Dataset Loading

In order to build machine learning models, you need to first load your data into memory.

Packaged Datasets

Scikit Learn library comes with a lot of packaged datasets. These datasets are useful for getting a handle on a given algorithm or library feature before building your models.

Load from CSV

If you have a dataset as a CSV file on your local workstation or on a remote server, you can also load it directly onto Scikit Learn. From the prepared X and Y variables, you can start training a machine learning model.

Data Representation

Scikit Learn comes with multiple ways to represent data - like tables, feature matrix, or target array.

Data as Tables

Tables are the best way to represent data in Scikit-learn. It consists of a 2D grid of data with rows representing individual elements of the dataset and columns representing the quantities related to those individual elements.

Data as Feature Matrix

A feature matrix is a table layout that is stored in a variable X and assumed to be 2D with shape [n_samples, n_features]. It is mostly contained in a NumPy array or a Pandas DataFrame. Just like a table, the individual objects are described by the dataset and the features represent the distinct observations that describe each sample in a quantitative manner.

Data as Target Array

A target array or a label is denoted by the variable Y. It is usually 1D with length n_samples. It is mostly contained in the NumPy array or Pandas Series. Target arrays may contain both continuous numerical and discrete values.

Estimator API

Estimator API is one of the main APIs implemented by Scikit-learn. It provides a consistent interface for a wide range of ML applications that’s why all machine learning algorithms in Scikit-Learn are implemented via Estimator API. The object that learns from the data (fitting the data) is an estimator. It can be used with any algorithm like classification, regression, clustering, or even with a transformer, that extracts useful features from raw data.

Generalized Linear Regression

Generalized Linear Models (GLM) extend linear models in two ways. The predicted values are first linked to a linear combination of the input variables using an inverse link function.

Scikit_Learn_Cheatsheet_1.

The squared loss function is then replaced by the unit deviance of the distribution in the exponential family.

The minimization problem is thus,

Scikit_Learn_Cheatsheet_2

Where α is the L2 regularization penalty.

The following table consists of some specific EDMs and their unit deviance.

Scikit_Learn_Cheatsheet_3

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is an optimization technique that provides a way to train machine learning models. It can be applied to large-scale and sparse machine learning problems which are often found in text classification and natural language processing.

Advantages of SGD:

Efficient
Easy to implement (lots of opportunities for code tuning).

Disadvantages of SGD:

Requires a number of hyperparameters
Sensitive to feature scaling

Anomaly Detection

Anomaly detection is used to identify data points that do not fit well with the rest of the data in the dataset. Anomalies or outliers can be divided into three categories:

Point anomalies − Occurs if an individual data instance is considered anomalous with respect to the rest of the data.
Contextual anomalies − Occurs if a data instance is anomalous in a specific context.
Collective anomalies − Occurs if a collection of related data instances is anomalous with respect to the entire dataset rather than individual values.

There are two methods used for anomaly detection - outlier detection and novelty detection:

SNo	Method	Description
1	Outlier detection	The training data contains outliers that are far from the rest of the data. It is also known as unsupervised anomaly detection.
2	Novelty detection	The training data does not contain outliers. It is also known as semi-supervised anomaly detection.

KNN Learning

K-Nearest Neighbor (KNN) is one of the simplest machine learning algorithms. It does not require any training data points; the whole training data is used in the testing phase. The k-NN algorithm consists of two steps:

Step 1: Compute and store the k nearest neighbors for each sample in the training set.

Step 2: Retrieve the k nearest neighbors from the dataset. Among these k-nearest neighbors, predict the class through voting.

The module sklearn.neighbors provides the functionality for unsupervised and supervised KNN learning methods.

Unsupervised KNN Learning

The sklearn.neighbors.NearestNeighbors module is used to implement unsupervised KNN learning. The following parameters are used in this module.

SNo	Parameter	Description
1	n_neighbors − int, optional	Determine the number of neighbors to fetch The default value is 5.
2	radius − float, optional	Limits the distance of neighbors to return The default value is 1.0.
3	algorithm − {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional	Takes the algorithm you want to use to compute the nearest neighbors. If you set it to ‘auto’, it will decide the most appropriate algorithm based on the values passed to fit method.
4	leaf_size − int, optional	Affects the speed of the construction, query, and memory required to store the tree The default value is 30
5	metric − string or callable	Computes the distance between points. Can be passed as a string or callable function
6	P − integer, optional	The parameter for the Minkowski metric The default value is 2
7	metric_params − dict, optional	The additional keyword arguments for the metric function. The default value is None.
8	N_jobs − int or None, optional	Represents the number of parallel jobs to run for neighbor search The default value is None

Supervised KNN Learning

The supervised KNN learning is used for classification (for data with discrete labels) and regression (for data with continuous labels). There are two different types of nearest neighbor classifiers used by scikit-learn:

SNo	Classifiers	Description
1.	KNeighborsClassifier	It implements learning based on the k nearest neighbors The choice of the value of k is dependent on data.
2.	RadiusNeighborsClassifier	It implements learning based on the number of neighbors within a fixed radius of each training point.

Boosting Methods

Boosting methods help you build an ensemble model in an incremental way by training each base model estimator sequentially. The sklearn.ensemble module has two boosting methods - AdaBoost and Gradient Tree Boosting.

AdaBoost

AdaBoost is a very successful boosting ensemble method that gives weights to the instances in the dataset.

Classification With AdaBoost

The sklearn.ensemble.AdaBoostClassifier is used to build a classifier in AdaBoost. The main parameter of this module is base_estimator, which is the value of the base estimator from which the boosted ensemble is built. If we set this parameter’s value to “none”, the base estimator would be DecisionTreeClassifier(max_depth=1).

Gradient Tree Boost

Gradient Tree Boost is a generalization of boosting to arbitrary differentiable loss functions. It can be used for any type of regression and classification problem. The main advantage of gradient tree boosting is that it can handle mixed-type data.

Classification With Gradient Tree Boost

The sklearn.ensemble.GradientBoostingClassifier is used to build the classifier in Gradient Tree Boost. The main parameter of this module is ‘loss’, which is the value of the loss function to be optimized.

If we set this value to “deviance”, it refers to deviance for classification with probabilistic outputs.

If we this value to “exponential”, it recovers the AdaBoost algorithm

Clustering Methods

Clustering methods are used to find similarities and relationship patterns among data samples and cluster them into groups having similarities based on features. The following clustering methods are available under the Scikit-learn library sklearn.cluster.

SNo	Algorithm	Parameters	Scalability	Metric Used
1	K-Means	No. of clusters	Very large n_samples	The distance between points.
2	Affinity Propagation	Damping	It’s not scalable with n_samples	Graph Distance
3	Mean-Shift	Bandwidth	It’s not scalable with n_samples.	The distance between points.
4	Spectral Clustering	No.of clusters	Medium level of scalability with n_samples. Small level of scalability with n_clusters.	Graph Distance
5	Hierarchical Clustering	Distance threshold or No.of clusters	Large n_samples Large n_clusters	The distance between points.
6	DBSCAN	Size of neighborhood	Very large n_samples and medium n_clusters.	Nearest point distance
7	OPTICS	Minimum cluster membership	Very large n_samples and large n_clusters.	The distance between points.
8	BIRCH	Threshold, Branching factor	Large n_samples Large n_clusters	The Euclidean distance between points.

Learn over a dozen of data science tools and skills with PG Program in Data Science and get access to masterclasses by Purdue faculty. Enroll now and add a shining star to your data science resume!

Want to Learn More?

Scikit-Learn is a very simple tool that lets you know how everything works behind the scenes. Once you have mastered the tool, you can move on to other advanced tools which makes building models much easier. If you want to learn more, you can check out Simplilearn’s Data Scientist course created in collaboration with IBM. It features exclusive IBM hackathons, masterclasses, ask-me-anything sessions, and live interaction with practitioners, practical labs, and projects. Get started with this course today and boost your career in data science.

Program Name	Duration	Fees
Data Strategy for Leaders Cohort Starts: 30 Jul, 2025	14 weeks	$3,200
Professional Certificate Program in Data Engineering Cohort Starts: 4 Aug, 2025	7 months	$3,850
Professional Certificate in Data Science and Generative AI Cohort Starts: 11 Aug, 2025	6 months	$3,800
Professional Certificate in Data Analytics and Generative AI Cohort Starts: 11 Aug, 2025	8 months	$3,500
Data Scientist	11 months	$1,449
Data Analyst	11 months	$1,449

Table of Contents

Scikit Learn Cheatsheet

Scikit Learn Tool

Dataset Loading

Data Representation

Estimator API

Generalized Linear Regression

Stochastic Gradient Descent

Anomaly Detection

KNN Learning

Boosting Methods

Clustering Methods

Want to Learn More?

Scikit Learn Cheatsheet: A Comprehensive Scikit Learn Glossary

Table of Contents

Scikit Learn Cheatsheet

Scikit Learn Tool

Dataset Loading

Data Representation

Estimator API

Generalized Linear Regression

Stochastic Gradient Descent

Anomaly Detection

KNN Learning

Boosting Methods

Clustering Methods

Want to Learn More?

Master Data Science and Unlock Top-Tier Roles

Scikit Learn Cheatsheet

Scikit Learn Tool

Dataset Loading

Packaged Datasets

Load from CSV

Data Representation

Data as Tables

Data as Feature Matrix

Data as Target Array

Master Data Science and Unlock Top-Tier Roles

Estimator API

Generalized Linear Regression

Stochastic Gradient Descent

Advantages of SGD:

Disadvantages of SGD:

Anomaly Detection

SNo

Method

Description

KNN Learning

Unsupervised KNN Learning

SNo

Parameter

Description

Supervised KNN Learning

SNo

Classifiers

Description

Master Data Science and Unlock Top-Tier Roles

Boosting Methods

AdaBoost

Classification With AdaBoost

Gradient Tree Boost

Classification With Gradient Tree Boost

Clustering Methods

SNo

Algorithm

Parameters

Scalability

Metric Used

Want to Learn More?

Data Science & Business Analytics Courses Duration and Fees

Recommended Reads

Get Affiliated Certifications with Live Class programs

Professional Certificate in Data Science and Generative AI