What is Hierarchical Clustering and How Does It Work?

It’s difficult to comprehend the amount of data that is generated daily. In fact, we create 2.5 quintillion bytes of data each day. Analyzing that data is a challenge and not just because of the quantity; the data also comes from many sources, in many forms, and is delivered at rapid speeds.

Data analysts are responsible for organizing these massive amounts of data into meaningful patterns—interpreting it to find meaning in a language only those versed in data science can understand. These analysts rely on tools to help make their jobs easier in the face of overwhelming bits of information.

Enter clustering: one of the most common methods of unsupervised learning, a type of machine learning using unknown or unlabeled data.

Looking forward to a career in Data Science? Check out the Data Science Bootcamp now.

What is Clustering?

To understand what clustering is, let’s begin with an applicable example. Let’s say you want to travel to 20 places over a period of four days. How can you visit them all? We can come to a solution using clustering, and grouping the places into four sets (or clusters).

To determine these clusters, places that are nearest to one another are grouped together. The result is four clusters based on proximity, allowing you to visit all 20 places within your allotted four-day period.

Clustering is the method of dividing objects into sets that are similar, and dissimilar to the objects belonging to another set. There are two different types of clustering, each divisible into two subsets

Hierarchical clustering

Agglomerative
Divisive
Partial clustering

K-means
Fuzzy c-means

Every kind of clustering has its own purpose and numerous use cases.

Customer Segmentation

In customer segmentation, clustering can help answer the questions:

What people belong to together?
How do we group them together?

Social Network Analysis

User personas are a good use of clustering for social networking analysis. We can look for similarities between people and group them accordingly.

City Planning

Clustering is popular in the realm of city planning. Planners need to check that an industrial zone isn’t near a residential area, or that a commercial zone somehow wound up in the middle of an industrial zone.

However, in this article, we’ll focus on hierarchical clustering.

An Example of Hierarchical Clustering

Hierarchical clustering is separating data into groups based on some measure of similarity, finding a way to measure how they’re alike and different, and further narrowing down the data.

Let's consider that we have a set of cars and we want to group similar ones together. Look at the image shown below:

SUVs

For starters, we have four cars that we can put into two clusters of car types: sedan and SUV. Next, we'll bunch the sedans and the SUVs together. For the last step, we can group everything into one cluster and finish when we’re left with only one cluster.

Types of Hierarchical Clustering

Hierarchical clustering is divided into:

Agglomerative
Divisive

Divisive Clustering

Divisive clustering is known as the top-down approach. We take a large cluster and start dividing it into two, three, four, or more clusters.

Agglomerative Clustering

Agglomerative clustering is known as a bottom-up approach. Consider it as bringing things together.

Both of these approaches are as shown below:

divisive

Next, let us discuss how hierarchical clustering works.

How Does Hierarchical Clustering Work?

Let's consider that we have a few points on a 2D plane with x-y coordinates.

y-values

Here, each data point is a cluster of its own. We want to determine a way to compute the distance between each of these points. For this, we try to find the shortest distance between any two data points to form a cluster.

Once we find those with the least distance between them, we start grouping them together and forming clusters of multiple points.

y-values-2

This is represented in a tree-like structure called a dendrogram.

dendogram

As a result, we have three groups: P1-P2, P3-P4, and P5-P6. Similarly, we have three dendrograms, as shown below:

3-dendograms

In the next step, we bring two groups together. Now the two groups P3-P4 and P5-P6 are all under one dendrogram because they're closer together than the P1-P2 group. This is as shown below:

yvalue-p3-p6

We finish when we’re left with one cluster and finally bring everything together.

y-value-01

You can see how the cluster on the right went to the top with the gray hierarchical box connecting them.

The next question is: How do we measure the distance between the data points? The next section of the Hierarchical clustering article answers this question.

Learn data analysis, data visualization, machine learning, deep learning, SQL, R, and Python with the Data Science Course with Placement Guarantee. Check out the course now!

What is the Distance Measure?

Distance measure determines the similarity between two elements and it influences the shape of the clusters.

Some of the ways we can calculate distance measures include:

Euclidean distance measure
Squared Euclidean distance measure
Manhattan distance measure
Cosine distance measure

Euclidean Distance Measure

The most common method to calculate distance measures is to determine the distance between the two points. Let’s say we have a point P and point Q: the Euclidean distance is the direct straight-line distance between the two points.

The formula for distance between two points is shown below:

euclidian-distance

As this is the sum of more than two dimensions, we calculate the distance between each of the different dimensions squared and then take the square root of that to get the actual distance between them.

Squared Euclidean Distance Measurement

This is identical to the Euclidean measurement method, except we don't take the square root at the end. The formula is shown below:

squared-distance

Depending on whether the points are farther apart or closer together, then the difference in distances can be computed faster by using squared Euclidean distance measurement.

While this method gives us the exact distance, it won't make a difference when calculating which is smaller and which is larger. Removing the square root can make the computation faster.

Manhattan Distance Measurement

This method is a simple sum of horizontal and vertical components or the distance between two points measured along axes at right angles.

The formula is shown below:

manhattan-distance

This method is different because you're not looking at the direct line, and in certain cases, the individual distances measured will give you a better result.

Most of the time, you’ll go with the Euclidean squared method because it's faster. But when using the Manhattan distance, you measure either the X difference or the Y difference and take the absolute value of it.

Cosine Distance Measure

The cosine distance similarity measures the angle between the two vectors. The formula is:

cosine-distance

As the two vectors separate, the cosine distance becomes greater. This method is similar to the Euclidean distance measure, and you can expect to get similar results with both of them.

Note that the Manhattan measurement method will produce a very different result. You can end up with bias if your data is very skewed or if both sets of values have a dramatic size difference.

Let us now take a detailed look at the types of hierarchical clustering, starting with agglomerative clustering.

What is Agglomerative Clustering?

Agglomerate clustering begins with each element as a separate cluster and merges them into larger clusters.

There are three key questions need to be answered:

How do we represent a cluster that has more than one point?
How do we determine the nearness of clusters?
When do we stop combining clusters?

Let's assume that we have six data points in a Euclidean space. We're dealing with X-Y dimensions in such a case.

x-y-distance

How do we represent a cluster of more than one point?
Here, we will make use of centroids, which is the average of its points.
Let’s first take the points 1.2 and 2.1, and we’ll group them together because they're close. For these points, we compute a point in the middle and mark it as (1.5,1.5). It’s the centroid of those two points.

centroid-2points

Next, we measure the other group of points by taking 4.1 and 5.0. We set up a centroid of those two points as (4.5,0.5).

centroid-points

Once we have the centroid of the two groups, we see that the next closest point to a centroid (1.5, 1.5) is (0,0) and group them, computing a new centroid based on those three points. The new centroid will be (1,1).

new-centroid

We do the same with the last point (5,3), and it computes into the first group. You can see that the dendrogram on the right is growing. Now each of these points is connected. We group them, and finally, we get a centroid of that group, too, at (4.7,1.3).

centroid-group

Finally, we combine the two groups by their centroids and end up with one large group that has its centroid. Usually, we don't compute the last centroid; we just put them all together.

last-centroid.

Now that we’ve resolved the matter of representing clusters and determining their nearness, when do we stop combining clusters?

There are many approaches you can take:

Approach 1: Pick several clusters(k) upfront

When we don't want to look at 200 clusters, we pick the K value. We decide the number of clusters (say, the first six or seven) required in the beginning, and we finish when we reach the value K. This is done to limit the incoming information.

That can be very important, especially if you're feeding it into another algorithm that requires three or four values.

Possible challenges: This approach only makes sense when you know the data well. When you're clustering with K clusters, you probably already know that domain. But if you're exploring brand new data, you may not know how many clusters you need.

Approach 2: Stop when the next merge would create a cluster with low cohesion.

We keep clustering until the next merge of clusters creates a bad cluster/low cohesion setup. That means the point is so close to being in both the clusters that it doesn't make sense to bring them together.

Approach 3.1: Diameter of a cluster

Diameter is the maximum distance between any pair of points in the cluster. We finish when the diameter of a new cluster exceeds the threshold. We don't want the two circles or clusters to overlap as that diameter increases.

Approach 3.2: Radius of a cluster

Radius is the maximum distance of a point from the centroid. We finish when the radius of a new cluster exceeds the threshold.

Let us now discuss another type of hierarchical clustering i.e. divisive clustering.

What is Divisive Clustering?

The divisive clustering approach begins with a whole set composed of all the data points and divides it into smaller clusters. This can be done using a monothetic divisive method.

But what is a monothetic divisive method?

Let's try to understand it by using the example from the agglomerative clustering section above. We consider a space with six points in it as we did before.

agglomerative-clustering.

We name each point in the cluster as ABCDEF.
Here, we obtain all possible splits into two clusters, as shown.

abcdef-clustering

For each split, we can compute cluster sum of squares as shown:

cluster-sum

Next, we select the cluster with the largest sum of squares. Let's assume that the sum of squared distance is the largest for the third split ABCDEF. We split the ABC out, and we're left with the DEF on the other side. We again find this sum of squared distances and split it into clusters, as shown.

split-cluster

You can see the hierarchical dendrogram coming down as we start splitting everything apart. It continues to divide until every data point has its node or until we get to K (if we have set a K value).

How skilled are you with the concepts of Machine learning? Try asnwering these Machine Learning MCQs and find out now!

Applying Hierarchical Clustering: Hands-On Demonstration

Problem statement: A U.S. oil organization needs to know its sales in various states in the United States and cluster them based on their sales.

The steps we take are:

Import the dataset
Create a scatter plot
Normalize the data
Calculate Euclidean distance
Create a dendrogram

Explore the Concepts of Machine Learning

Are you thinking about the next step after learning about hierarchical clustering? Since there are so many other important aspects to be covered while trying to understand machine learning, we suggest you in the Simplilearn AI and ML Certification Course, Machine Learning Course.

The course covers all the machine learning concepts, from supervised learning to modeling and developing algorithms. In our course, you’ll learn the skills needed to become a machine learning engineer and unlock the power of this emerging field. Start your machine learning journey today!

What is Hierarchical Clustering and How Does It Work

Table of Contents