What is Hierarchical Clustering and How Does It Work

It’s difficult to comprehend the amount of data that is generated daily. In fact, we create 2.5 quintillion bytes of data each day. Analyzing that data is a challenge and not just because of the quantity; the data also comes from many sources, in many forms, and is delivered at rapid speeds.

Data analysts are responsible for organizing these massive amounts of data into meaningful patterns—interpreting it to find meaning in a language only those versed in data science can understand. These analysts rely on tools to help make their jobs easier in the face of overwhelming bits of information. 

Enter clustering: one of the most common methods of unsupervised learning, a type of machine learning using unknown or unlabeled data. 

In this Hierarchical clustering articleHere, we’ll explore the important details of clustering, including:

  • What is clustering?
  • What is hierarchical clustering?
  • How does hierarchical clustering work? 
  • What is the distance measure?
  • What is agglomerative clustering?
  • What is divisive clustering?
  • Application with a clustering demo
Looking forward to a career in Data Science? Check out the Data Science Certification course now.

What is Clustering?

To understand what clustering is, let’s begin with an applicable example. Let’s say you want to travel to 20 places over a period of four days. How can you visit them all? We can come to a solution using clustering, and grouping the places into four sets (or clusters). 

To determine these clusters, places that are nearest to one another are grouped together. The result is four clusters based on proximity, allowing you to visit all 20 places within your allotted four-day period. 

Clustering is the method of dividing objects into sets that are similar, and dissimilar to the objects belonging to another set. There are two different types of clustering, each divisible into two subsets

  1. Hierarchical clustering
    • Agglomerative 
    • Divisive 
  2. Partial clustering 
    • K-means 
    • Fuzzy c-means

Every kind of clustering has its own purpose and numerous use cases. 

Customer Segmentation

In customer segmentation, clustering can help answer the questions:

  • What people belong to together? 
  • How do we group them together?

Social Network Analysis 

User personas are a good use of clustering for social networking analysis. We can look for similarities between people and group them accordingly. 

City Planning 

Clustering is popular in the realm of city planning. Planners need to check that an industrial zone isn’t near a residential area, or that a commercial zone somehow wound up in the middle of an industrial zone. 

However, in this article, we’ll focus on hierarchical clustering. 

Data Science with R Programming

Explore Course
Data Science with R Programming

An Example of Hierarchical Clustering 

Hierarchical clustering is separating data into groups based on some measure of similarity, finding a way to measure how they’re alike and different, and further narrowing down the data. 

Let's consider that we have a set of cars and we want to group similar ones together. Look at the image shown below:


For starters, we have four cars that we can put into two clusters of car types: sedan and SUV. Next, we'll bunch the sedans and the SUVs together. For the last step, we can group everything into one cluster and finish when we’re left with only one cluster. 

Types of Hierarchical Clustering 

Hierarchical clustering is divided into:

  • Agglomerative 
  • Divisive 

Divisive Clustering

Divisive clustering is known as the top-down approach. We take a large cluster and start dividing it into two, three, four, or more clusters.

Agglomerative Clustering

Agglomerative clustering is known as a bottom-up approach. Consider it as bringing things together.

Both of these approaches are as shown below:


Next, let us discuss how hierarchical clustering works.

How Does Hierarchical Clustering Work?

Let's consider that we have a few points on a 2D plane with x-y coordinates.


Here, each data point is a cluster of its own. We want to determine a way to compute the distance between each of these points. For this, we try to find the shortest distance between any two data points to form a cluster.

Once we find those with the least distance between them, we start grouping them together and forming clusters of multiple points. 


This is represented in a tree-like structure called a dendrogram. 


As a result, we have three groups: P1-P2, P3-P4, and P5-P6. Similarly, we have three dendrograms, as shown below:


In the next step, we bring two groups together. Now the two groups P3-P4 and P5-P6 are all under one dendrogram because they're closer together than the P1-P2 group. This is as shown below:


We finish when we’re left with one cluster and finally bring everything together.


You can see how the cluster on the right went to the top with the gray hierarchical box connecting them. 

The next question is: How do we measure the distance between the data points? The next section of the Hierarchical clustering article answers this question.

Distance Measure

Distance measure determines the similarity between two elements and it influences the shape of the clusters.

Some of the ways we can calculate distance measures include:

  • Euclidean distance measure 
  • Squared Euclidean distance measure
  • Manhattan distance measure 
  • Cosine distance measure 

Euclidean Distance Measure 

The most common method to calculate distance measures is to determine the distance between the two points. Let’s say we have a point P and point Q: the Euclidean distance is the direct straight-line distance between the two points.

The formula for distance between two points is shown below:


As this is the sum of more than two dimensions, we calculate the distance between each of the different dimensions squared and then take the square root of that to get the actual distance between them. 

Squared Euclidean Distance Measurement

This is identical to the Euclidean measurement method, except we don't take the square root at the end. The formula is shown below:


Depending on whether the points are farther apart or closer together, then the difference in distances can be computed faster by using squared Euclidean distance measurement.

While this method gives us the exact distance, it won't make a difference when calculating which is smaller and which is larger. Removing the square root can make the computation faster. 

Manhattan Distance Measurement 

This method is a simple sum of horizontal and vertical components or the distance between two points measured along axes at right angles. 

The formula is shown below: 


This method is different because you're not looking at the direct line, and in certain cases, the individual distances measured will give you a better result.

Most of the time, you’ll go with the Euclidean squared method because it's faster. But when using the Manhattan distance, you measure either the X difference or the Y difference and take the absolute value of it.

Cosine Distance Measure

The cosine distance similarity measures the angle between the two vectors. The formula is:


As the two vectors separate, the cosine distance becomes greater. This method is similar to the Euclidean distance measure, and you can expect to get similar results with both of them.

Note that the Manhattan measurement method will produce a very different result. You can end up with bias if your data is very skewed or if both sets of values have a dramatic size difference.

Let us now take a detailed look at the types of hierarchical clustering, starting with agglomerative clustering.

What is Agglomerative Clustering?

Agglomerate clustering begins with each element as a separate cluster and merges them into larger clusters. 

There are three key questions need to be answered:

  1. How do we represent a cluster that has more than one point?
  2. How do we determine the nearness of clusters?
  3. When do we stop combining clusters? 

Let's assume that we have six data points in a Euclidean space. We're dealing with X-Y dimensions in such a case.


How do we represent a cluster of more than one point?
Here, we will make use of centroids, which is the average of its points. 
Let’s first take the points 1.2 and 2.1, and we’ll group them together because they're close. For these points, we compute a point in the middle and mark it as (1.5,1.5). It’s the centroid of those two points. 


Next, we measure the other group of points by taking 4.1 and 5.0. We set up a centroid of those two points as (4.5,0.5).


Once we have the centroid of the two groups, we see that the next closest point to a centroid (1.5, 1.5) is (0,0) and group them, computing a new centroid based on those three points. The new centroid will be (1,1). 


We do the same with the last point (5,3), and it computes into the first group. You can see that the dendrogram on the right is growing. Now each of these points is connected. We group them, and finally, we get a centroid of that group, too, at (4.7,1.3). 


Finally, we combine the two groups by their centroids and end up with one large group that has its centroid. Usually, we don't compute the last centroid; we just put them all together.


Now that we’ve resolved the matter of representing clusters and determining their nearness, when do we stop combining clusters?

There are many approaches you can take:

Approach 1: Pick several clusters(k) upfront

When we don't want to look at 200 clusters, we pick the K value. We decide the number of clusters (say, the first six or seven) required in the beginning, and we finish when we reach the value K. This is done to limit the incoming information.

That can be very important, especially if you're feeding it into another algorithm that requires three or four values.

Possible challenges: This approach only makes sense when you know the data well. When you're clustering with K clusters, you probably already know that domain. But if you're exploring brand new data, you may not know how many clusters you need.

Approach 2: Stop when the next merge would create a cluster with low cohesion.

We keep clustering until the next merge of clusters creates a bad cluster/low cohesion setup. That means the point is so close to being in both the clusters that it doesn't make sense to bring them together.

Approach 3.1: Diameter of a cluster 

Diameter is the maximum distance between any pair of points in the cluster. We finish when the diameter of a new cluster exceeds the threshold. We don't want the two circles or clusters to overlap as that diameter increases.

Approach 3.2: Radius of a cluster 

Radius is the maximum distance of a point from the centroid. We finish when the radius of a new cluster exceeds the threshold.

Let us now discuss another type of hierarchical clustering i.e. divisive clustering.

Divisive Clustering 

The divisive clustering approach begins with a whole set composed of all the data points and divides it into smaller clusters. This can be done using a monothetic divisive method.

But what is a monothetic divisive method?

Let's try to understand it by using the example from the agglomerative clustering section above. We consider a space with six points in it as we did before. 


We name each point in the cluster as ABCDEF.
Here, we obtain all possible splits into two clusters, as shown. 


For each split, we can compute cluster sum of squares as shown:


Next, we select the cluster with the largest sum of squares. Let's assume that the sum of squared distance is the largest for the third split ABCDEF. We split the ABC out, and we're left with the DEF on the other side. We again find this sum of squared distances and split it into clusters, as shown. 


You can see the hierarchical dendrogram coming down as we start splitting everything apart. It continues to divide until every data point has its node or until we get to K (if we have set a K value). 

How skilled are you with the concepts of Machine learning? Try asnwering these Machine Learning MCQs and find out now!

Applying Hierarchical Clustering: Hands-On Demonstration 

Problem statement: A U.S. oil organization needs to know its sales in various states in the United States and cluster them based on their sales. 

The steps we take are:

  1. Import the dataset 
  2. Create a scatter plot 
  3. Normalize the data
  4. Calculate Euclidean distance
  5. Create a dendrogram

Explore the Concepts of Machine Learning

Are you thinking about the next step after learning about hierarchical clustering? Since there are so many other important aspects to be covered while trying to understand machine learning, we suggest you in the Simplilearn Machine Learning Certification Course

The course covers all the machine learning concepts, from supervised learning to modeling and developing algorithms. In our course, you’ll learn the skills needed to become a machine learning engineer and unlock the power of this emerging field. Start your machine learning journey today!

About the Author


Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.