Every machine Learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally divided into two types: supervised and unsupervised. Kmeans clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.
What is Clustering
Clustering is like sorting a bunch of similar items into different groups based on their characteristics. In data mining and machine learning, it’s a powerful technique used to group similar data points together, making it easier to find patterns or understand large datasets. Essentially, clustering helps identify natural groupings in your data. There are two common types of clustering methods:
Types of Clustering
Clustering is a type of unsupervised learning wherein data points are grouped into different sets based on their degree of similarity.
The various types of clustering are:
 Hierarchical clustering
 Partitioning clustering
Hierarchical clustering is further subdivided into:
 Agglomerative clustering
 Divisive clustering
Partitioning clustering is further subdivided into:
 KMeans clustering
 Fuzzy CMeans clustering
Hierarchical Clustering
Hierarchical clustering uses a treelike structure, like so:
In agglomerative clustering, there is a bottomup approach. We begin with each element as a separate cluster and merge them into successively more massive clusters, as shown below:
Divisive clustering is a topdown approach. We begin with the whole set and proceed to divide it into successively smaller clusters, as you can see below:
Partitioning Clustering
Partitioning clustering is split into two subtypes  KMeans clustering and Fuzzy CMeans.
In kmeans clustering, the objects are divided into several clusters mentioned by the number ‘K.’ So if we say K = 2, the objects are divided into two clusters, c1 and c2, as shown:
Here, the features or characteristics are compared, and all objects having similar characteristics are clustered together.
Fuzzy cmeans is very similar to kmeans in the sense that it clusters objects that have similar characteristics together. In kmeans clustering, a single object cannot belong to two different clusters. But in cmeans, objects can belong to more than one cluster, as shown.
What is KMeans Clustering?
Kmeans clustering is a way of grouping data based on how similar or close the data points are to each other. Imagine you have a bunch of points, and you want to group them into clusters. The algorithm works by first randomly picking some central points (called centroids) and then assigning every data point to the nearest centroid.
Once that’s done, it recalculates the centroids based on the new groupings and repeats the process until the clusters make sense. It’s a pretty fast and efficient method, but it works best when the clusters are distinct and not too mixed up. One challenge, though, is figuring out the right number of clusters (K) beforehand. Plus, if there’s a lot of noise or overlap in the data, K Means might not perform as well.
Objective of KMeans Clustering
KMeans clustering primarily aims to organize similar data points into distinct groups. Here’s a look at its key objectives:

Grouping Similar Data Points
KMeans is designed to cluster data points that share common traits, allowing patterns or trends to emerge. Whether analyzing customer behavior or images, the method helps reveal hidden relationships within your dataset.

Minimizing WithinCluster Distance
Another objective is to keep data points in each group as close to the cluster's centroid as possible. Reducing this internal distance results in compact, cohesive clusters, enhancing the accuracy of your results.

Maximizing BetweenCluster Distance
KMeans also aims to maintain clear separation between different clusters. By maximizing the distance between groups, the algorithm ensures that each cluster remains distinct, providing a better understanding of data categories without overlap.
Properties of KMeans Clustering
Now, let’s look at the key properties that make Kmeans clustering algorithm effective in creating meaningful groups:

Similarity Within a Cluster
One of the main things K Means aims for is that all the data points in a cluster should be pretty similar to each other. Imagine a bank that wants to group its customers based on income and debt. If customers within the same cluster have vastly different financial situations, then a onesizefitsall approach to offers might not work. For example, a customer with high income and high debt might have different needs compared to someone with low income and low debt. By making sure the customers in each cluster are similar, the bank can create more tailored and effective strategies.

Differences Between Clusters
Another important aspect is that the clusters themselves should be as distinct from each other as possible. Going back to our bank example, if one cluster consists of highincome, highdebt customers and another cluster has highincome, lowdebt customers, the differences between the clusters are clear. This separation helps the bank create different strategies for each group. If the clusters are too similar, it can be challenging to treat them as separate segments, which can make targeted marketing less effective.
Applications of KMeans Clustering
Here are some interesting ways Kmeans clustering is put to work across different fields:

Distance Measures
At the heart of KMeans clustering is the concept of distance. Euclidean distance, for example, is a simple straightline measurement between points and is commonly used in many applications. Manhattan distance, however, follows a gridlike path, much like how you'd navigate city streets. Squared Euclidean distance makes calculations easier by squaring the values, while cosine distance is handy when working with text data because it measures the angle between data vectors. Picking the right distance measure really depends on what kind of problem you’re solving and the nature of your data.

KMeans for Geyser Eruptions
KMeans clustering has even been applied to studying the eruptions of the Old Faithful geyser in Yellowstone. The data collected includes eruption duration and the waiting time between eruptions. By clustering this information, researchers can uncover patterns that help predict the geyser’s behavior. For instance, you might find clusters of similar eruption durations and intervals, which could improve predictions for future eruptions.

Customer Segmentation
One of the most popular uses of Kmeans clustering is for customer segmentation. From banks to ecommerce, businesses use Kmeans clustering customer segmentation to group customers based on their behaviors. For example, in telecom or sports industries, companies can create targeted marketing campaigns by understanding different customer segments better. This allows for personalized offers and communications, boosting customer engagement and satisfaction.

Document Clustering
When dealing with a vast collection of documents, KMeans can be a lifesaver. It groups similar documents together based on their content, which makes it easier to manage and retrieve relevant information. For instance, if you have thousands of research papers, clustering can quickly help you find related studies, improving both organization and efficiency in accessing valuable information.

Image Segmentation
In image processing, KMeans clustering is commonly used to group pixels with similar colors, which divides the image into distinct regions. This is incredibly helpful for tasks like object detection and image enhancement. For instance, clustering can help separate objects within an image, making analysis and processing more accurate. It’s also widely used to extract meaningful features from images in various visual tasks.

Recommendation Engines
KMeans clustering also plays a vital role in recommendation systems. Say you want to suggest new songs to a listener based on their past preferences; clustering can group similar songs together, helping the system provide personalized suggestions. By clustering content that shares similar features, recommendation engines can deliver a more tailored experience, helping users discover new songs that match their taste.

KMeans for Image Compression
KMeans can even help with image compression by reducing the number of colors in an image while keeping the visual quality intact. KMeans reduces the image size without losing much detail by clustering similar colors and replacing the pixels with the average of their cluster. It’s a practical method for compressing images for more accessible storage and transmission, all while maintaining visual clarity.
Advantages of Kmeans
 Simple and easy to implement: The kmeans algorithm is easy to understand and implement, making it a popular choice for clustering tasks.
 Fast and efficient: Kmeans is computationally efficient and can handle large datasets with high dimensionality.
 Scalability: Kmeans can handle large datasets with many data points and can be easily scaled to handle even larger datasets.
 Flexibility: Kmeans can be easily adapted to different applications and can be used with varying metrics of distance and initialization methods.
Disadvantages of KMeans
 Sensitivity to initial centroids: Kmeans is sensitive to the initial selection of centroids and can converge to a suboptimal solution.
 Requires specifying the number of clusters: The number of clusters k needs to be specified before running the algorithm, which can be challenging in some applications.
 Sensitive to outliers: Kmeans is sensitive to outliers, which can have a significant impact on the resulting clusters.
Different Evaluation Metrics for Clustering
When it comes to evaluating how well your clustering algorithm is working, there are a few key metrics that can help you get a clearer picture of your results. Here’s a rundown of the most useful ones:

Silhouette Analysis
Silhouette analysis is like a report card for your clusters. It measures how well each data point fits into its own cluster compared to other clusters. A high silhouette score means that your points are snugly fitting into their clusters and are quite distinct from points in other clusters. Imagine a score close to 1 as a sign that your clusters are welldefined and separated. Conversely, a score close to 0 indicates some overlap, and a negative score suggests that the clustering might need some work.

Inertia
Inertia is a bit like a gauge of how tightly packed your data points are within each cluster. It calculates the sum of squared distances from each point to the cluster's center (or centroid). Think of it as measuring how snugly the points are huddled together. Lower inertia means that points are closer to the centroid and to each other, which generally indicates that your clusters are wellformed. For most numeric data, you'll use Euclidean distance, but if your data includes categorical features, Manhattan distance might be better.

Dunn Index
The Dunn Index takes a broader view by considering both the distance within and between clusters. It’s calculated as the ratio of the smallest distance between any two clusters (intercluster distance) to the largest distance within a cluster (intracluster distance). A higher Dunn Index means that clusters are not only tight and cohesive internally but also wellseparated from each other. In other words, you want your clusters to be as far apart as possible while being as compact as possible.
How Does KMeans Clustering Work?
The flowchart below shows how kmeans clustering works:
The goal of the KMeans algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters.
Another method is to use the Elbow technique to determine the value of K. Once we get the K's value, the system will assign that many centroids randomly and measure the distance of each of the data points from these centroids. Accordingly, it assigns those points to the corresponding centroid from which the distance is minimum. So each data point will be assigned to the centroid, which is closest to it. Thereby we have a K number of initial clusters.
It calculates the new centroid position for the newly formed clusters. The centroid's position moves compared to the randomly allocated one.
Once again, the distance of each point is measured from this new centroid point. If required, the data points are relocated to the new centroids, and the mean position or the new centroid is calculated once again.
If the centroid moves, the iteration continues indicating no convergence. But once the centroid stops moving (which means that the clustering process has converged), it will reflect the result.
Let's use a visualization example to understand this better.
We have a data set for a grocery shop, and we want to find out how many clusters this has to be spread across. To find the optimum number of clusters, we break it down into the following steps:
Step 1:
The Elbow method is the best way to find the number of clusters. The elbow method constitutes running KMeans clustering on the dataset.
Next, we use withinsumofsquares as a measure to find the optimum number of clusters that can be formed for a given data set. Within the sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid.
The WSS is measured for each value of K. The value of K, which has the least amount of WSS, is taken as the optimum value.
Now, we draw a curve between WSS and the number of clusters.
Here, WSS is on the yaxis and number of clusters on the xaxis.
You can see that there is a very gradual change in the value of WSS as the K value increases from 2.
So, you can take the elbow point value as the optimal value of K. It should be either two, three, or at most four. But, beyond that, increasing the number of clusters does not dramatically change the value in WSS, it gets stabilized.
Step 2:
Let's assume that these are our delivery points:
We can randomly initialize two points called the cluster centroids.
Here, C1 and C2 are the centroids assigned randomly.
Step 3:
Now the distance of each location from the centroid is measured, and each data point is assigned to the centroid, which is closest to it.
This is how the initial grouping is done:
Step 4:
Compute the actual centroid of data points for the first group.
Step 5:
Reposition the random centroid to the actual centroid.
Step 6:
Compute the actual centroid of data points for the second group.
Step 7:
Reposition the random centroid to the actual centroid.
Step 8:
Once the cluster becomes static, the kmeans algorithm is said to be converged.
The final cluster with centroids c1 and c2 is as shown below:
KMeans Clustering Algorithm
Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters.
The steps to form clusters are:
Step 1: Choose K random points as cluster centers called centroids.
Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e., calculating its distance to each centroid)
Step 3: Identify new centroids by taking the average of the assigned points.
Step 4: Keep repeating step 2 and step 3 until convergence is achieved
Let's take a detailed look at it at each of these steps.
Step 1:
We randomly pick K (centroids). We name them c1,c2,..... ck, and we can say that
Where C is the set of all centroids.
Step 2:
We assign each data point to its nearest center, which is accomplished by calculating the euclidean distance.
Where dist() is the Euclidean distance.
Here, we calculate each x value's distance from each c value, i.e. the distance between x1c1, x1c2, x1c3, and so on. Then we find which is the lowest value and assign x1 to that particular centroid.
Similarly, we find the minimum distance for x2, x3, etc.
Step 3:
We identify the actual centroid by taking the average of all the points assigned to that cluster.
Where Si is the set of all points assigned to the ith cluster.
It means the original point, which we thought was the centroid, will shift to the new position, which is the actual centroid for each of these groups.
Step 4:
Keep repeating step 2 and step 3 until convergence is achieved.
How to Choose the Value of "K number of clusters" in KMeans Clustering?
Although many choices are available for choosing the optimal number of clusters, the Elbow Method is one of the most popular and appropriate methods. The Elbow Method uses the idea of WCSS value, which is short for for Within Cluster Sum of Squares. WCSS defines the total number of variations within a cluster. This is the formula used to calculate the value of WCSS (for three clusters) provided courtesy of Javatpoint:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
Python Implementation of the KMeans Clustering Algorithm
Here’s how to use Python to implement the KMeans Clustering Algorithm. These are the steps you need to take:
 Data preprocessing
 Finding the optimal number of clusters using the elbow method
 Training the KMeans algorithm on the training data set
 Visualizing the clusters
1. Data PreProcessing. Import the libraries, datasets, and extract the independent variables.
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Mall_Customers_data.csv')
x = dataset.iloc[:, [3, 4]].values
2. Find the optimal number of clusters using the elbow method. Here’s the code you use:
#finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
wcss_list= [] #Initializing the list for the values of WCSS
#Using for loop for iterations from 1 to 10.
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='kmeans++', random_state= 42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elobw Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
3. Train the Kmeans algorithm on the training dataset. Use the same two lines of code used in the previous section. However, instead of using i, use 5, because there are 5 clusters that need to be formed. Here’s the code:
#training the Kmeans model on a dataset
kmeans = KMeans(n_clusters=5, init='kmeans++', random_state= 42)
y_predict= kmeans.fit_predict(x)
4. Visualize the Clusters. Since this model has five clusters, we need to visualize each one.
#visulaizing the clusters
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1100)')
mtp.legend()
mtp.show()
Challenges With KMeans Clustering Algorithm
KMeans is a powerful tool, but it’s not without its hiccups. Here are a couple of common challenges you might face:

Uneven Cluster Sizes
One issue you might run into with K Means is when clusters vary in size. Picture this: you have clusters that are small and spread out on the edges, and a larger, more central cluster. When K Means is applied, it can struggle to evenly distribute the data. The algorithm might create clusters that don’t quite match the actual data distribution, leading to some clusters being too small or too large compared to others.

Different Densities
Another challenge comes up when the clusters have different densities. Imagine you have some clusters with tightly packed points and others where the points are more scattered. K Means might have trouble with this. It tends to group points based on distance from the cluster center, so tightly packed points might end up in one cluster, while scattered points could be split across different clusters, even if they actually belong together. This can lead to clusters that don’t accurately reflect the true structure of your data.
Demo: KMeans Clustering
Problem Statement—Walmart wants to open a chain of stores across the state of Florida and find the optimal store locations to maximize revenue.
The issue here is that if they open too many stores close to each other, they will not make a profit. But if the stores are too far apart, they will not have enough sales coverage.
Solution—Walmart is an ecommerce giant. Its database already contains customers' addresses, which it can use to perform KMeans Clustering to find the optimal location.