The best beginner’s guide on K-Means Clustering.
Everything you need to know about K-means clustering
Data is essential for data science. With a lot of data being generated every second, it’s no surprise that most of this data is unlabeled. But this is okay because there are different techniques, to handle unlabeled datasets. Even there’s an entire domain of Machine Learning called “Unsupervised Learning” that deals with unlabeled data.
Sometimes we want to see how the data is organized, and that’s where clustering comes into play. Though it’s mostly used for unlabeled data, it works fine for labeled data as well. The word ‘clustering’ means grouping similar items together. The most commonly used clustering method is K-Means.
This article will be the best guide for you which explains how K-Means Clustering works, how to measure the quality of clusters, choose the optimal number of K.
Imagine you’re opening a book store. You have a stack of books and 3 bookshelves. Your goal is to place similar ones on one shelf. What you would do, is pick up 3 books, one for each shelf in order to set a theme for every shelf. These books will now dictate which of the remaining books will go on which shelf.
Every time you pick a book up from the stack, you would compare it with those 3 books, and place this new book on the shelf that is similar. You would repeat this process until all the books are placed.
Once you’re done, you might see that changing the number of bookshelves, and picking up different books for those shelves would increase how well you’ve grouped the books. So, you should repeat the process in hopes of a better outcome.
K-means clustering is a better place to start exploring unlabeled data. The “K” in K-Means denotes the number of clusters. This algorithm is bound to focalize to a solution after some iterations. It has 4 steps:
- Initialize Cluster Centroids
- Assign datapoints to Clusters
- Update Cluster centroids
- Repeat steps 2–3 until the stopping condition is met.
You don’t have to start with 3 or more than 3 clusters initially, but 2–3 is generally a good place to start and update later on.
1. Initialize K & Centroids
At the start, you tell your model how many clusters it should make. First, the model picks up K. For now, let’s take ‘K’ as 3 data points from the dataset. These points are called cluster centroids.
There are different ways to initialize the centroids, you can either choose them at random — or sort the dataset, split it into K portions and pick one data point from each portion as a centroid.
2. Assigning Clusters to datapoints
From here, the model performs calculations on its own and assigns a cluster to each point. Your model would calculate the distance between the point & all the centroids and will be assigned to the cluster with the nearest centroid. There are different ways you can calculate this distance; all having their pros and cons.
The picture above shows how to calculate the L2 distance between the centroid and a data point. Every time a data point is assigned to a cluster the following steps are followed.
3. Updating Centroids
As the initial centroids were chosen arbitrarily, your model updates them with new cluster values. The new value may or may not occur in the dataset, in fact, it would be a coincidence if it does. This is because the updated cluster centroid is the mean value of all the data points within that cluster.
Updating cluster centroids
4. Stopping Criterion
As steps 2 and 3 would be performed iteratively, it would go on forever if we don’t set a stopping criterion. This stopping criterion tells our algorithm when to stop updating our clusters. It is important to note that setting a stopping criterion would not necessarily return the best cluster but to make sure it returns fairly good clusters, and more importantly at least return some clusters.
Like everything else, there are different ways to set the stopping criterion. You can even set multiple conditions, if met, would stop the iteration and return the results. Some conditions are:
- The data points assigned to specific cluster remain the same
- Centroids remain the same
- The distance of data points from their centroid is minimum
- The fixed number of iterations has reached
Evaluating the cluster quality
The goal here isn’t just to make clusters, but to make, meaningful clusters. Quality clustering is when the data points within a cluster are close together, and far from others.
The two methods to measure the cluster quality are:
- Inertia: Inertia tells how far the points within a cluster are. The range of inertia’s value starts from zero and goes up.
- Silhouette score: The silhouette score tells how far the data points in one cluster are, from the data points in the cluster. The range of silhouette score is from -1 to 1. The score should be closer to 1 than -1.
How many clusters?
You have to define the number of clusters you want to make. There are a few methods available to choose the number of K. The straightforward method is to just plot the data points and see if it gives you a hint. As you can see in the figure, making 3 clusters seems like a good choice.
K=3 seems like a good choice
Another method is to use the value of inertia. The idea behind good clustering is having a small value of inertia, and a small number of clusters.
The value of inertia decreases as the number of clusters increase. So, it’s a trade-off here. Rule of thumb: The elbow point in the inertia graph is a good choice because after that the change in the value of inertia isn’t significant.
K=3 becomes the optimal choice
I hope this article would have given you a better explanation and guide for you as a beginner in K-means clustering.
Don’t forget to leave your responses.✌
Everyone stay tuned!! To get my stories in your mailbox kindly subscribe to my newsletter.
Thank you for reading! Do not forget to give your claps and to share your responses and share it with a friend!