The best beginner’s guide on K-Means Clustering.

Picture by Radu Marcusu on Unsplash

Data is essential for data science. With a lot of data being generated every second, it’s no surprise that most of this data is unlabeled. But this is okay because there are different techniques, to handle unlabeled datasets. Even there’s an entire domain of Machine Learning called “Unsupervised Learning” that deals with unlabeled data.

Sometimes we want to see how the data is organized, and that’s where clustering comes into play. Though it’s mostly used for unlabeled data, it works fine for labeled data as well. The word ‘clustering’ means grouping similar items together. The most commonly used clustering method is K-Means.

This article will be the best guide for you which explains how K-Means Clustering works, how to measure the quality of clusters, choose the optimal number of K.

The Concept

Imagine you’re opening a book store. You have a stack of books and 3 bookshelves. Your goal is to place similar ones on one shelf. What you would do, is pick up 3 books, one for each shelf in order to set a theme for every shelf. These books will now dictate which of the remaining books will go on which shelf.

Every time you pick a book up from the stack, you would compare it with those 3 books, and place this new book on the shelf that is similar. You would repeat this process until all the books are placed.

Once you’re done, you might see that changing the number of bookshelves, and picking up different books for those shelves would increase how well you’ve grouped the books. So, you should repeat the process in hopes of a better outcome.

The Algorithm

K-means clustering is a better place to start exploring unlabeled data. The “K” in K-Means denotes the number of clusters. This algorithm is bound to focalize to a solution after some iterations. It has 4 steps:

  1. Initialize Cluster Centroids
  2. Assign datapoints to Clusters
  3. Update Cluster centroids
  4. Repeat steps 2–3 until the stopping condition is met.

You don’t have to start with 3 or more than 3 clusters initially, but 2–3 is generally a good place to start and update later on.

1. Initialize K & Centroids

At the start, you tell your model how many clusters it should make. First, the model picks up K. For now, let’s take ‘K’ as 3 data points from the dataset. These points are called cluster centroids.

There are different ways to initialize the centroids, you can either choose them at random — or sort the dataset, split it into K portions and pick one data point from each portion as a centroid.

2. Assigning Clusters to datapoints

From here, the model performs calculations on its own and assigns a cluster to each point. Your model would calculate the distance between the point & all the centroids and will be assigned to the cluster with the nearest centroid. There are different ways you can calculate this distance; all having their pros and cons.

The picture above shows how to calculate the L2 distance between the centroid and a data point. Every time a data point is assigned to a cluster the following steps are followed.

3. Updating Centroids

As the initial centroids were chosen arbitrarily, your model updates them with new cluster values. The new value may or may not occur in the dataset, in fact, it would be a coincidence if it does. This is because the updated cluster centroid is the mean value of all the data points within that cluster.

Updating cluster centroids

4. Stopping Criterion

As steps 2 and 3 would be performed iteratively, it would go on forever if we don’t set a stopping criterion. This stopping criterion tells our algorithm when to stop updating our clusters. It is important to note that setting a stopping criterion would not necessarily return the best cluster but to make sure it returns fairly good clusters, and more importantly at least return some clusters.

Like everything else, there are different ways to set the stopping criterion. You can even set multiple conditions, if met, would stop the iteration and return the results. Some conditions are:

  1. The data points assigned to specific cluster remain the same
  2. Centroids remain the same
  3. The distance of data points from their centroid is minimum
  4. The fixed number of iterations has reached

Evaluating the cluster quality

The goal here isn’t just to make clusters, but to make, meaningful clusters. Quality clustering is when the data points within a cluster are close together, and far from others.

The two methods to measure the cluster quality are:

  1. Inertia: Inertia tells how far the points within a cluster are. The range of inertia’s value starts from zero and goes up.
  2. Silhouette score: The silhouette score tells how far the data points in one cluster are, from the data points in the cluster. The range of silhouette score is from -1 to 1. The score should be closer to 1 than -1.

How many clusters?

You have to define the number of clusters you want to make. There are a few methods available to choose the number of K. The straightforward method is to just plot the data points and see if it gives you a hint. As you can see in the figure, making 3 clusters seems like a good choice.

K=3 seems like a good choice

Another method is to use the value of inertia. The idea behind good clustering is having a small value of inertia, and a small number of clusters.

The value of inertia decreases as the number of clusters increase. So, it’s a trade-off here. Rule of thumb: The elbow point in the inertia graph is a good choice because after that the change in the value of inertia isn’t significant.

K=3 becomes the optimal choice

Conclusion

I hope this article would have given you a better explanation and guide for you as a beginner in K-means clustering.

Don’t forget to leave your responses.✌

Everyone stay tuned!! To get my stories in your mailbox kindly subscribe to my newsletter.

Thank you for reading! Do not forget to give your claps and to share your responses and share it with a friend!

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

3. Credit explained: first-party data

How to Run Hive Scripts?

UPNG 2019 Intake Data Visualization

Finding Biological Taxonomic Database Model

What is The Difference Between Data Science and Machine Learning?

Financial Modeling with Excel

Dive into Agro API Part 5 | Accumulated temperature and precipitation by polygon

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adith - The Data Guy

Adith - The Data Guy

More from Medium

Normal Distribution and Normality Test

Recognizing Handwritten Digits with scikit-learn

Curse of Dimensionality

Getting started data visualization with Seaborn

png