Clustering Analysis: k-Means

Unsupervised learning method of cluster analysis, k-Means. Part 1 of 2 of clustering analysis

4 min readNov 9, 2020

What is Clustering?

Clustering analysis is an unsupervised method or technique for breaking down data into groups/clusters. Like most unsupervised learning, clustering is a great method for exploring a new data set in hopes of finding some type of pattern or underlying structure. This is unsupervised because we aren’t predicting any labels, but rather finding ways to make groups. (Great article on unsupervised vs. supervised learning).

The two types of clustering that I’m familiar with is k-Means and Hierarchical Clustering (HCA) and work in different ways. I’ll first go over the K-Means method

K-Means Clustering

This method of clustering sounds very much like our k-Nearest Neighbors algorithm and onewouldn’t be too surprised by the naming when you understand the way it works. Before going into a instance where you would want to use this, the steps are:

Step 1: Pick the number of groups you believe there are in the group. In more technical words, pick the number of centroids to begin with. The centroids are initialized at random points.
Step 2: Assign all points to the closest centroids. This is done by looking at each data point, calculating the distance from each centroids, and assigning it whichever centroids has the shortest distance. At this point you will have as many clusters formed as centroids
Step 3: Look at each cluster and calculate the mean of all the points within that cluster. Once we find the mean, we will use this number to update the centroids
Step 4: Update the centroids and repeat previous steps until the centroids stop changing.

For example, a 3-mean cluster would look something like this.

One case to consider is that depending on the start of the randomly initialized centroid, the ordering of the groups may change. Looking at the picture above, if the blue centroid started on the bottom right of the data and was further than the red, then the bottom right of the data set would be blue instead. This can make it good to consider different ways of initializing the centroids like by choosing a specific point or initializing it a certain amount of distance away from each other.

Another case that also requires the condition that of random initialization is to consider is the potential to having too many centroids. This may result in the clustered properly. A really simple and great tool to use for visualizing the methodology of k-Means can be found here. It was a big help for me to see what’s going on in a simple 2-dimensional example.

So overall, the methodology is taking “k” centroids and finding the mean distance of each cluster, hence k-Means clustering!

Clustering Metric

When we have our cluster, we want to know how well our clusters performed in each cluster. The ideal scenario for clustering would to have all the points within one cluster be a short numerical distance from the others within the same cluster. This can be referred to as the cohesion of a cluster. We also want to make sure the clusters of a group are farther away in distance than in its own cluster. This is can be referred to as separation from other clusters. A metric that does this relative to the whole data set is the Silhouette Value and for calculating distance we can use any distance formula.

We have “b_i” as the mean cohesion and “a_i” as the separation. When we divide by the max of either of these, we will have a silhouette score that ranges between -1 and 1. Having a higher value means its assigned in the proper cluster and isn’t too similar to a neighboring ones. From here we can observe if most of our points are close to 1, then our clustering has done a decent job. If we see a lot of points closer to -1, then it may be our clustering may have too many or too little clusters and not grouped properly.

Downsides

In order for k-Means clustering to be fully effective, we do have some assumptions like independent variables, balanced cluster sizes, similar cluster density, and circular shaped clusters. The shape of the cluster can be thought about using variance, and can be explained much better in this article by David Robinson.

Besides having random initialization points, another potential downside of k-Means clustering would be creating equal sized clusters even if the data doesn’t have equal groups. Moreover, this can also be applied when looking at the density of groups. To help, one can apply dimensionality reduction, scaling data or even another clustering method.

Conclusion

Hope this was a helpful overview of k-Means clustering! I will explore other clustering methods like the Hierarchical Clustering Analysis (HCA) in another post!