Clustering Analysis: HCA

Unsupervised learning method of cluster analysis, Hierarchical Clustering Analysis (HCA). Part 2 of 2 of Clustering Analysis

3 min readNov 16, 2020

What is HCA Clustering?

To remind you from the previous post, clustering analysis is an unsupervised method or technique for breaking down data into groups/clusters. We aren’t predicting any labels, but rather finding ways to make groups different from the way we do in k-Means.

As a refresher, k-Means was picking the number of starting points to reflect the number groups we want, assigning all points to the closest points, finding the mean of the distances to be a new “starting point” and repeating the previous two steps. Remember, for k-Means, there’s a downside of potentially picking too many or few groups/starting points. Picking the number of groups is a top-down approach, but for HCA, it’ll be a bottom-up approach by letting the data direct us towards a number of groups

Rather than listing out steps like in k-Means, HCA is much simpler. It starts off with each data point being it’s own group. Next, the two groups closest to each other become a single group. A group is closer if their distance is shorter using anything like Euclidean, Manhattan, etc distances. This process continues on and on until there is one final single group. To picture the result of this process, you can picture a tree forming from the end leaves to the root. The tree that forms is called a dendrogram.

To be more specific, this is called agglomerative HCA. Agglomerative means to gather or amass, so that’s the perfect differentiation. The other approach is the divisive HCA, which as you can probably guess, starts with one big group, and divides up into singular ones. This is a top-down approach like k-Means clustering. I’ll only go into agglomerative HCA since I’m more familiar with that one.

Agglomerative HCA

To go further into detail, agglomerative HCA has certain ways of determining distance. If we had three clusters with multiple points in each of them, how could we compare distance? In other words, how do we link the closer groups?

Each and every point will have a distance to another point in the cluster, one way of dealing with that is to take the average, average linkage. There’s either the weighted-average or unweighted-average approach which I linked since it’s explained much better there. Besides that, there’s also a single linkage and complete linkage.

Single linkage would be using the shortest distance possible distance between groups, while complete linkage would use the largest distance between a group. It’s much easier with an example, so first we start with singular groups.

[1, 2, 5, 7 ,8]

Then calculating the distances between the two closest points is a tie with a distance of 1.

[1,2] 5 [9, 10]

Now where does the 5 go? We’ll call [1, 2] group A and [7, 8] group B.

Distances from 5 to group A: [5-1, 5-2] = [4, 3]
Distances from 5 to group B: [9-5, 10-5] = [4, 5]

Using single linkage, between group A and B, the shortest distance would be 3. That means we would put 5 into the group with [1,2] to get [1,2,5]. Using complete linkage, we would be grouping 5 with group B to get [5,9,10] since it has the largest, shortest distance of 5.

Evaluation

After obtaining a dendrogram laying out the groups, one can judge for themselves what a natural grouping would be. After deciding the number of groups to use, one can visual the data to see if it makes sense. Moreover, just like in k-Means, there is the silhouette score as a metric to measure the similarity.

Conclusion

Overall, HCA is another type of clustering analysis that is simple and easy to implement. It’s a great tool for trying to establish the groupings, but it for sure isn’t perfect. Some disadvantages would be having outliers that would throw the merging out of whack. Moreover, there is no back-tracking for merges, which could be a potential problem down the line.