Hierarchical Clustering - An Unsupervised Learning Algorithm

Introduction

Unsupervised learning is a type of Machine learning in which we use unlabeled data and we try to find a pattern among the data.
Clustering algorithms falls under the category of unsupervised learning. In these algorithms, we try to make different clusters among the data.
Hierarchical Clustering algorithms build a hierarchy of clusters where each node is a cluster consisting of the clusters of its children node.

To check it's implementation in Python CLICK HERE

There are various strategies in Hierarchical Clustering such as :
  • Divisive
  • Agglomerative

This type of diagram is called Dendrogram.
Divisive - It is a Top-down approach. So we start with all observations in a large cluster and break it down into smaller ones.
Agglomerative - It is the opposite of Divisive as it is a Bottom-Up approach. Here, each observation starts in its cluster and pairs of the cluster are merged as they move up the hierarchy. 
(Generally, Agglomerative is used more as compared to Divisive among Data Scientists) 


Example of Hierarchical Clustering - An International team of scientists led by UCLA biologists used a dendrogram to report genetic data from more than 900 dogs from 85 breeds and more than 200 wild grey wolves worldwide. They used this diagram to see the similarity in genetic data of these animals.


Algorithm

The Algorithm for Hierarchical Clustering is as follow :
  1. Create n Clusters for each data point.
  2. Compute the Proximity matrix.
  3. Repeat -
    1. Merge the two closest cluster
    2. Update the proximity matrix
  4. Until only a single cluster remains.
Now you must be wondering what is proximity matrix??
Proximity matrix is simply an NxN dimensional matrix. Here n is the number of the training example. The matrix contains the distance between two nearest clusters.

Computing Proximity matrix and Distance between 2 clusters

There are 4 ways of computing distance between the clusters - 
  1. Single - Linkage Clustering
    • The minimum distance between clusters
  2. Complete - Linkage Clustering
    • The maximum distance between clusters
  3. Average - Linkage Clustering
    • The average distance between clusters
  4. Centroid - Linkage Clustering
    • Distance between cluster centroids

Advantages and Disadvantages


Hierarchical Clustering V/S K-Means


Thanks for reading the blog. Drop your feedback and suggestions in comments. Next post will be on its implementation. So make sure to check it out.

To check it's implementation in Python CLICK HERE

Visit my website - https://chandbud.me/

Comments

Popular posts from this blog

Implementing Hierarchical Clustering - In Python Programming language

Pose Estimation || Application of Computer Vision and Deep Learning