Implementing Hierarchal Clustering (Python)

Clustering is an important non-supervised learning technique. It aims to split data into certain clusters. For instance, if you input data pertaining to shoppers in the local grocery market, clustering that could output age based clusters of say < 12 years, 12-18, 18-60, and > 60. Another intuitive example is banking, clustering financial data for a large group of individuals could output income-based clusters, say 3 pertaining to the lower middle, upper middle, and upper classes.

The most basic and intuitive method of clustering is K-means, which identifies K clusters. It randomly initialises K centroids, marks the points near it, and repeats until it repeats K averaged out cluster centroids. Hierarchal analysis, however, is based on a much different principle.

There are two methods of hierarchal analysis: agglomerative and divisive. Agglomerative puts each data point into a cluster of its own. Hence, if you input 6000 points, you start out with 600 clusters. It then clusters the closest points (and then clusters) and repeats this process until there is only one giant cluster encompassing the entire dataset left. Divisive does the exact opposite. It starts with one giant clusters, and splits each cluster (repeatedly) until each point is its own cluster. The results are plotted on a special plot, called a dendrogram. The longest vertical line segment on the dendrogram gets to be the optimum number of clusters for analysis. This will be much easier to understand when I show a dendrogram below.

Here is a link to the dataset I’ve used. You can access the full code here. This tutorial on analyticsvidhya was also immensely helpful to me when understanding how hierarchal clustering works.

Note down the libraries I’ve imported. The dataset is fairly straightforward. You have an ID for each customer, and financial data corresponding to that ID. You’ll notice that the standard deviation or range for features is quite different. Where balance frequency tends to stay close to 1 for each ID, account balance is wildly different. This can cause issues during clustering, so that’s why i’ve scaled my data so that each feature is similar to each other feature in relative terms.

Here is the dendrogram for the data. The y-axis represents the ‘closeness’ of each individual data-point/cluster. You’d obviously expect y to be maxed out when there’s only 1 cluster, so that’s no surprise. Now, looking at this graph, we must select the number of clusters for our model. A general rule of thumb here is to take the number of clusters pertaining to the longest vertical line visible here. The distance of each vertical line from each other represents how faraway those clusters are. Hence, you want a small number of clusters (not necessary, but in this application, optimal), but also want your clusters to be spaced far apart, so that they clearly represent different groups of people (in this context).

I’m taking 3 clusters, which corresponds to a vertical axis value of 23. 3 clusters also intuitively makes sense to me as any customer can broadly be classified into lower, middle, and upper class. Of course, there are subdivisions inside these 3 broad categories too, and you might argue that the lower class wouldn’t even be represented here, so we can say that these 3 clusters correspond to the lower middle, upper middle, and upper classes.

Here is a diagrammatic representation of what I’ve chosen.

After building the model, all that’s left is visualising the results. There are more than two features, so I’m arbitrarily selecting two, plotting all points using those features for my axes, and giving each point a color that corresponds to all other points in its cluster.

You’ll notice that there is a lot of data here, but also a clear pattern. Points belonging to the purple cluster visibly tend towards the upper left corner. Similarly, points in the teal cluster tend to the bottom left corner, and points in the yellow cluster tend to the bottom right corner.

Implementing PCA and UMAP in Python

You can find the full code for PCA here, and the full code for UMAP here.

Dimensionality reduction is an important part of constructing Machine Learning models. Dimensionality Reduction is basically the process of combining multiple features into a smaller number of features. Features that have a higher contribution to the target value have a greater representation in the final combined feature than features that contribute less. For instance, if you have 8 features, the first 6 of which have a summed contribution of around 95%, and the last 2 of which have a contribution of about only 5%, then those 6 features will have a greater representation in the final combined feature. In terms of advantages, the most significant is less memory storage and hence higher modeling and processing speed. Other advantages include simplicity and easier visualization. For instance, you can easily plot the contribution of two combined features to the target, especially compared to plotting, say, 20 initial features. Another significant aspect is that features will less contribution that would otherwise add useless ‘weight’ to the model are removed early on.

The two methods of dimensionality reduction I will be using are PCA and UMAP. I won’t be going in through how they work as I’ve given a short overview of their purpose above. Instead, I’ll go through the code I implemented for each, and visualize the results. For this exercise, I’m using the WHO Life Expectancy Dataset that can be found on Kaggle, as its very small and easy to work with. My target variable will be life expectancy, and my features will be aspects like adult mortality, schooling, GDP etc. I randomly selected these features from the dataset.

Here is a list of the modules we will be using. train_test_split will help us break out data into a training set and a testing set (about a 7:3 ratio). While this isn’t significant right now, this aids in the detection of under fitting and over fitting. Under-fitting is detected by bad performances on both the training set and the testing set, whereas Over-fitting is detected by really good performance on the training set but bad performance on the testing set. StandardScaler has been used to normalise features. Feature normalisation is a technique that reduces the range of the dataset, or the standard deviation, in layman’s terms. Lastly, we’ve imported both PCA and UMAP, which will be used.

Here we just load our dataset, extract the features that will be used (see column names in the dataframe), and rename them for the sake of simplicity. As you can see, there are some random spaces and not all use underscores as notation, so I decided to have one uniform way of typing out each feature. Now, to extract a feature matrix and a target vector, just drop the life_expectancy column from the dataframe and convert it into a numpy array, and convert the life_expectancy column into a separate numpy array. I won’t show the code for splitting and normalising, because that’s pretty much irrelevant here.

Implementing PCA in itself is very simple, as shown above. You’ll notice that I’ve specified n_components to be equal to 2 above. This is because I just wanted to point out that the number of combined features you want at the end can be set by you. In this case, it doesn’t really matter because PCA will give only two combined features if I do not a specify a pre-set number. After that, I’ve fitted the training_data to PCA.

Here’s a bit of data treatment before I finally plot the results. I’ve basically converted PCA’s output, which was a numpy array, to a pandas dataframe, and then added life_expectancy as a column because that will be used for the color-bar you will see below.

Here is the code for my plot, and here is the plot:

You can see the relative contribution of each componenent to each feature, whose target value, or life expectancy, is represented by the color of the marker. While I don’t see any patterns straightaway (specific colors being clustered somewhere etc.), the primary thing that does stand out is how heavily green dots (~ 70 expectancy) are clustered towards the bottom left. There are other colors as well, but there don’t seem to be many green dots anywhere else.

The code for UMAP is the exact same, except with UMAP as our decomposer instead of PCA. Here’s the plot.

You can straightaway see that the results of UMAP are quite different. Once again, there are no noticeable patterns in terms of specific colors being clustered in specific locations, but the overall structure is quite different from that of PCA. We can see that each color is distributed throughout.

There’s no way to say which method is better without modeling your target variable with respect to both principal components and calculating the accuracy on the testing set. This post just aims to illustrate how both of them work without going into specific details.

Implementing Univariate Linear Regression in Python

The objective of this post is to explain the steps I took to implementing univariate linear regression in Python. Do note that I’m not using libraries with inbuilt ML models like sklearn and sci-py here.

Here is our gradient descent function, utilising mean squared error as the cost function.

def gradientDescent(theta, alpha, iterations):
    
    m = ex1data.shape[0]  # finding the number of trial examples
    n = ex1data.shape[1]  # finding the number of features + 1
    
    for iteration in range(iterations):
        
        total0 = 0
        total1 = 0
        
        for row in range(m): # iterating over each training example
        
            hypothesis = 0
            
            for val in range(n-1):
                hypothesis += ex1data.values[row][val] * theta[val][0] 
                
            load = hypothesis - ex1data.values[row][n-1]
            total0 += load*ex1data.values[row][0]
            total1 += load*ex1data.values[row][1]
            
        temp0 = theta[0][0] - ((alpha*total0)/m)
        temp1 = theta[1][0] - ((alpha*total1)/m)
        theta = [[round(temp0, 4)], [round(temp1, 4)]]
        
    return theta

We carry out gradient descent 1500 times here, by setting iterations equal to 1500. Our starting values of theta are 0. Now, for those of you who don’t know how gradient descent works, here’s a short explanation that attempts to cover the crux of it. Intuitively, we subtract each of our output values as given by the hypothesis function by the target value we’re trying to predict, then square the difference. The gradient descent update rule subtracts the partial derivative of this (beyond us mortals for now) from the existing values of theta – updating them. This entire process repeats 1500 times until gradient descent converges to an optimal value of theta, or the minimum point of the cost function.

Now, we plot our data to see what it looks like.

m = ex1data.shape[0]
print ('No. of training examples --> {}'.format(m)) # outputting the number of traning examples for the user
eye = []

for i in range(0,m): 
    eye.append(1)  # creating an array of 1s and adding it to X
    
if len(ex1data.columns) == 2:  # to avoid an error wherein ex1data already has the column of 1s
    ex1data.insert(0, "feature1", eye)
    
print ('here is theta (initial)')
theta = [[0], [0]]
matrix_print(theta)

Now, we firstly add a column vector consisting entirely of 1s, of dimensions m by 1, to our feature matrix. Hence, we have a feature matrix of m by 2, where one column pertains to our variable data and another to a column vector of 1s. We then initialise both values of theta to 0. Our learning rate, alpha, will be set to 0.01 for gradient descent, and we will execute gradient descent 1500 times. After running gradient descent and plotting our predicted values against the actual dataset, this is what we get:

Pretty cool, right?

This entire example was based on solving the Week 1 problem set of Andrew Ng’s machine learning course on coursera.org through Python. So credits to Stanford University. Stay quarantined, stay safe!