Chapter 8 Non-spatial clustering

Sessions 8.1 + 8.2

8.1 Cluster Analysis

In the previous weeks, you have successfully used different dimension reduction techniques to help ‘make sense’ of a dataset with many variables (or dimensions). When creating visualizations of our observations based on these reduced dimensions (e.g. the 2-dimensional plots derived from multi-dimensional scaling or a plot of the largest two factors in a factor analysis), depending on the data, you can’t help but to find patterns in the resulting plots. You notice certain observations that group or cluster together. You can try to analyze these clusters visually, by drawing (imaginary) lines around them, but we can also do so in a more structured, programmatic way. Doing this a few advantages: we can do this without the ‘bias’ of the human observer (but, as we discussed at length before, quantitative techniques can be biased too!), calculate goodness-of-fit metrics, and, importantly, find clusters in high-dimensional data (rather than reducing everything to 2 or 3 dimensions so we can visualize it).

As you might imagine, there are different types of techniques for cluster analysis (and it goes by different names too: in network analysis we call this community detection instead) and there isn’t a single good, or optimal, technique. As with many techniques, what is ‘correct’ depends on you - the researcher - and the specifics of the research question and dataset. In this block, we will cover a few of the core, conventional cluster analysis techniques. We will cover a partitional technique (k-means clustering), a hierarchical approach and a density-based approach (DBSCAN).

8.2 Partitional Clustering

To explore how to perform clustering analysis in R, we will yet again use our familiar dataset and initially just extract two factors using factor analysis (through the principal function from the psych package). As we can see in the resulting plot, we can distinguish a few groups of planning areas that seem more closely related to each other than to other planning areas in the country. Can we identify these groups in a more automatic fashion?

To do so, we can use a partitional technique called k means. Partitional techniques all follow the same steps:

  • The number of clusters is pre-assigned (by you, the researcher). We call this number of clusters k.
  • The goal is to divide our space into k clusters in a way that minimizes or maximizes a certain cost function. You are already familiar with cost functions: e.g., in linear regression we are trying to minimize the difference between the estimated values and the observed values.
  • There are different cost functions you can use. The most popular cost function is the total within-cluster distance or squared error, which is referred to as k means. In equation form: \(\sum_{i=1}^{k}{ \sum_{x_j \in S_i}{\|x_j - c_i\|^2}}\). Let’s unpack it.
    • We label our clusters from 1 to \(k\).
    • \(c_i\) is the centroid of a cluster \(i\). You can think of it as the average, middle point of all observations assigned to cluster \(i\).
    • \(x_j\) is an observation belonging to cluster \(i\). We get the distance between \(x_j\) and \(c_i\) with \(\|x_j - c_i\|^2\).
    • Points (i.e., observations) which are in cluster \(i\) are said to belong to \(S_i\), the set of observations assigned to cluster \(i\).
    • We sum over all clusters all distances between observations and their assigned cluster centroid. Just like linear regression, it is a sum of square deviations!

How does the algorithm work?

  1. The algorithm starts out by distributing randomly k cluster centroids so that they are as far away from each other as possible.
  2. Each observation is assigned to the cluster centroid it is closest to.
  3. The cluster centroids are re-calculated.
  4. Observations are re-assigned to new clusters if they are now closer to a different centroid.
  5. Recalculate cluster centroids, and repeat process until a stable solution is reached.

You can watch this process interactively on several websites. I find this simple approach by Karanveer Mohan quite elegant, as well as this interactive article by Naftali Harris.

We can run k-means clustering in R through the built-in kmeans function. It takes two required arguments: the dataset to operate on and the number of cluster to identify. As the solution for k-means might depend on its initial starting points, we can run the algorithm multiple times and take the best solution - this is done by setting the nstart parameter. For now, let’s set the number of clusters to 4 and run the algorithm 50 times.

## K-means clustering with 4 clusters of sizes 6, 13, 2, 7
## Cluster means:
##          RC1        RC2
## 1  0.2104976  0.7698051
## 2 -0.7817176 -0.5659694
## 3 -0.3865277  2.7751919
## 4  1.3817713 -0.4016589
## Clustering vector:
##             4             1             1             2             4 
##             2             3             2             1             4 
##             2             2             2             4             1 
##             1             4             2             2             4 
##             2             2             1             2             3 
##             4             2             2 
## Within cluster sum of squares by cluster:
## [1] 1.59724144 3.45176404 0.08472948 2.74002273
##  (between_SS / total_SS =  85.4 %)
## Available components:
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Inspect the output of the clustering algorithm. We indeed find 4 clusters with 2, 6, 7, and 13 members respectively. We can visualize the clustering results with the fviz_cluster function from the factoextra package and change the size of the labels with option cex.

8.2.1 Exercise

Try to re-run the same analysis with only 2 clusters. What do you think of the resulting clusters? Do the results still make sense?

The previous exercise is a good illustration of picking the ‘right’ number of clusters. As you might have guessed, this again relates deeply to the bias-variance tradeoff we have seen in class.

  • With a small number of clusters, we have more bias. At one extreme, when we only pick one cluster, all observations are assigned to it. The model cannot discriminate between different groups, i.e., our model is too “simple”.
  • With a large number of clusters, we have more variance. At the other extreme, when we pick as many clusters as we have observations, each observation is its own cluster, which it doesn’t share with any other observation. The model is “too good” at discriminating, putting everyone in their own group: our model is too “complex”.

We will see how to approach that in a later section but first we will try to use a different clustering technique altogether.

8.3 Determining the optimal number of clusters

So far, we have set the number of clusters ourselves. In many cases, you can have good theoretical or a-priori reasons to set the number of clusters ahead of times. In other cases, you have no notion about the expected number of clusters yet. So how to come up with a ‘good’ number? There are different techniques to help you assess this. One of these techniques is called the ‘elbow’ method - similar to how we can decide how many factors to extract in PCA.

As k-means clustering is trying to minimize a specific cost function (the total within-cluster squared error), we can simply iterate through all the potential number of clusters in our dataset and calculate this statistic. We can then look for the ‘bend’ point: the point where adding additional clusters doesn’t greatly add to minimizing our cost function anymore. As with any repeated operation, we will use purrr to automate the loop (specifically, the map_dbl function).

We can also do this with one of the convenience functions offered in the factoextra package. How many clusters would you pick based on this plot?

There are additional methods too. Average silhouette is another often-used approach. Conceptually, the silhouette is a measure reflecting how well an individual point fits within its cluster (this Stackoverflow answer provides a more detailed description). The average silhouette simply takes the average of the silhouettes of all points. Again, factoextra makes plotting this quite easy.

What number of clusters would you pick based on this method?

Finally, there is also the gap statistic. This statistic is calculated based on a comparison of our cluster result with a null distribution that assumes no clustering. We will cover the conceptual logic behind this when we discuss spatial clustering next week so for now we will just look at what this metric indicates the best k would be (read the original paper by Tibshirani, Walther and Hastie (2000) if you’re interested). We need an additional library (cluster) to calculate this statistic.

What is the suggested number of clusters based on the gap statistic?

8.4 Hierarchical clustering

Hierarchical clustering works a little differently from partition-based clustering. There are two types of hierarchical clustering: agglomerative and divisive. In the first approach, each observation starts as its own cluster (these are called leaves) and observations are iteratively merged until only a single, large cluster is left (we call this the root cluster). At each step, the two clusters that are most similar are merged together. The second approach works the other way around, we start with a single, large cluster and then split the cluster up into smaller clusters until each observation is its own cluster.

We can run agglomerative hierarchical clustering in R with the built-in hclust function. It takes a single requirement, namely the distance matrix of our observations (remember those from MDS in the previous block?).

The resulting plot, called a dendogram, is a visual summary of all the steps of the agglomerative algorithm. Can you identify the two planning areas in Singapore that are most similar? We can use this structure to make ‘cuts’ at different parts depending on how many clusters we would like to identify. To extract 4 clusters:

cutree separates the k clusters into k groups, then plotted using fviz_cluster.

8.4.1 Exercise

Try to re-run the same analysis with only 2 clusters again and compare the results to the k-means based outcome. Which method do you think is better?

8.5 Extension: Density-based clustering (DBSCAN)

You have now seen two different approaches to clustering. There are many more clustering techniques available today. We will cover one more that is called DBSCAN. It is a density-based clustering algorithm, which works slightly different from the partion and hierarchical methods covered so far. The goal of such an algorithm is to cluster points together that are close to other (groups of) points. To do so, it uses two specific parameters. Again, these will need to be set by hand.

  • First, is the minimum number of points that a cluster needs to have to be considered a cluster.
  • Second, is the ‘epsilon’ or distance threshold within which points can be considered to be part of the same cluster.

How does DBSCAN work?

  1. The algorithm starts at a random observation.
  2. It then looks around that point within the epsilon radius. If it finds more points than the minium number of points, it will mark all those points as belonging to the same cluster as the current point.
  3. It will then move on to the next point and repeat the procedure.

Naftali Harris has made another good interactive exploration of the steps involved in this algorithm.

There are a few advantages of this method. First, it is able to distinguish oddly shaped clusters (see for example the multishapes example in the factoextra package). This can be an important advantage but this highly depends on the meaning of your dimensions. In our example planning area dataset this isn’t necessarily a good thing. Second, it will not assign all points to a cluster per se but can also identify outliers that do not belong to any cluster.

You might wonder how do we determine the minimum number of points and the epsilon parameters? This can come from some meaningful theoretical basis or can be determined by looking at the k-nearest neighbor distances. We are looking - again! - for an elbow or knee in this plot.

Based on this plot, an epsilon of 0.55 might be a reasonable starting point. Let’s use it to run the dbscan algorithm.

8.5.1 Exercise

Can you interpret the results and compare with the result of the kmeans algorithm? To get a better sense of the importance of setting the ‘right’ epsilon, try to increase and decrease it in steps of 0.05.