Chapter 9 Spatial clustering

Sessions 9.1 + 9.2

library(tidyverse)
library(factoextra)
library(sf)
library(tmap)

9.1 Cluster analysis on spatial data

In the previous week, we have seen the utility of several clustering approaches for finding groups of similar observations in ‘data space’. So far, we have ignored the actual geographic dimension of the underlying data. But, as you might have already intuited, we can apply clustering approaches to explicitly geographic variables as well. To explore that, we will switch back to analyzing the HDB resale data that you have analyzed previously from a predominantly non-spatial angle. To help you on your way, you can read in the resale data with an added geometry column containing the spatial (point) location of each flat. Find the dataset here and add it to your project’s data folder.

We read in our data and, for now, extract a 5000 row sample to keep things running relatively snappy. Note that we are using the set.seed function: in this Chapter, we will need to make sure that the randomness of sample_n is “fixed”, i.e., that we all always obtain the same sample. The tm_ functions are loaded from the tmap package. Note that we will make heavy use of tmap in this Chapter and the remaining ones. The package’s “get started” page is a great place to review basic commands, which we will also reintroduce as we go along.

resale <- readRDS(here::here("data/resale_with_geom.rds"))

set.seed(13)
resale %>% 
  sample_n(5000) %>% # plotting large datasets can take a long time, use a sample to speed things up
  tm_shape() + tm_dots(col = "town")

set.seed(13)
resale_sample <- sample_n(resale, 5000)

In density-based clustering (dbscan), you used the DBSCAN algorithm on two factors extracted with factor analysis. Instead, we can use the explicit x/y location of each flat to look for clustering pattern in space. As HDB estates are not a ‘natural’ phenomenon but are instead planned by the Singapore governement, we certainly would expect them to be clustered in space. We first look for an elbow in our k-nearest-neighbour distance plot (st_ functions are part of the sf package).

library(dbscan)
coordinates_resale <- resale_sample %>% st_coordinates()  # we only need a matrix of the actual coordinates for now
kNNdistplot(coordinates_resale, k = 30)

Once we have chosen an appropriate epsilon, we plot our samples over the map.

resale_sample$geographical_cluster <- dbscan(coordinates_resale, eps = 800, minPts = 30) %>% 
  pluck('cluster') %>% as.character()

resale_sample %>% 
  sample_n(500) %>% 
  tm_shape() + tm_text(text = "geographical_cluster", col = "town")

As you can see, if we set the epsilon or distance threshold to 800 meters we find about 10 clusters (0 is the ‘cluster’ with outliers). It is here that we start to see very explicitely that clustering is scale dependendent. This notion is very important when dealing with spatial data: what is ‘clustered’ at one scale level might not be clustered on another. We will get back to this conceptual notion a few times in the coming weeks.


9.1.1 Exercise

Try decreasing the epsilon to 600 or even 400 meters, and back up to 1000 meters. What do you see happening? Can you explain what you see? Can you interpret it as another bias-variance trade-off?


To illustrate the contrast with using DBSCAN on non-geographic variables instead, let’s repeat this clustering exercise on the floor_area, remaining_lease and resale_price instead.

resale_attributes <- resale_sample %>% 
  select(floor_area_sqm, remaining_lease, resale_price) %>% 
  st_set_geometry(NULL) %>% 
  scale()

kNNdistplot(resale_attributes, k = 30)

From the plot, we choose epsilon = 0.6.

dbscan(resale_attributes, eps = 0.6, minPts = 30) %>% 
  fviz_cluster(data = resale_attributes)

Can you interpret these results? Can you distinguish clusters? As is often the case with variables related to socio-economic processes, DBSCAN might not be the most useful algorithm. We can compare with a partitioning approach instead.

  • First, look for the elbow to choose our number of clusters.
fviz_nbclust(resale_attributes, kmeans, method = "wss")

  • We pick 4 clusters and plot the result (note that we cluster over three variables, yet fviz_cluster plots in a 2D space: it is using the first two principal components to do so).
kmeans(resale_attributes, centers = 4, nstart = 50) %>% 
  fviz_cluster(data = resale_attributes)

  • Finally, we plot on the geography of Singapore the cluster that our observations belong to.
resale_sample$attributes_cluster <- kmeans(resale_attributes, centers = 4, nstart = 50) %>% 
  pluck('cluster') %>% as.character()

resale_sample %>% 
  sample_n(500) %>% 
  tm_shape() + tm_text(text = "attributes_cluster")

The results of the k-means algorithm might be a little more interpretable and they clearly show some specific geographic variation as well despite the fact that geographic space hasn’t been considered explicitly in the decisions of the algorithm. However, if we were to draw logical, coherent spatial units (for example, to form the basis of administrative units) this wouldn’t necessarily be possible yet. For this process - often referred to as regionalization - it can be useful to combine both spatial and non-spatial characteristics of the area. To do this, in the next sections we will first aggregate our individual flats to larger spatial units (often referred to as ‘basic units’) before performing a spatially-constrained clustering.

9.2 Aggregation to larger spatial units

Video lecture 9.2a + 9.2b

To aggregate to slightly larger spatial units, we will create a grid of small hexagons over Singapore (find out why we use hexagons here (ArcGIS website)). As our flat sales do not cover the entire country, we will read in the planning areas again (the shapefiles are available here: decompress the zip archive and place the files in your project data folder).

If you need a refresher on spatial data structures in R, have a look at chapter 2-4 from Geocomputation With R.

planning_areas <- st_read(here::here("data/MP14_PLNG_AREA_NO_SEA_PL.shp")) %>% 
  filter(!(OBJECTID == 49 | OBJECTID == 18)) # remove islands
## Reading layer `MP14_PLNG_AREA_NO_SEA_PL' from data source `/Users/barnabe/Documents/Research/Teaching/Computation Urban Design/lecturenotes/data/MP14_PLNG_AREA_NO_SEA_PL.shp' using driver `ESRI Shapefile'
## Simple feature collection with 55 features and 12 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS:  SVY21
hex_grid <- planning_areas %>%
  st_make_grid(st_bbox(.), square = FALSE, cellsize = 1500) %>% 
  st_sf() %>% 
  mutate(hex_id = row_number())

tm_shape(hex_grid) + tm_polygons()

Once we have our hexagon spatial units ready we can spatially join each flat sales to its corresponding grid cell. If you set the seed before creating resale_sample above, you should find that the first observation of resale_sample is a flat sold on “2018-08” in Woodlands, block 897A:

resale_sample %>% head(1)
## Simple feature collection with 1 feature and 13 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 23620.03 ymin: 46468.18 xmax: 23620.03 ymax: 46468.18
## CRS:            +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs
## # A tibble: 1 x 14
##   month town  flat_type block street_name storey_range floor_area_sqm flat_model
##   <chr> <chr> <fct>     <chr> <chr>       <fct>                 <dbl> <fct>     
## 1 2018… WOOD… 4 ROOM    897A  WOODLANDS … 10 TO 12                 99 Model A   
## # … with 6 more variables: lease_commence_date <dbl>, remaining_lease <dbl>,
## #   resale_price <dbl>, geometry <POINT [m]>, geographical_cluster <chr>,
## #   attributes_cluster <chr>

If for some reason you have a different first observation:

We now join the sample data we have with the hexagon grid.

resale_hex <- st_join(resale_sample, hex_grid) %>% 
  st_set_geometry(NULL)

Note: we set the geometry to NULL here so that the POINT geometry of the flats does not interfere with our analysis later - we don’t need it any longer after the join.

We can now use the resulting dataset to create a series of metrics that reflect the situation in each hexagon. For example, to calculate the average years of lease remaining in each hexagon.

remaining_lease <- resale_hex %>% 
  group_by(hex_id) %>% 
  summarise(remaining_lease = mean(remaining_lease)) %>% 
  left_join(hex_grid, .) %>% 
  filter(remaining_lease > 0)

tm_shape(remaining_lease) + tm_polygons(col = "remaining_lease")

9.3 Spatially-constrained clustering

Video lecture 9.3.1

For the spatially constrained clustering, we will use the SKATER algorithm. Although it is a bit less efficient than the REDCAP algorithms discussed in the reading for this Chapter, it has a solid implementation in the spdep package for R. The algorithm has a number of key ingredients (most of these are shared with other regionalization/clustering algorithms)

We convert our sf object to an sp object first. sp is the pre-cursor to sf and not all methods in spdep are able to handle sf objects appropriately. We then start with constructing a list of neighbors.

library(spdep)

hex_sp <- as(remaining_lease, 'Spatial')
hex_neighbors <- poly2nb(hex_sp)
## visually inspect neighbors
plot(hex_neighbors, coordinates(hex_sp))

9.3.1 From areal representation to network

As you can see the Northern part of Singapore is disconnected from the other areas. This is simply caused by a lack of HDB resale transactions in the in-between area. However, for the algorithm this is somewhat problematic as it expects all areas in the dataset to be connected to each other. We solve this manually by connecting it to the rest of our hexagons.

To do so, we will need to run an interactive session. We will go over this together in video lecture 9.3.1. Alternatively, if you prefer not to bother, you can jump directly to Section 9.3.2.

Note that you may need to open the R console client (not RStudio) that was downloaded when you installed R. The application is usually called R.

  • Open R.
  • Copy the code here to an empty new script file. For now, only copy the code lying between the two “Start copy” and “End copy” comments.
  • Replace the project path with your own.
  • Paste the code into the R application. Once executed, a window will open.
  • Click on one node of the small component at the top of the window.
  • Click on another node in the large component. You should see the result below.

  • In the console, enter “y” to add contiguity. Then either enter “c” to keep connecting (it is not necessary) or “q” to quit.
  • Now copy the saveRDS command. This saves the modified object into your data folder.

9.3.2 Building the minimum spanning tree

Video lecture 9.3.2

If you followed the previous Section, you should have the hex_neighbors_connect.rds file in your project data folder. Otherwise, make sure to download it here and place it in your data folder.

We load in this hex_neighbors and plot.

hex_neighbors <- readRDS(here::here("data/hex_neighbors_connect.rds"))
plot(hex_neighbors, coordinates(hex_sp))

We also need the attribute data to base our clustering on. For now we just use the remaining_lease variable.

cluster_data <- remaining_lease %>% 
  st_set_geometry(NULL) %>%
  select(remaining_lease)

Once all the core ingredients are in place we can calculate a few other required variables. Make sure to look into each of the new variables you get to check whether you understand the steps here.

# For each edge between neighbors we calculate the (dis)similarity or 'cost' in 'data space'
hex_edge_costs <- nbcosts(hex_neighbors, cluster_data)
# we add the calculated costs to neighbor list
hex_edge_weights <- nb2listw(hex_neighbors, hex_edge_costs, style="B")
# based on the weighted neighbor list we calculate a minimum spanning tree
hex_mstree <- mstree(hex_edge_weights)
plot(hex_mstree, coordinates(hex_sp))

9.3.3 Clustering with SKATER

Video lecture 9.3.3

It is this minimum spanning tree that is the essential ingredient in the SKATER algorithm. We only need the first two columns of the tree (defining the “from” and “to” of each edge in the tree), the cluster data and the number of cuts (the number of clusters will be equal to the number of cuts + 1).

hex_skater_k6 <- skater(hex_mstree[,1:2], 
                        cluster_data, 
                        ncuts = 5)

remaining_lease$cluster <- as.character(hex_skater_k6$groups)
tm_shape(remaining_lease) + tm_polygons(col = "cluster")

The SKATER algorithm has a number of useful additional constraints. For example we can set the minimum size of the cluster to be 10 hexagons, like so:

hex_skater_k6_crit <- skater(hex_mstree[,1:2], 
                        cluster_data, 
                        ncuts = 5,
                        crit = 10)
remaining_lease$cluster <- as.character(hex_skater_k6_crit$groups)
tm_shape(remaining_lease) + tm_polygons(col = "cluster")

9.4 Assignment 4 (due Monday, April 13th, 23:59)

For your fourth assignment, you need to write a short report on spatial clustering. You are also encouraged to submit the project update described in the second subsection below.

9.4.1 Spatially-constrained clustering

In Secions 9.2 and 9.3, we have discussed how to cluster neighbourhoods of Singapore based on the remaining_lease variable. Repeat the analysis with at least 2 out of the 4 following variables:

  • Average price per square meter
  • % of 3 room flats relative to total number of flats
  • Average floor area
  • Average number of years remaining on the lease

Focus on how you would set the parameters of the SKATER algorithm (e.g., the number of cuts or minimum number of hexagons). How does varying the parameters change the results you obtain? Can you explain the changes you observe?

9.4.2 Final project update

By the submission of this assignment, you should already have obtained most of the data you will use for your project. The first part of your project should consist of Exploratory Data Analysis (EDA), much like you have done for your first assignment.

To start with your project, do the following steps:

  1. You should already have a project folder with your proposal in your RStudio project. Create a new RMarkdown file called project.Rmd.

  2. Use your project title for the title of project.Rmd. Don’t forget to change output: github_document, unless you have interactive maps, in which case you can leave html_document (or output both).

  3. Get started on the EDA section of your proejct. EDA usually features (at least) the following:

  • A description of your datasets (typically, one dataset = one csv file). For each dataset:
    • Name of the dataset.
    • Schema of the dataset: list of variables and their types.
      • What is the variable?
      • If categorical variable: is it nominal? ordinal? What are the levels?
      • If continuous variable: what is the unit?
    • How was the data collected? What is the source?
      • Sampling procedure of the dataset (is it a survey? how were respondents sampled?)
    • Centrality, dispersion, higher moments of the variables (see also Chapter 3). What are the notable features of your data?
  • New methods we have seen in class:
    • Visualise data with MDS (either the “plain vanilla” version, or t-SNE/umap).
    • Extract and interpret components with PCA.
    • Find clusters.

Note that if you have spatial data (you most likely do), you can plot your data on a map using the tmap library we used in this Chapter. You are free to reuse any dataset from the class, including the planning areas shapefile MP14_PLNG_AREA_NO_SEA_PL.shp.

Note also that this part of the assignment is ungraded. The purpose of this update is to help you plan your work and complete tangible parts of your project ahead of the final deadline. If you have completed this update, leave a note that you have done the update in your submission issue on Github and I will provide comments on the update.

9.4.3 Submitting the assignment

  • Create your assignment as a single self-contained RMarkdown document (you may read in cleaned data at the beginning from your data folder with the here library). It should be saved in the assignments/ folder, as assignment4.Rmd.
  • Change the output parameter to github_document so that your assignment is rendered visually representation on Github.
  • Do not forget to run styler on your code before submitting! (Check out the Session 2.2 slides if you are not sure how to do so).
  • Once you are finished, navigate to your own Github repository at https://github.com/02522-cua/[your-name]. Create a new issue, title it ‘Assignment 4 by [your_name]’. In the body of the issue, include at least a link to the last commit you made as part of your assignment and ‘at’ your prof (include @barnabemonnot in your text). To link to a specific commit, you can just include the SHA (a short code uniquely identifying a commit) in your issue. You can find the SHA codes in the history of your commits (e.g. at https://github.com/02522-cua/barnabe-monnot/commits/master my first commit has SHA ‘1f2c9c1’). Note that you can always edit the text of the issue if you want to update the assignment commit, but commits are timestamped so your prof can check whether they were submitted before the deadline!

Your assignments will be graded on a 5-point scale, according to this rubric:

  1. The Rmarkdown does not run and there’s incomplete description and contextualization
  2. The Rmarkdown does not run or runs but doesn’t produce the expected output. There is limited description and contextualization of code and its output
  3. The Rmarkdown runs, the code is clearly structured but the description and contextualization of code and its output is limited
  4. The Rmarkdown runs, the code is clearly structured and the description and contextualization is complete.
  5. As above, plus you have included some new or creative approaches, methods or angles that were not covered in class.

Note that half or quarter-point grades might be given. The rubric above is only an indication for specific points on the grading scale.