Chapter 9 Spatial clustering
library(tidyverse) library(factoextra) library(sf) library(tmap)
9.1 Cluster analysis on spatial data
In the previous week, we have seen the utility of several clustering approaches for finding groups of similar observations in ‘data space’. So far, we have ignored the actual geographic dimension of the underlying data. But, as you might have already intuited, we can apply clustering approaches to explicitly geographic variables as well. To explore that, we will switch back to analyzing the HDB resale data that you have analyzed previously from a predominantly non-spatial angle. To help you on your way, you can read in the resale data with an added
geometry column containing the spatial (point) location of each flat. Find the dataset here and add it to your project’s
We read in our data and, for now, extract a 5000 row sample to keep things running relatively snappy. Note that we are using the
set.seed function: in this Chapter, we will need to make sure that the randomness of
sample_n is “fixed”, i.e., that we all always obtain the same sample. The
tm_ functions are loaded from the
tmap package. Note that we will make heavy use of
tmap in this Chapter and the remaining ones. The package’s “get started” page is a great place to review basic commands, which we will also reintroduce as we go along.
readRDS(here::here("data/resale_with_geom.rds")) resale <- set.seed(13) %>% resale sample_n(5000) %>% # plotting large datasets can take a long time, use a sample to speed things up tm_shape() + tm_dots(col = "town")
set.seed(13) sample_n(resale, 5000)resale_sample <-
In density-based clustering (dbscan), you used the DBSCAN algorithm on two factors extracted with factor analysis. Instead, we can use the explicit x/y location of each flat to look for clustering pattern in space. As HDB estates are not a ‘natural’ phenomenon but are instead planned by the Singapore governement, we certainly would expect them to be clustered in space. We first look for an elbow in our k-nearest-neighbour distance plot (
st_ functions are part of the
library(dbscan) resale_sample %>% st_coordinates() # we only need a matrix of the actual coordinates for now coordinates_resale <-kNNdistplot(coordinates_resale, k = 30)
Once we have chosen an appropriate epsilon, we plot our samples over the map.
$geographical_cluster <- dbscan(coordinates_resale, eps = 800, minPts = 30) %>% resale_sample pluck('cluster') %>% as.character() %>% resale_sample sample_n(500) %>% tm_shape() + tm_text(text = "geographical_cluster", col = "town")
As you can see, if we set the epsilon or distance threshold to 800 meters we find about 10 clusters (
0 is the ‘cluster’ with outliers). It is here that we start to see very explicitely that clustering is scale dependendent. This notion is very important when dealing with spatial data: what is ‘clustered’ at one scale level might not be clustered on another. We will get back to this conceptual notion a few times in the coming weeks.
Try decreasing the epsilon to 600 or even 400 meters, and back up to 1000 meters. What do you see happening? Can you explain what you see? Can you interpret it as another bias-variance trade-off?
To illustrate the contrast with using DBSCAN on non-geographic variables instead, let’s repeat this clustering exercise on the
resale_sample %>% resale_attributes <- select(floor_area_sqm, remaining_lease, resale_price) %>% st_set_geometry(NULL) %>% scale() kNNdistplot(resale_attributes, k = 30)
From the plot, we choose epsilon = 0.6.
dbscan(resale_attributes, eps = 0.6, minPts = 30) %>% fviz_cluster(data = resale_attributes)
Can you interpret these results? Can you distinguish clusters? As is often the case with variables related to socio-economic processes, DBSCAN might not be the most useful algorithm. We can compare with a partitioning approach instead.
- First, look for the elbow to choose our number of clusters.
fviz_nbclust(resale_attributes, kmeans, method = "wss")
- We pick 4 clusters and plot the result (note that we cluster over three variables, yet
fviz_clusterplots in a 2D space: it is using the first two principal components to do so).
kmeans(resale_attributes, centers = 4, nstart = 50) %>% fviz_cluster(data = resale_attributes)
- Finally, we plot on the geography of Singapore the cluster that our observations belong to.
$attributes_cluster <- kmeans(resale_attributes, centers = 4, nstart = 50) %>% resale_sample pluck('cluster') %>% as.character() %>% resale_sample sample_n(500) %>% tm_shape() + tm_text(text = "attributes_cluster")
The results of the k-means algorithm might be a little more interpretable and they clearly show some specific geographic variation as well despite the fact that geographic space hasn’t been considered explicitly in the decisions of the algorithm. However, if we were to draw logical, coherent spatial units (for example, to form the basis of administrative units) this wouldn’t necessarily be possible yet. For this process - often referred to as regionalization - it can be useful to combine both spatial and non-spatial characteristics of the area. To do this, in the next sections we will first aggregate our individual flats to larger spatial units (often referred to as ‘basic units’) before performing a spatially-constrained clustering.
9.2 Aggregation to larger spatial units
To aggregate to slightly larger spatial units, we will create a grid of small hexagons over Singapore (find out why we use hexagons here (ArcGIS website)). As our flat sales do not cover the entire country, we will read in the planning areas again (the shapefiles are available here: decompress the zip archive and place the files in your project
If you need a refresher on spatial data structures in R, have a look at chapter 2-4 from Geocomputation With R.
st_read(here::here("data/MP14_PLNG_AREA_NO_SEA_PL.shp")) %>% planning_areas <- filter(!(OBJECTID == 49 | OBJECTID == 18)) # remove islands
## Reading layer `MP14_PLNG_AREA_NO_SEA_PL' from data source `/Users/barnabe/Documents/Research/Teaching/Computation Urban Design/lecturenotes/data/MP14_PLNG_AREA_NO_SEA_PL.shp' using driver `ESRI Shapefile' ## Simple feature collection with 55 features and 12 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33 ## projected CRS: SVY21
planning_areas %>% hex_grid <- st_make_grid(st_bbox(.), square = FALSE, cellsize = 1500) %>% st_sf() %>% mutate(hex_id = row_number()) tm_shape(hex_grid) + tm_polygons()
Once we have our hexagon spatial units ready we can spatially join each flat sales to its corresponding grid cell. If you set the seed before creating
resale_sample above, you should find that the first observation of
resale_sample is a flat sold on “2018-08” in Woodlands, block 897A:
## Simple feature collection with 1 feature and 13 fields ## geometry type: POINT ## dimension: XY ## bbox: xmin: 23620.03 ymin: 46468.18 xmax: 23620.03 ymax: 46468.18 ## CRS: +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs ## # A tibble: 1 x 14 ## month town flat_type block street_name storey_range floor_area_sqm flat_model ## <chr> <chr> <fct> <chr> <chr> <fct> <dbl> <fct> ## 1 2018… WOOD… 4 ROOM 897A WOODLANDS … 10 TO 12 99 Model A ## # … with 6 more variables: lease_commence_date <dbl>, remaining_lease <dbl>, ## # resale_price <dbl>, geometry <POINT [m]>, geographical_cluster <chr>, ## # attributes_cluster <chr>
If for some reason you have a different first observation:
- Download the
- Place it in your
- Run the following command:
We now join the sample data we have with the hexagon grid.
st_join(resale_sample, hex_grid) %>% resale_hex <- st_set_geometry(NULL)
Note: we set the geometry to
NULL here so that the
POINT geometry of the flats does not interfere with our analysis later - we don’t need it any longer after the join.
We can now use the resulting dataset to create a series of metrics that reflect the situation in each hexagon. For example, to calculate the average years of lease remaining in each hexagon.
resale_hex %>% remaining_lease <- group_by(hex_id) %>% summarise(remaining_lease = mean(remaining_lease)) %>% left_join(hex_grid, .) %>% filter(remaining_lease > 0) tm_shape(remaining_lease) + tm_polygons(col = "remaining_lease")
9.3 Spatially-constrained clustering
Video lecture 9.3.1
For the spatially constrained clustering, we will use the SKATER algorithm. Although it is a bit less efficient than the REDCAP algorithms discussed in the reading for this Chapter, it has a solid implementation in the
spdep package for R. The algorithm has a number of key ingredients (most of these are shared with other regionalization/clustering algorithms)
We convert our
sf object to an
sp object first.
sp is the pre-cursor to
sf and not all methods in
spdep are able to handle
sf objects appropriately. We then start with constructing a list of neighbors.
library(spdep) as(remaining_lease, 'Spatial') hex_sp <- poly2nb(hex_sp) hex_neighbors <-## visually inspect neighbors plot(hex_neighbors, coordinates(hex_sp))
9.3.1 From areal representation to network
As you can see the Northern part of Singapore is disconnected from the other areas. This is simply caused by a lack of HDB resale transactions in the in-between area. However, for the algorithm this is somewhat problematic as it expects all areas in the dataset to be connected to each other. We solve this manually by connecting it to the rest of our hexagons.
To do so, we will need to run an interactive session. We will go over this together in video lecture 9.3.1. Alternatively, if you prefer not to bother, you can jump directly to Section 9.3.2.
Note that you may need to open the R console client (not RStudio) that was downloaded when you installed R. The application is usually called R.
- Open R.
- Copy the code here to an empty new script file. For now, only copy the code lying between the two “Start copy” and “End copy” comments.
- Replace the project path with your own.
- Paste the code into the R application. Once executed, a window will open.
- Click on one node of the small component at the top of the window.
- Click on another node in the large component. You should see the result below.
- In the console, enter “y” to add contiguity. Then either enter “c” to keep connecting (it is not necessary) or “q” to quit.
- Now copy the
saveRDScommand. This saves the modified object into your
9.3.2 Building the minimum spanning tree
Video lecture 9.3.2
If you followed the previous Section, you should have the
hex_neighbors_connect.rds file in your project
data folder. Otherwise, make sure to download it here and place it in your
We load in this
hex_neighbors and plot.
readRDS(here::here("data/hex_neighbors_connect.rds")) hex_neighbors <-plot(hex_neighbors, coordinates(hex_sp))
We also need the attribute data to base our clustering on. For now we just use the
remaining_lease %>% cluster_data <- st_set_geometry(NULL) %>% select(remaining_lease)
Once all the core ingredients are in place we can calculate a few other required variables. Make sure to look into each of the new variables you get to check whether you understand the steps here.
# For each edge between neighbors we calculate the (dis)similarity or 'cost' in 'data space' nbcosts(hex_neighbors, cluster_data) hex_edge_costs <-# we add the calculated costs to neighbor list nb2listw(hex_neighbors, hex_edge_costs, style="B") hex_edge_weights <-# based on the weighted neighbor list we calculate a minimum spanning tree mstree(hex_edge_weights) hex_mstree <-plot(hex_mstree, coordinates(hex_sp))
9.3.3 Clustering with SKATER
Video lecture 9.3.3
It is this minimum spanning tree that is the essential ingredient in the SKATER algorithm. We only need the first two columns of the tree (defining the “from” and “to” of each edge in the tree), the cluster data and the number of cuts (the number of clusters will be equal to the number of cuts + 1).
skater(hex_mstree[,1:2], hex_skater_k6 <- cluster_data, ncuts = 5) $cluster <- as.character(hex_skater_k6$groups) remaining_leasetm_shape(remaining_lease) + tm_polygons(col = "cluster")
The SKATER algorithm has a number of useful additional constraints. For example we can set the minimum size of the cluster to be 10 hexagons, like so:
skater(hex_mstree[,1:2], hex_skater_k6_crit <- cluster_data, ncuts = 5, crit = 10) $cluster <- as.character(hex_skater_k6_crit$groups) remaining_leasetm_shape(remaining_lease) + tm_polygons(col = "cluster")
9.4 Assignment 4 (due Monday, April 13th, 23:59)
For your fourth assignment, you need to write a short report on spatial clustering. You are also encouraged to submit the project update described in the second subsection below.
9.4.1 Spatially-constrained clustering
In Secions 9.2 and 9.3, we have discussed how to cluster neighbourhoods of Singapore based on the
remaining_lease variable. Repeat the analysis with at least 2 out of the 4 following variables:
- Average price per square meter
- % of 3 room flats relative to total number of flats
- Average floor area
- Average number of years remaining on the lease
Focus on how you would set the parameters of the SKATER algorithm (e.g., the number of cuts or minimum number of hexagons). How does varying the parameters change the results you obtain? Can you explain the changes you observe?
9.4.2 Final project update
By the submission of this assignment, you should already have obtained most of the data you will use for your project. The first part of your project should consist of Exploratory Data Analysis (EDA), much like you have done for your first assignment.
To start with your project, do the following steps:
You should already have a
projectfolder with your proposal in your RStudio project. Create a new RMarkdown file called
Use your project title for the title of
project.Rmd. Don’t forget to change
output: github_document, unless you have interactive maps, in which case you can leave
html_document(or output both).
Get started on the EDA section of your proejct. EDA usually features (at least) the following:
- A description of your datasets (typically, one dataset = one csv file). For each dataset:
- Name of the dataset.
- Schema of the dataset: list of variables and their types.
- What is the variable?
- If categorical variable: is it nominal? ordinal? What are the levels?
- If continuous variable: what is the unit?
- How was the data collected? What is the source?
- Sampling procedure of the dataset (is it a survey? how were respondents sampled?)
- Centrality, dispersion, higher moments of the variables (see also Chapter 3). What are the notable features of your data?
- New methods we have seen in class:
- Visualise data with MDS (either the “plain vanilla” version, or
- Extract and interpret components with PCA.
- Find clusters.
- Visualise data with MDS (either the “plain vanilla” version, or
Note that if you have spatial data (you most likely do), you can plot your data on a map using the
tmap library we used in this Chapter. You are free to reuse any dataset from the class, including the planning areas shapefile
Note also that this part of the assignment is ungraded. The purpose of this update is to help you plan your work and complete tangible parts of your project ahead of the final deadline. If you have completed this update, leave a note that you have done the update in your submission issue on Github and I will provide comments on the update.
9.4.3 Submitting the assignment
- Create your assignment as a single self-contained RMarkdown document (you may read in cleaned data at the beginning from your
datafolder with the
herelibrary). It should be saved in the
- Change the
github_documentso that your assignment is rendered visually representation on Github.
- Do not forget to run
styleron your code before submitting! (Check out the Session 2.2 slides if you are not sure how to do so).
- Once you are finished, navigate to your own Github repository at
https://github.com/02522-cua/[your-name]. Create a new issue, title it ‘Assignment 4 by [your_name]’. In the body of the issue, include at least a link to the last commit you made as part of your assignment and ‘at’ your prof (include
@barnabemonnotin your text). To link to a specific commit, you can just include the SHA (a short code uniquely identifying a commit) in your issue. You can find the SHA codes in the history of your commits (e.g. at https://github.com/02522-cua/barnabe-monnot/commits/master my first commit has SHA ‘1f2c9c1’). Note that you can always edit the text of the issue if you want to update the assignment commit, but commits are timestamped so your prof can check whether they were submitted before the deadline!
Your assignments will be graded on a 5-point scale, according to this rubric:
- The Rmarkdown does not run and there’s incomplete description and contextualization
- The Rmarkdown does not run or runs but doesn’t produce the expected output. There is limited description and contextualization of code and its output
- The Rmarkdown runs, the code is clearly structured but the description and contextualization of code and its output is limited
- The Rmarkdown runs, the code is clearly structured and the description and contextualization is complete.
- As above, plus you have included some new or creative approaches, methods or angles that were not covered in class.
Note that half or quarter-point grades might be given. The rubric above is only an indication for specific points on the grading scale.