Chapter 10 Spatial autocorrelation

Session 10

10.1 Introduction

In the previous week, we saw that we could combine both the spatial location (as a constraint on the clustering procedure) and the other attributes of that location together in a single clustering algorithm. This is a powerful combination partly because of one of Geography’s central tenets (often referred to as Tobler’s first law):

Everything is related to everything else, but near things are more related than far things.

This notion holds true (to different degrees) for many social processes (I’m sure you can think of many yourself!). Apart from visually inspecting this tendency for similar things (or values) to cluster together in space, we can also directly quantify to what degree a variable is clustered in space. That is, rather than looking for individual clusters in space, we can calculate whether the pattern shows clear signs of clustering in the first place. This is also often referred to as spatial autocorrelation: the degree to which a variable correlates with itself across space. As we will discuss next week, this has major implications for many statistical procedures (e.g., it violates one of the central assumptions in regression analysis) but for now we will just focus on describing the degree to which this might be present in a dataset.

10.2 Lagged average over neighbours

Lab practices 10.0 + 10.1

To calculate spatial autocorrelation, we need to walk through some similar steps as we did for the spatial clustering in the previous Chapter. Our ultimate goal is to look at the relation between a variable (in our case remaining lease years) and a spatially lagged version of that variable.

The intuition behind spatial lag is the following: if I didn’t know the average remaining lease for one hexagon, how could I estimate it if I knew the value of that variable for its neighbours? One thing I could do is take the average over all values of the variable for my hexagon’s neighbours, and use that as an estimator for the unknown average remaining lease.

The first step is to construct a list of neighbours for each area. Here we use the same remaining_lease data as earlier, which you can download here.

remaining_lease <- readRDS(here::here("data/remaining_lease.rds"))
hex_sp <- as(remaining_lease, 'Spatial')
hex_neighbours <- poly2nb(hex_sp)
plot(hex_neighbours, coordinates(hex_sp))

Once we have our list of neighbours, we need to assign a weight to the relationship to each neighbour. For clustering, we used the similarity in ‘data space’ to determine this weight but here we used a weighted approach: if a hexagon has 4 neighbours, each neighbour will weigh 1/4.

Note: In the first assignment, you computed the moving average of a variable over time. “Averaging” in that context meant assigning weights to observations before and after some time \(t\). We are doing the same thing here: we assign weights to neighbouring observations (but here we do it in space, not in time).

hex_weights <- nb2listw(hex_neighbours, style="W", zero.policy=TRUE)

Open hex_weights in your Environment and observe the weights attribute!

Once we have the weights, we can calculate a spatially lagged version of the variable of interest. In this case this creates an average years of lease left for all the neighbours of a hexagon. lag.listw is an spdep function that does just that.

lease_lag <- lag.listw(hex_weights, hex_sp$remaining_lease)

lease_lag is a list of length equal to the number of hexagons we have. We can plot it, but first we must assign each lease_lag value to its hexagon. Since our remaining_lease object is a data frame containing all heaxagon indices, their average remaining lease and the geometry of the hexagon, we can simply add the lagged version to it and plot.

lease_lag_df <- remaining_lease %>%
  add_column(lease_lag = lease_lag)

current.mode <- tmap_mode("plot")
tm_shape(lease_lag_df) +
  tm_polygons(col = "lease_lag")

10.2.1 Exercise

Compare the plot of lease_lag that we have just produced with the plot of remaining_lease (you can reuse lease_lag_df since it has a remaining_lease column, or try out tm_facets). What do you observe?

10.3 Moran’s I

Lab practice 10.2

We saw in class that we measure spatial autocorrelation from looking at how the point average (the average in one hexagon, contained in remaining_lease) varies with the moving average (the average over a neighbourhood, contained in lease_lag). Let’s plot them both:

ggplot(lease_lag_df) +
  geom_point(aes(x = remaining_lease, y = lease_lag))

What is this plot telling us?

  • When remaining_lease is low, lease_lag is low too.
  • When remaining_lease is high, lease_lag is high too.

In other words, there is a positive correlation between remaining_lease and lease_lag!

To measure this correlation, we compare the lease_lag with the original remaining_lease variable by regressing the original variable on the spatially-lagged version. We call this metric the ‘Moran’s I’.

moran_manual <- lm(lease_lag ~ remaining_lease, data=lease_lag_df)
## Call:
## lm(formula = lease_lag ~ remaining_lease, data = lease_lag_df)
## Coefficients:
##     (Intercept)  remaining_lease  
##          35.565            0.504

The coefficient or slope of the line is about 0.51, indicating that there is indeed positive autocorrelation: 0 would mean no autocorrelation, -1 fully negative and 1 fully positive.

       aes(x = remaining_lease, y = lease_lag)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE)

10.3.1 Exercise

Based on our bootstrapping approach in Chapter 4, can you think of a way to calculate the statistical significance for this coefficient?

We can also use the moran.test function from the spdep package to calculate Moran’s I in a more automatic fashion. It should give (almost) the same result as our manual regression analysis. What do we conclude from this number?

moran_auto <- moran.test(lease_lag_df$remaining_lease, hex_weights)

10.4 Local value of autocorrelation

Lab practice 10.3

Just as we can calculate spatial autocorrelation globally, we can also look at this on a more local scale (autocorrelation is very dependent on the scale regardless!). In spdep Moran’s I has a local counterpart called localmoran. Instead of a global metric, it will give an indication on how much clustering is present around each polygon and indicates whether the neighbourhood values indicate a ‘cold’ cluster or a ‘hot’ cluster. In our case: cold means fewer years left on the lease!

moran_local <- localmoran(lease_lag_df$remaining_lease, hex_weights)
lease_lag_df$moran <- moran_local[,1]

tm_shape(lease_lag_df) +
  tm_polygons(col = "moran", palette = "-RdBu")

10.4.1 Exercise

Which are cold and hot clusters for the price per square metre variable?