Chapter 10 Spatial autocorrelation
In the previous week, we saw that we could combine both the spatial location (as a constraint on the clustering procedure) and the other attributes of that location together in a single clustering algorithm. This is a powerful combination partly because of one of Geography’s central tenets (often referred to as Tobler’s first law):
Everything is related to everything else, but near things are more related than far things.
This notion holds true (to different degrees) for many social processes (I’m sure you can think of many yourself!). Apart from visually inspecting this tendency for similar things (or values) to cluster together in space, we can also directly quantify to what degree a variable is clustered in space. That is, rather than looking for individual clusters in space, we can calculate whether the pattern shows clear signs of clustering in the first place. This is also often referred to as spatial autocorrelation: the degree to which a variable correlates with itself across space. As we will discuss next week, this has major implications for many statistical procedures (e.g., it violates one of the central assumptions in regression analysis) but for now we will just focus on describing the degree to which this might be present in a dataset.
10.2 Lagged average over neighbours
To calculate spatial autocorrelation, we need to walk through some similar steps as we did for the spatial clustering in the previous Chapter. Our ultimate goal is to look at the relation between a variable (in our case remaining lease years) and a spatially lagged version of that variable.
The intuition behind spatial lag is the following: if I didn’t know the average remaining lease for one hexagon, how could I estimate it if I knew the value of that variable for its neighbours? One thing I could do is take the average over all values of the variable for my hexagon’s neighbours, and use that as an estimator for the unknown average remaining lease.
The first step is to construct a list of neighbours for each area. Here we use the same
remaining_lease data as earlier, which you can download here.
readRDS(here::here("data/remaining_lease.rds")) remaining_lease <- as(remaining_lease, 'Spatial') hex_sp <- poly2nb(hex_sp) hex_neighbours <-plot(hex_neighbours, coordinates(hex_sp))
Once we have our list of neighbours, we need to assign a weight to the relationship to each neighbour. For clustering, we used the similarity in ‘data space’ to determine this weight but here we used a weighted approach: if a hexagon has 4 neighbours, each neighbour will weigh 1/4.
Note: In the first assignment, you computed the moving average of a variable over time. “Averaging” in that context meant assigning weights to observations before and after some time \(t\). We are doing the same thing here: we assign weights to neighbouring observations (but here we do it in space, not in time).
nb2listw(hex_neighbours, style="W", zero.policy=TRUE)hex_weights <-
hex_weights in your Environment and observe the
Once we have the weights, we can calculate a spatially lagged version of the variable of interest. In this case this creates an average years of lease left for all the neighbours of a hexagon.
lag.listw is an
spdep function that does just that.
lag.listw(hex_weights, hex_sp$remaining_lease)lease_lag <-
lease_lag is a list of length equal to the number of hexagons we have. We can plot it, but first we must assign each
lease_lag value to its hexagon. Since our
remaining_lease object is a data frame containing all heaxagon indices, their average remaining lease and the geometry of the hexagon, we can simply add the lagged version to it and plot.
remaining_lease %>% lease_lag_df <- add_column(lease_lag = lease_lag) tmap_mode("plot") current.mode <-tm_shape(lease_lag_df) + tm_polygons(col = "lease_lag")
Compare the plot of
lease_lag that we have just produced with the plot of
remaining_lease (you can reuse
lease_lag_df since it has a
remaining_lease column, or try out
tm_facets). What do you observe?
10.3 Moran’s I
Lab practice 10.2
We saw in class that we measure spatial autocorrelation from looking at how the point average (the average in one hexagon, contained in
remaining_lease) varies with the moving average (the average over a neighbourhood, contained in
lease_lag). Let’s plot them both:
ggplot(lease_lag_df) + geom_point(aes(x = remaining_lease, y = lease_lag))
What is this plot telling us?
lease_lagis low too.
lease_lagis high too.
In other words, there is a positive correlation between
To measure this correlation, we compare the
lease_lag with the original
remaining_lease variable by regressing the original variable on the spatially-lagged version. We call this metric the ‘Moran’s I’.
lm(lease_lag ~ remaining_lease, data=lease_lag_df) moran_manual <-moran_manual
## ## Call: ## lm(formula = lease_lag ~ remaining_lease, data = lease_lag_df) ## ## Coefficients: ## (Intercept) remaining_lease ## 35.565 0.504
The coefficient or slope of the line is about 0.51, indicating that there is indeed positive autocorrelation: 0 would mean no autocorrelation, -1 fully negative and 1 fully positive.
ggplot(lease_lag_df, aes(x = remaining_lease, y = lease_lag)) + geom_point() + geom_smooth(method = lm, se = FALSE)
Based on our bootstrapping approach in Chapter 4, can you think of a way to calculate the statistical significance for this coefficient?
We can also use the
moran.test function from the
spdep package to calculate Moran’s I in a more automatic fashion. It should give (almost) the same result as our manual regression analysis. What do we conclude from this number?
moran.test(lease_lag_df$remaining_lease, hex_weights)moran_auto <-
10.4 Local value of autocorrelation
Lab practice 10.3
Just as we can calculate spatial autocorrelation globally, we can also look at this on a more local scale (autocorrelation is very dependent on the scale regardless!). In
spdep Moran’s I has a local counterpart called
localmoran. Instead of a global metric, it will give an indication on how much clustering is present around each polygon and indicates whether the neighbourhood values indicate a ‘cold’ cluster or a ‘hot’ cluster. In our case: cold means fewer years left on the lease!
localmoran(lease_lag_df$remaining_lease, hex_weights) moran_local <-$moran <- moran_local[,1] lease_lag_df tm_shape(lease_lag_df) + tm_polygons(col = "moran", palette = "-RdBu")
Which are cold and hot clusters for the price per square metre variable?