Chapter 11 Spatial regression

Sessions 11.1 + 11.2

11.1 Introduction

In the previous Chapter, you looked at the concept of spatial autocorrelation from an exploratory perspective. As already foreshadowed, the same concept also plays a very specific role when conducting regression analyses on spatial data. When observations are correlated in space this can (but not always) result in the violation of several key assumptions underlying regression analyses. Luckily this can be corrected for by explicitly bringing a spatial component (i.e., model term) into the regression. We will explore this type of analysis by building a model of HDB resale prices across Singapore. We will try to explain resale prices not only by characterics of the flats but also based on the socio-economic characteristics of the underlying planning area.

11.2 Setting up the regression

Lab practices 11.0 + 11.1

To start, we load the shapefile of the planning areas and join to the socio-economic data we also used in Chapter 6, with MDS. We only keep planning areas for which we have data.

planning_areas_sf <- st_read(here::here("data/MP14_PLNG_AREA_NO_SEA_PL.shp")) %>% 
  filter(!(OBJECTID == 49 | OBJECTID == 18)) # remove islands
## Reading layer `MP14_PLNG_AREA_NO_SEA_PL' from data source `/Users/barnabe/Documents/Research/Teaching/Computation Urban Design/lecturenotes/data/MP14_PLNG_AREA_NO_SEA_PL.shp' using driver `ESRI Shapefile'
## Simple feature collection with 55 features and 12 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS:  SVY21
planning_areas <- read_csv(here::here("data/planning_areas.csv")) %>% 
  left_join(planning_areas_sf, ., by = c("PLN_AREA_N" = "planning_area")) %>% 
  filter(area_hdb > 0) %>%
  st_buffer(0) # we use this to correct shapefile issues

tm_shape(planning_areas) + tm_polygons(col = 'area_hdb')

We want to analyse housing prices at a slightly finer detail than the planning area. To do so, we construct a hexagonal grid with a grid size of 1500 metres.

hex_grid <- planning_areas %>%
  st_make_grid(st_bbox(.), square = FALSE, cellsize = 1500) %>% 
  st_sf() %>% 
  mutate(hex_id = row_number())

tm_shape(hex_grid) + tm_polygons()

We can join information from each planning area to the corresponding hexagon. Because some hexagons might intersect with multiple planning areas, we set the largest = T argument so that we use the information from the planning area that has the largest overlap.

hex_grid <- st_join(hex_grid, planning_areas, largest = T)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
tm_shape(hex_grid) + tm_polygons(col = 'area_hdb')

Great! We can now load the resale data and join this each hexagon, similar to how we did this in the last 2 weeks. Note: we set the geometry to NULL here so that the POINT geometry of the flats does not interfere with our analysis later - we don’t need it any longer after the join.

resale <- readRDS(here::here("data/resale_with_geom.rds"))
resale_hex <- st_join(resale, hex_grid) %>%
  st_set_geometry(NULL)

Based on this tibble, we can now create a series of metrics that reflect the situation in each hexagon. In addition to the mean price per square meter, we also calculate the average floor area and the average years remaining on the lease for each hexagon.

hex_grid <- resale_hex %>%
  mutate(price_sqm = resale_price / floor_area_sqm) %>%
  group_by(hex_id) %>%
  summarise(mean_price = mean(price_sqm),
            mean_lease = mean(remaining_lease),
            mean_floor_area = mean(floor_area_sqm)) %>%
  left_join(hex_grid, .) %>%
  filter(mean_price > 0)

tm_shape(hex_grid) + tm_polygons(col = "mean_price")

11.3 A naive linear model

Lab practice 11.2

With the hex_grid object, we now have all the information to start constructing a model of HDB resale prices. We will start by creating a simple, linear model where we try to predict the mean price in each hexagon by the % of households with an income lower than 2000; the % of people with a university degree; and the average number of years left on the lease. We estimate the model using ols (with the built-in lm function) and inspect the results with tidy and glance from the broom package, as we have done in Chapter 5.

price_ols <- lm(mean_price ~ hh_income_lt_2000 + edu_university + mean_lease, data = hex_grid)
glance(price_ols)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.764         0.759  493.      141. 1.25e-40     3 -1019. 2048. 2062.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

On the surface, this does not look like a bad model! The R-squared hovers around 0.7, which is pretty high considering we only have limited data available at the hexagon scale.

tidy(price_ols)
## # A tibble: 4 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)        -1659.     542.       -3.06 2.69e- 3
## 2 hh_income_lt_2000  11866.     743.       16.0  1.90e-32
## 3 edu_university      6254.     508.       12.3  1.47e-23
## 4 mean_lease            35.0      5.78      6.05 1.43e- 8

11.3.0.1 Exercise

Can you understand the effect size and direction of each of the coefficients? Hint: More flats are single-person flats closer to the city centre, hence hh_income_lt_2000 is higher.


11.3.1 Residuals of the model

Lab practice 11.3

Whenever we are dealing with spatial data, it is good practice to plot the residuals of your model on a map. This allows you to quickly determine if there is any spatial clustering present in the residuals (there shouldn’t be!).

hex_ols <- augment(price_ols, data = hex_grid) %>%
  st_as_sf()

tm_shape(hex_ols) + tm_polygons(col = ".resid", palette = "RdBu")

What do you think? Are the residuals randomly distributed?

We can assess this more quantitatively by comparing the residuals with a lagged version of the residuals, as we have done in the previous Chapter.

hex_sp <- as(hex_ols, 'Spatial')
hex_neighbors <- poly2nb(hex_sp)
hex_weights <- nb2listw(hex_neighbors, style="W", zero.policy=TRUE)
hex_ols$.resid_lag <- lag.listw(hex_weights, hex_ols$.resid)
ggplot(hex_ols, aes(.resid, .resid_lag)) + geom_point() + geom_smooth()

lm(.resid_lag ~ .resid , hex_ols)
## 
## Call:
## lm(formula = .resid_lag ~ .resid, data = hex_ols)
## 
## Coefficients:
## (Intercept)       .resid  
##      30.356        0.273

Remember, this is the same as conducting a Moran’s I test - so we can do this in one swoop too.

moran.test(hex_ols$.resid, hex_weights)
## 
##  Moran I test under randomisation
## 
## data:  hex_ols$.resid  
## weights: hex_weights    
## 
## Moran I statistic standard deviate = 4.5642, p-value = 2.507e-06
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.272953968      -0.007518797       0.003776109

The Moran’s I seems significantly greater than zero. This is an indication that our residuals are positively autocorrelated, which is a problem. For instance, see Modern Dive 10.3: Independence of the residuals is a condition to use regression for inference!

11.4 Two spatial models

When the residuals are NOT randomly distributed in space, we need to bring the spatial component into the model in a more explicit way. Generally speaking, there are two ways of doing this. First, spatial lag models are based on the idea that there is spatial dependence between the dependent variable for different observations (often referred to as rho). Second, spatial error models are based on the idea that the model error for different observations is (partially) spatially dependent (often referred to as lambda). This is caused by some other (unmeasured) spatial covariate. Although there are some statistical tests that you can do to determine which type of model is better, ultimately this choice needs to be based on your theoretical understanding of the modeled process.

Whichever model we choose to use, we will always check eventuall whether the residuals still exhibit any form of dependency.

11.4.1 Spatial lag model

Lab practice 11.4.1

In the case of housing values, we could argue that the housing prices in one location are influenced by the surrounding housing prices. So let’s start with creating a spatial lag model. We can use the lagsarlm function from the spdep package to do so.

price_lag <- lagsarlm(mean_price ~ hh_income_lt_2000 + edu_university + mean_lease,
                      data = hex_grid,
                      listw = hex_weights)
summary(price_lag)
## 
## Call:lagsarlm(formula = mean_price ~ hh_income_lt_2000 + edu_university + 
##     mean_lease, data = hex_grid, listw = hex_weights)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1557.057  -269.052   -31.263   292.198  1382.822 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                     Estimate Std. Error z value  Pr(>|z|)
## (Intercept)       -2230.3114   449.1756 -4.9653 6.858e-07
## hh_income_lt_2000  6651.7726   919.2316  7.2362 4.614e-13
## edu_university     3761.7596   536.6864  7.0092 2.396e-12
## mean_lease           32.0712     4.9985  6.4161 1.398e-10
## 
## Rho: 0.50774, LR test value: 36.294, p-value: 1.6967e-09
## Asymptotic standard error: 0.066144
##     z-value: 7.6763, p-value: 1.6431e-14
## Wald statistic: 58.925, p-value: 1.6431e-14
## 
## Log likelihood: -1000.849 for lag model
## ML residual variance (sigma squared): 166840, (sigma: 408.45)
## Number of observations: 134 
## Number of parameters estimated: 6 
## AIC: 2013.7, (AIC for lm: 2048)
## LM test for residual autocorrelation
## test value: 1.9792, p-value: 0.15948

We can see that the rho is quite significant! However, reading the coefficients for spatial lag models isn’t as straightforward anymore because we now have what is often referred to as spill-over effects. Because observations depend on each other, changing the value of a variable in one region can have a ripple effect throughout the study area. To gain insight into this, we can use the impacts function from the spatialreg package to split both the direct and indirect effects.

impacts(price_lag, listw = hex_weights)
## Impact measures (lag, exact):
##                       Direct   Indirect       Total
## hh_income_lt_2000 7227.03535 6285.68837 13512.72372
## edu_university    4087.08637 3554.72888  7641.81526
## mean_lease          34.84481   30.30615    65.15095

Just as we did with the OLS model, we can check the residuals for any spatial correlation.

hex_grid$resid_lagsarlm <- residuals(price_lag)
tm_shape(hex_grid) + tm_polygons(col = "resid_lagsarlm", palette = "RdBu")

moran.test(hex_grid$resid_lagsarlm, hex_weights)
## 
##  Moran I test under randomisation
## 
## data:  hex_grid$resid_lagsarlm  
## weights: hex_weights    
## 
## Moran I statistic standard deviate = 1.0585, p-value = 0.1449
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.057200593      -0.007518797       0.003738136

Spatial autocorrelation has largely disappeared. If it wouldn’t have, this would be a good indication a spatial error model would be more appropriate.

11.4.2 Spatial error model

Lab practice 11.4.2

Let’s see now how to set up and run the spatial error model. The procedure is very similar to that of the spatial lag model and naive linear regression: fit the model, interpret the results!

price_error <- errorsarlm(mean_price ~ hh_income_lt_2000 + edu_university + mean_lease,
                          data = hex_grid, listw = hex_weights)

The lambda is around 0.8. Let’s see how our residuals are distributed.

hex_grid$resid_error <- residuals(price_error)
tm_shape(hex_grid) + tm_polygons(col = "resid_error", palette = "RdBu")

Finally, let’s check the Moran’s I.

moran.test(hex_grid$resid_error, hex_weights)
## 
##  Moran I test under randomisation
## 
## data:  hex_grid$resid_error  
## weights: hex_weights    
## 
## Moran I statistic standard deviate = 0.051575, p-value = 0.4794
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.004361457      -0.007518797       0.003747754

It is very close to the expected value assuming no autocorrelation, so our spatial error model also managed to significantly reduce the spatial autocorrelation.

11.4.3 Comparison with spatial error model

We can test which model might be more appropriate with the Lagrange Multiplier diagnostics (that’s a mouthful). It’s not a foolproof method but gives a good indication.

lm.LMtests(model = price_ols, listw = hex_weights, test=c("LMerr", "LMlag")) %>% summary()
##  Lagrange multiplier diagnostics for spatial dependence
## data:  
## model: lm(formula = mean_price ~ hh_income_lt_2000 + edu_university +
## mean_lease, data = hex_grid)
## weights: hex_weights
##  
##       statistic parameter   p.value    
## LMerr    19.156         1 1.204e-05 ***
## LMlag    30.984         1 2.602e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The multipliers here indicate that both models “make sense”. It is difficult to argue on the basis of the statistic alone that the lag model “makes more sense” than the error model. Think of it more as a clue and then think back to what your data is and what is more sensible (is it a lag issue or is it an unobserved variable issue?)

You will work with other model selection methods in your assignment. You already know the R-squared and we will introduce the AIC and log-likelihood briefly in class too.

11.5 Assignment 5 (Thursday, April 23rd, 23:59)

Your fifth assignment consists of the following parts.

11.5.1 Spatial regression

Compare the three models (price_ols, price_lag, price_error) with each other in a single table. Compare and discuss the meaning of three model fit statistics (R-squared, log-likelihood, AIC) as well as the model coefficients. Based on your understanding of the underlying process determining housing prices and the model fit, which model would you pick?

Hint: The R-squared is not directly given in our two spatial regression models. This is not a problem! We can compute it ourselves. Remember that the R-squared is simply:

\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i = 1}^n r_i^2}{\sum_{i = 1}^n (y_i - \bar{y})^2} \]

where \(r_i\) is the residual for hexagon \(i\) and \(\bar{y}\) is the mean value of the response variable. We can compute \(\bar{y}\) easily as:

y_bar = mean(hex_grid$mean_price)

and \(SS_{tot}\) as

SS_tot = sum((hex_grid$mean_price - y_bar) ^ 2)

The residuals of the spatial lag model are contained in hex_grid$resid_lagsarlm, and those of the spatial error model are contained in hex_grid$resid_error. So

SS_reslag = sum(hex_grid$resid_lagsarlm ^ 2)
SS_reserror = sum(hex_grid$resid_error ^ 2)

You now have everything you need to compute the R-squared for the spatial lag and error models!

Hint 2: You can print the value of a variable from your R environment into your RMarkdown text. For instance, if I run the code chunk below:

x <- 2

x is now a variable in my environment. I can write the following in RMarkdown text (outside of a code chunk), replacing [backtick] with ```:

Hello [backtick]r x[backtick].

Notice the r at the beginning? This indicates that you want to use the value of x inside your text. In your knitted notebook, the text above will be replaced to

Hello 2

This is quite useful to report single statistics, such as the R-squared, log-likelihood and AIC in your assignments. For instance, if you have the three values for the price_ols model, you can write as RMarkdown text:

Model | R-squared | AIC | Log-likelihood
-|-|-|-
`price_ols` | [backtick]r rsq[backtick] | [backtick]r AIC[backtick] | [backtick]r loglik[backtick]

This will create a neat little table that you never have to update manually if your model or your values change.

11.5.2 Submitting the assignment

  • Create your assignment as a single self-contained RMarkdown document (you may read in cleaned data at the beginning from your data folder with the here library). It should be saved in the assignments/ folder, as assignment5.Rmd.
  • Change the output parameter to github_document so that your assignment is rendered visually representation on Github.
  • Do not forget to run styler on your code before submitting! (Check out the Session 2.2 slides if you are not sure how to do so).
  • Once you are finished, navigate to your own Github repository at https://github.com/02522-cua/[your-name]. Create a new issue, title it ‘Assignment 1 by [your_name]’. In the body of the issue, include at least a link to the last commit you made as part of your assignment and ‘at’ your prof (include @barnabemonnot in your text). To link to a specific commit, you can just include the SHA (a short code uniquely identifying a commit) in your issue. You can find the SHA codes in the history of your commits (e.g. at https://github.com/02522-cua/barnabe-monnot/commits/master my first commit has SHA ‘1f2c9c1’). Note that you can always edit the text of the issue if you want to update the assignment commit, but commits are timestamped so your prof can check whether they were submitted before the deadline!

Your assignments will be graded on a 5-point scale, according to this rubric:

  1. The Rmarkdown does not run and there’s incomplete description and contextualization
  2. The Rmarkdown does not run or runs but doesn’t produce the expected output. There is limited description and contextualization of code and its output
  3. The Rmarkdown runs, the code is clearly structured but the description and contextualization of code and its output is limited
  4. The Rmarkdown runs, the code is clearly structured and the description and contextualization is complete.
  5. As above, plus you have included some new or creative approaches, methods or angles that were not covered in class.

Note that half or quarter-point grades might be given. The rubric above is only an indication for specific points on the grading scale.

11.5.3 Final Project Update

You will need to submit a prototype of your project. The prototype is not the final version, you can think of it as “a pretty good version you would show your colleagues, but not your boss (yet)”. You want your prototype to feature:

  • Most of the code, which should be able to run without issues.
  • The EDA part you were suggested to submit for the previous assignment.
  • A pretty good write-up of your analysis, although not a very final one.

In other words, from looking at your prototype, your Prof. should understand the overall structure of the project, have a good understanding of the data you are using and a good expectation of the missing parts (for instance, if you were not able to test your hypothesis yet, I should still be able to understand how you are planning to do so and what are the steps necessary to get there).

Once again, I encourage you to look into bookdown for your project write-up, which will allow you to deploy your project analysis and notebooks online. It is easy to set up, with the steps described here.