# Chapter 11 Spatial regression

## 11.1 Introduction

In the previous Chapter, you looked at the concept of spatial autocorrelation from an exploratory perspective. As already foreshadowed, the same concept also plays a very specific role when conducting regression analyses on spatial data. When observations are correlated in space this can (but not always) result in the violation of several key assumptions underlying regression analyses. Luckily this can be corrected for by explicitly bringing a spatial component (i.e., model term) into the regression. We will explore this type of analysis by building a model of HDB resale prices across Singapore. We will try to explain resale prices not only by characterics of the flats but also based on the socio-economic characteristics of the underlying planning area.

## 11.2 Setting up the regression

To start, we load the shapefile of the planning areas and join to the socio-economic data we also used in Chapter 6, with MDS. We only keep planning areas for which we have data.

```
planning_areas_sf <- st_read(here::here("data/MP14_PLNG_AREA_NO_SEA_PL.shp")) %>%
filter(!(OBJECTID == 49 | OBJECTID == 18)) # remove islands
```

```
## Reading layer `MP14_PLNG_AREA_NO_SEA_PL' from data source `/Users/barnabe/Documents/Research/Teaching/Computation Urban Design/lecturenotes/data/MP14_PLNG_AREA_NO_SEA_PL.shp' using driver `ESRI Shapefile'
## Simple feature collection with 55 features and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## epsg (SRID): NA
## proj4string: +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs
```

```
planning_areas <- read_csv(here::here("data/planning_areas.csv")) %>%
left_join(planning_areas_sf, ., by = c("PLN_AREA_N" = "planning_area")) %>%
filter(area_hdb > 0) %>%
st_buffer(0) # we use this to correct shapefile issues
```

```
## Warning: Column `PLN_AREA_N`/`planning_area` joining factor and character
## vector, coercing into character vector
```

We want to analyse housing prices at a slightly finer detail than the planning area. To do so, we construct a hexagonal grid with a grid size of 1500 metres.

```
hex_grid <- planning_areas %>%
st_make_grid(st_bbox(.), square = FALSE, cellsize = 1500) %>%
st_sf() %>%
mutate(hex_id = row_number())
tm_shape(hex_grid) + tm_polygons()
```

We can join information from each planning area to the corresponding hexagon. Because some hexagons might intersect with multiple planning areas, we set the `largest = T`

argument so that we use the information from the planning area that has the largest overlap.

```
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
```

Great! We can now load the resale data and join this each hexagon, similar to how we did this in the last 2 weeks. Note: we set the geometry to `NULL`

here so that the POINT geometry of the flats does not interfere with our analysis later - we don’t need it any longer after the join.

```
resale <- readRDS(here::here("data/resale_with_geom.rds"))
resale_hex <- st_join(resale, hex_grid) %>%
st_set_geometry(NULL)
```

Based on this tibble, we can now create a series of metrics that reflect the situation in each hexagon. In addition to the mean price per square meter, we also calculate the average floor area and the average years remaining on the lease for each hexagon.

```
hex_grid <- resale_hex %>%
mutate(price_sqm = resale_price / floor_area_sqm) %>%
group_by(hex_id) %>%
summarise(mean_price = mean(price_sqm),
mean_lease = mean(remaining_lease),
mean_floor_area = mean(floor_area_sqm)) %>%
left_join(hex_grid, .) %>%
filter(mean_price > 0)
tm_shape(hex_grid) + tm_polygons(col = "mean_price")
```

## 11.3 A naive linear model

Lab practice 11.2

With the `hex_grid`

object, we now have all the information to start constructing a model of HDB resale prices. We will start by creating a simple, linear model where we try to predict the mean price in each hexagon by the % of households with an income lower than 2000; the % of people with a university degree; and the average number of years left on the lease. We estimate the model using ols (with the built-in `lm`

function) and inspect the results with `tidy`

and `glance`

from the `broom`

package, as we have done in Chapter 5.

```
price_ols <- lm(mean_price ~ hh_income_lt_2000 + edu_university + mean_lease, data = hex_grid)
glance(price_ols)
```

```
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.764 0.759 493. 141. 1.25e-40 4 -1019. 2048. 2062.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
```

On the surface, this does not look like a bad model! The R-squared hovers around 0.7, which is pretty high considering we only have limited data available at the hexagon scale.

```
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1659. 542. -3.06 2.69e- 3
## 2 hh_income_lt_2000 11866. 743. 16.0 1.90e-32
## 3 edu_university 6254. 508. 12.3 1.47e-23
## 4 mean_lease 35.0 5.78 6.05 1.43e- 8
```

#### 11.3.0.1 Exercise

Can you understand the effect size and direction of each of the coefficients? *Hint: More flats are single-person flats closer to the city centre, hence hh_income_lt_2000 is higher.*

### 11.3.1 Residuals of the model

Lab practice 11.3

Whenever we are dealing with spatial data, it is good practice to plot the residuals of your model on a map. This allows you to quickly determine if there is any spatial clustering present in the residuals (there shouldn’t be!).

```
hex_ols <- augment(price_ols, data = hex_grid) %>%
st_as_sf()
tm_shape(hex_ols) + tm_polygons(col = ".resid", palette = "RdBu")
```

What do you think? Are the residuals randomly distributed?

We can assess this more quantitatively by comparing the residuals with a lagged version of the residuals, as we have done in the previous Chapter.

```
hex_sp <- as(hex_ols, 'Spatial')
hex_neighbors <- poly2nb(hex_sp)
hex_weights <- nb2listw(hex_neighbors, style="W", zero.policy=TRUE)
hex_ols$.resid_lag <- lag.listw(hex_weights, hex_ols$.resid)
ggplot(hex_ols, aes(.resid, .resid_lag)) + geom_point() + geom_smooth()
```

```
##
## Call:
## lm(formula = .resid_lag ~ .resid, data = hex_ols)
##
## Coefficients:
## (Intercept) .resid
## 30.356 0.273
```

Remember, this is the same as conducting a Moran’s I test - so we can do this in one swoop too.

```
##
## Moran I test under randomisation
##
## data: hex_ols$.resid
## weights: hex_weights
##
## Moran I statistic standard deviate = 4.5642, p-value = 2.507e-06
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.272953968 -0.007518797 0.003776109
```

The Moran’s I seems significantly greater than zero. This is an indication that our residuals are positively autocorrelated, which is a problem. For instance, see Modern Dive 10.3: Independence of the residuals is a condition to use regression for inference!

## 11.4 Two spatial models

When the residuals are NOT randomly distributed in space, we need to bring the spatial component into the model in a more explicit way. Generally speaking, there are two ways of doing this. First, *spatial lag* models are based on the idea that there is spatial dependence between the dependent variable for different observations (often referred to as *rho*). Second, *spatial error* models are based on the idea that the model error for different observations is (partially) spatially dependent (often referred to as *lambda*). This is caused by some other (unmeasured) spatial covariate. Although there are some statistical tests that you can do to determine which type of model is better, ultimately this choice needs to be based on your theoretical understanding of the modeled process.

Whichever model we choose to use, we will always check eventuall whether the residuals still exhibit any form of dependency.

### 11.4.1 Spatial lag model

Lab practice 11.4.1

In the case of housing values, we could argue that the housing prices in one location are influenced by the surrounding housing prices. So let’s start with creating a *spatial lag* model. We can use the `lagsarlm`

function from the `spdep`

package to do so.

```
price_lag <- lagsarlm(mean_price ~ hh_income_lt_2000 + edu_university + mean_lease,
data = hex_grid,
listw = hex_weights)
summary(price_lag)
```

```
##
## Call:lagsarlm(formula = mean_price ~ hh_income_lt_2000 + edu_university +
## mean_lease, data = hex_grid, listw = hex_weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1557.057 -269.052 -31.263 292.198 1382.822
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2230.3114 449.1756 -4.9653 6.858e-07
## hh_income_lt_2000 6651.7726 919.2316 7.2362 4.614e-13
## edu_university 3761.7596 536.6864 7.0092 2.396e-12
## mean_lease 32.0712 4.9985 6.4161 1.398e-10
##
## Rho: 0.50774, LR test value: 36.294, p-value: 1.6967e-09
## Asymptotic standard error: 0.066144
## z-value: 7.6763, p-value: 1.6431e-14
## Wald statistic: 58.925, p-value: 1.6431e-14
##
## Log likelihood: -1000.849 for lag model
## ML residual variance (sigma squared): 166840, (sigma: 408.45)
## Number of observations: 134
## Number of parameters estimated: 6
## AIC: 2013.7, (AIC for lm: 2048)
## LM test for residual autocorrelation
## test value: 1.9792, p-value: 0.15948
```

We can see that the rho is quite significant! However, reading the coefficients for spatial lag models isn’t as straightforward anymore because we now have what is often referred to as spill-over effects. Because observations depend on each other, changing the value of a variable in one region can have a ripple effect throughout the study area. To gain insight into this, we can use the `impacts`

function from the `spatialreg`

package to split both the direct and indirect effects.

```
## Impact measures (lag, exact):
## Direct Indirect Total
## hh_income_lt_2000 7227.03535 6285.68837 13512.72372
## edu_university 4087.08637 3554.72888 7641.81526
## mean_lease 34.84481 30.30615 65.15095
```

Just as we did with the OLS model, we can check the residuals for any spatial correlation.

```
hex_grid$resid_lagsarlm <- residuals(price_lag)
tm_shape(hex_grid) + tm_polygons(col = "resid_lagsarlm", palette = "RdBu")
```

```
##
## Moran I test under randomisation
##
## data: hex_grid$resid_lagsarlm
## weights: hex_weights
##
## Moran I statistic standard deviate = 1.0585, p-value = 0.1449
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.057200593 -0.007518797 0.003738136
```

Spatial autocorrelation has largely disappeared. If it wouldn’t have, this would be a good indication a spatial error model would be more appropriate.

### 11.4.2 Spatial error model

Lab practice 11.4.2

Let’s see now how to set up and run the spatial error model. The procedure is very similar to that of the spatial lag model and naive linear regression: fit the model, interpret the results!

```
price_error <- errorsarlm(mean_price ~ hh_income_lt_2000 + edu_university + mean_lease,
data = hex_grid, listw = hex_weights)
```

The lambda is around 0.8. Let’s see how our residuals are distributed.

```
hex_grid$resid_error <- residuals(price_error)
tm_shape(hex_grid) + tm_polygons(col = "resid_error", palette = "RdBu")
```

Finally, let’s check the Moran’s I.

```
##
## Moran I test under randomisation
##
## data: hex_grid$resid_error
## weights: hex_weights
##
## Moran I statistic standard deviate = 0.051575, p-value = 0.4794
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.004361457 -0.007518797 0.003747754
```

It is very close to the expected value assuming no autocorrelation, so our spatial error model also managed to significantly reduce the spatial autocorrelation.

### 11.4.3 Comparison with spatial error model

We can test which model might be more appropriate with the Lagrange Multiplier diagnostics (that’s a mouthful). It’s not a foolproof method but gives a good indication.

```
## Lagrange multiplier diagnostics for spatial dependence
## data:
## model: lm(formula = mean_price ~ hh_income_lt_2000 + edu_university +
## mean_lease, data = hex_grid)
## weights: hex_weights
##
## statistic parameter p.value
## LMerr 19.156 1 1.204e-05 ***
## LMlag 30.984 1 2.602e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The multipliers here indicate that both models “make sense”. It is difficult to argue on the basis of the statistic alone that the lag model “makes more sense” than the error model. Think of it more as a clue and then think back to what your data is and what is more sensible (is it a lag issue or is it an unobserved variable issue?)

You will work with other model selection methods in your assignment. You already know the R-squared and we will introduce the AIC and log-likelihood briefly in class too.

## 11.5 Assignment 5 (Thursday, April 23rd, 23:59)

Your fifth assignment consists of the following parts.

### 11.5.1 Spatial regression

Compare the three models (`price_ols`

, `price_lag`

, `price_error`

) with each other in a single table. Compare and discuss the meaning of three model fit statistics (R-squared, log-likelihood, AIC) as well as the model coefficients. Based on your understanding of the underlying process determining housing prices and the model fit, which model would you pick?

*Hint:* The R-squared is not directly given in our two spatial regression models. This is not a problem! We can compute it ourselves. Remember that the R-squared is simply:

\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i = 1}^n r_i^2}{\sum_{i = 1}^n (y_i - \bar{y})^2} \]

where \(r_i\) is the residual for hexagon \(i\) and \(\bar{y}\) is the mean value of the response variable. We can compute \(\bar{y}\) easily as:

and \(SS_{tot}\) as

The residuals of the spatial lag model are contained in `hex_grid$resid_lagsarlm`

, and those of the spatial error model are contained in `hex_grid$resid_error`

. So

You now have everything you need to compute the R-squared for the spatial lag and error models!

### 11.5.2 Submitting the assignment

- Create your assignment as a single self-contained RMarkdown document (you may read in cleaned data at the beginning from your
`data`

folder with the`here`

library). It should be saved in the`assignments/`

folder, as`assignment5.Rmd`

. - Change the
`output`

parameter to`github_document`

so that your assignment is rendered visually representation on Github. - Do not forget to run
`styler`

on your code before submitting! (Check out the Session 2.2 slides if you are not sure how to do so). - Once you are finished, navigate to your own Github repository at
`https://github.com/02522-cua/[your-name]`

. Create a new issue, title it ‘Assignment 1 by [your_name]’. In the body of the issue, include at least a link to the last commit you made as part of your assignment and ‘at’ your prof (include`@barnabemonnot`

in your text). To link to a specific commit, you can just include the SHA (a short code uniquely identifying a commit) in your issue. You can find the SHA codes in the history of your commits (e.g. at https://github.com/02522-cua/barnabe-monnot/commits/master my first commit has SHA ‘1f2c9c1’). Note that you can always edit the text of the issue if you want to update the assignment commit, but commits are timestamped so your prof can check whether they were submitted before the deadline!

Your assignments will be graded on a 5-point scale, according to this rubric:

- The Rmarkdown does not run and there’s incomplete description and contextualization
- The Rmarkdown does not run or runs but doesn’t produce the expected output. There is limited description and contextualization of code and its output
- The Rmarkdown runs, the code is clearly structured but the description and contextualization of code and its output is limited
- The Rmarkdown runs, the code is clearly structured and the description and contextualization is complete.
- As above, plus you have included some new or creative approaches, methods or angles that were not covered in class.

Note that half or quarter-point grades might be given. The rubric above is only an indication for specific points on the grading scale.

### 11.5.3 Final Project Update

You will need to submit a prototype of your project. The prototype is not the final version, you can think of it as “a pretty good version you would show your colleagues, but not your boss (yet)”. You want your prototype to feature:

- Most of the code, which should be able to run without issues.
- The EDA part you were suggested to submit for the previous assignment.
- A pretty good write-up of your analysis, although not a very final one.

In other words, from looking at your prototype, your Prof. should understand the overall structure of the project, have a good understanding of the data you are using and a good expectation of the missing parts (for instance, if you were not able to test your hypothesis yet, I should still be able to understand how you are planning to do so and what are the steps necessary to get there).

Once again, I encourage you to look into `bookdown`

for your project write-up, which will allow you to deploy your project analysis and notebooks online. It is easy to set up, with the steps described here.