Chapter 5 Correlation and regression

5.1 Introduction

In the previous weeks, we have already taken a relatively deep dive into different variables recorded for HDB resale transactions. So far, we have mostly looked at single variables in isolation - for this reason we call this univariate statistics. However, most interesting statistics involves more than a single variable and instead looks at relations between variables. This week we will look at two techniques to look at more than one variable, so we are now in the territory of _bi_variate or _multi_variate statistics!

5.2 Visualize first

Before we look at quantitative statistics that summarize relationships in our dataset, it is a good idea to visualize potential relations first. For continuous variables, this can often be efficiently done through a scatterplot matrix. You can create these with the base R pairs function or use a ggplot extension such as the GGally library.

sales <- readRDS(here::here("data/sales.rds"))

sales %>% 
  select(floor_area_sqm, remaining_lease, lease_commence_date, resale_price) %>% 

The above command might take a minute or so to run because our dataset has about 80k points and R needs to draw each of these points for each graph in our matrix. Who has time for that? We can make things a little snappier during our initial exploration by just taking a 5% sample. Remember this from last week?

sales %>% 
  select(floor_area_sqm, remaining_lease, lease_commence_date, resale_price) %>% 
  sample_frac(0.05) %>% 

Enough to see potential trends - but much faster! On first sight, it seems that we have one variable pair that is very related: remaining_lease and lease_commencement_date. This is obvious but it is a good sanity test. We also find a potentially strong relation between price and the square footage of the flat. The relationship between lease and price seems a bit more complicated than we might expect at first sight so we’ll ignore that for now.

You also know that there are several nominal or ordinal variables in the dataset that might be of importance. The primary ones are the flat type, floor number and the town. You might have already experimented with visualizing relations between these variables and the continous variables in the previous assignment. What are potentially useful graphs that you can use for such relationships that give you similar insights as our scatterplot matrix?

5.3 Correlation

Once you have explored relationships in your data visually, it can be useful to find more numerical ways to express these relationships. One such way is through an analysis of correlation. The most widely known statistic for correlation is the Pearson product moment correlation coefficient. Please refer back to Chapter 4.3 in Burt et al. for a longer discussion on the calculation of this coefficient. Luckily in R, we can do this very simpluy.

cor(sales$floor_area_sqm, sales$resale_price, method = "pearson")
## [1] 0.6425493

As we can already see in the scatterplot, these two variables are strongly, and positively correlated. The higher the square footage, the higher the resale price (duh!). Explore some of the other relationships between the continuous variables and try to explain their correlation coefficients.

Pearson’s coefficient can only be applied on continuous variables. It is also a measure of linear relationship, so it may not always be the best to measure non-linear relationships. Luckily, we also have another correlation coefficient, Spearman’s rank correlation coefficient, that works on both continuous and ordinal variables and doesn’t look at the values of the variable, but only their ranks. We first apply this to a relationship we already know.

cor(sales$floor_area_sqm, sales$resale_price, method = "spearman")
## [1] 0.7134009

A slightly different number (because we use a different calculation approach) but our conclusion about the relationship between these two variables stands. Now we can apply the same technique to see, for example, the relationship between floor number and price. To do so, we need to make sure we have ordered the levels of the factor storey_range appropriately because Spearman uses the ‘rank’ to calculate the correlation.

cor(as.integer(sales$storey_range), sales$resale_price, method = "spearman") # before an appropriate sort
## [1] 0.1286719
sales <- sales %>% 
  mutate(storey_range = fct_relevel(storey_range, sort(levels(storey_range))))

cor(as.integer(sales$storey_range), sales$resale_price, method = "spearman") # can you figure out what the `as.integer()` function does?
## [1] 0.2721566

5.4 Regression

Regression analysis is very much like correlation analysis but it does two additional things. First, while correlation allows us to summarize the direction and strength of a relationship between two variables, these two variables can be treated interchangeably. In regression analysis, we change this by specifying a specific dependent and one or more independent variables. This is also the second difference: regression analysis allows us to consider a relation between more than just two variables. In the next section we will conduct an ordinary least squares regression analysis with resale_price as our dependent variable. To keep things simple we will focus on transactions from 2016 only.

To do an OLS regression in R, we can use the base R function lm(). We need to specify two important parts: the dataset we want to use and a regression formula. The formula takes the following form: dep_var ~ indep_var. So to regress floor area on resale price:

sales_2016 <- filter(sales, month > "2015-12-01" & month < "2017-01-01")
ols <- lm(resale_price ~ floor_area_sqm, sales_2016)

We have now stored our model results in the ols variable. To inspect the model results, we can use the glance and tidy functions from the broom package.

glance(ols) # glance returns goodness of fit measures
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.425         0.425 1.05e5    14323.       0     1 -2.52e5 5.03e5 5.03e5
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Can your interpret the r squared of your model? The sigma is another term for the standard error of the estimate or root mean square error. Can you use Burt et al. Chapter 4 to interpret this coefficient?

tidy(ols) # tidy returns information about the regression coefficients
## # A tibble: 2 x 5
##   term           estimate std.error statistic   p.value
##   <chr>             <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      73681.    3143.       23.4 7.86e-120
## 2 floor_area_sqm    3754.      31.4     120.  0.

Can you interpret the coefficient for floor area?

We can add information on the model fit and its residuals to the original tibble by using augment.

sales_ols <- augment(ols, data = sales_2016)
## # A tibble: 19,373 x 17
##    month      town  flat_type block street_name storey_range floor_area_sqm
##    <date>     <chr> <fct>     <chr> <chr>       <fct>                 <dbl>
##  1 2016-01-01 ANG … 2 ROOM    406   ANG MO KIO… 07 TO 09                 44
##  2 2016-01-01 ANG … 2 ROOM    116   ANG MO KIO… 01 TO 03                 44
##  3 2016-01-01 ANG … 3 ROOM    170   ANG MO KIO… 01 TO 03                 60
##  4 2016-01-01 ANG … 3 ROOM    560   ANG MO KIO… 01 TO 03                 67
##  5 2016-01-01 ANG … 3 ROOM    258   ANG MO KIO… 01 TO 03                 73
##  6 2016-01-01 ANG … 3 ROOM    639   ANG MO KIO… 01 TO 03                 68
##  7 2016-01-01 ANG … 3 ROOM    538   ANG MO KIO… 04 TO 06                 68
##  8 2016-01-01 ANG … 3 ROOM    419   ANG MO KIO… 01 TO 03                 74
##  9 2016-01-01 ANG … 3 ROOM    435   ANG MO KIO… 10 TO 12                 67
## 10 2016-01-01 ANG … 3 ROOM    180   ANG MO KIO… 04 TO 06                 68
## # … with 19,363 more rows, and 10 more variables: flat_model <fct>,
## #   lease_commence_date <dbl>, remaining_lease <dbl>, resale_price <dbl>,
## #   .fitted <dbl>, .resid <dbl>, .std.resid <dbl>, .hat <dbl>, .sigma <dbl>,
## #   .cooksd <dbl>

We can now use this information to visualize our model results in more details - specifically by looking at the residuals of our model. We know that these residuals ideally do not exhibit any systematic patterns. For example, the histogram of the residuals should be roughly normally distributed.

sales_ols %>% 
  ggplot(aes(x = .resid)) +
  geom_histogram(bins = 50)

Is the histogram of residuals normally distributed? If not, what could be a potential reason? How do you interpret this histogram? We will discuss a few reasons for this phenomenon in class. Some of these issues can be caused by non-linearity in the relationship between some of our variables. In that case, we can alleviate the situation by transforming either the independent or dependent variable (check out Chapter 13 in Burt et al. if you’re interested). In this case, the ‘strange’ pattern might very well be caused by additional variables that have an effect on the resale price. You can include additional variables in your regression analysis like so dep_var ~ indep_var1 + indep_var2 + indep_var3. In this Block’s assignment you will experiment with the inclusion of different variables to understand the dynamics of HDB property prices in Singapore.

5.5 Assignment (Thursday, February 27th, 23:59)

For your second assignment, you will conduct an exploration of the effect of different variables on resale transactions in Singapore. You will have to integrate at least the following:

5.5.1 Hypothesis testing

In Chapter 3, we saw that the price of a flat can be quite different depending on what storey the flat is located on. We want to investigate the following research question: “Is the average price of 3 room flats on storey 04-06 significantly different from 3 room flats on storey 07-09?”. To do so, we will use all transactions for 3 room flats in storeys 04-06 or 07-09 in our dataset.

  • State your null hypothesis and your alternative hypothesis.
  • Construct the null distribution.
  • Find and visualise the p-value.
  • Construct and visualise the confidence interval of the sample difference in means and discuss its relation with the p-value.
  • BONUS: Plot the difference in means for successive groups of flats per storey, i.e., obtain the difference in means for 04-06 to 01-03, then 07-09 to 04-06, then 12-10 to 07-09 etc. What do you observe? When does the price difference stop being significant? (if ever?)

5.5.2 Regression

  • Experiment with the addition of different continuous and ordinal variables to the regression model built in Chapter 5 (but EXCLUDE town for now). Pay attention to the overall goodness of fit of the model but also to the regression coefficients. Interpret and explain the coefficients.
  • Add the town variable to your regression model. Because town is a categorical variable R will create ‘dummy’ or binary variables for each unique entry. The first entry will be the ‘reference’. All the other coefficients will be reported relative to that reference (in this case AMK). If, say, the coefficient for Bedok is -30,000 that means that relative to Ang Mo Kio, flats in Bedok are $30,000 cheaper. You can change the ‘reference’ by chaning town into a factor and order your factor levels so that the reference is the first level. Interpret and explain the goodness of fit measures and the coefficients. Tip: if you have a lot of coefficients to report, don’t shy away from using visualiations.
  • Based on your model, try to work through a simple example. Let’s say you have a family in the market for a 3 room flat in Punggol. They are looking for a flat sized at 67 sq meters and 95 years left on the lease. What should they be looking to pay according to your model? Can you explain to them how each of their choices translates into the flat price? Compare your estimate with all known transactions of flats with those characteristics - how correct was your model?
  • The coefficients are very useful in helping us understand how the resale price is affected by other variables. However, sometimes we conduct regression analysis simply because we are interested in predicting the dependent variable for new observations. So far, you have done your analysis only on the 2016 data. Use your 2016 model, to predict the 2017 prices. Tip: you can use the augment function with the newData argument to do this. Calculate the standard error of the estimate for the 2017 and compare that with your 2016 data. Did the error go up or down and what might cause this?
  • BONUS: Finally, discuss how you could further enhance your analysis with the inclusion of additional variables or data not contained within the HDB resale transaction dataset itself, or try out a non-linear model by transforming some of your independent variables. For instance (this is an example), if you find that the resale_price is usually roughly dependent on the square of the remaining_lease, you could run a linear model with remaining_lease ^ 2.

5.5.3 Submitting the assignment

  • Create your assignment as a single self-contained RMarkdown document (you may read in cleaned data at the beginning from your data folder with the here library). It should be saved in the assignments/ folder, as assignment2.Rmd.
  • Change the output parameter to github_document so that your assignment is rendered visually representation on Github.
  • Do not forget to run styler on your code before submitting! (Check out the Session 2.2 slides if you are not sure how to do so).
  • Once you are finished, navigate to your own Github repository at[your-name]. Create a new issue, title it ‘Assignment 1 by [your_name]’. In the body of the issue, include at least a link to the last commit you made as part of your assignment and ‘at’ your prof (include @barnabemonnot in your text). To link to a specific commit, you can just include the SHA (a short code uniquely identifying a commit) in your issue. You can find the SHA codes in the history of your commits (e.g. at my first commit has SHA ‘1f2c9c1’). Note that you can always edit the text of the issue if you want to update the assignment commit, but commits are timestamped so your prof can check whether they were submitted before the deadline!

Your assignments will be graded on a 5-point scale, according to this rubric:

  1. The Rmarkdown does not run and there’s incomplete description and contextualization
  2. The Rmarkdown does not run or runs but doesn’t produce the expected output. There is limited description and contextualization of code and its output
  3. The Rmarkdown runs, the code is clearly structured but the description and contextualization of code and its output is limited
  4. The Rmarkdown runs, the code is clearly structured and the description and contextualization is complete.
  5. As above, plus you have included some new or creative approaches, methods or angles that were not covered in class.

Note that half or quarter-point grades might be given. The rubric above is only an indication for specific points on the grading scale.