Chapter 12 Towards Machine Learning

12.1 Logistic Regression

When doing (linear) regression analysis so far, we have used the technique to model the effect of one or more independent variables (X) on a dependent variable (Y). This dependent variable has so far always been continuous, such as the resale price of an HDB flat. But what if our dependent variable is instead categorical in nature? Predicting the ‘outcome’ of a categorical variable is also referred to as classification and is part of the supervised set of machine learning techniques. Although an ordinary linear regression cannot be used for classification, there are classification techniques based on regression as well. In this Section we look at one such technique—logistic regression.

12.1.1 Model

Let’s say you are interested in understanding the effect of parental income on a student’s chance of graduating university. Income is a continuous variable but graduating from a university is a binary event (only two possible outcomes: you graduate or not). Let’s try and think of this as the chance of graduating, and model this probability with a plain vanilla linear regression. We try to fit the slope \(\beta_1\) and intercept \(\beta_0\),

\[ p = \text{Probability of graduating} = \beta_0 + \beta_1 \, \times \text{Income} \]

There is an issue: The predicted chance can be lower than 0 or higher than 1. That’s not theoretically possible!

Since we are interested in the chance of a certain outcome happening, we can instead look at the odds ratio. The odds ratio is the ratio of the chance a positive outcome and a negative outcome: \(\frac{p}{1−p}\). Since a regular odds ratio goes from 0 to Infinity (with a midpoint at 1) and we rather deal with a variable that goes from -Infinity to Infinity (with a midpoint at 0), we take the natural logarithm of the odds ratio \(log(\frac{p}{1−p})\). We call this the log odds or logit, hence the name logistic regression! We are now modeling the log odds in the same way as we did with linear regression:

\[ \log(\frac{p}{1−p})= \beta_0 + \beta_1 X \]

where we now call \(X\) the income.

After manipulating this formula above, you can see we can use it to also predict the chance of only the positive event happening (\(p\)): \(p=\frac{\exp{(\beta_0 + \beta_1 X)}}{\exp(\beta_0 + \beta_1 X) + 1}\). What’s more the effect of \(X\) on a positive outcome isn’t linear!

x <- seq(from = -15, to = 15, by = 0.1)
p <- exp(x)/(exp(x) + 1)
ggplot(data = NULL, mapping = aes(x = x, y = p)) +
  geom_line(colour = 'blue') +
  ggtitle('Logistic Regression Line') +
  xlab("Income X") +
  ylab("Probability of graduating p")

12.1.2 Setting up the data

Let’s see how this works in practice. Do you think we can predict whether an HDB flat is in the ‘Central Region’ of Singapore based on its characteristics? Let’s find out. You can download the data here.

df <- readRDS(here::here("data/resale_pln_region.rds")) %>%
  mutate(region = as_factor(str_remove(region, "-"))) %>%
  mutate(resale_price = resale_price / 1000) %>% # measure price in k
  mutate(region = relevel(region, ref = "Noncentral")) %>% # we change the reference level because we want to predict whether something is 'Central'

## Rows: 78,853
## Columns: 11
## $ town               <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
## $ storey_range       <dbl> 3, 1, 1, 1, 3, 3, 1, 1, 1, 5, 3, 4, 4, 3, 3, 3, 4,…
## $ floor_area_sqm     <dbl> 60, 68, 69, 68, 68, 67, 68, 68, 67, 68, 67, 68, 68…
## $ remaining_lease    <dbl> 70, 65, 64, 63, 64, 64, 64, 65, 62, 69, 60, 64, 65…
## $ resale_price       <dbl> 255.0, 275.0, 285.0, 290.0, 290.0, 290.0, 290.0, 2…
## $ hh_income_lt_2000  <dbl> 0.2592, 0.2592, 0.2592, 0.2592, 0.2592, 0.2592, 0.…
## $ hh_income_gt_17500 <dbl> 0.1296, 0.1296, 0.1296, 0.1296, 0.1296, 0.1296, 0.…
## $ edu_university     <dbl> 0.2434663, 0.2434663, 0.2434663, 0.2434663, 0.2434…
## $ status_unemployed  <dbl> 0.02926209, 0.02926209, 0.02926209, 0.02926209, 0.…
## $ region             <fct> Noncentral, Noncentral, Noncentral, Noncentral, No…

We will create a sample (for faster calculation) and split the data in two parts, a ‘training’ set and a ‘testing’ set. This is a common pattern in Machine Learning:

  • Train on a reduced dataset.
  • Test on new data, unseen by the model previously, to check how well the model generalises (i.e., if it overfits or not).

We will set aside the test set and will first create the model on the training dataset.

df_sample <- df %>%

# Creating a fixed sample for training/testing
sample <- sample(c(TRUE, FALSE), nrow(df_sample), replace = T, prob = c(0.7,0.3))

sample_train <- df_sample %>%

sample_test <- df_sample %>% 

12.1.3 Performing a logistic regression

We can conduct a logistic regression with the general modeling function glm. It is similar to the familiar lm but instead of ordinary least squares (OLS, minimising the sum of squared residuals), glm uses the maximum likelihood estimation and so is a much more versatile function.1

In this case, we need specify to specify that we want to use the binomial distribution to run a logistic regression but the other parameters are similar to lm.

model_logit1 <- glm(region ~ resale_price, family = "binomial", data = sample_train)
## Call:
## glm(formula = region ~ resale_price, family = "binomial", data = sample_train)
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7275  -0.6264  -0.5271  -0.4318   2.4113  
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.638621   0.073851  -49.27   <2e-16 ***
## resale_price  0.004688   0.000144   32.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 13607  on 14002  degrees of freedom
## Residual deviance: 12482  on 14001  degrees of freedom
## AIC: 12486
## Number of Fisher Scoring iterations: 4

Great, our model ran but what does it tell us? We can start by interpreting the model coefficient. We can see that the coefficient for resale_price is 0.0046. What does this number mean? It means that for a 1-unit increase in resale price, the log odds of that flat being in the Central region increase with 0.0046. In most cases we are interested in the odds, so we can exponentiate the coefficient.

##  (Intercept) resale_price 
##   0.02628858   1.00469939

In other words, for a $1,000 increase in sale price, the odds of that flat being in the Central region increase by ~1.005 (0.5%).

We can use this coefficient (together with the intercept to make prediction on new data as well) with the predict function.

predict(model_logit1, tibble(resale_price = c(300, 900)), type = "response")
##          1          2 
## 0.09690421 0.64128002

Based on this simple model, a flat with a price of $300k is likely to be outside of the centre (9.6% chance of being in the centre), while a $900k is likely in the Central Region (64.1% chance). We can probably come up with a better model by adding additional variables to the model.

model_logit2 <- glm(region ~ resale_price + remaining_lease, family = "binomial", data = sample_train)
## Call:
## glm(formula = region ~ resale_price + remaining_lease, family = "binomial", 
##     data = sample_train)
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5852  -0.5893  -0.3911  -0.2155   2.7746  
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      2.3472795  0.1628678   14.41   <2e-16 ***
## resale_price     0.0077473  0.0001839   42.13   <2e-16 ***
## remaining_lease -0.1028643  0.0027784  -37.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 13607  on 14002  degrees of freedom
## Residual deviance: 10700  on 14000  degrees of freedom
## AIC: 10706
## Number of Fisher Scoring iterations: 5 Exercise

  • How do you interpret the new model coefficients?
  • We can update our prediction too. What happened with the predictions now that the remaining_lease has been added?
        tibble(resale_price = c(300, 900), remaining_lease = c(50, 80)),
        type = "response")

12.1.4 Variable importance

What if you wanted to know which of the two variable was more important to determine Central-ness? We can calculate this ‘importance’ with the varImp function from the caret package.

##                  Overall
## resale_price    42.13532
## remaining_lease 37.02267

For logistic regression, the ‘importance’ is the same as the ‘Z-value’ reported by summary(), which is calculated by dividing the coefficient by the standard error. In this case both our variables are quite (equally) important.

12.1.5 Model selection

We now have two models – but how do you determine which model is the better model of the two? One way is to look at the log-likelihood of each model. We see that the second model has a much better likelihood - but we should verify and test whether this difference is significant too. The test is called “Chi-squared” (or \(\chi^2\)). It is a very common test—so common in fact that pchisq is part of the base R stats package.

## 'log Lik.' -6241.22 (df=2)
## 'log Lik.' -5350.031 (df=3)
lrtest <- function(model1, model2) {
  ll1 <- logLik(model1)
  ll2 <- logLik(model2)
  two_lr <- 2 * (as.numeric(ll2) - as.numeric(ll1))
  df_diff <- abs(attr(ll2, which = "df") - attr(ll1, which = "df"))
  pchisq(two_lr, df = df_diff, lower.tail = F) # this is the same as looking up in a chi square table and returns a p value
lrtest(model_logit1, model_logit2)
## [1] 0

Let’s add in the remaining variables! Clearly this results in an even better model.

model_logit3 <- glm(region ~ resale_price + remaining_lease + storey_range + floor_area_sqm + hh_income_lt_2000 + hh_income_gt_17500 + edu_university + status_unemployed, family = "binomial", data = sample_train)
## Call:
## glm(formula = region ~ resale_price + remaining_lease + storey_range + 
##     floor_area_sqm + hh_income_lt_2000 + hh_income_gt_17500 + 
##     edu_university + status_unemployed, family = "binomial", 
##     data = sample_train)
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.21712  -0.00353  -0.00032  -0.00002   3.13987  
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -5.558e+01  2.196e+00  -25.31   <2e-16 ***
## resale_price        1.582e-02  8.733e-04   18.11   <2e-16 ***
## remaining_lease    -9.536e-02  7.165e-03  -13.31   <2e-16 ***
## storey_range       -3.038e-01  3.030e-02  -10.03   <2e-16 ***
## floor_area_sqm     -8.158e-02  5.104e-03  -15.98   <2e-16 ***
## hh_income_lt_2000   9.462e+01  3.499e+00   27.04   <2e-16 ***
## hh_income_gt_17500 -7.628e+01  3.216e+00  -23.72   <2e-16 ***
## edu_university      1.670e+02  6.088e+00   27.43   <2e-16 ***
## status_unemployed   2.409e+02  1.791e+01   13.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 13606.8  on 14002  degrees of freedom
## Residual deviance:  2978.6  on 13994  degrees of freedom
## AIC: 2996.6
## Number of Fisher Scoring iterations: 10
## 'log Lik.' -1489.302 (df=9) Exercise

Try to understand new estimated coefficients. How do they affect the Central-ness of a flat? And which predictors are the most important? Hint: you can check the importance of each variable with varImp.

The log likelihood tells us this model is better than the previous models. But how good is this model in a more absolute way?

One way to look at this is by calculating a pseudo R-squared (see this for an overview). Remember that it’s not clear what we should call “residuals” when we have a categorical response variable! Then what should we use instead of the R-squared we defined for linear regression? In logistic regression, we sometimes see McFadden’s R-squared used.

model_logit_null <- glm(region ~ 1, family = "binomial", data = sample_train)
mcfadden_r2 <- 1-(as.numeric(logLik(model_logit3))/as.numeric(logLik(model_logit_null)))
## [1] 0.7810944

An R-squared of 0.78 doesn’t seem so bad! But if we are ultimately interested in prediction, the best way to assess our model fit is how well it predicts.

12.1.6 Predictions on the test dataset

We can do this with the test data we set aside. We can apply our model to the test data, have it predict whether an observation is in the Central Region and compare that with the real observed value.

predict_model_3 <- predict(model_logit3, newdata = sample_test, type = "response")

Notice that predict_model_3 is a vector of probabilities, numbers between 0 and 1. For each data point in our test data, the vector holds the estimated probability of this data point belonging to Central Singapore. Here are the first ten values of the vector (expressed as a percentage).

round(predict_model_3[1:10] * 100)
##   1   2   3   4   5   6   7   8   9  10 
## 100   0  10  93   0  10   6   0   0   0

To make our prediction, we could decide that if the probability is greater than 50%, we’ll predict that the flat is in Central Singapore. We can look at the performance of the prediction through the raw confusion matrix. This shows the following 4 cases:

  • True Negatives: The true response is “non-central”, and we predicted non-central as well (top left)
  • False Negatives: The true response is “non-central”, but we predicted central (top right)
  • True Positives: The true response is “central”, and we predicted central (bottom right)
  • False Positives: The true response is “central”, but we predicted non-central (bottom left)
table(sample_test$region, predict_model_3 > 0.5)
##              FALSE TRUE
##   Noncentral  4693   97
##   Central      163 1044

Depending on your research question, you might want to tune your model to prevent a certain type of error (e.g., many use cases rely on minimises false negatives, and are OK with increasing false positives, and vice versa). This is formalised in the concepts of ‘accuracy’, ‘precision’ and ‘recall’ (or prevalence, specificity and sensitivity) – see the help of caret for an explanation (?posPredValue). Which measure would be of most value here?

Precision and recall Wikipedia

We can get a balanced, summary statistic with the F1 score. We’ll add the prediction to our test data table first to make things a bit easier.

sample_test <- sample_test %>%
  mutate(pred = as.factor(ifelse(predict_model_3 > 0.5, "Central", "Noncentral")) %>% relevel(., ref = "Noncentral"))

table(sample_test$region, sample_test$pred)
##              Noncentral Central
##   Noncentral       4693      97
##   Central           163    1044
## these function come from caret
precision <- posPredValue(sample_test$pred, sample_test$region, positive = "Central")
recall <- sensitivity(sample_test$pred, sample_test$region, positive = "Central")

F1 <- (2 * precision * recall) / (precision + recall)
## [1] 0.8892675

12.2 Random forests

Random forests are a powerful ML technique to train decision trees without overfitting. One nice thing is that random forests are able to estimate both continuous and categorical variables, with minimal difference between the two implementations. Compared to other popular ML methods such as neural networks, they also provide slightly more interpretative power. They are based on many primitives we have already seen in class (bootstrap, hierarchical trees, model selection) and thus will be a nice send-off towards your future data-driven roles!

Let’s first build the overall intuition behind random forests, starting with decision trees.

12.2.1 Decision trees

We have seen “trees” on two occasions. When we performed hierarchical clustering, we were building a tree that placed observations at its leaves, branching out to indicate clusters. We have also discussed the Minimum Spanning Tree when we performed spatial clustering with SKATER.

In decision trees, you start from the root node, which branches in two directions, and keep moving from node to node until you reach a leaf at the bottom of a tree.

In this sense, decision trees look like flowcharts. You start from the root of your tree and make a decision at each branch. The decisions are based on attributes of the observation, such as “Is the remaining lease of this flat greater than 75 years? Yes or no.” or “Is this flat located in Serangoon? Yes or no.”

When you finally reach a leaf (a node that is not branching anymore), the leaf contains an estimate for your response variable. For instance, if we answered “Yes” to the question “Is the resale price greater than 900k?”, we will likely end up at a leaf that tells us “The flat is in Central Singapore”.

Training a decision tree then means that we must find the collection of branches and leaves that most “explain” our data. If we can decide early on that a flat is centrally located by asking whether it has an unusually high resale price, we want this decision to be made as high in the tree as possible2.

Let’s work out an example on our data! We will take a simple model first to build intuition, estimating the resale_price from floor_area_sqm and edu_university. As often in R, we only need to give a formula and the data to train on. We use the tree package to get the decision tree. We’ll also get a smaller sample this time around.


df_sample_rf <- df %>%

# Creating a fixed sample for training/testing
sample <- sample(c(TRUE, FALSE), nrow(df_sample_rf), replace = T, prob = c(0.7,0.3))

sample_train_rf <- df_sample_rf %>%

sample_test_rf <- df_sample_rf %>% 

# train the tree on the training set
tr <- tree(resale_price ~ floor_area_sqm + edu_university,
           data = sample_train_rf)

We can plot the tree to see how it arranged the decisions and which values it returns on its leaves (you can expand it in your RStudio session if it is too overplotted).


The single values at the bottom of the tree are predictions made by the tree. If the floor area of the flat is less than 84.5, we go to the left of the tree. If it is also less than 74.5, we return 307.5k. Otherwise, return 375.5.

As we explained before, the nice thing about trees is that they are also able to perform classification, when the response variable is categorical. Let’s try it out.

tr_class <- tree(region ~ remaining_lease,
                 data = sample_train_rf %>%
                   mutate(region = as_factor(REGION_N)))

Finally, we can also have categorical variables in our nodes (e.g., “Is the flat in Ang Mo Kio?”) but we need to convert the town categorical variable to a collection of binary numerical variables, one for each town. For instance, we will have variables is_in_ang_mo_kio, is_in_bedok etc. which only take values 0 or 1. This type of conversion is very frequent in ML and called one-hot encoding or dummy variables. The caret package has a function to help produce these variables.

dummy_model <- dummyVars(~ ., data = sample_train_rf)
dummy_sample <- predict(dummy_model, newdata = sample_train_rf) %>%
  as_tibble() %>%
  select(resale_price, remaining_lease, starts_with("region", = FALSE))

Open up dummy sample and check that it indeed has the encoding we are looking for! Let’s now plot the tree.

tr_cat <- tree(resale_price ~ .,
               data = dummy_sample)

12.2.2 Bagging

Decision trees have a few defects. By branching early, they tend to overfit the data they are trained on, such that a single decision tree usually makes up its mind very quickly. That’s bad, because our model cannot generalise well then. So instead of relying on a single tree, we train a whole bunch of them and ask each and every tree what their estimate is based on some observation we pipe in.

Each tree returns some estimate (“Central!”, “Not central!”, “Central!”) and it is then up to us to smartly aggregate their replies. Many variations exist here, from a “democracy” that picks the most frequently returned estimate (in the case of a categorical response variable) to a “make everyone happy” simple average (in the case of a continuous response variable).

So how do we grow our forest? We use bootstrap! We have a sample of n observations that we would like to train our trees on. We specify how many trees we would like to train (call this number B) and repeat the following B times:

  1. Sample n times with replacement from the original sample (bootstrap!)
  2. Train a decision tree on this resampled dataset.

At the end of the procedure, we have B trained decision trees ready to estimate for us. This method is called bagging, short for “bootstrap aggregating”.

12.2.3 Random forests

We have to dig a bit deeper into how our trees are trained to have the full picture. We must decide at each node which input variables to test, and what the level of the test is. For instance, if we decide to test remaining_lease, we must decide if we want to test “is greater than 75 years”, “is greater than 50 years” or any other number of years.

What is commonly understood as random forests (there are possible variations) uses a technique called feature bagging. We start the tree from the top (the root) and at each node we sample a small number of our predictor variables to define the test. If we have \(p\) variables, we tend to randomly choose about \(\sqrt{p}\) variables to produce the split. The level of the test is then decided such that the split it operates (there are observations that satisfy the test, and others that do not) most separates the data. This can be measured by what is called Gini impurity, on which we will not expand.

After each tree is trained with a combination of both bagging and feature bagging, we have a forest of trained decision trees and can start making our predictions.

For now, let’s do random forest without categorical variables.


price_rf <- randomForest(
  resale_price ~ storey_range + floor_area_sqm + remaining_lease + edu_university,
  data = sample_train_rf, ntree = 100)

## Call:
##  randomForest(formula = resale_price ~ storey_range + floor_area_sqm +      remaining_lease + edu_university, data = sample_train_rf,      ntree = 100) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 1
##           Mean of squared residuals: 4127.839
##                     % Var explained: 80.7
print(importance(price_rf, type = 2))
##                 IncNodePurity
## storey_range          7690541
## floor_area_sqm       30420891
## remaining_lease      10761294
## edu_university       13493451

We can plot the error made by the model as more trees are getting trained.


Let’s try out the model on our test set.

sample_test_rf <- sample_test_rf %>%
  add_column(pred = predict(price_rf, newdata = sample_test_rf))

We could get the Root Mean Square Error (RMSE), which is simply the square root of the mean squared residuals. Assuming we have \(n\) observations in our test set:

\[ RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^n (y_i - \hat{y}_i)^2} \]

We compare it with a linear regression trained on the same dataset.

rmse_rf <- sqrt(mean((sample_test_rf$resale_price - sample_test_rf$pred)^2))

price_ols <- lm(
  resale_price ~ storey_range + floor_area_sqm + remaining_lease + edu_university,
  data = sample_train_rf)
sample_test_rf <- augment(price_ols, newdata = sample_test_rf)
rmse_lm <- sqrt(mean((sample_test_rf$resale_price - sample_test_rf$.fitted)^2))

c(list(rmse_rf = rmse_rf, rmse_lm = rmse_lm))
## $rmse_rf
## [1] 65.57966
## $rmse_lm
## [1] 95.49317

The RMSE of random forests is lower indeed than that of a simple linear regression. This indicates that random forests performance was better at predicting the price of unseen resale transactions. Exercise

Add a categorical variable to the model. How does the model perform now?

12.2.4 Interpreting random forests

In terms of interpretation, random forests are not as clear-cut as decision trees, since they return an aggregate of many decision trees’ output. There is still useful information we can glean from them, in particular how important are the variables we include in our model.

One measure of importance could be checking the minimum depth of the variable in each single decision tree. If the first decision a tree makes is based on variable \(x\), then the minimum depth of \(x\) is 0. If only the second decision is based on \(x\), then the minimum depth of \(x\) is 1. As we explained before, the earlier a variable is placed in the tree, the more “influential” it is, since it determines where the decision will branch earlier on.

With the randomForestExplainer library, we can plot the distribution of minimum depths for each variable, across all our individual decision trees in the forest.

# this takes a long time to run, so we save it to the disk
# you can load the .rds file if you prefer, or run it yourself

# mdd <- min_depth_distribution(price_rf)
# saveRDS(mdd, here::here("data/random_forest_mdd.rds"))

mdd <- readRDS(here::here("data/random_forest_mdd.rds"))

We can also look at the interaction of two variables on the response variable.

plot_predict_interaction(price_rf, sample_train_rf, "floor_area_sqm", "remaining_lease")

You can even get a whole report by running explain_forest! Check out the one I obtained here.

# we need to add localImp = TRUE to get the whole report
price_rf <- randomForest(
  resale_price ~ storey_range + floor_area_sqm + remaining_lease + edu_university,
  data = sample_train_rf, ntree = 100, localImp = TRUE)

# comment it out and run it in a console rather than leave it in a RMarkdown notebook.
# explain_forest(price_rf, interactions = TRUE, data = sample_train_rf)

  1. To see why it is more versatile, think about the question “What are the residuals of the logistic regression model?”↩︎

  2. Decision trees have no respect for gravity: we often speak of “high” to mean closer to the root, i.e, an early decision↩︎