Chapter 3 Descriptive statistics and visualisations

3.1 Introduction

This week we will do a more thorough investigation of the HDB resale data. To make the data available within your Rmarkdown notebook, you would have to reload it. You can do this by executing the data loading code again.

sales <- read_csv(here::here("data/hdb_resale_2015_onwards.csv")) %>%
mutate(month = ymd(month, truncated = 1),
flat_type = as_factor(flat_type),
storey_range = as_factor(storey_range),
flat_model = as_factor(flat_model))

However, the data loading and cleaning steps often can be quite ‘expensive’: they take a long time to run. To cut down on this, in practice, these steps often get split into separate Rmd files. You can save an object to disk by running saveRDS() and load it again with readRDS(). So in this case, you could save the sales object at the end of your first notebook.

saveRDS(sales, here::here("data/sales.rds"))

sales <- readRDS(here::here("data/sales.rds"))

3.2 Central Tendency

In the previous week, we answered a few initial questions with/about the dataset (e.g. ‘what is the most expensive flat in Punggol?’). These questions were all descriptive and meant to give a sense of the distribution of different variables in our dataset. As we learn from Burt et al.’s Describing Data with Statistics chapter, we also have a suite of more quantitative measures available – often referred to as descriptive statistics. The chapter discusses two different types of such statistics: measures of the central tendency and measures of dispersion. After we understand what these statistics do conceptually, it is actually very straightforward to calculate them in R. The measures discussed in the chapter are:

• median
• mean
• mode

Mean (mean()) and median (median()) already exist in base R. Mode doesn’t but we can easily create our own function:

# from https://stackoverflow.com/a/25635740
manual_mode <- function(x, na.rm = FALSE) { # we don't use 'mode' as a function name because it already exists
if(na.rm){
x = x[!is.na(x)]
}

ux <- unique(x)
return(ux[which.max(tabulate(match(x, ux)))])
}

We can now use these three function to find out the distribution of each of our variables, for example:

mean(sales$floor_area_sqm) ## [1] 97.58903 median(sales$floor_area_sqm)
## [1] 96
manual_mode(sales$floor_area_sqm) ## [1] 67 What do these three measures tell you about the floor area variable? 3.3 Dispersion Similarly, R has a built-in set of function for dispersion statistics. Remember, while the measures of central tendency give you an indication of a typical value, the measures of dispersion give you an indication of the spread of that variable around the central tendency (most often, the mean). We can calculate these with R: # Range max(sales$floor_area_sqm) - min(sales$floor_area_sqm) # Interquartile Range IQR(sales$floor_area_sqm)

# Standard Deviation
sd(sales$floor_area_sqm) # Coefficient of variation sd(sales$floor_area_sqm) / mean(sales$floor_area_sqm) # Kurtosis and Skewness from the 'e1071 library kurtosis(sales$floor_area_sqm)
skewness(sales$floor_area_sqm) Together with the measures of central tendency, are you able to understand the distribution of the floor area variable based on these measures? Note that if you want to have a quick summary of a variable then running each of these descriptive statistics one by one isn’t so efficient. You can use R’s built in summary() function or the ‘tidy’ version skim() (from the skimr package) to get an immediate overview of many measures. summary(sales$floor_area_sqm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   31.00   76.00   96.00   97.59  112.00  280.00
skim(sales$floor_area_sqm)  Name sales$floor_area_sqm Number of rows 79100 Number of columns 1 _______________________ Column type frequency: numeric 1 ________________________ Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 97.59 24.22 31 76 96 112 280 ▃▇▁▁▁

3.4 Visualization

So far, we have explored the floor area variable with a range of different statistical measures. However, Anscombe’s Quartet tells us that we should not always blindly trust the summary statistics by themselves. Combining these statistics with a visual exploration is often advisable. Based on what you now know about the floor area of resale flats, can you visualize (in your head) the histogram of this variable? Let’s see if you were about right. Note that you can make use Healy’s Data Visualization Chapter 3 to refresh your memory on plotting with R/ggplot2. For example:

ggplot(sales, aes(x = floor_area_sqm)) +
geom_histogram(binwidth = 5)

If we overlay this with a normal distribution, the difference is clear:

ggplot(sales, aes(x = floor_area_sqm)) +
geom_histogram(aes(y = ..density..), binwidth = 5) +
stat_function(fun = dnorm, args = list(mean = mean(sales$floor_area_sqm), sd = sd(sales$floor_area_sqm)))

This isn’t all that surprising. Social data, especially data where governments and policy have a strong impact, very often do not follow normal distributions very strictly. In this case, there might very well be specific historical or policy reasons for that ‘weird’ peak on the left side of the graph. It coincides with our mode (calculated earlier). We can inspect what kind of flats have exactly this much floor area:

sales %>%
filter(floor_area_sqm == 67) %>%
View()

Can you think of specific contextual reasons why there might be so many flats with 66-70 sq meters of floor area?

There are a few other visualizations that are useful to explore distributions of a single variable. First there is the boxplot: a visual summary of the mean, IQR and outliers.

ggplot(sales, aes(x = 1, y = floor_area_sqm)) +
geom_boxplot()

With the boxplot, we ‘lose’ a bit of insight into the distribution within the IQR (after all, this is just a box). We can alleviate this by drawing violin plots instead.

ggplot(sales, aes(x = 1, y = floor_area_sqm)) +
geom_violin()

So far, we have only looked at the distribution of a variable for an entire dataset. In practice, these distributions look very differently for different subsets of a dataset. For example, in the case of HDB data, the floor area distribution will be very different for the different flat types. Visualizations are particularly useful to explore distributions within subgroups. In ggplot, we can do that with facets:

ggplot(sales, aes(x = floor_area_sqm)) +
geom_histogram(binwidth = 10) +
facet_wrap(vars(flat_type), scales = "free_y")

Violin plots and boxplots can be useful in this scenario as well:

ggplot(sales, aes(x = flat_type, y = floor_area_sqm)) +
geom_violin()

3.5 Assignment (Monday, February 17th, 23:59)

For your first assignment, you will conduct an extensive exploration of the distribution of all the variables in our dataset. You will have to integrate at least the following:

• Summarize the different continuous variables (area, price, remaining lease) as well as the nominal/ordinal variables (month, flat_type, town, flat_model, storey_range), summarize these variables in both table (stats on central tendency and distribution) and visual form.
• Analyze the distribution of (some of) these variables for different subsets of the data. For example, explore the difference between towns, or between flat types.
• Analyze the distribution of at least one variable for unique combinations of town and flat_type (for each town, for each flat type: Ang Mo Kio, 1 room; Ang Mo Kio 2 room; etc.)
• Analyze change in resale price per square meter over time. Use a 6-month moving average to do so.

Make sure you answer the points listed above but also provide a short introduction to the dataset (what is it? why is it interesting?). Take care not to just create a bunch of tables and visualizations but explain to the reader what they are seeing. Pick out interesting patterns etc. etc. After reading your report, the reader (and you!) should have a solid understanding of the distribution of the variables in the HDB dataset and, ideally, have a series of observations to explore more in-depth in subsequent analyses. To create nice looking tables, you can make use of the new gt packages.

3.5.1 Submitting the assignment

• Create your assignment as a single self-contained RMarkdown document (you may read in cleaned data at the beginning from your data folder with the here library). It should be saved in the assignments/ folder, as assignment1.Rmd.
• Change the output parameter to github_document so that your assignment has a visual representation on Github
• Do not forget to run styler on your code before submitting! (Check out the Session 2.2 slides if you are not sure how to do so).
• Once you are finished, navigate to your own Github repository at https://github.com/02522-cua/[your-name]. Create a new issue, title it ‘Assignment 1 by [your_name]’. In the body of the issue, include at least a link to the last commit you made as part of your assignment and ‘at’ your prof (include @barnabemonnot` in your text). To link to a specific commit, you can just include the SHA (a short code uniquely identifying a commit) in your issue. You can find the SHA codes in the history of your commits (e.g. at https://github.com/02522-cua/barnabe-monnot/commits/master my first commit has SHA ‘1f2c9c1’). Note that you can always edit the text of the issue if you want to update the assignment commit, but commits are timestamped so your prof can check whether they were submitted before the deadline!

Your assignments will be graded on a 5-point scale, according to this rubric:

1. The Rmarkdown does not run and there’s incomplete description and contextualization
2. The Rmarkdown does not run or runs but doesn’t produce the expected output. There is limited description and contextualization of code and its output
3. The Rmarkdown runs, the code is clearly structured but the description and contextualization of code and its output is limited
4. The Rmarkdown runs, the code is clearly structured and the description and contextualization is complete.
5. As above, plus you have included some new or creative approaches, methods or angles that were not covered in class.

Note that half or quarter-point grades might be given. The rubric above is only an indication for specific points on the grading scale.