# Chapter 4 Sample statistics

## 4.1 Introduction

In the previous week, you have explored the different variables in the HDB dataset through a series of descriptive statistics that summarize the central tendency and dispersion of each variable. However, with this dataset, as with many real-world datasets we have to distinguish between the ‘real’ population and the sample. For a longer discussion of this difference, please refer back to Chapter 6 in Burt et al. In short, the sample is most often a subset of all ‘real’ data points. This can be because it is cheaper to collect a sample (e.g. this is why surveys are done on only a small subset of a population) or simply because data on the entire population does not exist. The good thing is that we can, powered with some statistical tools, still make inferences about the population with a relatively small sample. The best news is that a sample introduces two potential reasons for statistics on the sample to deviate from the population: sampling error and sampling bias. Sampling error is present in all samples: it is simply the uncertainty caused by the sample being - by definition - different from the population. Sampling bias is slightly different: it is introduced by the sampling process not being entirely random. This can cause certain people or objects to have a higher chance to be included in the sample. We see this a lot in social science research and always need to be aware. In this week, we will focus on sampling error only.

## 4.2 Sampling resale transactions

In the previous week we calculated the mean resale price for our entire dataset.

sales <- readRDS(here::here("data/sales.rds"))

mu <- sales %>%
summarize(stat = mean(resale_price))

mu
## # A tibble: 1 x 1
##      stat
##     <dbl>
## 1 439793.

The mean resale price across Singapore is $439,793 for the period 2015 to now. If we assume our datasets consists of all transactions in Singapore during that period (the population), we can simulate taking a sample from this dataset and calculate the mean for that sample. For example, let’s randomly sample only 100 transactions and see what the mean is for that sample. sales %>% sample_n(100) %>% summarize(stat = mean(resale_price)) ## # A tibble: 1 x 1 ## stat ## <dbl> ## 1 424047. Rerun this chunk of code a few times yourself. What do you observe? Is the mean of our sample always close the population mean? We can automate this re-running with R. We create a simple function that takes a sample and calculates the mean and execute that function multiple times with purrr’s rerun. sample_mean <- function(data, size = 100) { data %>% sample_n(size) %>% summarize(stat = mean(resale_price)) } sampled_means <- rerun(500, sample_mean(sales)) %>% bind_rows() What does the sampled_means table represent? Let’s look at the sampled means visually and draw a vertical line to indicate the population mean. sampled_means %>% ggplot(aes(x = stat)) + geom_histogram() + geom_vline(xintercept=mu$stat)

In this case, we know that the ‘real’ mean is about 440,000. However, quite a few of our perfectly random samples (we have no sampling bias!) have a mean resale price that’s quite different from the population.

Adapt the code above to increase the size of each sample from 100 to 5000. How does the histogram change? Then change the sample size again from 5000 to 15000. If a higher sample size leads to a better ‘prediction’ of the population mean, should we just always aim for the highest possible sample size?

## 4.3 Confidence intervals and bootstrapping

Here’s some good news: in most cases we are not interested in knowing the population mean exactly. We want to know the population mean approximately. In other words, we want to be reasonably certain that our sample mean is close to the population mean. In statistical language, we can do this through confidence intervals. You might have seen this before in newspapers or scientific articles (or calculated them yourself). It is often described something along these lines: ‘we are 95% certain that the mean resale price of HDB flats in Singapore is between 438,750 and 440,750’.

We can calculate confidence interval through a process called bootstrapping. To do so we are going to take a sample of size n = 500 observations.

sales_sample <- sales %>%
sample_n(500)

We are also going to keep in mind the sample mean of our sample.

x_bar <- sales_sample %>%
summarize(stat = mean(resale_price))

x_bar
## # A tibble: 1 x 1
##      stat
##     <dbl>
## 1 438806.

We will use the infer package (see docs here) to create the confidence intervals. It has a number of useful functions that can be flexibly combined to do all kinds of statistical inference (i.e. determining properties of a population based on a sample). We will largely follow the chapter on confidence intervals from Modern Dive here.

sales_sample %>%
specify(response = resale_price) %>%
calculate(stat = "mean")
## # A tibble: 1 x 1
##      stat
##     <dbl>
## 1 438806.

As you can see, we can also calculate our sample mean with infer. It has a slightly different syntax: we first specify a response variable (resale_price) and then indicate which statistic we would like to calculate.

### 4.3.1 Bootstrapping

We can use this same procedure but introduce one more step: generate. This will generate a ‘bootstrapped’ sample. A bootstrap procedure simply takes a sample that is equally sized to the entire dataset (79,100 rows in this case) and it specifcally takes this sample with replacement. This means that we copy one observation from our dataset and add it to the sample but we do not remove this observation from the dataset. We keep doing this until our sample is equal in size to our dataset (see Burt Chapter 10.7 for an additional explanation).

bootstrapped_resale_price <- sales_sample %>%
specify(response = resale_price) %>%
generate(reps = 500) %>%
calculate(stat = "mean")

Inspect the bootstrapped_resale_price object - does it look similar to the sampled_means object?

Once we have the bootstrapped means, we can visualize the histogram again:

bootstrapped_resale_price %>%
visualise() +

## 4.4 Comparing means between groups

Finally, assume somebody claims that flats in Queenstown are, on average, more expensive than those in Marine Parade. With the sales data you could verify this claim:

town_means <- sales %>%
filter(town == "QUEENSTOWN" | town == "MARINE PARADE") %>%
group_by(town) %>%
summarise(mean = mean(resale_price))
town_means
## # A tibble: 2 x 2
##   town             mean
##   <chr>           <dbl>
## 2 QUEENSTOWN    549037.
ggplot() +
geom_histogram(
data = sales %>%
filter(town == "QUEENSTOWN" | town == "MARINE PARADE"),
mapping = aes(x = resale_price, y = ..density.., group = town, fill = town),
position = position_dodge()
) +
geom_vline(
data = town_means,
mapping = aes(xintercept = mean, group = town, color = town)
)

They’re right! Queenstown flats are about $8,500 more expensive. However, could this difference between the two towns be caused by sampling error instead? After all, the means plotted above are pretty close to each other. In other words, are the two means significantly different? A first step might be to look at the confidence interval for the Queenstown flats bootstrapped_resale_price_qt <- sales %>% filter(town == "QUEENSTOWN") %>% specify(response = resale_price) %>% generate(reps = 100) %>% calculate(stat = "mean") bootstrapped_resale_price_qt %>% get_ci() ## # A tibble: 1 x 2 ## lower_ci upper_ci ## <dbl> <dbl> ## 1 540845. 558184. You obtained the confidence interval for the mean resale price of Marine Parade above, in Section 4.3.3. Do the confidence intervals overlap? What do you think overlapping confidence intervals indicate about the significant difference between the two means? It is a common error to assume that if confidence intervals overlap this means there is no significant difference. This is not necessarily true (you can read why here and here). In short: if the confidence intervals do not overlap, you know for certain there is a significant difference. If they overlap, there might or might not be a significant difference. We can test the significance of the difference more formally with infer as well (please read Modern Dive Chapter 9 if you are new to hypothesis testing). First we calculate a new stat, the difference between the two means. mean_diff <- sales %>% filter(town == "MARINE PARADE" | town == "QUEENSTOWN") %>% specify(formula = resale_price ~ town) %>% calculate(stat = "diff in means", order = c("QUEENSTOWN", "MARINE PARADE")) mean_diff ## # A tibble: 1 x 1 ## stat ## <dbl> ## 1 8628. We then generate a set of sampled datasets. But instead of doing a simple bootstrap, we generate each sample based on the null hypothesis, namely that there is no difference between the mean of the two towns. null_distribution <- sales %>% filter(town == "MARINE PARADE" | town == "QUEENSTOWN") %>% specify(formula = resale_price ~ town) %>% hypothesize(null = "independence") %>% generate(reps = 500, type = "permute") %>% calculate(stat = "diff in means", order = c("QUEENSTOWN", "MARINE PARADE")) Once we have done this, we can visualize the result. Since our hypothesis was that Queenstown resales are significantly higher than Marine Parade, we set the direction argument of p-value functions to greater (note than we specified an order in the calculate function of the previous code chunk). null_distribution %>% visualise(bins = 100) + shade_p_value(obs_stat = mean_diff, direction = "greater") Based on this visualization, if you had to take a guess on the p-value, what would it be? Let’s see if you are right: null_distribution %>% get_pvalue(obs_stat = mean_diff, direction = "greater") There’s another way to look at this too: by calculating the confidence interval of the difference between the two means. Remember that earlier we calculated that the difference in our sample is about 8600 (Queenstown is$8600 more expensive than Marine Parade).

sales %>%
filter(town == "MARINE PARADE" | town == "QUEENSTOWN") %>%
specify(formula = resale_price ~ town) %>%
# hypothesize(null = "independence") %>% # disable the null hypothesis
generate(reps = 500, type = "bootstrap") %>% # set type to bootstrap
calculate(stat = "diff in means", order = c("QUEENSTOWN", "MARINE PARADE")) %>%
get_ci()
## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1  -10475.   28136.

What does the confidence interval tell you in this case? Use shade_confidence_interval to plot the CI.

Exercise 4: In the RStudio help, find out what the direction argument does for the get_p_value and shade_p_value functions. Try setting it to both on the data above.

Exercise 5: Are the resale prices means between Marine Parade and Bukit Timah significantly different?

Exercise 6: During the previous week, we saw that the price of a flat can be quite different depending on what storey the flat is located on. Use the same procedure to see if the average price of 3 room flats on storey 04-06 is significantly different from 3 room flats on storey 07-09.