# Chapter 7 Principal Component Analysis (PCA)

## 7.1 Introduction

In the previous section, we used Multi-Dimensional Scaling (MDS) as a dimension reduction technique. Although MDS can be quite effective in its own right, the interpretation of the resulting plots can sometimes be difficult - especially because the axes lose their explicit meaning. There are other dimension reduction strategies available that might help in this regard. One of these strategies is called factor analysis: the analysis of underlying or *latent* ‘factors’ present in larger set of variables or dimensions (e.g. income and education could both be dimensions of an underlying factor ‘class’). A special case of factor analysis is called Principal Component Analysis (PCA). PCA allows us to show which variables/dimensions in a dataset are correlated, potentially allowing them to be expressed in fewer dimensions or ‘components’. It also has another useful feature for subsequent analytical procedures: the components derived from PCA are, by definition, uncorrelated which is often a requirement for (for example) regression analyses. For a longer discussion and a worked example of PCA, see Boehmke’s US crime rates notebook or Wyly’s background paper and play with the interactive example made by Victor Powell and Lewis Lehe.

## 7.2 PCA with two variables

Conducting PCA in R is relatively straightforward. It has a built-in function for PCA (`prcomp`

) but we will also use the `broom`

, `ggfortify`

and `psych`

packages to help make our analysis a bit easier.

We first give some of the intuition behind PCA by restricting our analysis to two variables only. Our goal will be to *describe* our observations as succinctly as possible, while providing the maximum amount of information. If our data had no variation and all neighbourhoods had the same two values for `hh_income_gt_17500`

and `area_res`

, we wouldn’t need to summarise it more than knowing the two values. But the data varies, as evidenced by the following plot:

```
read_csv(here::here("data/planning_areas.csv"))
planning_areas <-
%>%
planning_areas select(planning_area, area_res, hh_income_gt_17500) %>%
ggplot() +
geom_point(aes(x = area_res, y = hh_income_gt_17500))
```

`area_res`

varies between 0 and 0.15, while `hh_income_gt_17500`

varies between a little below 0.1 and a little over 0.5. PCA attempts to explain this variation by capturing as much of it as possible. This is done by extracting *components* (sometimes called *factors*, although factor analysis is related but different from PCA), with the first component extracted to explain as much variation as possible. Let’s see what this means on our example.

### 7.2.1 Scaling

First we scale our data as we have done in MDS, this time using the `scale`

function. Scaling a variable subtracts the mean of that variable from all observations, then divides all observations by the standard deviation of the variable. In other words, you can check that the following two expressions give you the same result.

`scale(planning_areas$area_res)[1:6] # showing first six observations only`

`## [1] 0.1432271 1.3028069 0.5597262 -0.4560503 -0.8413429 -0.4478156`

```
%>%
planning_areas select(area_res) %>%
mutate(area_res_scaled = (area_res - mean(area_res)) / sd(area_res)) %>%
head()
```

```
## # A tibble: 6 x 2
## area_res area_res_scaled
## <dbl> <dbl>
## 1 0.0401 0.143
## 2 0.0836 1.30
## 3 0.0557 0.560
## 4 0.0176 -0.456
## 5 0.00310 -0.841
## 6 0.0179 -0.448
```

Let’s decompose what is happening when we apply `scale`

.

- First, we look at the distribution of
`area_res`

.

```
%>%
planning_areas ggplot() +
geom_histogram(aes(x = area_res), bins = 30)
```

- Second, we subtract the mean of
`area_res`

from all the observations to obtain a shifted`area_res`

. Note that this does not change the shape of the distribution (up to some artefacts from the binning), but shifted it to the left. The distribution is now*centred at zero*: the mean of our shifted`area_res`

is 0.

```
%>%
planning_areas ggplot() +
geom_histogram(aes(x = area_res - mean(area_res)), bins = 30)
```

- Finally, we divide by the standard deviation to scale the shifted
`area_res`

. Again the shape does not change but you can observe that the variable takes values over a larger interval.

```
%>%
planning_areas ggplot() +
geom_histogram(aes(x = (area_res - mean(area_res)) / sd(area_res)), bins = 30)
```

The idea is to apply the same procedure to all our variables. You can see the result below on some of the variables. We combine a few commands below:

- To scale our variables, we cast them into a matrix and apply the
`scale`

function. - We cast the matrix back to a tibble with
`as_tibble()`

. - We add back the planning area names with
`bind_cols`

.

```
library(patchwork)
# Choose some variables from our dataset
planning_areas %>%
scale_example_data <- select(
planning_area,starts_with("area"),
starts_with("age")
)
# Cast dataset to a long format
scale_example_data %>%
long_data <- pivot_longer(-planning_area)
# Scale dataset and cast it to a long format
scale(
long_data_scaled <-%>%
scale_example_data select(-planning_area) %>%
as.matrix()
%>%
) as_tibble() %>%
bind_cols(scale_example_data %>% select(planning_area)) %>% # add area names back
pivot_longer(-planning_area)
# Compare plots
%>%
long_data ggplot() +
geom_histogram(aes(x = value)) +
facet_wrap(vars(name)) |
long_data_scaled %>%
ggplot() +
geom_histogram(aes(x = value)) +
facet_wrap(vars(name))
```

Let’s scale our two variables `area_res`

and `hh_income_gt_17500`

and plot them on a scatterplot.

```
scale(
planning_areas_scaled <-%>%
planning_areas select(hh_income_gt_17500, area_res) %>%
as.matrix()) %>%
as_tibble() %>%
bind_cols(planning_areas %>% select(planning_area))
%>%
planning_areas_scaled ggplot() +
geom_text(aes(x = area_res, y = hh_income_gt_17500, label = planning_area))
```

Note that scaling does not change the relative positions of the planning areas in the `hh_income_gt_17500`

/ `area_res`

scatterplot, only their coordinates. Now let’s extract our first component.

### 7.2.2 First component extraction

```
library(broom)
# `prcomp` extracts principal components for us
planning_areas_scaled %>%
pc <- column_to_rownames(var = "planning_area") %>%
select(area_res, hh_income_gt_17500) %>%
prcomp(.)
tidy(pc, "pcs") %>% # summary information on the components
filter(PC == 1) # only look at the first component
```

```
## # A tibble: 1 x 4
## PC std.dev percent cumulative
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1.33 0.879 0.879
```

The `percent`

variable refers to the amount of total variance explained by the first component. 88% is a high number! Now let’s see if we can glean more information about this component.

```
tidy(pc, "variables") %>% spread(key = "PC", value = "value") %>%
factors <- select(column, "1")
factors
```

```
## # A tibble: 2 x 2
## column `1`
## <chr> <dbl>
## 1 area_res -0.707
## 2 hh_income_gt_17500 -0.707
```

The two numbers in the “1” column above give us the “direction” of the first component. We can plot these directions in a *directions plot*.

```
tidy(pc, "variables") %>%
filter(PC == 1) %>%
ggplot(aes(x = column, y = value)) +
geom_hline(yintercept = 0) +
geom_col() +
ylim(-1,1) +
coord_flip() +
facet_grid(~PC)
```

Note that we can represent it with points too:

```
tidy(pc, "variables") %>%
filter(PC == 1) %>%
ggplot(aes(x = column, y = value)) +
geom_hline(yintercept = 0) +
geom_point(size = 3) +
ylim(-1,1) +
coord_flip() +
facet_grid(~PC)
```

We can also plot the component’s direction in our initial scatterplot. Since we scaled our variables, the component is an arrow (“vector”) originating from the center of the plot.

This should remind you of the linear regression. There is a rather positive relationship between `hh_income_gt_17500`

and `area_res`

(the higher one is, the higher we expect the other to be as well). Meanwhile, linear regression gave us the line that explained the most variation (by minimising the sum of squared residuals). The first component encapsulates this relationship by telling us that the data moves along the arrow, either in the same direction (the smaller `area_res`

, the smaller `hh_income_gt_17500`

) or in the opposite (the higher `area_res`

, the higher `hh_income_gt_17500`

).

### 7.2.3 Second component extraction

Before going further, let’s extract *another* component. Note that there is an order to the operations: the first component explains the most variation possible (in this case 88%). Then the second component explains the most variation possible among the variation remaining (12% remain).

```
tidy(pc, "pcs") %>%
filter(PC == 2) # now we look at the second component
```

```
## # A tibble: 1 x 4
## PC std.dev percent cumulative
## <dbl> <dbl> <dbl> <dbl>
## 1 2 0.492 0.121 1
```

We can also get the directions plot for the two components.

```
tidy(pc, "variables") %>%
ggplot(aes(x = column, y = value)) +
geom_hline(yintercept = 0) +
geom_col() +
ylim(-1,1) +
coord_flip() +
facet_grid(~PC)
```

The second component we extract explains 12% of the total variation, and so explains all the variation left! Let’s see why this is not surprising by plotting the second component.

The second component points to a different direction. While the first represented the positive relationship between `hh_income_gt_17500`

and `area_res`

, the second says “this is not a perfect relationship, there is still a bit of variation left on either side of the line”. In particular, now that we have two arrows, we could write any planning area in our dataset as a mixture of the first and the second arrow, since they define a coordinate system. This is done with the following chunk.

```
tidy(pc) %>% spread(key = "PC", value = "value") %>%
rename(component1 = "1", component2 = "2") # position of each point on the principal components
```

```
## # A tibble: 28 x 3
## row component1 component2
## <chr> <dbl> <dbl>
## 1 ANG MO KIO 0.255 -0.458
## 2 BEDOK -1.04 -0.805
## 3 BISHAN -0.922 0.130
## 4 BUKIT BATOK 0.527 0.118
## 5 BUKIT MERAH 0.968 0.222
## 6 BUKIT PANJANG 0.570 0.0635
## 7 BUKIT TIMAH -3.74 0.786
## 8 CHOA CHU KANG 0.827 0.0933
## 9 CLEMENTI 0.0485 0.0551
## 10 GEYLANG 0.350 -0.490
## # … with 18 more rows
```

Let’s look at Bukit Timah and understand what these numbers mean. The coefficient of component 1 for Bukit Timah is -3.74, while the coefficient of component 2 is 0.79. This means that we can understand Bukit Timah as “a lot of negative component 1, and some amount of component 2”. We can plot all these coefficients in a scatterplot, called a *loadings plot*.

```
tidy(pc) %>% spread(key = "PC", value = "value") %>%
rename(component1 = "1", component2 = "2") %>%
ggplot() +
geom_text(aes(x = component1, y = component2, label = row))
```

In the plot above, we express each planning area in terms of component 1 and component 2, instead of in terms of `hh_income_gt_17500`

and `area_res`

. Why is that useful? Because each component captures some aspect of the relationship between `hh_income_gt_17500`

and `area_res`

, the first one capturing their positive relationship, and the second one correcting for the additional variance that the first could not explain. We can plot the *loadings* of each variable onto the loading plot, represented by arrows.

```
library(ggfortify)
autoplot(pc, label = TRUE, x = 1, y = 2, loadings = T, loadings.label = T)
```

Now the plots start to make sense. Points moving towards the direction of the `hh_income_gt_17500`

are those for which this variable has a high value, and the same conclusion holds for `area_res`

. The resulting plot however is not much more informative than the one we started with: it is simply rotated and has two extra red arrows. This is something we already observed with MDS. Starting with a 2D dataset and reducing it to a 2D dataset moves things around, giving perhaps a bit more clarity but not much more interpretation. So let’s look at PCA applied on more variables of our dataset to see where the method shines.

## 7.3 PCA with several variables

The strength of PCA really comes out once our dataset has more than two variables. Let’s analyse all of the age, occupation, area, dwelling and household variables we have available in our dataset.

```
planning_areas %>%
pc <- column_to_rownames(var = "planning_area") %>%
select(starts_with("age"),
starts_with("occupation"),
starts_with("area"),
starts_with("dwelling"),
starts_with("hh")) %>%
prcomp(., center = T, scale. = T) # instead of scaling our data first, we can let `prcomp` do it for us
```

When we inspect the results of our PCA, you notice that you get as many components back as there were variables in your dataset (14). PCA will always give back an equal number of components compared to your input variables. Together, all of these components explain 100% of the variance in your data. But, and this is where things get useful, the first few components explain much more variance! In our case, the first three components explain 87% of all variance.

`tidy(pc, "pcs") # summary information on the components`

```
## # A tibble: 14 x 4
## PC std.dev percent cumulative
## <dbl> <dbl> <dbl> <dbl>
## 1 1 2.43 0.422 0.422
## 2 2 2.22 0.353 0.774
## 3 3 1.14 0.0935 0.868
## 4 4 0.843 0.0507 0.919
## 5 5 0.628 0.0282 0.947
## 6 6 0.503 0.0181 0.965
## 7 7 0.415 0.0123 0.977
## 8 8 0.355 0.00901 0.986
## 9 9 0.279 0.00554 0.992
## 10 10 0.217 0.00337 0.995
## 11 11 0.194 0.00268 0.998
## 12 12 0.137 0.00133 0.999
## 13 13 0.0921 0.00061 1.00
## 14 14 0.0633 0.00029 1
```

### 7.3.1 Deciding how many components to keep

How many componenents should we keep? There are different ways to decide which components to cut off.

The first method looks for the elbow in the so-called scree plot, which is just a line plot of the explained variance of each components. Where would you place the cut-off here?

```
tidy(pc, "pcs") %>%
ggplot(aes(x = PC, y = percent)) +
geom_line() +
geom_text(aes(x = PC, y = percent, label = PC), nudge_y = 0.03)
```

The second method picks component until some threshold of explained variance is reached (usually, 90%). `tidy(pc, "pcs")`

above returned the cumulative variance explained. We can use it to plot the cumulative scree plot and place our cut-off.

```
tidy(pc, "pcs") %>%
ggplot(aes(x = PC, y = cumulative)) +
geom_hline(yintercept = 0.9, color = "orange") +
geom_line() +
geom_text(aes(x = PC, y = cumulative, label = PC), nudge_y = 0.03)
```

### 7.3.2 Directions plot

First, we can look at the directions plot, to ask how variables tend to vary with each other.

```
tidy(pc, "variables") %>%
filter(PC < 5) %>% # only show first 4 components
ggplot(aes(x = column, y = value)) +
geom_hline(yintercept = 0) +
geom_col(aes(fill=(value >= 0)),
show.legend = FALSE) +
coord_flip() +
facet_grid(~PC)
```

Looking at the first component here, it seems to say that for a high value of `occupation_senior_managers_professional`

, we expect a low value of `occupation_cleaners_labourers_plant`

, a low value of `hh_income_lt_2000`

, a high value of `hh_income_gt_17500`

etc. Another interpretation: Looking at the bottom three variables on age, we can see that they are ordered similarly in the first and second components. However, the first factor has a positive loading for `area_res`

while the second factor has a positive loading for `area_hdb`

. The first factor seems to say “when we are in a younger neighbourhood, we expect more senior managers”. The second factor comes to give more clarity to this assessment: “This is not true if the neighbourhood is more HDB!” We will also see a variation on PCA that makes such statements even clearer in the next Section.

### 7.3.3 Loading plot

We can plot the results of our analysis for the first few components by hand with `ggplot`

as we did above, but we can also do this more easily with the `autoplot`

function provided by the `ggfortify`

package.

`autoplot(pc, label = TRUE) # first two components by default`

`autoplot(pc, label = TRUE, x = 3, y = 4) # PC3 and PC4`

These plots look quite interesting and you might already be able to infer some patterns. Is there a more principled way to get at that meaning? We can do this by look at the *loadings* of each variable on the component. Loadings are nothing more than a fancy term for the correlation between the original variable and the factor. We can do this in table-form:

`tidy(pc, "variables") %>% spread(key = "PC", value = "value")`

```
## # A tibble: 14 x 15
## column `1` `2` `3` `4` `5` `6` `7` `8`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age_0… 0.121 0.103 -0.795 -0.0881 0.105 -0.0108 0.276 -0.169
## 2 age_1… 0.328 0.164 0.312 -0.129 -0.310 -0.283 0.182 0.250
## 3 age_g… -0.303 -0.285 0.140 0.0485 0.102 0.00362 0.0714 0.0669
## 4 area_… -0.0785 0.329 0.0295 -0.675 0.333 -0.320 -0.439 -0.0590
## 5 area_… -0.320 -0.0803 0.0678 -0.549 -0.343 0.601 0.0861 -0.0466
## 6 area_… 0.133 -0.344 0.221 -0.129 0.716 0.224 0.248 0.0516
## 7 dwell… -0.373 -0.0277 -0.162 -0.239 -0.0674 -0.328 0.554 0.194
## 8 dwell… 0.173 -0.396 0.0132 -0.103 -0.0713 -0.176 -0.0941 -0.305
## 9 hh_1p… -0.341 -0.221 0.0244 0.0713 -0.243 -0.168 -0.226 -0.0581
## 10 hh_5p… 0.369 0.0354 0.271 -0.212 -0.140 -0.115 0.396 -0.186
## 11 hh_in… 0.178 -0.390 -0.0326 -0.201 -0.139 -0.0810 0.0307 -0.315
## 12 hh_in… -0.349 -0.195 0.0936 0.0585 0.121 -0.439 0.112 0.0612
## 13 occup… -0.251 0.303 0.221 0.177 0.0704 -0.0416 0.168 -0.785
## 14 occup… 0.150 -0.398 -0.195 -0.0827 -0.120 -0.157 -0.240 -0.0866
## # … with 6 more variables: `9` <dbl>, `10` <dbl>, `11` <dbl>, `12` <dbl>,
## # `13` <dbl>, `14` <dbl>
```

Admittedly this is not very readable. When trying to understand the ‘meaning’ of each component, it sometimes makes sense to look at both the loadings and the observations in PCA ‘space’ at the same time. We can do so with `autoplot`

.

`autoplot(pc, label = TRUE, x = 1, y = 2, loadings = T, loadings.label = T)`

That’s a lot of arrows! Yet, this could drive our analysis to understand why Marine Parade is different from Queenstown for example. It seems for instance that `area_res`

is quite higher in Marine Parade, while `area_other`

dominates in Queenstown. We start to see groups appear and the variables that drive the differences between these groups.

#### 7.3.3.1 Exercise

Obtain the loadings plot for components 3 and 4. Does this add to your analysis?

This is what we would obtain doing MDS with the same set of variables.

```
library(distances)
planning_areas %>%
pln_area_matrix <- select(starts_with("age"),
starts_with("occupation"),
starts_with("area"),
starts_with("dwelling"),
starts_with("hh")) %>%
as.matrix()
distances(pln_area_matrix, normalize = "studentize") %>% as.matrix()
pln_area_dist <-
cmdscale(pln_area_dist) %>%
mds_planning_area <- as_tibble() %>%
rename(Y1 = V1, Y2 = V2)
# add planning area name back
planning_areas %>%
mds_planning_area <- select(planning_area) %>%
bind_cols(., mds_planning_area)
ggplot(mds_planning_area, aes(x = Y1, y = Y2)) +
geom_text(aes(label = planning_area))
```

The groups we see are not so different, but it is quite hard for us to tell why they are plotted that way. MDS in particular does not give us loadings, which tell us what drives observations to be close to each other.

## 7.4 Extensions to PCA

Finally, if our concern is purely with interpretability and finding ‘latent’ factors, we can also relax some of the restrictions that PCA brings with it. If you remember, in PCA the first component always explains as much of the variation in our dataset as possible. But this isn’t a hard requirement and we can obtain our components/factors based on other rules as well. One such rule could be that we want to maximize the loadings/correlation of individual variables on a single component. If only a single or small set of variables would load on a factor (and they wouldn’t load on the other components), this would really help our interpretation. To do this, we can use the `principal`

function from the `psych`

package.

```
library(psych)
planning_areas %>%
fa <- column_to_rownames(var = "planning_area") %>%
select(starts_with("age"),
starts_with("occupation"),
starts_with("area"),
starts_with("dwelling"),
starts_with("hh")) %>%
principal(nfactors = 4, rotate = "varimax")
fa
```

```
## Principal Components Analysis
## Call: principal(r = ., nfactors = 4, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 RC4 h2 u2 com
## age_0_9 -0.17 0.00 -0.97 0.01 0.97 0.029 1.1
## age_10_19 -0.93 0.00 0.17 0.11 0.91 0.094 1.1
## age_gt_65 0.88 0.24 0.37 -0.06 0.97 0.030 1.5
## occupation_senior_manager_professional 0.07 0.96 -0.13 -0.15 0.97 0.032 1.1
## occupation_cleaners_labourers_plant 0.23 -0.90 0.23 0.06 0.91 0.088 1.3
## area_hdb -0.13 -0.60 -0.10 0.71 0.90 0.104 2.1
## area_res -0.04 0.81 0.32 -0.05 0.76 0.235 1.3
## area_other 0.75 -0.06 0.19 0.51 0.86 0.143 1.9
## dwelling_1_2_room 0.87 -0.26 -0.05 0.27 0.90 0.100 1.4
## dwelling_condo_landed -0.03 0.97 0.09 -0.12 0.96 0.041 1.1
## hh_income_lt_2000 0.91 0.02 0.30 -0.03 0.92 0.079 1.2
## hh_income_gt_17500 -0.04 0.98 0.03 -0.04 0.97 0.032 1.0
## hh_1person 0.93 0.08 0.23 -0.06 0.93 0.068 1.1
## hh_5person_plus -0.90 0.31 0.15 0.12 0.94 0.064 1.3
##
## RC1 RC2 RC3 RC4
## SS loadings 5.57 4.88 1.50 0.92
## Proportion Var 0.40 0.35 0.11 0.07
## Cumulative Var 0.40 0.75 0.85 0.92
## Proportion Explained 0.43 0.38 0.12 0.07
## Cumulative Proportion 0.43 0.81 0.93 1.00
##
## Mean item complexity = 1.3
## Test of the hypothesis that 4 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.03
## with the empirical chi square 3.98 with prob < 1
##
## Fit based upon off diagonal values = 1
```

The object returned by `principal`

has a slightly different structure than that from `prcomp`

. By looking at the summary, you will see that the proportion of the variance explained by each component is slightly different than in the previous PCA. This is to be expected, but luckily we see that our 4 factors still explain 92% of the variance in our dataset. Let’s have a look at the resulting directions plot.

```
'loadings']] %>%
fa[[ unclass() %>%
as_tibble(rownames = "planning_area") %>% # convert loadings to a tibble or easy plotting
gather(key = "component", value = "value", -planning_area) %>%
ggplot(aes(x = planning_area, y = value)) +
geom_hline(yintercept = 0) +
geom_col(aes(fill=(value >= 0)),
show.legend = FALSE) +
ylim(-1,1) +
coord_flip() +
facet_grid(~component)
```

### 7.4.1 Exercise

Compare with the same plot based on PCA - which one do you find easier to interpret? What do you take away from this plot?

Finally, we can also make a plot of the planning areas for the first four factors.

```
'scores']] %>%
fa[[ unclass() %>%
as_tibble(rownames = "planning_area") %>% # convert loadings to a tibble or easy plotting
ggplot(aes(x = RC1, y = RC2)) + geom_text(aes(label = planning_area))
```

```
'scores']] %>%
fa[[ unclass() %>%
as_tibble(rownames = "planning_area") %>% # convert loadings to a tibble or easy plotting
ggplot(aes(x = RC3, y = RC4)) + geom_text(aes(label = planning_area))
```

## 7.5 Assignment 3 (Monday, March 30th, 23:59)

For your third assignment, you need to write a short report on the dimension reduction analysis we have conducted on the planning areas dataset so far.

### 7.5.1 Data analysis

In this report, you will conduct your own data analysis of the `planning_areas`

dataset. Since most of the necessary code/syntax has already been discussed in class, you should focus most of your attention to the *interpretation* of the analysis of *outcomes*. You can discuss your understanding of the methods, or complement your analysis with conceptual details we have not covered in class.

Among the 21 variables in our dataset, choose a subset of around 10 variables, preferably with a small overlap between the variables you choose and those we used in class in Section 7.3.

Perform PCA on this set of variables.

Extract your components using

`prcomp`

, without forgetting to scale your data (either using`scale`

or`prcomp`

’s scaling options).Discuss how many components you would keep for your analysis and why.

Obtain the directions plot of your selected components. What are the relationships or contrasts you observe from them?

Use

`autoplot`

to obtain the loadings plot. Your chosen variables should appear as red arrows and your observations should be visible on the plot.What can you tell about your data that you could not tell from MDS alone?

Bonus: Contrast your analysis with the

`varimax`

method presented in Section 7.4. Which do you find easier to interpret? Why?Conclusion.

- How does the socio-economic landscape of Singapore’s planning areas look like?
- Are there specific patterns, clusters and contrasts?
- Are neighbourhoods all the same or quite different? What could be potential reasons for these differences?
- Can you think of variables not contained in this dataset which would explain some of these differences?

### 7.5.2 Bonus: Extensions

MDS and PCA are far from the only dimension reduction techniques. We discussed `UMAP`

in the previous chapter, while `t-SNE`

is another method. Both are available in the smallvis package. Choose **one** of the two to produce a different plot and discuss how this representation differs from MDS or whether this gives you a different interpretation.

### 7.5.3 Submitting the assignment

- Create your assignment as a single self-contained RMarkdown document (you may read in cleaned data at the beginning from your
`data`

folder with the`here`

library). It should be saved in the`assignments/`

folder, as`assignment3.Rmd`

. - Change the
`output`

parameter to`github_document`

so that your assignment is rendered visually on Github. - Do not forget to run
`styler`

on your code before submitting! (Check out the Session 2.2 slides if you are not sure how to do so). - Once you are finished, navigate to your own Github repository at
`https://github.com/02522-cua/[your-name]`

. Create a new issue, title it ‘Assignment 3 by [your_name]’. In the body of the issue, include at least a link to the last commit you made as part of your assignment and ‘at’ your prof (include`@barnabemonnot`

in your text). To link to a specific commit, you can just include the SHA (a short code uniquely identifying a commit) in your issue. You can find the SHA codes in the history of your commits (e.g. at https://github.com/02522-cua/barnabe-monnot/commits/master my first commit has SHA ‘1f2c9c1’). Note that you can always edit the text of the issue if you want to update the assignment commit, but commits are timestamped so your prof can check whether they were submitted before the deadline!

Your assignments will be graded on a 5-point scale, according to this rubric:

- The Rmarkdown does not run and there’s incomplete description and contextualization
- The Rmarkdown does not run or runs but doesn’t produce the expected output. There is limited description and contextualization of code and its output
- The Rmarkdown runs, the code is clearly structured but the description and contextualization of code and its output is limited
- The Rmarkdown runs, the code is clearly structured and the description and contextualization is complete.
- As above, plus you have included some new or creative approaches, methods or angles that were not covered in class.

Note that half or quarter-point grades might be given. The rubric above is only an indication for specific points on the grading scale.