Chapter 7 Principal Component Analysis (PCA)
In the previous section, we used Multi-Dimensional Scaling (MDS) as a dimension reduction technique. Although MDS can be quite effective in its own right, the interpretation of the resulting plots can sometimes be difficult - especially because the axes lose their explicit meaning. There are other dimension reduction strategies available that might help in this regard. One of these strategies is called factor analysis: the analysis of underlying or latent ‘factors’ present in larger set of variables or dimensions (e.g. income and education could both be dimensions of an underlying factor ‘class’). A special case of factor analysis is called Principal Component Analysis (PCA). PCA allows us to show which variables/dimensions in a dataset are correlated, potentially allowing them to be expressed in fewer dimensions or ‘components’. It also has another useful feature for subsequent analytical procedures: the components derived from PCA are, by definition, uncorrelated which is often a requirement for (for example) regression analyses. For a longer discussion and a worked example of PCA, see Boehmke’s US crime rates notebook or Wyly’s background paper and play with the interactive example made by Victor Powell and Lewis Lehe.
7.2 PCA with two variables
Conducting PCA in R is relatively straightforward. It has a built-in function for PCA (
prcomp) but we will also use the
psych packages to help make our analysis a bit easier.
We first give some of the intuition behind PCA by restricting our analysis to two variables only. Our goal will be to describe our observations as succinctly as possible, while providing the maximum amount of information. If our data had no variation and all neighbourhoods had the same two values for
area_res, we wouldn’t need to summarise it more than knowing the two values. But the data varies, as evidenced by the following plot:
area_res varies between 0 and 0.15, while
hh_income_gt_17500 varies between a little below 0.1 and a little over 0.5. PCA attempts to explain this variation by capturing as much of it as possible. This is done by extracting components (sometimes called factors, although factor analysis is related but different from PCA), with the first component extracted to explain as much variation as possible. Let’s see what this means on our example.
First we scale our data as we have done in MDS, this time using the
scale function. Scaling a variable subtracts the mean of that variable from all observations, then divides all observations by the standard deviation of the variable. In other words, you can check that the following two expressions give you the same result.
##  0.1432271 1.3028069 0.5597262 -0.4560503 -0.8413429 -0.4478156
## # A tibble: 6 x 2 ## area_res area_res_scaled ## <dbl> <dbl> ## 1 0.0401 0.143 ## 2 0.0836 1.30 ## 3 0.0557 0.560 ## 4 0.0176 -0.456 ## 5 0.00310 -0.841 ## 6 0.0179 -0.448
Let’s decompose what is happening when we apply
- First, we look at the distribution of
- Second, we subtract the mean of
area_resfrom all the observations to obtain a shifted
area_res. Note that this does not change the shape of the distribution (up to some artefacts from the binning), but shifted it to the left. The distribution is now centred at zero: the mean of our shifted
- Finally, we divide by the standard deviation to scale the shifted
area_res. Again the shape does not change but you can observe that the variable takes values over a larger interval.
The idea is to apply the same procedure to all our variables. You can see the result below on some of the variables. We combine a few commands below:
- To scale our variables, we cast them into a matrix and apply the
- We cast the matrix back to a tibble with
- We add back the planning area names with
library(patchwork) # Choose some variables from our dataset scale_example_data <- planning_areas %>% select( planning_area, starts_with("area"), starts_with("age") ) # Cast dataset to a long format long_data <- scale_example_data %>% pivot_longer(-planning_area) # Scale dataset and cast it to a long format long_data_scaled <- scale( scale_example_data %>% select(-planning_area) %>% as.matrix() ) %>% as_tibble() %>% bind_cols(scale_example_data %>% select(planning_area)) %>% # add area names back pivot_longer(-planning_area) # Compare plots long_data %>% ggplot() + geom_histogram(aes(x = value)) + facet_wrap(vars(name)) | long_data_scaled %>% ggplot() + geom_histogram(aes(x = value)) + facet_wrap(vars(name))
Let’s scale our two variables
hh_income_gt_17500 and plot them on a scatterplot.
Note that scaling does not change the relative positions of the planning areas in the
area_res scatterplot, only their coordinates. Now let’s extract our first component.
7.2.2 First component extraction
library(broom) # `prcomp` extracts principal components for us pc <- planning_areas_scaled %>% column_to_rownames(var = "planning_area") %>% select(area_res, hh_income_gt_17500) %>% prcomp(.) tidy(pc, "pcs") %>% # summary information on the components filter(PC == 1) # only look at the first component
## # A tibble: 1 x 4 ## PC std.dev percent cumulative ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1.33 0.879 0.879
percent variable refers to the amount of total variance explained by the first component. 88% is a high number! Now let’s see if we can glean more information about this component.
## # A tibble: 2 x 2 ## column `1` ## <chr> <dbl> ## 1 area_res -0.707 ## 2 hh_income_gt_17500 -0.707
The two numbers in the “1” column above give us the “direction” of the first component. We can plot these directions in a directions plot.
Note that we can represent it with points too:
We can also plot the component’s direction in our initial scatterplot. Since we scaled our variables, the component is an arrow (“vector”) originating from the center of the plot.
This should remind you of the linear regression. There is a rather positive relationship between
area_res (the higher one is, the higher we expect the other to be as well). Meanwhile, linear regression gave us the line that explained the most variation (by minimising the sum of squared residuals). The first component encapsulates this relationship by telling us that the data moves along the arrow, either in the same direction (the smaller
area_res, the smaller
hh_income_gt_17500) or in the opposite (the higher
area_res, the higher
7.2.3 Second component extraction
Before going further, let’s extract another component. Note that there is an order to the operations: the first component explains the most variation possible (in this case 88%). Then the second component explains the most variation possible among the variation remaining (12% remain).
## # A tibble: 1 x 4 ## PC std.dev percent cumulative ## <dbl> <dbl> <dbl> <dbl> ## 1 2 0.492 0.121 1
We can also get the directions plot for the two components.
The second component we extract explains 12% of the total variation, and so explains all the variation left! Let’s see why this is not surprising by plotting the second component.
The second component points to a different direction. While the first represented the positive relationship between
area_res, the second says “this is not a perfect relationship, there is still a bit of variation left on either side of the line”. In particular, now that we have two arrows, we could write any planning area in our dataset as a mixture of the first and the second arrow, since they define a coordinate system. This is done with the following chunk.
## # A tibble: 28 x 3 ## row component1 component2 ## <fct> <dbl> <dbl> ## 1 ANG MO KIO 0.255 -0.458 ## 2 BEDOK -1.04 -0.805 ## 3 BISHAN -0.922 0.130 ## 4 BUKIT BATOK 0.527 0.118 ## 5 BUKIT MERAH 0.968 0.222 ## 6 BUKIT PANJANG 0.570 0.0635 ## 7 BUKIT TIMAH -3.74 0.786 ## 8 CHOA CHU KANG 0.827 0.0933 ## 9 CLEMENTI 0.0485 0.0551 ## 10 GEYLANG 0.350 -0.490 ## # … with 18 more rows
Let’s look at Bukit Timah and understand what these numbers mean. The coefficient of component 1 for Bukit Timah is -3.74, while the coefficient of component 2 is 0.79. This means that we can understand Bukit Timah as “a lot of negative component 1, and some amount of component 2”. We can plot all these coefficients in a scatterplot, called a loadings plot.
In the plot above, we express each planning area in terms of component 1 and component 2, instead of in terms of
area_res. Why is that useful? Because each component captures some aspect of the relationship between
area_res, the first one capturing their positive relationship, and the second one correcting for the additional variance that the first could not explain. We can plot the loadings of each variable onto the loading plot, represented by arrows.
Now the plots start to make sense. Points moving towards the direction of the
hh_income_gt_17500 are those for which this variable has a high value, and the same conclusion holds for
area_res. The resulting plot however is not much more informative than the one we started with: it is simply rotated and has two extra red arrows. This is something we already observed with MDS. Starting with a 2D dataset and reducing it to a 2D dataset moves things around, giving perhaps a bit more clarity but not much more interpretation. So let’s look at PCA applied on more variables of our dataset to see where the method shines.
7.3 PCA with several variables
The strength of PCA really comes out once our dataset has more than two variables. Let’s analyse all of the age, occupation, area, dwelling and household variables we have available in our dataset.
When we inspect the results of our PCA, you notice that you get as many components back as there were variables in your dataset (14). PCA will always give back an equal number of components compared to your input variables. Together, all of these components explain 100% of the variance in your data. But, and this is where things get useful, the first few components explain much more variance! In our case, the first three components explain 87% of all variance.
## # A tibble: 14 x 4 ## PC std.dev percent cumulative ## <dbl> <dbl> <dbl> <dbl> ## 1 1 2.43 0.422 0.422 ## 2 2 2.22 0.353 0.774 ## 3 3 1.14 0.0935 0.868 ## 4 4 0.843 0.0507 0.919 ## 5 5 0.628 0.0282 0.947 ## 6 6 0.503 0.0181 0.965 ## 7 7 0.415 0.0123 0.977 ## 8 8 0.355 0.00901 0.986 ## 9 9 0.279 0.00554 0.992 ## 10 10 0.217 0.00337 0.995 ## 11 11 0.194 0.00268 0.998 ## 12 12 0.137 0.00133 0.999 ## 13 13 0.0921 0.00061 1.00 ## 14 14 0.0633 0.00029 1
7.3.1 Deciding how many components to keep
How many componenents should we keep? There are different ways to decide which components to cut off.
The first method looks for the elbow in the so-called scree plot, which is just a line plot of the explained variance of each components. Where would you place the cut-off here?
The second method picks component until some threshold of explained variance is reached (usually, 90%).
tidy(pc, "pcs") above returned the cumulative variance explained. We can use it to plot the cumulative scree plot and place our cut-off.
7.3.2 Directions plot
First, we can look at the directions plot, to ask how variables tend to vary with each other.
Looking at the first component here, it seems to say that for a high value of
occupation_senior_managers_professional, we expect a low value of
occupation_cleaners_labourers_plant, a low value of
hh_income_lt_2000, a high value of
hh_income_gt_17500 etc. Another interpretation: Looking at the bottom three variables on age, we can see that they are ordered similarly in the first and second components. However, the first factor has a positive loading for
area_res while the second factor has a positive loading for
area_hdb. The first factor seems to say “when we are in a younger neighbourhood, we expect more senior managers”. The second factor comes to give more clarity to this assessment: “This is not true if the neighbourhood is more HDB!” We will also see a variation on PCA that makes such statements even clearer in the next Section.
7.3.3 Loading plot
We can plot the results of our analysis for the first few components by hand with
ggplot as we did above, but we can also do this more easily with the
autoplot function provided by the
These plots look quite interesting and you might already be able to infer some patterns. Is there a more principled way to get at that meaning? We can do this by look at the loadings of each variable on the component. Loadings are nothing more than a fancy term for the correlation between the original variable and the factor. We can do this in table-form:
## # A tibble: 14 x 15 ## column `1` `2` `3` `4` `5` `6` `7` `8` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age_0… 0.121 0.103 -0.795 -0.0881 0.105 -0.0108 0.276 -0.169 ## 2 age_1… 0.328 0.164 0.312 -0.129 -0.310 -0.283 0.182 0.250 ## 3 age_g… -0.303 -0.285 0.140 0.0485 0.102 0.00362 0.0714 0.0669 ## 4 area_… -0.0785 0.329 0.0295 -0.675 0.333 -0.320 -0.439 -0.0590 ## 5 area_… -0.320 -0.0803 0.0678 -0.549 -0.343 0.601 0.0861 -0.0466 ## 6 area_… 0.133 -0.344 0.221 -0.129 0.716 0.224 0.248 0.0516 ## 7 dwell… -0.373 -0.0277 -0.162 -0.239 -0.0674 -0.328 0.554 0.194 ## 8 dwell… 0.173 -0.396 0.0132 -0.103 -0.0713 -0.176 -0.0941 -0.305 ## 9 hh_1p… -0.341 -0.221 0.0244 0.0713 -0.243 -0.168 -0.226 -0.0581 ## 10 hh_5p… 0.369 0.0354 0.271 -0.212 -0.140 -0.115 0.396 -0.186 ## 11 hh_in… 0.178 -0.390 -0.0326 -0.201 -0.139 -0.0810 0.0307 -0.315 ## 12 hh_in… -0.349 -0.195 0.0936 0.0585 0.121 -0.439 0.112 0.0612 ## 13 occup… -0.251 0.303 0.221 0.177 0.0704 -0.0416 0.168 -0.785 ## 14 occup… 0.150 -0.398 -0.195 -0.0827 -0.120 -0.157 -0.240 -0.0866 ## # … with 6 more variables: `9` <dbl>, `10` <dbl>, `11` <dbl>, `12` <dbl>, ## # `13` <dbl>, `14` <dbl>
Admittedly this is not very readable. When trying to understand the ‘meaning’ of each component, it sometimes makes sense to look at both the loadings and the observations in PCA ‘space’ at the same time. We can do so with
That’s a lot of arrows! Yet, this could drive our analysis to understand why Marine Parade is different from Queenstown for example. It seems for instance that
area_res is quite higher in Marine Parade, while
area_other dominates in Queenstown. We start to see groups appear and the variables that drive the differences between these groups.
Obtain the loadings plot for components 3 and 4. Does this add to your analysis?
This is what we would obtain doing MDS with the same set of variables.
library(distances) pln_area_matrix <- planning_areas %>% select(starts_with("age"), starts_with("occupation"), starts_with("area"), starts_with("dwelling"), starts_with("hh")) %>% as.matrix() pln_area_dist <- distances(pln_area_matrix, normalize = "studentize") %>% as.matrix() mds_planning_area <- cmdscale(pln_area_dist) %>% as_tibble() %>% rename(Y1 = V1, Y2 = V2) # add planning area name back mds_planning_area <- planning_areas %>% select(planning_area) %>% bind_cols(., mds_planning_area) ggplot(mds_planning_area, aes(x = Y1, y = Y2)) + geom_text(aes(label = planning_area))
The groups we see are not so different, but it is quite hard for us to tell why they are plotted that way. MDS in particular does not give us loadings, which tell us what drives observations to be close to each other.
7.4 Extensions to PCA
Finally, if our concern is purely with interpretability and finding ‘latent’ factors, we can also relax some of the restrictions that PCA brings with it. If you remember, in PCA the first component always explains as much of the variation in our dataset as possible. But this isn’t a hard requirement and we can obtain our components/factors based on other rules as well. One such rule could be that we want to maximize the loadings/correlation of individual variables on a single component. If only a single or small set of variables would load on a factor (and they wouldn’t load on the other components), this would really help our interpretation. To do this, we can use the
principal function from the
## Principal Components Analysis ## Call: principal(r = ., nfactors = 4, rotate = "varimax") ## Standardized loadings (pattern matrix) based upon correlation matrix ## RC1 RC2 RC3 RC4 h2 u2 com ## age_0_9 -0.17 0.00 -0.97 0.01 0.97 0.029 1.1 ## age_10_19 -0.93 0.00 0.17 0.11 0.91 0.094 1.1 ## age_gt_65 0.88 0.24 0.37 -0.06 0.97 0.030 1.5 ## occupation_senior_manager_professional 0.07 0.96 -0.13 -0.15 0.97 0.032 1.1 ## occupation_cleaners_labourers_plant 0.23 -0.90 0.23 0.06 0.91 0.088 1.3 ## area_hdb -0.13 -0.60 -0.10 0.71 0.90 0.104 2.1 ## area_res -0.04 0.81 0.32 -0.05 0.76 0.235 1.3 ## area_other 0.75 -0.06 0.19 0.51 0.86 0.143 1.9 ## dwelling_1_2_room 0.87 -0.26 -0.05 0.27 0.90 0.100 1.4 ## dwelling_condo_landed -0.03 0.97 0.09 -0.12 0.96 0.041 1.1 ## hh_income_lt_2000 0.91 0.02 0.30 -0.03 0.92 0.079 1.2 ## hh_income_gt_17500 -0.04 0.98 0.03 -0.04 0.97 0.032 1.0 ## hh_1person 0.93 0.08 0.23 -0.06 0.93 0.068 1.1 ## hh_5person_plus -0.90 0.31 0.15 0.12 0.94 0.064 1.3 ## ## RC1 RC2 RC3 RC4 ## SS loadings 5.57 4.88 1.50 0.92 ## Proportion Var 0.40 0.35 0.11 0.07 ## Cumulative Var 0.40 0.75 0.85 0.92 ## Proportion Explained 0.43 0.38 0.12 0.07 ## Cumulative Proportion 0.43 0.81 0.93 1.00 ## ## Mean item complexity = 1.3 ## Test of the hypothesis that 4 components are sufficient. ## ## The root mean square of the residuals (RMSR) is 0.03 ## with the empirical chi square 3.98 with prob < 1 ## ## Fit based upon off diagonal values = 1
The object returned by
principal has a slightly different structure than that from
prcomp. By looking at the summary, you will see that the proportion of the variance explained by each component is slightly different than in the previous PCA. This is to be expected, but luckily we see that our 4 factors still explain 92% of the variance in our dataset. Let’s have a look at the resulting directions plot.
fa[['loadings']] %>% unclass() %>% as_tibble(rownames = "planning_area") %>% # convert loadings to a tibble or easy plotting gather(key = "component", value = "value", -planning_area) %>% ggplot(aes(x = planning_area, y = value)) + geom_hline(yintercept = 0) + geom_col(aes(fill=(value >= 0)), show.legend = FALSE) + ylim(-1,1) + coord_flip() + facet_grid(~component)
Compare with the same plot based on PCA - which one do you find easier to interpret? What do you take away from this plot?
Finally, we can also make a plot of the planning areas for the first four factors.
7.5 Assignment 3 (Monday, March 30th, 23:59)
For your third assignment, you need to write a short report on the dimension reduction analysis we have conducted on the planning areas dataset so far.
7.5.1 Data analysis
In this report, you will conduct your own data analysis of the
planning_areas dataset. Since most of the necessary code/syntax has already been discussed in class, you should focus most of your attention to the interpretation of the analysis of outcomes. You can discuss your understanding of the methods, or complement your analysis with conceptual details we have not covered in class.
Among the 21 variables in our dataset, choose a subset of around 10 variables, preferably with a small overlap between the variables you choose and those we used in class in Section 7.3.
- Perform PCA on this set of variables.
- Extract your components using
prcomp, without forgetting to scale your data (either using
prcomp’s scaling options).
- Discuss how many components you would keep for your analysis and why.
- Obtain the directions plot of your selected components. What are the relationships or contrasts you observe from them?
autoplotto obtain the loadings plot. Your chosen variables should appear as red arrows and your observations should be visible on the plot.
- What can you tell about your data that you could not tell from MDS alone?
Bonus: Contrast your analysis with the
varimaxmethod presented in Section 7.4. Which do you find easier to interpret? Why?
- How does the socio-economic landscape of Singapore’s planning areas look like?
- Are there specific patterns, clusters and contrasts?
- Are neighbourhoods all the same or quite different? What could be potential reasons for these differences?
- Can you think of variables not contained in this dataset which would explain some of these differences?
7.5.2 Bonus: Extensions
MDS and PCA are far from the only dimension reduction techniques. We discussed
UMAP in the previous chapter, while
t-SNE is another method. Both are available in the smallvis package. Choose one of the two to produce a different plot and discuss how this representation differs from MDS or whether this gives you a different interpretation.
7.5.3 Submitting the assignment
- Create your assignment as a single self-contained RMarkdown document (you may read in cleaned data at the beginning from your
datafolder with the
herelibrary). It should be saved in the
- Change the
github_documentso that your assignment is rendered visually on Github.
- Do not forget to run
styleron your code before submitting! (Check out the Session 2.2 slides if you are not sure how to do so).
- Once you are finished, navigate to your own Github repository at
https://github.com/02522-cua/[your-name]. Create a new issue, title it ‘Assignment 3 by [your_name]’. In the body of the issue, include at least a link to the last commit you made as part of your assignment and ‘at’ your prof (include
@barnabemonnotin your text). To link to a specific commit, you can just include the SHA (a short code uniquely identifying a commit) in your issue. You can find the SHA codes in the history of your commits (e.g. at https://github.com/02522-cua/barnabe-monnot/commits/master my first commit has SHA ‘1f2c9c1’). Note that you can always edit the text of the issue if you want to update the assignment commit, but commits are timestamped so your prof can check whether they were submitted before the deadline!
Your assignments will be graded on a 5-point scale, according to this rubric:
- The Rmarkdown does not run and there’s incomplete description and contextualization
- The Rmarkdown does not run or runs but doesn’t produce the expected output. There is limited description and contextualization of code and its output
- The Rmarkdown runs, the code is clearly structured but the description and contextualization of code and its output is limited
- The Rmarkdown runs, the code is clearly structured and the description and contextualization is complete.
- As above, plus you have included some new or creative approaches, methods or angles that were not covered in class.
Note that half or quarter-point grades might be given. The rubric above is only an indication for specific points on the grading scale.