A Slack nuggets

A.1 git woes

A quick post-mortem of some git issues we ran into:

  • Greyed-out push/pull options: Happens when git is not aware of any location to push your changes to. Two fixes:

    1. Redo the steps as we did today (Create new project from version control, enter the URL of your repo and make/commit/push the changes)
    2. Set the origin remote in a Shell (this is explained in the link).
  • Conflicting changes: In general, not editing anything in Github directly will help you avoid this issue, since you will be the only one working on your repository. This error can come up when Github and your local work are out-of-sync with conflicting updates, e.g., if you edited something on Github and then made more edits locally without integrating the Github edits first, or if you and your friend worked on the same file and made conflicting changes.

  • Nuclear option: Sometimes git just gets too hairy. If you have a rather fresh version of your work already pushed on Github and you can easily retrace the changes you have made since that fresh version, delete your project locally, create it again in RStudio with version control and manually add the changes you have made since the last push. Commit, push and check if this is working fine.

You can also check from Chapter 9 from happygitwithr on how to tie Github a bit stronger with your laptop. We didn’t run into too many issues pushing the work once the permissions were set, but it is possible to do one of two things:

In any case, do not hesitate to create a repository for yourself outside of the 02522-cua organisation on Github and test things out for yourself. We will have plenty more opportunities to go over git troubles as they come. And for more background on why people care so much about it, check out this article from the syllabus!

A.2 Minimal reproducible example

Hi everyone, posting here a link to a famous Stack overflow post on minimal reproducible examples. When asking for help either to me or to one of your classmates, save yourself and each other time by making sure that you communicate a minimal reproducible example. In the case of R, this means giving a piece of code chunk that can be run in an individual chunk. For instance, if your question is “I cannot get the number of sales per month!”, this is a minimal reproducible example:

sales <- readRDS(here::here("data/sales.rds"))
sales_cleaned <- sales %>%
sales_summary <- sales_cleaned %>%
  group_by(month) %>%
  summarise(m = count())

while this is not:

sales_summary <- sales_cleaned %>%
  group_by(month) %>%
  summarise(m = count())

An external helper will not know what you mean by sales_cleaned, nor which libraries you are using (maybe count is well-defined in some library, but with the tidyverse only it does not work).

Additionally, make sure to include any lead you have explored to solve the problem by yourself. Chances are, expanding on these leads and working out a minimal example will inspire you a solution (this is close to the rubber duck method).

A.3 On good plots

We discussed experiment design yesterday, or how to collect samples. If you have the occasion, pick up a copy of The Truthful Art, by Alberto Cairo (apparently the full text is available from the SUTD library website). A chapter on experiment design is available online, take a look to go deeper into what we discussed! I added the link to the readings in the lecture notes.

Towards the end of the session, I also got a lot of questions from you on what should figure in your plots, and more questions on how to make plots nice when they get cluttered with text. The same Truthful Art book contains a lot of wisdom on visualising data.

According to Cairo’s definition, “A visualisation is any kind of visual representation of information designed to enable communication, analysis, discovery, exploration, etc.” For your assignments, think of what you plot and report as part of some internal monologue that you are carrying out as you explore the dataset from the first time you import it to your building intuition about the relationships between its variables. Your first plots need not be very complicated, but they should answer questions such as:

  • What is this data?
  • Why do I care about it?
  • How can I appropriately describe it?

Make sure that you clearly show your understanding of what this dataset is about and what it contains (yes, all the variables!) If you need inspiration, take a look at the long list of geometries ggplot can draw for you. We have of course not used all of them in class, so try different ones and ask yourself which one does a better job at communicating your point. You can also take a look at Chapter 6 of The Truthful Art.

If your later plots—where you start looking at relationships between variables—become cluttered beyond readability, ask yourself whether they are a meaningful addition to this mental conversation that you are having, or what their strength as an argument is.

  • Does the plot communicate salient features of the dataset?
  • Does it point towards interesting explorations?
  • If it does, can you highlight better these directions in the plot instead of having a flat visual model where all the information is on the same plane?

Cairo also states as an axiom that “A visualisation is a model”. A model doesn’t seek to capture everything, but it tries to isolate the parts that are meaningful, while remaining truthful. It’s not always the case that because you can plot it, you should, just as a model does not need to add “needless complexity”, in Cairo’s words. In this sense, a single plot is like a single argument in this long internal conversation. If you are honest in your internal monologue (e.g., do not seek to confirm your own biases and properly document your exploratory work), someone who has not played with the data but reads what you have to say about it should be able to pick up the conversation where you left it off.

A final point: because you are having this internal monologue with yourself, there is a good chance that your report will not look like that of one of your classmates. This conversation is personal, and your understanding of the data is too. Obviously, the 20 of you will not compute 20 different means of the resale_price (it is what it is), and there is a good chance of overlap for the initial exploratory part. But bring your own sensibilities to working with the data, or even your own aesthetics as you produce different plots. ggplot has great style defaults, but that doesn’t mean they are set in stone and cannot be changed (you can check theme command options). What you find interesting may not be what your classmate finds interesting. The way you present your arguments, or how you conduct your investigation and report it, should convey your own reasoning. This may be a computational class first, but you are building up to a project that will reflect your own interests and creativity. It is not too early to start building up your own thought process and your own methods to working with data.

A.4 Common assignment issues

This follows some notes posted on assignment 1 in my repo. Take a look!

I would like to point out the following issues which reappear often:

  • Structure of the document: Markdown has very expressive syntax, and super simple commands. If you are not sure which, do bookmark this page In general, try to structure your document well with ## for sections, ### for subsections etc. It makes it a lot easier for a reader to follow what is going on.
  • Plots: ggplot has a ton of options and out-of-the-box plots won’t always work. I often see facets used in your assignments and they are a good model, but they are not necessarily the way to go when you have a lot of different facets. Comparing things across towns for instance, too many facets appear and the results are not easy to read. Think of each plot as trying to make a (single) point, with your job to make that point as clear as possible in your plots.
  • Statistics vomit: It’s little use for a reader to be assailed with many tables and plots that are not explained, interpreted or described at all. When you work on the assignments, you will be running a ton of commands and plots before settling on which is the most effective. That doesn’t mean all you have run should be put in the report! Think like an editor: who is your audience, what are the things you are trying to communicate, how can you make your arguments as precisely and concisely as possible? In the mock assignment I created above, I have left many “building” steps to show you how you might be iterating over a few designs before finding one that works. In your reports there is no need to show these different steps, unless you think each one is meaningful.