Part 3: ggplot2, factors, boxplots, dplyr: subsetting

Materials from class on Wednesday, January 25, 2023

R Project files

Please download the part3 folder from this dropbox folder link.

Use the grey “download” button to download the whole folder, please keep the file structure and folder organization exactly the same as we need this for class. Be sure to unzip if necessary. You may move the folder part3 wherever you like on your computer. Be sure to unzip if necessary.

In advance of class, please open the part3 Rstudio project (double click on the .rproj file), open part3.Rmd and knit (click the Knit button at the top of the file) this file. This will install packages that you need for the Rmd to run.

Readings

Required and suggested class readings can be found on the Readings tab by class. These readings may be done anytime before or after class, but they will supplement your understanding of the class materials and help make homework and project work easier.

Class Video

The class video is here, but I forgot to video tape the part about here::here(). If I have a chance I will re-record myself talking about it, but in the meantime, click here for Ted’s video from last year, start at linked time and watch for about 6 minutes, which explains similar ideas.

View last year’s class and materials here.

Post-Class

Please fill out the following survey and we will discuss the results during the next lecture. All responses will be anonymous.

  • Clearest Point: What was the most clear part of the lecture?
  • Muddiest Point: What was the most unclear part of the lecture to you?
  • Anything Else: Is there something you’d like me to know?

https://bit.ly/bsta504_postclass_survey

Muddiest points

themes in ggplot

Check out this reference about ggplot themes first.

Here’s a couple examples using this one plot, so you can see how the theme changes the look of the figure, when you use built in themes from the ggplot2 package (yes it only works in ggplot figures, for the person who asked about that)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
p <- ggplot(mtcars, aes(x = mpg, y = carb, color = factor(cyl))) +
  geom_point() +
  labs(title = "My scatterplot")
p

Here are some built in themes:

p + theme_bw()

p + theme_minimal()

p + theme_classic()

However you can make more customized themes or plot changes where you use the theme() function to add in a lot of other elements. You can use this add on function to choose specific parts of the plot that you want to change, like this example from the above reference. Anything specified here will override the built in theme selected first. There are many options, and looking at specific examples will help. I am always, always googling how to change parts of the theme/plot like this, because there are just so many options it’s too hard to remember them all.

p + theme_classic() +
  theme(
    plot.title = element_text(face = "bold", size = 12),
    legend.background = element_rect(fill = "white", size = 4, colour = "white"),
    legend.justification = c(0, 1),
    legend.position = c(0, 1),
    axis.ticks = element_line(colour = "grey70", size = 0.2),
    panel.grid.major = element_line(colour = "grey70", size = 0.2),
    panel.grid.minor = element_blank()
  )

Here’s a simpler example just changing the title (from the above reference):

p + theme(plot.title = element_text(size = 16))

p + theme(plot.title = element_text(face = "bold", colour = "red"))

p + theme(plot.title = element_text(hjust = 1))

here()

Agreed, it’s very confusing, more in class 4!

Could you please clarify about the use of select(one_of) and the count command that was mentioned in dplyr cheatsheet?

These are very different, if you are talking about select() vs count(). One thing to note is that since I recorded that class, one_of() has been superseded/replaced by any_of() and all_of().

First, select() is a function to subset columns.

library(palmerpenguins)

Here we specify the columns we want in the order we want:

penguins %>% select(bill_length_mm, island, species, year)
## # A tibble: 344 × 4
##    bill_length_mm island    species  year
##             <dbl> <fct>     <fct>   <int>
##  1           39.1 Torgersen Adelie   2007
##  2           39.5 Torgersen Adelie   2007
##  3           40.3 Torgersen Adelie   2007
##  4           NA   Torgersen Adelie   2007
##  5           36.7 Torgersen Adelie   2007
##  6           39.3 Torgersen Adelie   2007
##  7           38.9 Torgersen Adelie   2007
##  8           39.2 Torgersen Adelie   2007
##  9           34.1 Torgersen Adelie   2007
## 10           42   Torgersen Adelie   2007
## # … with 334 more rows
## # ℹ Use `print(n = ...)` to see more rows

Here, we pass a vector of character names, both of which work:

penguins %>% select(c("bill_length_mm","island","species","year"))
## # A tibble: 344 × 4
##    bill_length_mm island    species  year
##             <dbl> <fct>     <fct>   <int>
##  1           39.1 Torgersen Adelie   2007
##  2           39.5 Torgersen Adelie   2007
##  3           40.3 Torgersen Adelie   2007
##  4           NA   Torgersen Adelie   2007
##  5           36.7 Torgersen Adelie   2007
##  6           39.3 Torgersen Adelie   2007
##  7           38.9 Torgersen Adelie   2007
##  8           39.2 Torgersen Adelie   2007
##  9           34.1 Torgersen Adelie   2007
## 10           42   Torgersen Adelie   2007
## # … with 334 more rows
## # ℹ Use `print(n = ...)` to see more rows
penguins %>% select(any_of(c("bill_length_mm","island","species","year")))
## # A tibble: 344 × 4
##    bill_length_mm island    species  year
##             <dbl> <fct>     <fct>   <int>
##  1           39.1 Torgersen Adelie   2007
##  2           39.5 Torgersen Adelie   2007
##  3           40.3 Torgersen Adelie   2007
##  4           NA   Torgersen Adelie   2007
##  5           36.7 Torgersen Adelie   2007
##  6           39.3 Torgersen Adelie   2007
##  7           38.9 Torgersen Adelie   2007
##  8           39.2 Torgersen Adelie   2007
##  9           34.1 Torgersen Adelie   2007
## 10           42   Torgersen Adelie   2007
## # … with 334 more rows
## # ℹ Use `print(n = ...)` to see more rows

This might be useful if we have that character vector already saved from some other data work we are doing:

colnames_needed <- c("bill_length_mm","island","species","year")
penguins %>% select(colnames_needed)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(colnames_needed)` instead of `colnames_needed` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## # A tibble: 344 × 4
##    bill_length_mm island    species  year
##             <dbl> <fct>     <fct>   <int>
##  1           39.1 Torgersen Adelie   2007
##  2           39.5 Torgersen Adelie   2007
##  3           40.3 Torgersen Adelie   2007
##  4           NA   Torgersen Adelie   2007
##  5           36.7 Torgersen Adelie   2007
##  6           39.3 Torgersen Adelie   2007
##  7           38.9 Torgersen Adelie   2007
##  8           39.2 Torgersen Adelie   2007
##  9           34.1 Torgersen Adelie   2007
## 10           42   Torgersen Adelie   2007
## # … with 334 more rows
## # ℹ Use `print(n = ...)` to see more rows

But the key about any_of and all_of is what it allows. any_of() allows column names that don’t exist! Using no tidyselect helper or all_of() does not allow this. Which you use depends on what you want to allow to happen.

colnames_needed <- c("bill_length_mm","island","species","year","MISSING")
# penguins %>% select((colnames_needed)) # does not work
penguins %>% select(any_of(colnames_needed)) # works!
## # A tibble: 344 × 4
##    bill_length_mm island    species  year
##             <dbl> <fct>     <fct>   <int>
##  1           39.1 Torgersen Adelie   2007
##  2           39.5 Torgersen Adelie   2007
##  3           40.3 Torgersen Adelie   2007
##  4           NA   Torgersen Adelie   2007
##  5           36.7 Torgersen Adelie   2007
##  6           39.3 Torgersen Adelie   2007
##  7           38.9 Torgersen Adelie   2007
##  8           39.2 Torgersen Adelie   2007
##  9           34.1 Torgersen Adelie   2007
## 10           42   Torgersen Adelie   2007
## # … with 334 more rows
## # ℹ Use `print(n = ...)` to see more rows
# penguins %>% select(all_of(colnames_needed)) # does not work

any_of() is an example of a tidyselect helper, which we will see a lot more of when we start using across() with mutate() and summarize() in class 4. See this long list of useful tidyselect functions for more.

On the other hand, count() is mainly to count the number of unique values in a column/vector:

penguins %>% count(species)
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124

This created a new tibble, that summarizes the species column by counting the number of each type of species. This works for any type of vector but is most useful with character and factor columns. You can also use multiple columns here to see all possible combinations:

penguins %>% count(species, year)
## # A tibble: 9 × 3
##   species    year     n
##   <fct>     <int> <int>
## 1 Adelie     2007    50
## 2 Adelie     2008    50
## 3 Adelie     2009    52
## 4 Chinstrap  2007    26
## 5 Chinstrap  2008    18
## 6 Chinstrap  2009    24
## 7 Gentoo     2007    34
## 8 Gentoo     2008    46
## 9 Gentoo     2009    44

Clearest Points

filter, select, arrange, pipes

Super! I want to take the time to mention (and hopefully not confuse everyone) that the pipe has recently (2021) been integrated into “base R”, that is, it’s loaded without loading the tidyverse package. HOWEVER it is this symbol |> and does not behave exactly like the tidyverse pipe %>% (actually from the magrittr package within the tidyverse package). For all the usual uses it works the same, so it could be used interchangeably in this class. Just know that if you see that type of pipe being used, assume it’s doing basically the same thing. Even R for Data Science is likely moving to use the native/base pipe, see this explanation. Probably next year’s class I will switch everything over to use this, though I still just use %>% in my own work as it’s slightly more flexible for more “advanced” usage.

penguins |> count(species, year)
## # A tibble: 9 × 3
##   species    year     n
##   <fct>     <int> <int>
## 1 Adelie     2007    50
## 2 Adelie     2008    50
## 3 Adelie     2009    52
## 4 Chinstrap  2007    26
## 5 Chinstrap  2008    18
## 6 Chinstrap  2009    24
## 7 Gentoo     2007    34
## 8 Gentoo     2008    46
## 9 Gentoo     2009    44

Other messages

Density ridges are cool!

I agree!