Part 4. dplyr: mutate, group_by, summarize, across
R Project files
Please download the part4
folder from this dropbox folder link. Be sure to unzip if necessary. “Knit” the part4.Rmd file to install packages and make sure everything is working with data loading.
(We did not finish part4, and will finish it in class 5.)
This year’s class video
See Slack for the zoom recording link
Last Year’s Class Video
View last year’s class and materials here.
Slides
During “Muddiest Parts” review, we will go over these slides
Another useful video
Post-Class
Please fill out the following survey and we will discuss the results during the next lecture. All responses will be anonymous.
- Clearest Point: What was the most clear part of the lecture?
- Muddiest Point: What was the most unclear part of the lecture to you?
- Anything Else: Is there something you’d like me to know?
Muddiest points
I’ve noticed some confusion about what I call “saving your work”, so we’ll go over these slides.
using factors, what you’re doing and the benefit of turning things into factors in mutate
I usually turn something into a factor for plotting (especially if I have a categorial numeric variable), and we’ll see more examples of that. We also later will see how it matters in statistical modeling/regression. It also is often easier to manage levels/categories this way, as we will see when we talk about the forcats
package again in class 6.
case_when is not easy
Correct! Also some other comments on wanting more practice with case_when()
. We will continue to see examples with this as we finish part5 and in other classes. It’s a very handy function so I use it a lot! See also the video above about factors with another explanation.
The function for converting a vector back from factor to character - I thought I had it, but I didn’t.
Oh, I didn’t show this!
# make a character vector
myvec <- c("medium", "low", "high", "low")
myvec_fac <- factor(myvec)
myvec_fac
## [1] medium low high low
## Levels: high low medium
class(myvec_fac)
## [1] "factor"
# get the levels out
levels(myvec_fac)
## [1] "high" "low" "medium"
# Note we can "test" the classes of something like so:
is.factor(myvec_fac)
## [1] TRUE
is.character(myvec_fac)
## [1] FALSE
# Now we can change it back
myvec2 <- as.character(myvec_fac)
myvec2
## [1] "medium" "low" "high" "low"
class(myvec2)
## [1] "character"
levels(myvec2) # no levels, because it's not a factor
## NULL
# we could also change to numeric, how do you think it picks which number is which?
myvec3 <- as.numeric(myvec_fac)
myvec3
## [1] 3 2 1 2
# levels in order is assigned 1, 2, 3
table(myvec_fac, myvec3)
## myvec3
## myvec_fac 1 2 3
## high 1 0 0
## low 0 2 0
## medium 0 0 1
# change the level order
myvec_fac2 <- factor(myvec, levels = c("low", "medium", "high"))
levels(myvec_fac2)
## [1] "low" "medium" "high"
myvec4 <- as.numeric(myvec_fac2)
myvec4
## [1] 2 1 3 1
table(myvec_fac2, myvec4)
## myvec4
## myvec_fac2 1 2 3
## low 2 0 0
## medium 0 1 0
## high 0 0 1
factor vs as.factor
Essentially the same. From the help documentation ?factor
: “as.factor coerces its argument to a factor. It is an abbreviated (sometimes faster) form of factor.”
I would like to know when you recommend that we save a new data set once we create new covariates. Also, it is unclear to me how you add the variable to the existing data.
If I want to use that column/covariate again, I save it (so almost always, as I don’t often make a column without using it later). I usually save it back into the original data set I’m working with, that is, overwrite that object to be updated with the new column. As long as I keep track of my changes this is definitely ok. It can get confusing having too many versions of a data set floating around. If something is broken, the worst that happens is that you’ll just need to start from the beginning and reload your data (the data file will remain untouched) and re-run the code.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(palmerpenguins)
# does not save the new column, just prints result
penguins %>%
mutate(newvec = bill_length_mm/bill_depth_mm)
## # A tibble: 344 × 9
## species island bill_length_mm bill_de…¹ flipp…² body_…³ sex year newvec
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 2.09
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 2.27
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 2.24
## 4 Adelie Torgersen NA NA NA NA <NA> 2007 NA
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 1.90
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 1.91
## 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007 2.19
## 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 2
## 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007 1.88
## 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007 2.08
## # … with 334 more rows, and abbreviated variable names ¹bill_depth_mm,
## # ²flipper_length_mm, ³body_mass_g
## # ℹ Use `print(n = ...)` to see more rows
# saves new column in a data frame that is called penguins2
penguins2 <- penguins %>%
mutate(newvec = bill_length_mm/bill_depth_mm)
glimpse(penguins2)
## Rows: 344
## Columns: 9
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
## $ newvec <dbl> 2.090909, 2.270115, 2.238889, NA, 1.901554, 1.907767…
glimpse(penguins) # has not been changed
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
# saves new column in a data frame in the original data frame penguins
# *overwrites penguins*
penguins <- penguins %>%
mutate(newvec = bill_length_mm/bill_depth_mm)
glimpse(penguins)
## Rows: 344
## Columns: 9
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
## $ newvec <dbl> 2.090909, 2.270115, 2.238889, NA, 1.901554, 1.907767…
arrange vs filter
arrange
orders or sorts your data and does not remove or add anything, while filter
removes rows.
Clearest points
working directory, here reordering factors mutate tibble vs data frame factors filtering
Glad to hear we’re making progress!
Other points
Is there a list somewhere of all potential colors?
A couple answers:
- See this page for a list of “named” colors in R., or the ggplot2 cookbook for a smaller list
- We will talk more about palettes when we finish part4 but there are many, many. I suggest finding a package or two that has the palettes you like and working with those. See a bunch listed here (scroll down in the readme).. My favorites are
ggthemes
andcolorBlindness
.
I’m curious what the best practice is for stringing things together versus breaking them into pieces. For example, if I was trying to make a binary variable where all values were classified as larger or greater than the mean, I could use mean() inside several other functions like mutate(). Alternately I could calculate mean() [meanxx <- mean(xxx)] and save it as an object, and then use the other functions on that value. I’m curious because it seems like if you did too many functions at once and were getting errors, it would be hard to tell what was wrong. But if you did it in a more stepwise fashion, you could see (for example) that mean() wasn’t working because there were NAs in your dataset. More importantly, I think if you were getting an erroneous answer (not an error, but a wrong answer, like if you calculated the mean of a variable but your NA’s were marked with “-88” and so R considered these actual observations) you might not know if you joined too many functions together and didn’t “see” what was happening under the hood. I’m curious how to deal with that.
I copied over this whole question because I think it is an excellent one, and well said (hope you don’t mind)! I think this is something that evolves as you become more experienced in coding and debugging, and as you find your own style of coding. I will talk some about debugging later, but what you are saying about breaking things up into pieces absolutely helps with that.
The one thing to make sure of is that if you are saving intermediate steps, such as meanxx <- mean(mydata$xx)
and using it later, but then you update the data set (filter, replace NAs, fix an incorrect data entry, whatever), you need to make sure to update/re-calculate that mean object as it no longer matches your newer data set! So there is more to keep track of, in that case.
I will say that if you are keeping track of all the steps well, then functionally it does not matter too much, so if it makes things easier to break it up, do that! If you like to chain everything together (often I do) you can run each piece by highlighting the code and running just that part to see what is going on, and this is something I do often.
Your example is something I would probably do, though, as using the mean inside mutate does make me a bit nervous. For example, let’s use median because it’s easier to check my work at the end:
library(janitor) # for tabyl()
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
# there are NAs in here:
median(penguins$body_mass_g)
## [1] NA
# let's save the median as a vector of length 1, remove NAs
tmpmedian <- median(penguins$body_mass_g, na.rm = TRUE)
tmpmedian
## [1] 4050
penguins <- penguins %>%
mutate(
large_mass = case_when(
body_mass_g >= tmpmedian ~ "yes",
body_mass_g < tmpmedian ~ "no" # this allows NAs to remain NA
))
penguins %>% tabyl(large_mass)
## large_mass n percent valid_percent
## no 170 0.494186047 0.497076
## yes 172 0.500000000 0.502924
## <NA> 2 0.005813953 NA
# if I had just used median without checking for NAs, they all are NA:
penguins %>%
mutate(large_mass = 1*(body_mass_g >= median(body_mass_g))) %>%
tabyl(large_mass)
## large_mass n percent valid_percent
## NA 344 1 NA
# Note if I just want females, this no longer makes sense:
penguins %>%
filter(sex=="female") %>%
mutate(
large_mass = case_when(
body_mass_g >= tmpmedian ~ "yes",
body_mass_g < tmpmedian ~ "no" # this allows NAs to remain NA
)) %>%
tabyl(large_mass)
## large_mass n percent
## no 107 0.6484848
## yes 58 0.3515152
# but this would:
penguins %>%
filter(sex=="female") %>%
mutate(
large_mass = case_when(
body_mass_g >= median(body_mass_g, na.rm = TRUE) ~ "yes",
body_mass_g < median(body_mass_g, na.rm = TRUE) ~ "no"
)) %>%
tabyl(large_mass)
## large_mass n percent
## no 80 0.4848485
## yes 85 0.5151515