Part 4. dplyr: mutate, group_by, summarize, across

Materials from class on Wednesday, February 1, 2023

R Project files
This year’s class video
Last Year’s Class Video
Slides
Another useful video
Post-Class
Muddiest points
Clearest points
Other points

R Project files

Please download the part4 folder from this dropbox folder link. Be sure to unzip if necessary. “Knit” the part4.Rmd file to install packages and make sure everything is working with data loading.

(We did not finish part4, and will finish it in class 5.)

This year’s class video

See Slack for the zoom recording link

Last Year’s Class Video

View last year’s class and materials here.

Slides

During “Muddiest Parts” review, we will go over these slides

Another useful video

Dr. Kelly Bodwin’s forcats/factor

Post-Class

Please fill out the following survey and we will discuss the results during the next lecture. All responses will be anonymous.

Clearest Point: What was the most clear part of the lecture?
Muddiest Point: What was the most unclear part of the lecture to you?
Anything Else: Is there something you’d like me to know?

https://bit.ly/bsta504_postclass_survey

Muddiest points

I’ve noticed some confusion about what I call “saving your work”, so we’ll go over these slides.

using factors, what you’re doing and the benefit of turning things into factors in mutate

I usually turn something into a factor for plotting (especially if I have a categorial numeric variable), and we’ll see more examples of that. We also later will see how it matters in statistical modeling/regression. It also is often easier to manage levels/categories this way, as we will see when we talk about the forcats package again in class 6.

case_when is not easy

Correct! Also some other comments on wanting more practice with case_when(). We will continue to see examples with this as we finish part5 and in other classes. It’s a very handy function so I use it a lot! See also the video above about factors with another explanation.

The function for converting a vector back from factor to character - I thought I had it, but I didn’t.

Oh, I didn’t show this!

# make a character vector
myvec <- c("medium", "low", "high", "low")
myvec_fac <- factor(myvec)
myvec_fac

## [1] medium low    high   low   
## Levels: high low medium

class(myvec_fac)

## [1] "factor"

# get the levels out
levels(myvec_fac)

## [1] "high"   "low"    "medium"

# Note we can "test" the classes of something like so:
is.factor(myvec_fac)

## [1] TRUE

is.character(myvec_fac)

## [1] FALSE

# Now we can change it back
myvec2 <- as.character(myvec_fac)
myvec2

## [1] "medium" "low"    "high"   "low"

class(myvec2)

## [1] "character"

levels(myvec2) # no levels, because it's not a factor

## NULL

# we could also change to numeric, how do you think it picks which number is which?
myvec3 <- as.numeric(myvec_fac)
myvec3

## [1] 3 2 1 2

# levels in order is assigned 1, 2, 3
table(myvec_fac, myvec3)

##          myvec3
## myvec_fac 1 2 3
##    high   1 0 0
##    low    0 2 0
##    medium 0 0 1

# change the level order
myvec_fac2 <- factor(myvec, levels = c("low", "medium", "high"))
levels(myvec_fac2)

## [1] "low"    "medium" "high"

myvec4 <- as.numeric(myvec_fac2)
myvec4

## [1] 2 1 3 1

table(myvec_fac2, myvec4)

##           myvec4
## myvec_fac2 1 2 3
##     low    2 0 0
##     medium 0 1 0
##     high   0 0 1

factor vs as.factor

Essentially the same. From the help documentation ?factor: “as.factor coerces its argument to a factor. It is an abbreviated (sometimes faster) form of factor.”

I would like to know when you recommend that we save a new data set once we create new covariates. Also, it is unclear to me how you add the variable to the existing data.

If I want to use that column/covariate again, I save it (so almost always, as I don’t often make a column without using it later). I usually save it back into the original data set I’m working with, that is, overwrite that object to be updated with the new column. As long as I keep track of my changes this is definitely ok. It can get confusing having too many versions of a data set floating around. If something is broken, the worst that happens is that you’ll just need to start from the beginning and reload your data (the data file will remain untouched) and re-run the code.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(palmerpenguins)

# does not save the new column, just prints result
penguins %>% 
  mutate(newvec = bill_length_mm/bill_depth_mm)

## # A tibble: 344 × 9
##    species island    bill_length_mm bill_de…¹ flipp…² body_…³ sex    year newvec
##    <fct>   <fct>              <dbl>     <dbl>   <int>   <int> <fct> <int>  <dbl>
##  1 Adelie  Torgersen           39.1      18.7     181    3750 male   2007   2.09
##  2 Adelie  Torgersen           39.5      17.4     186    3800 fema…  2007   2.27
##  3 Adelie  Torgersen           40.3      18       195    3250 fema…  2007   2.24
##  4 Adelie  Torgersen           NA        NA        NA      NA <NA>   2007  NA   
##  5 Adelie  Torgersen           36.7      19.3     193    3450 fema…  2007   1.90
##  6 Adelie  Torgersen           39.3      20.6     190    3650 male   2007   1.91
##  7 Adelie  Torgersen           38.9      17.8     181    3625 fema…  2007   2.19
##  8 Adelie  Torgersen           39.2      19.6     195    4675 male   2007   2   
##  9 Adelie  Torgersen           34.1      18.1     193    3475 <NA>   2007   1.88
## 10 Adelie  Torgersen           42        20.2     190    4250 <NA>   2007   2.08
## # … with 334 more rows, and abbreviated variable names ¹bill_depth_mm,
## #   ²flipper_length_mm, ³body_mass_g
## # ℹ Use `print(n = ...)` to see more rows

# saves new column in a data frame that is called penguins2
penguins2 <- penguins %>% 
  mutate(newvec = bill_length_mm/bill_depth_mm)
glimpse(penguins2)

## Rows: 344
## Columns: 9
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
## $ newvec            <dbl> 2.090909, 2.270115, 2.238889, NA, 1.901554, 1.907767…

glimpse(penguins) # has not been changed

## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

# saves new column in a data frame in the original data frame penguins
# *overwrites penguins*
penguins <- penguins %>% 
  mutate(newvec = bill_length_mm/bill_depth_mm)
glimpse(penguins)

## Rows: 344
## Columns: 9
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
## $ newvec            <dbl> 2.090909, 2.270115, 2.238889, NA, 1.901554, 1.907767…

arrange vs filter

arrange orders or sorts your data and does not remove or add anything, while filter removes rows.

Clearest points

working directory, here reordering factors mutate tibble vs data frame factors filtering

Glad to hear we’re making progress!

Other points

Is there a list somewhere of all potential colors?

A couple answers:

See this page for a list of “named” colors in R., or the ggplot2 cookbook for a smaller list
We will talk more about palettes when we finish part4 but there are many, many. I suggest finding a package or two that has the palettes you like and working with those. See a bunch listed here (scroll down in the readme).. My favorites are ggthemes and colorBlindness.

I’m curious what the best practice is for stringing things together versus breaking them into pieces. For example, if I was trying to make a binary variable where all values were classified as larger or greater than the mean, I could use mean() inside several other functions like mutate(). Alternately I could calculate mean() [meanxx <- mean(xxx)] and save it as an object, and then use the other functions on that value. I’m curious because it seems like if you did too many functions at once and were getting errors, it would be hard to tell what was wrong. But if you did it in a more stepwise fashion, you could see (for example) that mean() wasn’t working because there were NAs in your dataset. More importantly, I think if you were getting an erroneous answer (not an error, but a wrong answer, like if you calculated the mean of a variable but your NA’s were marked with “-88” and so R considered these actual observations) you might not know if you joined too many functions together and didn’t “see” what was happening under the hood. I’m curious how to deal with that.

I copied over this whole question because I think it is an excellent one, and well said (hope you don’t mind)! I think this is something that evolves as you become more experienced in coding and debugging, and as you find your own style of coding. I will talk some about debugging later, but what you are saying about breaking things up into pieces absolutely helps with that.

The one thing to make sure of is that if you are saving intermediate steps, such as meanxx <- mean(mydata$xx) and using it later, but then you update the data set (filter, replace NAs, fix an incorrect data entry, whatever), you need to make sure to update/re-calculate that mean object as it no longer matches your newer data set! So there is more to keep track of, in that case.

I will say that if you are keeping track of all the steps well, then functionally it does not matter too much, so if it makes things easier to break it up, do that! If you like to chain everything together (often I do) you can run each piece by highlighting the code and running just that part to see what is going on, and this is something I do often.

Your example is something I would probably do, though, as using the mean inside mutate does make me a bit nervous. For example, let’s use median because it’s easier to check my work at the end:

library(janitor) # for tabyl()

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

# there are NAs in here:
median(penguins$body_mass_g)

## [1] NA

# let's save the median as a vector of length 1, remove NAs
tmpmedian <- median(penguins$body_mass_g, na.rm = TRUE)
tmpmedian

## [1] 4050

penguins <- penguins %>%
  mutate(
    large_mass = case_when(
      body_mass_g >= tmpmedian ~ "yes",
      body_mass_g < tmpmedian ~ "no" # this allows NAs to remain NA
    ))

penguins %>% tabyl(large_mass)

##  large_mass   n     percent valid_percent
##          no 170 0.494186047      0.497076
##         yes 172 0.500000000      0.502924
##        <NA>   2 0.005813953            NA

# if I had just used median without checking for NAs, they all are NA:
penguins %>%
  mutate(large_mass = 1*(body_mass_g >= median(body_mass_g))) %>%
  tabyl(large_mass)

##  large_mass   n percent valid_percent
##          NA 344       1            NA

# Note if I just want females, this no longer makes sense:
penguins %>%
  filter(sex=="female") %>%
  mutate(
    large_mass = case_when(
      body_mass_g >= tmpmedian ~ "yes",
      body_mass_g < tmpmedian ~ "no" # this allows NAs to remain NA
    )) %>%
  tabyl(large_mass)

##  large_mass   n   percent
##          no 107 0.6484848
##         yes  58 0.3515152

# but this would:
penguins %>%
  filter(sex=="female") %>%
  mutate(
    large_mass = case_when(
      body_mass_g >= median(body_mass_g, na.rm = TRUE) ~ "yes",
      body_mass_g < median(body_mass_g, na.rm = TRUE) ~ "no" 
    )) %>%
  tabyl(large_mass)

##  large_mass  n   percent
##          no 80 0.4848485
##         yes 85 0.5151515

Last updated on February 13, 2023

Edit this page