(Class 9) Part 7 continued and Part 8 . Intro to stats/`broom`/More Purrr

Materials from class on Wednesday, March 8, 2023

R Project files
Class Video
Post-Class
Muddiest Points
Clearest points

R Project files

Please download the part8 sub-folder from this dropbox link. Be sure to unzip if necessary. Knit the part8.Rmd to install any required packages.

Class Video

View last year’s class and materials here.

Post-Class

Please fill out the following survey and we will discuss the results during the next lecture. All responses will be anonymous.

Clearest Point: What was the most clear part of the lecture?
Muddiest Point: What was the most unclear part of the lecture to you?
Anything Else: Is there something you’d like me to know?

https://bit.ly/bsta504_postclass_survey

Muddiest Points

A lot of things were difficult including “this whole class” as one person said, and yeah, I get it, it’s hard stuff! The reason I teach these harder topics like for loops, functions, map, etc, as opposed to just going over more of the same kind of data cleaning tasks with various examples, is because it’s a lot harder to be motivated to learn the hard stuff if you’ve never been exposed to it. It will probably seem too daunting (I know this because it took me a long time to force myself to learn ggplot, or purrr::map, or even across and the new pivot_longer because I already had other ways of doing that).

You have the tools by now to learn how to do other data cleaning tasks related to what we’ve learned (i.e. more factor and string manipulation, even working with dates will not be that hard to figure out).

Also, part of the reason R is so powerful and useful is that it’s a “real” programming language (more similar to C, python, java, etc than SPSS or even SAS or STATA). This part of it will take a lot of practice to feel comfortable if you haven’t had any programming experience. If you have had programming experience, seeing how it’s done in R will get you started in the right direction to using the R-specific programming tools like purrr::map that are truly so useful when automating data tasks.

for loop was a bit confusing when making empty vector

It really is, and is why I recommend not using for loops but embracing map()! We could get even more technical and talk about how it’s actually better (faster/efficient) to specify the length or dimension of the empty vector (or data frame, or list, or whatever, this is called pre-allocation) because of how memory is allocated in R, but, no, I refuse to go down that rabbit hole and just say, use map()!

Side note: If you’re working with data with millions of records, you’ll have plenty of speed issues to worry about, and you need an even more advanced R programming class focused on big data.

I think the whole creation of the function is still quite a bit hazy for me. I believe it’s something that just takes some more practice. Hoping we can fit some more practice challenges to help really build this understanding.

We will start class with another function example, but please ask questions about anything confusing about it during class, too!

I am still struggling with functions! In the reading on functions, I got confused about the difference between the && || operators and the & | operators.. The reading said “beware of floating point numbers” and I’m not sure what that is.

As we saw in class, the & “and” operator and | “or” operator are logic operators used to string one condition to another, such as:

thing <- 3
is.na(thing) | thing == 3

## [1] TRUE

is.na(thing) & thing == 3

## [1] FALSE

But remember we talked about how most functions in R are vectorized, which means they work seamlessly over a vector. This is true for | and & as well. However, if you didn’t want that vectorized behavior and only wanted to check the first elements of a vector you’d use the double && and ||. This becomes useful for if statements, but, you likely don’t need to worry about it, and you probably want the single & |.

thing <- 1:3
is.na(thing) | thing == 3

## [1] FALSE FALSE  TRUE

is.na(thing) & thing == 3

## [1] FALSE FALSE FALSE

is.na(thing) || thing == 3

## [1] FALSE

is.na(thing) && thing == 3

## [1] FALSE

Another very specific situation mentioned in that reading is that floating point numbers (numeric values with lots of numbers after the decimal point) sometimes due to computational rounding/storage will not be exactly equal to each other so you just need to be wary of using == there. The example from the reading sums it up well:

thing <- sqrt(2)^2 # should be 2, right?
2==thing # huh

## [1] FALSE

identical(2,thing) # weird

## [1] FALSE

2 - thing # extremely small value

## [1] -4.440892e-16

# I used to check for "equality" this way...before I knew about dplyr::near()
(2-thing) < 1e-16

## [1] TRUE

dplyr::near(2,thing)

## [1] TRUE

Still struggling with the difference between [[]] and [] and unclear on whether that distinction is actually important functionally.

It is very important functionally, if you think back to your homework question where you got different data types depending on which you use. Sometimes you want a list, sometimes you don’t want a list. Usually you only want a list ( i.e. list[1:2]) if you are asking for multiple elements of a list, otherwise you’re wanting to pull out what’s inside that “slot” and use list[[1]].

Note that a lot of newer packages make dealing with complex lists less common than it used to be. The example I gave was the broom package tidy() function. In the past, we all learned how to pull out parts of regression output by accessing parts of the list using [[]] and $, just like I showed in class. Probably a lot of your biostats classes still do it this way because that is how your professor learned it. But, now we just need to use broom::tidy() to get a data frame of coefficients, confidence intervals, and p-values.

If pluck and pull do the same thing, is there any advantage to using one over the other?

As I mentioned last class, pluck and pull are similar in that they “pull out” elements from lists but they are used differently so there can not be any “advantage”. pluck is for lists and pull is for data frames (which are also lists, but you can’t use pull on a non-df list! you need to use pluck in that case).

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(palmerpenguins)
# try this on your own
# a list that is not a data frame
# WHY is it not a data frame?
mylist <- list("a"=1:3, "b" = 2) 
# mylist %>% pull("a")
# Error in UseMethod("pull") : 
#  no applicable method for 'pull' applied to an object of class "list"

Side note, see the difference here:

as.data.frame(mylist)

##   a b
## 1 1 2
## 2 2 2
## 3 3 2

mylist <- list("a"=1:3, "b"=2:4)
mylist

## $a
## [1] 1 2 3
## 
## $b
## [1] 2 3 4

as.data.frame(mylist)

##   a b
## 1 1 2
## 2 2 3
## 3 3 4

If we do have a data frame/tibble and want to “pull out” a column as a vector (not as a data frame), we are also pulling out an element from a list because a data frame is also a list!

Here is how we would use pull and pluck to do the “same thing” on a data frame:

# remember a tibble is a special kind of data frame, which is a special kind of list
str(penguins)

## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

class(penguins)

## [1] "tbl_df"     "tbl"        "data.frame"

typeof(penguins)

## [1] "list"

s = penguins %>% pull(species)
str(s)

##  Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...

# does not work because you need quotes for a list element names
# s2 = penguins %>% pluck(species)
# Error in list2(...) : object 'species' not found

s2 = penguins %>% pluck("species")
str(s2)

##  Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...

# are they the same?
identical(s, s2)

## [1] TRUE

I am not in the habit of using pluck yet, because I am used to [[]] and use it when I need it. I do use pull all the time to get a vector, though, for example:

penguins %>% 
  group_by(species) %>%
  summarize(m = mean(bill_length_mm, na.rm = TRUE)) %>%
  pull(m)

## [1] 38.79139 48.83382 47.50488

Or let’s say I want a list of patient (penguin) ids of a subset:

mypenguins <- penguins %>%
  mutate(id = row_number(), .before = "species")

mypenguins %>% 
  filter(bill_length_mm < 35)

## # A tibble: 9 × 9
##      id species island    bill_length_mm bill_dept…¹ flipp…² body_…³ sex    year
##   <int> <fct>   <fct>              <dbl>       <dbl>   <int>   <int> <fct> <int>
## 1     9 Adelie  Torgersen           34.1        18.1     193    3475 <NA>   2007
## 2    15 Adelie  Torgersen           34.6        21.1     198    4400 male   2007
## 3    19 Adelie  Torgersen           34.4        18.4     184    3325 fema…  2007
## 4    55 Adelie  Biscoe              34.5        18.1     187    2900 fema…  2008
## 5    71 Adelie  Torgersen           33.5        19       190    3600 fema…  2008
## 6    81 Adelie  Torgersen           34.6        17.2     189    3200 fema…  2008
## 7    93 Adelie  Dream               34          17.1     185    3400 fema…  2008
## 8    99 Adelie  Dream               33.1        16.1     178    2900 fema…  2008
## 9   143 Adelie  Dream               32.1        15.5     188    3050 fema…  2009
## # … with abbreviated variable names ¹bill_depth_mm, ²flipper_length_mm,
## #   ³body_mass_g

ids_short_bill <- mypenguins %>% 
  filter(bill_length_mm < 35) %>% 
  pull(id)

Now I have a vector of IDs that satisfy my bill length requirements.

ids_short_bill

## [1]   9  15  19  55  71  81  93  99 143

I just want to check my understanding is correct. The map() is for list and it can be used as itself, but the across() function is only for data frame or tibble and can be used inside the mutate() function. Is that correct? Then, can we use any function inside those map(), and mutate() ?

I really like this distinction and clarification! Yes to this part

map() can be used by itself like, list %>% map(.f = length), applied to a list or vector
across() can only be used as a helper function inside mutate or summarize applied to a data frame/tibble

Also:

inside across() we need to use very specific syntax which is called tidyselect.
Think of across() and select() as friends, because they use the same language to select columns.

But across() is used more like map() in that it takes a “what” argument (.cols = tidy select columns for across, .x = a list or vector for map) and “function” argument (.fns= for across because multiple functions can be supplied, .f= for map because only one function can be applied)

library(palmerpenguins)

penguins %>% select(where(is.numeric))

## # A tibble: 344 × 5
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##             <dbl>         <dbl>             <int>       <int> <int>
##  1           39.1          18.7               181        3750  2007
##  2           39.5          17.4               186        3800  2007
##  3           40.3          18                 195        3250  2007
##  4           NA            NA                  NA          NA  2007
##  5           36.7          19.3               193        3450  2007
##  6           39.3          20.6               190        3650  2007
##  7           38.9          17.8               181        3625  2007
##  8           39.2          19.6               195        4675  2007
##  9           34.1          18.1               193        3475  2007
## 10           42            20.2               190        4250  2007
## # … with 334 more rows

# penguins %>% across(where(is.numeric))
# Error in `across()`:
# ! Must only be used inside data-masking verbs like `mutate()`,
#   `filter()`, and `group_by()`.

# mutate requires a function that returns a vector the same length as the original vector
penguins %>% mutate(across(.cols = where(is.numeric), .f = as.character))

## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex   year 
##    <fct>   <fct>     <chr>          <chr>         <chr>      <chr>   <fct> <chr>
##  1 Adelie  Torgersen 39.1           18.7          181        3750    male  2007 
##  2 Adelie  Torgersen 39.5           17.4          186        3800    fema… 2007 
##  3 Adelie  Torgersen 40.3           18            195        3250    fema… 2007 
##  4 Adelie  Torgersen <NA>           <NA>          <NA>       <NA>    <NA>  2007 
##  5 Adelie  Torgersen 36.7           19.3          193        3450    fema… 2007 
##  6 Adelie  Torgersen 39.3           20.6          190        3650    male  2007 
##  7 Adelie  Torgersen 38.9           17.8          181        3625    fema… 2007 
##  8 Adelie  Torgersen 39.2           19.6          195        4675    male  2007 
##  9 Adelie  Torgersen 34.1           18.1          193        3475    <NA>  2007 
## 10 Adelie  Torgersen 42             20.2          190        4250    <NA>  2007 
## # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
## #   ²body_mass_g

# this works but it shouldn't and is "deprecated" in dplyr 1.1.0
# summarize SHOULD return a vector of length 1
penguins %>% summarize(across(.cols = where(is.numeric), .f = as.character))

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.

## # A tibble: 344 × 5
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year 
##    <chr>          <chr>         <chr>             <chr>       <chr>
##  1 39.1           18.7          181               3750        2007 
##  2 39.5           17.4          186               3800        2007 
##  3 40.3           18            195               3250        2007 
##  4 <NA>           <NA>          <NA>              <NA>        2007 
##  5 36.7           19.3          193               3450        2007 
##  6 39.3           20.6          190               3650        2007 
##  7 38.9           17.8          181               3625        2007 
##  8 39.2           19.6          195               4675        2007 
##  9 34.1           18.1          193               3475        2007 
## 10 42             20.2          190               4250        2007 
## # … with 334 more rows

penguins %>% summarize(across(.cols = where(is.numeric), .f = length))

## # A tibble: 1 × 5
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##            <int>         <int>             <int>       <int> <int>
## 1            344           344               344         344   344

mylist <- list("a"=1:3, "b" = 2, c = penguins) 

# .x can be piped into map or used as an explicit argument
mylist %>% map(.f = length)

## $a
## [1] 3
## 
## $b
## [1] 1
## 
## $c
## [1] 8

map(.x = mylist, .f = length)

## $a
## [1] 3
## 
## $b
## [1] 1
## 
## $c
## [1] 8

# this also works because penguins is a data frame which means it is also a list (columns are elements)
penguins %>% map(.f = length)

## $species
## [1] 344
## 
## $island
## [1] 344
## 
## $bill_length_mm
## [1] 344
## 
## $bill_depth_mm
## [1] 344
## 
## $flipper_length_mm
## [1] 344
## 
## $body_mass_g
## [1] 344
## 
## $sex
## [1] 344
## 
## $year
## [1] 344

map(.x = penguins, .f = length)

## $species
## [1] 344
## 
## $island
## [1] 344
## 
## $bill_length_mm
## [1] 344
## 
## $bill_depth_mm
## [1] 344
## 
## $flipper_length_mm
## [1] 344
## 
## $body_mass_g
## [1] 344
## 
## $sex
## [1] 344
## 
## $year
## [1] 344

However, as we will see in class today, we also can use map() inside mutate() when we are using nested data frames, or when we need to “vectorize” a non-vectorized function. In this case, map() is being applied to a list of data that is inside a column of a data frame….it’s complicated, and we’ll see more today.

Clearest points

For every topic in the muddy list it was also in the clear list, so at least it’s not all lost. I think more practice will help.

Last updated on March 14, 2023

Edit this page