(Class 9) Part 7 continued and Part 8 . Intro to stats/`broom`/More Purrr
R Project files
Please download the part8 sub-folder from this dropbox link. Be sure to unzip if necessary. Knit the part8.Rmd
to install any required packages.
Class Video
View last year’s class and materials here.
Post-Class
Please fill out the following survey and we will discuss the results during the next lecture. All responses will be anonymous.
- Clearest Point: What was the most clear part of the lecture?
- Muddiest Point: What was the most unclear part of the lecture to you?
- Anything Else: Is there something you’d like me to know?
Muddiest Points
A lot of things were difficult including “this whole class” as one person said, and yeah, I get it, it’s hard stuff! The reason I teach these harder topics like for loops, functions, map, etc, as opposed to just going over more of the same kind of data cleaning tasks with various examples, is because it’s a lot harder to be motivated to learn the hard stuff if you’ve never been exposed to it. It will probably seem too daunting (I know this because it took me a long time to force myself to learn ggplot
, or purrr::map
, or even across
and the new pivot_longer
because I already had other ways of doing that).
You have the tools by now to learn how to do other data cleaning tasks related to what we’ve learned (i.e. more factor and string manipulation, even working with dates will not be that hard to figure out).
Also, part of the reason R is so powerful and useful is that it’s a “real” programming language (more similar to C, python, java, etc than SPSS or even SAS or STATA). This part of it will take a lot of practice to feel comfortable if you haven’t had any programming experience. If you have had programming experience, seeing how it’s done in R will get you started in the right direction to using the R-specific programming tools like purrr::map
that are truly so useful when automating data tasks.
for loop was a bit confusing when making empty vector
It really is, and is why I recommend not using for loops but embracing map()
! We could get even more technical and talk about how it’s actually better (faster/efficient) to specify the length or dimension of the empty vector (or data frame, or list, or whatever, this is called pre-allocation) because of how memory is allocated in R, but, no, I refuse to go down that rabbit hole and just say, use map()
!
Side note: If you’re working with data with millions of records, you’ll have plenty of speed issues to worry about, and you need an even more advanced R programming class focused on big data.
I think the whole creation of the function is still quite a bit hazy for me. I believe it’s something that just takes some more practice. Hoping we can fit some more practice challenges to help really build this understanding.
We will start class with another function example, but please ask questions about anything confusing about it during class, too!
I am still struggling with functions! In the reading on functions, I got confused about the difference between the && || operators and the & | operators.. The reading said “beware of floating point numbers” and I’m not sure what that is.
As we saw in class, the & “and” operator and | “or” operator are logic operators used to string one condition to another, such as:
thing <- 3
is.na(thing) | thing == 3
## [1] TRUE
is.na(thing) & thing == 3
## [1] FALSE
But remember we talked about how most functions in R are vectorized, which means they work seamlessly over a vector. This is true for | and & as well. However, if you didn’t want that vectorized behavior and only wanted to check the first elements of a vector you’d use the double && and ||. This becomes useful for if statements, but, you likely don’t need to worry about it, and you probably want the single & |.
thing <- 1:3
is.na(thing) | thing == 3
## [1] FALSE FALSE TRUE
is.na(thing) & thing == 3
## [1] FALSE FALSE FALSE
is.na(thing) || thing == 3
## [1] FALSE
is.na(thing) && thing == 3
## [1] FALSE
Another very specific situation mentioned in that reading is that floating point numbers (numeric values with lots of numbers after the decimal point) sometimes due to computational rounding/storage will not be exactly equal to each other so you just need to be wary of using ==
there. The example from the reading sums it up well:
thing <- sqrt(2)^2 # should be 2, right?
2==thing # huh
## [1] FALSE
identical(2,thing) # weird
## [1] FALSE
2 - thing # extremely small value
## [1] -4.440892e-16
# I used to check for "equality" this way...before I knew about dplyr::near()
(2-thing) < 1e-16
## [1] TRUE
dplyr::near(2,thing)
## [1] TRUE
Still struggling with the difference between [[]] and [] and unclear on whether that distinction is actually important functionally.
It is very important functionally, if you think back to your homework question where you got different data types depending on which you use. Sometimes you want a list, sometimes you don’t want a list. Usually you only want a list ( i.e. list[1:2]
) if you are asking for multiple elements of a list, otherwise you’re wanting to pull out what’s inside that “slot” and use list[[1]]
.
Note that a lot of newer packages make dealing with complex lists less common than it used to be. The example I gave was the broom
package tidy()
function. In the past, we all learned how to pull out parts of regression output by accessing parts of the list using [[]]
and $
, just like I showed in class. Probably a lot of your biostats classes still do it this way because that is how your professor learned it. But, now we just need to use broom::tidy()
to get a data frame of coefficients, confidence intervals, and p-values.
If pluck and pull do the same thing, is there any advantage to using one over the other?
As I mentioned last class, pluck and pull are similar in that they “pull out” elements from lists but they are used differently so there can not be any “advantage”. pluck
is for lists and pull
is for data frames (which are also lists, but you can’t use pull
on a non-df list! you need to use pluck
in that case).
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(palmerpenguins)
# try this on your own
# a list that is not a data frame
# WHY is it not a data frame?
mylist <- list("a"=1:3, "b" = 2)
# mylist %>% pull("a")
# Error in UseMethod("pull") :
# no applicable method for 'pull' applied to an object of class "list"
Side note, see the difference here:
as.data.frame(mylist)
## a b
## 1 1 2
## 2 2 2
## 3 3 2
mylist <- list("a"=1:3, "b"=2:4)
mylist
## $a
## [1] 1 2 3
##
## $b
## [1] 2 3 4
as.data.frame(mylist)
## a b
## 1 1 2
## 2 2 3
## 3 3 4
If we do have a data frame/tibble and want to “pull out” a column as a vector (not as a data frame), we are also pulling out an element from a list because a data frame is also a list!
Here is how we would use pull and pluck to do the “same thing” on a data frame:
# remember a tibble is a special kind of data frame, which is a special kind of list
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
class(penguins)
## [1] "tbl_df" "tbl" "data.frame"
typeof(penguins)
## [1] "list"
s = penguins %>% pull(species)
str(s)
## Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
# does not work because you need quotes for a list element names
# s2 = penguins %>% pluck(species)
# Error in list2(...) : object 'species' not found
s2 = penguins %>% pluck("species")
str(s2)
## Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
# are they the same?
identical(s, s2)
## [1] TRUE
I am not in the habit of using pluck
yet, because I am used to [[]]
and use it when I need it. I do use pull
all the time to get a vector, though, for example:
penguins %>%
group_by(species) %>%
summarize(m = mean(bill_length_mm, na.rm = TRUE)) %>%
pull(m)
## [1] 38.79139 48.83382 47.50488
Or let’s say I want a list of patient (penguin) ids of a subset:
mypenguins <- penguins %>%
mutate(id = row_number(), .before = "species")
mypenguins %>%
filter(bill_length_mm < 35)
## # A tibble: 9 × 9
## id species island bill_length_mm bill_dept…¹ flipp…² body_…³ sex year
## <int> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
## 2 15 Adelie Torgersen 34.6 21.1 198 4400 male 2007
## 3 19 Adelie Torgersen 34.4 18.4 184 3325 fema… 2007
## 4 55 Adelie Biscoe 34.5 18.1 187 2900 fema… 2008
## 5 71 Adelie Torgersen 33.5 19 190 3600 fema… 2008
## 6 81 Adelie Torgersen 34.6 17.2 189 3200 fema… 2008
## 7 93 Adelie Dream 34 17.1 185 3400 fema… 2008
## 8 99 Adelie Dream 33.1 16.1 178 2900 fema… 2008
## 9 143 Adelie Dream 32.1 15.5 188 3050 fema… 2009
## # … with abbreviated variable names ¹bill_depth_mm, ²flipper_length_mm,
## # ³body_mass_g
ids_short_bill <- mypenguins %>%
filter(bill_length_mm < 35) %>%
pull(id)
Now I have a vector of IDs that satisfy my bill length requirements.
ids_short_bill
## [1] 9 15 19 55 71 81 93 99 143
I just want to check my understanding is correct. The map() is for list and it can be used as itself, but the across() function is only for data frame or tibble and can be used inside the mutate() function. Is that correct? Then, can we use any function inside those map(), and mutate() ?
I really like this distinction and clarification! Yes to this part
map()
can be used by itself like,list %>% map(.f = length)
, applied to a list or vectoracross()
can only be used as a helper function insidemutate
orsummarize
applied to a data frame/tibble
Also:
- inside
across()
we need to use very specific syntax which is calledtidyselect
. - Think of
across()
andselect()
as friends, because they use the same language to select columns.
But across()
is used more like map()
in that it takes a “what” argument (.cols =
tidy select columns for across, .x =
a list or vector for map) and “function” argument (.fns=
for across because multiple functions can be supplied, .f=
for map because only one function can be applied)
library(palmerpenguins)
penguins %>% select(where(is.numeric))
## # A tibble: 344 × 5
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## <dbl> <dbl> <int> <int> <int>
## 1 39.1 18.7 181 3750 2007
## 2 39.5 17.4 186 3800 2007
## 3 40.3 18 195 3250 2007
## 4 NA NA NA NA 2007
## 5 36.7 19.3 193 3450 2007
## 6 39.3 20.6 190 3650 2007
## 7 38.9 17.8 181 3625 2007
## 8 39.2 19.6 195 4675 2007
## 9 34.1 18.1 193 3475 2007
## 10 42 20.2 190 4250 2007
## # … with 334 more rows
# penguins %>% across(where(is.numeric))
# Error in `across()`:
# ! Must only be used inside data-masking verbs like `mutate()`,
# `filter()`, and `group_by()`.
# mutate requires a function that returns a vector the same length as the original vector
penguins %>% mutate(across(.cols = where(is.numeric), .f = as.character))
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
## <fct> <fct> <chr> <chr> <chr> <chr> <fct> <chr>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen <NA> <NA> <NA> <NA> <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
## 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
## # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
## # ²body_mass_g
# this works but it shouldn't and is "deprecated" in dplyr 1.1.0
# summarize SHOULD return a vector of length 1
penguins %>% summarize(across(.cols = where(is.numeric), .f = as.character))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## # A tibble: 344 × 5
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## <chr> <chr> <chr> <chr> <chr>
## 1 39.1 18.7 181 3750 2007
## 2 39.5 17.4 186 3800 2007
## 3 40.3 18 195 3250 2007
## 4 <NA> <NA> <NA> <NA> 2007
## 5 36.7 19.3 193 3450 2007
## 6 39.3 20.6 190 3650 2007
## 7 38.9 17.8 181 3625 2007
## 8 39.2 19.6 195 4675 2007
## 9 34.1 18.1 193 3475 2007
## 10 42 20.2 190 4250 2007
## # … with 334 more rows
penguins %>% summarize(across(.cols = where(is.numeric), .f = length))
## # A tibble: 1 × 5
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## <int> <int> <int> <int> <int>
## 1 344 344 344 344 344
mylist <- list("a"=1:3, "b" = 2, c = penguins)
# .x can be piped into map or used as an explicit argument
mylist %>% map(.f = length)
## $a
## [1] 3
##
## $b
## [1] 1
##
## $c
## [1] 8
map(.x = mylist, .f = length)
## $a
## [1] 3
##
## $b
## [1] 1
##
## $c
## [1] 8
# this also works because penguins is a data frame which means it is also a list (columns are elements)
penguins %>% map(.f = length)
## $species
## [1] 344
##
## $island
## [1] 344
##
## $bill_length_mm
## [1] 344
##
## $bill_depth_mm
## [1] 344
##
## $flipper_length_mm
## [1] 344
##
## $body_mass_g
## [1] 344
##
## $sex
## [1] 344
##
## $year
## [1] 344
map(.x = penguins, .f = length)
## $species
## [1] 344
##
## $island
## [1] 344
##
## $bill_length_mm
## [1] 344
##
## $bill_depth_mm
## [1] 344
##
## $flipper_length_mm
## [1] 344
##
## $body_mass_g
## [1] 344
##
## $sex
## [1] 344
##
## $year
## [1] 344
However, as we will see in class today, we also can use map()
inside mutate()
when we are using nested data frames, or when we need to “vectorize” a non-vectorized function. In this case, map()
is being applied to a list of data that is inside a column of a data frame….it’s complicated, and we’ll see more today.
Clearest points
For every topic in the muddy list it was also in the clear list, so at least it’s not all lost. I think more practice will help.