(Class 7) Part 6: Messy data example, more data wrangling and ggplot

Materials from class on Wednesday, February 22, 2023

R Project files

Last class we finished up part5 materials. This is class 7, and we will start with part6 now (sorry, we’re going to be off by one from now on). Please download the part6 sub-folder from this dropbox folder link.

This section is mainly a practice, with some additional ggplot lessons. There will be lots of time for challenges so that you can get practice working on these data wrangling and graphing problems together.

This year’s class video

See Slack for the zoom recording link (though zoom had some malfunction that failed to show the correct Rstudio screen, so last year’s video may be more helpful)

Last Year’s Class Video (Class 7, Part 6)

View last year’s class and materials here.

Post-Class

Please fill out the following survey and we will discuss the results during the next lecture. All responses will be anonymous.

  • Clearest Point: What was the most clear part of the lecture?
  • Muddiest Point: What was the most unclear part of the lecture to you?
  • Anything Else: Is there something you’d like me to know?

https://bit.ly/bsta504_postclass_survey

Muddiest Parts

Still some pain points related to pivot_-ing. I get it, it took me a long time to get comfortable with this (and I’ve gone through multiple function and argument transformations from melt() to gather() to pivot_longer() etc, all those transitions were hard!). We might not see many more examples with this because we have so much more to cover, but keep practicing, and ask for help when you need it! Relatedly:

pivot_longer is so versatile for data manipulation but sometimes it’s contains many arguments

When ever I use pivot_longer and pivot_wider I get the columns that are being switched wrong.Is there a way you think about it that helps you sort out which is which? The Stata manual uses this ‘i’ and ‘j’ notation that’s helping when I’m working in Stata, but I haven’t found an easy way to work with those functions in R.

The argument names have changed a lot since I started pivoting with the tidyverse, but now they are using names_to= because the authors think this makes more sense, and I tend to agree. This argument name helps me figure out what to do more than old versions. I think of this argument as “column names turn into one column called = X” or send the “names” of these columns “to” this new column. Therefore, I need to specify which cols= are the column names going into names_to=

Then, when we go to pivot_wider() we get names_from= which is asking, where are the column names coming from? Also, values_from= which is asking, where are the actual values coming from? Pivot wider to me is easier because we don’t need to specify any other information than those two pieces (you can optionally specify id_cols= but the default is just to use all the other columns that you’re not pivoting). Pivot wider is tricky, however, in that you do need just one value for each combination of id columns.

If you’re still having trouble, you’re not alone, look at this article on how to create a code snippet that pops up to tell you exactly what to do, every time: (and there’s more about what “snippets” are in that article).

snippet plonger
	pivot_longer(${1:mydf},
	             cols = ${2:columns to pivot long},
	             names_to = "${3:desired name for category column}",
	             values_to = "${4:desired name for value column}"
	)

I struggled with the pivoting of tables on assignment #5, and still have some trouble wrapping my head around it– especially when we pivoted the long table back to wide but with different column names.

One difficult thing to grasp about pivoting is the idea of pivoting long, and then even longer (doubly long?), and then back again to wide but in a different way, with different information in the columns. This takes a lot of practice to get there without a lot of struggle. After you have done this many times you’ll be better able to see what you need out of the data frame and how to get it there. I wish we had more time to just pivot things in all sorts of ways, because it’s a powerful form of data manipulation!

Faceting in ggplot I need to play with setting plot scales to actual values with “free_x/free_y”

This is something I think you’ll need to “play around with” to just, try some things and see how it affects the plot. The homework faceting is similar to the example plot we did in class last week, but, there’s a lot more you can do with faceting and it’s a very powerful way to display data. The ggplot2 book’s Faceting chapter is a nice review of this.

I’ve also been meaning to mention that the ggplot2 package website has useful FAQs on lots of tricky subjects. Here’s the faceting one.

Reading the vignette for summarize and list. Can you explain how to read this vignette.

Sorry I couldn’t figure out which vignette you were talking about here, could you send me the link? Or maybe you are looking for a link… Usually if I mention a vignette in class I have the link the Rmd/html file. Otherwise, the best way to find package vignettes is by going to the package website. Most of the tidyverse packages have a website, and the vignettes will often be in the “Articles” drop down. For instance, here’s dplyr’s website and list of articles/vignettes, but I don’t see one on summarize/lists! You can also see vignettes in the CRAN package’s website usually, for instance here’s dplyr

Can we go over how to create a summary table with percentages using summary and across?

This depends on exactly what you want to do. I might use one of the fancier functions like gtsummary::tbl_summary() or table1::table1 for a true “summary table” of all my categorical variables, but we can see how this would work with summarize() “by hand”. We are going to see in part6 some examples with tabyl, as well.

Here’s an example just with summarize:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(palmerpenguins)

# First with tabyl, using adorn_ (which we will see in part6 today)
penguins %>%
  tabyl(species, sex) %>%
  adorn_percentages() %>%
  adorn_pct_formatting()
##    species female  male  NA_
##     Adelie  48.0% 48.0% 3.9%
##  Chinstrap  50.0% 50.0% 0.0%
##     Gentoo  46.8% 49.2% 4.0%
# try to get the same information with summarize:
penguins %>%
  group_by(species) %>%
  summarize(pct_male = sum(sex=="male", na.rm = TRUE)/length(sex),
            pct_female = sum(sex=="female", na.rm = TRUE)/length(sex),
            pct_NA = sum(is.na(sex), na.rm = TRUE)/length(sex)) %>%
  mutate(across(where(is.numeric), ~.x*100))
## # A tibble: 3 × 4
##   species   pct_male pct_female pct_NA
##   <fct>        <dbl>      <dbl>  <dbl>
## 1 Adelie        48.0       48.0   3.95
## 2 Chinstrap     50         50     0   
## 3 Gentoo        49.2       46.8   4.03
# mean also works:
penguins %>%
  group_by(species) %>%
  summarize(pct_male = mean(sex=="male", na.rm = TRUE),
            pct_female = mean(sex=="female", na.rm = TRUE),
            pct_NA = mean(is.na(sex), na.rm = TRUE)) %>%
  mutate(across(where(is.numeric), ~.x*100))
## # A tibble: 3 × 4
##   species   pct_male pct_female pct_NA
##   <fct>        <dbl>      <dbl>  <dbl>
## 1 Adelie        50         50     3.95
## 2 Chinstrap     50         50     0   
## 3 Gentoo        51.3       48.7   4.03

But it’s hard to generalize that specific summarize to other columns with across(), because nothing else has the category “male”, “female”. You’d need your data to be in quite a particular format, so I think this not something you’ll commonly do. You could calculate proportion missing, though, which is a similar idea:

penguins %>% group_by(species) %>%
  summarize(across(everything(), .fns= ~ mean(is.na(.x))))
## # A tibble: 3 × 8
##   species   island bill_length_mm bill_depth_mm flipper_l…¹ body_…²    sex  year
##   <fct>      <dbl>          <dbl>         <dbl>       <dbl>   <dbl>  <dbl> <dbl>
## 1 Adelie         0        0.00658       0.00658     0.00658 0.00658 0.0395     0
## 2 Chinstrap      0        0             0           0       0       0          0
## 3 Gentoo         0        0.00806       0.00806     0.00806 0.00806 0.0403     0
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Clearest Points

Lots of joining and binding, great!

Outlining the steps you took and what you are looking for in the data was really helpful.

I really liked the flow of explanation on data management yesterday, thank you!

That’s great to hear, I almost completely got rid of part6 because I wasn’t sure if it would be that helpful. Hopefully it provided some useful practice with messier data!

Other

We followed an example where we had a graph to strive for, but is there an example or just advice you can provide when we don’t have that? How do you know to leave the demographic information aside and only join the other data first?

I don’t think there’s good one size fits all advice here. For that particular question, I could have joined the demographic data with each piece of the other data, that is also an option. That also would have avoided the need for bind_rows(), so that’s a good point. I do tend to create separate data sets by “type” so for instance in that example I wanted all of the biomarker/outcome data together, just so I knew where it was. That’s my main reason for binding those data together first.

Working with messy data seems very scary!!!

It’s so useful to learn how to work with super messy data because unfortunately that’s often how people have their data set up!

Messy data is scary, I agree!

Why do you use here::here sometimes and other times it’s just here without the here:: preceding it?

Usually pckgname::functioname() is not necessary but used for clarification where that function is coming from. I sometimes do that in case I haven’t loaded the package first. Since I often use the here package at the beginning of my script when loading in data for an analysis, I’m not used to loading like other packages, but instead just call that function by using here::here(), so it’s somewhat due to habit!