Part 2: Loading Data, data.frames, and ggplot2

Materials from class on Wednesday, January 18, 2023

R Project files

Please download the part2 folder from this dropbox folder link Be sure to unzip if necessary. In advance of class, please open the part2 Rstudio project (double click on the .rproj file), open part2.Rmd and knit (click the Knit button at the top of the file) this file. This will install packages that you need for the Rmd to run.

Readings

Required and suggested class readings can be found on the Readings tab by class. These readings may be done anytime before or after class, but they will supplement your understanding of the class materials and help make homework and project work easier.

This year’s class video

See Slack for the zoom recording link

Last Year’s Class Video

View last year’s class and materials here.

Post-Class

Please fill out the following survey and we will discuss the results during the next lecture. All responses will be anonymous.

  • Clearest Point: What was the most clear part of the lecture?
  • Muddiest Point: What was the most unclear part of the lecture to you?
  • Anything Else: Is there something you’d like me to know?

https://bit.ly/bsta504_postclass_survey

Muddiest Points

the benefits of tibble vs data frame and when to use which?

In this class we will always use tibble. Just remember that an object can be multiple types. A tibble is a data frame, but not vice versa. A tibble is really a data frame with “perks”. See this explanation from the tibble 1.0 package release

There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str():

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
class(mtcars)
## [1] "data.frame"
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
mtcars_tib <- as_tibble(mtcars)
class(mtcars_tib)
## [1] "tbl_df"     "tbl"        "data.frame"
mtcars_tib
## # A tibble: 32 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows
## # ℹ Use `print(n = ...)` to see more rows

Another interesting difference is that tibbles don’t have row names, but a lot of built in data.frames in R do. But rownames are hard to get out. So, when you make a tibble of a data.frame you can tell the function to use the rownames as a column:

mtcars_tib <- as_tibble(mtcars, rownames = "car_name")
mtcars_tib
## # A tibble: 32 × 12
##    car_name      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows
## # ℹ Use `print(n = ...)` to see more rows

Tibbles also clearly delineate [ and [[: [ always returns another tibble, [[ always returns a vector. No more drop = FALSE!

If we ask for the first column using the [] notation, we receive a numeric vector from a data frame, and a tibble/data.frame from the tibble.

We have not learned the [[]] yet because we have not talked about lists in R, but we will soon. The code below returns the first column as a vector for both a data frame and a tibble.

mtcars[,1]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
mtcars_tib[,1]
## # A tibble: 32 × 1
##    car_name         
##    <chr>            
##  1 Mazda RX4        
##  2 Mazda RX4 Wag    
##  3 Datsun 710       
##  4 Hornet 4 Drive   
##  5 Hornet Sportabout
##  6 Valiant          
##  7 Duster 360       
##  8 Merc 240D        
##  9 Merc 230         
## 10 Merc 280         
## # … with 22 more rows
## # ℹ Use `print(n = ...)` to see more rows
mtcars[[1]]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
mtcars_tib[[1]]
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"
class(mtcars[,1])
## [1] "numeric"
class(mtcars_tib[,1])
## [1] "tbl_df"     "tbl"        "data.frame"
class(mtcars_tib[[1]])
## [1] "character"

As I was mentioning in class, there are some (older) functions that don’t like tibbles, but all you need to do is just make its primary class a data.frame as such:

A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame:

mtcars_df <- as.data.frame(mtcars_tib)
class(mtcars_df)
## [1] "data.frame"

Back to muddy quotes:

path files and knowing if you’re in a project or just an RMD

R markdown vs R projects

I hope to spend more time talking about this in class 4.

ggplot stuff was the most muddy, but I also haven’t done a lot of ggplot stuff before

Yes this was definitely expected for a brief intro, ggplot takes a while to get the hang of! We will use ggplot every class now, so we will go through it in bite sized pieces.

Using na=“NA” to pull in data and how to know that it’s needed.

I will show more examples of this. Rule number one of importing data in any software is to look at your data, and figure out if what you see in the software is what you expect. Always look at your data! The read_excel(filename, na="NA") is a strange case that isn’t actually very common to code data as “NA” directly, but I wanted to show you how it looks different when it does happen. Usually, missing data is just a blank space, which is automatically read in as the special NA data type in R.

# If you did not include `na=NA` it would have been read in like this
df1 <- tibble(a = c("NA","C","D"), b= 1:3,  c = c(1,3,"NA"))
# If you did include `na = NA` it would have been read in like this
df2 <- tibble(a = c(NA,"C","D"), b= 1:3, c = c(1,3,NA))

# note the character types of the two DFs, and the way NA is printed
df1
## # A tibble: 3 × 3
##   a         b c    
##   <chr> <int> <chr>
## 1 NA        1 1    
## 2 C         2 3    
## 3 D         3 NA
df2
## # A tibble: 3 × 3
##   a         b     c
##   <chr> <int> <dbl>
## 1 <NA>      1     1
## 2 C         2     3
## 3 D         3    NA

I saw a lot of code with the two colons (“::”) in the middle. It is unclear to me if this is an alternative way to write some commands or if there is a certain context in which it is used.

Good question, what this does is pulls a function from a package, so it works whether you have loaded the package (using library() or p_load()) or not. I mainly use it as a clue to you to where the function is coming from. Otherwise, you may not know you need to load that package to use it! For instance:

# does not work, haven't loaded the package janitor
mtcars %>% tabyl(am, cyl)
# does work
mtcars %>% janitor::tabyl(am, cyl)
##  am 4 6  8
##   0 3 4 12
##   1 8 3  2
# also works
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
mtcars %>% tabyl(am, cyl)
##  am 4 6  8
##   0 3 4 12
##   1 8 3  2

Clearest Points

skim

loading our excel to R studio

Loading in the data and selecting the sheets that are most relevant to what we are looking to do was very clear and a nice foundation for future projects. I found that showing different ways of importing the data was helpful.

I’m glad, the import tool in Rstudio is very nice, just remember to save the code in your Rmd.

functionality of ggplot

tidying the data

Found out what eval=TRUE and eval=FALSE mean!

Great and I’ll show that again for anyone who was confused! (“still a little bit confused about the {r, EVAL} code”)

Other messages

Some people had trouble getting the fig.path= to work in the knitr options. I’m not sure what could be causing that but feel free to ask me during break.

Here’s a good reference for all the code chunk options, if you want to read about it.

link to the course website that is in the overview tab in SAKAI links to last years materials.

Oops thank you great catch, fixed!

Speed is going great. I’m just worried as we progress through the course, it’ll be more difficult. Overall, really enjoying this class.

I understand the concern, some things will get more difficult (I’m thinking across() in class 4, writing functions, and purrr), but we will also circle back to some things that might be familiar or maybe less complicated to start (stats models, making tables). Definitely keep asking questions and I will slow down as needed!