Submission Instructions

Please sign up for a function here: https://docs.google.com/spreadsheets/d/1-RWAQTlLwttjFuZVAtSs8OiHIwu6AZLUdWugIHHTWVo/edit?usp=sharing

For this assignment, please submit both the .Rmd and the .html files. I will add it to the website. Remove your name from the Rmd if you do not wish it shared. If you select a function which was presented last year, please develop your own examples and content.

dpylr::slice_sample

In this document, I will introduce the slice_sample() function and show what it’s for.

#load tidyverse up
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
#example dataset
library(palmerpenguins)
data(penguins)

What is it for?

This function selects a random sample of your dataset. It works very similarly to filter() function.

Without any arguments, it will select a single random row:

slice_sample(penguins)
## # A tibble: 1 × 8
##   species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex    year
##   <fct>   <fct>           <dbl>         <dbl>          <int>   <int> <fct> <int>
## 1 Adelie  Dream            36.8          18.5            193    3500 fema…  2009
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

You can also specifiy how many rows of data you want selected:

slice_sample(penguins, n=30)
## # A tibble: 30 × 8
##    species   island    bill_length_mm bill_depth_mm flippe…¹ body_…² sex    year
##    <fct>     <fct>              <dbl>         <dbl>    <int>   <int> <fct> <int>
##  1 Gentoo    Biscoe              46.5          14.5      213    4400 fema…  2007
##  2 Chinstrap Dream               58            17.8      181    3700 fema…  2007
##  3 Adelie    Torgersen           35.2          15.9      186    3050 fema…  2009
##  4 Chinstrap Dream               50.5          18.4      200    3400 fema…  2008
##  5 Adelie    Torgersen           34.6          21.1      198    4400 male   2007
##  6 Chinstrap Dream               49.2          18.2      195    4400 male   2007
##  7 Chinstrap Dream               46.4          18.6      190    3450 fema…  2007
##  8 Adelie    Dream               36            17.8      195    3450 fema…  2009
##  9 Adelie    Biscoe              36.4          17.1      184    2850 fema…  2008
## 10 Adelie    Torgersen           41.5          18.3      195    4300 male   2009
## # … with 20 more rows, and abbreviated variable names ¹​flipper_length_mm,
## #   ²​body_mass_g

You can also specify the proportion of data you want kept:

slice_sample(penguins, prop = 0.5)
## # A tibble: 172 × 8
##    species   island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##    <fct>     <fct>           <dbl>         <dbl>       <int>   <int> <fct> <int>
##  1 Adelie    Biscoe           35.7          16.9         185    3150 fema…  2008
##  2 Adelie    Dream            39.8          19.1         184    4650 male   2007
##  3 Gentoo    Biscoe           47.5          14.2         209    4600 fema…  2008
##  4 Gentoo    Biscoe           47.7          15           216    4750 fema…  2008
##  5 Gentoo    Biscoe           45.4          14.6         211    4800 fema…  2007
##  6 Chinstrap Dream            52.2          18.8         197    3450 male   2009
##  7 Gentoo    Biscoe           43.6          13.9         217    4900 fema…  2008
##  8 Adelie    Dream            42.3          21.2         191    4150 male   2007
##  9 Gentoo    Biscoe           49.4          15.8         216    4925 male   2009
## 10 Chinstrap Dream            49            19.6         212    4300 male   2009
## # … with 162 more rows, and abbreviated variable names ¹​flipper_length_mm,
## #   ²​body_mass_g
# slice_sample() allows you to random select with or without replacement
penguins %>% slice_sample(n = 5)
## # A tibble: 5 × 8
##   species   island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex    year
##   <fct>     <fct>           <dbl>         <dbl>        <int>   <int> <fct> <int>
## 1 Chinstrap Dream            45.7          17            195    3650 fema…  2009
## 2 Adelie    Dream            37            16.9          185    3000 fema…  2007
## 3 Gentoo    Biscoe           46.4          15            216    4700 fema…  2008
## 4 Adelie    Biscoe           45.6          20.3          191    4600 male   2009
## 5 Gentoo    Biscoe           48.7          14.1          210    4450 fema…  2007
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
penguins %>% slice_sample(n = 5, replace = TRUE)
## # A tibble: 5 × 8
##   species   island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex    year
##   <fct>     <fct>           <dbl>         <dbl>        <int>   <int> <fct> <int>
## 1 Adelie    Biscoe           37.8          18.3          174    3400 fema…  2007
## 2 Gentoo    Biscoe           45.4          14.6          211    4800 fema…  2007
## 3 Gentoo    Biscoe           45.5          15            220    5000 male   2008
## 4 Chinstrap Dream            51.7          20.3          194    3775 male   2007
## 5 Adelie    Biscoe           41.4          18.6          191    3700 male   2008
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
# you can optionally weight by a variable - this code weights by the species of the penguins, so certain species are more likely to get selected
penguins %>% slice_sample(weight_by = species, n = 5)
## # A tibble: 5 × 8
##   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Gentoo  Biscoe              48.7          15.7         208    5350 male   2008
## 2 Adelie  Torgersen           34.4          18.4         184    3325 fema…  2007
## 3 Gentoo  Biscoe              43.5          14.2         220    4700 fema…  2008
## 4 Gentoo  Biscoe              48.7          15.1         222    5350 male   2007
## 5 Adelie  Torgersen           42.9          17.6         196    4700 male   2008
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Is it helpful?

I think this function can definitely be helpful! It would be a great may to get a random sample from a large dataset so that you have something more manageable to work with.