Please sign up for a function here: https://docs.google.com/spreadsheets/d/1-RWAQTlLwttjFuZVAtSs8OiHIwu6AZLUdWugIHHTWVo/edit?usp=sharing
For this assignment, please submit both the .Rmd
and the .html
files. I will add it to the website. Remove your name from the Rmd if you do not wish it shared. If you select a function which was presented last year, please develop your own examples and content.
dpylr::slice_sample
In this document, I will introduce the slice_sample() function and show what it’s for.
#load tidyverse up
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
#example dataset
library(palmerpenguins)
data(penguins)
This function selects a random sample of your dataset. It works very similarly to filter() function.
Without any arguments, it will select a single random row:
slice_sample(penguins)
## # A tibble: 1 × 8
## species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Dream 36.8 18.5 193 3500 fema… 2009
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
You can also specifiy how many rows of data you want selected:
slice_sample(penguins, n=30)
## # A tibble: 30 × 8
## species island bill_length_mm bill_depth_mm flippe…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Gentoo Biscoe 46.5 14.5 213 4400 fema… 2007
## 2 Chinstrap Dream 58 17.8 181 3700 fema… 2007
## 3 Adelie Torgersen 35.2 15.9 186 3050 fema… 2009
## 4 Chinstrap Dream 50.5 18.4 200 3400 fema… 2008
## 5 Adelie Torgersen 34.6 21.1 198 4400 male 2007
## 6 Chinstrap Dream 49.2 18.2 195 4400 male 2007
## 7 Chinstrap Dream 46.4 18.6 190 3450 fema… 2007
## 8 Adelie Dream 36 17.8 195 3450 fema… 2009
## 9 Adelie Biscoe 36.4 17.1 184 2850 fema… 2008
## 10 Adelie Torgersen 41.5 18.3 195 4300 male 2009
## # … with 20 more rows, and abbreviated variable names ¹flipper_length_mm,
## # ²body_mass_g
You can also specify the proportion of data you want kept:
slice_sample(penguins, prop = 0.5)
## # A tibble: 172 × 8
## species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Biscoe 35.7 16.9 185 3150 fema… 2008
## 2 Adelie Dream 39.8 19.1 184 4650 male 2007
## 3 Gentoo Biscoe 47.5 14.2 209 4600 fema… 2008
## 4 Gentoo Biscoe 47.7 15 216 4750 fema… 2008
## 5 Gentoo Biscoe 45.4 14.6 211 4800 fema… 2007
## 6 Chinstrap Dream 52.2 18.8 197 3450 male 2009
## 7 Gentoo Biscoe 43.6 13.9 217 4900 fema… 2008
## 8 Adelie Dream 42.3 21.2 191 4150 male 2007
## 9 Gentoo Biscoe 49.4 15.8 216 4925 male 2009
## 10 Chinstrap Dream 49 19.6 212 4300 male 2009
## # … with 162 more rows, and abbreviated variable names ¹flipper_length_mm,
## # ²body_mass_g
# slice_sample() allows you to random select with or without replacement
penguins %>% slice_sample(n = 5)
## # A tibble: 5 × 8
## species island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Chinstrap Dream 45.7 17 195 3650 fema… 2009
## 2 Adelie Dream 37 16.9 185 3000 fema… 2007
## 3 Gentoo Biscoe 46.4 15 216 4700 fema… 2008
## 4 Adelie Biscoe 45.6 20.3 191 4600 male 2009
## 5 Gentoo Biscoe 48.7 14.1 210 4450 fema… 2007
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
penguins %>% slice_sample(n = 5, replace = TRUE)
## # A tibble: 5 × 8
## species island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Biscoe 37.8 18.3 174 3400 fema… 2007
## 2 Gentoo Biscoe 45.4 14.6 211 4800 fema… 2007
## 3 Gentoo Biscoe 45.5 15 220 5000 male 2008
## 4 Chinstrap Dream 51.7 20.3 194 3775 male 2007
## 5 Adelie Biscoe 41.4 18.6 191 3700 male 2008
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
# you can optionally weight by a variable - this code weights by the species of the penguins, so certain species are more likely to get selected
penguins %>% slice_sample(weight_by = species, n = 5)
## # A tibble: 5 × 8
## species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Gentoo Biscoe 48.7 15.7 208 5350 male 2008
## 2 Adelie Torgersen 34.4 18.4 184 3325 fema… 2007
## 3 Gentoo Biscoe 43.5 14.2 220 4700 fema… 2008
## 4 Gentoo Biscoe 48.7 15.1 222 5350 male 2007
## 5 Adelie Torgersen 42.9 17.6 196 4700 male 2008
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
I think this function can definitely be helpful! It would be a great may to get a random sample from a large dataset so that you have something more manageable to work with.