Please sign up for a function here: https://docs.google.com/spreadsheets/d/1-RWAQTlLwttjFuZVAtSs8OiHIwu6AZLUdWugIHHTWVo/edit?usp=sharing
For this assignment, please submit both the .Rmd
and the
.html
files. I will add it to the website. Remove your name
from the Rmd if you do not wish it shared. If you select a function
which was presented last
year, please develop your own examples and content.
dplyr::distinct()
In this document, I will introduce the distinct()
function and show what it’s for.
#load tidyverse up
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
#example dataset
library(palmerpenguins)
data(penguins)
Discuss what the function does. Learn from the examples, but show how to use it using another dataset such as
penguins
. If you can provide two examples, even better!
Arguments
This function is from a dplyr
package and it is used to
select distinct or unique rows from the original data frame. Also it
supports eliminating duplicates from tibble. The following is the
syntax.
distinct(.data, ..., ,keep_all = FALSE)
TRUE
, keep all
variables/columns in the input data frame.
Examples with penguins data
# distinct() on all columns in the data
penguins %>% distinct() %>% head()
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
# distinct() on selected columns in the data
# only factor columns
penguins %>% select(where(is.factor)) %>% distinct()
## # A tibble: 13 × 3
## species island sex
## <fct> <fct> <fct>
## 1 Adelie Torgersen male
## 2 Adelie Torgersen female
## 3 Adelie Torgersen <NA>
## 4 Adelie Biscoe female
## 5 Adelie Biscoe male
## 6 Adelie Dream female
## 7 Adelie Dream male
## 8 Adelie Dream <NA>
## 9 Gentoo Biscoe female
## 10 Gentoo Biscoe male
## 11 Gentoo Biscoe <NA>
## 12 Chinstrap Dream female
## 13 Chinstrap Dream male
penguins %>% distinct(species, island, sex, .keep_all = TRUE)
## # A tibble: 13 × 8
## species island bill_length_mm bill_depth_mm flippe…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen NA NA NA NA <NA> 2007
## 4 Adelie Biscoe 37.8 18.3 174 3400 fema… 2007
## 5 Adelie Biscoe 37.7 18.7 180 3600 male 2007
## 6 Adelie Dream 39.5 16.7 178 3250 fema… 2007
## 7 Adelie Dream 37.2 18.1 178 3900 male 2007
## 8 Adelie Dream 37.5 18.9 179 2975 <NA> 2007
## 9 Gentoo Biscoe 46.1 13.2 211 4500 fema… 2007
## 10 Gentoo Biscoe 50 16.3 230 5700 male 2007
## 11 Gentoo Biscoe 44.5 14.3 216 4100 <NA> 2007
## 12 Chinstrap Dream 46.5 17.9 192 3500 fema… 2007
## 13 Chinstrap Dream 50 19.5 196 3900 male 2007
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
Similar functions
# dplyr::n_distinct()
penguins %>% summarise(
across(is.factor, n_distinct)
)
## # A tibble: 1 × 3
## species island sex
## <int> <int> <int>
## 1 3 3 3
# janitor::tabyl()
penguins %>% janitor::tabyl(species)
## species n percent
## Adelie 152 0.4418605
## Chinstrap 68 0.1976744
## Gentoo 124 0.3604651
# unique() `Base R`
unique(penguins$species)
## [1] Adelie Gentoo Chinstrap
## Levels: Adelie Chinstrap Gentoo
# duplicated() `Base R`
duplicated(penguins$species)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [97] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [109] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [133] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [145] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [157] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [169] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [193] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [205] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [217] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [229] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [253] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [265] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [277] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [289] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [313] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [325] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [337] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
knitr::include_graphics("remove-duplicate-data-r.png")
Image Reference: https://www.datanovia.com/en/lessons/identify-and-remove-duplicate-data-in-r/
Discuss whether you think this function is useful for you and your work. Is it the best thing since sliced bread, or is it not really relevant to your work?
This function allows us to glimpse the data before or during wrangling, especially for categorical data. If we have NAs or typos in the data, this will give notice because it considers even NAs as unique values.