`dplyr::distinct()`

In this document, I will introduce the distinct() function and show what it’s for.

#load tidyverse up
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

#example dataset
library(palmerpenguins)
data(penguins)

What is it for?

Discuss what the function does. Learn from the examples, but show how to use it using another dataset such as penguins. If you can provide two examples, even better!

Arguments

This function is from a dplyr package and it is used to select distinct or unique rows from the original data frame. Also it supports eliminating duplicates from tibble. The following is the syntax.

distinct(.data, ..., ,keep_all = FALSE)

.data : a data frame or tibble.
… : optional variables to determine unique rows.
.keep_all : if TRUE, keep all variables/columns in the input data frame.

Examples with penguins data

# distinct() on all columns in the data

penguins %>% distinct() %>% head()

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
## 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g

# distinct() on selected columns in the data 
# only factor columns

penguins %>% select(where(is.factor)) %>% distinct()

## # A tibble: 13 × 3
##    species   island    sex   
##    <fct>     <fct>     <fct> 
##  1 Adelie    Torgersen male  
##  2 Adelie    Torgersen female
##  3 Adelie    Torgersen <NA>  
##  4 Adelie    Biscoe    female
##  5 Adelie    Biscoe    male  
##  6 Adelie    Dream     female
##  7 Adelie    Dream     male  
##  8 Adelie    Dream     <NA>  
##  9 Gentoo    Biscoe    female
## 10 Gentoo    Biscoe    male  
## 11 Gentoo    Biscoe    <NA>  
## 12 Chinstrap Dream     female
## 13 Chinstrap Dream     male

penguins %>% distinct(species, island, sex, .keep_all = TRUE)

## # A tibble: 13 × 8
##    species   island    bill_length_mm bill_depth_mm flippe…¹ body_…² sex    year
##    <fct>     <fct>              <dbl>         <dbl>    <int>   <int> <fct> <int>
##  1 Adelie    Torgersen           39.1          18.7      181    3750 male   2007
##  2 Adelie    Torgersen           39.5          17.4      186    3800 fema…  2007
##  3 Adelie    Torgersen           NA            NA         NA      NA <NA>   2007
##  4 Adelie    Biscoe              37.8          18.3      174    3400 fema…  2007
##  5 Adelie    Biscoe              37.7          18.7      180    3600 male   2007
##  6 Adelie    Dream               39.5          16.7      178    3250 fema…  2007
##  7 Adelie    Dream               37.2          18.1      178    3900 male   2007
##  8 Adelie    Dream               37.5          18.9      179    2975 <NA>   2007
##  9 Gentoo    Biscoe              46.1          13.2      211    4500 fema…  2007
## 10 Gentoo    Biscoe              50            16.3      230    5700 male   2007
## 11 Gentoo    Biscoe              44.5          14.3      216    4100 <NA>   2007
## 12 Chinstrap Dream               46.5          17.9      192    3500 fema…  2007
## 13 Chinstrap Dream               50            19.5      196    3900 male   2007
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g

Similar functions

# dplyr::n_distinct()
penguins %>% summarise(
  across(is.factor, n_distinct)
)

## # A tibble: 1 × 3
##   species island   sex
##     <int>  <int> <int>
## 1       3      3     3

# janitor::tabyl()
penguins %>% janitor::tabyl(species)

##    species   n   percent
##     Adelie 152 0.4418605
##  Chinstrap  68 0.1976744
##     Gentoo 124 0.3604651

# unique()  `Base R`
unique(penguins$species)

## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo

# duplicated()   `Base R`
duplicated(penguins$species)

##   [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [109]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [121]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [145]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [157]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [169]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [181]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [193]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [205]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [217]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [229]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [241]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [253]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [265]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [277] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [289]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [301]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [313]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [325]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [337]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

knitr::include_graphics("remove-duplicate-data-r.png")

Image Reference: https://www.datanovia.com/en/lessons/identify-and-remove-duplicate-data-in-r/

Is it helpful?

Discuss whether you think this function is useful for you and your work. Is it the best thing since sliced bread, or is it not really relevant to your work?

This function allows us to glimpse the data before or during wrangling, especially for categorical data. If we have NAs or typos in the data, this will give notice because it considers even NAs as unique values.

Function of the Week:

Haemin Lee

2023-02-13

Submission Instructions

`dplyr::distinct()`

What is it for?

Is it helpful?