uncount()

#load tidyverse up
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(broom)
#example dataset
library(palmerpenguins)
library(ggplot2)
library(knitr)
library(gt)
data(penguins)

In this document, I will introduce the uncount() function and show what it can be used for.

The uncount() function can be found within the dplyr library which I’ve loaded below.

library(dplyr)

What is it for?

If you type ?uncount() into the console, what R tells you is that uncount() “performs the opposite operation to dplyr::count()”. So R suggests that in order to understand uncount(), we must first understand the function count(). I disagree with R on this idea, but I have included a short section on count() in case it makes uncount() clearer.

Both of these functions can be found in the dplyr library. count() allows you to easily count the number of unique values of one or more variables. This function produces the same output as group_by() combined with summarise() as you can see below when considering the species found within the penguins dataset.

penguins %>% 
  group_by(species) %>% 
  summarise(n = n())
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124
penguins %>% 
  count(species) 
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124

So as we can see from the count() function, we are simply counting the number of unique values in a specified column. Now when we are using the function uncount(), instead of counting the unique values in a data set, we are producing a data frame based on the weights assigned. The weights determine the number of each unique value that are created in the new dataset and be can assigned to an overall dataset or individual values. This will make more sense as we consider the two examples below.

Example 1

So now we’re going to actually use the uncount() function with the penguins dataset.

penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
##    <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
##  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
##  2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
##  3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
##  4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
##  5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
##  6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
##  7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
##  8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
##  9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
## 10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
## # … with 334 more rows, and abbreviated variable names ¹​flipper_length_mm,
## #   ²​body_mass_g

We can see the above penguins dataset contains 344 lines. With uncount(), we can quickly replicate the dataset by a factor of n. Below, I set n to be 2 and I’ve duplicated every row in the dataset and added it row by row to the dataset. A similar function that we’ve covered is rbind(), however if you used rbind() to bind a duplicate dataset, it would attach it to the bottom of the dataset.

penguins2 <- uncount(penguins, 2)
penguins2
## # A tibble: 688 × 8
##    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
##    <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
##  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
##  2 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
##  3 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
##  4 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
##  5 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
##  6 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
##  7 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
##  8 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
##  9 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
## 10 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
## # … with 678 more rows, and abbreviated variable names ¹​flipper_length_mm,
## #   ²​body_mass_g

Example 2

Another way to use uncount() is to use it to build your own dataset quickly from scratch.

data_color <- tibble(color = c("red","orange","yellow","green","blue","purple"))
data_color 
## # A tibble: 6 × 1
##   color 
##   <chr> 
## 1 red   
## 2 orange
## 3 yellow
## 4 green 
## 5 blue  
## 6 purple
ggplot(data = data_color,
  aes(x = color))+
  geom_bar(fill = c("blue","green","orange","purple","red","yellow"))+
  labs(x = "Color", y = "Count")

Above I have made a very simple dataset containing the colors of the rainbow. There is 1 of each color. Below, I have taken the same dataset and maniplulated it using uncount() with an n of 3 to triplicate each row.

data2 <- uncount(data_color, 3)
data2 
## # A tibble: 18 × 1
##    color 
##    <chr> 
##  1 red   
##  2 red   
##  3 red   
##  4 orange
##  5 orange
##  6 orange
##  7 yellow
##  8 yellow
##  9 yellow
## 10 green 
## 11 green 
## 12 green 
## 13 blue  
## 14 blue  
## 15 blue  
## 16 purple
## 17 purple
## 18 purple
ggplot(data = data2,
  aes(x = color))+
  geom_bar(fill = c("blue","green","orange","purple","red","yellow"))+
  labs(x = "Color", y = "Count")

If you want to specify a unique number for each value, you can create a vector and pass that into the uncount() function as I have shown below.

n <- c(1, 2, 3, 4, 5, 6)
data3 <- uncount(data_color, n)
data3 
## # A tibble: 21 × 1
##    color 
##    <chr> 
##  1 red   
##  2 orange
##  3 orange
##  4 yellow
##  5 yellow
##  6 yellow
##  7 green 
##  8 green 
##  9 green 
## 10 green 
## # … with 11 more rows
ggplot(data = data3,
  aes(x = color))+
  geom_bar(fill = c("blue","green","orange","purple","red","yellow"))+
  labs(x = "Color", y = "Count")

Is it helpful?

uncount() could be useful if you needed to quickly build small, example datasets. Outside of that, I have not seen a need for uncount() since most R users tend to use already generated datasets. This function could be used if you needed to be duplicate a dataset line by line, but I can not think of a reason you might choose to do so. Overall, I do not expect to uncount() frequently in my work.