uncount()#load tidyverse up
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(broom)
#example dataset
library(palmerpenguins)
library(ggplot2)
library(knitr)
library(gt)
data(penguins)
In this document, I will introduce the uncount()
function and show what it can be used for.
The uncount() function can be found within the
dplyr library which I’ve loaded below.
library(dplyr)
If you type ?uncount() into the console, what R tells
you is that uncount() “performs the opposite operation to
dplyr::count()”. So R suggests that in order to understand
uncount(), we must first understand the function
count(). I disagree with R on this idea, but I have
included a short section on count() in case it makes
uncount() clearer.
Both of these functions can be found in the dplyr library.
count() allows you to easily count the number of unique
values of one or more variables. This function produces the same output
as group_by() combined with summarise() as you
can see below when considering the species found within the
penguins dataset.
penguins %>%
group_by(species) %>%
summarise(n = n())
## # A tibble: 3 × 2
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
penguins %>%
count(species)
## # A tibble: 3 × 2
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
So as we can see from the count() function, we are
simply counting the number of unique values in a specified column. Now
when we are using the function uncount(), instead of
counting the unique values in a data set, we are producing a data frame
based on the weights assigned. The weights determine the number of each
unique value that are created in the new dataset and be can assigned to
an overall dataset or individual values. This will make more sense as we
consider the two examples below.
So now we’re going to actually use the uncount()
function with the penguins dataset.
penguins
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
## 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007
## # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
## # ²body_mass_g
We can see the above penguins dataset contains 344
lines. With uncount(), we can quickly replicate the dataset
by a factor of n. Below, I set n to be 2 and I’ve duplicated every row
in the dataset and added it row by row to the dataset. A similar
function that we’ve covered is rbind(), however if you used
rbind() to bind a duplicate dataset, it would attach it to
the bottom of the dataset.
penguins2 <- uncount(penguins, 2)
penguins2
## # A tibble: 688 × 8
## species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 3 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 4 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 5 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 6 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 7 Adelie Torgersen NA NA NA NA <NA> 2007
## 8 Adelie Torgersen NA NA NA NA <NA> 2007
## 9 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 10 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## # … with 678 more rows, and abbreviated variable names ¹flipper_length_mm,
## # ²body_mass_g
Another way to use uncount() is to use it to build your
own dataset quickly from scratch.
data_color <- tibble(color = c("red","orange","yellow","green","blue","purple"))
data_color
## # A tibble: 6 × 1
## color
## <chr>
## 1 red
## 2 orange
## 3 yellow
## 4 green
## 5 blue
## 6 purple
ggplot(data = data_color,
aes(x = color))+
geom_bar(fill = c("blue","green","orange","purple","red","yellow"))+
labs(x = "Color", y = "Count")
Above I have made a very simple dataset containing the colors of the
rainbow. There is 1 of each color. Below, I have taken the same dataset
and maniplulated it using uncount() with an n of 3 to
triplicate each row.
data2 <- uncount(data_color, 3)
data2
## # A tibble: 18 × 1
## color
## <chr>
## 1 red
## 2 red
## 3 red
## 4 orange
## 5 orange
## 6 orange
## 7 yellow
## 8 yellow
## 9 yellow
## 10 green
## 11 green
## 12 green
## 13 blue
## 14 blue
## 15 blue
## 16 purple
## 17 purple
## 18 purple
ggplot(data = data2,
aes(x = color))+
geom_bar(fill = c("blue","green","orange","purple","red","yellow"))+
labs(x = "Color", y = "Count")
If you want to specify a unique number for each value, you can create
a vector and pass that into the uncount() function as I
have shown below.
n <- c(1, 2, 3, 4, 5, 6)
data3 <- uncount(data_color, n)
data3
## # A tibble: 21 × 1
## color
## <chr>
## 1 red
## 2 orange
## 3 orange
## 4 yellow
## 5 yellow
## 6 yellow
## 7 green
## 8 green
## 9 green
## 10 green
## # … with 11 more rows
ggplot(data = data3,
aes(x = color))+
geom_bar(fill = c("blue","green","orange","purple","red","yellow"))+
labs(x = "Color", y = "Count")
uncount() could be useful if you needed to quickly build
small, example datasets. Outside of that, I have not seen a need for
uncount() since most R users tend to use already generated
datasets. This function could be used if you needed to be duplicate a
dataset line by line, but I can not think of a reason you might choose
to do so. Overall, I do not expect to uncount() frequently
in my work.