dplyr::recode

In this document, I will introduce the recode() function and show what it’s for.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(palmerpenguins)

What does the command do?

Recode command replaces the numeric values based on their position or their name, and replaces character or factor values only by their name. This function has been superseded by case_match.

Typically when we use commands such:

->mutate

syntax: df %>% mutate(new_name=old_name)

->rename

syntax:rename(df$varname,new_name=old_name)

whereas in the recode command, the syntax is a bit different in terms of assigning the old and the new variable name

recode(x,old_name=new_name)

Syntax

Syntax of recode

recode(.x, …, .default = NULL, .missing = NULL)

where

->argument .x is the vector to modify

->dynamic dots-> splice arguments(with splice operator !!!), inject names(with glue syntax :=)

-> argument .default - if it specified, the unmatched values will take on the specified value. If it is not specified and the unmatched values are of the same type as original values in .x, unmatched values are not changed. If the .default is not specified and the unmatched values are not compatible with the original values, then the unmatched values are replaced with NA.

->argument .missing, replaces missing values in .x by the value specified

->Replacements must have either length one or the same length as .x.

Syntax of recode_factor

recode_factor(.x, …, .default = NULL, .missing = NULL, .ordered = FALSE)

-> .ordered-If TRUE, recode_factor() creates an ordered factor

What is it for?

Recoding numeric variables

Example 1

set.seed(123)

x1 <- sample(c(1:20),size=6,replace=TRUE)
x1
## [1] 15 19 14  3 10 18
dplyr::recode(x1, `3` = 30, `10` = 100, .default =5)
## [1]   5   5   5  30 100   5

In this example, it can be noted that the number 3 was replaced by 30 and 10 was replaced by 100 and the rest of the observations were changed to 5.

Example 2

x2 <- c(2, 3, 100, 1, 4, 3, 3,2,3)               # Create example vector
x2
## [1]   2   3 100   1   4   3   3   2   3
y2<-dplyr::recode(x1,"3"=99)
## Warning: Unreplaced values treated as NA as `.x` is not compatible.
## Please specify replacements exhaustively or supply `.default`.
y2
## [1] NA NA NA 99 NA NA
# after specifying default argument
y2<-dplyr::recode(x1,"3"=99,.default=10)
y2
## [1] 10 10 10 99 10 10

In the above example, 3 was replaced by 99 and the rest of the values were changed to NA and after specifying the .default argument, the values except 3 were recoded as 10.

Example 3

x3 <- rep(1:5, 5)
x3
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
class(x3)
## [1] "integer"
dplyr::recode(x3,`1`=6L,`2`=7L,`3`=8L) # number followed by letter will create an integer an not a numeric value
##  [1] 6 7 8 4 5 6 7 8 4 5 6 7 8 4 5 6 7 8 4 5 6 7 8 4 5
# it can be noted that even though .default argument is not specified, since the replacements are the same type as the original values of .x, the unmatched values 4 and 5 are not changed.

dplyr::recode(x3,"a","b","c") # look at the warnings, need to specify .default argument, the unmatched values are changed to NA
## Warning: Unreplaced values treated as NA as `.x` is not compatible.
## Please specify replacements exhaustively or supply `.default`.
##  [1] "a" "b" "c" NA  NA  "a" "b" "c" NA  NA  "a" "b" "c" NA  NA  "a" "b" "c" NA 
## [20] NA  "a" "b" "c" NA  NA
#use default argument

dplyr::recode(x3,"a","b","c",.default ="nothing")
##  [1] "a"       "b"       "c"       "nothing" "nothing" "a"       "b"      
##  [8] "c"       "nothing" "nothing" "a"       "b"       "c"       "nothing"
## [15] "nothing" "a"       "b"       "c"       "nothing" "nothing" "a"      
## [22] "b"       "c"       "nothing" "nothing"
dplyr::recode(x3,"a","b","c",.default ="other")
##  [1] "a"     "b"     "c"     "other" "other" "a"     "b"     "c"     "other"
## [10] "other" "a"     "b"     "c"     "other" "other" "a"     "b"     "c"    
## [19] "other" "other" "a"     "b"     "c"     "other" "other"
# missing argument
x4 <- c(1:10, NA)
x4
##  [1]  1  2  3  4  5  6  7  8  9 10 NA
dplyr::recode(x4,"a","b","c",.default ="other",.missing=NA_character_)
##  [1] "a"     "b"     "c"     "other" "other" "other" "other" "other" "other"
## [10] "other" NA

When .default was not specified and when the data type of replacement values(6,7,8) matched the original values(1,2,3), the unmatched values 4 and 5 are not changed.

When .default was not specified and when the data type of replacement values(“a”,“b”,“c”) didn’t match the original values(1,2,3), the unmatched numbers 4 and 5 were replaced by NA.

When .default argument was specified, the numbers 4 and 5 were replaced by “nothing” or “other”.

.missing argument is used to specify replacement values for any missing values in .x

Recoding characters

Example 4

# load penguins dataset
penguins<-penguins

# check the datatype
class(penguins$species)
## [1] "factor"
#since this variable is in a factor format, need to be converted to characters first
char_vec1<-as.character(levels(penguins$species))
class(char_vec1)
## [1] "character"
char_vec1
## [1] "Adelie"    "Chinstrap" "Gentoo"
#use recode to change the names
dplyr::recode(char_vec1, 
       "Adelie"="Adelie Penguins",
       "Chinstrap"="Chinstrap Penguins",
       .default="Gentoo Penguins")
## [1] "Adelie Penguins"    "Chinstrap Penguins" "Gentoo Penguins"

In the above example, we first converted the variable species from the factor format to the character format and the names of the different types of penguins were recoded to different names.

Example 5

set.seed(123)

char_vec2<-sample(c("ca", "or", "wa"), 15, replace = TRUE)

char_vec2
##  [1] "wa" "wa" "wa" "or" "wa" "or" "or" "or" "wa" "ca" "or" "or" "ca" "or" "wa"
# use slice operator !!!
key <- c(ca = "california",  or= "oregon", wa = "washington")

dplyr::recode(char_vec2, !!!key)
##  [1] "washington" "washington" "washington" "oregon"     "washington"
##  [6] "oregon"     "oregon"     "oregon"     "washington" "california"
## [11] "oregon"     "oregon"     "california" "oregon"     "washington"

In the above example, the !!! operator is used in conjunction with the recode command in order to recode the characters in the character vector.

Recoding factors

Example 6

x_fac<-as.factor(c("apple","banana","orange"))
x_fac
## [1] apple  banana orange
## Levels: apple banana orange
# replacd factor level "banana" by new factor level "strawberry"
y_factor<-dplyr::recode(x_fac,"banana"="strawberry")
y_factor
## [1] apple      strawberry orange    
## Levels: apple strawberry orange
# alternatively use recode_factor, look at the difference!!

y_factor<-dplyr::recode_factor(x_fac,"banana"="strawberry")
y_factor
## [1] apple      strawberry orange    
## Levels: strawberry apple orange

In the above example, recode command replaces the factor level “banana” to “strawberry” and the ordering of the levels is preserved whereas in the recode_factor command, not only changes the factor level “banana” is to “strawberry” but also the ordering of the levels is changed with the replaced factor level “strawberry” being ordered first.

Is this helpful?

-> recode() is superseded in favor of case_match(), which has a more elegant interface

-> recode_factor() is also superseded, however, the direct replacement is not currently available, mostly will eventually be in forcats.

-> Use if_else() for creating new variables based on logical vectors.

-> For more complex recoding, use case_when().

-> I think this function is helpful, helps to recode numeric, character and factor variables. We could use this function for simple recoding of variables but I think I liked the recode command from the other packages (car, admisc packages) more as they had other useful arguments like cut, separator,interval which the recode command in dplyr doesn’t have. I would prefer using case_when especially when the recoding of variables involves more complexity. Thus, this can be used in my work provided the recoding of variables involved is simple.

Extra comments

Case_match command

This function allows you to vectorise multiple switch() statements. Each case is evaluated sequentially and the first match for each element determines the corresponding value in the output vector. If no cases match, the .default is used.

Example 7

x <- c("ca", "or", "wa", "al", "tx", NA, "ut", "co")

dplyr::case_match(x,"ca" ~ 1,"or" ~ 2,"tx" ~ 3,"wa" ~ 4,NA ~ 0,.default =5)
## [1] 1 2 4 5 3 0 5 5

In the above code, whenever the character “ca” was encountered in the vector x, it was replaced by 1, similarly “or” was replaced by 2, “tx” replaced by 3, “wa” replaced by 4 and NA replaced by 0 and any other character in x such as “al”,“ut” and “co” are recoded as 5.

Example 8

num<-rep(1:10,5)
num
##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5
## [26]  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10
dplyr::case_match(num,5~50,10 ~ 100, 8~80, .default = num)
##  [1]   1   2   3   4  50   6   7  80   9 100   1   2   3   4  50   6   7  80   9
## [20] 100   1   2   3   4  50   6   7  80   9 100   1   2   3   4  50   6   7  80
## [39]   9 100   1   2   3   4  50   6   7  80   9 100

In this example,the number 5 in the numeric vector was replaced by 50, 8 by replaced by 8, 10 was replaced by 100 and the other values in the numeric vector are retained as it is.

Recode commands in other packages!

recode commands is also present in admisc and car packages.

Car(companion to applied regression) package

Recode command in Car package: Recodes a numeric vector, character vector, or factor according to simple recode specifications

Syntax:

recode(var, recodes, as.factor, as.numeric=TRUE, levels,to.value=“=”, interval=“:”, separator=“;”)

Example 9

x <- rep(1:10, 3)
x
##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5
## [26]  6  7  8  9 10
car::recode(x, "c(2, 5) = 'In' ; else = 'Out'")
##  [1] "Out" "In"  "Out" "Out" "In"  "Out" "Out" "Out" "Out" "Out" "Out" "In" 
## [13] "Out" "Out" "In"  "Out" "Out" "Out" "Out" "Out" "Out" "In"  "Out" "Out"
## [25] "In"  "Out" "Out" "Out" "Out" "Out"

The numbers 2 and 5 in the vector x is replaced by the character variable “In” and other numbers in x are recoded as “Out”.

Admisc(Adrian Dusa’s Miscellaneous) package

Recodes a vector (numeric, character or factor) according to a set of rules. It is similar to the function recode() from package car, but more flexible.

Syntax:

recode(x, rules, cut, values, …)

Some examples:

Example 10

# more treatment of "else" values
x <- 12:25
x
##  [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# recoding rules don't overlap all existing values, the rest are empty
y<-admisc::recode(x, "15:20=10")
y
##  [1] NA NA NA 10 10 10 10 10 10 NA NA NA NA NA
# all other values are copied
z<-admisc::recode(x, "15:20=10; else=copy")
z
##  [1] 12 13 14 10 10 10 10 10 10 21 22 23 24 25

In the vector y, the numbers from 15(inclusive) to 20(inclusive) are replaced by 10 and rest of the numbers in the vector x are assigned NA.

In the vector z, the numbers from 15(inclusive) to 20(inclusive) are replaced by 10 and rest of the numbers in the vector x are retained as it is.

Example 11

set.seed(1234)

x2 <- factor(sample(letters[1:10], 20, replace = TRUE),levels = letters[1:10])
x2
##  [1] j f e i e f d b g f j f d h d d e h d h
## Levels: a b c d e f g h i j
y2<-admisc::recode(x2,"a:d=1;e:i=0;else=NA")
y2
##  [1] NA  0  0  0  0  0  1  1  0  0 NA  0  1  0  1  1  0  0  1  0

Similarly, the characters a to d in the vector x2 is recoded as 1 and the characters e to i is recoded as 0 and the other characters that are in the vector x2 is recoded as NA.

Example 12

set.seed(123)

x <- sample(10:50, 20, replace = TRUE)
x
##  [1] 40 24 23 12 46 23 34 35 36 14 36 37 18 38 44 17 35 16 18 28
y<-admisc::recode(x, cut = "25,40")
y
##  [1] 2 1 1 1 3 1 2 2 2 1 2 2 1 2 3 1 2 1 1 2

The numeric values less than or equal to 25 in the vector x is recoded as 1, greater than 25 and less than or equal to 40 is recoded as 2 and the numbers exceeding 40 is recoded as 3.