Intro

I am going to introduce the matches() function and how it can be used in finding columns using regular expressions ( a tool for describing patterns in strings). I will be using a dataset on hotel reservations as an example. Below is the list of names for all the variables in the hotel dataset.

#load tidyverse up
library(tidyverse)
#example dataset
reserve <-read.csv("Data/Hotel_Reservations.csv")
colnames(reserve)

##  [1] "Booking_ID"                          
##  [2] "no_of_adults"                        
##  [3] "no_of_children"                      
##  [4] "no_of_weekend_nights"                
##  [5] "no_of_week_nights"                   
##  [6] "type_of_meal_plan"                   
##  [7] "required_car_parking_space"          
##  [8] "room_type_reserved"                  
##  [9] "lead_time"                           
## [10] "arrival_year"                        
## [11] "arrival_month"                       
## [12] "arrival_date"                        
## [13] "market_segment_type"                 
## [14] "repeated_guest"                      
## [15] "no_of_previous_cancellations"        
## [16] "no_of_previous_bookings_not_canceled"
## [17] "avg_price_per_room"                  
## [18] "no_of_special_requests"              
## [19] "booking_status"

What is it for?

The matches() function allows the use of regular expressions to find columns, whereas the contains() function is limited to a literal interpretation of the description. For instance

reserve %>% select(contains("val")) %>% glimpse()

## Rows: 36,275
## Columns: 3
## $ arrival_year  <int> 2017, 2018, 2018, 2018, 2018, 2018, 2017, 2018, 2018, 20…
## $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10, 9, 4, 11, 11, 10, 6, …
## $ arrival_date  <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, 18, 11, 30, 26, 20, 20,…

reserve %>% select(matches("val")) %>% glimpse()

## Rows: 36,275
## Columns: 3
## $ arrival_year  <int> 2017, 2018, 2018, 2018, 2018, 2018, 2017, 2018, 2018, 20…
## $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10, 9, 4, 11, 11, 10, 6, …
## $ arrival_date  <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, 18, 11, 30, 26, 20, 20,…

Both give the same answer, as both are finding columns with “val” located in the title. On the other hand

reserve %>% select(contains("val_[ym]")) %>% glimpse()

## Rows: 36,275
## Columns: 0

reserve %>% select(matches("val_[ym]")) %>% glimpse()

## Rows: 36,275
## Columns: 2
## $ arrival_year  <int> 2017, 2018, 2018, 2018, 2018, 2018, 2017, 2018, 2018, 20…
## $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10, 9, 4, 11, 11, 10, 6, …

Notice that contains("val_[ym]") contains zero columns, as there are no columns that include “val_[ym]” in the title.

matches("val_[ym]") is interpreted as an expression, where “[ym]” states that the next letter must be “y” or “m”. This is why matches("val_[ym]") does not include the column “arrival_date” where, matches("val")) did.

Another quick example is

reserve %>% select(matches("of_[^p]")) %>% glimpse()

## Rows: 36,275
## Columns: 6
## $ no_of_adults           <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, 1, 2, 1, 2, 2,…
## $ no_of_children         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ no_of_weekend_nights   <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, 2, 2, 2, 0, 0,…
## $ no_of_week_nights      <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, 1, 1, 0, 2, 2,…
## $ type_of_meal_plan      <chr> "Meal Plan 1", "Not Selected", "Meal Plan 1", "…
## $ no_of_special_requests <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, 1, 0, 2, 2, 1,…

where matches("of_[^p]") finds columns where the next letter does not begin with “p”. This excludes “no_of_previous_cancellations” and “no_of_previous_bookings_not_canceled” respectively.

Is it helpful?

This could be useful for large datasets with similar variable names. Using matches() allows you to be more specific in which variables you want to find by using regular expressions.

Function of the Week: matches()

Kellen Stark

2023-02-20

Intro

What is it for?

Is it helpful?