Intro

I am going to introduce the matches() function and how it can be used in finding columns using regular expressions ( a tool for describing patterns in strings). I will be using a dataset on hotel reservations as an example. Below is the list of names for all the variables in the hotel dataset.

#load tidyverse up
library(tidyverse)
#example dataset
reserve <-read.csv("Data/Hotel_Reservations.csv")
colnames(reserve)
##  [1] "Booking_ID"                          
##  [2] "no_of_adults"                        
##  [3] "no_of_children"                      
##  [4] "no_of_weekend_nights"                
##  [5] "no_of_week_nights"                   
##  [6] "type_of_meal_plan"                   
##  [7] "required_car_parking_space"          
##  [8] "room_type_reserved"                  
##  [9] "lead_time"                           
## [10] "arrival_year"                        
## [11] "arrival_month"                       
## [12] "arrival_date"                        
## [13] "market_segment_type"                 
## [14] "repeated_guest"                      
## [15] "no_of_previous_cancellations"        
## [16] "no_of_previous_bookings_not_canceled"
## [17] "avg_price_per_room"                  
## [18] "no_of_special_requests"              
## [19] "booking_status"

What is it for?

The matches() function allows the use of regular expressions to find columns, whereas the contains() function is limited to a literal interpretation of the description. For instance

reserve %>% select(contains("val")) %>% glimpse()
## Rows: 36,275
## Columns: 3
## $ arrival_year  <int> 2017, 2018, 2018, 2018, 2018, 2018, 2017, 2018, 2018, 20…
## $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10, 9, 4, 11, 11, 10, 6, …
## $ arrival_date  <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, 18, 11, 30, 26, 20, 20,…
reserve %>% select(matches("val")) %>% glimpse()
## Rows: 36,275
## Columns: 3
## $ arrival_year  <int> 2017, 2018, 2018, 2018, 2018, 2018, 2017, 2018, 2018, 20…
## $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10, 9, 4, 11, 11, 10, 6, …
## $ arrival_date  <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, 18, 11, 30, 26, 20, 20,…

Both give the same answer, as both are finding columns with “val” located in the title. On the other hand

reserve %>% select(contains("val_[ym]")) %>% glimpse()
## Rows: 36,275
## Columns: 0
reserve %>% select(matches("val_[ym]")) %>% glimpse()
## Rows: 36,275
## Columns: 2
## $ arrival_year  <int> 2017, 2018, 2018, 2018, 2018, 2018, 2017, 2018, 2018, 20…
## $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10, 9, 4, 11, 11, 10, 6, …

Notice that contains("val_[ym]") contains zero columns, as there are no columns that include “val_[ym]” in the title.

matches("val_[ym]") is interpreted as an expression, where “[ym]” states that the next letter must be “y” or “m”. This is why matches("val_[ym]") does not include the column “arrival_date” where, matches("val")) did.

Another quick example is

reserve %>% select(matches("of_[^p]")) %>% glimpse()
## Rows: 36,275
## Columns: 6
## $ no_of_adults           <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, 1, 2, 1, 2, 2,…
## $ no_of_children         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ no_of_weekend_nights   <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, 2, 2, 2, 0, 0,…
## $ no_of_week_nights      <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, 1, 1, 0, 2, 2,…
## $ type_of_meal_plan      <chr> "Meal Plan 1", "Not Selected", "Meal Plan 1", "…
## $ no_of_special_requests <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, 1, 0, 2, 2, 1,…

where matches("of_[^p]") finds columns where the next letter does not begin with “p”. This excludes “no_of_previous_cancellations” and “no_of_previous_bookings_not_canceled” respectively.


Is it helpful?

This could be useful for large datasets with similar variable names. Using matches() allows you to be more specific in which variables you want to find by using regular expressions.