I am going to introduce the
matches()
function and how it can be used in finding columns using regular expressions ( a tool for describing patterns in strings). I will be using a dataset on hotel reservations as an example. Below is the list of names for all the variables in the hotel dataset.
#load tidyverse up
library(tidyverse)
#example dataset
reserve <-read.csv("Data/Hotel_Reservations.csv")
colnames(reserve)
## [1] "Booking_ID"
## [2] "no_of_adults"
## [3] "no_of_children"
## [4] "no_of_weekend_nights"
## [5] "no_of_week_nights"
## [6] "type_of_meal_plan"
## [7] "required_car_parking_space"
## [8] "room_type_reserved"
## [9] "lead_time"
## [10] "arrival_year"
## [11] "arrival_month"
## [12] "arrival_date"
## [13] "market_segment_type"
## [14] "repeated_guest"
## [15] "no_of_previous_cancellations"
## [16] "no_of_previous_bookings_not_canceled"
## [17] "avg_price_per_room"
## [18] "no_of_special_requests"
## [19] "booking_status"
The
matches()
function allows the use of regular expressions to find columns, whereas thecontains()
function is limited to a literal interpretation of the description. For instance
reserve %>% select(contains("val")) %>% glimpse()
## Rows: 36,275
## Columns: 3
## $ arrival_year <int> 2017, 2018, 2018, 2018, 2018, 2018, 2017, 2018, 2018, 20…
## $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10, 9, 4, 11, 11, 10, 6, …
## $ arrival_date <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, 18, 11, 30, 26, 20, 20,…
reserve %>% select(matches("val")) %>% glimpse()
## Rows: 36,275
## Columns: 3
## $ arrival_year <int> 2017, 2018, 2018, 2018, 2018, 2018, 2017, 2018, 2018, 20…
## $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10, 9, 4, 11, 11, 10, 6, …
## $ arrival_date <int> 2, 6, 28, 20, 11, 13, 15, 26, 6, 18, 11, 30, 26, 20, 20,…
Both give the same answer, as both are finding columns with “val” located in the title. On the other hand
reserve %>% select(contains("val_[ym]")) %>% glimpse()
## Rows: 36,275
## Columns: 0
reserve %>% select(matches("val_[ym]")) %>% glimpse()
## Rows: 36,275
## Columns: 2
## $ arrival_year <int> 2017, 2018, 2018, 2018, 2018, 2018, 2017, 2018, 2018, 20…
## $ arrival_month <int> 10, 11, 2, 5, 4, 9, 10, 12, 7, 10, 9, 4, 11, 11, 10, 6, …
Notice that
contains("val_[ym]")
contains zero columns, as there are no columns that include “val_[ym]” in the title.
matches("val_[ym]")
is interpreted as an expression, where “[ym]” states that the next letter must be “y” or “m”. This is whymatches("val_[ym]")
does not include the column “arrival_date” where,matches("val"))
did.
Another quick example is
reserve %>% select(matches("of_[^p]")) %>% glimpse()
## Rows: 36,275
## Columns: 6
## $ no_of_adults <int> 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, 1, 2, 1, 2, 2,…
## $ no_of_children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ no_of_weekend_nights <int> 1, 2, 2, 0, 1, 0, 1, 1, 0, 0, 1, 2, 2, 2, 0, 0,…
## $ no_of_week_nights <int> 2, 3, 1, 2, 1, 2, 3, 3, 4, 5, 0, 1, 1, 0, 2, 2,…
## $ type_of_meal_plan <chr> "Meal Plan 1", "Not Selected", "Meal Plan 1", "…
## $ no_of_special_requests <int> 0, 1, 0, 0, 0, 1, 1, 1, 1, 3, 0, 1, 0, 2, 2, 1,…
where
matches("of_[^p]")
finds columns where the next letter does not begin with “p”. This excludes “no_of_previous_cancellations” and “no_of_previous_bookings_not_canceled” respectively.
This could be useful for large datasets with similar variable names. Using
matches()
allows you to be more specific in which variables you want to find by using regular expressions.