Ridgeline Plot, AKA density ridges plot.

In this document, I will introduce the geom_density_ridges function from the ggridges package, and show what it’s for.

What is it for?

Ridgelines creates density curves from the distribution of a continuous variable, stratified by a categorical variable. To use this function we need to load the ggplot2 and ggridges libraries.

library(ggplot2)
library(ggridges)

Example #1, from our smoke_complete dataset that we used in Homework 3, shows us the distribution of years of age at lung disease diagnosis by race. This visualization makes it very easy to show differences in distribution of a continuous variable by specified categories of interest; for example, you might choose to view age at diagnosis among categories of non, light, moderate, and heavy smoking.

# Load data
smoke_complete <- read_excel("smoke_complete.xlsx", 
                             sheet = 1, 
                             na= "NA") 

# Create an age in years vector
smoke_complete$age_dx_yrs <- smoke_complete$age_at_diagnosis/365
# Example 1, distribution of age at diagnosis with lung disease by race
ggplot(data = smoke_complete,
       aes(x = age_dx_yrs,
           y = race,
           fill = race)
       ) + 
  geom_density_ridges(alpha = 0.4) +
  ggthemes::theme_clean() + 
  theme(legend.position="none") + 
  labs(
    x = "Age at diagnosis (years)",   
    y = "Race",
    title = "Age at diagnosis vs. race"
    )
## Picking joint bandwidth of 4.02

table(smoke_complete$race)
## 
##          american indian or alaska native 
##                                         2 
##                                     asian 
##                                        27 
##                 black or african american 
##                                        70 
## native hawaiian or other pacific islander 
##                                         1 
##                              not reported 
##                                       210 
##                                     white 
##                                       842

In this example, we can see that all distributions are somewhat left-skewed across all race groups, and that there is some hint of bimodality among Asians and perhaps also African Americans, with a rather substantial frequency of diagnoses prior to age 50. However, the irregularity we see in these distributions may also reflect smaller numbers among these groups. We can also see that we don’t have sufficient numbers in the categories “Native Hawaiian or other Pacific Islander” or “American Indian or Alaskan Native” to generate a curve, so those lines are left empty.

Example #2, using the penguin data from the palmerpenguins library, shows us the distribution of body mass in grams by species of penguin.

# Example 2, distribution of penguin body mass in grams by species
library(palmerpenguins)
data(penguins)

ggplot(data = penguins,
       aes(x = body_mass_g,
           y = species,
           fill = species)
       ) + 
  geom_density_ridges(alpha = 0.4) +
  ggthemes::theme_clean() + 
  theme(legend.position="none") + 
  labs(
    x = "Body Mass (grams)",   
    y = "Species",
    title = "Penguin body mass by species"
    )
## Picking joint bandwidth of 153
## Warning: Removed 2 rows containing non-finite values (`stat_density_ridges()`).

We can see right away that Gentoos are much larger than the other penguin species, and that they also have a wider range of sizes. The distribution has a hint of bimodality, which we might guess point at sex differences in size (and if you wanted to explore that further, this would also be a good tool for visualizing that!).

Is it helpful?

This is really useful for my work, because I am often looking at outcomes like gestational weight gain or birthweight that may have different distributions according to categorical variables such as race, age bracket, income bracket, insurance type, or parity. Being able to look at the differences in distribution, and particularly to compare the shapes of distributions, can help me not only determine if I need to dig deeper into quantifying any differences, but can also help me start to uncover the story that underlies observed differences in distribution. For example, if I see bimodality, two humps, in a distribution of birthweight in one income bracket but not others, it raises questions like, does this reflect people who fall within this bracket who qualify for food assistance at the low end, and have enough money to buy the groceries they need at the high end, but are falling through the gaps in the middle with too much income to qualify for assistance, but not enough to keep themselves and their family adequately fed? In other words, it can help point me toward where to start looking at problems, and start me on the path to thinking about solutions.