I will introduce the geom_qq function from the ggplot2 package and show how to make QQ plots.
#example dataset
data("mtcars")
This function creates a quantile-quantile plot (also called QQ or normal probability plot) which plots observations against predictions generated by your model. This allows you to assess whether or not your data is approximately normally distributed. The closer the dots are to a straight line, the closer your data approximates a normal distribution, and the closer your observations are to the predictions from your model.
This is an important step when doing almost any kind of modeling. Normality is an assumption of many statistical tests. QQ plots are just one of many tools for assessing your data for violations of the assumptions of the statistical tests you use.
I am going to use the cars dataset. It contains information on the height, weight, cost, make and model, and mph for various cars.
# QQ plot with 1 var: miles per gallon
qq<- ggplot(mtcars, aes(sample=mpg))+
geom_qq()
qq
# QQ plot with labels
qq +
labs(x = "Miles Per Hour",
title = "MPH of cars from the cars dataset")
This would only tell you about univariate normality. If you were interesting in multivariate normality, you would have to generate model residuals.
# Call simple regression model
model1<-lm(mpg~wt, data=mtcars)
# Residuals
res1 <- resid(model1) # residual
# Plot residuals
qq2 <-ggplot(mtcars, aes(sample=res1))+
geom_qq(color="aquamarine4") +
labs(x="Observed values of MPG",
y="Predicted values of MPG",
title = "Model Residuals for MPH versus Weight") +
theme_minimal()
qq2
# You can also add a line to show where the predicted values would be
qq3<-qq2 +
geom_qq_line(mapping=NULL, data=mtcars, color="mediumorchid3")
qq3
When the observations form a curve instead of a straight line, or vary widely from the line, this suggests your sample data may be skewed. From this graph it looks like our residuals are approximately normal, but things kind of fall apart near the ends of the distritbution.
I already use something else (qqnorm) so this doesn’t improve on that any. Here is my code if that’s useful. I also usually make residual-vs-fitted plots which won’t work with geom_qq(), because it only allows specification of one variable.
# Generate fitted values + other residuals
fitted1 <- model1$fitted.values # fitted value
stand1 <- rstandard(model1) # standardized residual
stud1 <- rstudent(model1) # studentized residual
# Plot of residuals vs. fitted for model1
par(mfrow=c(1,3)) # check to make sure that code runs
plot(fitted1, res1,
main="Residual Plot",
xlab="Fitted Value",
ylab="Residual")
abline(0,0, col="seagreen")
# Plot of standardized residual vs. fitted for model1
plot(fitted1, stand1,
main="Standardized Residual Plot",
xlab="Fitted Value",
ylab="Residual")
abline(0,0, col="hotpink1")
# Plot of studentized residual vs. fitted for model y1
plot(fitted1, stud1,
main="Studentized Residual Plot",
xlab="Fitted Value",
ylab="Residual")
abline(0,0, col="salmon2")
# Skewness and kurtosis
skew <- skewness(res1)
skew
## [1] 0.6678291
# QQ plots
par(mfrow=c(2,3))
qqnorm(res1,
ylab="Residuals")
qqline(res1)
qqnorm(stand1,
ylab="Standardized Residuals")
qqline(stand1)
qqnorm(stud1,
ylab="Studentized Residuals")
qqline(stud1)