I noticed a pattern at one University. The students in the business program were using a P-P plot to examine the distribution of residuals in regression models, when a Q-Q plot is widely referenced in statistics textbooks. So I looked deeper into the differences between P-P and Q-Q plots with simulated data.
First, I used R to create a normally distributed data set (N = 50) with an M = 3.0 and an SD = 1.0.
ndata <- rnorm(n = 100, mean = 3.0, sd = 1.0)
Review of Histogram
Next, I used ggplot2 (Wickham et al., 2020), to create a histogram of the data.
The data looks approximately normal; however, note the distance between the two tails and the other data points.
Tests of Normality
Next, I’ll perform a series of statistical tests to see if the data follows a theoretical normal distribution. For this illustration, I’ll use six different tests: Shapiro-Wilk test (found in the stats package, which is loaded automatically by R), Anderson-Darling, Cramer-von Mises, Kolmogorov-Smirnov w/Lilliefors correction, Pearson Chi-Square, and Shapiro-Francia (found in the nortest package; Gross & Ligges, 2015).
Shapiro-Wilk normality test
W = 0.99155, p-value = 0.02226
Anderson-Darling normality test
A = 0.72545, p-value = 0.05808
Cramer-von Mises normality test
W = 0.10841, p-value = 0.0859
Lilliefors (Kolmogorov-Smirnov) normality test
D = 0.042662, p-value = 0.07763
Pearson chi-square normality test
P = 62.88, p-value = 1.344e-06
Shapiro-Francia normality test
W = 0.99237, p-value = 0.03868
Interesting…three of the tests (Anderson-Darling, Cramer-von Mises, and Kolmogorov-Smirnov w/Lilliefors correction) found the distribution to follow a theoretical normally distributed (p > .05), while three others (Shapiro-Wilk, Pearson Chi-square, and Shapiro-Francia) did not. What to do?
One could pick a test and make a decision, but the histogram and test may demonstrate to the reader that the decision was subjective. Let’s try to plot the data against a theoretical normal distribution.
The P-P Plot
Using ggplot2 and qqplotr (Almeida et al., 2020), I created a P-P plot based on the data and plotted a 95% CI band on the AB line –
ggplot(data = test, mapping = aes(sample = ndata)) +
labs(x = “Probability Points”, y = “Cumulative Probability”)
Note the “submarine sandwich” 95% CI band around the data. A P-P plot focuses on the skewness or asymmetry of the distribution. Thus, the mode is magnified. If relying on a P-P plot, an emerging researcher could rely on the some of the statistical tests to state the distribution following a normal distribution and use a P-P plot to support that conclusion.
The Q-Q Plot
Next, let’s plot a Q-Q plot using the same parameters –
ggplot(data = test, mapping = aes(sample = ndata)) +
labs(x = “Theoretical Quanitles”, y = “Sample Quantiles”)
Interesting. In the Q-Q plot, points at both tails deviate from the 95% CI of a theoretical normal distribution. A Q-Q plot magnifies deviations at the tails. Thus, if an emerging scholar was looking at a Q-Q plot with certain tests of normality, one could decide that a residual or a variable did (or did not) follow a normal distribution.
It appears a P-P plot is best when used to explore extremely peaked distributions, while a Q-Q plot is best used to explore the influence of tails of a distribution.
Why is a P-P Plot is chosen more frequently at this school?
I corresponded with a methodologist at this University and she shared a few thoughts –
- Many universities (and students) use SPSS in their coursework. In the regression menu option, there is a Probability Plot option box. If checked, it creates a P-P plot. Note: A Q-Q plot is not offered within the regression menu. See this link on how to create a Q-Q plot from regression residuals in SPSS.
- Field (2018) is used as the associated textbook when teaching SPSS in doctoral business programs. The author prominently discusses P-P plots in this version of the textbook. Note: He also covers Q-Q plots but in a more subtle way and the discussion is buried in a graphics section. When found, the author refers to an earlier discussion on quantiles and quartiles. In the R version of book (Field et al, 2012), the Q-Q plot is referenced and their is no reference to a P-P Plot.
Student Notes: Don’t be a slave to a single author’s view: Expand your knowledge by reading different points of view. Don’t be a slave to a menu-based system: Learn about the statistical tests, how they are interpreted, and what the plots represent.
Almeida, A., Loy, A., & Hofmann, H. (2020, February 4). qqplotr: Quantile-quantile plot extensions for ‘ggplot2’. https://cran.r-project.org/web/packages/qqplotr/qqplotr.pdf
Field, A. (2018). Discovering statistics using IBM SPSS Statistics (5th Ed.). SAGE Publications.
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. SAGE Publications
Gross, J., & Ligges, U. (2015, July 29). nortest: Tests for normality. https://cran.r-project.org/web/packages/nortest/nortest.pdf
Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahshi, K., Wilke, C., Woo, K., Yutani, H., & Dunnington, D. (2020, June 19). ggplot2: Create elegant data visualisations using the Grammar of Graphics. https://cloud.r-project.org/web/packages/ggplot2/ggplot2.pdf