P-P plot vs. Q-Q plot…

I noticed a pattern at one University. The students in the business program were using a P-P plot to examine the distribution of residuals in regression models, when a Q-Q plot is widely referenced in statistics textbooks. So I looked deeper into the differences between P-P and Q-Q plots with simulated data.

Data Creation

First, I used R to create a normally distributed data set (N = 50) with an M = 3.0 and an SD = 1.0.

set.seed(092920)
ndata <- rnorm(n = 100, mean = 3.0, sd = 1.0)

Review of Histogram

Next, I used ggplot2 (Wickham et al., 2020), to create a histogram of the data.

The data looks approximately normal; however, note the distance between the two tails and the other data points.

Tests of Normality

Next, I’ll perform a series of statistical tests to see if the data follows a theoretical normal distribution. For this illustration, I’ll use six different tests: Shapiro-Wilk test (found in the stats package, which is loaded automatically by R), Anderson-Darling, Cramer-von Mises, Kolmogorov-Smirnov w/Lilliefors correction, Pearson Chi-Square, and Shapiro-Francia (found in the nortest package; Gross & Ligges, 2015).

shapiro.test(ndata)
ad.test(ndata)
cvm.test(ndata)
lillie.test(ndata)
pearson.test(ndata)
sf.test(ndata)

Shapiro-Wilk normality test

data: test$ndata
W = 0.99155, p-value = 0.02226

Anderson-Darling normality test

data: test$ndata
A = 0.72545, p-value = 0.05808

Cramer-von Mises normality test

data: test$ndata
W = 0.10841, p-value = 0.0859

Lilliefors (Kolmogorov-Smirnov) normality test

data: test$ndata
D = 0.042662, p-value = 0.07763

Pearson chi-square normality test

data: test$ndata
P = 62.88, p-value = 1.344e-06

Shapiro-Francia normality test

data: test$ndata
W = 0.99237, p-value = 0.03868

Interesting…three of the tests (Anderson-Darling, Cramer-von Mises, and Kolmogorov-Smirnov w/Lilliefors correction) found the distribution to follow a theoretical normal distribution (p > .05), while three others (Shapiro-Wilk, Pearson Chi-square, and Shapiro-Francia) did not. What to do?

One could pick a test and make a decision, but the histogram and test may demonstrate to the reader that the decision was subjective. Let’s try to plot the data against a theoretical normal distribution.

The P-P Plot

Using ggplot2 and qqplotr (Almeida et al., 2020), I created a P-P plot based on the data and plotted a 95% CI band on the AB line –

ggplot(data = test, mapping = aes(sample = ndata)) +
stat_pp_band() +
stat_pp_line() +
stat_pp_point() +
labs(x = “Probability Points”, y = “Cumulative Probability”)

Note the “submarine sandwich” 95% CI band around the data. A P-P plot focuses on the skewness or asymmetry of the distribution. Thus, the mode is magnified. If relying on a P-P plot, an emerging researcher could rely on some of the statistical tests to state the distribution following a normal distribution and use a P-P plot to support that conclusion.

The Q-Q Plot

Next, let’s plot a Q-Q plot using the same parameters –

ggplot(data = test, mapping = aes(sample = ndata)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point() +
labs(x = “Theoretical Quanitles”, y = “Sample Quantiles”)

Interesting. In the Q-Q plot, points at both tails deviate from the 95% CI of a theoretical normal distribution. A Q-Q plot magnifies deviations at the tails. Thus, if an emerging scholar was looking at a Q-Q plot with certain tests of normality, one could decide that a residual or a variable did (or did not) follow a normal distribution.

It appears a P-P plot is best when used to explore extremely peaked distributions, while a Q-Q plot is best used to explore the influence of tails of a distribution.

Why is a P-P Plot is chosen more frequently at this school?

I corresponded with a methodologist at this University and she shared a few thoughts –

  • Many universities (and students) use SPSS in their coursework. In the regression menu option, there is a Probability Plot option box. If checked, it creates a P-P plot. Note: A Q-Q plot is not offered within the regression menu. See this link on how to create a Q-Q plot from regression residuals in SPSS.
  • Field (2018) is used as the associated textbook when teaching SPSS in doctoral business programs. The author prominently discusses P-P plots in this version of the textbook. Note: He also covers Q-Q plots but in a more subtle way and the discussion is buried in a graphics section. When found, the author refers to an earlier discussion on quantiles and quartiles. In the R version of book (Field et al, 2012), the Q-Q plot is referenced and their is no reference to a P-P Plot.

Student Notes: Don’t be a slave to a single author’s view: Expand your knowledge by reading different points of view. Don’t be a slave to a menu-based system: Learn about the statistical tests, how they are interpreted, and what the plots represent.

References:

Almeida, A., Loy, A., & Hofmann, H. (2020, February 4). qqplotr: Quantile-quantile plot extensions for ‘ggplot2’. https://cran.r-project.org/web/packages/qqplotr/qqplotr.pdf

Field, A. (2018). Discovering statistics using IBM SPSS Statistics (5th Ed.). SAGE Publications.

Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. SAGE Publications

Gross, J., & Ligges, U. (2015, July 29). nortest: Tests for normality. https://cran.r-project.org/web/packages/nortest/nortest.pdf

Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahshi, K., Wilke, C., Woo, K., Yutani, H., & Dunnington, D. (2020, June 19). ggplot2: Create elegant data visualizations using the Grammar of Graphics. https://cloud.r-project.org/web/packages/ggplot2/ggplot2.pdf

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s