Bootstrapping…

I was asked by a colleague to review a nearly completed doctoral manuscript to opine on a chairperson’s recommendation to the student on how to address a small sample size. According to the student’s use of G*Power, a sample size of 67 was required (see below):

While the student received 67 responses, only 46 (68.7%) were usable. In an attempt to help the student, the chairperson recommended the student (a) randomly select 4 records from the 46, and (b) add them back to the sample (increasing the sample from 46 to 50). Why 4? Why not 3? 5? Why not duplicate the entire sample and have an N = 92? With an N = 92, one could find an r = .327 (assuming a SP = .95). The student couldn’t answer those questions as he was merely following instructions of the faculty. That is a topic for another day…

What could have the student done?

Option 1: Use what you have

Do the study with the 46 records. If one reduces their view of statistical power to that of the widely-referenced .80 associated with social science (see the seminal work of the late Jacob Cohen), an r = .349 (well below the estimated effect size) could still be found (see below):

Whatever effect size is found by the student, the value can be compared to the effect size hypothesized (r = .377), and differences explored by the student in the study. A learned faculty would suggest the student report 95% Confidence Intervals (CI), and compare the hypothesized effect size to the upper and lower CI. If the hypothesized values are within the range of the CI, then it’s probably a sampling error issue. If the hypothesized values are NOT within the range of the CI, either the hypothesized effect size was in error or something is unique in the sample and more exploration is needed.

Option 2: Bootstrapping

Bootstrapping uses sampling with replacement to allow an inference to be made about a population (Efron 1979). Bootstrapping is used to assess uncertainty by “resampling data, creating a set of simulated datasets that can be used to approximate some aspects of the sampling distribution, thus giving some sense of the variation that could be expected if the data collection process had be re-done” (Gelman et al., 2021, p. 74). Using bootstrapping, I’m hypothesizing that the test statistic will be within a 95% CI’s, but the CI’s will be wider than those of the original data set, and the distribution of effect sizes will approximate a normal distribution.

To illustrate, I simulated a 46-record data set with two variables that had an projected relationship of r = .25 using the faux package in R (DeBruine et al., 2021). Rather than assume a directional hypothesis, I used the standard NHST assumption (two-sided test). The relationship between the two variables was positively sloped, moderate in size, and statistically significant, r(44) = .398, p = .006, 95% CI (.122, .617).

Next, I bootstrapped three different sample sizes (50, 67, and 1000) using the boot package (Canty & Ripley, 2021). The N = 50 represented what the faculty was trying to recommend. The N = 67 was the originally hypothesized sample size, and the N = 1000 is the widely-used sample size used in bootstrapping. The following table displays the results –

Data SetEffect Size (r)95% Lower Confidence Interval95% Upper Confidence Interval
Simulated Data (N = 46).398.122.617
Bootstrapped Simulation (N = 50).360.105.646
Bootstrapped Simulation (N = 67).360.109.608
Bootstrapped Simulation (N = 1000).360.093.638
Since bootstrapping involves sampling, it’s possible the test statistic may vary between iterations. To address that, I set the R set.seed() command to “040121”

Note in the table how the confidence intervals change as the sample increases. A better way to view the differences is to look at the density function between the three bootstrapped samples –

Note how the distribution base ‘fattens” as the sample size increases from N = 50 to N = 67. Finally, note how the effect size distribution becomes normally distributed with an N = 1000. Regardless, there is always a chance that the true effect size is less than or equal to 0, as depicted by the lower CI meeting or crossing the dotted red line (representing r = 0).

I speculate the chairperson was trying to suggest bootstrapping to the student. Either the faculty didn’t have sufficient knowledge of bootstrapping to guide the student, or the concept of bootstrapping was above the ability of the student to comprehend. I also speculate that faculty was trying to address a high p-value issue. Since the calculation of p-value is based on the sample size, there is nothing a student or faculty can do when a sample size is small except focus on the size of the effect. Perhaps that is what truly was lost in translation. Students, and the faculty advising them, need to understand that it’s not the p-value that is necessarily important but the effect size.

I suspect faculty and students will see more and more low sample sizes over the next year or so as people are fatigued or disinterested in completing surveys (thanks to COVID-19). Students need to be prepared to find a larger population to sample to counter potentially lower than expected response rates.

References:

Canty, A., & Ripley, B. (2021, February 12). boot: Bootstrap functions. https://cran.r-project.org/web/packages/boot/boot.pdf

DeBruine, L., Krystalli, A., & Heiss, A. (2021, March 27). faux: Simulation for factorial designs. https://cran.r-project.org/web/packages/faux/faux.pdf

Efron, B. (1979). Bootstrap methods: Another look at the Jacknife. Annals of Statistics, 7(1), 1-26. https://doi.org/10.1214/aos/1176344552

Gelman, A., Hill, J., & Vehtari, A. (2021). Regression and other stories. Cambridge University Press.

Code Snippet used in analysis:

library(tidyverse)
library(faux)
library(car)
library(boot)
# set random number seed to allow recreation
set.seed(040121)
# create a 46 record datset with an r = 0.25
df <- rnorm_multi(46, 2, r = 0.25)
# perform correlation with original 46 records
cor_value <- cor.test(df$X1, df$X2, meth = "pearson")
cor_value
# bootstrap with N = 50
set.seed(040121)
boot_example1 <- boot(df, 
  statistic = function(data, i) {
    cor(data[i, "X1"], data[i, "X2"], method='spearman', use = "complete.obs")
  },
  R = 50
)
summary(boot_example1)
boot.ci(boot_example1, type = c("norm", "basic", "perc", "bca")) 
plot(density(boot_example1$t))
abline(v = c(0, .35998, 0.1054, 0.6461),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )
# bootstrap with N = 67
set.seed(040121)
boot_example2 <- boot(df, 
  statistic = function(data, i) {
    cor(data[i, "X1"], data[i, "X2"], method='spearman')
  },
  R = 67
)
summary(boot_example2)
boot.ci(boot_example2, type = c("norm", "basic", "perc", "bca")) 
plot(density(boot_example2$t))
abline(v = c(0, .35998, 0.1091, 0.6078),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )
# bootstrap with N = 1000
set.seed(040121)
boot_example3 <- boot(df, 
  statistic = function(data, i) {
    cor(data[i, "X1"], data[i, "X2"], method='spearman')
  },
  R = 1000
)
summary(boot_example3)
boot.ci(boot_example3, type = c("norm", "basic", "perc", "bca")) 
plot(density(boot_example3$t))
abline(v = c(0, .35998, 0.0932, 0.6382),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )
# create three plots in one graphic
par(mfrow=c(1,3))
plot(density(boot_example1$t))
abline(v = c(0, .35998, 0.1054, 0.6461),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )
plot(density(boot_example2$t))
abline(v = c(0, .35998, 0.1091, 0.6078),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )
plot(density(boot_example3$t))
abline(v = c(0, .35998, 0.0932, 0.6382),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )

Luck, inadvertent omission, or lack of knowledge?

Johnson (2018) explored the willingness to hire people who were convicted of drug crimes. The scope of the study was limited to the Central Virginia region. To answer the first research question (How does the willingness to hire returning citizens by Central Virginia employers differ by position/job role in private sector, for-profit business firms?), the emerging scholar used descriptive measures (rather than inferential statistics).

Johnson stated that the null hypothesis could be rejected for three types of jobs: Unskilled, Semi-skilled Labor, and Skilled Labor. I suppose that the decision was based on point estimates above 50% (Figure 1) –

Figure 1. Willingness to hire by position/job role (Johnson, 2018, p. 56)

However, when rejecting null hypotheses based on sample data, confidence intervals must be considered. Based on the information provided by the emerging scholar in his study, there were 653,193 businesses in the sample frame. A quota sample of 635 was chosen (p. 35). Using R and the samplingbook package (Manitz, 2017), that equates to having a 95% CI of 3.89% (see below).

sample.size.prop(e = .0389, N = 653193, level = .95)

sample.size.prop object: Sample size for proportion estimate
With finite population correction: N=653193, precision e=0.0389 and expected proportion P=0.5

Sample size needed: 635

I then recreated the graphic using the ggplot2 package (Wickham et al., 2020), and added the 95% CI (Figure 2).

Figure 2. Recreated Willingness to hire by position/job role with 95% CI = 3.89%

Okay. I see it. However, only 105 complete responses were obtained, not the target sample of 635. Using the same method to calculate the 95% CI above, I backed into a 9.6% 95% CI (see below):

sample.size.prop(e = .096, N = 653193, level = .95)

sample.size.prop object: Sample size for proportion estimate
With finite population correction: N=653193, precision e=0.096 and expected proportion P=0.5

Sample size needed: 105

Thus, the 95% CI changed from a planned 3.83% to an actual 9.6%; a 2.5x increase in interval width. When overlaying the new 95% CI on the data, new perspectives emerge (Figure 3).

Figure 3 Willingness to hire by position/job role (Johnson, 2019, p. 56) with 95% CI = 9.6%

Visually, one can see that the emerging scholar is correct when stating that Semi-skilled Labor and Skilled Labor fall above the 50% line; even when accounting for the 95% CI. However, the error bar for Unskilled Labor (a) drops below 50%, and the error bar for Clerical Labor (d) rises above 50%. Should Unskilled Labor be omitted from the rejection? Should Clerical Labor be included in the rejection of the null hypothesis? It appears both a Type I and a Type II error occurred.

One note: The emerging scholar reported his findings as being similar to research performed in Sweden by Ahmed and Lang (2017). The authors wrote –

We found that ex-offenders were discriminated against in the occupations accounting clerk, cleaner, preschool teacher, restaurant worker, sales person, and software developer. However, we did not observe any statistically significant discrimination against ex-offenders in the occupations auto mechanic, enrolled nurse, and truck driver. 

Ahmed & Lang, 2017, p. 17

Well, they don’t now. Also…Virginia = Sweden? That may be a stretch…

Student Note: Descriptive statistics are not inferential statistics. Know the difference.

References:

Ahmed, A., & Lang, E. (2017). The employability of ex-offenders: A field experiment in
the Swedish labor market. IZA Journal of Labor Policy, 6(1), Article 6. https://doi.org/10.1186/s40173-017-0084-2

Johnson, R. (2018). Willingness of employers to hire individuals convicted of drug crimes in Central Virginia (Doctoral dissertation). ProQuest LLC. (13421921)

Manitz, J. (2017, May 21). samplingbook: Survey sampling procedures. https://cran.r-project.org/web/packages/samplingbook/samplingbook.pdf

Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahashi, K., Wilke, C., Woo, K., Yutani, H., Dunningham, D., & RStudio (2020, June 19). ggplot2: Create elegant data visualizations using the Grammar of Graphics. https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

P-P plot vs. Q-Q plot…

I noticed a pattern at one University. The students in the business program were using a P-P plot to examine the distribution of residuals in regression models, when a Q-Q plot is widely referenced in statistics textbooks. So I looked deeper into the differences between P-P and Q-Q plots with simulated data.

Data Creation

First, I used R to create a normally distributed data set (N = 50) with an M = 3.0 and an SD = 1.0.

set.seed(092920)
ndata <- rnorm(n = 100, mean = 3.0, sd = 1.0)

Review of Histogram

Next, I used ggplot2 (Wickham et al., 2020), to create a histogram of the data.

The data looks approximately normal; however, note the distance between the two tails and the other data points.

Tests of Normality

Next, I’ll perform a series of statistical tests to see if the data follows a theoretical normal distribution. For this illustration, I’ll use six different tests: Shapiro-Wilk test (found in the stats package, which is loaded automatically by R), Anderson-Darling, Cramer-von Mises, Kolmogorov-Smirnov w/Lilliefors correction, Pearson Chi-Square, and Shapiro-Francia (found in the nortest package; Gross & Ligges, 2015).

shapiro.test(ndata)
ad.test(ndata)
cvm.test(ndata)
lillie.test(ndata)
pearson.test(ndata)
sf.test(ndata)

Shapiro-Wilk normality test

data: test$ndata
W = 0.99155, p-value = 0.02226

Anderson-Darling normality test

data: test$ndata
A = 0.72545, p-value = 0.05808

Cramer-von Mises normality test

data: test$ndata
W = 0.10841, p-value = 0.0859

Lilliefors (Kolmogorov-Smirnov) normality test

data: test$ndata
D = 0.042662, p-value = 0.07763

Pearson chi-square normality test

data: test$ndata
P = 62.88, p-value = 1.344e-06

Shapiro-Francia normality test

data: test$ndata
W = 0.99237, p-value = 0.03868

Interesting…three of the tests (Anderson-Darling, Cramer-von Mises, and Kolmogorov-Smirnov w/Lilliefors correction) found the distribution to follow a theoretical normal distribution (p > .05), while three others (Shapiro-Wilk, Pearson Chi-square, and Shapiro-Francia) did not. What to do?

One could pick a test and make a decision, but the histogram and test may demonstrate to the reader that the decision was subjective. Let’s try to plot the data against a theoretical normal distribution.

The P-P Plot

Using ggplot2 and qqplotr (Almeida et al., 2020), I created a P-P plot based on the data and plotted a 95% CI band on the AB line –

ggplot(data = test, mapping = aes(sample = ndata)) +
stat_pp_band() +
stat_pp_line() +
stat_pp_point() +
labs(x = “Probability Points”, y = “Cumulative Probability”)

Note the “submarine sandwich” 95% CI band around the data. A P-P plot focuses on the skewness or asymmetry of the distribution. Thus, the mode is magnified. If relying on a P-P plot, an emerging researcher could rely on some of the statistical tests to state the distribution following a normal distribution and use a P-P plot to support that conclusion.

The Q-Q Plot

Next, let’s plot a Q-Q plot using the same parameters –

ggplot(data = test, mapping = aes(sample = ndata)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point() +
labs(x = “Theoretical Quanitles”, y = “Sample Quantiles”)

Interesting. In the Q-Q plot, points at both tails deviate from the 95% CI of a theoretical normal distribution. A Q-Q plot magnifies deviations at the tails. Thus, if an emerging scholar was looking at a Q-Q plot with certain tests of normality, one could decide that a residual or a variable did (or did not) follow a normal distribution.

It appears a P-P plot is best when used to explore extremely peaked distributions, while a Q-Q plot is best used to explore the influence of tails of a distribution.

Why is a P-P Plot is chosen more frequently at this school?

I corresponded with a methodologist at this University and she shared a few thoughts –

  • Many universities (and students) use SPSS in their coursework. In the regression menu option, there is a Probability Plot option box. If checked, it creates a P-P plot. Note: A Q-Q plot is not offered within the regression menu. See this link on how to create a Q-Q plot from regression residuals in SPSS.
  • Field (2018) is used as the associated textbook when teaching SPSS in doctoral business programs. The author prominently discusses P-P plots in this version of the textbook. Note: He also covers Q-Q plots but in a more subtle way and the discussion is buried in a graphics section. When found, the author refers to an earlier discussion on quantiles and quartiles. In the R version of book (Field et al, 2012), the Q-Q plot is referenced and their is no reference to a P-P Plot.

Student Notes: Don’t be a slave to a single author’s view: Expand your knowledge by reading different points of view. Don’t be a slave to a menu-based system: Learn about the statistical tests, how they are interpreted, and what the plots represent.

References:

Almeida, A., Loy, A., & Hofmann, H. (2020, February 4). qqplotr: Quantile-quantile plot extensions for ‘ggplot2’. https://cran.r-project.org/web/packages/qqplotr/qqplotr.pdf

Field, A. (2018). Discovering statistics using IBM SPSS Statistics (5th Ed.). SAGE Publications.

Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. SAGE Publications

Gross, J., & Ligges, U. (2015, July 29). nortest: Tests for normality. https://cran.r-project.org/web/packages/nortest/nortest.pdf

Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahshi, K., Wilke, C., Woo, K., Yutani, H., & Dunnington, D. (2020, June 19). ggplot2: Create elegant data visualizations using the Grammar of Graphics. https://cloud.r-project.org/web/packages/ggplot2/ggplot2.pdf