P-value > 1?

I was skimming ProQuest today to look for something new to discuss. There are many dissertations with incorrect statistical tests, and misaligned research questions and research designs. But, I found one from a University that so far this year has only one graduate (or at least only one graduate dissertation in ProQuest). In this study, the emerging scholar explored a company’s change in financial value pre- and post-merger or acquisition. While outside my domain, I’ve read enough about financial value creation to at least understand what was going on. Plus, I like reading about how time-series analyses are constructed. For a primer on time-series analysis in R, see Avril Chohan and Tavish Srivastava.

As I reviewed the emerging scholars data analysis I began to be concerned. I saw the term one-way analysis of variance (ANOVA) on p. 59. A one-way ANOVA is used to explore the differences in a variable of interest based on three or more categories. For example, if one where to examine test scores of five 4th-grade courses, that would be a function that a one-way ANOVA could handle. However, the variables in this study are not independent. One of the critical assumptions of a one-way ANOVA is that the variables are independent. For example, the stock price of Google (Symbol: GOOGL) on September 15, 2021 is related to the stock price on both September 14, 2021 and September 16, 2021. In other words, an observation at one point is related to the observation at another point.

There are a many of data analysis approaches, statistical tests, and books written about the analysis of time-series data. In the ANOVA series of tests, which is used to detect changes in a mean value of the variable of interest, a repeated-measure ANOVA could be selected. A repeated-measure ANOVA, sometimes called RMANOVA, measures the mean value of the variable of interest over time. An RMANOVA, at least, should have been considered and attempted. So what does the wrong test have to do with the title of this blog entry?

Three of the results of the one-way ANOVA, which I know is wrong, have p-values listed greater than 1!!! For the record, 11 of the 12 statistical tests had reported p-values > 1 (the highest being 8.94). A p-value means the probability of having a result that is equal to or greater than the result achieved under a specific hypothesis. Thus, does that mean that the null hypothesis was not rejected was with the force of 894% probability? What does that say about the one result which was p < .001?

I get it when faculty aren’t strong in quantitative data analysis. That’s the purpose of textbooks, the library, and the Internet. Maybe even reach out to a few colleagues to refresh a memory or two. But seriously? p > 1!!! Don’t advise students performing quantitative work.

Reference:

Tuggle, J. R. (2021). Evaluating merger activity using a quantitative case study approach to aid in the determination if mergers and acquisitions are a strategic advantage to the creation of financial value for acquiring company shareholders (Doctoral dissertation; Lincoln Memorial University). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection (28650945)

Assessing organizational culture based on words used by company to describe itself…

Duke (2021) explored the association between organizational culture and financial statement fraud. What was interesting to me was how organizational culture was measured. In this study, the emerging scholar apparently reviewed press releases and other publicly-available documentation relating to companies, and classified each sample company into four culture groups –

  • Adhocracy
  • Clan
  • Hierarchy
  • Market

No statement was made by the emerging researcher about her classification being reviewed by more seasoned, experienced domain experts (like faculty!). How does a reader assess construct validity? I guess it doesn’t matter. Also, there was no assumption or limitation statement that the coding was subjective. I guess its “Trust me…”

In RQ1, each of these cultures was compared to nominally-coded fraud (0 = No; 1 = Yes), and a Chi-square test was performed (2 x 5 = 10 cells). No significant association. It makes sense – N = 50. With a sample size of 50, only a w in excess of .489 (which is large!) could be identified.

Next, the emerging scholar performed another Chi-square test comparing the four classifications with 5 types of fraud; however, Table 2 (p. 59), which apparently shows the type of fraud by organizational classification (N = 60; some fraud incidents fall into more than 1 category, which is problematic), doesn’t reconcile with data displayed in Tables 7-10 (N = 100 in each classification). If a Chi-square test was performed on data found in Table 2, there is no significant association. But, the emerging scholar did five Chi-square tests on the N = 100 data and found a statistically significant association.

What does this all mean? If you look deep enough you can find something; mainly, if you get to interpret the values of the variables. This is why I tell students: Don’t trust doctoral dissertations. Who says the respective emerging scholar or their faculty know what they’re doing?

Reference:

Duke K. (2021). Organizational culture and the relationship to financial statement fraud (Doctoral dissertation). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection.(28315482)

When to quit reading…

Long (2021) studied “the relationship between internal control weaknesses and lower profitability” (p. ii). Sounds straight forward to me. Internal control weaknesses could be interval and represent the count of weaknesses, and profitability would be measured by the firm reporting. Count data normally follows a Poisson distribution, but perhaps it could be transformed. Profit data can be normalized through log transformation. A correlation test or regression could be performed.

Stop Sign Stock Photo, Picture And Royalty Free Image. Image 8623324.

The emerging researcher explains later (pp. 2-5), that the focus is on Internal Control Weakness factors (whatever that is), that reduce the Return on Net Operating Assets (RNOA) post-merger and acquisition. It appears this type of research was recommended by two accounting academics. Good thing there is an IT professional to perform this study!

To perform the analysis, Mergers and Acquisition (M&A) and Internal Control Weakness (ICW) were dichotomizied (0 = No; 1 Yes). Companies were divided into four groups: Group 1 (M&A = No; ICW = No); Group 2 (M&A = No, ICW = Yes); Group 3 (M&A = Yes; ICW = No); and Group 4 (M&A = Yes, ICW = Yes). Then, the emerging scholar explains that three types of tests will be performed –

  • Paired Sample t-test (RNOA as DV)
  • Correlation
  • Multiple Regression

First, a paired-sample t-test is used to evaluate variables at different points in time. What the student should have performed is a two-sample t-test where between group differences are evaluated. Who reviewed this study? It doesn’t matter, no statistical differences were found. Could that be caused by the wrong test? Maybe. Could it be caused by sample size (An N = 119 was determined [p. 63], but only 38 companies were listed on pp. 83-85), or significant differences in sample sizes between groups? Also, maybe. The reason I answer maybe is that the emerging scholar failed to report descriptive statistics for the study. No Group n. Just the Group M. Regardless, was it the wrong test? Absolutely! But, I’m still scratching my head about why do this test when it wasn’t the focus of the study. I speculate the emerging scholar “mimiced” another study without understanding what was going on, or was advised by faculty to do this.

Second, the emerging scholar performed regression analysis using Cash Flow and Board Size as IVs and RNOA as the DV. Nothing was significant. Finally, the emerging scholar performed two regression analyses using an unknown value related to Groups 1 and 3 and an unknown value related to Groups 2 and 4 as IVs, and (a) RNOA and (b) Cash Flow as DVs. Again, nothing was significant; however, the emerging scholar did identify that the “Control Groups” (Groups 1 and 3) coefficient was significant (p = 0.011) in one model. Unfortunately, I don’t know how to interpret a B = -0.999 when the actual values are not described reported.

What’s funny is that a third regression analysis, using the same IVs, was performed. This time with Board Size as a DV. So, these two questionable IVs can predict board size? What does that have to do with the study? Plus, don’t get me started about the performance of tests of normality on categorical variables (see p. 77).

What happened here? I have no idea. I should have stopped reading at the paired samples t-test…

Reference:

Long, L. G. (2021). The effects of internal control weaknesses that undermine acquisitions (Doctoral dissertation). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection. (28315391)

What could be found with 16 IVs and an N = 31?

At the time of writing this post, I’m looking at a recently introduced colleagues doctoral dissertation. The faculty member went to the same school as me, but several years later. It’s always interesting looking at people’s dissertations because it can show a person’s beginning academic success. This faculty member will be mentoring emerging scholars so I always hope more knowledge has been acquired. Why do I say that? Well…

This faculty, as a doctoral student, explored the influence of, what are described as, critical business variables on solo criminal law practitioner success. In the study, the faculty utilized an instrument that purports to have been validated many times in many countries. I don’t feel like exploring that claim so I skipped to the data analysis plan and results.

First, when determining a sample size, one estimates the desired effect size. An effect size is commonly based on prior research, but can include other factors (e.g., practicality). In this study, the then emerging scholar reported the size of the population (N = 530) but stated that an a priori sample size-based effect size wasn’t needed since the faculty was planning on performing a census. What? Surveying a population is fine; however, one should have some idea on the expected effect size so if less than the required number of samples are returned, follow-up efforts can be initiated to reach the desired level of practicality. This issue will rear its ugly head later.

Second, a total of 16 research hypotheses were explored. Of the 16, 8 IVs were ordinal, 4 were interval, and 4 were nominal. The DV was titled “Degree of Success” and was treated as an ordinal variable. In the study, the then emerging scholar decided to perform the analyses using the Kendall Rank-Order Correlation Coefficient (tau-b). For a quick review of this technique, see link. This is fine and appropriate if two variables are not normally distributed. However, what about the 3 IVs that are nominal? Kendall is not the right solution. These relationships should have been explored by one-sample t-tests; parametric or non-parametric. Wrong test…

Third, 16 IVs and 1 IV? Bonferroni anyone? With 16 IVs, a p-value of .003125 (.05/16) would be required so as not to make a Type I error caused by cumulative, family-wise hypothesis testing issues.

So, what happened? Well, let’s count the issues –

  • Only 31 responses were received. Was this expected? I bet not. Normally, a social science researcher shoots for an 80% SP (1-(4*p-value)). This study was severely under-powered to start (SP ~ .09). Perhaps if the then emerging scholar had calculated a sample size a priori… How big of sample was needed? Well, to identify a moderate effect size with a p-value of .003125, about 155 observations were needed. I wonder how much time it would have taken to get the N = 31 closer to N = 155?
  • A p-value of .05 was used, not a p-value of .003125. Thus, any statistically significant results reported had a high probability of being a Type I error.
  • Guess what? One statistical result was reported (tau-b = .322, p = .037). I’m not going to list it since it should be ignored
  • There may be something to report in this study, but since descriptive statistics were NOT reported relating to the sample, it’s hard to tell.

I’m thinking about asking for this faculty’s data. Perhaps there is something there…who knows?

Bootstrapping…

I was asked by a colleague to review a nearly completed doctoral manuscript to opine on a chairperson’s recommendation to the student on how to address a small sample size. According to the student’s use of G*Power, a sample size of 67 was required (see below):

While the student received 67 responses, only 46 (68.7%) were usable. In an attempt to help the student, the chairperson recommended the student (a) randomly select 4 records from the 46, and (b) add them back to the sample (increasing the sample from 46 to 50). Why 4? Why not 3? 5? Why not duplicate the entire sample and have an N = 92? With an N = 92, one could find an r = .327 (assuming a SP = .95). The student couldn’t answer those questions as he was merely following instructions of the faculty. That is a topic for another day…

What could have the student done?

Option 1: Use what you have

Do the study with the 46 records. If one reduces their view of statistical power to that of the widely-referenced .80 associated with social science (see the seminal work of the late Jacob Cohen), an r = .349 (well below the estimated effect size) could still be found (see below):

Whatever effect size is found by the student, the value can be compared to the effect size hypothesized (r = .377), and differences explored by the student in the study. A learned faculty would suggest the student report 95% Confidence Intervals (CI), and compare the hypothesized effect size to the upper and lower CI. If the hypothesized values are within the range of the CI, then it’s probably a sampling error issue. If the hypothesized values are NOT within the range of the CI, either the hypothesized effect size was in error or something is unique in the sample and more exploration is needed.

Option 2: Bootstrapping

Bootstrapping uses sampling with replacement to allow an inference to be made about a population (Efron 1979). Bootstrapping is used to assess uncertainty by “resampling data, creating a set of simulated datasets that can be used to approximate some aspects of the sampling distribution, thus giving some sense of the variation that could be expected if the data collection process had be re-done” (Gelman et al., 2021, p. 74). Using bootstrapping, I’m hypothesizing that the test statistic will be within a 95% CI’s, but the CI’s will be wider than those of the original data set, and the distribution of effect sizes will approximate a normal distribution.

To illustrate, I simulated a 46-record data set with two variables that had an projected relationship of r = .25 using the faux package in R (DeBruine et al., 2021). Rather than assume a directional hypothesis, I used the standard NHST assumption (two-sided test). The relationship between the two variables was positively sloped, moderate in size, and statistically significant, r(44) = .398, p = .006, 95% CI (.122, .617).

Next, I bootstrapped three different sample sizes (50, 67, and 1000) using the boot package (Canty & Ripley, 2021). The N = 50 represented what the faculty was trying to recommend. The N = 67 was the originally hypothesized sample size, and the N = 1000 is the widely-used sample size used in bootstrapping. The following table displays the results –

Data SetEffect Size (r)95% Lower Confidence Interval95% Upper Confidence Interval
Simulated Data (N = 46).398.122.617
Bootstrapped Simulation (N = 50).360.105.646
Bootstrapped Simulation (N = 67).360.109.608
Bootstrapped Simulation (N = 1000).360.093.638
Since bootstrapping involves sampling, it’s possible the test statistic may vary between iterations. To address that, I set the R set.seed() command to “040121”

Note in the table how the confidence intervals change as the sample increases. A better way to view the differences is to look at the density function between the three bootstrapped samples –

Note how the distribution base ‘fattens” as the sample size increases from N = 50 to N = 67. Finally, note how the effect size distribution becomes normally distributed with an N = 1000. Regardless, there is always a chance that the true effect size is less than or equal to 0, as depicted by the lower CI meeting or crossing the dotted red line (representing r = 0).

I speculate the chairperson was trying to suggest bootstrapping to the student. Either the faculty didn’t have sufficient knowledge of bootstrapping to guide the student, or the concept of bootstrapping was above the ability of the student to comprehend. I also speculate that faculty was trying to address a high p-value issue. Since the calculation of p-value is based on the sample size, there is nothing a student or faculty can do when a sample size is small except focus on the size of the effect. Perhaps that is what truly was lost in translation. Students, and the faculty advising them, need to understand that it’s not the p-value that is necessarily important but the effect size.

I suspect faculty and students will see more and more low sample sizes over the next year or so as people are fatigued or disinterested in completing surveys (thanks to COVID-19). Students need to be prepared to find a larger population to sample to counter potentially lower than expected response rates.

References:

Canty, A., & Ripley, B. (2021, February 12). boot: Bootstrap functions. https://cran.r-project.org/web/packages/boot/boot.pdf

DeBruine, L., Krystalli, A., & Heiss, A. (2021, March 27). faux: Simulation for factorial designs. https://cran.r-project.org/web/packages/faux/faux.pdf

Efron, B. (1979). Bootstrap methods: Another look at the Jacknife. Annals of Statistics, 7(1), 1-26. https://doi.org/10.1214/aos/1176344552

Gelman, A., Hill, J., & Vehtari, A. (2021). Regression and other stories. Cambridge University Press.

Code Snippet used in analysis:

library(tidyverse)
library(faux)
library(car)
library(boot)
# set random number seed to allow recreation
set.seed(040121)
# create a 46 record datset with an r = 0.25
df <- rnorm_multi(46, 2, r = 0.25)
# perform correlation with original 46 records
cor_value <- cor.test(df$X1, df$X2, meth = "pearson")
cor_value
# bootstrap with N = 50
set.seed(040121)
boot_example1 <- boot(df, 
  statistic = function(data, i) {
    cor(data[i, "X1"], data[i, "X2"], method='spearman', use = "complete.obs")
  },
  R = 50
)
summary(boot_example1)
boot.ci(boot_example1, type = c("norm", "basic", "perc", "bca")) 
plot(density(boot_example1$t))
abline(v = c(0, .35998, 0.1054, 0.6461),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )
# bootstrap with N = 67
set.seed(040121)
boot_example2 <- boot(df, 
  statistic = function(data, i) {
    cor(data[i, "X1"], data[i, "X2"], method='spearman')
  },
  R = 67
)
summary(boot_example2)
boot.ci(boot_example2, type = c("norm", "basic", "perc", "bca")) 
plot(density(boot_example2$t))
abline(v = c(0, .35998, 0.1091, 0.6078),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )
# bootstrap with N = 1000
set.seed(040121)
boot_example3 <- boot(df, 
  statistic = function(data, i) {
    cor(data[i, "X1"], data[i, "X2"], method='spearman')
  },
  R = 1000
)
summary(boot_example3)
boot.ci(boot_example3, type = c("norm", "basic", "perc", "bca")) 
plot(density(boot_example3$t))
abline(v = c(0, .35998, 0.0932, 0.6382),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )
# create three plots in one graphic
par(mfrow=c(1,3))
plot(density(boot_example1$t))
abline(v = c(0, .35998, 0.1054, 0.6461),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )
plot(density(boot_example2$t))
abline(v = c(0, .35998, 0.1091, 0.6078),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )
plot(density(boot_example3$t))
abline(v = c(0, .35998, 0.0932, 0.6382),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
       )