My oldest son asked me once why I blog about the quality of mentor-reviewed, student-completed research. I explained to him that the quality of research being published, social science or not, has been shown to fraught with errors.

I want to quote from Dr. Gelman’s last paragraph –

Research is not being done for the benefit of author or the journal; it’s for the benefit of the readers of the journal, and ultimately for society. If you don’t want your work to be publicly discussed, you shouldn’t publish it. We make criticisms in public for the same reason that we write articles for publication, because we think this work is ultimately of relevance to people on the outside

Andrew Gelman

Wrong analysis, over analysis, or just confusion?

I was updating my ORCHID ID and SSRN account information a few months ago and was searching for something to write about. I found a recently completed dissertation on the relationship between corporate social responsibility and brand equity. Since marketing, specifically brand equity, is my domain, I was interested. During my reading, four issues immediately popped up in my head –

  • Descriptive statistics containing unimportant information
  • Exploratory factor analysis results being misinterpreted, while the important information was nowhere to be found
  • Decisions made about a scale based on an incomplete analysis
  • Statistical results normally use to reject a null hypothesis ignored, and an alternative test employed (with the results misinterpreted and not fully analyzed).

Misreporting/Under-reporting Descriptive Statistics

On page 62 of this study, the emerging scholar reported the M, SD, Skewness, and Kurtosis of each survey item. In the paragraph leading up to Table 6, the author states –

Considering the absolute value of skewness is less than +/- 3, it is acceptable but still highly skewed (Aminu & Shariff, 2014; Kline, 2015).

p. 62

The author makes a similar statement about the Kurtosis with a different value (+/- 10). The skewness and kurtosis, which relate to distributional properties, are irrelevant at the item level. The properties are important when the items are combined to form a scale that would be measured (see Table 7, p. 64 of the study). Did the emerging scholar report the M/SD or distributional properties of the scales actually used in the study? Nope.

Exploratory Factor Analysis Under-reporting/Misinterpretation

The purpose of an Exploratory Factor Analysis (EFA) is two-fold. First, an EFA can reduce a series of survey items to a small set of factors. Second, an EFA can provide support for convergent (construct) and divergent validity. The latter is demonstrated by reporting the dimensions (vertical column) based on the weight of the items (horizontal row). See an example below:

Source: Qualtrics

The emerging scholar stated she performed an EFA, which is fine; however, information similar to the example was not reported. What was reported was –

The principal component analysis, including the oblimin rotation was conducted and verified the Kaiser-Meyer-Olkin (KMO) measures of .808 for Corporate Social Responsibility, .857 for Multi-Dimensional Brand Equity, and .855 for Overall Brand Equity. These indicate a meaningful factor analysis as they are nearing 1.0 and are much higher than the .7 minimum suggested (Stevens, 2012).

p. 64

Here’s where the misinterpretation begins. The KMO measure is used to assess whether the items are suitable for factor analysis; it does not support the meaningfulness of the analysis.

The emerging scholar performed both an EFA and reliability analysis on both survey instruments and their sub-dimensions. This was fine but somewhat redundant. An EFA would identify any potential non-reliable survey items by not loading the item in a dimension. Is somebody following a template or copying a process from another study??

Incomplete Scale Analysis

Composite scales were constructed. Box plots were used to identify outliers (outliers and extreme outliers were identified but were retained. I suspect they were retained because of the small sample size [N = 66]). Finally, both Kolmogorov-Smirnov (with Lilliefors correction) and Shapiro-Wilk tests of normality were used to assess whether the scales followed an approximately normal distribution (Spoiler alert: None did). The emerging scholar then decided to use nonparametric tests for hypothesis testing.

These two tests and others (e.g., Anderson-Darling, Cramer-von Mises) can be sensitive to outliers. Since the outliers (and extreme outliers) were retained, that decision may have caused the normality problem. Reviewing a Q-Q plot could have assisted in visualizing how the outliers influenced the distribution. Perhaps removing the outliers would have supported a decision to assume an approximately normal distribution. Who knows?

Over Analysis/Incorrect Analysis

In the hypothesis testing (pp. 68-71), it gets confusing. Really confusing…

For RQ1, Spearman’s Rank-Order Coefficient was used as the test statistic (probably due to the normality issue). The test was significant with a large/strong effect size (rs = .692, p < .001). That was all that was needed. However, the emerging scholar then moved to perform simple regression (she states multiple regression, but it wasn’t), where an r2 of 0.517 was calculated (r = .719). The validity of regression is dependent on several assumptions. One assumption is normally distributed errors. Without analysis of the errors, there is no validity that the linear model is correct. Why even do regression when one has already established a non-parametric relationship between variables? Also, the student makes reference to not being able to perform a t-test because of the non-parametric issues. A t-test?

For RQ2, the emerging scholar performed multiple regression using the four sub-dimensions of CSR as IVs with the DV being composite Brand Equity. The purpose was to determine which of the four dimensions was “most associated” with brand equity. What does “most associated” mean? The strongest relationship?

The emerging scholar reported both Spearman’s correlation coefficient and regression unstandardized/standardized beta values, along with t values and p values. Why? Who knows? No regression diagnostics charts were included to substantiate that a linear model was appropriate.

In RQ3, which is the funniest one to me, the emerging scholar somewhat mirrored RQ2 but flipped the direction and used the three brand equity dimensions as the IVs and CSR as the DV. If one is using multiple regression, one is trying to establish or confirm some type of cause and effect. Does Brand Equity cause Corporate Social Responsibility? However, only Spearman correlation coefficient values were reported. Why not multiple regression like in RQ2? Through all this, where are controls for Type I errors (see Bonferroni correction)?


Somebody got confused or lost along the journey, a standard template may have been used and boxes were being checked, the committee didn’t read the paper in detail, or the committee didn’t understand what was going on. The competence of committee members always concerns me. I just pointed out the highlights. If you review the emerging scholar’s data analysis plan you will find more comedy (I didn’t know a statistical test was calculating a Standard Deviation?).

More infrastructure supporting students and faculty is definitely needed.


Smith, M. (2021). The relationship between corporate social responsibility and brand equity within the business-to-business service sector (Doctoral dissertation, Columbia Southern University).

P-value > 1?

I was skimming ProQuest today to look for something new to discuss. There are many dissertations with incorrect statistical tests, and misaligned research questions and research designs. But, I found one from a University that so far this year has only one graduate (or at least only one graduate dissertation in ProQuest). In this study, the emerging scholar explored a company’s change in financial value pre- and post-merger or acquisition. While outside my domain, I’ve read enough about financial value creation to at least understand what was going on. Plus, I like reading about how time-series analyses are constructed. For a primer on time-series analysis in R, see Avril Chohan and Tavish Srivastava.

As I reviewed the emerging scholars data analysis I began to be concerned. I saw the term one-way analysis of variance (ANOVA) on p. 59. A one-way ANOVA is used to explore the differences in a variable of interest based on three or more categories. For example, if one where to examine test scores of five 4th-grade courses, that would be a function that a one-way ANOVA could handle. However, the variables in this study are not independent. One of the critical assumptions of a one-way ANOVA is that the variables are independent. For example, the stock price of Google (Symbol: GOOGL) on September 15, 2021 is related to the stock price on both September 14, 2021 and September 16, 2021. In other words, an observation at one point is related to the observation at another point.

There are a many of data analysis approaches, statistical tests, and books written about the analysis of time-series data. In the ANOVA series of tests, which is used to detect changes in a mean value of the variable of interest, a repeated-measure ANOVA could be selected. A repeated-measure ANOVA, sometimes called RMANOVA, measures the mean value of the variable of interest over time. An RMANOVA, at least, should have been considered and attempted. So what does the wrong test have to do with the title of this blog entry?

Three of the results of the one-way ANOVA, which I know is wrong, have p-values listed greater than 1!!! For the record, 11 of the 12 statistical tests had reported p-values > 1 (the highest being 8.94). A p-value means the probability of having a result that is equal to or greater than the result achieved under a specific hypothesis. Thus, does that mean that the null hypothesis was not rejected was with the force of 894% probability? What does that say about the one result which was p < .001?

I get it when faculty aren’t strong in quantitative data analysis. That’s the purpose of textbooks, the library, and the Internet. Maybe even reach out to a few colleagues to refresh a memory or two. But seriously? p > 1!!! Don’t advise students performing quantitative work.


Tuggle, J. R. (2021). Evaluating merger activity using a quantitative case study approach to aid in the determination if mergers and acquisitions are a strategic advantage to the creation of financial value for acquiring company shareholders (Doctoral dissertation; Lincoln Memorial University). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection (28650945)

Assessing organizational culture based on words used by company to describe itself…

Duke (2021) explored the association between organizational culture and financial statement fraud. What was interesting to me was how organizational culture was measured. In this study, the emerging scholar apparently reviewed press releases and other publicly-available documentation relating to companies, and classified each sample company into four culture groups –

  • Adhocracy
  • Clan
  • Hierarchy
  • Market

No statement was made by the emerging researcher about her classification being reviewed by more seasoned, experienced domain experts (like faculty!). How does a reader assess construct validity? I guess it doesn’t matter. Also, there was no assumption or limitation statement that the coding was subjective. I guess its “Trust me…”

In RQ1, each of these cultures was compared to nominally-coded fraud (0 = No; 1 = Yes), and a Chi-square test was performed (2 x 5 = 10 cells). No significant association. It makes sense – N = 50. With a sample size of 50, only a w in excess of .489 (which is large!) could be identified.

Next, the emerging scholar performed another Chi-square test comparing the four classifications with 5 types of fraud; however, Table 2 (p. 59), which apparently shows the type of fraud by organizational classification (N = 60; some fraud incidents fall into more than 1 category, which is problematic), doesn’t reconcile with data displayed in Tables 7-10 (N = 100 in each classification). If a Chi-square test was performed on data found in Table 2, there is no significant association. But, the emerging scholar did five Chi-square tests on the N = 100 data and found a statistically significant association.

What does this all mean? If you look deep enough you can find something; mainly, if you get to interpret the values of the variables. This is why I tell students: Don’t trust doctoral dissertations. Who says the respective emerging scholar or their faculty know what they’re doing?


Duke K. (2021). Organizational culture and the relationship to financial statement fraud (Doctoral dissertation). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection.(28315482)

When to quit reading…

Long (2021) studied “the relationship between internal control weaknesses and lower profitability” (p. ii). Sounds straight forward to me. Internal control weaknesses could be interval and represent the count of weaknesses, and profitability would be measured by the firm reporting. Count data normally follows a Poisson distribution, but perhaps it could be transformed. Profit data can be normalized through log transformation. A correlation test or regression could be performed.

Stop Sign Stock Photo, Picture And Royalty Free Image. Image 8623324.

The emerging researcher explains later (pp. 2-5), that the focus is on Internal Control Weakness factors (whatever that is), that reduce the Return on Net Operating Assets (RNOA) post-merger and acquisition. It appears this type of research was recommended by two accounting academics. Good thing there is an IT professional to perform this study!

To perform the analysis, Mergers and Acquisition (M&A) and Internal Control Weakness (ICW) were dichotomizied (0 = No; 1 Yes). Companies were divided into four groups: Group 1 (M&A = No; ICW = No); Group 2 (M&A = No, ICW = Yes); Group 3 (M&A = Yes; ICW = No); and Group 4 (M&A = Yes, ICW = Yes). Then, the emerging scholar explains that three types of tests will be performed –

  • Paired Sample t-test (RNOA as DV)
  • Correlation
  • Multiple Regression

First, a paired-sample t-test is used to evaluate variables at different points in time. What the student should have performed is a two-sample t-test where between group differences are evaluated. Who reviewed this study? It doesn’t matter, no statistical differences were found. Could that be caused by the wrong test? Maybe. Could it be caused by sample size (An N = 119 was determined [p. 63], but only 38 companies were listed on pp. 83-85), or significant differences in sample sizes between groups? Also, maybe. The reason I answer maybe is that the emerging scholar failed to report descriptive statistics for the study. No Group n. Just the Group M. Regardless, was it the wrong test? Absolutely! But, I’m still scratching my head about why do this test when it wasn’t the focus of the study. I speculate the emerging scholar “mimiced” another study without understanding what was going on, or was advised by faculty to do this.

Second, the emerging scholar performed regression analysis using Cash Flow and Board Size as IVs and RNOA as the DV. Nothing was significant. Finally, the emerging scholar performed two regression analyses using an unknown value related to Groups 1 and 3 and an unknown value related to Groups 2 and 4 as IVs, and (a) RNOA and (b) Cash Flow as DVs. Again, nothing was significant; however, the emerging scholar did identify that the “Control Groups” (Groups 1 and 3) coefficient was significant (p = 0.011) in one model. Unfortunately, I don’t know how to interpret a B = -0.999 when the actual values are not described reported.

What’s funny is that a third regression analysis, using the same IVs, was performed. This time with Board Size as a DV. So, these two questionable IVs can predict board size? What does that have to do with the study? Plus, don’t get me started about the performance of tests of normality on categorical variables (see p. 77).

What happened here? I have no idea. I should have stopped reading at the paired samples t-test…


Long, L. G. (2021). The effects of internal control weaknesses that undermine acquisitions (Doctoral dissertation). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection. (28315391)