What could be found with 16 IVs and an N = 31?

At the time of writing this post, I’m looking at a recently introduced colleagues doctoral dissertation. The faculty member went to the same school as me, but several years later. It’s always interesting looking at people’s dissertations because it can show a person’s beginning academic success. This faculty member will be mentoring emerging scholars so I always hope more knowledge has been acquired. Why do I say that? Well…

This faculty, as a doctoral student, explored the influence of, what are described as, critical business variables on solo criminal law practitioner success. In the study, the faculty utilized an instrument that purports to have been validated many times in many countries. I don’t feel like exploring that claim so I skipped to the data analysis plan and results.

First, when determining a sample size, one estimates the desired effect size. An effect size is commonly based on prior research, but can include other factors (e.g., practicality). In this study, the then emerging scholar reported the size of the population (N = 530) but stated that an a priori sample size-based effect size wasn’t needed since the faculty was planning on performing a census. What? Surveying a population is fine; however, one should have some idea on the expected effect size so if less than the required number of samples are returned, follow-up efforts can be initiated to reach the desired level of practicality. This issue will rear its ugly head later.

Second, a total of 16 research hypotheses were explored. Of the 16, 8 IVs were ordinal, 4 were interval, and 4 were nominal. The DV was titled “Degree of Success” and was treated as an ordinal variable. In the study, the then emerging scholar decided to perform the analyses using the Kendall Rank-Order Correlation Coefficient (tau-b). For a quick review of this technique, see link. This is fine and appropriate if two variables are not normally distributed. However, what about the 3 IVs that are nominal? Kendall is not the right solution. These relationships should have been explored by one-sample t-tests; parametric or non-parametric. Wrong test…

Third, 16 IVs and 1 IV? Bonferroni anyone? With 16 IVs, a p-value of .003125 (.05/16) would be required so as not to make a Type I error caused by cumulative, family-wise hypothesis testing issues.

So, what happened? Well, let’s count the issues –

  • Only 31 responses were received. Was this expected? I bet not. Normally, a social science researcher shoots for an 80% SP (1-(4*p-value)). This study was severely under-powered to start (SP ~ .09). Perhaps if the then emerging scholar had calculated a sample size a priori… How big of sample was needed? Well, to identify a moderate effect size with a p-value of .003125, about 155 observations were needed. I wonder how much time it would have taken to get the N = 31 closer to N = 155?
  • A p-value of .05 was used, not a p-value of .003125. Thus, any statistically significant results reported had a high probability of being a Type I error.
  • Guess what? One statistical result was reported (tau-b = .322, p = .037). I’m not going to list it since it should be ignored
  • There may be something to report in this study, but since descriptive statistics were NOT reported relating to the sample, it’s hard to tell.

I’m thinking about asking for this faculty’s data. Perhaps there is something there…who knows?

Alignment of themes to research question…

I started writing this blog post about priming interviewees in qualitative research. However, once I got into writing, I realized I simply found another poorly performed qualitative study. However, I did want to discuss aligning research-deduced themes with research questions. Here’s the study –

Job Satisfaction and Job-Related Stress among NCAA Division II Athletic Directors in
Historically Black Colleges and Universities

Name withheld (but you can search for the study)

I’ve been involved with many students who are exploring job satisfaction and job-related stress in a variety of industries, but I’ve never heard of a study on this topic in university athletic directors (AD’s). What surprised me was the study wasn’t quantitative; it was qualitative.

The emerging scholar’s overarching research question was –

What strategies do ADs at HBCUs implement to manage departments with limited resources?

p. 14

What does the phrase ‘limited resources’ mean? It would seem that some form of quantitative measure would need to be used to separate athletic departments into categories based on resources. However, I found this sentence –

…there was an assumption that HBCU athletic directors would experience job dissatisfaction and
job-related stress due to decreased funding, inadequate facility management, and
inconsistent roster management

p. 19

Wow! This statement makes it easy for a researcher…I’ll just assume something is happening whether true or not.

Now, a quick note about priming. The interview guide can be found on Appendix C of the dissertation. Honestly, it’s not really an interview guide. The student employed the ‘oral survey’ Q&A approach often suggested by faculty that have limited understanding of qualitative data collection methodologies. Rather than critique the self-described “interview questions,” I will point out one issue –

Q3 – What strategies have you implemented to motivate your staff and thereby increase
job satisfaction?

p. 133

This question requires the interviewee to –

  • Understand the word strategy or, at a minimum, understand the researcher’s definition of the term
  • Differentiate a strategy from a tactic
  • Reflect on how a strategy has been specifically applied to or influenced staff motivation
  • Reflect on staff responses to the strategy and subjectively estimate its influence on their own level of job satisfaction

In other words, the emerging scholar placed the responsibility for the study’s results on the interviewee responses, not on the interpretation of the responses. Ugh!

What would have happened if the emerging scholar simply started with –

  • How do you motivate your employees?
  • How do your employees respond to the techniques you employ to motivate?
  • When do you decide to change methods?

The aforementioned approach allows the interviewees to describe the methods they use to motivate employees, which would then be analyzed by the emerging scholar as a strategy or tactic. Each motivational technique could be explored in-depth by follow-up questions and, subsequently, tied back to the literature. Next, the emerging scholar could explore in-depth with the interviewee the responses by employees. Did the description provided by the interviewee align with the expectations found in the literature? Finally, discussing a change in methods and its impetus, could result in an alignment with the research question?

When I finally got to the themes, I chuckled:

  • Shared responsibility – “participants believed the workplace demands they face daily do not allow them to have the ability to make all decisions for the department. Having shared responsibilities among other leaders within the department was essential for each athletic director” (p. 97). Every job has some level of work demand. Some demands are based on the lack of resources (e.g., human capital), some are note (e.g., heavy lifting). In the academic literature, sharing responsibility within an organizational unit is the tenant of work-based teams. It would seem the study participants are simply employing widely-referenced management techniques. However, since the emerging scholar assumed all HBCU ADs face limited resources, this had to be a theme.
  • Empowering staff – The emerging scholar didn’t describe the meaning of this phrase; rather, paraphrased material was listed from external sources (two sources cited weren’t listed in the References). However, similar to shared responsibility, employee empowerment is an oft-studied topic in the literature.
  • Limited resources to grow facilities – The term ‘resources’ in this context relates to financial resources. ADs are often held accountable for promotion of their programs; however, how much of that job is part of their normal duties? Based on how the emerging scholar phrased the research question, this theme is not aligned with the research question.
  • Limited female participation – The emerging researcher delved into gender equity, the recruitment of females to play sports, and the balance between males and females in sports. This topic relates to recruitment, probably more about society than management…again unrelated to the research question.

In the emerging scholars biography she stated that she works for an HBCU athletic department, so I acknowledge the interest. She also stated that she would like to pursue an athletic department job. That’s great! If you, too, are an emerging researcher and you look at this study for references, that’s fine…just be wary about citing these results. Redo the research.

Face Validity…

Yesterday, I briefly discussed face validity in the context of a student creating an instrument to measure a latent variable (e.g., usefulness, intention). Someone read my post and sent me an email asking “How would I measure face validity?” Well, face validity can’t be measured. Face validity answers the question – “Does this test, on the face, measure what is says to measures?” In other words, face validity is the perception that the test is appropriate or valid.

Why is face validity important? Rogers (1995) posited that if test takers believe a test does not have face validity, they would take it less seriously and hurry through; conversely, if they believe the test of have face validity, the test takers would make a conscientious effort to answer honestly.

I advise students (and faculty) to impanel a few experts in the domain under study and get their thoughts on whether the pool of items in the test appear to measure what is under study. If they agree, the first hurdle is passed. The next hurdle is to perform an exploratory factor analysis.

I emphasized the word pool in the prior paragraph for a reason. Developing a valid survey instrument takes time. One of the most time-consuming tasks is creating a pool of items that appear to form a measurable dimension. The reason why one has to create a pool is that until the survey instrument is distributed, feedback is received, and exploratory factor analysis is performed, there is no way to confirm which items strongly form a construct. For example, to get the 36-item Job Satisfaction Survey, Spector (1985) reported he started with 74 items.

References:

Rogers, T. B. (1995). The psychological testing enterprise: An introduction. Brooks-Cole.

Spector, P. E. (1985). Measurement of human service staff satisfaction: Development of the Job Satisfaction Survey. American Journal of Community Psychology, 13(6), 693-713. https://doi.org/10.1007/bf00929796

Bonferroni correction…

When using Null Hypothesis Significance Testing (NHST), a researcher is stating that the null hypothesis will be rejected if certain likelihood criteria is met; most often p < .05. A p = .05 equate to a 1/20 chance that the null hypothesis will be rejected in error (Type I error). But, what happens if a DV is testing 9 times? One way is to employ a p-value correction method attributed to Italian mathematician, Carlo Bonferroni (Bonferroni correction). Under the Bonferroni correction, to maintain the limit of a 5% likelihood of a Type I error across multiple tests, one would divide the p-value (.05) by the number of tests.

Schvom (2019) explored customer satisfaction in the US Arline industry. The emerging scholar focused on two types of analysis: Differences in service element (a survey item-level metric) by groups and customer satisfaction by groups. There were 9 difference service elements. This is when a p-value correction method should be employed.

In this situation, a learned committee member should have advised the emerging scholar to review the Bonferroni correction and its application to avoid experiment-wise error rates. In this example, the emerging scholar should have used p = .0056 as the test of significance rather than p = .05. Thus, a Type I error was made in H2B when rejecting the null hypothesis that On-Time Arrival satisfaction was important by Gender (p = .009); however, satisfaction with Number of Layovers by Gender (p = .001) was correctly rejected (p. 29; see Appendix F). Was Bonferonni advised and not discussed by the student? Or, was this just a review/oversight error by the student, faculty, and University quality control? Who knows? I usually look at the size of the effect, not the p-values because the p-value is influenced by the size of the sample.

Note: Admittedly, the Bonferroni correction is the easiest for faculty to explain and, in my opinion, business students to understand. However, there are many methods of addressing family-wise errors. See the sequential testing methodologies of Dunn (1959, 1961), Sidek (1967), and Holm (1979).

References:

Dunn, O. J. (1959). Confidence intervals for the means of dependent, normally distributed variables. Journal of the American Statistical Association, 54(287), 695-698. https://doi.org/10.1080/01621459.1959.10501524

Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56(293), 52-64. https://doi.org/10.1080/01621459.1961.10482090

Holm, S. (1979). A simple sequentially rejective multiple test procedures. Scandinavian Journal of Statistics, 6(2), 65-70. https://www.jstor.org/stable/4615733

Schvom, A. F. (2019). A critical evaluation of service elements related to customer satisfaction in the U.S. Airline industry (Doctoral dissertation). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection. (13856698)

Sidak, Z. K. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318), 626-633. https://doi.org/10.1080/01621459.1967.10482935

Pilot Studies…

I recently had the opportunity to consult about the use pilot studies. A colleague brought to my attention a student-developed survey instrument that (allegedly) measured three dimensions –

  • Perceived Usefulness
  • Perceived Ease of Use
  • Intention to Use

These three dimension are normally associated with the Technology Acceptance Model (TAM), developed by Davis (1989), and modified since then.

These three items are often modified to focus on a specific technology. However, the student did something I had never seen before: He measured six different types of technology under the banner of the “technologies of the next industrial revolution.” Rather, than use a validated instrument as a base and modify the subject in each item, he wrote new items and allocated one item for each of the six different types of technology.

Rather than go deeper into psychometrics and why the student’s approach was flawed, let’s get back to topic.

The purpose of a pilot study is to examine the feasibility of an approach before a large scale study is performed. Pilot studies are often performed in both qualitative and quantitative research. Since the student created a new instrument, a pilot study was warranted. The student should have surveyed between 50-100 people to validate the a priori face validity properties via Exploratory Factor Analysis.

What did the student do? He surveyed 13 people to ask if the items were valid based on the wording!?!? What type of validity was he seeking? Face Validity? Language validity?

Cut to the end…the survey instrument failed confirmatory factor analysis (Tucker-Lewis Index < .90), an exploratory factor analysis resulted in the elimination of three items due to loading < .50, and the resultant 15-item survey resulting in three dimensions that made no sense (e.g., a prior face validity was incorrect). What was left was an attempt to salvage the student’s research…an endeavor I dislike but understand why its done.

Where were the red flags? I see three –

  • Attempting to measure usefulness, use of, and intention to use six different types of technology with one item each. (Buy a book! I prefer Furr, but also have Miller and Lovler)
  • Failing to impanel a group of experts to evaluate the face validity of each items
  • Failing to understand the purpose of a pilot study

Who’s to blame? The three issues mentioned above fall directly on the lap of the faculty and University. Not everybody has the skills and knowledge to advise doctoral students. Some need to focus on their strength: teaching in their domain.