Face Validity…

Yesterday, I briefly discussed face validity in the context of a student creating an instrument to measure a latent variable (e.g., usefulness, intention). Someone read my post and sent me an email asking “How would I measure face validity?” Well, face validity can’t be measured. Face validity answers the question – “Does this test, on the face, measure what is says to measures?” In other words, face validity is the perception that the test is appropriate or valid.

Why is face validity important? Rogers (1995) posited that if test takers believe a test does not have face validity, they would take it less seriously and hurry through; conversely, if they believe the test of have face validity, the test takers would make a conscientious effort to answer honestly.

I advise students (and faculty) to impanel a few experts in the domain under study and get their thoughts on whether the pool of items in the test appear to measure what is under study. If they agree, the first hurdle is passed. The next hurdle is to perform an exploratory factor analysis.

I emphasized the word pool in the prior paragraph for a reason. Developing a valid survey instrument takes time. One of the most time-consuming tasks is creating a pool of items that appear to form a measurable dimension. The reason why one has to create a pool is that until the survey instrument is distributed, feedback is received, and exploratory factor analysis is performed, there is no way to confirm which items strongly form a construct. For example, to get the 36-item Job Satisfaction Survey, Spector (1985) reported he started with 74 items.

References:

Rogers, T. B. (1995). The psychological testing enterprise: An introduction. Brooks-Cole.

Spector, P. E. (1985). Measurement of human service staff satisfaction: Development of the Job Satisfaction Survey. American Journal of Community Psychology, 13(6), 693-713. https://doi.org/10.1007/bf00929796

Bonferroni correction…

When using Null Hypothesis Significance Testing (NHST), a researcher is stating that the null hypothesis will be rejected if certain likelihood criteria is met; most often p < .05. A p = .05 equate to a 1/20 chance that the null hypothesis will be rejected in error (Type I error). But, what happens if a DV is testing 9 times? One way is to employ a p-value correction method attributed to Italian mathematician, Carlo Bonferroni (Bonferroni correction). Under the Bonferroni correction, to maintain the limit of a 5% likelihood of a Type I error across multiple tests, one would divide the p-value (.05) by the number of tests.

Schvom (2019) explored customer satisfaction in the US Arline industry. The emerging scholar focused on two types of analysis: Differences in service element (a survey item-level metric) by groups and customer satisfaction by groups. There were 9 difference service elements. This is when a p-value correction method should be employed.

In this situation, a learned committee member should have advised the emerging scholar to review the Bonferroni correction and its application to avoid experiment-wise error rates. In this example, the emerging scholar should have used p = .0056 as the test of significance rather than p = .05. Thus, a Type I error was made in H2B when rejecting the null hypothesis that On-Time Arrival satisfaction was important by Gender (p = .009); however, satisfaction with Number of Layovers by Gender (p = .001) was correctly rejected (p. 29; see Appendix F). Was Bonferonni advised and not discussed by the student? Or, was this just a review/oversight error by the student, faculty, and University quality control? Who knows? I usually look at the size of the effect, not the p-values because the p-value is influenced by the size of the sample.

Note: Admittedly, the Bonferroni correction is the easiest for faculty to explain and, in my opinion, business students to understand. However, there are many methods of addressing family-wise errors. See the sequential testing methodologies of Dunn (1959, 1961), Sidek (1967), and Holm (1979).

References:

Dunn, O. J. (1959). Confidence intervals for the means of dependent, normally distributed variables. Journal of the American Statistical Association, 54(287), 695-698. https://doi.org/10.1080/01621459.1959.10501524

Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56(293), 52-64. https://doi.org/10.1080/01621459.1961.10482090

Holm, S. (1979). A simple sequentially rejective multiple test procedures. Scandinavian Journal of Statistics, 6(2), 65-70. https://www.jstor.org/stable/4615733

Schvom, A. F. (2019). A critical evaluation of service elements related to customer satisfaction in the U.S. Airline industry (Doctoral dissertation). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection. (13856698)

Sidak, Z. K. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318), 626-633. https://doi.org/10.1080/01621459.1967.10482935

Examining influence in a qualitative study?

It’s very important to use the correct terminology associated with a research method and design. For example, the word influence is widely-associated with quantitative research. Influence can be measured by examining the change in the Y variable when the X variable is manipulated or involving a third variable (Z). Quantitative research is more scientific, less subjective, is repeatable, and can be generalized. Qualitative research is based on the knowledge and skill of the researcher. There are times when an experienced researcher will explore influence in a qualitative study but that is few and far between, generally related to specific disciplines (e.g., medical, social work), and is supported with significant academic research (see here, here, and here). I don’t recommend emerging scholars perform qualitative research. Besides skill, the time needed to complete a qualitative study is much longer than the time needed to complete a quantitative study.

Grant (2019) is an example of why emerging scholars shouldn’t do qualitative research. This emerging scholar explored the influence of leadership behaviors on two dimensions: employee engagement and collaboration (the organization is not germane to this discussion). To perform this study, the emerging scholar created a 7-item open-ended survey and distributed it anonymously to 10 people in an organization exceeding 3,800 people. The emerging scholar would interpret the responses and categorize them to answer the following two research questions –

  • What leadership styles and behaviors are being utilized at [organization]?
  • What is the influence of existing leadership styles and behaviors on employee engagement and collaboration?

Yin (2018) describes five situations where a single case study would be appropriate: critical, unusual, common, revelatory, or longitudinal (pp. 48-50). In addition, Yin describes two types of single case studies: holistic and embedded (pp. 51-53). When reviewing the dissertation, the researcher is attempting a build a common, holistic single-case study. Common because leadership is an everyday situation. Holistic because the organization appears to have a single purpose. However, a case study focuses on “how” or “why” a situation occurred (perhaps leadership style evolution); not “what” style is prevalent or which specific styles influence two outcomes. With an anonymous survey, there is no way to follow up with a participant to clarify their responses. To quote a colleague –

Who’s the researcher? Carnac the Magnificent?

Name withheld

As a result, the research method (QUAL) and design (case study) doesn’t appear to align with the research questions. The results of the study should be ignored. However, I wanted to discuss the themes identified by Grant –

  • A collaborative, or transformational, leadership style is present
  • Organizational leaders are engaging
  • Unfair hiring practices have become standard

First, are collaborate and transformational the same? They’re close, but I believe some scholars would say they’re different. Second, what does the organization’s hiring practices have to do with leadership in an organization? Plus, how can one generalize to an organization of 3,800+ from a sample of 10? Do the math: That’s a 95% CI of nearly 31 points! Even if 90% of the sample described an organization leaders as collaborative, as interpreted by the researcher, that means the 95% CI would between 60% and Inf. What are the other 40%? Non-collaborative?

Reference:

Grant, R. M. (2019). Investigating the influence of leadership behaviors on employee engagement and collaboration in a Federal organization (Doctoral dissertation). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection. (22615969)

Yin, R. K. (2018). Case study research and applications: Design and methods (6th Ed.). SAGE Publications.

Type I errors galore…

Legatti-Maddox (2019) explored the moderating effect of two leadership styles, transformational and transactional, on the relationship between four types of humor and Organizational Citizenship Behavior. The sample for this study were 42 MBA students.

First, the sample size (N = 42) concerned me since it seemed to be a bit small to find a practical (moderate) effect (f2 = 0.15). Using the pwr.f2.test() function from R’s pwr package (Champerly, 2020), it appears a sample of at least 73 would be required with three independent variables and a minimum power of .80 (see below).

          u = 3
          v = 72.70583
         f2 = 0.15
  sig.level = 0.05
      power = 0.8

So, it appears the study was underpowered by design. With underpowered studies there is a low probability of finding true effects and any effects could be false. Let’s move forward…

Second, when performing a moderation analysis, one has to enter both independent variables along with the interaction (see below)

y = X1 + X2 + X1*X2

If the moderating variable (X1*X2) is significant, then the interaction is explored and the independent variables generally lose their value from a research perspective. However, since no independent or interactive variable’s p-values were reported, no moderation evidence was provided by the emerging scholar. Instead. p-values of unmoderated and moderated models were compared, and an increase in the F-statistic reported as evidence of a moderating effect. That’s a flawed approach and, when the null hypothesis is rejected based on that approach, a Type I error ensues.

In a prior post, I discuss the uses of P-P Plots vs. Q-Q Plots and how it’s a default option in regression under SPSS. This emerging scholar used this plot (from SPSS) and stated that the homoscedasticity assumption was met.

Figure 1. Normal P-P Plot of Regression Standardized Residual (p. 70)

However, there is no reference to the independent variables or which model. I wonder what would have happened if her faculty advisor challenged her and said the residuals are hetroskedastic?

Finally, a quick look at a summary table in the study (Table 19 below)

A learned faculty should have counseled this student that the p-values would need to be adjusted for potential family-wise errors as the student’s premise is that all nine models are true. The widely-cited Bonferroni correction would result in a new p-value of 0.0055 (.05/9). If applied, only Model 9 may have met the criteria. However the focus of the study was not on whether a model could be constructed, but whether the interaction of humor and leadership explained the relationship better than the direct effects. Thus, more Type I errors.

The interaction of humor and leadership may influence OCB, but this study provides no evidence. The results of this study should be ignored.

Student Note: The way to approach this would be through some Structured Equation Model (SEM) that controls for Type I errors.

Reference:

Champely, S. (2020, March 16). pwr: Basic functions of power analysis. https://cran.r-project.org/web/packages/pwr/pwr.pdf

Legatti-Maddox, A. C. (2019). Humor style in the workplace as it relates to leadership style and organizational citizenship behavior (Doctoral dissertation). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection. (22622521)

Ethnicity: M = 1.26, SD = .529?

I understand that one has to “numericize” categories for quantitative analysis, but any student and faculty should understand the numbers mean nothing when compared (see Figure 1).

Figure 1: Barplot of Ethnicity (Race) with Distribution overlay (Deonarinesingh, 2019, p. 59).

A chairperson has to perform a lot of reading when reviewing a student’s dissertation. A committee member can hopefully help. But this type of error appeared on every chart in this study’s Chapter 4; regardless of the type of variables (e.g., categorical, interval). Did the faculty not know, or did they simply not read the study?

Student Note: Understand your variables and how best to display them. Don’t rely on your committee; they might not know or remember.

This study will return in a later post…stay tuned.

Reference:

Deonarinesingh, S. (2019). The effect of cultural intelligence upon organizational citizenship behavior, mediated by openness to experience (Doctoral dissertation). ProQuest Dissertations & Theses Global: The Humanities and Social Sciences Collection. (13880805)