Pilot Studies…

I recently had the opportunity to consult about the use pilot studies. A colleague brought to my attention a student-developed survey instrument that (allegedly) measured three dimensions –

  • Perceived Usefulness
  • Perceived Ease of Use
  • Intention to Use

These three dimension are normally associated with the Technology Acceptance Model (TAM), developed by Davis (1989), and modified since then.

These three items are often modified to focus on a specific technology. However, the student did something I had never seen before: He measured six different types of technology under the banner of the “technologies of the next industrial revolution.” Rather, than use a validated instrument as a base and modify the subject in each item, he wrote new items and allocated one item for each of the six different types of technology.

Rather than go deeper into psychometrics and why the student’s approach was flawed, let’s get back to topic.

The purpose of a pilot study is to examine the feasibility of an approach before a large scale study is performed. Pilot studies are often performed in both qualitative and quantitative research. Since the student created a new instrument, a pilot study was warranted. The student should have surveyed between 50-100 people to validate the a priori face validity properties via Exploratory Factor Analysis.

What did the student do? He surveyed 13 people to ask if the items were valid based on the wording!?!? What type of validity was he seeking? Face Validity? Language validity?

Cut to the end…the survey instrument failed confirmatory factor analysis (Tucker-Lewis Index < .90), an exploratory factor analysis resulted in the elimination of three items due to loading < .50, and the resultant 15-item survey resulting in three dimensions that made no sense (e.g., a prior face validity was incorrect). What was left was an attempt to salvage the student’s research…an endeavor I dislike but understand why its done.

Where were the red flags? I see three –

  • Attempting to measure usefulness, use of, and intention to use six different types of technology with one item each. (Buy a book! I prefer Furr, but also have Miller and Lovler)
  • Failing to impanel a group of experts to evaluate the face validity of each items
  • Failing to understand the purpose of a pilot study

Who’s to blame? The three issues mentioned above fall directly on the lap of the faculty and University. Not everybody has the skills and knowledge to advise doctoral students. Some need to focus on their strength: teaching in their domain.


Measuring perception…

A week ago, a doctoral business student was referred to me for a discussion about a specific research design. While the primary research question was somewhat convoluted, once edited the focus was on measuring perception of a behavior in a group through the lens of another (the sample), then relating that perception of the group to the sample’s perception of project performance. If one were to ask a survey participant a single question, it would be “What do you think the relationship is between A and B?” This type of question falls into the category (loosely) of (public) opinion research, not business research.

Research into perceptions (and attitudes) are often studied using a qualitative research methodology. The perception of something or someone is generally explored via interviews, where the researcher collapses groups of thoughts into themes that answer the research question. This type of approach is often covered in some depth in many research methods and design textbooks.

When it comes to quantitative research, though, measuring perception is focused on the self-assessment of a sample. For example, the Perceived Stress Scale for measuring the perception of stress, or the Buss-Durkee Hostility Inventory for measuring aspects of hostility and guilt; both instruments developed by psychologists.

Using a subject’s perception of another person is problematic due to cognitive bias. This type of bias involves the systematic error in one’s perception of others. Within cognitive bias, there are three groups –

  • Fundamental Attribution Error, which involves labeling people without sufficient information, knowledge or time
  • Confirmation Bias, which is widely referred to as a common judgmental bias. Research has shown that people trick their minds on focusing on a small piece of information to confirm already developed belief
  • Self-serving Bias, which involves perceiving a situation in a manner so to plays the one perceiving in a more positive light

How would you measure validity in the proposed study? Have the sample assess behaviors in people, measure the behaviors of the people, compare the two assessments for accuracy, and factor that accuracy into the study? Seems like a long way to go, and all you are really doing is measuring the assessment ability of the sample.

I don’t know who’s at fault here for not identifying this type of fundamental issue before the student’s research proposal was developed. It may have been identified by faculty along the way and ignored by the student? It could be faculty didn’t really understand what the student was proposing due to how the research question was formed?


I was asked by a colleague to review a nearly completed doctoral manuscript to opine on a chairperson’s recommendation to the student on how to address a small sample size. According to the student’s use of G*Power, a sample size of 67 was required (see below):

While the student received 67 responses, only 46 (68.7%) were usable. In an attempt to help the student, the chairperson recommended the student (a) randomly select 4 records from the 46, and (b) add them back to the sample (increasing the sample from 46 to 50). Why 4? Why not 3? 5? Why not duplicate the entire sample and have an N = 92? With an N = 92, one could find an r = .327 (assuming a SP = .95). The student couldn’t answer those questions as he was merely following instructions of the faculty. That is a topic for another day…

What could have the student done?

Option 1: Use what you have

Do the study with the 46 records. If one reduces their view of statistical power to that of the widely-referenced .80 associated with social science (see the seminal work of the late Jacob Cohen), an r = .349 (well below the estimated effect size) could still be found (see below):

Whatever effect size is found by the student, the value can be compared to the effect size hypothesized (r = .377), and differences explored by the student in the study. A learned faculty would suggest the student report 95% Confidence Intervals (CI), and compare the hypothesized effect size to the upper and lower CI. If the hypothesized values are within the range of the CI, then it’s probably a sampling error issue. If the hypothesized values are NOT within the range of the CI, either the hypothesized effect size was in error or something is unique in the sample and more exploration is needed.

Option 2: Bootstrapping

Bootstrapping uses sampling with replacement to allow an inference to be made about a population (Efron 1979). Bootstrapping is used to assess uncertainty by “resampling data, creating a set of simulated datasets that can be used to approximate some aspects of the sampling distribution, thus giving some sense of the variation that could be expected if the data collection process had be re-done” (Gelman et al., 2021, p. 74). Using bootstrapping, I’m hypothesizing that the test statistic will be within a 95% CI’s, but the CI’s will be wider than those of the original data set, and the distribution of effect sizes will approximate a normal distribution.

To illustrate, I simulated a 46-record data set with two variables that had an projected relationship of r = .25 using the faux package in R (DeBruine et al., 2021). Rather than assume a directional hypothesis, I used the standard NHST assumption (two-sided test). The relationship between the two variables was positively sloped, moderate in size, and statistically significant, r(44) = .398, p = .006, 95% CI (.122, .617).

Next, I bootstrapped three different sample sizes (50, 67, and 1000) using the boot package (Canty & Ripley, 2021). The N = 50 represented what the faculty was trying to recommend. The N = 67 was the originally hypothesized sample size, and the N = 1000 is the widely-used sample size used in bootstrapping. The following table displays the results –

Data SetEffect Size (r)95% Lower Confidence Interval95% Upper Confidence Interval
Simulated Data (N = 46).398.122.617
Bootstrapped Simulation (N = 50).360.105.646
Bootstrapped Simulation (N = 67).360.109.608
Bootstrapped Simulation (N = 1000).360.093.638
Since bootstrapping involves sampling, it’s possible the test statistic may vary between iterations. To address that, I set the R set.seed() command to “040121”

Note in the table how the confidence intervals change as the sample increases. A better way to view the differences is to look at the density function between the three bootstrapped samples –

Note how the distribution base ‘fattens” as the sample size increases from N = 50 to N = 67. Finally, note how the effect size distribution becomes normally distributed with an N = 1000. Regardless, there is always a chance that the true effect size is less than or equal to 0, as depicted by the lower CI meeting or crossing the dotted red line (representing r = 0).

I speculate the chairperson was trying to suggest bootstrapping to the student. Either the faculty didn’t have sufficient knowledge of bootstrapping to guide the student, or the concept of bootstrapping was above the ability of the student to comprehend. I also speculate that faculty was trying to address a high p-value issue. Since the calculation of p-value is based on the sample size, there is nothing a student or faculty can do when a sample size is small except focus on the size of the effect. Perhaps that is what truly was lost in translation. Students, and the faculty advising them, need to understand that it’s not the p-value that is necessarily important but the effect size.

I suspect faculty and students will see more and more low sample sizes over the next year or so as people are fatigued or disinterested in completing surveys (thanks to COVID-19). Students need to be prepared to find a larger population to sample to counter potentially lower than expected response rates.


Canty, A., & Ripley, B. (2021, February 12). boot: Bootstrap functions. https://cran.r-project.org/web/packages/boot/boot.pdf

DeBruine, L., Krystalli, A., & Heiss, A. (2021, March 27). faux: Simulation for factorial designs. https://cran.r-project.org/web/packages/faux/faux.pdf

Efron, B. (1979). Bootstrap methods: Another look at the Jacknife. Annals of Statistics, 7(1), 1-26. https://doi.org/10.1214/aos/1176344552

Gelman, A., Hill, J., & Vehtari, A. (2021). Regression and other stories. Cambridge University Press.

Code Snippet used in analysis:

# set random number seed to allow recreation
# create a 46 record datset with an r = 0.25
df <- rnorm_multi(46, 2, r = 0.25)
# perform correlation with original 46 records
cor_value <- cor.test(df$X1, df$X2, meth = "pearson")
# bootstrap with N = 50
boot_example1 <- boot(df, 
  statistic = function(data, i) {
    cor(data[i, "X1"], data[i, "X2"], method='spearman', use = "complete.obs")
  R = 50
boot.ci(boot_example1, type = c("norm", "basic", "perc", "bca")) 
abline(v = c(0, .35998, 0.1054, 0.6461),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
# bootstrap with N = 67
boot_example2 <- boot(df, 
  statistic = function(data, i) {
    cor(data[i, "X1"], data[i, "X2"], method='spearman')
  R = 67
boot.ci(boot_example2, type = c("norm", "basic", "perc", "bca")) 
abline(v = c(0, .35998, 0.1091, 0.6078),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
# bootstrap with N = 1000
boot_example3 <- boot(df, 
  statistic = function(data, i) {
    cor(data[i, "X1"], data[i, "X2"], method='spearman')
  R = 1000
boot.ci(boot_example3, type = c("norm", "basic", "perc", "bca")) 
abline(v = c(0, .35998, 0.0932, 0.6382),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
# create three plots in one graphic
abline(v = c(0, .35998, 0.1054, 0.6461),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
abline(v = c(0, .35998, 0.1091, 0.6078),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")
abline(v = c(0, .35998, 0.0932, 0.6382),
       lty = c("dashed", "solid", "dashed", "dashed"),
       col = c("red", "blue", "blue", "blue")

Superficial vs Thorough Research…

I had interesting conversations with two colleagues; independently but about the same student. We chatted about how some students just touch the surface in their description of survey instruments (among other things), while others will dig deep to demonstrate thoroughness and thoughtfulness.

The student in question cited Campbell and Park (2017) as the source of a subjective-based measure to assess company (firm) performance. There was no discussion in the student’s study about the use of a subjective-based vs objective-based instrument; only that the instrument used in the study had a Cronbach’s alpha (α) of .87. That was it!

I have found that privately-held companies don’t like to share financial information with researchers. Thus, if a student goes down a path of surveying these type of businesses to collect objective data (e.g,. sales, gross margin), I advise them they can expect to have a higher non-response rate than anticipated, potentially a larger number of items not answered, and should plan ($$$) to obtain a larger sample.

I had two questions: –

Where did the use of subjective-based instruments to measure firm performance begin?

Are subjective-based instruments just as valid as objective-based instruments?

I couldn’t ask the student these questions, since he didn’t discuss it in his proposal. He probably doesn’t know. So, I started reading…starting with Campbell and Park (2017). Campbell and Park cited Campbell et al. (2011) and Runyan et al. (2008) on p. 305. Those references led to me to Frazier (2000), Niehm (2002), Droge et al. (2004), Runyan et al. (2006), Richard et al. (2009), and (most importantly) Venkatraman and Ramanujam (1986). Let’s start there…

Venkatraman and Ramanujam (1986) explored ten different approaches to measuring business performance. They posited that business performance has two dimensions: financial v operational, and primary data sources v secondary data sources. Relating to this student’s study, the second dimension was of interest. Venkatraman and Ramanujam discussed the benefits and limitations of using primary and secondary data as a measure of business performance (p. 808). More importantly, they discussed the use of financial data from secondary sources and operational data from primary sources to “enlarge the conceptualization of business performance” (p. 811). For example, a gross margin of 65% could be conceptualized as doing better or worse than the competition. Makes sense…but where was this type of instrument used first?

Frazier (2000), citing Venkatraman and Ramanujam, wrote “subjective assessments of performance are generally consistent with secondary performance measures” (p. 53). Frazier appears to have created a three-item instrument to measure firm performance. The three items, measured on a 5-point Likert scale from poor (1) to excellent (5), were –

  • How would you describe the overall performance of your store(s) last year?
  • How would you describe your performance relative to your major competitors?
  • How would you describe your performance relative to other stores like yours in the industry?

Frazier reported an α = .84 (N = 112). Niehm (2002), using a similarly worded instrument reported an α = .82 (N = 569). Runyan et al. (2008), citing Frazier (2000) and Niehm (2002), used the same instrument and reported an α = .82 (N = 267). However, what’s important to note is that Runyan et al. discussed an advantage that subjective questions have over objective questions – increased response rates – citing a study of one of the co-authors (Droge et al., 2004). Runyan et al. (2008) and Campbell et al. (2011) followed similar approaches and both reported an α = .87 (which could be an editorial error as the wording is similar). Campbell et al. (2011) also incorporated research performed by Richard et al. (2009) where the authors posit that the context of the study should dictate whether to use subjective or objective measures.

What did I learn in 2-3 hours of reading and writing –

  • Subjective-based instruments appear to similar in validity as objective-based instruments, but we (researchers) should periodically confirm by issuing both and examine construct validity.
  • A subjective instrument could reduce non-response rates, which is always an issue in research and incredibly important in today’s COVID-19 world as companies and people appear to over-surveyed and not responsive.
  • The three-item subjective-based instrument developed by Frazier (2000) appears to be reliable in test-retest situations

I also reflected on typical responses from students when asked about their instrument –

  • Superficial Student – “This person used it. Why can’t I?”
  • Thorough Student – “What would you like to know?”


Campbell, J. M., Line, N., Runyan, R. C., & Swinney, J. L. (2010). The moderating effect of family-ownership on firm performance: An examination of entrepreneurial orientation and social capital. Journal of Small Business Strategy, 21(2), 27-46.

Campbell, J. M., & Park, J. (2017). Extending the resource-based view: Effects of strategic orientation toward community on small business practice. Journal of Retailing and Consumer Services, 34(1), 302-308. https://doi.org/10.1016/j.jretconser.2016.01.013

Droge, C., Jayaram, J., & Vickery, S. K. (2004). The effects of internal versus external integration practices on time-based performance and overall performance. Journal of Operations Management, 22(6), 557-573. https://doi.org/10.1016/j.jom.2004.08.001

Frazier, B. J. (2000). The influence of network characteristics on information access, marketing competence, and perceptions of performance in small, rural businesses (Doctoral dissertation: Michigan State University).

Niehm, L. S. (2002). Retail superpreneurs and their influence on small communities (Doctoral dissertation: Michigan State University).

Richard, P., Devinney, T., Yip, G., & Johnson, G. (2009). Measuring organizational performance: Towards methodological best practice. Journal of Management, 35(3), 718-804. https://doi.org/10.1177/0149206308330560

Runyan, R., Droge, C., & Swinney, J. (2008). Entrepreneurial orientation versus small business orientation: What are their relationships to firm performance. Journal of Small Business Management, 46(4), 567-588. https://doi.org/10.1111/j.1540-627x.2008.00257.x

Runyan, R. Huddleston, P., & Swinney, J. (2006). Entrepreneurial orientation and social capital as small firm strategies: A study of gender differences from a resource-based view. The International Entrepreneurship and Management Journal, 2(4), 455-477. https://doi.org/10.1007/s11365-006-0010-3

Venkatraman, N., & Ramanujum, V. (1986). Measurement of business performance in strategy research: A comparison of approaches. Academy of Management Review, 11(4), 801-814. https://doi.org/10.2307/258398