“Current data-based methods of analysis and predictive models are insufficient – big data is able to remedy this. ”
– Dr. Alexander Lenk
Pass/fail decisions by faculty are mostly based on the ability of students to pass or fail prior to (Angoff) or post standard setting (Borderline Regression Analysis) criteria developed by faculty. Standard setting is a critical part of educational, licensing and certification testing. But outside of the cadre of practitioners, this aspect of test development is not well understood. Standard setting is the methodology used to define levels of achievement or proficiency and the cut-scores corresponding to those levels. A cut-score is simply the score that serves to classify the students whose score is below the cut-score into one level and the students whose score is at or above the cut-score into the next and higher level.
The Standard Error of Measurement (SEM) indicates the amount of error around the observed score. The observed score, the score we retrieve, store and analyse from an OSCE, is in fact the result of the true score and error around this true score. If we want a reliable decision around passing or failing a station e.g. an OSCE, we need to incorporate the SEM in that decision.
Observed Score is the true ability (true score) of the student plus the random error around that true score. The error is associated with the reliability or internal consistency of score sheets used in OSCEs. Within our system, Qpercom calculates Cronbach’s alpha as a reliability score indicating how consistent scores are being measured, and the Intra Class Correlation coefficient; how reliable are scores between the different stations (Silva et al., 2017). These classical psychometric measures of the data can be used to calculate the SEM. An observed score +/- the SEM means that with 68% certainty the ‘true score’ of that station is somewhere in between the actual score, plus or minus the SEM. In principle, one should consider plus or minus the 95% Confidence Interval, which is the Observed score plus or minus 1.96 * SEM (Zimmerman & Williams, 1966).
A few words about this paper…
A historically significant paper I have to say, and although not yet cited it formed the basis of what I eventually pursued for the last 10 years in the spin-off company, Qpercom. According to the Irish Times, we are “dragging exam assessment out of the dark ages“ (Oct. 2016). This suppressed paper actually forms the basis of what Qpercom has worked to achieve since 2008 with client partners worldwide. From being a PT clinician by training, I moved into medical education. As clinical researchers, we put a lot of effort into developing the Smallest Detectable Difference (SDD), to be detected using a ‘ruler’. Measuring maximal mouth opening with a metal ruler is one of the outcome variables in patients with maxillofacial pain. With the newly acquired evidence that you had to measure at least 12 mm difference in mouth opening before and after an intervention to be successful in patients with temporomandibular joint disc displacement, I changed jobs and moved into medical/dental education. I was immediately challenged to look into comparable measurements used in oral hygiene training. Probing depth measurements were used as an example to demonstrate the use of generalisability and decision studies in educational decision-making. Fourteen years after this publication, we are comparing 10 different European Universities on their quality assurance outcome of OSCEs. Have a read, use the evidence, and I hope this will help students and staff measuring any kind of assessment outcome, plus this historically significant paper needs citations!