The field of paramedicine has evolved considerably over the last three decades. Originally, the role of a paramedic was restricted to patient transportation from the field to a nearby hospital with little medical intervention along the way. Today, paramedics are skilled clinicians who are capable of performing advanced medical and procedural interventions in a prehospital setting (Joyce et al, 2009; Williams et al, 2016).
To ensure that paramedic students are ready to enter the workforce, academic institutions have been adopting novel techniques (Thompson and Houston, 2021).
Simulation training is a hands-on, practical teaching method that has become increasingly used within teaching institutions (McKenna et al, 2015; Van Dillen et al, 2016; Tsai et al, 2020; Thompson and Houston, 2021). It has been established that simulation training is a critical means of improving medical education. Students exposed to simulation acquire more competencies and expertise in practical skills than those whose curriculum does not require simulation training (Van Dillen et al, 2016).
As simulation is required in paramedic education, its quality should be investigated and assessed on an ongoing basis to ensure that students are achieving the most from simulation training. It is also imperative that the methods used for evaluating simulations are deemed fair (Williams et al, 2016).
In the domain of paramedicine, students typically have their simulation scenarios evaluated by using a validated assessment tool known as the global rating scale (GRS) (Tavares et al, 2013). The GRS is a seven-dimension scale that is used to assess the competency of paramedic students regarding situational awareness, history taking, patient assessment, decision making, resource use, communication and procedural skills (Tavares et al, 2013). Each of these seven themes is measured on a scale ranging from 1 (unsafe) to 7 (exceptional) (Tavares et al, 2014).
The GRS is valid and possesses high inter-rater reliability when it is used to evaluate paramedics in training. Additionally, there is evidence to suggest that the scores achieved on a GRS in simulation are transferable to abilities in a real-world clinical context (Tavares et al, 2013; 2014).
Although the GRS has been proven to be the gold standard when it comes to paramedic student evaluation, to the best of the authors’ knowledge, no one has explored the effect of differential rater function over time (DRIFT) on the outcomes of the grades obtained on the GRS (Ilgen et al, 2015). The concept of DRIFT has been demonstrated in other areas of medical education and arises typically because of increasing leniency as a result of rater fatigue (McLaughlin et al, 2009).
Fairness in assessment is crucial to education. Additionally, in a domain such as paramedicine, it is important that evaluation is standardised as public health and safety could otherwise become compromised (Yeates et al, 2019).
It is crucial for student success and for public safety to ensure that the evaluation of paramedic performance is accurate and fair. Therefore, the primary purpose of this study was to explore if a DRIFT phenomenon occurred during multiple GRS evaluations by paramedic raters.
Methods
Study design
This was a cohort study. A group of people who rated paramedic students was selected by convenience sampling and followed over a period of time (at GRS stations). No intervention was given to any participants.
Ethics
The research was approved by the Collège Boréal research ethics board. Following ethical approval, consent was acquired from each participant. To ensure that the participants were blind to the study, consent forms were signed following the examination period.
Setting
The data were collected at Collège Boréal, Sudbury, Ontario, Canada in the 2020–2021 academic year during the six practical simulation evaluations for first-year paramedic students. A total of 12 raters took part in evaluating GRS during the academic year.
The study examined rater evaluations during practical paramedic student evaluations using the GRS that had been applied at 11 stations lasting 12 minutes each. The student examinees had been randomly assigned to start at a different station and moved through the GRS circuit in the same order. Each rater scored each paramedic student examinee at the same station and was given a single rest period during the practical examination that was randomly assigned and staggered throughout.
Participants/population
The raters were employed paramedics who held the Advanced Emergency Medical Care Assistant (AEMCA) qualification and were legally allowed to work as a paramedic in the province of Ontario, Canada. Additionally, the raters were staff at Collège Boréal.
Each rater was familiar with the GRS evaluation tool and had been involved in evaluations before. The examinees were first- and second-year paramedic students who were enrolled at the college and were participating in the practical assessments required by the curriculum that take place three times per semester.
Outcome measures
The primary outcome measure was to see if a DRIFT occurred during evaluations.
Data analysis
Modelling the scores for testing for DRIFT
To model the scores for testing for DRIFT, an approximation of a student's probability to achieve a given score for a particular task was worked out. This was calculated by assuming the successful completion of six independent cumulative tasks, each one with the same probability of completion. This is described by a binomial distribution of fixed (n=6) and adjustable parameter P (the probability of completing the task).
To model the effect of other factors on P, the authors considered it depends on several random factors:
Modelling DRIFT
To model the DRIFT, two mechanisms for fatigue to affect the raters were considered.
The first was the leniency DRIFT model—that the fatigue generated by successive evaluations translates itself into a systematic change in the leniency or strictness of the evaluations. In this model, the same evaluator would consistently drift towards either higher or lower values.
The second mechanism concerns a perceptual failure model. This model assumes that fatigue results in the error rate at the evaluations.
Therefore, we distinguished two cases and defined two probabilities:
In the model, the authors consider that the transformation of the probability of error increases linearly with the number of tests performed. In this case, the dispersion of scores would converge over time towards an equilibrium score, irrespective of the performance of the student.
The models were designed to consider various factors when creating them. The models and calculations were conducted on RStudio software and are available upon request.
The Bayes factor Ks were calculated by taking the exponential of the difference in the logarithm of the evidence of alternative hypothesis. Values of K in dHart >20 were considered decisive, greater than 10, strong, whereas values >5 where considered substantial.
Findings
Models tested
Among the models explored, one consisting of students, the rater and the dimensions had the greatest evidence (-3151). Removal of any one of those factors or inclusion of the time of year, the category or the test of an interaction between student and dimension resulted in a decrease in the amount of evidence. Therefore, the authors used the background of an effect of the student, evaluator and the dimension, and tested the leniency and perceptual failure model.
Model tested | Evidence (decibels) | K (dHart) (decibels) |
---|---|---|
All students equal | -3247 | -419*** |
Student | -3228 | -333*** |
Student + rater | -3164 | -55.8*** |
Student + rater + category | -3171 | -87*** |
Student + rater + dimension | -3151 | 0 |
Student + rater + dimension + date of year | -3160 | -39.5*** |
Student + rater + dimension x student | -3169 | -77.9*** |
Indicates a decisive Bayes factor
Leniency model
The posterior distribution of the rate of increase in score with successive tests (expressed as logit) was analysed. In this model, all evaluators would systematically change the scores with time. The prior distribution allowed for either leniency (an increase in scores) or strictness (a decrease in scores).
The posterior distribution was centred on zero DRIFT and a large compression in the distributions is apparent, indicating that the study restricted the possible change in scores to a small range. After 10 tests, the 1–90 credibility interval (CI 1–99) for the logit of P was shown to be (–0.11, 0.18). In terms of score, a student deserving a 4 after 10 tests would get a score in the CI 1–99 (3.83–4.27). Therefore, the DRIFT would be <0.3 points.
Perceptual failure model
This model uses two parameters, alpha for the increase in type I error, and beta for the increase in type II errors.
In the model, the authors assume a perfect evaluator at time 0, with an increasing rate of mistakes with the following tests. At the 10th test, the posterior distribution of alpha and beta are reduced from the previous one. CI 1–99 for alpha (0.007–0.13) and beta (0.13–0.19) do not exclude the possibility of a small error rate in scoring. This error would lead to a decrease in the frequency of very high and very low scores.
To show this effect, the authors tested how individuals with either P=5/6 or P=1/6 would evolve with successive tests. In the first case, a score with a CI of 5.95–6.00 at first test would fall to a score with a CI of 5.51–6.02. In the second case, a score with an interval of (2.04–2.10) would reach values of (2.04–2.84). Therefore, it is possible that some perceptual fatigue could lead to a rise of up to 0.84 points on the GRS. There is also the prospect of students being failed because they are wrongly given a low score. Therefore, the scores of failing students cannot be discarded.
Discussion
The primary objective of this study was to detect if there was a DRIFT in the ability of the raters on successive tests during the day—essentially, a fatigue effect.
This study did not demonstrate that a DRIFT exists but that the possibility of a small effect cannot be completely discarded. However, the results of this study put clear limits on the likelihood of this effect: there is a 1% chance that the effect ≥0.8 points after 10 tests. Therefore, the results of this study suggest that a substantial DRIFT does not exist in paramedic raters when evaluating paramedic students’ simulations using the GRS; substantial evidence against both the leniency (K=–9.1 dHart) and the perceptual models (K=–7.1 dHart) was established.
This study helps to further validate previous research in providing more support that the GRS possesses reliable and valid inter- and intra-rater reliability. The fact that minimal (if any) DRIFT exists adds another positive element to the already favoured GRS. This is important as it supports the increasingly popular GRS as a fair tool for simulation assessment. Additionally, there is an established push to conduct more research on prehospital simulation and this adds to the existing literature (Maurin Söderholm et al, 2019).
Although the primary purpose of the study was to evaluate DRIFT, the study also revealed some other findings that are noteworthy. For example, five factors (student, evaluator, dimension of the evaluation, category of the task and time of year) were analysed and the first three were found to have predictive power regarding the score whereas the other two did not.
In addition, the accumulation of experience gained on successive tests while training did not translate into improved scores. Possible explanations for this puzzling lack of effect may include: the duration of the study was not enough to see an increase in student performance; the students were already saturated in their ability to perform the tasks; or the evaluators change their rating as the year progresses. Further research should be conducted to better understand this issue.
The findings of this study are at odds with other medical-based simulation findings (McLaughlin et al, 2009; Maniar, 2016; Yeates et al, 2019). However, these studies examined different populations (medical residents), had a different type of rater (medical doctors) and used a different tool for assessment (the objective structured clinical examination). Therefore, it may be difficult to draw any conclusions or parallels from this.
To the best of the authors’ knowledge, this was the first study that investigated paramedic raters and the possibility of DRIFT using the GRS. The findings have implications as an increasing number of academic institutions and paramedic services are using the GRS to evaluate or when hiring paramedic students.
This study provides further evidence that using the GRS is fair and that DRIFT is not having a substantial effect on results.
Although no DRIFT was found or established, future studies should be conducted involving multiple academic centres and with more GRS simulation stations or evaluations to confirm that this remains true beyond 11 stations.
CPD Reflection Questions
Key Points
Limitations
There are a few potential limitations in this study that must be taken into consideration when interpreting the findings.
This study is limited by the fact that data were collected at one institution (College Boréal) during one academic year (2020–2021) and from one cohort of raters.
Second, this study is limited to a maximum of 11 stations so a DRIFT effect could exist beyond this. Additionally, all raters received one randomised station off during the course of the day.
Conclusion
As the profession of paramedicine continues to evolve and more simulation training is implemented, it is important to evaluate the curriculum, including the tools that are used to evaluate students (Joyce et al, 2009; Williams et al, 2016).
The primary purpose of this study was to explore if a DRIFT phenomenon existed in raters during successive testing over the day—essentially, a fatigue effect that occurs when rating using the GRS. This study failed to prove that a DRIFT exists but the possibility of a small effect cannot be discarded. As a result, this study continues to add to the evidence that the GRS is an effective and valid means of evaluating paramedic simulations. However, further multicentre studies with greater numbers of simulation stations should be conducted.