8 February 2024 | Diego G. Campos, Tim Fütterer, Thomas Gfrörer, Rosa Lavelle-Hill, Kou Murayama, Lars König, Martin Hecht, Steffen Zitzmann, Ronny Scherer
This study evaluates the performance of machine learning (ML) algorithms and heuristic stopping criteria for abstract screening in systematic reviews in education and educational psychology. Using 27 systematic reviews, the researchers conducted a retrospective simulation to assess the sensitivity, specificity, and time savings of various ML algorithms and stopping rules. The results showed that ML algorithms, particularly the random forests with sentence bidirectional encoder representations from transformers (RF+SBERT), significantly reduced the screening workload by identifying 95% of relevant abstracts with fewer irrelevant records. On average, the screening workload was reduced by 58%, and time savings were estimated at 1.66 days. The study also found that heuristic stopping criteria, such as stopping after classifying 20% of records and 5% of irrelevant papers, achieved high specificity (M = 42%, SD = 28%). However, the performance of these criteria depended on the learning algorithm and the proportion of relevant papers in the dataset. The study highlights the importance of incorporating semantic and contextual information in feature extraction and modeling for effective abstract screening. It also emphasizes the need for further research to understand the performance of ML algorithms in educational research synthesis.This study evaluates the performance of machine learning (ML) algorithms and heuristic stopping criteria for abstract screening in systematic reviews in education and educational psychology. Using 27 systematic reviews, the researchers conducted a retrospective simulation to assess the sensitivity, specificity, and time savings of various ML algorithms and stopping rules. The results showed that ML algorithms, particularly the random forests with sentence bidirectional encoder representations from transformers (RF+SBERT), significantly reduced the screening workload by identifying 95% of relevant abstracts with fewer irrelevant records. On average, the screening workload was reduced by 58%, and time savings were estimated at 1.66 days. The study also found that heuristic stopping criteria, such as stopping after classifying 20% of records and 5% of irrelevant papers, achieved high specificity (M = 42%, SD = 28%). However, the performance of these criteria depended on the learning algorithm and the proportion of relevant papers in the dataset. The study highlights the importance of incorporating semantic and contextual information in feature extraction and modeling for effective abstract screening. It also emphasizes the need for further research to understand the performance of ML algorithms in educational research synthesis.