22 Jan 2025 | Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Bluethgen, Arne Edward Michalson, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, Jean-Benoit Delbrouck
GREEN is a novel metric for evaluating radiology reports that leverages language models to identify and explain clinically significant errors in generated reports. It provides a score aligned with expert preferences, human-readable explanations of errors, and a lightweight, open-source method that matches commercial counterparts. The metric was validated against GPT-4 and expert error counts, showing higher correlation with expert evaluations and preferences compared to existing approaches. GREEN includes a detailed free-text analysis of error explanations, enabling targeted improvements in model performance. The method was tested on various imaging modalities, including chest X-rays and abdominal CT scans, demonstrating robust performance on out-of-distribution data. GREEN's open-source nature supports widespread use and collaboration, while its adaptability across different imaging modalities and datasets encourages further research in medical AI. The metric's ability to maintain performance on OOD data highlights its versatility and potential as a standard for future developments in automated radiology reporting. However, the method has limitations, including processing time and the inherent uncertainty in error quantification. Overall, GREEN offers a more accurate and interpretable approach to evaluating radiology reports compared to existing metrics.GREEN is a novel metric for evaluating radiology reports that leverages language models to identify and explain clinically significant errors in generated reports. It provides a score aligned with expert preferences, human-readable explanations of errors, and a lightweight, open-source method that matches commercial counterparts. The metric was validated against GPT-4 and expert error counts, showing higher correlation with expert evaluations and preferences compared to existing approaches. GREEN includes a detailed free-text analysis of error explanations, enabling targeted improvements in model performance. The method was tested on various imaging modalities, including chest X-rays and abdominal CT scans, demonstrating robust performance on out-of-distribution data. GREEN's open-source nature supports widespread use and collaboration, while its adaptability across different imaging modalities and datasets encourages further research in medical AI. The metric's ability to maintain performance on OOD data highlights its versatility and potential as a standard for future developments in automated radiology reporting. However, the method has limitations, including processing time and the inherent uncertainty in error quantification. Overall, GREEN offers a more accurate and interpretable approach to evaluating radiology reports compared to existing metrics.