[slides and audio] GREEN%3A Generative Radiology Report Evaluation and Error Notation

The paper introduces GREEN (Generative Radiology Report Evaluation and Error Notation), a novel metric designed to evaluate radiology reports. GREEN addresses the challenge of assessing the factual correctness and clinical significance of errors in radiology reports, which is crucial for accurate medical communication. Unlike existing metrics that either fail to consider factual correctness (e.g., BLEU and ROUGE) or lack interpretability (e.g., F1CheXpert and F1RadGraph), GREEN offers: 1. **Score Alignment**: A score aligned with expert preferences, ranging from 0 to 1, where 1 indicates perfect agreement with expert evaluations. 2. **Human-Interpretable Explanations**: Provides detailed explanations of clinically significant errors, enabling feedback loops with end-users. 3. **Lightweight Open-Source Method**: Achieves performance comparable to commercial counterparts while being lightweight and open-source. The paper validates GREEN by comparing it to GPT-4 and expert error counts. It demonstrates higher correlation with expert error counts and better alignment with expert preferences compared to previous approaches. The code and datasets are published as a pypi package to facilitate further research and application in medical artificial intelligence.The paper introduces GREEN (Generative Radiology Report Evaluation and Error Notation), a novel metric designed to evaluate radiology reports. GREEN addresses the challenge of assessing the factual correctness and clinical significance of errors in radiology reports, which is crucial for accurate medical communication. Unlike existing metrics that either fail to consider factual correctness (e.g., BLEU and ROUGE) or lack interpretability (e.g., F1CheXpert and F1RadGraph), GREEN offers: 1. **Score Alignment**: A score aligned with expert preferences, ranging from 0 to 1, where 1 indicates perfect agreement with expert evaluations. 2. **Human-Interpretable Explanations**: Provides detailed explanations of clinically significant errors, enabling feedback loops with end-users. 3. **Lightweight Open-Source Method**: Achieves performance comparable to commercial counterparts while being lightweight and open-source. The paper validates GREEN by comparing it to GPT-4 and expert error counts. It demonstrates higher correlation with expert error counts and better alignment with expert preferences compared to previous approaches. The code and datasets are published as a pypi package to facilitate further research and application in medical artificial intelligence.

GREEN: Generative Radiology Report Evaluation and Error Notation

22 Jan 2025 | Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Bluethgen, Arne Edward Michalson, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, Jean-Benoit Delbrouck