10 Jun 2024 | Louis Blankemeier, Joseph Paul Cohen, Ashwin Kumar, Dave Van Veen, Syed Jamal Safdar Gardezi, Magdalini Paschali, Zhihong Chen, Jean-Benoit Delbrouck, Eduardo Reis, Cesar Truys, Christian Bluethgen, Malte Engmann Kjeldskov Jensen, Sophie Ostmeier, Maya Varma, Jeya Maria Jose Valanarasu, Zhongnan Fang, Zepeng Huo, Zaid Nabulsi, Diego Ardila, Wei-Hung Weng, Edson Amaro Junior, Neera Ahuja, Jason Fries, Nimam H. Shah, Andrew Johnston, Robert D. Boutin, Andrew Wentland, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, Akshay S. Chaudhari
Merlin is a 3D vision-language model (VLM) designed for interpreting abdominal computed tomography (CT) scans. It leverages both structured electronic health record (EHR) data and unstructured radiology reports for supervision without requiring additional manual annotations. Merlin is trained on a large clinical dataset of paired CT scans (6.38 million images from 15,331 CTs), EHR diagnosis codes (1.84 million codes), and radiology reports (6.04 million tokens). It is evaluated on 6 task types and 752 individual tasks, including zero-shot findings classification, phenotype classification, and cross-modal retrieval, as well as model-adapted tasks such as 5-year disease prediction, radiology report generation, and 3D semantic segmentation.
Merlin outperforms existing task-specific baselines on benchmarking tasks, demonstrating strong performance in zero-shot classification, phenotype classification, and cross-modal retrieval. It achieves an average F1 score of 0.741 on internal validation and 0.647 on external validation for zero-shot findings classification. Merlin also performs well in 5-year disease prediction, with an AUROC of 0.757 using 100% of downstream labels. It is capable of generating radiology reports based on CT images, outperforming RadFM in multiple metrics. Merlin also excels in 3D semantic segmentation, achieving better performance than the second-best model variation with 10% of training cases.
Merlin is trained on a single GPU, demonstrating the feasibility of training large models with limited computational resources. The model's performance is influenced by factors such as I3D weight initialization, multi-task learning with EHR and radiology reports, and report splitting. Merlin's training strategy and performance are validated on internal and external datasets, including the VerSe and TotalSegmentator datasets. The model's ability to generalize across various tasks and its efficient training strategy make it a promising foundation model for 3D medical imaging. Future work includes extending Merlin to additional anatomies and modalities, optimizing for radiology report generation and 3D segmentation, and exploring the benefits of pretraining on multiple anatomies and modalities.Merlin is a 3D vision-language model (VLM) designed for interpreting abdominal computed tomography (CT) scans. It leverages both structured electronic health record (EHR) data and unstructured radiology reports for supervision without requiring additional manual annotations. Merlin is trained on a large clinical dataset of paired CT scans (6.38 million images from 15,331 CTs), EHR diagnosis codes (1.84 million codes), and radiology reports (6.04 million tokens). It is evaluated on 6 task types and 752 individual tasks, including zero-shot findings classification, phenotype classification, and cross-modal retrieval, as well as model-adapted tasks such as 5-year disease prediction, radiology report generation, and 3D semantic segmentation.
Merlin outperforms existing task-specific baselines on benchmarking tasks, demonstrating strong performance in zero-shot classification, phenotype classification, and cross-modal retrieval. It achieves an average F1 score of 0.741 on internal validation and 0.647 on external validation for zero-shot findings classification. Merlin also performs well in 5-year disease prediction, with an AUROC of 0.757 using 100% of downstream labels. It is capable of generating radiology reports based on CT images, outperforming RadFM in multiple metrics. Merlin also excels in 3D semantic segmentation, achieving better performance than the second-best model variation with 10% of training cases.
Merlin is trained on a single GPU, demonstrating the feasibility of training large models with limited computational resources. The model's performance is influenced by factors such as I3D weight initialization, multi-task learning with EHR and radiology reports, and report splitting. Merlin's training strategy and performance are validated on internal and external datasets, including the VerSe and TotalSegmentator datasets. The model's ability to generalize across various tasks and its efficient training strategy make it a promising foundation model for 3D medical imaging. Future work includes extending Merlin to additional anatomies and modalities, optimizing for radiology report generation and 3D segmentation, and exploring the benefits of pretraining on multiple anatomies and modalities.