Understanding Merlin%3A A Vision Language Foundation Model for 3D Computed Tomography

**Merlin: A Vision Language Foundation Model for 3D Computed Tomography** **Abstract:** This paper introduces Merlin, a 3D vision-language foundation model designed to interpret abdominal CT scans. Over 85 million CT scans are performed annually in the US, with about one-quarter focusing on the abdomen. Given the shortage of radiologists, there is a growing need to use artificial intelligence to assist in interpreting these complex images and extracting new physiological insights. Current medical VLMs are limited to 2D images and short reports, and do not leverage electronic health records (EHR) for supervision. To address this, Merlin integrates structured EHR data and unstructured radiology reports for training without requiring additional manual annotations. The model is trained on a high-quality clinical dataset of paired CT scans, EHR diagnosis codes, and radiology reports. Comprehensive evaluations are conducted on six task types and 752 individual tasks, including zero-shot findings classification, phenotype classification, cross-modal retrieval, 5-year disease prediction, radiology report generation, and 3D semantic segmentation. Merlin outperforms task-specific baselines and demonstrates favorable performance compared to existing task-specific models. The paper also derives data scaling laws and presents ablation studies to understand the impact of various training strategies and data types. Training is performed on a single GPU, making it computationally efficient and accessible to health systems with limited resources. The authors plan to release their trained models, code, and dataset after removing all protected health information. **Main Contributions:** 1. **Training Strategy:** Merlin leverages structured EHR data and unstructured radiology reports for training, using a single GPU. 2. **Comprehensive Evaluation:** Evaluations cover six task types and 752 individual tasks, including zero-shot classification, phenotype classification, cross-modal retrieval, disease prediction, report generation, and 3D segmentation. 3. **Performance:** Merlin outperforms task-specific baselines on all evaluated tasks. 4. **Ablation Studies:** Analyzes the impact of different training strategies and data types on model performance. 5. **Data Scaling Laws:** Derives data scaling laws to guide the amount of training data needed for specific task performance. 6. **Computational Efficiency:** Training is performed on a single GPU, making it accessible to health systems with limited computational resources. **Discussion:** Merlin's performance is enhanced by 13D weight initialization, multi-task learning with EHR and reports, and splitting radiology reports into anatomical sections. Future work could focus on increasing dataset size, improving image resolution, optimizing batch size, and extending to additional anatomies and modalities. The study highlights the potential of 3D VLMs in assisting radiologists and enabling new biomarker discoveries.**Merlin: A Vision Language Foundation Model for 3D Computed Tomography** **Abstract:** This paper introduces Merlin, a 3D vision-language foundation model designed to interpret abdominal CT scans. Over 85 million CT scans are performed annually in the US, with about one-quarter focusing on the abdomen. Given the shortage of radiologists, there is a growing need to use artificial intelligence to assist in interpreting these complex images and extracting new physiological insights. Current medical VLMs are limited to 2D images and short reports, and do not leverage electronic health records (EHR) for supervision. To address this, Merlin integrates structured EHR data and unstructured radiology reports for training without requiring additional manual annotations. The model is trained on a high-quality clinical dataset of paired CT scans, EHR diagnosis codes, and radiology reports. Comprehensive evaluations are conducted on six task types and 752 individual tasks, including zero-shot findings classification, phenotype classification, cross-modal retrieval, 5-year disease prediction, radiology report generation, and 3D semantic segmentation. Merlin outperforms task-specific baselines and demonstrates favorable performance compared to existing task-specific models. The paper also derives data scaling laws and presents ablation studies to understand the impact of various training strategies and data types. Training is performed on a single GPU, making it computationally efficient and accessible to health systems with limited resources. The authors plan to release their trained models, code, and dataset after removing all protected health information. **Main Contributions:** 1. **Training Strategy:** Merlin leverages structured EHR data and unstructured radiology reports for training, using a single GPU. 2. **Comprehensive Evaluation:** Evaluations cover six task types and 752 individual tasks, including zero-shot classification, phenotype classification, cross-modal retrieval, disease prediction, report generation, and 3D segmentation. 3. **Performance:** Merlin outperforms task-specific baselines on all evaluated tasks. 4. **Ablation Studies:** Analyzes the impact of different training strategies and data types on model performance. 5. **Data Scaling Laws:** Derives data scaling laws to guide the amount of training data needed for specific task performance. 6. **Computational Efficiency:** Training is performed on a single GPU, making it accessible to health systems with limited computational resources. **Discussion:** Merlin's performance is enhanced by 13D weight initialization, multi-task learning with EHR and reports, and splitting radiology reports into anatomical sections. Future work could focus on increasing dataset size, improving image resolution, optimizing batch size, and extending to additional anatomies and modalities. The study highlights the potential of 3D VLMs in assisting radiologists and enabling new biomarker discoveries.