19 Jan 2024 | Fernando Pérez-García, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C. Castro, Anton Schwaighofer, Matthew P. Lungren, Maria Wetscherek, Noel Codella, Stephanie L. Hyland, Javier Alvarez-Valle, Ozan Oktay
RAD-DINO is a biomedical image encoder pre-trained solely on unimodal medical imaging data, achieving performance comparable or better than state-of-the-art language-supervised models on various benchmarks. Unlike traditional methods that rely on text supervision, RAD-DINO uses self-supervised learning (SSL) through masked image modelling (MIM) and contrastive learning. This approach enables the encoder to learn robust, generalizable features for downstream tasks such as image classification, semantic segmentation, and text report generation from images. The encoder's performance scales with the size and diversity of training data, and it outperforms language-supervised models in predicting patient demographics, suggesting its utility for broader clinical applications. Ablation studies show that RAD-DINO's performance is influenced by factors such as input resolution, training dataset size, and the use of MIM. The encoder demonstrates strong alignment with clinical information, including patient records, and performs well in tasks requiring precise image analysis. RAD-DINO's features are also effective for vision-language alignment tasks, such as generating text reports from images. The study highlights the potential of image-only SSL for training scalable, general-purpose biomedical image encoders without reliance on text supervision.RAD-DINO is a biomedical image encoder pre-trained solely on unimodal medical imaging data, achieving performance comparable or better than state-of-the-art language-supervised models on various benchmarks. Unlike traditional methods that rely on text supervision, RAD-DINO uses self-supervised learning (SSL) through masked image modelling (MIM) and contrastive learning. This approach enables the encoder to learn robust, generalizable features for downstream tasks such as image classification, semantic segmentation, and text report generation from images. The encoder's performance scales with the size and diversity of training data, and it outperforms language-supervised models in predicting patient demographics, suggesting its utility for broader clinical applications. Ablation studies show that RAD-DINO's performance is influenced by factors such as input resolution, training dataset size, and the use of MIM. The encoder demonstrates strong alignment with clinical information, including patient records, and performs well in tasks requiring precise image analysis. RAD-DINO's features are also effective for vision-language alignment tasks, such as generating text reports from images. The study highlights the potential of image-only SSL for training scalable, general-purpose biomedical image encoders without reliance on text supervision.