[slides and audio] RAD-DINO%3A Exploring Scalable Medical Image Encoders Beyond Text Supervision

RAD-DINO is a biomedical image encoder pre-trained solely on unimodal biomedical imaging data, challenging the prevailing reliance on language supervision for learning general-purpose biomedical imaging encoders. The study demonstrates that RAD-DINO achieves similar or better performance than state-of-the-art biomedical language-supervised models on various benchmarks, including classification, semantic segmentation, and text report generation. Key findings include: 1. **Independence from Text Supervision**: RAD-DINO performs well without the need for paired image-text datasets, highlighting the limitations of text supervision in capturing detailed and diverse medical information. 2. **Strong Correlation with Clinical Data**: RAD-DINO's learned representations correlate better with patient medical records, such as sex and age, than language-supervised models, which often omit these details. 3. **Scalability with Training Data**: RAD-DINO's performance scales well with the quantity and diversity of training data, showing that image-only supervision is a scalable approach for training foundational biomedical image encoders. The study also includes ablation studies to understand the factors contributing to RAD-DINO's performance, such as domain transfer, the role of masked image modelling (MIM), and input image resolution. Overall, RAD-DINO provides a robust and scalable solution for training biomedical image encoders, with potential applications in broader clinical contexts.RAD-DINO is a biomedical image encoder pre-trained solely on unimodal biomedical imaging data, challenging the prevailing reliance on language supervision for learning general-purpose biomedical imaging encoders. The study demonstrates that RAD-DINO achieves similar or better performance than state-of-the-art biomedical language-supervised models on various benchmarks, including classification, semantic segmentation, and text report generation. Key findings include: 1. **Independence from Text Supervision**: RAD-DINO performs well without the need for paired image-text datasets, highlighting the limitations of text supervision in capturing detailed and diverse medical information. 2. **Strong Correlation with Clinical Data**: RAD-DINO's learned representations correlate better with patient medical records, such as sex and age, than language-supervised models, which often omit these details. 3. **Scalability with Training Data**: RAD-DINO's performance scales well with the quantity and diversity of training data, showing that image-only supervision is a scalable approach for training foundational biomedical image encoders. The study also includes ablation studies to understand the factors contributing to RAD-DINO's performance, such as domain transfer, the role of masked image modelling (MIM), and input image resolution. Overall, RAD-DINO provides a robust and scalable solution for training biomedical image encoders, with potential applications in broader clinical contexts.

RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text Supervision