11 Jul 2024 | Gabriele Campanella, Shengjia Chen, Ruchika Verma, Jennifer Zeng, Aryeh Stock, Matt Croken, Brandon Veremis, Abdulkadir Elmas, Kuan-lin Huang, Ricky Kwan, Jane Houldsworth, Adam J. Schoenfeld, Chad Vanderbilt
This paper presents a clinical benchmark for evaluating publicly available pathology foundation models trained using self-supervised learning (SSL). The benchmark includes a collection of pathology datasets from two medical centers, covering clinically relevant tasks such as disease detection, biomarker prediction, and treatment outcome prediction. The authors assess the performance of these models on a variety of tasks and provide insights into best practices for training new foundation models and selecting appropriate pre-trained models. Key findings include:
1. **Model Performance**: DINO and DINOv2-trained models generally outperform older models, with UNI and Prov-GigaPath showing superior performance in certain biomarker prediction tasks.
2. **Model Size and Resources**: Larger models did not significantly improve performance in disease detection tasks but showed better performance in biomarker prediction tasks, particularly for tissues overrepresented in the training datasets.
3. **Dataset Composition**: The composition of the pretraining dataset played a crucial role in downstream performance, especially for biomarker prediction tasks.
4. **Computational Resources**: Higher computational resources did not consistently lead to better performance in either disease detection or biomarker prediction tasks.
The paper emphasizes the importance of a systematic benchmark for comparing pathology foundation models and highlights the need for further research to optimize pretraining and improve generalizability. The benchmark is available on GitHub, and the authors plan to regularly update it with new models and tasks.This paper presents a clinical benchmark for evaluating publicly available pathology foundation models trained using self-supervised learning (SSL). The benchmark includes a collection of pathology datasets from two medical centers, covering clinically relevant tasks such as disease detection, biomarker prediction, and treatment outcome prediction. The authors assess the performance of these models on a variety of tasks and provide insights into best practices for training new foundation models and selecting appropriate pre-trained models. Key findings include:
1. **Model Performance**: DINO and DINOv2-trained models generally outperform older models, with UNI and Prov-GigaPath showing superior performance in certain biomarker prediction tasks.
2. **Model Size and Resources**: Larger models did not significantly improve performance in disease detection tasks but showed better performance in biomarker prediction tasks, particularly for tissues overrepresented in the training datasets.
3. **Dataset Composition**: The composition of the pretraining dataset played a crucial role in downstream performance, especially for biomarker prediction tasks.
4. **Computational Resources**: Higher computational resources did not consistently lead to better performance in either disease detection or biomarker prediction tasks.
The paper emphasizes the importance of a systematic benchmark for comparing pathology foundation models and highlights the need for further research to optimize pretraining and improve generalizability. The benchmark is available on GitHub, and the authors plan to regularly update it with new models and tasks.