5 Aug 2024 | Yingxue Xu, Yihui Wang, Fengtao Zhou, Jiabo Ma, Shu Yang, Huangjing Lin, Xin Wang, Jiguang Wang, Li Liang, Anjia Han, Ronald Cheong Kin Chan, Hao Chen
This paper introduces a novel whole-slide pretraining paradigm called Multimodal Self-TAught PRetraining (mSTAR) for computational pathology (CPath). mSTAR leverages a large multimodal dataset, including H\&E diagnostic whole slide images (WSIs), pathology reports, and RNA-Seq data, to enhance the performance of pathology foundation models (FMs). The dataset consists of 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. mSTAR's two-stage pretraining process first injects multimodal knowledge into a slide aggregator through slide-level contrastive learning, followed by self-taught training to propagate this knowledge to the patch extractor. This approach broadens the context of modeling from unimodal to multimodal and from patch-level to slide-level, significantly improving the performance of mSTAR on various downstream tasks. Extensive experiments across 7 types of applications and 43 subtasks demonstrate that mSTAR consistently outperforms state-of-the-art (SOTA) FMs, with statistically significant differences. mSTAR excels in slide classification, survival analysis, multimodal fusion, few-shot and zero-shot classification, and pathological report generation, showcasing its robustness and generalization capabilities. The integration of multimodal data, particularly pathology reports and gene expression profiles, enhances the model's ability to capture complex interactions and improve performance in diverse clinical tasks.This paper introduces a novel whole-slide pretraining paradigm called Multimodal Self-TAught PRetraining (mSTAR) for computational pathology (CPath). mSTAR leverages a large multimodal dataset, including H\&E diagnostic whole slide images (WSIs), pathology reports, and RNA-Seq data, to enhance the performance of pathology foundation models (FMs). The dataset consists of 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types. mSTAR's two-stage pretraining process first injects multimodal knowledge into a slide aggregator through slide-level contrastive learning, followed by self-taught training to propagate this knowledge to the patch extractor. This approach broadens the context of modeling from unimodal to multimodal and from patch-level to slide-level, significantly improving the performance of mSTAR on various downstream tasks. Extensive experiments across 7 types of applications and 43 subtasks demonstrate that mSTAR consistently outperforms state-of-the-art (SOTA) FMs, with statistically significant differences. mSTAR excels in slide classification, survival analysis, multimodal fusion, few-shot and zero-shot classification, and pathological report generation, showcasing its robustness and generalization capabilities. The integration of multimodal data, particularly pathology reports and gene expression profiles, enhances the model's ability to capture complex interactions and improve performance in diverse clinical tasks.