HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis

HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis

23 Jun 2024 | Guillaume Jaume, Paul Doucet, Andrew H. Song, Ming Y. Lu, Cristina Almagro-Pérez, Sophia J. Wagner, Anurag J. Vaidya, Richard J. Chen, Drew F.K. Williamson, Ahrong Kim, Faisal Mahmood
HEST-1k is a dataset containing 1,108 paired spatial transcriptomics (ST) profiles, each linked to a whole-slide image (WSI) and metadata. It was compiled from 131 public and internal cohorts across 25 organs, two species (Homo sapiens and Mus musculus), and 320 cancer samples from 25 cancer types. The dataset includes 1.5 million expression-morphology pairs and 60 million nuclei. HEST-1k is used for three main applications: benchmarking foundation models for histopathology (HEST-Benchmark), biomarker identification, and multimodal representation learning. HEST-1k, HEST-Library, and HEST-Benchmark are freely accessible via GitHub. The dataset includes comprehensive metadata for each sample, including species, cancer type, and organ. It also includes histology data, such as image resolution and magnification, and nuclear segmentation and classification. Gene expression data are unified into an ANNDATA object, which can be loaded with scanpy. The dataset also includes nuclear segmentation and classification, with each nucleus classified into five categories: neoplastic epithelial, non-neoplastic epithelial, inflammatory, stromal, and necrotic. HEST-1k enables the analysis of interactions and correlations between tissue morphology (as seen in H&E) and local gene expression (as provided in ST). It has been used to study morphological correlates of expression changes in invasive breast cancer and to visualize tumor heterogeneity on both the morphological and molecular sides. The dataset also supports multimodal representation learning, with the CONCH model being fine-tuned on five Xenium invasive breast cancer cases to better encode the underlying molecular landscape associated with disease-specific morphologies. HEST-1k is a valuable resource for researchers in computational pathology and spatial transcriptomics, providing a comprehensive dataset for benchmarking and developing new methods in histopathology and biomarker discovery. The dataset is hosted on HuggingFace and is released under the Attribution-NonCommercial-ShareAlike 4.0 International license. It is intended for research purposes only and must not be used for diagnostic procedures.HEST-1k is a dataset containing 1,108 paired spatial transcriptomics (ST) profiles, each linked to a whole-slide image (WSI) and metadata. It was compiled from 131 public and internal cohorts across 25 organs, two species (Homo sapiens and Mus musculus), and 320 cancer samples from 25 cancer types. The dataset includes 1.5 million expression-morphology pairs and 60 million nuclei. HEST-1k is used for three main applications: benchmarking foundation models for histopathology (HEST-Benchmark), biomarker identification, and multimodal representation learning. HEST-1k, HEST-Library, and HEST-Benchmark are freely accessible via GitHub. The dataset includes comprehensive metadata for each sample, including species, cancer type, and organ. It also includes histology data, such as image resolution and magnification, and nuclear segmentation and classification. Gene expression data are unified into an ANNDATA object, which can be loaded with scanpy. The dataset also includes nuclear segmentation and classification, with each nucleus classified into five categories: neoplastic epithelial, non-neoplastic epithelial, inflammatory, stromal, and necrotic. HEST-1k enables the analysis of interactions and correlations between tissue morphology (as seen in H&E) and local gene expression (as provided in ST). It has been used to study morphological correlates of expression changes in invasive breast cancer and to visualize tumor heterogeneity on both the morphological and molecular sides. The dataset also supports multimodal representation learning, with the CONCH model being fine-tuned on five Xenium invasive breast cancer cases to better encode the underlying molecular landscape associated with disease-specific morphologies. HEST-1k is a valuable resource for researchers in computational pathology and spatial transcriptomics, providing a comprehensive dataset for benchmarking and developing new methods in histopathology and biomarker discovery. The dataset is hosted on HuggingFace and is released under the Attribution-NonCommercial-ShareAlike 4.0 International license. It is intended for research purposes only and must not be used for diagnostic procedures.
Reach us at info@study.space
[slides and audio] HEST-1k%3A A Dataset for Spatial Transcriptomics and Histology Image Analysis