TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models

TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models

March 8, 2024 | Jenna Kefeli, Nicholas Tattonetti
The TCGA-Reports dataset is a publicly available collection of 9,523 machine-readable pathology reports derived from The Cancer Genome Atlas (TCGA). This dataset provides a benchmark for researchers using large language models (LLMs) in pathology applications. The reports, originally in PDF format, were processed using optical character recognition (OCR) and post-processing techniques to extract text, enabling the use of natural language processing (NLP) methods for tasks such as cancer-type classification. The dataset includes text from pathology reports, which are more detailed and nuanced than structured electronic health records (EHRs) and contain valuable clinical information. The TCGA-Reports dataset is particularly useful for researchers who may not have access to institution-specific or controlled-access corpora. It can be combined with other TCGA data, including genomic and imaging data, to enhance analysis. The dataset was validated through a proof-of-principle cancer-type classification task across 32 tissue types, achieving an average AU-ROC of 0.992. The dataset is de-identified and publicly available, making it a valuable resource for benchmarking text-based AI models in pathology. The dataset includes a variety of cancer types, with breast invasive carcinoma being the most prevalent. The TCGA-Reports dataset is expected to facilitate further research in pathology and cancer research by providing a large, accessible, and machine-readable corpus of pathology reports.The TCGA-Reports dataset is a publicly available collection of 9,523 machine-readable pathology reports derived from The Cancer Genome Atlas (TCGA). This dataset provides a benchmark for researchers using large language models (LLMs) in pathology applications. The reports, originally in PDF format, were processed using optical character recognition (OCR) and post-processing techniques to extract text, enabling the use of natural language processing (NLP) methods for tasks such as cancer-type classification. The dataset includes text from pathology reports, which are more detailed and nuanced than structured electronic health records (EHRs) and contain valuable clinical information. The TCGA-Reports dataset is particularly useful for researchers who may not have access to institution-specific or controlled-access corpora. It can be combined with other TCGA data, including genomic and imaging data, to enhance analysis. The dataset was validated through a proof-of-principle cancer-type classification task across 32 tissue types, achieving an average AU-ROC of 0.992. The dataset is de-identified and publicly available, making it a valuable resource for benchmarking text-based AI models in pathology. The dataset includes a variety of cancer types, with breast invasive carcinoma being the most prevalent. The TCGA-Reports dataset is expected to facilitate further research in pathology and cancer research by providing a large, accessible, and machine-readable corpus of pathology reports.
Reach us at info@study.space