2019 | Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, Steven Horng
MIMIC-CXR is a publicly available, de-identified database of chest radiographs and free-text radiology reports from 65,379 patients who presented to the Beth Israel Deaconess Medical Center Emergency Department between 2011 and 2016. The dataset includes 227,835 imaging studies and 377,110 images, with each study containing one or more images (usually a frontal and lateral view). The reports, written by practicing radiologists, describe the radiological findings of the images. All data are de-identified to protect patient privacy and are freely available for research in computer vision, natural language processing, and clinical data mining. The dataset was created by handling three distinct data modalities: electronic health records, images (chest radiographs), and natural language (free-text reports). The images were sourced from the hospital's PACS system in DICOM format, and the reports were extracted from the EHR in XML format. The de-identification process involved removing protected health information (PHI) from both the DICOM metadata and pixel values, using a combination of custom algorithms and manual review. The dataset is organized into subfolders named according to anonymous patient identifiers, with each patient folder containing a single folder and a single text file for each imaging study. The project was approved by the Institutional Review Board of BIDMC and waiver of individual patient consent was granted due to the non-clinical nature of the project. The dataset is available on PhysioNet, with access controlled and a data use agreement required.MIMIC-CXR is a publicly available, de-identified database of chest radiographs and free-text radiology reports from 65,379 patients who presented to the Beth Israel Deaconess Medical Center Emergency Department between 2011 and 2016. The dataset includes 227,835 imaging studies and 377,110 images, with each study containing one or more images (usually a frontal and lateral view). The reports, written by practicing radiologists, describe the radiological findings of the images. All data are de-identified to protect patient privacy and are freely available for research in computer vision, natural language processing, and clinical data mining. The dataset was created by handling three distinct data modalities: electronic health records, images (chest radiographs), and natural language (free-text reports). The images were sourced from the hospital's PACS system in DICOM format, and the reports were extracted from the EHR in XML format. The de-identification process involved removing protected health information (PHI) from both the DICOM metadata and pixel values, using a combination of custom algorithms and manual review. The dataset is organized into subfolders named according to anonymous patient identifiers, with each patient folder containing a single folder and a single text file for each imaging study. The project was approved by the Institutional Review Board of BIDMC and waiver of individual patient consent was granted due to the non-clinical nature of the project. The dataset is available on PhysioNet, with access controlled and a data use agreement required.