Understanding A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

**CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation** Chest X-rays (CXR) are the most frequently performed imaging tests in clinical practice. Recent advancements in vision-language foundation models (FMs) have enabled the possibility of automated CXR interpretation, which can assist physicians and improve patient outcomes. However, developing accurate FMs for CXR interpretation is challenging due to limited large-scale datasets, lack of specialized encoders, and the absence of evaluation frameworks. This work addresses these challenges by introducing *CheXinstruct*, a large-scale instruction-tuning dataset curated from 28 publicly available datasets. It also presents *CheXagent*, an instruction-tuned FM capable of analyzing and summarizing CXRs. CheXagent is built using a clinical large language model (LLM) for radiology report parsing, a vision encoder for CXR image representation, and a network to bridge the vision and language modalities. Additionally, *CheXbench* is introduced as a novel benchmark for systematically evaluating FMs across 8 clinically relevant CXR interpretation tasks. Extensive evaluations using CheXbench demonstrate that CheXagent outperforms previously developed general- and medical-domain FMs. The project also includes a fairness evaluation across sex, race, and age to highlight potential performance disparities, contributing to model transparency in healthcare AI. The key contributions of this work include: 1. **CheXinstruct**: A large-scale instruction-tuning dataset with 6M instruction-image-answer triplets. 2. **CheXagent**: An instruction-tuned FM with 8B parameters for CXR interpretation. 3. **CheXbench**: A comprehensive benchmark for evaluating FMs across 8 clinically relevant tasks. The results show that CheXagent significantly outperforms general-domain and medical-domain FMs in image perception tasks and provides high-quality medical text generation and summarization. The fairness analysis highlights the need for continued efforts to address biases in AI models used in healthcare.**CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation** Chest X-rays (CXR) are the most frequently performed imaging tests in clinical practice. Recent advancements in vision-language foundation models (FMs) have enabled the possibility of automated CXR interpretation, which can assist physicians and improve patient outcomes. However, developing accurate FMs for CXR interpretation is challenging due to limited large-scale datasets, lack of specialized encoders, and the absence of evaluation frameworks. This work addresses these challenges by introducing *CheXinstruct*, a large-scale instruction-tuning dataset curated from 28 publicly available datasets. It also presents *CheXagent*, an instruction-tuned FM capable of analyzing and summarizing CXRs. CheXagent is built using a clinical large language model (LLM) for radiology report parsing, a vision encoder for CXR image representation, and a network to bridge the vision and language modalities. Additionally, *CheXbench* is introduced as a novel benchmark for systematically evaluating FMs across 8 clinically relevant CXR interpretation tasks. Extensive evaluations using CheXbench demonstrate that CheXagent outperforms previously developed general- and medical-domain FMs. The project also includes a fairness evaluation across sex, race, and age to highlight potential performance disparities, contributing to model transparency in healthcare AI. The key contributions of this work include: 1. **CheXinstruct**: A large-scale instruction-tuning dataset with 6M instruction-image-answer triplets. 2. **CheXagent**: An instruction-tuned FM with 8B parameters for CXR interpretation. 3. **CheXbench**: A comprehensive benchmark for evaluating FMs across 8 clinically relevant tasks. The results show that CheXagent significantly outperforms general-domain and medical-domain FMs in image perception tasks and provides high-quality medical text generation and summarization. The fairness analysis highlights the need for continued efforts to address biases in AI models used in healthcare.