24 Mar 2024 | Haoran Lai, Qingsong Yao, Zihang Jiang, Rongsheng Wang, Zhiyang He, Xiaodong Tao, S. Kevin Zhou
**CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification**
**Authors:** Haoran Lai, Qingsong Yao, Zihang Jiang, Rongsheng Wang, Zhiyang He, Xiaodong Tao, S. Kevin Zhou
**Institutional Affiliations:** University of Science and Technology of China, iFlytek Co.Ltd
**Abstract:**
The advancement of Zero-Shot Learning (ZSL) in the medical domain has been driven by pre-trained models on large-scale image-text pairs, focusing on image-text alignment. However, existing methods primarily rely on cosine similarity, which may not fully capture the complex relationship between medical images and reports. To address this gap, the authors introduce CARZero, a novel approach that leverages cross-attention mechanisms to process image and report features, creating a Similarity Representation (SimR) that more accurately reflects the intricate relationships in medical semantics. This representation is then linearly projected to form an image-text similarity matrix for cross-modality alignment. Additionally, CARZero incorporates a Large Language Model (LLM)-based prompt alignment strategy, standardizing diverse diagnostic expressions into a unified format for both training and inference phases. The approach demonstrates state-of-the-art performance in zero-shot classification on five official chest radiograph diagnostic test sets, including remarkable results on datasets with long-tail distributions of rare diseases.
**Contributions:**
- A novel cross-attention alignment for medical images and reports, utilizing SimR to articulate complex relationships.
- LLM-based prompt alignment to standardize diverse diagnostic expressions into a unified prompt format.
- State-of-the-art performance on five large-scale radiology diagnosis datasets, with significant improvements in diagnosing rare diseases.
**Methods:**
- **Feature Extraction:** Utilizes ViT-base for image features and BioBERT for text features.
- **Cross-Attention Alignment:** Generates SimR to represent the relationship between images and reports, optimized using InfoNCE loss.
- **LLM-based Prompt Alignment:** Integrates prompt templates into training data to align prompts during training and inference.
**Experiments:**
- Evaluates on five public datasets: MIMIC-CXR, Open-I, PadChest, ChestXray14, and CheXpert.
- Achieves state-of-the-art AUC performance on PadChest (0.810) and surpasses existing methods fine-tuned on 1% data on ChestXray14.
**Conclusion:**
CARZero effectively captures the complex relationships between medical images and reports, achieving superior performance in zero-shot classification tasks. Future work could explore fine-tuning tasks and natural data to further validate the method's effectiveness.**CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification**
**Authors:** Haoran Lai, Qingsong Yao, Zihang Jiang, Rongsheng Wang, Zhiyang He, Xiaodong Tao, S. Kevin Zhou
**Institutional Affiliations:** University of Science and Technology of China, iFlytek Co.Ltd
**Abstract:**
The advancement of Zero-Shot Learning (ZSL) in the medical domain has been driven by pre-trained models on large-scale image-text pairs, focusing on image-text alignment. However, existing methods primarily rely on cosine similarity, which may not fully capture the complex relationship between medical images and reports. To address this gap, the authors introduce CARZero, a novel approach that leverages cross-attention mechanisms to process image and report features, creating a Similarity Representation (SimR) that more accurately reflects the intricate relationships in medical semantics. This representation is then linearly projected to form an image-text similarity matrix for cross-modality alignment. Additionally, CARZero incorporates a Large Language Model (LLM)-based prompt alignment strategy, standardizing diverse diagnostic expressions into a unified format for both training and inference phases. The approach demonstrates state-of-the-art performance in zero-shot classification on five official chest radiograph diagnostic test sets, including remarkable results on datasets with long-tail distributions of rare diseases.
**Contributions:**
- A novel cross-attention alignment for medical images and reports, utilizing SimR to articulate complex relationships.
- LLM-based prompt alignment to standardize diverse diagnostic expressions into a unified prompt format.
- State-of-the-art performance on five large-scale radiology diagnosis datasets, with significant improvements in diagnosing rare diseases.
**Methods:**
- **Feature Extraction:** Utilizes ViT-base for image features and BioBERT for text features.
- **Cross-Attention Alignment:** Generates SimR to represent the relationship between images and reports, optimized using InfoNCE loss.
- **LLM-based Prompt Alignment:** Integrates prompt templates into training data to align prompts during training and inference.
**Experiments:**
- Evaluates on five public datasets: MIMIC-CXR, Open-I, PadChest, ChestXray14, and CheXpert.
- Achieves state-of-the-art AUC performance on PadChest (0.810) and surpasses existing methods fine-tuned on 1% data on ChestXray14.
**Conclusion:**
CARZero effectively captures the complex relationships between medical images and reports, achieving superior performance in zero-shot classification tasks. Future work could explore fine-tuning tasks and natural data to further validate the method's effectiveness.