Understanding FairCLIP%3A Harnessing Fairness in Vision-Language Learning

**FairCLIP: Harnessing Fairness in Vision-Language Learning** **Authors:** Yan Luo, Min Shi, Muhammad Osama Khan, Muhammad Muneeb Afzal, Hao Huang, Shuaihang Yuan, Yu Tian, Luo Song, Ava Kouhana, Tobias Elze, Yi Fang, Mengyu Wang **Institution:** Harvard Ophthalmology AI Lab, Harvard University; Tandon School of Engineering, New York University; Multimedia and Visual Computing Lab, New York University Abu Dhabi **Abstract:** Fairness is a critical concern in deep learning, especially in healthcare, where models influence diagnoses and treatment decisions. Despite the importance of fairness in vision-only models, the fairness of medical vision-language (VL) models remains unexplored due to the scarcity of medical VL datasets. To address this gap, the authors introduce the first fair vision-language medical dataset, Harvard-FairVLMed, which provides detailed demographic attributes, ground-truth labels, and clinical notes to facilitate an in-depth examination of fairness within VL foundation models. Using this dataset, they conduct a comprehensive fairness analysis of two widely-used VL models (CLIP and BLIP2) across four protected attributes: race, gender, ethnicity, and language. The results highlight significant biases in all VL models, with Asian, Male, Non-Hispanic, and Spanish being the preferred subgroups across the respective protected attributes. To mitigate these biases, the authors propose FairCLIP, an optimal-transport-based approach that achieves a favorable trade-off between performance and fairness by reducing the Sinkhorn distance between the overall sample distribution and the distributions corresponding to each demographic group. The dataset and code are available at <https://ophal.hms.harvard.edu/datasets/harvard-fairvmed10k>. **Contributions:** 1. Introduction of the first fair vision-language medical dataset (Harvard-FairVLMed) for studying the fairness of VL foundation models. 2. Comprehensive fairness analysis of CLIP and BLIP2 on Harvard-FairVLMed, revealing significant biases. 3. Proposal of FairCLIP, an optimal transport-based approach to enhance fairness in VL models. **Keywords:** Vision-language models, fairness, medical datasets, deep learning, healthcare applications**FairCLIP: Harnessing Fairness in Vision-Language Learning** **Authors:** Yan Luo, Min Shi, Muhammad Osama Khan, Muhammad Muneeb Afzal, Hao Huang, Shuaihang Yuan, Yu Tian, Luo Song, Ava Kouhana, Tobias Elze, Yi Fang, Mengyu Wang **Institution:** Harvard Ophthalmology AI Lab, Harvard University; Tandon School of Engineering, New York University; Multimedia and Visual Computing Lab, New York University Abu Dhabi **Abstract:** Fairness is a critical concern in deep learning, especially in healthcare, where models influence diagnoses and treatment decisions. Despite the importance of fairness in vision-only models, the fairness of medical vision-language (VL) models remains unexplored due to the scarcity of medical VL datasets. To address this gap, the authors introduce the first fair vision-language medical dataset, Harvard-FairVLMed, which provides detailed demographic attributes, ground-truth labels, and clinical notes to facilitate an in-depth examination of fairness within VL foundation models. Using this dataset, they conduct a comprehensive fairness analysis of two widely-used VL models (CLIP and BLIP2) across four protected attributes: race, gender, ethnicity, and language. The results highlight significant biases in all VL models, with Asian, Male, Non-Hispanic, and Spanish being the preferred subgroups across the respective protected attributes. To mitigate these biases, the authors propose FairCLIP, an optimal-transport-based approach that achieves a favorable trade-off between performance and fairness by reducing the Sinkhorn distance between the overall sample distribution and the distributions corresponding to each demographic group. The dataset and code are available at <https://ophal.hms.harvard.edu/datasets/harvard-fairvmed10k>. **Contributions:** 1. Introduction of the first fair vision-language medical dataset (Harvard-FairVLMed) for studying the fairness of VL foundation models. 2. Comprehensive fairness analysis of CLIP and BLIP2 on Harvard-FairVLMed, revealing significant biases. 3. Proposal of FairCLIP, an optimal transport-based approach to enhance fairness in VL models. **Keywords:** Vision-language models, fairness, medical datasets, deep learning, healthcare applications

FairCLIP: Harnessing Fairness in Vision-Language Learning

5 Apr 2024 | Yan Luo1*, Min Shi1*, Muhammad Osama Khan2*, Muhammad Muneeb Afzal2, Hao Huang3 Shuaihang Yuan3 Yu Tian1 Luo Song1 Ava Kouhana1 Tobias Elze1 Yi Fang2,3† Mengyu Wang1†

5 Apr 2024 | Yan Luo1, Min Shi1, Muhammad Osama Khan2*, Muhammad Muneeb Afzal2, Hao Huang3 Shuaihang Yuan3 Yu Tian1 Luo Song1 Ava Kouhana1 Tobias Elze1 Yi Fang2,3† Mengyu Wang1†