The limits of fair medical imaging AI in real-world generalization

The limits of fair medical imaging AI in real-world generalization

October 2024 | Yuzhe Yang, Haoran Zhang, Judy W. Gichoya, Dina Katabi & Marzyeh Ghassemi
This study investigates the extent to which medical imaging AI models use demographic shortcuts in disease classification and the fairness implications across subpopulations. The research focuses on three medical imaging disciplines—radiology, dermatology, and ophthalmology—and six global chest X-ray datasets. The findings reveal that medical AI models leverage demographic information, leading to unfair predictions across subgroups. While correcting these shortcuts can create 'locally optimal' models within the original data distribution, these models may not perform well in new test settings. Surprisingly, models with less demographic encoding are often more 'globally optimal', showing better fairness in new environments. The study evaluates the fairness of models across different demographic groups, including race, sex, and age, and finds significant disparities in false positive and false negative rates. It also explores the impact of distribution shifts on model fairness, showing that fairness gaps do not consistently transfer between in-distribution (ID) and out-of-distribution (OOD) settings. The research highlights the importance of considering fairness in real-world clinical deployments and the need for models that maintain performance and fairness beyond their initial training contexts. The study proposes model selection criteria that prioritize fairness in OOD settings, finding that models with minimal demographic attribute encoding tend to perform better in terms of fairness. It also emphasizes the trade-offs between fairness and other clinical metrics, showing that optimizing fairness can lead to worse results in other important metrics. The findings underscore the complexity of achieving fair and effective AI models in healthcare and the necessity for comprehensive evaluations to ensure their reliability and equitable outcomes.This study investigates the extent to which medical imaging AI models use demographic shortcuts in disease classification and the fairness implications across subpopulations. The research focuses on three medical imaging disciplines—radiology, dermatology, and ophthalmology—and six global chest X-ray datasets. The findings reveal that medical AI models leverage demographic information, leading to unfair predictions across subgroups. While correcting these shortcuts can create 'locally optimal' models within the original data distribution, these models may not perform well in new test settings. Surprisingly, models with less demographic encoding are often more 'globally optimal', showing better fairness in new environments. The study evaluates the fairness of models across different demographic groups, including race, sex, and age, and finds significant disparities in false positive and false negative rates. It also explores the impact of distribution shifts on model fairness, showing that fairness gaps do not consistently transfer between in-distribution (ID) and out-of-distribution (OOD) settings. The research highlights the importance of considering fairness in real-world clinical deployments and the need for models that maintain performance and fairness beyond their initial training contexts. The study proposes model selection criteria that prioritize fairness in OOD settings, finding that models with minimal demographic attribute encoding tend to perform better in terms of fairness. It also emphasizes the trade-offs between fairness and other clinical metrics, showing that optimizing fairness can lead to worse results in other important metrics. The findings underscore the complexity of achieving fair and effective AI models in healthcare and the necessity for comprehensive evaluations to ensure their reliability and equitable outcomes.
Reach us at info@study.space
Understanding The limits of fair medical imaging AI in real-world generalization