2024 | Cathy Ong Ly, Balagopal Unnikrishnan, Tony Tadic, Tirth Patel, Joe Duhamel, Sonja Kandel, Yasanoo Moayed, Michael Brudno, Andrew Hope, Heather Ross, Chris McIntosh
The article addresses the challenge of estimating the generalizability of AI models in healthcare without extensive external data validation. It highlights that shortcut learning, where models learn spurious correlations from hidden data acquisition biases (DAB), can lead to overestimation of model performance by up to 20%. The authors propose an open-source, bias-corrected external accuracy estimate, \( P_{Est} \), which calibrates for DAB-induced shortcut learning, improving external accuracy estimation by an average of 4%. The method involves shuffling data to remove structural and semantic features, then using the shuffled data to estimate DABIS and calibrate the model's performance on external datasets. The study uses 13 datasets across five modalities (X-Rays, CTs, ECGs, clinical discharge summaries, and lung auscultation data) to demonstrate the effectiveness of the proposed method, showing a significant reduction in the overestimation of model performance. The results highlight the importance of addressing DAB to improve the generalizability of AI models in healthcare.The article addresses the challenge of estimating the generalizability of AI models in healthcare without extensive external data validation. It highlights that shortcut learning, where models learn spurious correlations from hidden data acquisition biases (DAB), can lead to overestimation of model performance by up to 20%. The authors propose an open-source, bias-corrected external accuracy estimate, \( P_{Est} \), which calibrates for DAB-induced shortcut learning, improving external accuracy estimation by an average of 4%. The method involves shuffling data to remove structural and semantic features, then using the shuffled data to estimate DABIS and calibrate the model's performance on external datasets. The study uses 13 datasets across five modalities (X-Rays, CTs, ECGs, clinical discharge summaries, and lung auscultation data) to demonstrate the effectiveness of the proposed method, showing a significant reduction in the overestimation of model performance. The results highlight the importance of addressing DAB to improve the generalizability of AI models in healthcare.