June 2024 | Noel F. Ayoub, MD, MBA; Karthik Balakrishnan, MD, MPH; Marc S. Ayoub, MD; Thomas F. Barrett, MD; Abel P. David, MD; and Stacey T. Gray, MD
This study investigates the inherent bias in large language models (LLMs), specifically OpenAI's GPT-4, by simulating physicians making life-and-death decisions in resource-limited environments. Thirteen questions were designed to test the model's bias across various demographic characteristics, including race, gender, age, political affiliation, and sexual orientation. The simulation involved 1000 unique physicians and patients per question, with each physician choosing one patient to save. The results consistently showed that the simulated physicians favored patients with similar demographic characteristics, with most pairwise comparisons showing statistical significance (P<.05). Notably, nondescript physicians preferred younger, White, and male patients, while female physicians were more likely to save patients of similar demographic characteristics. Politically, Democratic physicians favored Black and female patients, and Republican physicians preferred White and male patients. Heterosexual and gay/lesbian physicians also showed biases in their preferences. The study highlights the potential negative impact of these biases on patient outcomes if LLMs are used without appropriate precautions in clinical care. The findings underscore the need for critical examination of the biases in LLMs and the importance of addressing implicit bias in healthcare and society.This study investigates the inherent bias in large language models (LLMs), specifically OpenAI's GPT-4, by simulating physicians making life-and-death decisions in resource-limited environments. Thirteen questions were designed to test the model's bias across various demographic characteristics, including race, gender, age, political affiliation, and sexual orientation. The simulation involved 1000 unique physicians and patients per question, with each physician choosing one patient to save. The results consistently showed that the simulated physicians favored patients with similar demographic characteristics, with most pairwise comparisons showing statistical significance (P<.05). Notably, nondescript physicians preferred younger, White, and male patients, while female physicians were more likely to save patients of similar demographic characteristics. Politically, Democratic physicians favored Black and female patients, and Republican physicians preferred White and male patients. Heterosexual and gay/lesbian physicians also showed biases in their preferences. The study highlights the potential negative impact of these biases on patient outcomes if LLMs are used without appropriate precautions in clinical care. The findings underscore the need for critical examination of the biases in LLMs and the importance of addressing implicit bias in healthcare and society.