RareBench: Can LLMs Serve as Rare Diseases Specialists?

RareBench: Can LLMs Serve as Rare Diseases Specialists?

August 25-29, 2024 | Xuanzhong Chen, Xiaohao Mao, Qihan Guo, Lun Wang, Shuyang Zhang, Ting Chen
RareBench: Can LLMs Serve as Rare Diseases Specialists? This paper introduces RareBench, a benchmark designed to evaluate the capabilities of large language models (LLMs) in rare disease diagnosis. Rare diseases affect approximately 300 million people globally and are often difficult to diagnose due to the complexity of differentiating among many rare diseases and the lack of experienced physicians. Recent cases, such as ChatGPT correctly diagnosing a rare disease after 17 doctors failed, highlight the potential of LLMs in this area. RareBench includes a comprehensive dataset of rare disease patients and introduces a dynamic few-shot prompt methodology to enhance LLMs' diagnostic performance. The benchmark also includes a comparative study of GPT-4's diagnostic capabilities against those of specialist physicians, showing that GPT-4's performance is comparable to that of experienced doctors. RareBench consists of four tasks: phenotype extraction from electronic health records (EHRs), screening for specific rare diseases, comparative analysis of common and rare diseases, and differential diagnosis among universal rare diseases. The dataset includes 2,185 patient cases from both public and Peking Union Medical College Hospital (PUMCH) datasets, covering 421 rare diseases. The benchmark evaluates LLMs across these tasks using metrics such as precision, recall, and F1-score. The study also explores the integration of a rare disease knowledge graph with an information content (IC) based random walk algorithm to enhance LLMs' diagnostic capabilities. This approach significantly improves the performance of LLMs in differential diagnosis, even surpassing GPT-4 in some cases. The results show that GPT-4 performs well in all tasks, with the highest recall rates in differential diagnosis. However, the study also highlights the limitations of LLMs, including their inability to fully replace human expertise in rare disease diagnosis. The paper concludes that RareBench provides a valuable framework for evaluating LLMs in rare disease diagnosis and highlights the potential of integrating LLMs into clinical practice. However, it also emphasizes the need for further research to address the limitations of LLMs and ensure their safe and ethical use in healthcare. The study was approved by the Ethics Committees at Peking Union Medical College Hospital, Peking Union Medical College, and the Chinese Academy of Medical Sciences.RareBench: Can LLMs Serve as Rare Diseases Specialists? This paper introduces RareBench, a benchmark designed to evaluate the capabilities of large language models (LLMs) in rare disease diagnosis. Rare diseases affect approximately 300 million people globally and are often difficult to diagnose due to the complexity of differentiating among many rare diseases and the lack of experienced physicians. Recent cases, such as ChatGPT correctly diagnosing a rare disease after 17 doctors failed, highlight the potential of LLMs in this area. RareBench includes a comprehensive dataset of rare disease patients and introduces a dynamic few-shot prompt methodology to enhance LLMs' diagnostic performance. The benchmark also includes a comparative study of GPT-4's diagnostic capabilities against those of specialist physicians, showing that GPT-4's performance is comparable to that of experienced doctors. RareBench consists of four tasks: phenotype extraction from electronic health records (EHRs), screening for specific rare diseases, comparative analysis of common and rare diseases, and differential diagnosis among universal rare diseases. The dataset includes 2,185 patient cases from both public and Peking Union Medical College Hospital (PUMCH) datasets, covering 421 rare diseases. The benchmark evaluates LLMs across these tasks using metrics such as precision, recall, and F1-score. The study also explores the integration of a rare disease knowledge graph with an information content (IC) based random walk algorithm to enhance LLMs' diagnostic capabilities. This approach significantly improves the performance of LLMs in differential diagnosis, even surpassing GPT-4 in some cases. The results show that GPT-4 performs well in all tasks, with the highest recall rates in differential diagnosis. However, the study also highlights the limitations of LLMs, including their inability to fully replace human expertise in rare disease diagnosis. The paper concludes that RareBench provides a valuable framework for evaluating LLMs in rare disease diagnosis and highlights the potential of integrating LLMs into clinical practice. However, it also emphasizes the need for further research to address the limitations of LLMs and ensure their safe and ethical use in healthcare. The study was approved by the Ethics Committees at Peking Union Medical College Hospital, Peking Union Medical College, and the Chinese Academy of Medical Sciences.
Reach us at info@study.space