**RareBench: Can LLMs Serve as Rare Diseases Specialists?**
This paper introduces *RareBench*, a pioneering benchmark designed to evaluate the capabilities of Large Language Models (LLMs) in diagnosing rare diseases. Rare diseases, affecting approximately 300 million people worldwide, often have low clinical diagnosis rates due to a lack of experienced physicians and the complexity of differentiating among many rare diseases. The authors address this gap by developing *RareBench*, which includes the largest open-source dataset on rare disease patients and a comprehensive benchmarking framework. They also introduce a dynamic few-shot prompt methodology, leveraging a comprehensive rare disease knowledge graph to enhance LLMs' diagnostic performance.
The study evaluates LLMs on four critical dimensions: phenotype extraction from electronic health records (EHRs), screening for specific rare diseases, comparative analysis of common and rare diseases, and differential diagnosis among universal rare diseases. The results show that while all LLMs perform poorly in phenotype extraction, GPT-4 achieves the best performance in screening for specific rare diseases and differential diagnosis. The dynamic few-shot prompt method significantly improves LLMs' performance, particularly in differential diagnosis, with GPT-4 outperforming other models and even specialist physicians in some cases.
The paper concludes by highlighting the potential of integrating LLMs into the clinical diagnostic process for rare diseases, paving the way for future advancements in this field. However, it also discusses limitations and ethical considerations, emphasizing the need for continuous refinement and adherence to health information privacy regulations.**RareBench: Can LLMs Serve as Rare Diseases Specialists?**
This paper introduces *RareBench*, a pioneering benchmark designed to evaluate the capabilities of Large Language Models (LLMs) in diagnosing rare diseases. Rare diseases, affecting approximately 300 million people worldwide, often have low clinical diagnosis rates due to a lack of experienced physicians and the complexity of differentiating among many rare diseases. The authors address this gap by developing *RareBench*, which includes the largest open-source dataset on rare disease patients and a comprehensive benchmarking framework. They also introduce a dynamic few-shot prompt methodology, leveraging a comprehensive rare disease knowledge graph to enhance LLMs' diagnostic performance.
The study evaluates LLMs on four critical dimensions: phenotype extraction from electronic health records (EHRs), screening for specific rare diseases, comparative analysis of common and rare diseases, and differential diagnosis among universal rare diseases. The results show that while all LLMs perform poorly in phenotype extraction, GPT-4 achieves the best performance in screening for specific rare diseases and differential diagnosis. The dynamic few-shot prompt method significantly improves LLMs' performance, particularly in differential diagnosis, with GPT-4 outperforming other models and even specialist physicians in some cases.
The paper concludes by highlighting the potential of integrating LLMs into the clinical diagnostic process for rare diseases, paving the way for future advancements in this field. However, it also discusses limitations and ethical considerations, emphasizing the need for continuous refinement and adherence to health information privacy regulations.