AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

28 Jun 2024 | Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xie, Fei Huang, Jingren Zhou
This paper introduces AI Hospital, a multi-agent framework that simulates dynamic medical interactions between a Doctor and non-player characters (NPCs) including Patient, Examiner, and Chief Physician. The framework enables realistic assessments of large language models (LLMs) in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, which evaluates LLMs' performance in symptom collection, examination recommendations, and diagnoses using high-quality Chinese medical records. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at https://github.com/LibertFan/AI_Hospital. The AI Hospital framework consists of three NPC characters — the Patient, Examiner, and Chief Physician — and one player character, the Doctor. Each character assumes specific roles and responsibilities within the framework. The AI Hospital operates in two phases. In the diagnostic phase, Patient, Examiner, and Doctor engage in conversations to exchange information necessary for accurate diagnosis. The number of interaction turns in this phase can vary depending on the Doctor's diagnostic strategy. Subsequently, during the evaluation phase, Chief Physician is responsible for scoring the performance of Doctor in the diagnostic phase. The following sections will elaborate on the settings and construction methods for each agent in AI Hospital. Medical records are a valuable resource for reconstructing the hospital visit experience and simulating real-world medical interactions. By leveraging these records, we can reverse-engineer the diagnostic process and shape the behavior of agents within the AI Hospital framework. We categorize the information in each medical record into three types: 1) Subjective Information This category includes the patient's symptoms, etiology, past medical history, habits, etc., which are primarily provided by the patient during their verbal interactions with the doctor; 2) Objective Information This category encompasses medical test reports such as complete blood counts, urinalysis, and chest X-rays. The presence of these data in medical records indicates that the patient underwent these tests during the diagnostic process at the doctor's recommendation; 3) Diagnosis and Treatment This category consists of diagnostic results, diagnostic rationales, and treatment courses, which are the final conclusions made by the doctor during the diagnostic process, based on the combination of subjective and objective information. These categories of information are assigned to the corresponding agents in the AI Hospital framework. Patient has access to the subjective information, Examiner is aware of the objective information, and Chief Physician possess all information, while Doctor do not have access to any information. AI Hospital framework assigns specific categories of information from medical records to each agent, shaping their scope of information within the diagnostic process. In the AI Hospital framework, we leverage GPT-3.5 to power Patient and Examiner, and GPTThis paper introduces AI Hospital, a multi-agent framework that simulates dynamic medical interactions between a Doctor and non-player characters (NPCs) including Patient, Examiner, and Chief Physician. The framework enables realistic assessments of large language models (LLMs) in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, which evaluates LLMs' performance in symptom collection, examination recommendations, and diagnoses using high-quality Chinese medical records. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at https://github.com/LibertFan/AI_Hospital. The AI Hospital framework consists of three NPC characters — the Patient, Examiner, and Chief Physician — and one player character, the Doctor. Each character assumes specific roles and responsibilities within the framework. The AI Hospital operates in two phases. In the diagnostic phase, Patient, Examiner, and Doctor engage in conversations to exchange information necessary for accurate diagnosis. The number of interaction turns in this phase can vary depending on the Doctor's diagnostic strategy. Subsequently, during the evaluation phase, Chief Physician is responsible for scoring the performance of Doctor in the diagnostic phase. The following sections will elaborate on the settings and construction methods for each agent in AI Hospital. Medical records are a valuable resource for reconstructing the hospital visit experience and simulating real-world medical interactions. By leveraging these records, we can reverse-engineer the diagnostic process and shape the behavior of agents within the AI Hospital framework. We categorize the information in each medical record into three types: 1) Subjective Information This category includes the patient's symptoms, etiology, past medical history, habits, etc., which are primarily provided by the patient during their verbal interactions with the doctor; 2) Objective Information This category encompasses medical test reports such as complete blood counts, urinalysis, and chest X-rays. The presence of these data in medical records indicates that the patient underwent these tests during the diagnostic process at the doctor's recommendation; 3) Diagnosis and Treatment This category consists of diagnostic results, diagnostic rationales, and treatment courses, which are the final conclusions made by the doctor during the diagnostic process, based on the combination of subjective and objective information. These categories of information are assigned to the corresponding agents in the AI Hospital framework. Patient has access to the subjective information, Examiner is aware of the objective information, and Chief Physician possess all information, while Doctor do not have access to any information. AI Hospital framework assigns specific categories of information from medical records to each agent, shaping their scope of information within the diagnostic process. In the AI Hospital framework, we leverage GPT-3.5 to power Patient and Examiner, and GPT
Reach us at info@study.space