[slides and audio] AI Hospital%3A Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

The paper introduces AI Hospital, a multi-agent framework designed to simulate dynamic medical interactions between a Doctor (player) and NPCs including Patient, Examiner, and Chief Physician. This setup aims to evaluate the performance of large language models (LLMs) in clinical scenarios. The authors develop the Multi-View Medical Evaluation (MVME) benchmark, which uses high-quality Chinese medical records to assess LLMs' ability to collect symptoms, recommend examinations, and make diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions among doctors. Despite improvements, current LLMs still exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. The findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. The data, code, and experimental results are open-sourced at https://github.com/LibertFan/AI_Hospital. - **AI Hospital Framework**: A multi-agent framework that simulates real-world medical interactions, including dynamic conversations between the Doctor and NPCs. - **Multi-View Medical Evaluation (MVME) Benchmark**: A benchmark that evaluates LLMs' performance in symptom collection, examination recommendations, and diagnoses using high-quality Chinese medical records. - **Dispute Resolution Collaborative Mechanism**: A mechanism that facilitates iterative discussions among doctors to enhance diagnostic accuracy. - **Performance Gaps**: Current LLMs perform significantly worse in multi-turn interactions compared to one-step approaches, with accuracy levels less than 50% of GPT-4. - **Collaborative Mechanism**: The collaborative mechanism improves performance but falls short of the upper bound, suggesting that existing LLMs need more effective multi-turn diagnostic strategies. - **Challenges**: LLMs struggle to pose pertinent questions, elicit crucial symptoms, and recommend appropriate medical examinations, highlighting the difficulties in replicating complex clinical reasoning processes. - **Introduction**: Overview of the importance of AI in healthcare and the limitations of current LLMs in clinical diagnosis. - **Setup of AI Hospital**: Detailed description of the AI Hospital framework, including agent setup and dialogue flow. - **MVME Dataset Construction**: Collection and validation of Chinese medical records for the MVME benchmark. - **Collaborative Diagnosis**: Introduction of a collaborative mechanism for improving diagnostic accuracy. - **Experiments**: Analysis of agent behavior and evaluation of LLMs' performance in the AI Hospital framework. - **Further Analysis**: Examination of collaboration mechanisms, error types, and ethical considerations. - **Conclusion**: Summary of the main contributions and limitations of the research. - **LLM-Powered Agents**: Previous efforts in creating agents for medical education and their limitations. - **Medical Large Language Models**: Development and fine-tuning of LLMs in the medical domain. - **Evaluation in Medicine AI**: Previous research on automated diagnostic methods and evaluation metrics. - **AI Hospital Framework**: A novel multi-agent system for simulating medical interactions. - **MVME Benchmark**: A comprehensiveThe paper introduces AI Hospital, a multi-agent framework designed to simulate dynamic medical interactions between a Doctor (player) and NPCs including Patient, Examiner, and Chief Physician. This setup aims to evaluate the performance of large language models (LLMs) in clinical scenarios. The authors develop the Multi-View Medical Evaluation (MVME) benchmark, which uses high-quality Chinese medical records to assess LLMs' ability to collect symptoms, recommend examinations, and make diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions among doctors. Despite improvements, current LLMs still exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. The findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. The data, code, and experimental results are open-sourced at https://github.com/LibertFan/AI_Hospital. - **AI Hospital Framework**: A multi-agent framework that simulates real-world medical interactions, including dynamic conversations between the Doctor and NPCs. - **Multi-View Medical Evaluation (MVME) Benchmark**: A benchmark that evaluates LLMs' performance in symptom collection, examination recommendations, and diagnoses using high-quality Chinese medical records. - **Dispute Resolution Collaborative Mechanism**: A mechanism that facilitates iterative discussions among doctors to enhance diagnostic accuracy. - **Performance Gaps**: Current LLMs perform significantly worse in multi-turn interactions compared to one-step approaches, with accuracy levels less than 50% of GPT-4. - **Collaborative Mechanism**: The collaborative mechanism improves performance but falls short of the upper bound, suggesting that existing LLMs need more effective multi-turn diagnostic strategies. - **Challenges**: LLMs struggle to pose pertinent questions, elicit crucial symptoms, and recommend appropriate medical examinations, highlighting the difficulties in replicating complex clinical reasoning processes. - **Introduction**: Overview of the importance of AI in healthcare and the limitations of current LLMs in clinical diagnosis. - **Setup of AI Hospital**: Detailed description of the AI Hospital framework, including agent setup and dialogue flow. - **MVME Dataset Construction**: Collection and validation of Chinese medical records for the MVME benchmark. - **Collaborative Diagnosis**: Introduction of a collaborative mechanism for improving diagnostic accuracy. - **Experiments**: Analysis of agent behavior and evaluation of LLMs' performance in the AI Hospital framework. - **Further Analysis**: Examination of collaboration mechanisms, error types, and ethical considerations. - **Conclusion**: Summary of the main contributions and limitations of the research. - **LLM-Powered Agents**: Previous efforts in creating agents for medical education and their limitations. - **Medical Large Language Models**: Development and fine-tuning of LLMs in the medical domain. - **Evaluation in Medicine AI**: Previous research on automated diagnostic methods and evaluation metrics. - **AI Hospital Framework**: A novel multi-agent system for simulating medical interactions. - **MVME Benchmark**: A comprehensive

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

28 Jun 2024 | Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xie, Fei Huang, Jingren Zhou