28 Jun 2024 | Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xie, Fei Huang, Jingren Zhou
The paper introduces AI Hospital, a multi-agent framework designed to simulate dynamic medical interactions between a Doctor (player) and NPCs including Patient, Examiner, and Chief Physician. This setup aims to evaluate the performance of large language models (LLMs) in clinical scenarios. The authors develop the Multi-View Medical Evaluation (MVME) benchmark, which uses high-quality Chinese medical records to assess LLMs' ability to collect symptoms, recommend examinations, and make diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions among doctors. Despite improvements, current LLMs still exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. The findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. The data, code, and experimental results are open-sourced at https://github.com/LibertFan/AI_Hospital.
- **AI Hospital Framework**: A multi-agent framework that simulates real-world medical interactions, including dynamic conversations between the Doctor and NPCs.
- **Multi-View Medical Evaluation (MVME) Benchmark**: A benchmark that evaluates LLMs' performance in symptom collection, examination recommendations, and diagnoses using high-quality Chinese medical records.
- **Dispute Resolution Collaborative Mechanism**: A mechanism that facilitates iterative discussions among doctors to enhance diagnostic accuracy.
- **Performance Gaps**: Current LLMs perform significantly worse in multi-turn interactions compared to one-step approaches, with accuracy levels less than 50% of GPT-4.
- **Collaborative Mechanism**: The collaborative mechanism improves performance but falls short of the upper bound, suggesting that existing LLMs need more effective multi-turn diagnostic strategies.
- **Challenges**: LLMs struggle to pose pertinent questions, elicit crucial symptoms, and recommend appropriate medical examinations, highlighting the difficulties in replicating complex clinical reasoning processes.
- **Introduction**: Overview of the importance of AI in healthcare and the limitations of current LLMs in clinical diagnosis.
- **Setup of AI Hospital**: Detailed description of the AI Hospital framework, including agent setup and dialogue flow.
- **MVME Dataset Construction**: Collection and validation of Chinese medical records for the MVME benchmark.
- **Collaborative Diagnosis**: Introduction of a collaborative mechanism for improving diagnostic accuracy.
- **Experiments**: Analysis of agent behavior and evaluation of LLMs' performance in the AI Hospital framework.
- **Further Analysis**: Examination of collaboration mechanisms, error types, and ethical considerations.
- **Conclusion**: Summary of the main contributions and limitations of the research.
- **LLM-Powered Agents**: Previous efforts in creating agents for medical education and their limitations.
- **Medical Large Language Models**: Development and fine-tuning of LLMs in the medical domain.
- **Evaluation in Medicine AI**: Previous research on automated diagnostic methods and evaluation metrics.
- **AI Hospital Framework**: A novel multi-agent system for simulating medical interactions.
- **MVME Benchmark**: A comprehensiveThe paper introduces AI Hospital, a multi-agent framework designed to simulate dynamic medical interactions between a Doctor (player) and NPCs including Patient, Examiner, and Chief Physician. This setup aims to evaluate the performance of large language models (LLMs) in clinical scenarios. The authors develop the Multi-View Medical Evaluation (MVME) benchmark, which uses high-quality Chinese medical records to assess LLMs' ability to collect symptoms, recommend examinations, and make diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions among doctors. Despite improvements, current LLMs still exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. The findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. The data, code, and experimental results are open-sourced at https://github.com/LibertFan/AI_Hospital.
- **AI Hospital Framework**: A multi-agent framework that simulates real-world medical interactions, including dynamic conversations between the Doctor and NPCs.
- **Multi-View Medical Evaluation (MVME) Benchmark**: A benchmark that evaluates LLMs' performance in symptom collection, examination recommendations, and diagnoses using high-quality Chinese medical records.
- **Dispute Resolution Collaborative Mechanism**: A mechanism that facilitates iterative discussions among doctors to enhance diagnostic accuracy.
- **Performance Gaps**: Current LLMs perform significantly worse in multi-turn interactions compared to one-step approaches, with accuracy levels less than 50% of GPT-4.
- **Collaborative Mechanism**: The collaborative mechanism improves performance but falls short of the upper bound, suggesting that existing LLMs need more effective multi-turn diagnostic strategies.
- **Challenges**: LLMs struggle to pose pertinent questions, elicit crucial symptoms, and recommend appropriate medical examinations, highlighting the difficulties in replicating complex clinical reasoning processes.
- **Introduction**: Overview of the importance of AI in healthcare and the limitations of current LLMs in clinical diagnosis.
- **Setup of AI Hospital**: Detailed description of the AI Hospital framework, including agent setup and dialogue flow.
- **MVME Dataset Construction**: Collection and validation of Chinese medical records for the MVME benchmark.
- **Collaborative Diagnosis**: Introduction of a collaborative mechanism for improving diagnostic accuracy.
- **Experiments**: Analysis of agent behavior and evaluation of LLMs' performance in the AI Hospital framework.
- **Further Analysis**: Examination of collaboration mechanisms, error types, and ethical considerations.
- **Conclusion**: Summary of the main contributions and limitations of the research.
- **LLM-Powered Agents**: Previous efforts in creating agents for medical education and their limitations.
- **Medical Large Language Models**: Development and fine-tuning of LLMs in the medical domain.
- **Evaluation in Medicine AI**: Previous research on automated diagnostic methods and evaluation metrics.
- **AI Hospital Framework**: A novel multi-agent system for simulating medical interactions.
- **MVME Benchmark**: A comprehensive