2025-5-27 | Samuel Schmidgall, Rojin Ziaei, Carl Harris, Ji Woong Kim, Eduardo Reis, Jeffrey Jopling, Michael Moor
AgentClinic is a multimodal agent benchmark designed to evaluate large language models (LLMs) in simulated clinical environments. It aims to address the limitations of existing benchmarks, which often rely on static question-answering tasks that do not accurately reflect the complex, sequential nature of clinical decision-making. AgentClinic includes patient interactions, multimodal data collection under incomplete information, and the use of various tools, providing an in-depth evaluation across nine medical specialties and seven languages.
Key contributions of AgentClinic include:
1. **Multimodal and Interactive Evaluation**: The benchmark simulates clinical environments using language agents, patient agents, measurement agents, and a moderator, allowing for interactive and dialogue-driven evaluations.
2. **Bias Simulation**: Agents are instructed to exhibit 23 different biases, including cognitive and implicit biases, to assess their impact on diagnostic accuracy and patient perceptions.
3. **Tool Integration**: Agents can use tools such as adaptive retrieval, reflection cycles, and notebook editing, which significantly affect their performance.
4. **Multilingual and Specialist Cases**: The benchmark includes cases from nine medical specialties and seven languages, providing a diverse and realistic evaluation environment.
The study evaluates 11 models, including Claude-3.5, GPT-4, and Llama 3, and finds that solving MedQA problems in the sequential decision-making format of AgentClinic is more challenging, leading to diagnostic accuracies that can drop below a tenth of the original accuracy. Claude-3.5 outperforms other LLM backbones in most settings, but there are significant differences in how well agents handle tools. For example, Llama-3 shows up to 92% relative improvements with the notebook tool.
The benchmark also introduces patient-centric metrics, such as patient compliance and consultation ratings, to evaluate the quality of care perceived by patients. Human clinicians rated the dialogues from AgentClinic-MedQA, providing insights into the realism and empathy of the agents.
Overall, AgentClinic advances the field by offering a more comprehensive and interactive evaluation platform for medical AI systems, emphasizing the need for novel evaluation strategies beyond static question-answering benchmarks.AgentClinic is a multimodal agent benchmark designed to evaluate large language models (LLMs) in simulated clinical environments. It aims to address the limitations of existing benchmarks, which often rely on static question-answering tasks that do not accurately reflect the complex, sequential nature of clinical decision-making. AgentClinic includes patient interactions, multimodal data collection under incomplete information, and the use of various tools, providing an in-depth evaluation across nine medical specialties and seven languages.
Key contributions of AgentClinic include:
1. **Multimodal and Interactive Evaluation**: The benchmark simulates clinical environments using language agents, patient agents, measurement agents, and a moderator, allowing for interactive and dialogue-driven evaluations.
2. **Bias Simulation**: Agents are instructed to exhibit 23 different biases, including cognitive and implicit biases, to assess their impact on diagnostic accuracy and patient perceptions.
3. **Tool Integration**: Agents can use tools such as adaptive retrieval, reflection cycles, and notebook editing, which significantly affect their performance.
4. **Multilingual and Specialist Cases**: The benchmark includes cases from nine medical specialties and seven languages, providing a diverse and realistic evaluation environment.
The study evaluates 11 models, including Claude-3.5, GPT-4, and Llama 3, and finds that solving MedQA problems in the sequential decision-making format of AgentClinic is more challenging, leading to diagnostic accuracies that can drop below a tenth of the original accuracy. Claude-3.5 outperforms other LLM backbones in most settings, but there are significant differences in how well agents handle tools. For example, Llama-3 shows up to 92% relative improvements with the notebook tool.
The benchmark also introduces patient-centric metrics, such as patient compliance and consultation ratings, to evaluate the quality of care perceived by patients. Human clinicians rated the dialogues from AgentClinic-MedQA, providing insights into the realism and empathy of the agents.
Overall, AgentClinic advances the field by offering a more comprehensive and interactive evaluation platform for medical AI systems, emphasizing the need for novel evaluation strategies beyond static question-answering benchmarks.