2025-5-27 | Samuel Schmidgall, Rojin Ziaei, Carl Harris, Ji Woong Kim, Eduardo Reis, Jeffrey Jopling and Michael Moor
AgentClinic is a new multimodal agent benchmark designed to evaluate large language models (LLMs) in simulated clinical environments. It includes patient interactions, multimodal data collection, and the use of various tools, enabling an in-depth evaluation across nine medical specialties and seven languages. The benchmark presents challenges in sequential decision-making, which is more complex than traditional static question-answering formats. Results show that solving MedQA problems in this format is significantly more challenging, with diagnostic accuracy dropping to as low as a tenth of original accuracy. Agents based on Claude-3.5 outperform others in most settings, though there are stark differences in their ability to use tools like experiential learning, adaptive retrieval, and reflection cycles. Llama-3 shows up to 92% relative improvements with the notebook tool. The benchmark also includes real-world electronic health records, clinical reader studies, bias perturbations, and novel patient-centric metrics.
AgentClinic introduces language agents for simulating clinical environments, including patient, doctor, measurement, and moderator agents. It incorporates clinically relevant biases, such as cognitive and implicit biases, which can affect dialogue and decisions. The benchmark includes patient agents based on real clinical cases, with 24 different biases and environments from nine medical specialties and seven languages. Doctor agents can use various tools, such as web browsing, textbooks, reflection cycles, and note-taking, with varying degrees of effectiveness among models.
Results show that models like GPT-4 and Claude-3.5 perform well, but there are significant differences in their ability to use tools. The diagnostic accuracy of models is influenced by interaction time and the choice of patient language model. Bias evaluations show that cognitive and implicit biases significantly reduce diagnostic accuracy and affect patient perception. Specialist and multilingual cases reveal significant differences in diagnostic accuracy across medical specialties. The benchmark also evaluates the impact of various tools on diagnostic accuracy, with Claude-3.5 performing best overall.
Human dialogue ratings show that the doctor agent is rated lower due to issues like poor opening statements and overly focusing on a particular diagnosis. The benchmark also evaluates diagnostic accuracy in a multimodal environment, showing that models struggle with understanding visual context. Overall, AgentClinic provides a comprehensive evaluation platform for medical AI systems, highlighting the need for novel evaluation strategies beyond static question-answering benchmarks.AgentClinic is a new multimodal agent benchmark designed to evaluate large language models (LLMs) in simulated clinical environments. It includes patient interactions, multimodal data collection, and the use of various tools, enabling an in-depth evaluation across nine medical specialties and seven languages. The benchmark presents challenges in sequential decision-making, which is more complex than traditional static question-answering formats. Results show that solving MedQA problems in this format is significantly more challenging, with diagnostic accuracy dropping to as low as a tenth of original accuracy. Agents based on Claude-3.5 outperform others in most settings, though there are stark differences in their ability to use tools like experiential learning, adaptive retrieval, and reflection cycles. Llama-3 shows up to 92% relative improvements with the notebook tool. The benchmark also includes real-world electronic health records, clinical reader studies, bias perturbations, and novel patient-centric metrics.
AgentClinic introduces language agents for simulating clinical environments, including patient, doctor, measurement, and moderator agents. It incorporates clinically relevant biases, such as cognitive and implicit biases, which can affect dialogue and decisions. The benchmark includes patient agents based on real clinical cases, with 24 different biases and environments from nine medical specialties and seven languages. Doctor agents can use various tools, such as web browsing, textbooks, reflection cycles, and note-taking, with varying degrees of effectiveness among models.
Results show that models like GPT-4 and Claude-3.5 perform well, but there are significant differences in their ability to use tools. The diagnostic accuracy of models is influenced by interaction time and the choice of patient language model. Bias evaluations show that cognitive and implicit biases significantly reduce diagnostic accuracy and affect patient perception. Specialist and multilingual cases reveal significant differences in diagnostic accuracy across medical specialties. The benchmark also evaluates the impact of various tools on diagnostic accuracy, with Claude-3.5 performing best overall.
Human dialogue ratings show that the doctor agent is rated lower due to issues like poor opening statements and overly focusing on a particular diagnosis. The benchmark also evaluates diagnostic accuracy in a multimodal environment, showing that models struggle with understanding visual context. Overall, AgentClinic provides a comprehensive evaluation platform for medical AI systems, highlighting the need for novel evaluation strategies beyond static question-answering benchmarks.