03 April 2024 | Nikita Mehandru, Brenda Y. Miao, Eduardo Rodriguez Almaraz, Madhumita Sushil, Atul J. Butte & Ahmed Alaa
Recent advancements in large language models (LLMs) have opened new opportunities in healthcare, including information synthesis and clinical decision support. Unlike traditional benchmarks, LLMs can act as intelligent "agents" that interact with stakeholders in open-ended conversations and influence clinical decisions. These agents should be evaluated in high-fidelity clinical simulations, such as "Artificial Intelligence Structured Clinical Examinations" (AI-SCE), which draw from technologies like self-driving cars that operate in dynamic environments.
The release of ChatGPT has brought LLMs into the spotlight, with models like Med-PaLM 2 performing at human expert levels on medical questions. GPT-4 has shown potential in summarizing physician-patient encounters, achieving high scores on medical licensing exams, and generating clinical question-answer pairs. These models can perform complex tasks beyond traditional NLP, such as multi-step reasoning and simulation of clinical text.
LLM agents can be developed for various clinical use cases by providing access to clinical guidelines, databases, and other tools. These agents can autonomously retrieve information, perform multi-step analyses, and interact with other agents or external users. Healthcare systems are already integrating LLMs into patient messaging systems, and some medical centers are exploring "virtual-first" approaches where LLMs assist in patient triaging.
To evaluate LLM-based chatbots, agent-based modeling (ABM) can be used to create simulated environments. ABM has been used in health policy, biology, and social sciences to study health behaviors and disease spread. Similarly, ABM can simulate clinical settings to evaluate how LLM agents interact with users, use tools, and handle errors.
AI-SCE benchmarks, similar to OSCEs in medical education, can assess LLMs' ability to aid in real-world clinical workflows. These benchmarks should involve interdisciplinary teams and draw from real-world clinical tasks. AI-SCEs should evaluate both outputs and intermediate steps, capturing the agent's reasoning process and tool usage.
Evaluations should include human evaluators, external datasets, and post-deployment monitoring to ensure data distribution shifts and bias mitigation. Randomized control trials should compare simulation environments with real-world settings. As LLMs evolve, benchmarks should shift from static datasets to dynamic simulations, moving from language modeling to agent modeling. This approach could benefit future LLM research and development for clinical applications.Recent advancements in large language models (LLMs) have opened new opportunities in healthcare, including information synthesis and clinical decision support. Unlike traditional benchmarks, LLMs can act as intelligent "agents" that interact with stakeholders in open-ended conversations and influence clinical decisions. These agents should be evaluated in high-fidelity clinical simulations, such as "Artificial Intelligence Structured Clinical Examinations" (AI-SCE), which draw from technologies like self-driving cars that operate in dynamic environments.
The release of ChatGPT has brought LLMs into the spotlight, with models like Med-PaLM 2 performing at human expert levels on medical questions. GPT-4 has shown potential in summarizing physician-patient encounters, achieving high scores on medical licensing exams, and generating clinical question-answer pairs. These models can perform complex tasks beyond traditional NLP, such as multi-step reasoning and simulation of clinical text.
LLM agents can be developed for various clinical use cases by providing access to clinical guidelines, databases, and other tools. These agents can autonomously retrieve information, perform multi-step analyses, and interact with other agents or external users. Healthcare systems are already integrating LLMs into patient messaging systems, and some medical centers are exploring "virtual-first" approaches where LLMs assist in patient triaging.
To evaluate LLM-based chatbots, agent-based modeling (ABM) can be used to create simulated environments. ABM has been used in health policy, biology, and social sciences to study health behaviors and disease spread. Similarly, ABM can simulate clinical settings to evaluate how LLM agents interact with users, use tools, and handle errors.
AI-SCE benchmarks, similar to OSCEs in medical education, can assess LLMs' ability to aid in real-world clinical workflows. These benchmarks should involve interdisciplinary teams and draw from real-world clinical tasks. AI-SCEs should evaluate both outputs and intermediate steps, capturing the agent's reasoning process and tool usage.
Evaluations should include human evaluators, external datasets, and post-deployment monitoring to ensure data distribution shifts and bias mitigation. Randomized control trials should compare simulation environments with real-world settings. As LLMs evolve, benchmarks should shift from static datasets to dynamic simulations, moving from language modeling to agent modeling. This approach could benefit future LLM research and development for clinical applications.