2024 | Giovanni Maria Iannantuono, Dara Bracken-Clarke, Fatima Karzai, Hyoyoung Choo-Wosoba, James L. Gulley, Charalampos S. Floudas
A cross-sectional study evaluated the performance of three large language models (LLMs)—ChatGPT-4, ChatGPT-3.5, and Google Bard—in answering immuno-oncology (IO) questions. The study aimed to assess their ability to provide reproducible, accurate, relevant, and readable responses to 60 open-ended questions across four domains: mechanisms, indications, toxicities, and prognosis. ChatGPT-4 and ChatGPT-3.5 answered all questions, while Google Bard answered only 53.3% of them. The number of reproducible answers was higher for ChatGPT-4 (95%) and ChatGPT-3.5 (88.3%) compared to Google Bard (50%). Accuracy was highest for ChatGPT-4 (75.4%), followed by ChatGPT-3.5 (58.5%) and Google Bard (43.8%). Relevance was also highest for ChatGPT-4 (71.9%) and ChatGPT-3.5 (77.4%) compared to Google Bard (43.8%). Readability was highest for ChatGPT-4 (100%) and ChatGPT-3.5 (98.1%) compared to Google Bard (87.5%).
The study highlights that ChatGPT-4 and ChatGPT-3.5 are more effective than Google Bard in providing accurate and relevant information on immuno-oncology. However, all three LLMs showed risks of inaccuracy or incompleteness, emphasizing the need for expert verification of their outputs. The study also notes that while LLMs can be useful tools for providing information, their performance varies across different tasks and domains. Additionally, the datasets used for training these models are not always up-to-date, which may affect their accuracy. The study concludes that expert evaluation is essential for the clinical use of LLMs, as they may not always provide reliable information. The findings suggest that ChatGPT-4 and ChatGPT-3.5 are more suitable for use in immuno-oncology than Google Bard, but further research is needed to fully understand their capabilities and limitations.A cross-sectional study evaluated the performance of three large language models (LLMs)—ChatGPT-4, ChatGPT-3.5, and Google Bard—in answering immuno-oncology (IO) questions. The study aimed to assess their ability to provide reproducible, accurate, relevant, and readable responses to 60 open-ended questions across four domains: mechanisms, indications, toxicities, and prognosis. ChatGPT-4 and ChatGPT-3.5 answered all questions, while Google Bard answered only 53.3% of them. The number of reproducible answers was higher for ChatGPT-4 (95%) and ChatGPT-3.5 (88.3%) compared to Google Bard (50%). Accuracy was highest for ChatGPT-4 (75.4%), followed by ChatGPT-3.5 (58.5%) and Google Bard (43.8%). Relevance was also highest for ChatGPT-4 (71.9%) and ChatGPT-3.5 (77.4%) compared to Google Bard (43.8%). Readability was highest for ChatGPT-4 (100%) and ChatGPT-3.5 (98.1%) compared to Google Bard (87.5%).
The study highlights that ChatGPT-4 and ChatGPT-3.5 are more effective than Google Bard in providing accurate and relevant information on immuno-oncology. However, all three LLMs showed risks of inaccuracy or incompleteness, emphasizing the need for expert verification of their outputs. The study also notes that while LLMs can be useful tools for providing information, their performance varies across different tasks and domains. Additionally, the datasets used for training these models are not always up-to-date, which may affect their accuracy. The study concludes that expert evaluation is essential for the clinical use of LLMs, as they may not always provide reliable information. The findings suggest that ChatGPT-4 and ChatGPT-3.5 are more suitable for use in immuno-oncology than Google Bard, but further research is needed to fully understand their capabilities and limitations.