Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study

Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study

2024 | Giovanni Maria Iannantuono, Dara Bracken-Clarke, Fatima Karzai, Hyoyoung Choo-Wosoba, James L. Gulley, Charalampos S. Floudas
This cross-sectional study evaluates the performance of three large language models (LLMs)—ChatGPT-4, ChatGPT-3.5, and Google Bard—in answering questions related to immuno-oncology (IO). The study generated 60 open-ended questions across four domains: mechanisms, indications, toxicities, and prognosis. Responses were collected and assessed for reproducibility, accuracy, relevance, and readability by two independent reviewers. Key findings include: - ChatGPT-4 and ChatGPT-3.5 answered all questions, while Google Bard answered only 53.3%. - ChatGPT-4 and ChatGPT-3.5 had higher reproducibility rates (95% and 86.3%, respectively) compared to Google Bard (50%). - Accuracy was higher for ChatGPT-4 (78.4%) and ChatGPT-3.5 (58.5%) than for Google Bard (43.8%). - Google Bard had the lowest relevance score (43.8%) compared to ChatGPT-4 (71.9%) and ChatGPT-3.5 (77.4%). - Readability was higher for ChatGPT-4 and ChatGPT-3.5 (98.1% and 100%, respectively) than for Google Bard (87.5%). The study concludes that ChatGPT-4 and ChatGPT-3.5 are powerful tools for providing information in IO, while Google Bard's performance is relatively poorer. However, all LLMs showed a risk of inaccuracy or incompleteness, emphasizing the need for expert verification of their outputs.This cross-sectional study evaluates the performance of three large language models (LLMs)—ChatGPT-4, ChatGPT-3.5, and Google Bard—in answering questions related to immuno-oncology (IO). The study generated 60 open-ended questions across four domains: mechanisms, indications, toxicities, and prognosis. Responses were collected and assessed for reproducibility, accuracy, relevance, and readability by two independent reviewers. Key findings include: - ChatGPT-4 and ChatGPT-3.5 answered all questions, while Google Bard answered only 53.3%. - ChatGPT-4 and ChatGPT-3.5 had higher reproducibility rates (95% and 86.3%, respectively) compared to Google Bard (50%). - Accuracy was higher for ChatGPT-4 (78.4%) and ChatGPT-3.5 (58.5%) than for Google Bard (43.8%). - Google Bard had the lowest relevance score (43.8%) compared to ChatGPT-4 (71.9%) and ChatGPT-3.5 (77.4%). - Readability was higher for ChatGPT-4 and ChatGPT-3.5 (98.1% and 100%, respectively) than for Google Bard (87.5%). The study concludes that ChatGPT-4 and ChatGPT-3.5 are powerful tools for providing information in IO, while Google Bard's performance is relatively poorer. However, all LLMs showed a risk of inaccuracy or incompleteness, emphasizing the need for expert verification of their outputs.
Reach us at info@study.space