2024 | Giovanni Maria Iannantuono, Dara Bracken-Clarke, Fatima Karzai, Hyoyoung Choo-Wosoba, James L. Gulley, Charalampos S. Floudas
This cross-sectional study evaluates the performance of three large language models (LLMs)—ChatGPT-4, ChatGPT-3.5, and Google Bard—in answering questions related to immuno-oncology (IO). The study generated 60 open-ended questions across four domains: mechanisms, indications, toxicities, and prognosis. Responses were collected and assessed for reproducibility, accuracy, relevance, and readability by two independent reviewers.
Key findings include:
- ChatGPT-4 and ChatGPT-3.5 answered all questions, while Google Bard answered only 53.3%.
- ChatGPT-4 and ChatGPT-3.5 had higher reproducibility rates (95% and 86.3%, respectively) compared to Google Bard (50%).
- Accuracy was higher for ChatGPT-4 (78.4%) and ChatGPT-3.5 (58.5%) than for Google Bard (43.8%).
- Google Bard had the lowest relevance score (43.8%) compared to ChatGPT-4 (71.9%) and ChatGPT-3.5 (77.4%).
- Readability was higher for ChatGPT-4 and ChatGPT-3.5 (98.1% and 100%, respectively) than for Google Bard (87.5%).
The study concludes that ChatGPT-4 and ChatGPT-3.5 are powerful tools for providing information in IO, while Google Bard's performance is relatively poorer. However, all LLMs showed a risk of inaccuracy or incompleteness, emphasizing the need for expert verification of their outputs.This cross-sectional study evaluates the performance of three large language models (LLMs)—ChatGPT-4, ChatGPT-3.5, and Google Bard—in answering questions related to immuno-oncology (IO). The study generated 60 open-ended questions across four domains: mechanisms, indications, toxicities, and prognosis. Responses were collected and assessed for reproducibility, accuracy, relevance, and readability by two independent reviewers.
Key findings include:
- ChatGPT-4 and ChatGPT-3.5 answered all questions, while Google Bard answered only 53.3%.
- ChatGPT-4 and ChatGPT-3.5 had higher reproducibility rates (95% and 86.3%, respectively) compared to Google Bard (50%).
- Accuracy was higher for ChatGPT-4 (78.4%) and ChatGPT-3.5 (58.5%) than for Google Bard (43.8%).
- Google Bard had the lowest relevance score (43.8%) compared to ChatGPT-4 (71.9%) and ChatGPT-3.5 (77.4%).
- Readability was higher for ChatGPT-4 and ChatGPT-3.5 (98.1% and 100%, respectively) than for Google Bard (87.5%).
The study concludes that ChatGPT-4 and ChatGPT-3.5 are powerful tools for providing information in IO, while Google Bard's performance is relatively poorer. However, all LLMs showed a risk of inaccuracy or incompleteness, emphasizing the need for expert verification of their outputs.