[slides and audio] Quality%2C Accuracy%2C and Bias in ChatGPT-Based Summarization of Medical Abstracts

This study evaluates the performance of ChatGPT in summarizing medical abstracts and assessing their relevance to various medical specialties. The researchers analyzed 140 peer-reviewed abstracts from 14 journals, and seven physicians evaluated the quality, accuracy, and bias of the ChatGPT summaries. The summaries were 70% shorter than the original abstracts, with high quality (median score 90), high accuracy (median 92.5), and low bias (median 0). The ChatGPT summaries were rated as highly accurate and of high quality by both the physicians and ChatGPT itself. However, ChatGPT had limited ability to classify the relevance of individual articles to specific medical specialties. The study found that ChatGPT's relevance classifications for journals were in line with human assessments, but for individual articles, the agreement was much lower. The researchers concluded that ChatGPT can help family physicians quickly review scientific literature, and they developed a software tool called pyJournalWatch to support this application. However, they emphasized that life-critical medical decisions should not rely solely on ChatGPT summaries, as they may contain rare but important inaccuracies. The study also highlights the need for careful validation of AI tools to prevent exacerbating existing health disparities.This study evaluates the performance of ChatGPT in summarizing medical abstracts and assessing their relevance to various medical specialties. The researchers analyzed 140 peer-reviewed abstracts from 14 journals, and seven physicians evaluated the quality, accuracy, and bias of the ChatGPT summaries. The summaries were 70% shorter than the original abstracts, with high quality (median score 90), high accuracy (median 92.5), and low bias (median 0). The ChatGPT summaries were rated as highly accurate and of high quality by both the physicians and ChatGPT itself. However, ChatGPT had limited ability to classify the relevance of individual articles to specific medical specialties. The study found that ChatGPT's relevance classifications for journals were in line with human assessments, but for individual articles, the agreement was much lower. The researchers concluded that ChatGPT can help family physicians quickly review scientific literature, and they developed a software tool called pyJournalWatch to support this application. However, they emphasized that life-critical medical decisions should not rely solely on ChatGPT summaries, as they may contain rare but important inaccuracies. The study also highlights the need for careful validation of AI tools to prevent exacerbating existing health disparities.

Quality, Accuracy, and Bias in ChatGPT-Based Summarization of Medical Abstracts

March/April 2024 | Joel Hake, MD; Miles Crowley, MD, MPH; Allison Coy, MD; Denton Shanks, DO, MPH; Aundria Eoff, MD; Kalee Kirmer-Voss, MD; Gurpreet Dhanda, MD; Daniel J. Parente, MD, PhD