Can GPT-3.5 Generate and Code Discharge Summaries?

Can GPT-3.5 Generate and Code Discharge Summaries?

24 Jan 2024 | Matúš Falis (MScR),* Aryo Pradipta Gema (MScR), Hang Dong (PhD), Luke Daines (PhD), Siddharth Basetti (MBSS), Michael Holder (MMedSci), Rose S Penfold (BMBCh), Alexandra Birch (PhD) and Beatrice Alex (PhD)
Can GPT-3.5 Generate and Code Discharge Summaries? This study investigates the potential of GPT-3.5 in generating and coding medical documents with ICD-10 codes for data augmentation in low-resource label settings. The researchers used GPT-3.5 to generate 9,606 discharge summaries based on ICD-10 code descriptions from the MIMIC-IV dataset. These generated summaries were combined with a baseline training set to form an augmented training set. Neural coding models were trained on both baseline and augmented data and evaluated on a MIMIC-IV test set. The study reports micro- and macro-F1 scores on the full codeset, generation codes, and their families. Weak Hierarchical Confusion Matrices were used to determine within-family and outside-of-family coding errors. The coding performance of GPT-3.5 was evaluated on both prompt-guided self-generated data and real MIMIC-IV data. Clinical professionals evaluated the clinical acceptability of the generated documents. Results showed that augmentation slightly hindered overall model performance but improved performance for generation candidate codes and their families, including one unseen in the baseline training data. Augmented models displayed lower out-of-family error rates. GPT-3.5 could identify ICD-10 codes by prompted descriptions but performed poorly on real data. Evaluators noted the correctness of generated concepts but suffered in variety, supporting information, and narrative. Discussion and conclusion indicate that GPT-3.5 alone is unsuitable for ICD-10 coding. Augmentation positively affects generation code families but mainly benefits codes with existing examples. Augmentation reduces out-of-family errors. Discharge summaries generated by GPT-3.5 state prompted concepts correctly but lack variety and authenticity in narratives. They are unsuitable for clinical practice. The study also explores GPT-3.5's ability to code real discharge summaries and self-generated text. It highlights challenges in generating natural-looking clinical notes, including verbatim reproductions of prompted diagnoses, unnatural diagnosis phrasing, missing supporting information, introducing spurious information, and failing to present diagnoses as interconnected events. These issues affect the coherence and plausibility of the clinical notes, reducing their acceptability and usefulness in clinical settings. The study concludes that while GPT-3.5 shows partial code identification ability, it is unsuitable for deployment in a clinical setting without further improvements.Can GPT-3.5 Generate and Code Discharge Summaries? This study investigates the potential of GPT-3.5 in generating and coding medical documents with ICD-10 codes for data augmentation in low-resource label settings. The researchers used GPT-3.5 to generate 9,606 discharge summaries based on ICD-10 code descriptions from the MIMIC-IV dataset. These generated summaries were combined with a baseline training set to form an augmented training set. Neural coding models were trained on both baseline and augmented data and evaluated on a MIMIC-IV test set. The study reports micro- and macro-F1 scores on the full codeset, generation codes, and their families. Weak Hierarchical Confusion Matrices were used to determine within-family and outside-of-family coding errors. The coding performance of GPT-3.5 was evaluated on both prompt-guided self-generated data and real MIMIC-IV data. Clinical professionals evaluated the clinical acceptability of the generated documents. Results showed that augmentation slightly hindered overall model performance but improved performance for generation candidate codes and their families, including one unseen in the baseline training data. Augmented models displayed lower out-of-family error rates. GPT-3.5 could identify ICD-10 codes by prompted descriptions but performed poorly on real data. Evaluators noted the correctness of generated concepts but suffered in variety, supporting information, and narrative. Discussion and conclusion indicate that GPT-3.5 alone is unsuitable for ICD-10 coding. Augmentation positively affects generation code families but mainly benefits codes with existing examples. Augmentation reduces out-of-family errors. Discharge summaries generated by GPT-3.5 state prompted concepts correctly but lack variety and authenticity in narratives. They are unsuitable for clinical practice. The study also explores GPT-3.5's ability to code real discharge summaries and self-generated text. It highlights challenges in generating natural-looking clinical notes, including verbatim reproductions of prompted diagnoses, unnatural diagnosis phrasing, missing supporting information, introducing spurious information, and failing to present diagnoses as interconnected events. These issues affect the coherence and plausibility of the clinical notes, reducing their acceptability and usefulness in clinical settings. The study concludes that while GPT-3.5 shows partial code identification ability, it is unsuitable for deployment in a clinical setting without further improvements.
Reach us at info@study.space