This study evaluates the performance of ChatGPT-3.5 and ChatGPT-4.0 in answering ophthalmology-related questions across different levels of ophthalmology training. The researchers extracted questions from the United States Medical Licensing Examination (USMLE) steps 1, 2, and 3, as well as from the Ophthalmic Knowledge Assessment Program and the Board of Ophthalmology Written Qualifying Examination (OB-WQE). The models were prompted identically and inputted to assess their accuracy.
**Results:**
- **GPT-3.5:** Achieved 55% accuracy (n=210) overall, with performance decreasing as examination levels advanced (P<.001).
- **GPT-4.0:** Achieved 70% accuracy (n=270) overall, with better performance on USMLE steps 2 and 3 and worse performance on USMLE step 1 and OB-WQE (P<.001).
**Correlation:**
- The correlation coefficient between ChatGPT answering correctly and human users answering correctly was 0.21 (P=.01) for GPT-3.5 and −0.31 (P<.001) for GPT-4.0.
**Performance by Difficulty Level:**
- GPT-3.5 performed similarly across difficulty levels, while GPT-4.0 performed more poorly with increasing difficulty.
**Performance by Topic:**
- Both models performed better on certain topics such as corneal diseases, pediatrics, retina, ocular oncology, and neuro-ophthalmology.
**Conclusions:**
- ChatGPT is not yet suitable for mainstream medical education due to its moderate accuracy and limitations in handling more complex and clinical questions. Future models with higher accuracy are needed to enhance its effectiveness in medical education.
The study highlights the need for further research to optimize ChatGPT for medical education and to explore its potential in generating personalized and tailored learning experiences for students.This study evaluates the performance of ChatGPT-3.5 and ChatGPT-4.0 in answering ophthalmology-related questions across different levels of ophthalmology training. The researchers extracted questions from the United States Medical Licensing Examination (USMLE) steps 1, 2, and 3, as well as from the Ophthalmic Knowledge Assessment Program and the Board of Ophthalmology Written Qualifying Examination (OB-WQE). The models were prompted identically and inputted to assess their accuracy.
**Results:**
- **GPT-3.5:** Achieved 55% accuracy (n=210) overall, with performance decreasing as examination levels advanced (P<.001).
- **GPT-4.0:** Achieved 70% accuracy (n=270) overall, with better performance on USMLE steps 2 and 3 and worse performance on USMLE step 1 and OB-WQE (P<.001).
**Correlation:**
- The correlation coefficient between ChatGPT answering correctly and human users answering correctly was 0.21 (P=.01) for GPT-3.5 and −0.31 (P<.001) for GPT-4.0.
**Performance by Difficulty Level:**
- GPT-3.5 performed similarly across difficulty levels, while GPT-4.0 performed more poorly with increasing difficulty.
**Performance by Topic:**
- Both models performed better on certain topics such as corneal diseases, pediatrics, retina, ocular oncology, and neuro-ophthalmology.
**Conclusions:**
- ChatGPT is not yet suitable for mainstream medical education due to its moderate accuracy and limitations in handling more complex and clinical questions. Future models with higher accuracy are needed to enhance its effectiveness in medical education.
The study highlights the need for further research to optimize ChatGPT for medical education and to explore its potential in generating personalized and tailored learning experiences for students.