Re-evaluating GPT-4’s bar exam performance

Re-evaluating GPT-4’s bar exam performance

30 January 2024 | Eric Martínez
The paper "Re-evaluating GPT-4’s bar exam performance" by Eric Martínez investigates the validity of OpenAI's claim that GPT-4 achieved a 90th percentile performance on the Uniform Bar Examination (UBE). The study identifies several methodological challenges in documenting and verifying this claim, leading to four main findings: 1. **Methodological Challenges**: The estimates of GPT-4's UBE percentile are heavily skewed towards repeat test-takers who failed the July administration, resulting in lower scores compared to the general test-taking population. 2. **July Administration Data**: Using data from a recent July administration, GPT-4's overall UBE percentile is estimated to be below the 69th percentile, with a score of ~48th percentile on essays. 3. **First-Time Test-Takers**: When comparing GPT-4's performance against first-time test-takers, the estimated percentile drops to ~62nd percentile, including ~42nd percentile on essays. 4. **Passing the Exam**: When examining only those who passed the exam, GPT-4's performance drops to ~48th percentile overall and ~15th percentile on essays. The paper also examines the validity of GPT-4's reported scaled UBE score of 298, successfully replicating the MBE score of 158 but highlighting methodological issues in the grading of the MPT + MEE components, which question the validity of the essay score (140). Additionally, the study investigates the effect of different hyperparameter settings on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings but a significant effect of prompt engineering. Overall, the findings suggest that OpenAI's estimates of GPT-4’s UBE percentile are likely overinflated and carry important implications for the desirability and feasibility of outsourcing legally relevant tasks to AI models, as well as the importance of rigorous and transparent capabilities evaluations for AI developers.The paper "Re-evaluating GPT-4’s bar exam performance" by Eric Martínez investigates the validity of OpenAI's claim that GPT-4 achieved a 90th percentile performance on the Uniform Bar Examination (UBE). The study identifies several methodological challenges in documenting and verifying this claim, leading to four main findings: 1. **Methodological Challenges**: The estimates of GPT-4's UBE percentile are heavily skewed towards repeat test-takers who failed the July administration, resulting in lower scores compared to the general test-taking population. 2. **July Administration Data**: Using data from a recent July administration, GPT-4's overall UBE percentile is estimated to be below the 69th percentile, with a score of ~48th percentile on essays. 3. **First-Time Test-Takers**: When comparing GPT-4's performance against first-time test-takers, the estimated percentile drops to ~62nd percentile, including ~42nd percentile on essays. 4. **Passing the Exam**: When examining only those who passed the exam, GPT-4's performance drops to ~48th percentile overall and ~15th percentile on essays. The paper also examines the validity of GPT-4's reported scaled UBE score of 298, successfully replicating the MBE score of 158 but highlighting methodological issues in the grading of the MPT + MEE components, which question the validity of the essay score (140). Additionally, the study investigates the effect of different hyperparameter settings on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings but a significant effect of prompt engineering. Overall, the findings suggest that OpenAI's estimates of GPT-4’s UBE percentile are likely overinflated and carry important implications for the desirability and feasibility of outsourcing legally relevant tasks to AI models, as well as the importance of rigorous and transparent capabilities evaluations for AI developers.
Reach us at info@study.space