Re-evaluating GPT-4's bar exam performance

Re-evaluating GPT-4's bar exam performance

2024 | Martínez, E.
This paper re-evaluates GPT-4's performance on the Uniform Bar Exam (UBE), challenging the claim that it achieved the 90th percentile. The authors find that OpenAI's estimates are overinflated. First, while GPT-4's UBE score nears the 90th percentile using approximate conversions from February Illinois Bar Exam administrations, these estimates are skewed toward repeat test-takers who scored lower than the general population. Second, data from a July administration shows GPT-4's UBE percentile below 69th, with essays at ~48th percentile. Third, using official NCBE data and conservative assumptions, GPT-4's performance against first-time test-takers is estimated at ~62nd percentile, including ~42nd percentile on essays. Fourth, when considering only those who passed the exam, GPT-4's performance drops to ~48th percentile overall and ~15th percentile on essays. The paper also investigates the validity of GPT-4's reported scaled UBE score of 298. It successfully replicates the MBE score but highlights methodological issues in grading the MPT + MEE components, calling into question the essay score. The paper also finds no significant effect of adjusting temperature settings on MBE performance, but prompt engineering significantly improves performance. These findings suggest OpenAI's estimates are likely overinflated, particularly if considered conservative. The results have implications for the feasibility of outsourcing legal tasks to AI and the importance of rigorous, transparent evaluations for AI safety. The paper underscores the need for legal professionals and AI developers to critically assess AI capabilities to ensure safe and trustworthy AI.This paper re-evaluates GPT-4's performance on the Uniform Bar Exam (UBE), challenging the claim that it achieved the 90th percentile. The authors find that OpenAI's estimates are overinflated. First, while GPT-4's UBE score nears the 90th percentile using approximate conversions from February Illinois Bar Exam administrations, these estimates are skewed toward repeat test-takers who scored lower than the general population. Second, data from a July administration shows GPT-4's UBE percentile below 69th, with essays at ~48th percentile. Third, using official NCBE data and conservative assumptions, GPT-4's performance against first-time test-takers is estimated at ~62nd percentile, including ~42nd percentile on essays. Fourth, when considering only those who passed the exam, GPT-4's performance drops to ~48th percentile overall and ~15th percentile on essays. The paper also investigates the validity of GPT-4's reported scaled UBE score of 298. It successfully replicates the MBE score but highlights methodological issues in grading the MPT + MEE components, calling into question the essay score. The paper also finds no significant effect of adjusting temperature settings on MBE performance, but prompt engineering significantly improves performance. These findings suggest OpenAI's estimates are likely overinflated, particularly if considered conservative. The results have implications for the feasibility of outsourcing legal tasks to AI and the importance of rigorous, transparent evaluations for AI safety. The paper underscores the need for legal professionals and AI developers to critically assess AI capabilities to ensure safe and trustworthy AI.
Reach us at info@study.space