Understanding Multiple-Choice Questions are Efficient and Robust LLM Evaluators

This paper introduces GSM-MC, MATH-MC, and PythonIO, multiple-choice (MC) datasets derived from popular large language model (LLM) evaluation benchmarks. GSM-MC is constructed by collecting incorrect predictions from 60 open-source models on GSM8K, while MATH-MC is derived from MATH. PythonIO is a new program reasoning MC dataset created from HumanEval and MBPP. The datasets are designed to evaluate LLMs in a multiple-choice format, which allows for more efficient and robust evaluation compared to open-ended generation tasks. The study shows that LLMs' performance on MC versions of benchmarks like GSM8K is strongly correlated with their performance on the original benchmarks, and is robust to variations in distractor choices and option orders. Evaluation time is significantly reduced, up to 30 times faster than open-ended generation. The results indicate that LLMs still have room for improvement on these MC benchmarks. The paper also explores the effectiveness of converting short-answer generation benchmarks into MC format. It finds that LLMs can understand MC questions, but they tend to have biases towards certain options. The study further demonstrates that MC evaluation is more efficient and robust than open-ended generation, and that the performance of LLMs on MC benchmarks is strongly correlated with their performance on the original benchmarks. The paper also introduces MATH-MC and PythonIO, which are designed for evaluating LLMs in program reasoning tasks. The results show that LLaMA-3 70B Instruct performs best on these benchmarks, with scores of 61.1 on GSM-MC, 60.3 on MATH-MC, and 70.1 on PythonIO. The study concludes that these MC benchmarks provide a more efficient and robust way to evaluate LLMs, and that further research is needed to explore the potential of converting other benchmarks into MC format.This paper introduces GSM-MC, MATH-MC, and PythonIO, multiple-choice (MC) datasets derived from popular large language model (LLM) evaluation benchmarks. GSM-MC is constructed by collecting incorrect predictions from 60 open-source models on GSM8K, while MATH-MC is derived from MATH. PythonIO is a new program reasoning MC dataset created from HumanEval and MBPP. The datasets are designed to evaluate LLMs in a multiple-choice format, which allows for more efficient and robust evaluation compared to open-ended generation tasks. The study shows that LLMs' performance on MC versions of benchmarks like GSM8K is strongly correlated with their performance on the original benchmarks, and is robust to variations in distractor choices and option orders. Evaluation time is significantly reduced, up to 30 times faster than open-ended generation. The results indicate that LLMs still have room for improvement on these MC benchmarks. The paper also explores the effectiveness of converting short-answer generation benchmarks into MC format. It finds that LLMs can understand MC questions, but they tend to have biases towards certain options. The study further demonstrates that MC evaluation is more efficient and robust than open-ended generation, and that the performance of LLMs on MC benchmarks is strongly correlated with their performance on the original benchmarks. The paper also introduces MATH-MC and PythonIO, which are designed for evaluating LLMs in program reasoning tasks. The results show that LLaMA-3 70B Instruct performs best on these benchmarks, with scores of 61.1 on GSM-MC, 60.3 on MATH-MC, and 70.1 on PythonIO. The study concludes that these MC benchmarks provide a more efficient and robust way to evaluate LLMs, and that further research is needed to explore the potential of converting other benchmarks into MC format.

Multiple-Choice Questions are Efficient and Robust LLM Evaluators

26 Jun 2024 | Ziyin Zhang, Zhaokun Jiang, Lizhen Xu, Hongkun Hao, Rui Wang