The paper presents GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and incorrect predictions from 60 open-source models on the GSM8K benchmark. Extensive experiments show that LLMs' performance on the MC version of GSM8K is strongly correlated with their performance on the original version and is robust to distractor choices and option orders. The evaluation time is reduced by up to 30 times. Inspired by this success, the authors introduce MATH-MC and PythonIO, new MC datasets for the MATH and HumanEval/MBPP benchmarks, respectively. Experimental results indicate that LLMs' performance on these new MC benchmarks leaves much room for improvement. The data and code are available at <https://github.com/Geralt-Targaryen/MC-Evaluation>.The paper presents GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and incorrect predictions from 60 open-source models on the GSM8K benchmark. Extensive experiments show that LLMs' performance on the MC version of GSM8K is strongly correlated with their performance on the original version and is robust to distractor choices and option orders. The evaluation time is reduced by up to 30 times. Inspired by this success, the authors introduce MATH-MC and PythonIO, new MC datasets for the MATH and HumanEval/MBPP benchmarks, respectively. Experimental results indicate that LLMs' performance on these new MC benchmarks leaves much room for improvement. The data and code are available at <https://github.com/Geralt-Targaryen/MC-Evaluation>.