Can multiple-choice questions really be useful in detecting the abilities of LLMs?

Can multiple-choice questions really be useful in detecting the abilities of LLMs?

23 May 2024 | Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, Noa Garcia
Can multiple-choice questions (MCQs) effectively evaluate the abilities of large language models (LLMs)? This study investigates the effectiveness of MCQs in assessing LLMs by evaluating nine LLMs on four question-answering (QA) datasets in Chinese and English. The results reveal that LLMs exhibit order sensitivity in bilingual MCQs, favoring answers located at specific positions, particularly the first position. The study also quantifies the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. The findings show a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, the study proposes two methods to quantify the consistency and confidence of LLMs' outputs, which can be generalized to other QA evaluation benchmarks. Notably, the analysis challenges the idea that higher consistency always indicates better model performance. The study also finds that MCQs are less reliable than LFGQs in terms of expected calibration error. The misalignment between MCQs and LFGQs is not only reflected in evaluation performance but also in the embedding space. The code and models are available at https://github.com/Meetyou-AI-Lab/Can-MC-Evaluate-LLMs. The study highlights the limitations of MCQs in evaluating LLMs, particularly in knowledge-intensive scenarios where long-form generation is required. The results suggest that LFGQs are a more reliable format for evaluating LLMs, as they align better with real-world use cases. The study recommends using LFGQs in professional domains where accuracy is critical, and using MCQs for general knowledge evaluation. The findings also indicate that the number of candidate answers and the domain of the dataset do not significantly affect the performance of LLMs. The study concludes that MCQs are not a reliable method for evaluating LLMs, and that LFGQs provide a more accurate assessment of model capabilities.Can multiple-choice questions (MCQs) effectively evaluate the abilities of large language models (LLMs)? This study investigates the effectiveness of MCQs in assessing LLMs by evaluating nine LLMs on four question-answering (QA) datasets in Chinese and English. The results reveal that LLMs exhibit order sensitivity in bilingual MCQs, favoring answers located at specific positions, particularly the first position. The study also quantifies the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. The findings show a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, the study proposes two methods to quantify the consistency and confidence of LLMs' outputs, which can be generalized to other QA evaluation benchmarks. Notably, the analysis challenges the idea that higher consistency always indicates better model performance. The study also finds that MCQs are less reliable than LFGQs in terms of expected calibration error. The misalignment between MCQs and LFGQs is not only reflected in evaluation performance but also in the embedding space. The code and models are available at https://github.com/Meetyou-AI-Lab/Can-MC-Evaluate-LLMs. The study highlights the limitations of MCQs in evaluating LLMs, particularly in knowledge-intensive scenarios where long-form generation is required. The results suggest that LFGQs are a more reliable format for evaluating LLMs, as they align better with real-world use cases. The study recommends using LFGQs in professional domains where accuracy is critical, and using MCQs for general knowledge evaluation. The findings also indicate that the number of candidate answers and the domain of the dataset do not significantly affect the performance of LLMs. The study concludes that MCQs are not a reliable method for evaluating LLMs, and that LFGQs provide a more accurate assessment of model capabilities.
Reach us at info@study.space
[slides and audio] Can Multiple-choice Questions Really Be Useful in Detecting the Abilities of LLMs%3F