7 Jun 2024 | Nishant Balepur, Abhilasha Ravichander, Rachel Rudinger
This paper investigates how large language models (LLMs) perform multiple-choice question answering (MCQA) without access to the question, using only the choices. The study evaluates four LLMs on three MCQA datasets (ARC, MMLU, HellaSwag) and finds that in 11 out of 12 cases, LLMs outperform majority baselines by up to 0.33 accuracy on HellaSwag. The results suggest that LLMs may be using artifacts in the data rather than true reasoning. The paper explores three hypotheses: memorization, choice dynamics, and question inference. It finds no evidence that memorization alone explains the results, but that LLMs may use group dynamics of choices and infer the original question from the choices. The study also shows that LLMs can sometimes infer the original question, which is an impressive reasoning strategy. However, this does not fully explain the high accuracy in MCQA. The paper advocates for stronger baselines in MCQA benchmarks and the design of robust MCQA datasets for fair evaluations. It also highlights the need for further research into LLM decision-making. The study provides a black-box analysis of MCQA performance and suggests that partial-input models may not always be based on surface-level shortcuts. The paper concludes that while LLMs may use artifacts, they still demonstrate reasoning abilities in MCQA. The study emphasizes the importance of transparent and robust evaluations of LLMs in MCQA tasks.This paper investigates how large language models (LLMs) perform multiple-choice question answering (MCQA) without access to the question, using only the choices. The study evaluates four LLMs on three MCQA datasets (ARC, MMLU, HellaSwag) and finds that in 11 out of 12 cases, LLMs outperform majority baselines by up to 0.33 accuracy on HellaSwag. The results suggest that LLMs may be using artifacts in the data rather than true reasoning. The paper explores three hypotheses: memorization, choice dynamics, and question inference. It finds no evidence that memorization alone explains the results, but that LLMs may use group dynamics of choices and infer the original question from the choices. The study also shows that LLMs can sometimes infer the original question, which is an impressive reasoning strategy. However, this does not fully explain the high accuracy in MCQA. The paper advocates for stronger baselines in MCQA benchmarks and the design of robust MCQA datasets for fair evaluations. It also highlights the need for further research into LLM decision-making. The study provides a black-box analysis of MCQA performance and suggests that partial-input models may not always be based on surface-level shortcuts. The paper concludes that while LLMs may use artifacts, they still demonstrate reasoning abilities in MCQA. The study emphasizes the importance of transparent and robust evaluations of LLMs in MCQA tasks.