[slides and audio] Artifacts or Abduction%3A How Do LLMs Answer Multiple-Choice Questions Without the Question%3F

The paper investigates the ability of large language models (LLMs) to perform multiple-choice question answering (MCQA) tasks using only the choices, without the question. The authors test this approach on three MCQA datasets and four LLMs, finding that in 11 out of 12 cases, the LLMs outperform majority baselines by up to 0.33 accuracy. To understand these results, the authors conduct a black-box analysis focusing on memorization, choice dynamics, and question inference. Key findings include: 1. **No Evidence of Memorization**: The high accuracy is not due to memorization alone. 2. **Choice Dynamics**: LLMs use both individual priors on choices and group dynamics, suggesting they reason over all choices rather than just individual cues. 3. **Abductive Question Inference**: LLMs can infer the original question from the choices, with 42% of inferred questions matching the original question meaningfully. The authors conclude that while LLMs can perform MCQA with limited information, this does not fully explain their high accuracy. They advocate for stronger baselines, robust datasets, and further research to enhance transparency and fairness in LLM evaluations. The paper also provides a detailed analysis and release of a black-box MCQA evaluation suite to facilitate future studies.The paper investigates the ability of large language models (LLMs) to perform multiple-choice question answering (MCQA) tasks using only the choices, without the question. The authors test this approach on three MCQA datasets and four LLMs, finding that in 11 out of 12 cases, the LLMs outperform majority baselines by up to 0.33 accuracy. To understand these results, the authors conduct a black-box analysis focusing on memorization, choice dynamics, and question inference. Key findings include: 1. **No Evidence of Memorization**: The high accuracy is not due to memorization alone. 2. **Choice Dynamics**: LLMs use both individual priors on choices and group dynamics, suggesting they reason over all choices rather than just individual cues. 3. **Abductive Question Inference**: LLMs can infer the original question from the choices, with 42% of inferred questions matching the original question meaningfully. The authors conclude that while LLMs can perform MCQA with limited information, this does not fully explain their high accuracy. They advocate for stronger baselines, robust datasets, and further research to enhance transparency and fairness in LLM evaluations. The paper also provides a detailed analysis and release of a black-box MCQA evaluation suite to facilitate future studies.

Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

7 Jun 2024 | Nishant Balepur, Abhilasha Ravichander, Rachel Rudinger