16 Jun 2024 | Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, Dan Roth
This study introduces a framework to assess whether large language models (LLMs) possess genuine reasoning abilities or rely on token bias. The research investigates LLMs' token bias in solving logical reasoning tasks, using carefully controlled synthetic datasets featuring conjunction fallacy and syllogistic problems. The framework outlines hypotheses where token biases are identifiable, with null hypotheses assuming LLMs have genuine reasoning capabilities. Findings suggest that most LLMs struggle with logical reasoning, relying on superficial patterns rather than true understanding. Performance improvements are attributed to token bias rather than genuine reasoning. The study evaluates various LLMs on synthetic datasets, revealing that token bias significantly influences their reasoning. The framework includes synthetic data generation, token perturbation, and statistical hypothesis testing. Results show that LLMs often fail to reason consistently, especially when faced with contextually misleading or altered examples. The study highlights the limitations of current LLMs in genuine reasoning and the need for further research into their capabilities.This study introduces a framework to assess whether large language models (LLMs) possess genuine reasoning abilities or rely on token bias. The research investigates LLMs' token bias in solving logical reasoning tasks, using carefully controlled synthetic datasets featuring conjunction fallacy and syllogistic problems. The framework outlines hypotheses where token biases are identifiable, with null hypotheses assuming LLMs have genuine reasoning capabilities. Findings suggest that most LLMs struggle with logical reasoning, relying on superficial patterns rather than true understanding. Performance improvements are attributed to token bias rather than genuine reasoning. The study evaluates various LLMs on synthetic datasets, revealing that token bias significantly influences their reasoning. The framework includes synthetic data generation, token perturbation, and statistical hypothesis testing. Results show that LLMs often fail to reason consistently, especially when faced with contextually misleading or altered examples. The study highlights the limitations of current LLMs in genuine reasoning and the need for further research into their capabilities.